hud

Platform for building RL environments and evals

Winter 2025active2025•Website

Artificial IntelligenceReinforcement Learning

Disclaimer

FYI Combinator is not affiliated with Y Combinator. Reports are generated by AI Research Agents and may not be 100% accurate.

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 3 months ago

What do they actually do

hud provides an open-source SDK and a hosted service that let teams turn real software—websites, desktop apps, APIs, or Dockerized apps—into repeatable RL/evaluation environments. Users develop environments locally with a CLI, then run large-scale evaluations or RL training either on hud’s cloud or on their own machines. Every run streams telemetry to a web dashboard with traces and public leaderboards; there are built-in benchmarks and example agents to get started quickly (site/docs, docs, leaderboards).

The platform emphasizes scale and reproducibility: it supports parallel execution of many environment copies, live scoring, and comparisons across agents, including popular LLMs integrated via examples. Pricing for the hosted execution layer is public (e.g., $0.50 per environment-hour), with academic credits and enterprise options available (pricing, examples/docs). hud is currently used by researchers and frontier AI labs and is a YC Winter 2025 company (YC profile).

Who are their target customer(s)

Academic ML researchers (universities and labs): Need to convert real apps and websites into reproducible testbeds and run many evaluations for papers, but lack infrastructure for large, scored, shareable runs and reliable leaderboards (docs, leaderboards).
Frontier AI / model teams building agent capabilities: Need consistent, high-throughput evaluations on real “computer‑use” tasks with replayable telemetry and comparisons across models, but spinning this up in-house is slow and brittle (YC profile, docs).
Engineers building agent environments and automated evaluators: Need fast local dev, hot-reload, and clear traces to iterate on environment wrappers and scorers, but current toolchains make development, debugging, and scaling to many machines cumbersome (GitHub SDK, docs).
Product or security teams testing models on internal apps: Need private, tightly controlled evaluations (on‑prem or dedicated cloud) to avoid data leakage, but lack easy-to-run private benchmarking and audit controls from public eval platforms (pricing/enterprise, docs).
Benchmark organizers and leaderboard maintainers: Need standardized task formats, automated scoring, and reproducible submissions, but building/hosting submission pipelines, leaderboards, and traceability is operationally heavy (leaderboards, docs).

How would they acquire their first 10, 50, and 100 customers

First 10: Run hands-on pilots with 10 labs/frontier research groups by offering credits/grants and 1:1 onboarding to port one benchmark each, execute runs with staff support, and publish reproducible leaderboards and case studies (docs, leaderboards).
First 50: Launch public benchmark challenges and a workshop at a major ML venue, distributing time‑limited cloud credits and ready‑to‑run agent adapters to reduce setup; showcase runs/traces on leaderboards to drive word-of-mouth (leaderboards, examples/docs).
First 100: Productize enterprise pilots from successful research users with on‑prem/dedicated deployments, privacy controls, SLAs, clear pricing, and self‑serve onboarding; expand adapters for major agent providers to ease evaluation (pricing/enterprise, docs).

What is the rough total addressable market

Top-down context:

Early market focused on evaluation and training infrastructure for “computer‑use” agents across academic labs and enterprise AI teams. Monetization blends metered execution (environment-hours) and enterprise deployments/support (pricing).

Bottom-up calculation:

Assume 700 academic/research orgs averaging ~$10k/year, 300 industry AI teams averaging ~$100k/year in metered usage/services, and 100 enterprise customers averaging ~$200k/year for private/on‑prem deployments—an initial TAM of roughly ~$57M/year.

Assumptions:

Metered cloud pricing remains near current list rates and usage scales with agent evaluation/training demand (pricing).
Counts of active research labs and industry AI teams engaging in agent evals are in the low thousands globally, with a subset willing to pay annually.
Enterprise buyers require private deployments and support, yielding higher ACVs than metered-only users.

Who are some of their notable competitors

OpenAI Evals: Open-source toolkit for writing automated model evaluations and benchmark suites. Overlaps on evaluation pipelines but focuses on scoring model outputs, not orchestrating live software environments or integrated RL training.
Hugging Face (Evaluate + Model Hub/leaderboards): Libraries and hosted leaderboards for dataset/task evaluations. Strong for shared benchmarks/model comparison, but does not provide a hosted runtime to turn live apps into interactive RL environments or agent orchestration.
Weights & Biases: Experiment tracking, visualizations, and team leaderboards. Competes on telemetry and sharing, but is primarily experiment tracking rather than an environment-orchestration/evaluation stack for external software.
EvalAI: Platform for hosting academic challenges and public leaderboards. Overlaps on submissions and reproducible scoring but lacks local/dev SDKs for MCP-style environments and integrated RL execution/training.
Ray (RLlib / Tune): Distributed compute and RL libraries for scaling training/evals. Overlaps on parallel execution but expects teams to supply their own environments and instrumentation; hud packages SDK/CLI, scoring, and real-software wrappers.