hud logo

hud

Platform for building RL environments and evals

Winter 2025active2025Website
Artificial IntelligenceReinforcement Learning
Sponsored
Documenso logo

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →
?

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 13 days ago

What do they actually do

hud provides an open-source SDK and a hosted service that let teams turn real software—websites, desktop apps, APIs, or Dockerized apps—into repeatable RL/evaluation environments. Users develop environments locally with a CLI, then run large-scale evaluations or RL training either on hud’s cloud or on their own machines. Every run streams telemetry to a web dashboard with traces and public leaderboards; there are built-in benchmarks and example agents to get started quickly (site/docs, docs, leaderboards).

The platform emphasizes scale and reproducibility: it supports parallel execution of many environment copies, live scoring, and comparisons across agents, including popular LLMs integrated via examples. Pricing for the hosted execution layer is public (e.g., $0.50 per environment-hour), with academic credits and enterprise options available (pricing, examples/docs). hud is currently used by researchers and frontier AI labs and is a YC Winter 2025 company (YC profile).

Who are their target customer(s)

  • Academic ML researchers (universities and labs): Need to convert real apps and websites into reproducible testbeds and run many evaluations for papers, but lack infrastructure for large, scored, shareable runs and reliable leaderboards (docs, leaderboards).
  • Frontier AI / model teams building agent capabilities: Need consistent, high-throughput evaluations on real “computer‑use” tasks with replayable telemetry and comparisons across models, but spinning this up in-house is slow and brittle (YC profile, docs).
  • Engineers building agent environments and automated evaluators: Need fast local dev, hot-reload, and clear traces to iterate on environment wrappers and scorers, but current toolchains make development, debugging, and scaling to many machines cumbersome (GitHub SDK, docs).
  • Product or security teams testing models on internal apps: Need private, tightly controlled evaluations (on‑prem or dedicated cloud) to avoid data leakage, but lack easy-to-run private benchmarking and audit controls from public eval platforms (pricing/enterprise, docs).
  • Benchmark organizers and leaderboard maintainers: Need standardized task formats, automated scoring, and reproducible submissions, but building/hosting submission pipelines, leaderboards, and traceability is operationally heavy (leaderboards, docs).

How would they acquire their first 10, 50, and 100 customers

  • First 10: Run hands-on pilots with 10 labs/frontier research groups by offering credits/grants and 1:1 onboarding to port one benchmark each, execute runs with staff support, and publish reproducible leaderboards and case studies (docs, leaderboards).
  • First 50: Launch public benchmark challenges and a workshop at a major ML venue, distributing time‑limited cloud credits and ready‑to‑run agent adapters to reduce setup; showcase runs/traces on leaderboards to drive word-of-mouth (leaderboards, examples/docs).
  • First 100: Productize enterprise pilots from successful research users with on‑prem/dedicated deployments, privacy controls, SLAs, clear pricing, and self‑serve onboarding; expand adapters for major agent providers to ease evaluation (pricing/enterprise, docs).

What is the rough total addressable market

Top-down context:

Early market focused on evaluation and training infrastructure for “computer‑use” agents across academic labs and enterprise AI teams. Monetization blends metered execution (environment-hours) and enterprise deployments/support (pricing).

Bottom-up calculation:

Assume 700 academic/research orgs averaging ~$10k/year, 300 industry AI teams averaging ~$100k/year in metered usage/services, and 100 enterprise customers averaging ~$200k/year for private/on‑prem deployments—an initial TAM of roughly ~$57M/year.

Assumptions:

  • Metered cloud pricing remains near current list rates and usage scales with agent evaluation/training demand (pricing).
  • Counts of active research labs and industry AI teams engaging in agent evals are in the low thousands globally, with a subset willing to pay annually.
  • Enterprise buyers require private deployments and support, yielding higher ACVs than metered-only users.

Who are some of their notable competitors

  • OpenAI Evals: Open-source toolkit for writing automated model evaluations and benchmark suites. Overlaps on evaluation pipelines but focuses on scoring model outputs, not orchestrating live software environments or integrated RL training.
  • Hugging Face (Evaluate + Model Hub/leaderboards): Libraries and hosted leaderboards for dataset/task evaluations. Strong for shared benchmarks/model comparison, but does not provide a hosted runtime to turn live apps into interactive RL environments or agent orchestration.
  • Weights & Biases: Experiment tracking, visualizations, and team leaderboards. Competes on telemetry and sharing, but is primarily experiment tracking rather than an environment-orchestration/evaluation stack for external software.
  • EvalAI: Platform for hosting academic challenges and public leaderboards. Overlaps on submissions and reproducible scoring but lacks local/dev SDKs for MCP-style environments and integrated RL execution/training.
  • Ray (RLlib / Tune): Distributed compute and RL libraries for scaling training/evals. Overlaps on parallel execution but expects teams to supply their own environments and instrumentation; hud packages SDK/CLI, scoring, and real-software wrappers.