Refresh logo

Refresh

RL environments for computer use and software engineering work

Spring 2025active2025Website
Market ResearchReinforcement LearningRecruitingData LabelingAI
Sponsored
Documenso logo

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →
?

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 15 days ago

What do they actually do

Refresh builds realistic computer-use training and evaluation environments so AI agents can practice the full loop of software engineering: writing code, running it, using tools, and debugging end-to-end. Their focus is on RL-style, multimodal environments that look and behave like a developer’s machine and browser, enabling repeatable experiments and benchmarks for agent capabilities (refresh.dev; YC profile).

Today, Refresh offers evaluation tooling and open-source components that let teams run agent workflows in browsers and collect clear, reproducible signals. Examples include an MCP server that autonomously exercises and evaluates web apps while capturing screenshots, console logs, and network traffic for failure analysis (web-eval-agent). Their GitHub org also hosts an OSS RL/evals toolkit fork used for agent evaluations and training workflows (Operative‑Sh org).

Who are their target customer(s)

  • AI research teams training large/agent models: They need realistic, multimodal environments where models can write, run, and debug code end‑to‑end; building and maintaining those simulated computer+tool environments in‑house is time‑consuming and costly (YC; TechCrunch).
  • Companies building AI coding assistants (Cursor‑style code agents): Models often produce code that looks plausible but fails at runtime; teams spend significant time wiring test‑and‑debug loops to verify outputs across real tools and browsers (YC; web‑eval‑agent).
  • Engineering and QA teams automating app testing: Manual QA is slow, repro is flaky, and cross‑environment checks are brittle; they need browser‑driven runs that capture console/network errors and report concrete failures automatically (web‑eval‑agent).
  • ML evaluation / MLOps engineers building agent benchmarks: Creating reproducible RL‑style evals that use real developer tools is tedious and brittle; they want turnkey environments and tooling to run repeatable experiments with consistent metrics (hud‑python; YC).
  • IDEs and developer‑tooling platforms integrating agent features: They need safe, integratable backends that can generate, execute, and validate code inside the editor (state/session handling, cookie/login reuse, etc.), which are hard to build and maintain reliably in‑house (web‑eval‑agent; YC).

How would they acquire their first 10, 50, and 100 customers

  • First 10: Manually recruit AI research groups, code‑agent startups, and friendly engineering teams for tightly scoped pilots with hands‑on support and instrumentation that reproduces their common failures, in exchange for feedback and a public case study.
  • First 50: Publish small, runnable repos and benchmark demos showing before/after outcomes, present at focused meetups/conferences, and run targeted outbound to code‑assistant builders, QA automation teams, and ML eval groups. Convert via packaged paid pilots (templates + onboarding) supported by early case studies and technical write‑ups.
  • First 100: Add self‑serve onboarding, prebuilt templates for common stacks, and integrations (CI, editors, test frameworks) to drive PLG, while running account‑based outreach for larger buyers. Support with a small enterprise onboarding team, published benchmark reports/case studies, and co‑sales with CI/QA/cloud vendors.

What is the rough total addressable market

Top-down context:

Refresh touches three large markets today: AI/ML infrastructure (reports cite ~$47B–$60B+ around 2024–2025), AI code tools (~$5B–$7B in 2024), and automation/software testing (automation testing around ~$18B; broader testing larger). Conservatively, relevant top‑down spend spans roughly $60B–$120B with overlap (Precedence Research; Grand View Research; Fortune Business Insights; GMI).

Bottom-up calculation:

As a working model: 1,500 orgs building agentic coding/eval systems adopting at $50k–$200k ARR implies ~$75M–$300M; add 10,000 QA/automation teams on smaller SKUs at $10k–$50k for ~$100M–$500M; plus 200+ AI labs/enterprises with advanced needs at $200k–$1M adds ~$40M–$200M. Together, this supports a path from low hundreds of millions toward low single‑digit billions over time, consistent with a conservative $10B–$30B SAM derived from software/tooling slices of the top‑down markets.

Assumptions:

  • Do not double‑count overlapping buyers across AI code tools and testing; exclude most pure hardware spend from AI infra.
  • Adoption mix spans enterprise pilots to team‑level SKUs; pricing bands reflect environment + eval tooling, not generic cloud/GPU.
  • 3–5 year horizon for bottom‑up adoption; penetration ramps with editor/CI integrations and enterprise templates.

Who are some of their notable competitors

  • E2B: Provides secure, isolated “virtual computer” sandboxes for AI agents and MCPs. Overlaps where teams need safe, scalable code execution environments for agent workflows (E2B; E2B blog).
  • AgentOps: Agent observability, testing, and replay to trace, debug, and deploy reliable AI agents—commonly used to validate and improve agent behavior in production (AgentOps).
  • LangSmith (LangChain): Widely used tracing/observability and evaluation for LLM apps and agents; a common alternative for building eval pipelines and debugging agent runs (LangSmith).
  • BrowserStack: Cloud browsers and large‑scale test automation; recently launched AI agents for test generation and self‑healing, relevant to QA teams automating UI flows (BrowserStack AI agents).
  • Surge: Builds RL environments and benchmarks for agent training across realistic tasks—overlapping with Refresh’s focus on training grounds for agent capabilities (Surge blog).