Datacurve

Frontier coding data for training and evaluating LLMs

Winter 2024active2024•Website

Disclaimer

FYI Combinator is not affiliated with Y Combinator. Reports are generated by AI Research Agents and may not be 100% accurate.

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 26 days ago

What do they actually do

Datacurve builds custom, high‑quality coding datasets and runnable evaluation environments that labs and developer‑tool companies use to train and evaluate code‑focused language models. They start by privately benchmarking a customer’s model to pinpoint failure modes, co‑scope targeted data projects, and deliver finished datasets or dockerized repos with test harnesses that teams can plug into their training or eval pipelines (Datacurve, YC profile).

They source the data via a gamified, bounty‑based platform (Shipd) where vetted software engineers complete coding “quests.” Submissions run through automated tests and human review before Datacurve packages them into formats like supervised fine‑tuning pairs, human‑feedback traces, agent‑style developer session data, and repo‑wide RL/eval environments (Datacurve, YC launch post, Sacra analysis).

Commercially, Datacurve sells project‑based engagements (often pilots that expand) rather than self‑serve labeling, positioning around bespoke, hard‑to‑scrape or synthesize code data and repeatable evals that fit into customers’ engineering workflows (Datacurve, Sacra analysis).

Who are their target customer(s)

Foundation‑model training teams (labs and model researchers): They need targeted coding examples and runnable suites to fix specific failure modes that public or synthetic data doesn’t surface reliably. Building this in‑house is slow and distracts researchers from core modeling work (Datacurve, YC profile).
Startups building AI developer tools (code completion, code review, agents): They require representative, end‑to‑end code tasks and tests to benchmark features and iterate quickly; curating and validating that data internally is costly and time‑consuming (Datacurve, Sacra analysis).
Enterprise ML/engineering platform teams piloting internal code models: They need private, tailored datasets and secure, stack‑specific test harnesses. Capturing this internally ties up senior engineers and is expensive to maintain (Datacurve, Sacra analysis).
Evaluation / RLHF ops teams inside labs and companies: They need repeatable benchmarks, human‑rated corrections, and dockerized environments to track regressions and guide fine‑tuning, but existing evals are often ad‑hoc and one‑off (Datacurve, Sacra analysis).

How would they acquire their first 10, 50, and 100 customers

First 10: Leverage founders’ YC/VC and pilot contacts to run small, paid pilots: benchmark a customer model, target one clear failure mode, and deliver a runnable dataset or test harness with measured impact (Datacurve, YC profile).
First 50: Template the pilot into a repeatable package (standard scope, timeline, deliverables) and run targeted outbound to similar labs and dev‑tool startups; showcase anonymized benchmark results and secure referrals from early customers/investors (Datacurve, Sacra analysis).
First 100: Productize into recurring evaluation subscriptions and hire sales + solutions engineers to onboard accounts; build delivery integrations (dockerized repos, test harness endpoints) and scale the Shipd contributor engine to shorten turnaround and handle larger contracts (Datacurve, Sacra analysis).

What is the rough total addressable market

Top-down context:

The AI training‑dataset market is estimated at about $3.2B in 2024 and growing, with broader data‑labeling/collection in the single‑ to low‑double‑digit billions depending on scope (Grand View Research, Market.us summary). Demand for code AI tools is also multi‑billion, indicating meaningful spend on code‑specific training and eval assets (industry report example).

Bottom-up calculation:

Starting from $3.2B training‑dataset spend, assume 10%–25% is code‑focused, giving ~$320M–$800M. If 20%–40% of that code slice goes to bespoke, high‑touch datasets and runnable evals, Datacurve’s near‑term SAM is roughly ~$60M–$320M annually (Grand View Research, AI code tools market context).

Assumptions:

Code‑focused spend is 10%–25% of overall AI training‑dataset spend today.
20%–40% of code‑dataset spend is on bespoke, high‑touch datasets and runnable evaluation (vs. generic/synthetic or fully in‑house).
Buyers externalize a material portion of these projects to vendors rather than doing everything internally.

Who are some of their notable competitors

Scale AI: Large data‑ops provider offering custom training data, RLHF/fine‑tuning, and private evaluation; includes a Coding Data Stream that overlaps with curated coding datasets/evals (Scale Data Engine, Coding Stream).
Appen: Managed annotation and crowd services with end‑to‑end LLM training, evaluation, and preference data for enterprises seeking a large vetted workforce and managed delivery (Appen LLM services, Data annotation).
Labelbox: Labeling platform + managed services focused on RLHF and evaluations (RL data, eval arenas, expert workforce), enabling continuous evals without bespoke one‑off datasets (Platform, Managed services docs).
Hugging Face: Community and enterprise datasets, evaluation libraries, and leaderboards that some teams use instead of commissioning private, tailored eval environments (Evaluate docs, Datasets docs).
Topcoder: On‑demand, competition‑style developer crowd running paid coding challenges—an alternative to bounty‑based generation of code tasks and managed talent programs (Topcoder challenges).