What do they actually do
Sepal AI runs a data-development service and platform that builds evaluation and training datasets for advanced AI. They combine software tooling, synthetic data generation, and a paid network of vetted domain experts to deliver curated datasets, benchmarks, human evaluation campaigns, RL/agent environments, and red‑team results. They describe this as a repeatable “Cloud‑Native Agent Dataset Factory” that turns research or product questions into evaluation/training data and verified outcomes (Sepal site · YC profile).
In practice, Sepal scopes the task, recruits and vets experts, runs the work, applies quality controls, and hands back usable artifacts. Public examples include biology reasoning benchmarks, finance Q&A/SQL evaluations, uplift trials with human baselines, and end‑to‑end security/red‑team engagements (Sepal site · YC profile). Sepal operates an Expert Hub to source specialists (e.g., DFIR/blue‑team roles with posted hourly pay ranges) and advertises SOC 2 for enterprise buyers (Expert Hub posting · Careers/SOC 2).
Customers reportedly include leading AI labs and enterprises. One public example is Sepal’s participation in pre‑launch testing for Anthropic’s Claude 3.7 Sonnet, cited in Anthropic’s system card and acknowledged by Sepal (Anthropic system card · Sepal post).
Who are their target customer(s)
- Frontier AI lab evaluation and safety teams: They need rigorous, adversarial pre‑release testing and human evaluations that internal teams can’t scale or repeat frequently. Sepal provides expert‑backed evaluations and pre‑launch testing (e.g., Anthropic engagement) (Sepal · Anthropic system card).
- Enterprise ML/product teams in regulated domains (finance, biotech, healthcare): They require domain‑accurate training/evaluation data with compliance/audit trails but lack access to vetted SMEs. Sepal offers curated datasets, expert recruitment, and enterprise compliance signals like SOC 2 (Sepal · Careers/SOC 2).
- RL/agent and autonomous‑system teams: They need reproducible, outcome‑verifiable environments to assess agent behavior and avoid regressions, which are costly to design and validate in‑house. Sepal builds RL environments and verifiable tasks (Sepal).
- Security, red‑team, and incident‑response managers: They need realistic adversarial campaigns and expert operators to uncover failure modes before release, but running credible red teams at scale is operationally heavy. Sepal runs end‑to‑end red‑teaming and hires DFIR/blue‑team specialists via its Expert Hub (Sepal · Expert Hub posting).
- Academic and industrial researchers needing benchmarks and human baselines: They need carefully designed, reproducible evaluation datasets and human‑judgment baselines, which are time‑consuming to coordinate. Sepal runs benchmark design, baselining, and full evaluation campaigns (YC profile · Sepal).
How would they acquire their first 10, 50, and 100 customers
- First 10: Land paid, tightly scoped pilots with frontier labs and a few regulated enterprises for pre‑release testing or dataset builds, leveraging the Anthropic collaboration and YC network for warm intros and credibility (Sepal · Anthropic system card · YC profile).
- First 50: Package repeatable offerings (e.g., red‑team pack, finance eval kit, biotech benchmark) and sell via direct outreach, workshops, and targeted content, staffing quickly through the Expert Hub; convert pilots to subscriptions with SLAs and compliance attestations (SOC 2) (Sepal · Expert Hub posting · Careers/SOC 2).
- First 100: Launch a customer portal/API with prebuilt environments and expert matching so mid‑sized teams can buy repeatable offers without long sales cycles; scale with partnerships (MLOps vendors, consultancies, compliance providers) and conference workshops, while automating synthetic augmentation to lower costs (Sepal · Talent search announcement).
What is the rough total addressable market
Top-down context:
Adjacent 2024 markets include data collection/labeling (~$3.8B), MLOps/model testing (~$1.6B), penetration testing/red‑team (~$2.45B), and synthetic data (~$0.2–0.4B). Due to overlap, the combined ceiling is roughly ~$8B today, but the realistically addressable pool is smaller (Grand View Research · Fortune BI — MLOps · Fortune BI — Pen Testing · GMI/Grand View — Synthetic Data).
Bottom-up calculation:
Assume ~25 frontier labs and ~750 regulated enterprises purchase recurring evaluation/red‑team/dataset offerings averaging $1–$3M annually, implying roughly ~$0.8–$2.3B; with a productized platform expanding reach to ~1,500 enterprises at similar spend, the opportunity approaches ~$1.5–$4B.
Assumptions:
- Focus on the premium slice of budgets (bespoke evaluations, AI red‑teaming, and high‑assurance datasets), not commodity labeling or broad cybersecurity.
- Average annual spend per customer across multiple campaigns falls in the $1–$3M range for labs and regulated enterprises.
- Market category overlaps are accounted for by sizing only the expert‑led, high‑value portion.
Who are some of their notable competitors
- Scale AI: Enterprise data platform offering human rater programs, synthetic‑data pipelines, and model evaluation/red‑teaming products; overlaps with Sepal on expert‑backed adversarial evaluation and synthetic augmentation (Scale Evaluation · Scale synthetic data).
- Labelbox: Training‑data and human‑evaluation platform with managed expert evaluators and an Evaluation Studio for live/multi‑turn LLM testing; overlaps on large, repeatable human‑evaluation campaigns and managed labeling workflows (Labelbox Evaluation · Platform).
- Giskard: LLM testing and continuous red‑teaming product focused on finding vulnerabilities and running repeatable evaluation suites; overlaps with Sepal’s LLM evaluation/red‑team use cases but is more tool‑first (Giskard product).
- Hugging Face: Open community hub for datasets, evaluation libraries, and benchmarks (Datasets, Evaluate, leaderboards); overlaps on benchmark/data distribution and researcher self‑service evaluation rather than bespoke expert services (Datasets · Evaluate).
- Galileo AI: AI observability and evaluation platform that captures ground truth and monitors agent/LLM behavior for regressions; overlaps on agent behavior testing and reproducible evaluation pipelines (Galileo).