Soren

The AI engineer for evals.

Fall 2025active2025•Website

Artificial IntelligenceDeveloper ToolsSaaSB2BAnalytics

Disclaimer

FYI Combinator is not affiliated with Y Combinator. Reports are generated by AI Research Agents and may not be 100% accurate.

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 3 months ago

What do they actually do

Soren builds an AI-first tool that automates evaluation and triage for LLMs and agentic workflows. It analyzes failed cases, clusters issues, pinpoints likely root causes, and runs targeted experiments to surface better-performing fixes. The system also updates or adds tests when models, prompts, or tools change, reducing manual maintenance and speeding iteration (YC profile, homepage).

Today, teams use Soren to continuously test, diagnose, and experiment on AI systems with automated test generation, failure analysis, and intelligent experimentation so engineering effort shifts from combing through logs to applying targeted improvements (homepage, YC profile).

Who are their target customer(s)

Startup ML/AI engineers building LLMs or agent features: They spend many hours writing and updating tests after every model, prompt, or tool change, then manually comb through failures to find what broke, slowing releases and iteration.
Reliability / QA engineers responsible for production AI: They receive large volumes of logs and traces from evaluations but lack clear root-cause signals, making triage slow and fixes imprecise.
Product managers shipping AI-powered features: Each update can cause regressions and they lack fast, repeatable signals to tell whether a change improved or degraded the product, reducing release confidence.
Teams building vertical or multi‑turn AI workflows: They lack good ways to evaluate end‑to‑end behavior on complex tasks, so performance gaps surface in production rather than tests.
Research or benchmarking teams running large eval suites: Creating, maintaining, and extending benchmarks is time‑consuming, pulling effort away from improving models or experiments.

How would they acquire their first 10, 50, and 100 customers

First 10: Run hands-on pilots with warm, high-signal accounts (YC/alumni startups, research labs, public agent adopters). Ingest their evals, perform triage, deliver a prioritized fix plan, and trade a short discount for a co‑authored case study/testimonial (YC profile).
First 50: Package the pilot into a 1–2 week “starter kit” with vertical templates (assistants, customer automation, multi‑turn workflows). Drive signups via targeted outbound, technical webinars with before/after triage, and outreach anchored on early case studies (homepage).
First 100: Productize into self‑serve with plug‑and‑play templates, CI integration, and clear onboarding. Add a small sales‑engineering/dev‑evangelism motion and partnerships (LLM infra, observability/QA, developer communities) to reach mid‑stage startups and ops/QA teams at scale (homepage).

What is the rough total addressable market

Top-down context:

Conservatively, Soren sells into the evaluation-focused slice of AI testing and adjacent MLOps budgets, which is roughly $3.5–4.0B today based on an AI-powered testing market near $3.4B in 2025 plus a modest portion of the $1.6–2.2B MLOps market (FMI, Fortune BI MLOps, Grand View MLOps). A broader, upper-bound view that includes full MLOps, AIOps (~$1.9B), and application observability (~$10B) sums to roughly ~$17B but risks overlap and double counting (Fortune BI AIOps, Credence monitoring).

Bottom-up calculation:

Estimate ~50,000 global teams actively building or operating LLM/agent features over the next 1–2 years, with average annual evaluation tooling spend of ~$70k per team (software + limited services), yielding ~$3.5B near‑term TAM. This aligns with the conservative, evaluation‑focused market slice above.

Assumptions:

Roughly 50k target teams globally will procure dedicated eval tooling in the near term (startups to mid‑market/enterprise feature teams).
Average annual budget per team for evaluation software and lightweight services is ~$70k, reflecting needs like dataset versioning, automated evaluators, triage, and CI runs.
Open‑source and incumbent platform features don’t fully displace dedicated eval tooling for most production teams in the next 1–2 years.

Who are some of their notable competitors

LangSmith: LangChain’s tracing and evaluation product for LLM apps. Supports offline/online evals, multi‑step traces, annotations, and experiment comparison with CI‑friendly workflows (evaluation docs).
Arize: Observability and evaluation for LLMs/agents combining tracing, automated evaluators (LLM‑as‑judge), monitoring, and experiment comparisons for production teams (LLM evaluation guide).
Galileo: Platform for agentic AI observability and evals with tracing, regressions, and failure clustering to help engineering/QA prioritize fixes.
Humanloop: Collaborative evaluation UI + API for running LLM tests with automated and human evaluators; simplifies dataset/evaluator creation without custom infra.
Confident AI / DeepEval: Developer‑focused eval suite for regression testing and robustness/security checks (e.g., RAG/agent tests), emphasizing CI and large benchmark maintenance.