David AI

Data for audio AI

Summer 2024active2024•Website

Disclaimer

FYI Combinator is not affiliated with Y Combinator. Reports are generated by AI Research Agents and may not be 100% accurate.

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 29 days ago

What do they actually do

David AI designs, collects, and licenses curated audio datasets used to train speech recognition, speech translation, text-to-speech, and conversational models. They publish named, off‑the‑shelf datasets and also build custom datasets. Buyers typically request samples, sign a license that defines allowed use cases, then receive access within 1–2 days for off‑the‑shelf sets or collaborate on short joint experiments for bespoke collections (website).

They report a large, channel‑separated, multilingual corpus spanning many accents that is used by Fortune‑100 companies and research labs; in interviews they describe a library measured in the hundreds of thousands of hours across 15+ languages (website; YC interview). Dataset creation follows a defined research loop (hypothesize, design data shape, collect, evaluate, scale, release, and continuously improve) and the business model is data licensing plus bespoke partnerships; the company has raised venture rounds (seed, Series A/B) to scale operations, with posts noting early seven‑figure revenue (website; seed post; Series A; Series B).

Who are their target customer(s)

Enterprise AI labs and large model teams building speech/voice models: Need large, high‑quality, multilingual, channel‑separated audio to improve robustness and understand how data changes model performance; collecting and validating at that scale in‑house is slow and costly (website).
Product teams shipping voice assistants, wearables, or robots: Require data that matches real‑world accents, noise, and devices so models stop failing in production; they lack reliable collections and evaluations to verify improvements quickly (website).
ASR / speech‑translation startups: See performance drops across dialects and noisy environments; need curated, labeled multilingual corpora and benchmarks they can license rapidly to ship improvements (website).
Text‑to‑speech and voice‑synthesis teams: Need studio‑grade, multi‑speaker, channel‑separated recordings with rich metadata; producing controlled, high‑fidelity collections internally is operationally hard and expensive (website).
Academic and corporate research groups running speech experiments: Want reproducible datasets and evaluation suites to show data‑driven gains, but building and iterating such research‑quality corpora diverts time from core modeling work (website).

How would they acquire their first 10, 50, and 100 customers

First 10: Founder‑led pilots using warm introductions: share tailored samples immediately, run a short joint experiment to demonstrate model uplift, then sign a simple data license to unlock fast access (website).
First 50: Turn the best pilots into named, off‑the‑shelf datasets and publish brief case studies with measured gains; pair with a predictable path (sample → license → delivery) and lightweight pilot contracts to reduce friction (website).
First 100: Hire small, sector‑focused sales pods and embed standard evaluation suites in demos; pursue larger licenses and multi‑project partnerships while building tooling so customers can buy, evaluate, and iterate with minimal custom work (Series A; website).

What is the rough total addressable market

Top-down context:

Industry reports project the overall AI training‑dataset market reaching the mid‑single‑digit billions this decade, with audio a smaller share; this implies a core audio‑dataset TAM around $1–2B over the next 5 years (MarketsandMarkets; Grand View).

Bottom-up calculation:

Counting practical buyers across ASR, speech‑to‑text APIs, conversational AI, and TTS—each multi‑billion markets—suggests a serviceable TAM in the single‑digit to low‑double‑digit billions (~$8–$15B today) for datasets, evaluations, and bespoke collections (speech & voice recognition; conversational AI; speech‑to‑text API; TTS).

Assumptions:

Audio represents a single‑digit to low‑teens percentage of the overall training‑dataset market.
Buyer markets overlap; serviceable TAM reflects the pool of potential customers, not an additive sum across segments.
Enterprise buyers spend roughly mid‑five to low‑seven figures annually on datasets, evaluations, and bespoke collections, depending on stage and scope.

Who are some of their notable competitors

Appen: Global data‑services provider offering large‑scale crowd collection, transcription, and off‑the‑shelf speech datasets; overlaps on multilingual, annotated audio and end‑to‑end services, but is a broad provider vs. a focused audio‑research dataset publisher (Appen).
Scale AI: Enterprise data platform with labeling and evaluation services; competes where customers need rigorous annotation pipelines and model evaluation integrated into ML workflows rather than bespoke audio R&D and channel‑separated dataset design (Scale AI).
Mozilla Common Voice: Open, crowd‑built multilingual speech dataset; a free alternative for teams that can work with open licensing and variable quality, but lacks commercial licensing, bespoke recording/cleaning, and enterprise research partnerships.
Datatang: Commercial provider of speech corpora and bespoke collection/annotation (China‑based, global customers); competes on off‑the‑shelf multilingual corpora and large‑scale operations, while David AI emphasizes research‑driven dataset design and packaged evaluations (dataset reference).
Veritone: Enterprise AI platform with speech/audio processing and synthetic voice services; customers wanting integrated transcription, separation, and TTS workflows may choose the platform approach over a pure dataset vendor (Veritone Voice).