Chonkie

Open Source Data Ingestion For AI

Spring 2025active2025•Website

Artificial IntelligenceDeveloper ToolsAI

Disclaimer

FYI Combinator is not affiliated with Y Combinator. Reports are generated by AI Research Agents and may not be 100% accurate.

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 3 months ago

What do they actually do

Chonkie builds tooling to prepare documents for Retrieval‑Augmented Generation (RAG). They offer an open‑source library (Python and TypeScript) that parses files, cleans and splits them into chunks, can generate embeddings, and writes results to vector stores or files. The library includes pluggable parsers (“chefs”), chunkers, refineries, and database adapters, and is published on GitHub and PyPI so teams can run it locally or embed it in their own pipelines (docs, GitHub, PyPI).

They also run a hosted service that exposes the same pipeline via a web playground, API keys, and a dashboard, so users can upload data and get processed chunks without managing infrastructure. The site lists pricing tiers (Starter/Pro/Enterprise) and enterprise controls, and the API/playground let newcomers test chunking and embeddings quickly (API intro, playground, pricing).

Who are their target customer(s)

ML engineer building RAG features at a product company: Spends time writing/tuning parsers and chunkers to avoid noisy or incomplete context; needs a reliable, repeatable way to prep documents for search and LLM prompts (chunkers docs, GitHub).
Data engineer maintaining ingestion pipelines and vector databases: Has to connect many sources and stores, orchestrate multi‑step parsing→chunking→storage, and debug failures; wants stable adapters and pipeline primitives (integrations, roadmap).
Support/knowledge‑base manager making company docs searchable: Messy PDFs, spreadsheets, and screenshots lead to poor extraction and incoherent chunks, so search is irrelevant and agents can’t find answers; needs better table/OCR handling and an easy sandbox (playground, roadmap).
Enterprise security/compliance officer: Worried about sending sensitive data to third‑party services without SOC2/HIPAA controls; wants local/controlled deployments and auditability (pricing/enterprise, OSS).
Startup founder or PM prototyping an AI feature: Needs a no‑ops way to upload docs, test chunking, and get embeddings fast without building an ingestion stack or writing glue code (playground, homepage).

How would they acquire their first 10, 50, and 100 customers

First 10: Identify heavy OSS users/contributors on GitHub/PyPI and convert them to closely supported cloud pilots, offering account migration and a bespoke connector if needed; close Starter/Pro seats after onboarding (GitHub, PyPI, playground).
First 50: Publish turnkey recipes and demo pipelines (support KBs, spreadsheets, screenshots) in docs/playground, run targeted workshops, and showcase before/after results to drive self‑serve Starter conversions (chunking API, integrations, pricing).
First 100: Partner with major vector DBs/embedding providers for joint marketing and co‑sell, run case‑study outbound to mid‑market teams, and promote enterprise controls (SSO/SOC2/HIPAA) to win regulated customers (roadmap, pricing).

What is the rough total addressable market

Top-down context:

Use the generative‑AI software market as the umbrella and estimate the share for RAG data prep/ingestion. If GenAI software is tens of billions in 2024 and ingestion is ~10% of spend, the addressable market is in the low single‑digit billions today (ABI Research).

Bottom-up calculation:

Start with representative 2024 figures: knowledge management (~$20B), data integration/iPaaS (~$12–17B), enterprise search (~$5B), vector DB (~$2.2B). After down‑weighting 50–66% for overlap, an ingestion specialist’s TAM is roughly $5B–$25B (KM, Data integration, Enterprise search, Vector DB).

Assumptions:

Ingestion/knowledge‑prep is ~10% of GenAI software spend.
Significant budget overlap across KM, search, iPaaS, and vector DB buyers (50–66%).
Figures reference 2024 estimates from cited analyst reports; scope differences persist.

Who are some of their notable competitors

Unstructured: Open‑source and hosted document parsing/extraction for unstructured data; overlaps on file processing and text preparation for RAG.
LlamaIndex: RAG data framework with loaders, chunking, indices, and storage integrations; often used as the ingestion layer for LLM apps.
LangChain: General LLM orchestration with document loaders and text splitters; widely adopted and frequently used for basic chunking/ingestion.
Haystack (deepset): Open‑source RAG framework providing pipelines, document stores, and components for ingestion and retrieval.
Vectara: Hosted retrieval platform that includes ingestion, chunking, and embeddings; competes when teams want an end‑to‑end managed stack.