sieve logo

sieve

AI + human review to solve data cleaning - accessible via API or Excel

Spring 2025active2025Website
InvestingData EngineeringAPIs
Sponsored
Documenso logo

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →
?

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 16 days ago

What do they actually do

Sieve provides a way for investment and data teams to extract and clean data from messy documents and external feeds, with accuracy guaranteed by human review. Teams can call an API or use an Excel add‑in to pull specific fields like earnings dates, line items from SEC filings, or historical series from PDFs without building their own parsing and QA workflows. Low‑confidence extractions are routed for manual verification, and the cleaned outputs are returned in a consistent format for models and pipelines. YC profile and Sieve’s docs and listing confirm the API + Excel add‑in approach and human‑in‑the‑loop review.usesieve docs, Excel add‑in guide, Microsoft marketplace listing

In Excel, analysts can call functions like =SIEVE.GET_DATA(...) against PDFs and filings to fetch specific values without leaving their spreadsheets, or integrate the API into ETL pipelines to replace ad‑hoc manual checks.Excel add‑in guide The marketplace listing notes the add‑in targets trading/investment teams and requires an enterprise subscription, aligning with a workflow where Sieve aims to replace manual data entry and review with an AI + human validation step.Microsoft marketplace listing

Who are their target customer(s)

  • Quant researchers at hedge funds collecting features/labels: They spend hours locating earnings dates and financial line items from filings and reports, diverting time from modeling and exposing their datasets to human error. YC profile
  • Data engineers maintaining ETL and validation pipelines: Pipelines frequently surface anomalies that require manual review (often via email), causing firefighting, slower releases, and brittle integrations. YC profile
  • Portfolio analysts and PMs needing clean, analysis‑ready series: Wrong or missing data from vendors/internal feeds breaks models and forces extra verification steps before decisions. Sieve product pages
  • Small‑cap or niche research teams using obscure PDFs/publications: Major vendors lack coverage; teams manually extract historical series from PDFs and custom documents, which is slow and error‑prone. Sieve case studies
  • Operations/vendor‑management teams buying/overseeing data feeds or BPO: Managing multiple vendors and re‑checking outputs is costly and manual; they need to reduce oversight while maintaining accuracy. Sieve FAQ/product messaging

How would they acquire their first 10, 50, and 100 customers

  • First 10: Run 4–8 week paid pilots with quant and niche research teams using the Excel add‑in and a small API ingestion. Use warm intros plus targeted outreach to secure time with a lead analyst; quantify time saved and errors fixed to convert on a simple SLA. YC profile, Excel add‑in listing
  • First 50: Turn pilot wins into a repeatable outbound motion: templated outreach to data engineers/heads of research, standardized pilot packages (Excel trial + small API), and public case studies showing specific extraction tasks to reduce evaluation friction. Price pilots to cover human‑review costs. Sieve case studies
  • First 100: Layer partnerships and channels: Excel marketplace + API integrations, reseller/white‑label deals with boutique data vendors/BPOs, and bundles with portfolio tools/data catalogs for the “cleaning” step. Hire AEs/AMs and solutions engineers to standardize onboarding and maintain a metricized pilot‑to‑paid funnel. Marketplace listing, Sieve product pages

What is the rough total addressable market

Top-down context:

Sieve sits at the intersection of intelligent document processing (IDP) and data quality/observability. The IDP market was about $7.9B in 2024 and growing fast, while data quality tools were around $2.3B in 2024; financial services is a meaningful slice of both.IDP market, Data quality tools, Data observability

Bottom-up calculation:

Start with ~3,342 active hedge funds globally as a proxy for research‑intensive buyers; assume 30% have recurring bespoke extraction/cleaning needs (~1,000 teams) at a blended $50k ACV → ~$50M. Add ~1,500 adjacent buy‑side/fin‑data engineering teams at ~$40k ACV → ~$60M, implying a near‑term serviceable market of roughly $100–150M, with upside into broader FS and ETL teams.Aurum 2024 fund count

Assumptions:

  • Aurum’s reported 3,342 fund count is a reasonable proxy for active hedge funds globally in 2024.
  • Roughly 30% of funds have ongoing, custom data extraction/cleaning needs not met by standard vendors.
  • Blended ACVs ($40k–$50k) reflect enterprise Excel/API deployments with human review for multiple workflows.

Who are some of their notable competitors

  • Calcbench: Specialist for SEC/XBRL data with an API and Excel add‑in that lets teams pull point‑in‑time financials and press‑release numbers; overlaps on filings line items and earnings dates but focuses on structured SEC data, not human‑review workflows for messy/custom docs. Calcbench
  • Sentieo: Equity‑research platform that indexes filings/transcripts and offers table extraction plus an Excel plugin; overlaps with analyst workflows but is a full research workspace rather than a human‑in‑the‑loop cleaning API for custom extractions. Sentieo product/Excel references, Office marketplace listing
  • Monte Carlo: Data‑observability tool for monitoring/alerting and lineage; competes on reducing manual review of pipeline breaks but doesn’t provide document extraction or human‑verified corrections for ad‑hoc PDF/vendor feed issues. Monte Carlo platform
  • Great Expectations: Open‑source data‑validation framework embedded in pipelines to assert schemas and run tests; reduces some manual checks but isn’t a managed extraction + human‑review service for messy external sources. GE docs
  • Amazon Textract (+ Mechanical Turk / Ground Truth): AWS OCR/table extractor often paired with human labeling (MTurk or Ground Truth) to correct errors; a DIY path to automated extraction + human review that requires significant glue work to match Sieve’s integrated API/Excel experience. Textract, Mechanical Turk