Moonshine

Building software to understanding the world with a camera

Fall 2024active2024•Website

Artificial IntelligenceIndoor MappingMachine LearningVideoAutomation

Disclaimer

FYI Combinator is not affiliated with Y Combinator. Reports are generated by AI Research Agents and may not be 100% accurate.

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 2 months ago

What do they actually do

Moonshine provides an API and web console that ingests video (files or URLs), indexes it into searchable collections, and exposes two core tools: Search (find moments with natural‑language queries) and Inquire (ask questions about one or more indexed videos and get an answer) (YC profile, docs). It’s aimed at developers who want to add video understanding to their apps without building the back‑end indexing and ML stack.

Today, a developer can sign up, create an index, upload videos, and let Moonshine extract embeddings, transcripts, OCR, and scene segments to enable search and Q&A. The system handles long videos by first searching for relevant segments and then running visual‑language models on those clips (“any‑length VLM”), and includes an Embeddings Viewer to explore clusters for prompt refinement (any‑length VLM, January update, quickstart). Pricing is usage‑based, with ingestion listed at $0.02 per minute and plan tiers for credits, concurrency, and storage (rates).

Who are their target customer(s)

Developers and engineering teams building products that need searchable video: They want to index footage and add natural‑language search/Q&A without building ingestion, embeddings, and model orchestration infrastructure from scratch.
Media and post‑production teams with large archives: Finding the right clip by hand is slow and expensive; they need reliable text search, similar‑clip surfacing, and quick clip extraction for reuse.
Operations teams (construction, inspection, retail) monitoring many cameras: They need summaries, change detection, and localized events across long time periods instead of manual review spread across fragmented tools.
Security and compliance teams reviewing surveillance: Scanning hours or days of footage for incidents is tedious; they want fast natural‑language search that surfaces relevant moments over long archives.
Robotics/AR/perception R&D teams exploring camera‑only spatial understanding: They seek to reduce multi‑sensor complexity and cost by using a single camera for mapping and object localization, but lack robust vision‑only spatial tools.

How would they acquire their first 10, 50, and 100 customers

First 10: Invite YC teams and developer startups into an early‑access program with onboarding help and free ingestion credits; provide hands‑on engineering support to get a first video library indexed and turned into a working demo (quickstart, YC profile).
First 50: Run targeted pilots in media/post‑production and site operations with clear metrics (e.g., time to find clips), integrate with cloud storage/MAM systems, and publish short ROI case studies to drive similar inbound (January update, rates/limits).
First 100: Open self‑serve with generous starter credits, expand SDKs/docs/templates for common workflows, and pair product‑led growth with account‑based outreach for larger archives needing migration and negotiated terms (January update, quickstart).

What is the rough total addressable market

Top-down context:

Adjacent markets suggest a large software opportunity: video analytics is ~$12.7B in 2024, computer vision is ~$19–20B, and media asset management is ~$6B; after overlap adjustments, a practical software TAM is roughly $20–40B (video analytics, computer vision, MAM).

Bottom-up calculation:

With ingestion priced at $0.02/min, 1,000 hours/year yields ≈$1.2k, 10,000 hours ≈$12k, and 100,000 hours ≈$120k per customer; reaching $10M ARR could come from ~833 mid‑enterprise customers at ~$12k/year or a mix of higher‑ARPC enterprise deals and usage‑based accounts (rates).

Assumptions:

Overlap between CV, video analytics, and MAM is significant; TAM consolidates overlapping software portions.
Enterprise deals include additional revenue beyond ingestion (queries, storage, integrations, SLAs).
Customer volumes and ARPC vary widely by vertical and footage hours ingested.

Who are some of their notable competitors

Google Cloud Video AI: General‑purpose video indexing (labels, shot changes, OCR, speech, streaming) with strong cloud integrations; broader tooling rather than a focused “index + natural‑language search/Q&A” developer flow (features).
Microsoft Azure AI Video Indexer: End‑to‑end enterprise video indexing (transcripts, faces, objects/scenes, summarization) across cloud and edge; more of a media AI platform than an API‑first VLM search/Q&A product (overview).
Amazon Rekognition Video: Managed analysis for stored/streaming video (labels, activities, text, faces, person tracking) suited to detection/alerting pipelines; emphasizes timestamped metadata over long‑archive natural‑language Q&A (docs).
Clarifai: Platform for custom CV models and video workflows (embeddings, tracking, full‑motion analysis); strong for bespoke detectors vs. Moonshine’s packaged indexing + VLM‑based search/ask approach (docs).
Descript: Transcript‑first editor used by media teams for text‑based editing and clip generation; overlaps on clip discovery needs but is a creator tool, not a programmatic video‑indexing API for developers/operations.