What do they actually do
Cactus is an open‑source inference engine and SDK for running language, vision, and speech models directly inside mobile apps on iOS and Android. It includes a small C++ runtime (Cactus Engine), a computation layer (Cactus Graph), and ARM‑focused kernels, plus ready SDK bindings for Flutter, React Native, Kotlin, and C++ so developers can embed local inference without wiring everything from scratch (docs, GitHub).
A typical workflow is: add the SDK, convert or load a supported model (e.g., GGUF or Cactus format), and call an OpenAI‑style API to run inference locally. Optional features include telemetry/dashboard for monitoring device inference, an Agent Builder canvas for wiring tool calls, and a hybrid cloud fallback for tasks that exceed device limits. The approach emphasizes privacy (data stays on device) and offline capability (docs, website).
Live signals include an active public repo with thousands of stars and company claims of use by thousands of developers and many products, plus demo apps on the App Store/Google Play. The team reports production usage on the order of 500k+ weekly inference tasks (GitHub, YC profile, docs, website).
Who are their target customer(s)
- Mobile app developers adding AI features to consumer apps: They need instant responses without relying on network conditions and want to avoid sending user data to the cloud; current cloud APIs add latency, break offline, and raise privacy concerns.
- Teams in regulated or privacy‑sensitive domains (health, finance): They must keep data on‑device to meet compliance and user trust while still shipping modern AI features, but cloud services complicate approvals and risk data exposure.
- Product managers seeking to reduce inference costs and vendor dependence: Cloud API bills and rate limits are rising, and outages or policy changes can disrupt features; they want predictable costs and fewer external dependencies.
- Teams building real‑time features (voice assistants, keyboards, AR/image): They need low latency and battery‑efficient inference across diverse ARM phones; current stacks are hard to optimize and inconsistent across devices.
- Small teams/indie developers using cross‑platform stacks (Flutter/React Native/Kotlin): Tooling is brittle, model conversion is painful, and platform integration takes time; they want production‑ready SDKs and simple packaging to ship quickly.
How would they acquire their first 10, 50, and 100 customers
- First 10: Recruit from existing users—active GitHub contributors, stargazers, demo‑app users, and YC network teams—for a free, time‑boxed pilot with hands‑on onboarding and device benchmarking. Win them by rapidly fixing blockers, shipping tailored SDK examples, and securing short public case studies.
- First 50: Publish step‑by‑step tutorials and turnkey sample apps for Flutter/React Native/Kotlin, performance comparison posts, and short video demos; distribute packages on npm/Flutter/Dart/Rust and Hugging Face. Funnel interest via an automated trial, community office hours, and targeted outreach to teams that tried the demos.
- First 100: Add DevRel and a lightweight sales motion focused on regulated and high‑value real‑time use cases; run formal pilots with prioritized support and publish security/compliance materials and pilot metrics. Partner with mobile agencies and framework maintainers and run a referral program that drives paid pilots and enterprise conversations.
What is the rough total addressable market
Top-down context:
On‑device AI for mobile sits at the intersection of mobile SDKs and inference infrastructure, serving teams that want LLM/vision/speech features without cloud round trips. As AI features diffuse into mainstream apps, a meaningful subset will require local, privacy‑preserving, low‑latency inference.
Bottom-up calculation:
If 10,000 mobile apps adopt on‑device inference with paid support/licenses averaging $5k–$50k annually per app, the annual TAM would be roughly $50M–$500M. Larger enterprises with multiple apps or higher SLAs could push the upper bound.
Assumptions:
- Adoption by 10k apps that need local inference for latency/privacy/offline.
- Paid plans centered on support, SDK/tooling, and production features at $5k–$50k per app/year.
- Does not include adjacent revenue (custom acceleration work, managed fallback, or enterprise features).
Who are some of their notable competitors
- MLC (mlc‑llm): Open‑source compiler/runtime to run LLMs across iOS/Android/web with language bindings. Strong for model compilation, but developers often handle platform bindings and packaging themselves (site, GitHub).
- llama.cpp / ggml: Popular minimal C/C++ runtime for running LLMs on many devices, including phones. Flexible and fast for experiments, but app‑level SDKs, packaging, and production tooling are typically DIY (repo).
- TensorFlow Lite (LiteRT): General on‑device ML runtime with conversion tools and hardware delegates (GPU/NNAPI/CoreML). Broad model support, but not focused on full LLM workflows or mobile LLM/agent SDKs out of the box (docs).
- Apple Core ML: Native iOS framework that leverages the Neural Engine and integrates with Xcode. Excellent for iOS‑only teams with Core ML models, but not cross‑platform and not centered on LLM agent workflows or hybrid cloud fallbacks (docs).
- ONNX Runtime Mobile: Cross‑platform inference engine with mobile optimizations and execution providers (NNAPI/CoreML). Good for ONNX‑converted models, but app‑level OpenAI‑style APIs, packaging, and agent tooling require extra work (docs).