What do they actually do
Wafer builds Herdora, a tool and runtime that makes existing PyTorch models run faster on GPUs and other accelerators. Teams point Herdora at their model, it profiles the workload to find bottlenecks, and it generates or patches optimized kernels so the model runs more efficiently. The company advertises typical speedups in the 1.5–5× range, though these are their own claims (Herdora site).
Customers can deploy the optimized model into the Wafer Inference Engine as a managed service or in their own VPC, with built‑in monitoring and ongoing performance fixes. The product targets teams running inference in production that don’t have in‑house CUDA specialists and want lower costs, steadier latency, or portability to non‑NVIDIA hardware such as AMD GPUs (Herdora site; Wafer YC profile). Wafer also publishes tooling like a profiler in their GitHub organization, which supports the profiling and monitoring workflow they describe (Herdora GitHub).
Who are their target customer(s)
- ML/infra engineers at startups running user‑facing models: They need predictable latency and lower cloud GPU bills but lack time and specialist GPU expertise to hand‑tune performance (Herdora; YC profile).
- Platform or SRE teams responsible for production inference: They must keep models reliable at scale but lack tooling to quickly find and fix kernel‑ or pipeline‑level bottlenecks (Herdora GitHub).
- Teams migrating off NVIDIA or evaluating alternative accelerators: Porting and tuning models for new hardware is slow and error‑prone; they want portability and performance without rebuilding low‑level code (Herdora; YC profile).
- Product/ML leads with large, continuous inference traffic: They need to reduce per‑request cost and avoid hiring scarce senior CUDA engineers to maintain performance (Herdora; YC profile).
- Early‑adopter ML teams running many models on shared GPUs: They want higher utilization and continuous, automated performance fixes instead of one‑off manual patches ([Herdora homepage/blog](https://www.herdora.com/; https://www.herdora.com/blog)).
How would they acquire their first 10, 50, and 100 customers
- First 10: Run hands‑on, no‑risk pilots via warm intros (YC network and early adopters), where Wafer engineers integrate Herdora, profile workloads, and deliver clear before/after results and cost deltas (Herdora; YC profile).
- First 50: Package the pilot into a standard "quick win" offer with an integration checklist and ROI report; publish short case studies and technical playbooks, and do focused outbound to matching teams while using GitHub/tools to attract developers (Herdora; Herdora GitHub).
- First 100: Productize onboarding for the managed runtime (self‑hosted images, templates, checklists), add channel partners (cloud resellers, hardware vendors, MLOps firms), and run a light enterprise motion for accounts that need SLAs while continuing to publish verified outcomes (YC profile; Herdora product pages).
What is the rough total addressable market
Top-down context:
Analyst estimates put global AI infrastructure spend around $136B in 2024 and growing rapidly, with the data‑center GPU market alone around $87B in 2024 (MarketsandMarkets AI infrastructure; MarketsandMarkets data‑center GPU). Within that, the most directly comparable segment—AI inference platform/runtimes—is estimated at roughly $18–20B in the mid‑2020s and expanding quickly (MarketsandMarkets AI Inference PaaS via PR).
Bottom-up calculation:
As a near‑term bottom‑up view, assume 3,000–6,000 organizations worldwide run user‑facing inference at meaningful scale and are candidates for third‑party optimization/runtimes; if 40% are serviceable with an average annual contract of ~$200k, that implies a $240M–$480M initial SAM Wafer can pursue, with upside as more teams adopt GenAI and expand usage.
Assumptions:
- Counts reflect companies with sustained production inference (not pilots or research only).
- Average ACV includes optimization plus managed runtime; excludes pure consulting.
- Serviceability limited by supported frameworks/hardware and required integration effort.
Who are some of their notable competitors
- OctoML: Commercial optimizer built on Apache TVM that tunes/compiles models for multiple hardware targets and provides packaging/deployment; overlaps on automated optimization and cross‑hardware deployment.
- NVIDIA (TensorRT + Triton): NVIDIA’s optimizer/runtime pair for its GPUs (TensorRT and Triton) offering kernel‑level optimizations and a production inference server; strong for teams staying on NVIDIA and seeking a first‑party stack.
- Apache TVM: Open‑source ML compiler that auto‑generates tuned kernels across CPUs/GPUs/accelerators; alternative for teams willing to build and operate their own compilation and autotuning pipelines.
- IREE: An MLIR‑based compiler and runtime targeting multiple backends (Vulkan/ROCm/CUDA/CPU); competes on multi‑backend compilation and efficient deployable artifacts, including non‑NVIDIA accelerators.
- Neural Magic (DeepSparse): Inference runtime and toolset optimized for sparsity and CPU execution; relevant for teams exploring CPU‑based cost reductions instead of GPU‑centric tuning.