What do they actually do
Osmosis provides an early‑access “Agent Improvement” platform that plugs into deployed AI agents to help them learn from real interactions. Teams connect their agent to Osmosis, which captures traces, scores outputs with developer‑defined reward functions or rubrics, and uses those signals to adjust future behavior so the agent improves over time instead of staying static. The system supports Model Context Protocol (MCP) tools, records interaction history, exposes vector search over past context, and cleans up stale knowledge to keep what the agent learns current docs/introduction blog on MCP/tool‑use.
Developers keep evaluation logic, tool wrappers, and reward rules in a GitHub repo; Osmosis auto‑syncs this repo so changes to rubrics or code immediately update how interactions are evaluated and learned from. This Git‑backed workflow provides versioning and auditability for reward logic and tool integrations docs/git‑sync/overview. Public docs, a launch blog/case study, and a released model artifact (Osmosis‑Apply‑1.7B) indicate the product is in active early use with pilots. The team announced a seed and public launch in Oct 2025 and is opening access more broadly to production teams running agents homepage blog Hugging Face YC profile LinkedIn launch post.
Who are their target customer(s)
- Product managers running customer‑facing chatbots and virtual assistants: Bots degrade or repeat mistakes in live conversations, and teams lack a safe, low‑effort way to teach them from real interactions without risky retraining docs/introduction.
- ML/Ops engineers responsible for production agents: There’s no repeatable pipeline to score agent behavior, version evaluation logic, and roll back changes; improvements are slow and hard to audit. Git‑synced reward functions/rubrics aim to fix this docs/git‑sync/overview.
- Teams building multi‑tool automation (RPA, agentic workflows, developer tooling): Agents pick wrong tools, mis‑sequence calls, or fail silently across services; tool‑use is hard to observe and improve. Osmosis focuses on tool‑call observability and MCP integrations to make this trainable blog on MCP/tool‑use.
- Startups or product teams creating specialized domain agents (e.g., code or vertical assistants): They need iterative tuning of small models or agent policies from real usage; one‑off fine‑tuning is expensive and cumbersome. Osmosis positions for continuous/online learning and light deployments Hugging Face.
- Security, privacy, and compliance owners in regulated companies: They need to prevent undesirable learning from user data and maintain an audit trail of what changed and why. Osmosis records traces, scores, and Git‑backs reward logic for auditability docs/introduction docs/git‑sync/overview.
How would they acquire their first 10, 50, and 100 customers
- First 10: Run hands‑on co‑development pilots with YC/startup teams already testing agents. Embed engineers for 4–8 weeks to wire up Git‑synced rewards and MCP tools, show measurable error reductions, and publish success stories docs/introduction YC profile blog.
- First 50: Open the waitlist with self‑serve Git Sync templates, MCP examples, and sample repos so small teams can install in days. Support with technical content, webinars, and office hours; reuse pilot playbooks to speed onboarding docs/git‑sync/overview blog on MCP/tool‑use Hugging Face.
- First 100: Convert active users to paid pilots with clear scopes (30–90 day SLAs, audit logs, rollback controls) targeted at PM and ML/Ops buyers. Pursue integrations and channel partnerships with LLM/agent platforms to drive inbound enterprise deals and partner‑led implementations docs/introduction docs/git‑sync/overview blog on MCP/tool‑use.
What is the rough total addressable market
Top-down context:
Relevant spend spans conversational/IVA software, AI agents/orchestration, MLOps/model governance, and RPA/automation—collectively in the tens of billions today. Example estimates cite multi‑billion markets for conversational AI and AI for customer service, several billions for AI agents and MLOps, and RPA in the mid‑single‑digit billions, implying a non‑exclusive combined pool around USD 30–45B in 2024–25 GVR conversational AI GVR AI for customer service GVR AI agents GVR/GMInsights MLOps (https://gminsights.com/industry-analysis/mlops-market) Fortune Business RPA.
Bottom-up calculation:
Illustratively: 40,000–80,000 teams operating production agents worldwide each spending roughly USD 100k–125k per year on agent improvement/ops would imply USD 4–10B in serviceable TAM today. This aligns with a pragmatic slice of the broader, overlapping markets noted above.
Assumptions:
- Tens of thousands of teams are already deploying or piloting production agents across support, operations, and internal tooling.
- Annual spend per team covers evaluation, continuous learning, monitoring, and governance separate from core LLM/API costs.
- Osmosis addresses a horizontal need across stacks (independent of model/provider), capturing a share of ops budgets as agent usage scales.
Who are some of their notable competitors
- LangSmith (LangChain): Observability and evaluation platform for LLM apps and agents with tracing, dashboards, alerts, online/offline evals, and self‑hosting options. Strong fit for teams on LangChain/LangGraph but framework‑agnostic Observability Evaluation.
- Humanloop: Enterprise LLM evals, prompt management, and observability platform with CI/CD integrations and human/AI evaluators; positioned for evaluation‑driven development of prompts and agents home Evaluations.
- Langfuse: Open‑source LLM engineering platform for tracing, evaluation (LLM‑as‑judge, human, custom), and prompt management; popular for self‑hosting and OTel‑style workflows Observability overview Evaluation overview.
- Weights & Biases Weave: W&B’s observability and evaluation toolkit for LLM apps and agents (traces, scorers, guardrails, registries), with dedicated guidance for agent systems and MCP integrations Weave docs Agents.
- Arize Phoenix: Open‑source LLM tracing and evaluation built on OpenTelemetry/OpenInference; includes prompt playgrounds, datasets/experiments, and human/LLM evals—usable standalone or alongside Arize’s enterprise platform Phoenix site Docs.