What do they actually do
Ragas is an open-source Python toolkit that teams run locally or in CI to evaluate and monitor LLM applications such as RAG pipelines, prompts, agents, and multi-step workflows. It’s a pip-installable library with a CLI and APIs, documented metrics, examples, and an active GitHub/community. It focuses on giving per-sample scores and reasoning, storing results, and enabling repeatable experiments whenever prompts, retrieval, or models change (GitHub, docs).
Typical use: engineers install Ragas, prepare evaluation datasets (hand-curated or synthetic), choose built-in or custom metrics (e.g., faithfulness, context precision/recall, tool-call accuracy), and run experiments to compare baselines vs. changes. Results are saved (e.g., CSV) with per-sample details so teams can inspect failures and iterate. Teams can also connect Ragas to production traces or observability tools (e.g., Datadog, Langfuse, LangSmith/LangChain) to score sampled traffic and surface likely bad answers for review (metrics, how-to, Datadog integration, Langfuse guide).
Operationally, Ragas is a developer library rather than a hosted SaaS. The near-term roadmap emphasizes more first-class templates (agent/prompt/workflow evals), broader integrations, and improved synthetic testset generation and judge alignment, consistent with their repo, docs, and blog posts (quickstart/README, integrations, blog on aligning judges).
Who are their target customer(s)
- ML/LLM engineers at startups building RAG and chat features: They need a repeatable way to re-run the same tests after prompt/model changes and to see per-example failures to debug regressions. Ragas provides a local/CI toolkit and quickstarts for this workflow (quickstart).
- Platform or SRE teams operating LLMs in production: They need to detect wrong or risky model outputs from live traffic and feed those scores into monitoring/alerting; logs/traces alone don’t capture answer quality. Ragas integrates with observability tooling to score sampled traces (Datadog integration).
- Observability and tooling vendors adding LLM evaluation features: They want an embeddable evaluation backend rather than building one from scratch. Ragas is used as a backend in vendor guides like Langfuse’s cookbook (Langfuse guide).
- Product managers and QA owners of LLM-powered features: They need concrete metrics and failing examples to prioritize fixes and decide if changes actually help users. Ragas saves per-sample scores and reasoning to categorize and triage errors (how-to).
- Research engineers and data scientists working on evals and alignment: They need better test-set generation and ways to align LLM-based judges with human labels because off-the-shelf judges can disagree. Ragas offers tools and guidance for synthetic data and judge alignment (blog).
How would they acquire their first 10, 50, and 100 customers
- First 10: Convert existing community users by offering hands-on help to contributors, Discord members, and teams that starred/forked the repo to set up their first CI evals within a week (GitHub, quickstart).
- First 50: Publish partner cookbooks and example repos (Datadog, Langfuse, LangChain) and run short workshops showing how to add Ragas to a live RAG/chat pipeline; follow up with attendees for setup help (Datadog, Langfuse guide).
- First 100: Ship plug-and-play connectors for popular observability/pipeline tools, offer a paid onboarding package, and publish 2–3 case studies on regressions caught to drive inbound and targeted outreach (integrations).
What is the rough total addressable market
Top-down context:
A share of generative-AI software spend flows to evaluation/monitoring. If 0.5–1% of a ~$63.7B 2025 generative-AI software market goes to evaluation, that implies ~$318M–$637M TAM, with upside if evaluation becomes core infrastructure (ABI Research). Observability budgets are already substantial, with median annual spend around $1.9–$2.0M per org per New Relic’s 2024 report, and AI monitoring adoption is rising (New Relic 2024 report, New Relic 2025 report).
Bottom-up calculation:
Illustrative: 1,000 mid/large orgs × $75k/year + 5,000 startups/SMBs × $5k/year + 200 vendors × $100k/year ≈ $120M/year. With higher adoption or larger enterprise contracts, totals move into the multiple hundreds of millions.
Assumptions:
- A meaningful subset of orgs will operationalize LLM evaluation in CI and monitoring.
- Vendors will embed an evaluation backend via licenses/integration deals.
- Per-customer spend reflects OSS-first usage plus paid onboarding/connectors or support.
Who are some of their notable competitors
- OpenAI Evals: Open-source framework (with hosted options) for building and running model evals and benchmarks. Strong for model/benchmark comparisons; less focused on CI-friendly RAG/agent workflows and observability integrations compared to Ragas (GitHub, docs).
- LangChain / LangSmith (OpenEvals): LangChain provides evaluators and examples (OpenEvals, agentevals); LangSmith adds hosted tracing, experiments, and evaluation UX. Tight integration for LangChain users and a hosted UI, versus Ragas’ standalone, library-first approach (OpenEvals, LangSmith docs).
- Hugging Face Evaluate: Library and Hub for community metrics (e.g., BLEU, ROUGE). Useful for standard NLP metrics, but not designed as an application-level CI/monitoring toolkit for RAG or production traces like Ragas (docs, GitHub).
- EleutherAI lm-evaluation-harness: Benchmarking harness for running standard NLP/reasoning benchmarks across models. Geared toward research-style model comparisons rather than continuous per-sample evaluation or production monitoring (GitHub).
- Langfuse: Observability and evaluation platform for LLM apps with tracing, dashboards, and evaluations (including LLM-as-a-judge). Broader hosted observability platform that can run/store evals, while Ragas is a lightweight evaluation library often used as a backend (evaluation docs, observability docs).