What do they actually do
Expected Parrot provides tools to run scripted, reproducible surveys and simulations across many language models and human respondents. Today it ships an open‑source Python package (EDSL) for programmatic experiments and a hosted no‑code workspace (Polly/Coop) where teams can build personas, launch runs, store results, and share experiments (EDSL docs, Polly, Coop docs).
Users define agents/personas and questionnaires, choose models (hundreds are available via your own API keys or an Expected Parrot key), and run jobs; outputs are cached so runs are cheaper to repeat and easy to audit (Remote inference, Model catalog, Caching). The same instrument can be sent to human respondents (web surveys or Prolific) to compare AI and human answers; the product is used by academic researchers and is in early enterprise pilots (Human validation, Homepage testimonials, Polly pilot).
Who are their target customer(s)
- Academic researchers (universities and labs): Need to run controlled experiments across many language models and keep results reproducible and shareable for papers or replication; current tooling is ad hoc and hard to version.
- Product or UX research teams: Need a fast way to simulate how different user segments or stakeholders will react to copy or flows, and to compare AI‑generated responses against real people within one workspace.
- Market researchers and behavioral scientists: Must run structured surveys across many conditions and split results by respondent type; stitching model runs, panel data, and analysis is tedious and error‑prone.
- Enterprise risk, compliance, and audit teams: Need to prove how a model produced a decision and control spend and access across vendors; lack reproducible runs, audit trails, and clear cost tracking.
- Internal ML evaluation or platform teams: Have to benchmark hundreds of models, compare cost vs latency, and run repeatable evaluations without building and maintaining custom infrastructure.
How would they acquire their first 10, 50, and 100 customers
- First 10: Directly onboard professors, lab leads, and active contributors already visible in testimonials and GitHub by offering pilot credits and a hands‑on workshop to port one existing experiment into EDSL/Coop, producing a reproducible notebook and public case study.
- First 50: Scale through academic channels: run tutorials at NLP/behavioral‑science conferences and university seminars, publish how‑to replications, and provide self‑serve academic onboarding with starter credits and ready‑to‑run notebooks to lower friction.
- First 100: Use short industry case studies showing time/cost savings and auditable comparisons to drive outbound to product/UX and market‑research teams; offer a two‑week paid pilot with clear success metrics, cost estimates, and human‑validation integration to convert to small paid accounts.
What is the rough total addressable market
Top-down context:
Expected Parrot sits between the ~$140B global insights industry and adjacent software spend (ESOMAR/ResearchWorld), the online/survey software market of roughly $3.6–4.5B today (Grand View Research, Mordor Intelligence), and a fast‑growing enterprise AI software market already in the low tens of billions (ABI Research, Grand View Research—Enterprise AI).
Bottom-up calculation:
Near‑term, if 10,000 research labs and product/UX teams adopt at ~$10k ACV, that’s ~$100M; at 50,000 teams and ~$15k ACV, that’s ~$750M. The academic slice alone is supported by a large base of higher‑education institutions globally (CWUR).
Assumptions:
- ACV for research/survey software in the $10k–$15k range per team per year.
- Addressable teams across academia and industry on the order of 10k–50k globally; the ~20k higher‑ed institutions provide part of this base (CWUR).
- This bottom‑up view focuses on survey/research software buyers and excludes broader enterprise AI tooling budgets.
Who are some of their notable competitors
- OpenAI Evals: Open‑source framework for writing and sharing reproducible model evaluations. Overlaps on repeatable experiments but is an evaluation harness rather than a hosted, multi‑model survey workspace with personas and human‑panel integrations.
- LangSmith (LangChain): Observability, testing, and evaluation for LLM apps with tracing and dashboards. Strong on app debugging and prompt testing, less focused on survey‑style persona simulations and mixed AI/human research workflows.
- Hugging Face (Evaluate / Datasets / Inference Endpoints): Large model hub with evaluation libraries, datasets, and managed inference. Great for benchmarking and deployment, but not purpose‑built for team survey/persona simulations with integrated human validation and collaboration.
- Scale AI: Enterprise data labeling and evaluation services (RLHF, safety, and eval programs). Competes on human validation and annotation, but centers on labeling and production data pipelines rather than a researcher‑friendly multi‑model survey runner with caching.
- Botium (Cyara): Conversational testing to simulate users and run automated tests across chatbot platforms. Suited for QA and integration testing, not for structured, reproducible LLM surveys and mixed AI/human research datasets.