ParaQuery

Managed GPU-accelerated Spark + SQL at 2x speed and half the cost

Spring 2025active2025•Website

Developer ToolsAnalyticsBig DataEnterprise SoftwareInfrastructure

Disclaimer

FYI Combinator is not affiliated with Y Combinator. Reports are generated by AI Research Agents and may not be 100% accurate.

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 8 months ago

What do they actually do

ParaQuery runs a fully managed Spark + SQL service that executes high‑volume ETL and analytical jobs on GPUs, using the Spark‑RAPIDS plugin so teams keep Spark SQL compatibility while getting GPU acceleration. Customers keep data in their existing object storage (e.g., S3/GCS) and point ParaQuery at it; they submit Spark or SQL jobs, and ParaQuery operates the GPU clusters and runtime to execute them without the customer managing GPU infrastructure themselves site, Launch HN.

Today it’s an early, white‑glove offering with a waitlist and founder‑led pilots rather than a broad self‑serve product. The team reports demo speedups and cost savings (e.g., a ~44‑minute BigQuery job running in ~5–6 minutes on ParaQuery, and one early customer cutting BigQuery spend ~60%), but these are company‑reported results from pilots and demos, not public benchmarks at scale yet Launch HN, site.

Who are their target customer(s)

Data engineering teams running frequent, large Spark/SQL ETL pipelines: Multi‑hour jobs create downstream delays and require costly compute to meet SLAs; they want faster, more predictable runtimes without re‑platforming site, Launch HN.
Analytics/BI teams with high warehouse compute bills: Big joins and aggregations have long latency and escalating costs on warehouses; they need lower runtime and cost for heavy queries Launch HN.
ML feature‑engineering teams with large preprocessing pipelines: Slow feature jobs delay iteration and model training; they need cheaper, faster batch feature generation to run more frequently site.
Platform/infra engineers who operate Spark clusters but not GPUs: Managing GPU hardware, drivers, and cluster reliability adds operational risk and complexity they prefer to avoid site.
Engineering teams hitting very large shuffles or petabyte‑scale workloads: Jobs become brittle at scale with performance cliffs; they want a service that handles large shuffles and scaling without heavy custom engineering Launch HN.

How would they acquire their first 10, 50, and 100 customers

First 10: Founder‑led, risk‑free pilots for companies with large BigQuery/Databricks bills; run a customer’s slowest ETL/join job, share a custom ROI analysis, and handle on‑call operations during the pilot to prove savings site, Launch HN.
First 50: Standardize a two‑week pilot playbook (connectors, compatibility checklist, repeatable benchmark + ROI report); hire sales engineers/CS to run pilots in parallel and publish case studies to drive inbound site, Launch HN.
First 100: Launch clear per‑cloud pricing and automated compatibility checks for self‑serve onboarding; list on cloud marketplaces, partner with data consultancies, and use public benchmarks/OSS contributions to reduce trust friction site, Launch HN.

What is the rough total addressable market

Top-down context:

Industry reports size cloud analytics/data‑warehouse spend around ~$30–$42B annually in 2024–2025; if 50–70% is compute, that’s ~$15–$29B of compute, and focusing on heavy ETL/joins suggests ~$4.5–$14.5B is addressable for ParaQuery Grand View Research, MRFR.

Bottom-up calculation:

Assume 8,000–12,000 mid/large enterprises run heavy Spark/SQL workloads; with $1–$2M/year in analytics compute per enterprise and 30–50% of that spend in heavy ETL/joins, the addressable spend is roughly $2.4–$12B/year. This aligns with the top‑down range and reflects ParaQuery’s workload focus.

Assumptions:

50–70% of cloud analytics/warehouse revenue is compute for heavy users
30–50% of compute spend goes to heavy ETL/joins/feature pipelines
8k–12k enterprises run substantial Spark/SQL pipelines with $1–$2M/year compute budgets

Who are some of their notable competitors

Databricks: Managed Spark and analytics platform widely used for ETL, SQL, and ML; ParaQuery competes on GPU‑accelerated execution for similar Spark workloads Databricks.
Google BigQuery: Serverless cloud data warehouse known for ease of use and scale; ParaQuery aims to undercut runtime and cost for heavy ETL/joins by running on GPUs BigQuery.
Snowflake: Popular cloud data warehouse with separated storage/compute and strong BI ecosystem; ParaQuery targets teams that want Spark compatibility and GPU speedups Snowflake.
OmniSci (GPU‑native DB): GPU databases provide very fast SQL/visual analytics but require moving data into a GPU‑native DB; ParaQuery instead accelerates existing Spark pipelines OmniSci.
NVIDIA Spark‑RAPIDS (open source): Open‑source plugin to run Spark on GPUs; teams can self‑manage RAPIDS and GPU infra instead of using a hosted service like ParaQuery Spark‑RAPIDS.