Archil

Transform S3 in an unlimited, local file system

Fall 2024active2024•Website

Disclaimer

FYI Combinator is not affiliated with Y Combinator. Reports are generated by AI Research Agents and may not be 100% accurate.

Documenso

Open source e-signing

The open source DocuSign alternative. Beautiful, modern, and built for developers.

Learn more →

Your Company Here

Sponsor slot available

Want to be listed as a sponsor? Reach thousands of founders and developers.

Report from 7 months ago

What do they actually do

Archil lets you mount an S3‑compatible bucket as a normal, POSIX‑like disk on your servers. Frequently used data is served from Archil’s shared, SSD‑backed cache for low‑latency reads/writes, while the rest stays in S3 and is fetched on demand; writes are synced back to S3 over time (docs intro, architecture).

Teams create a volume in the web console or CLI and mount it on one or more machines (including via a Kubernetes CSI driver). Files appear immediately to existing tools like PyTorch, Spark, and Jupyter without code changes (quickstart, CSI driver).

Pricing is based on “active” data in the working set rather than total bytes in S3; the public Developer plan lists $0.20 per active GiB‑month (pricing). Archil publishes performance comparisons and details of its cache/protocol design in the docs (performance).

Who are their target customer(s)

ML training teams (researchers and infra): They spend hours copying multi‑TB datasets to local disks and maintaining checkpoint/sync logic across many machines. They need fast shared access so jobs can start immediately without custom data plumbing.
Data scientists and analytics teams (notebooks, Spark, ad‑hoc): They face long time‑to‑first‑IO and duplicate large datasets across instances, slowing experiments and wasting storage. They want a mount that makes data usable by notebooks and Spark right away.
Platform and DevOps engineers (internal platforms, sandboxes): They maintain brittle sync scripts, image baking, and sandbox provisioning. They want a single shared mount to simplify onboarding, reduce per‑project ops, and improve reproducibility.
Teams building retrieval/RAG and inference infra: They need consistent, low‑latency access to large indexes across many services and can’t afford slow S3 fetches or duplicated state. They want a shared cache that serves hot data quickly.
Enterprise IT and security (governance, on‑prem/BYOC): They need versioning, access controls, locality guarantees, and deployment options that standard S3+EBS setups don’t provide out of the box.

How would they acquire their first 10, 50, and 100 customers

First 10: Founder‑led, hands‑on pilots with high‑fit ML, data science, and platform teams via YC/intros and targeted outreach; offer free credits, help mount their buckets, and run a measurable job to prove value in days (quickstart, CSI).
First 50: Publish reproducible benchmarks and short recipes (PyTorch, Spark, Jupyter) with repos/notebooks; run workshops and webinars, and pair them with a one‑click sandbox and low‑friction developer plan for self‑serve onboarding (performance, pricing).
First 100: Productize onboarding (self‑serve signup, templates, trials) and add a small sales/partners motion plus marketplace listings; target MLOps vendors and platform teams, supported by 3–5 case studies with clear ROI metrics (roadmap post, pricing).

What is the rough total addressable market

Top-down context:

The closest direct analog is AI‑powered storage, estimated at about $30.6B in 2024 and growing quickly (Grand View). Broader context: IaaS was ~$172B in 2024 and public cloud spending is forecast at $723B in 2025, but only a slice maps to Archil’s use cases (Gartner IaaS, Gartner cloud).

Bottom-up calculation:

Using current pricing, a 100‑TiB active working set (≈102,400 GiB) costs about $20,480/month, or ~$246k/year per project (pricing). At 1,000 such projects, revenue would be roughly $246M/year; this illustrates how revenue scales with active working‑set size and customer count.

Assumptions:

Active working set billed at $0.20/GiB‑month remains representative.
Typical ML/analytics projects manage on the order of tens to hundreds of TiB of active data.
A portion of teams adopt a managed S3‑backed mount rather than self‑managed alternatives.

Who are some of their notable competitors

Alluxio: Open‑source data‑caching layer that mounts object stores as a filesystem for clusters; similar goal (make S3 data appear local) but typically self‑managed and with POSIX/consistency limitations you must operate around (docs).
Amazon FSx for Lustre: AWS’s managed Lustre service with S3 linking for high‑throughput HPC; strong on scale/throughput inside AWS but not a multi‑cloud, pay‑for‑active‑data S3 mount flow.
Weka (WekaFS): Enterprise high‑performance filesystem for AI/HPC with S3 tiering; sold as software/appliance for on‑prem/cloud clusters, emphasizing raw throughput and enterprise features over a hosted, per‑active‑data SaaS model (overview).
LucidLink: Streaming cloud filespace with client‑side caching, popular in media/creative. Similar mount/stream UX, but oriented to desktop collaboration rather than multi‑node training clusters or Kubernetes CSI (platform).
s3fs‑fuse: Open‑source FUSE driver to mount S3 as a filesystem; simple and free for single‑node use, but lacks coordination and strong guarantees, so teams often outgrow it and build extra syncing logic.