Solution: a RAG pipeline for 10M+ docs with zero hallucination#
This page is arborist’s answer to a Google L5 system-design prompt that made the rounds online:
“Design a RAG pipeline for 10M docs with zero hallucination.”
arborist answers it as a Reverse RAG — the Merkle Providence Reverse RAG architecture (whitepaper): instead of trusting a model and hoping it stays grounded, every claim is bound back to source spans by Merkle proof, and what cannot be bound is refused. The canonical answer is a ten-box pipeline — ingest/normalize, hybrid BM25+embedding retrieval, ANN+rerank, source-confidence scoring, constrained generation, citation-backed responses, plus evals, caching and observability. arborist already implements every one of those boxes — and goes four steps further, which is what actually buys zero hallucination and a near-zero bill:
A deterministic verifier, not a model confidence score. “Zero hallucination” is not a threshold you tune — it is a property you prove. arborist grounds each claim against source spans with a byte-for-byte lexical verifier and emits an honest
UNGROUNDED(no answer) when the evidence is not there. No model judges itself.A Merkle-bound cache that skips the GPU. A hot answer is a content-addressed providence record with a Merkle proof — it replays with zero GPU joules and never re-enters the cost.
No vector embeddings. Retrieval is lexical-first (FTS5 BM25 + Merkle proofs); dense-vector semantic search is an optional layer, disabled by default. Embedding 10M docs would cost 10–100× more per document to ingest and a whole vector index to store, maintain and re-embed on drift — arborist skips all of it. That asymmetry is a large part of why the bill is what it is.
Measured energy COGS — per 1,000 answers. Because the pipeline is a Reverse RAG with no embedding step and no reasoning chains (cheap prefill + short decode), the GPU cost of a grounded answer is small enough to measure per thousand: ~$0.07–0.16 per 1,000 answers at $0.33/kWh (see Benchmark surface). A cache hit skips the GPU entirely, so it never enters that count.
Pipeline#
How arborist maps onto (and extends) the canonical design#
Canonical RAG box |
arborist component |
What arborist adds |
|---|---|---|
Ingest + normalize |
dedup + chunk + FTS5 ( |
Merkle commitment + idempotent content-addressing + lossless
|
Hybrid retrieval (BM25 + embeddings) |
4-route FTS5 merge, lexical-first ( |
no embeddings needed — dense-vector optional + off by default (10-100x cheaper ingest, no vector index); phrase-pattern route |
ANN + reranking |
sqrt body-coverage rerank + title boost |
corpus-derived synonym/rivalry layer, no embedding index to drift |
Source confidence scoring (heuristics + ML) |
deterministic lexical verifier ( |
confidence is proven, not scored: STRICT / HYBRID / UNGROUNDED |
Constrained generation |
claim_lattice / quote ( |
pointer IDs the model can’t forge; non-reasoning (cheap) |
Citation-backed responses |
per-claim source span -> Merkle root |
every answer carries a verifiable inclusion proof to its sources |
Caching + memory |
providence cache ( |
content-addressed + proof-carrying; a hit skips the GPU |
Continuous evals |
bench harnesses ( |
energy COGS per grounded answer, not just quality (see Benchmark surface) |
Insufficient evidence -> no answer |
|
rejection happens before the GPU is touched |
Observability |
run-DAG + audit chain ( |
tamper-evident |
Scaling to 10M+#
arborist is at 3.47M documents / 6.2M chunks today (all of 2010 English Wikipedia, time-locked) — about 35% of the 10M target. The architecture is complete; reaching 10M is a sourcing + storage question, not a redesign:
Storage: ~11.6 KB/doc -> +77 GB for the remaining ~6.5M docs (~118 GB total at 10M).
Sourcing: 2010 enwiki is essentially exhausted at 3.47M, so the remaining docs come from later dumps, other languages, or other knowledge bases — each ingested as its own time-locked corpus rather than blurring the cutoff.
Levers already in-tree: zstd-dictionary compression (
compress.py), hot/cold eviction (evict.py), and shard/mesh distribution (mesh/) — 10M never has to live on one box.
Permacomputer Preamble — License: AGPL-3.0-only
This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.
Our permacomputer is community-owned infrastructure optimized around four values:
TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.
NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.