Solution: a RAG pipeline for 10M+ docs with zero hallucination#

This page is arborist’s answer to a Google L5 system-design prompt that made the rounds online:

“Design a RAG pipeline for 10M docs with zero hallucination.”

arborist answers it as a Reverse RAG — the Merkle Providence Reverse RAG architecture (whitepaper): instead of trusting a model and hoping it stays grounded, every claim is bound back to source spans by Merkle proof, and what cannot be bound is refused. The canonical answer is a ten-box pipeline — ingest/normalize, hybrid BM25+embedding retrieval, ANN+rerank, source-confidence scoring, constrained generation, citation-backed responses, plus evals, caching and observability. arborist already implements every one of those boxes — and goes four steps further, which is what actually buys zero hallucination and a near-zero bill:

A deterministic verifier, not a model confidence score. “Zero hallucination” is not a threshold you tune — it is a property you prove. arborist grounds each claim against source spans with a byte-for-byte lexical verifier and emits an honest UNGROUNDED (no answer) when the evidence is not there. No model judges itself.
A Merkle-bound cache that skips the GPU. A hot answer is a content-addressed providence record with a Merkle proof — it replays with zero GPU joules and never re-enters the cost.
No vector embeddings. Retrieval is lexical-first (FTS5 BM25 + Merkle proofs); dense-vector semantic search is an optional layer, disabled by default. Embedding 10M docs would cost 10–100× more per document to ingest and a whole vector index to store, maintain and re-embed on drift — arborist skips all of it. That asymmetry is a large part of why the bill is what it is.
Measured energy COGS — per 1,000 answers. Because the pipeline is a Reverse RAG with no embedding step and no reasoning chains (cheap prefill + short decode), the GPU cost of a grounded answer is small enough to measure per thousand: ~$0.07–0.16 per 1,000 answers at $0.33/kWh (see Benchmark surface). A cache hit skips the GPU entirely, so it never enters that count.

Pipeline#

How arborist maps onto (and extends) the canonical design#

Canonical RAG box	arborist component	What arborist adds
Ingest + normalize	dedup + chunk + FTS5 (`ingest.py`, `store.py`)	Merkle commitment + idempotent content-addressing + lossless `supersedes` version history
Hybrid retrieval (BM25 + embeddings)	4-route FTS5 merge, lexical-first (`qa/query.py`)	no embeddings needed — dense-vector optional + off by default (10-100x cheaper ingest, no vector index); phrase-pattern route
ANN + reranking	sqrt body-coverage rerank + title boost	corpus-derived synonym/rivalry layer, no embedding index to drift
Source confidence scoring (heuristics + ML)	deterministic lexical verifier (`qa/verify.py`)	confidence is proven, not scored: STRICT / HYBRID / UNGROUNDED
Constrained generation	claim_lattice / quote (`qa/runner.py`)	pointer IDs the model can’t forge; non-reasoning (cheap)
Citation-backed responses	per-claim source span -> Merkle root	every answer carries a verifiable inclusion proof to its sources
Caching + memory	providence cache (`qa/canonical_cache.py`)	content-addressed + proof-carrying; a hit skips the GPU
Continuous evals	bench harnesses (`bench/`)	energy COGS per grounded answer, not just quality (see Benchmark surface)
Insufficient evidence -> no answer	`UNGROUNDED` honest abstention + pre-GPU metacognition guards	rejection happens before the GPU is touched
Observability	run-DAG + audit chain (`qa/dag.py`, `store.py`)	tamper-evident `event_hash` chain, not just dashboards

Scaling to 10M+#

arborist is at 3.47M documents / 6.2M chunks today (all of 2010 English Wikipedia, time-locked) — about 35% of the 10M target. The architecture is complete; reaching 10M is a sourcing + storage question, not a redesign:

Storage: ~11.6 KB/doc -> +77 GB for the remaining ~6.5M docs (~118 GB total at 10M).
Sourcing: 2010 enwiki is essentially exhausted at 3.47M, so the remaining docs come from later dumps, other languages, or other knowledge bases — each ingested as its own time-locked corpus rather than blurring the cutoff.
Levers already in-tree: zstd-dictionary compression (compress.py), hot/cold eviction (evict.py), and shard/mesh distribution (mesh/) — 10M never has to live on one box.

Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.

Solution: a RAG pipeline for 10M+ docs with zero hallucination

Contents

Solution: a RAG pipeline for 10M+ docs with zero hallucination#

Pipeline#

How arborist maps onto (and extends) the canonical design#

Scaling to 10M+#