Solution: a RAG pipeline for 10M+ docs with zero hallucination

Solution: a RAG pipeline for 10M+ docs with zero hallucination#

This page is arborist’s answer to a Google L5 system-design prompt that made the rounds online:

“Design a RAG pipeline for 10M docs with zero hallucination.”

arborist answers it as a Reverse RAG — the Merkle Providence Reverse RAG architecture (whitepaper): instead of trusting a model and hoping it stays grounded, every claim is bound back to source spans by Merkle proof, and what cannot be bound is refused. The canonical answer is a ten-box pipeline — ingest/normalize, hybrid BM25+embedding retrieval, ANN+rerank, source-confidence scoring, constrained generation, citation-backed responses, plus evals, caching and observability. arborist already implements every one of those boxes — and goes four steps further, which is what actually buys zero hallucination and a near-zero bill:

  1. A deterministic verifier, not a model confidence score. “Zero hallucination” is not a threshold you tune — it is a property you prove. arborist grounds each claim against source spans with a byte-for-byte lexical verifier and emits an honest UNGROUNDED (no answer) when the evidence is not there. No model judges itself.

  2. A Merkle-bound cache that skips the GPU. A hot answer is a content-addressed providence record with a Merkle proof — it replays with zero GPU joules and never re-enters the cost.

  3. No vector embeddings. Retrieval is lexical-first (FTS5 BM25 + Merkle proofs); dense-vector semantic search is an optional layer, disabled by default. Embedding 10M docs would cost 10–100× more per document to ingest and a whole vector index to store, maintain and re-embed on drift — arborist skips all of it. That asymmetry is a large part of why the bill is what it is.

  4. Measured energy COGS — per 1,000 answers. Because the pipeline is a Reverse RAG with no embedding step and no reasoning chains (cheap prefill + short decode), the GPU cost of a grounded answer is small enough to measure per thousand: ~$0.07–0.16 per 1,000 answers at $0.33/kWh (see Benchmark surface). A cache hit skips the GPU entirely, so it never enters that count.

Pipeline#

digraph arborist_rag { rankdir=TB; ranksep=0.4; nodesep=0.25; fontname="Helvetica"; node [fontname="Helvetica", shape=box, style=rounded, fontsize=10]; edge [fontname="Helvetica", fontsize=9]; query [label="USER QUERY", shape=oval, style=filled, fillcolor="#eeeeff"]; subgraph cluster_ingest { label="0 INGEST -> MERKLE STORE (offline, content-addressed)"; style=dashed; color="#888888"; ing [label="normalize + dedup\n(content-addressed = idempotent)\nmetadata + versioning"]; chunk [label="chunk (tok-512-v1)\n-> Merkle root\n-> FTS5 index\n-> supersedes edges"]; store [label="MERKLE STORE\n3.47M docs / 6.2M chunks\nWikipedia 2010 (time-locked)", shape=cylinder, style=filled, fillcolor="#eeffee"]; ing -> chunk -> store; } subgraph cluster_pre { label="1 PRE-GPU GUARDS (filter / falsify BEFORE the GPU)"; style=dashed; color="#cc8800"; guard [label="metacognition\n(temporal / contradiction /\nfalse-premise / out-of-corpus)\n+ broad-quantifier + cross-lang + coherence"]; reject [label="REJECT / FLAG\nno GPU touched", style=filled, fillcolor="#ffdddd"]; guard -> reject [label="incoherent /\nunanswerable", style=dashed]; } subgraph cluster_cache { label="2 PROVIDENCE CACHE (Merkle-bound)"; style=dashed; color="#0088cc"; cache [label="8-dim cache_key lookup\nsource_root / question / model /\npolicy / schema / chunking ...", shape=cylinder, style=filled, fillcolor="#ddeeff"]; hit [label="CACHE HIT ->\nreplay verified answer\n+ Merkle proof\nSKIP GPU - 0 joules", style=filled, fillcolor="#ddffdd"]; cache -> hit [label="hot"]; } subgraph cluster_retr { label="3 RETRIEVAL + RERANK (4 routes, merged)"; style=dashed; color="#00aa00"; retr [label="FTS5 LEXICAL-FIRST (no vector index)\nBM25 body / title-LIKE /\ncore-keyword (TF-IDF) / phrase-pattern\n-> merge"]; rerank [label="sqrt body-coverage rerank /\ntitle boost / rivalry exclusion +\nsynonym / per-source cap"]; ctx [label="context assembly\nwikitext -> base prose /\nper-mode budget"]; retr -> rerank -> ctx; } subgraph cluster_gen { label="4 CONSTRAINED GENERATION (narrow LLM, NON-reasoning)"; style=dashed; color="#aa00aa"; gen [label="claim_lattice (pointer IDs ->\nruntime-interpolated spans)\nOR quote - no test-time compute", style=filled, fillcolor="#fff0ff"]; cost [label="GPU: cheap prefill + short decode\n~$0.07-0.16 / 1k answers @ $0.33/kWh\nno embeddings · NO reasoning (would cost 4-6x)", shape=note, style=filled, fillcolor="#ffffcc"]; gen -> cost [style=dotted, arrowhead=none]; } subgraph cluster_verify { label="5 DETERMINISTIC VERIFIER (no model -- the 'zero hallucination')"; style=dashed; color="#cc0000"; verify [label="quote / span / entity / paraphrase\n-> STRICT / HYBRID / UNGROUNDED\nfour-rung ladder / title-relevance /\nclaim-cap / warrant", style=filled, fillcolor="#ffeeee"]; ung [label="UNGROUNDED ->\nHONEST ABSTENTION\n(no answer, not a guess)", style=filled, fillcolor="#ffdddd"]; verify -> ung [label="no evidence", style=dashed]; } subgraph cluster_audit { label="6 AUDIT + WRITE-BACK + FALSIFICATION"; style=dashed; color="#444444"; audit [label="audit chain\nevent_hash = sha256(prev || body)"]; write [label="write providence record\n+ Merkle proof / state=live"]; falsify [label="falsification controller\ndrift -> stale / failed / quarantined\n(cores never evict)"]; audit -> write -> falsify; } resp [label="FINAL ANSWER\nclaim -> source span -> Merkle root\nauditable + reproducible", shape=note, style=filled, fillcolor="#ddffdd"]; query -> guard; guard -> cache [label="ok"]; hit -> resp; cache -> retr [label="miss"]; store -> retr [style=dotted, label="FTS5"]; ctx -> gen; gen -> verify; verify -> audit [label="STRICT / HYBRID"]; ung -> resp; write -> resp; write -> cache [label="now hot ->\nnext time skips GPU", style=dashed, constraint=false]; falsify -> store [style=dashed, constraint=false]; }

How arborist maps onto (and extends) the canonical design#

Canonical RAG box

arborist component

What arborist adds

Ingest + normalize

dedup + chunk + FTS5 (ingest.py, store.py)

Merkle commitment + idempotent content-addressing + lossless supersedes version history

Hybrid retrieval (BM25 + embeddings)

4-route FTS5 merge, lexical-first (qa/query.py)

no embeddings needed — dense-vector optional + off by default (10-100x cheaper ingest, no vector index); phrase-pattern route

ANN + reranking

sqrt body-coverage rerank + title boost

corpus-derived synonym/rivalry layer, no embedding index to drift

Source confidence scoring (heuristics + ML)

deterministic lexical verifier (qa/verify.py)

confidence is proven, not scored: STRICT / HYBRID / UNGROUNDED

Constrained generation

claim_lattice / quote (qa/runner.py)

pointer IDs the model can’t forge; non-reasoning (cheap)

Citation-backed responses

per-claim source span -> Merkle root

every answer carries a verifiable inclusion proof to its sources

Caching + memory

providence cache (qa/canonical_cache.py)

content-addressed + proof-carrying; a hit skips the GPU

Continuous evals

bench harnesses (bench/)

energy COGS per grounded answer, not just quality (see Benchmark surface)

Insufficient evidence -> no answer

UNGROUNDED honest abstention + pre-GPU metacognition guards

rejection happens before the GPU is touched

Observability

run-DAG + audit chain (qa/dag.py, store.py)

tamper-evident event_hash chain, not just dashboards

Scaling to 10M+#

arborist is at 3.47M documents / 6.2M chunks today (all of 2010 English Wikipedia, time-locked) — about 35% of the 10M target. The architecture is complete; reaching 10M is a sourcing + storage question, not a redesign:

  • Storage: ~11.6 KB/doc -> +77 GB for the remaining ~6.5M docs (~118 GB total at 10M).

  • Sourcing: 2010 enwiki is essentially exhausted at 3.47M, so the remaining docs come from later dumps, other languages, or other knowledge bases — each ingested as its own time-locked corpus rather than blurring the cutoff.

  • Levers already in-tree: zstd-dictionary compression (compress.py), hot/cold eviction (evict.py), and shard/mesh distribution (mesh/) — 10M never has to live on one box.


Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

  • TRUTH — First principles, math & science, open source code freely distributed.

  • FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.

  • HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.

  • LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.