Solution: a RAG pipeline for 10M+ docs with zero hallucination
==============================================================

This page is **arborist's answer to a Google L5 system-design prompt** that
made the rounds online:

   *"Design a RAG pipeline for 10M docs with zero hallucination."*

arborist answers it as a **Reverse RAG** — the Merkle Providence Reverse
RAG architecture (`whitepaper`_): instead of trusting a model and hoping
it stays grounded, every claim is *bound back* to source spans by Merkle
proof, and what cannot be bound is refused. The canonical answer is a
ten-box pipeline — ingest/normalize, hybrid BM25+embedding retrieval,
ANN+rerank, source-confidence scoring, constrained generation,
citation-backed responses, plus evals, caching and observability.
arborist already implements every one of those boxes — and goes four
steps further, which is what actually buys *zero hallucination* and a
near-zero bill:

#. **A deterministic verifier, not a model confidence score.** "Zero
   hallucination" is not a threshold you tune — it is a property you prove.
   arborist grounds each claim against source spans with a byte-for-byte
   lexical verifier and emits an honest ``UNGROUNDED`` (no answer) when the
   evidence is not there. No model judges itself.
#. **A Merkle-bound cache that skips the GPU.** A hot answer is a
   content-addressed providence record with a Merkle proof — it replays
   with **zero GPU joules** and never re-enters the cost.
#. **No vector embeddings.** Retrieval is **lexical-first** (FTS5 BM25 +
   Merkle proofs); dense-vector semantic search is an *optional* layer,
   **disabled by default**. Embedding 10M docs would cost **10–100× more
   per document** to ingest and a whole vector index to store, maintain
   and re-embed on drift — arborist skips all of it. That asymmetry is a
   large part of why the bill is what it is.
#. **Measured energy COGS — per 1,000 answers.** Because the pipeline is
   a Reverse RAG with no embedding step and no reasoning chains (cheap
   prefill + short decode), the GPU cost of a *grounded* answer is small
   enough to **measure per thousand**: **~$0.07–0.16 per 1,000 answers
   at $0.33/kWh** (see :doc:`bench`). A cache hit skips the GPU entirely,
   so it never enters that count.

.. _whitepaper: https://unfirehose.com/merkle-providence-reverse-rag-whitepaper

Pipeline
--------

.. graphviz::

   digraph arborist_rag {
     rankdir=TB;
     ranksep=0.4; nodesep=0.25;
     fontname="Helvetica";
     node [fontname="Helvetica", shape=box, style=rounded, fontsize=10];
     edge [fontname="Helvetica", fontsize=9];

     query [label="USER QUERY", shape=oval, style=filled, fillcolor="#eeeeff"];

     subgraph cluster_ingest {
       label="0  INGEST -> MERKLE STORE  (offline, content-addressed)";
       style=dashed; color="#888888";
       ing   [label="normalize + dedup\n(content-addressed = idempotent)\nmetadata + versioning"];
       chunk [label="chunk (tok-512-v1)\n-> Merkle root\n-> FTS5 index\n-> supersedes edges"];
       store [label="MERKLE STORE\n3.47M docs / 6.2M chunks\nWikipedia 2010 (time-locked)",
              shape=cylinder, style=filled, fillcolor="#eeffee"];
       ing -> chunk -> store;
     }

     subgraph cluster_pre {
       label="1  PRE-GPU GUARDS  (filter / falsify BEFORE the GPU)";
       style=dashed; color="#cc8800";
       guard  [label="metacognition\n(temporal / contradiction /\nfalse-premise / out-of-corpus)\n+ broad-quantifier + cross-lang + coherence"];
       reject [label="REJECT / FLAG\nno GPU touched", style=filled, fillcolor="#ffdddd"];
       guard -> reject [label="incoherent /\nunanswerable", style=dashed];
     }

     subgraph cluster_cache {
       label="2  PROVIDENCE CACHE  (Merkle-bound)";
       style=dashed; color="#0088cc";
       cache [label="8-dim cache_key lookup\nsource_root / question / model /\npolicy / schema / chunking ...",
              shape=cylinder, style=filled, fillcolor="#ddeeff"];
       hit   [label="CACHE HIT ->\nreplay verified answer\n+ Merkle proof\nSKIP GPU - 0 joules",
              style=filled, fillcolor="#ddffdd"];
       cache -> hit [label="hot"];
     }

     subgraph cluster_retr {
       label="3  RETRIEVAL + RERANK  (4 routes, merged)";
       style=dashed; color="#00aa00";
       retr   [label="FTS5 LEXICAL-FIRST (no vector index)\nBM25 body / title-LIKE /\ncore-keyword (TF-IDF) / phrase-pattern\n-> merge"];
       rerank [label="sqrt body-coverage rerank /\ntitle boost / rivalry exclusion +\nsynonym / per-source cap"];
       ctx    [label="context assembly\nwikitext -> base prose /\nper-mode budget"];
       retr -> rerank -> ctx;
     }

     subgraph cluster_gen {
       label="4  CONSTRAINED GENERATION  (narrow LLM, NON-reasoning)";
       style=dashed; color="#aa00aa";
       gen  [label="claim_lattice (pointer IDs ->\nruntime-interpolated spans)\nOR quote - no test-time compute",
             style=filled, fillcolor="#fff0ff"];
       cost [label="GPU: cheap prefill + short decode\n~$0.07-0.16 / 1k answers @ $0.33/kWh\nno embeddings · NO reasoning (would cost 4-6x)",
             shape=note, style=filled, fillcolor="#ffffcc"];
       gen -> cost [style=dotted, arrowhead=none];
     }

     subgraph cluster_verify {
       label="5  DETERMINISTIC VERIFIER  (no model -- the 'zero hallucination')";
       style=dashed; color="#cc0000";
       verify [label="quote / span / entity / paraphrase\n-> STRICT / HYBRID / UNGROUNDED\nfour-rung ladder / title-relevance /\nclaim-cap / warrant",
               style=filled, fillcolor="#ffeeee"];
       ung    [label="UNGROUNDED ->\nHONEST ABSTENTION\n(no answer, not a guess)",
               style=filled, fillcolor="#ffdddd"];
       verify -> ung [label="no evidence", style=dashed];
     }

     subgraph cluster_audit {
       label="6  AUDIT + WRITE-BACK + FALSIFICATION";
       style=dashed; color="#444444";
       audit   [label="audit chain\nevent_hash = sha256(prev || body)"];
       write   [label="write providence record\n+ Merkle proof / state=live"];
       falsify [label="falsification controller\ndrift -> stale / failed / quarantined\n(cores never evict)"];
       audit -> write -> falsify;
     }

     resp [label="FINAL ANSWER\nclaim -> source span -> Merkle root\nauditable + reproducible",
           shape=note, style=filled, fillcolor="#ddffdd"];

     query -> guard;
     guard -> cache [label="ok"];
     hit -> resp;
     cache -> retr [label="miss"];
     store -> retr [style=dotted, label="FTS5"];
     ctx -> gen;
     gen -> verify;
     verify -> audit [label="STRICT / HYBRID"];
     ung -> resp;
     write -> resp;
     write -> cache [label="now hot ->\nnext time skips GPU", style=dashed, constraint=false];
     falsify -> store [style=dashed, constraint=false];
   }

How arborist maps onto (and extends) the canonical design
---------------------------------------------------------

.. list-table::
   :header-rows: 1
   :widths: 30 40 30

   * - Canonical RAG box
     - arborist component
     - What arborist adds
   * - Ingest + normalize
     - dedup + chunk + FTS5 (``ingest.py``, ``store.py``)
     - **Merkle commitment** + idempotent content-addressing + lossless
       ``supersedes`` version history
   * - Hybrid retrieval (BM25 + embeddings)
     - 4-route FTS5 merge, **lexical-first** (``qa/query.py``)
     - **no embeddings needed** — dense-vector optional + off by default
       (10-100x cheaper ingest, no vector index); phrase-pattern route
   * - ANN + reranking
     - sqrt body-coverage rerank + title boost
     - corpus-derived synonym/rivalry layer, no embedding index to drift
   * - Source confidence scoring (heuristics + ML)
     - **deterministic lexical verifier** (``qa/verify.py``)
     - confidence is *proven*, not scored: STRICT / HYBRID / UNGROUNDED
   * - Constrained generation
     - claim_lattice / quote (``qa/runner.py``)
     - pointer IDs the model can't forge; **non-reasoning** (cheap)
   * - Citation-backed responses
     - per-claim source span -> Merkle root
     - every answer carries a verifiable inclusion proof to its sources
   * - Caching + memory
     - providence cache (``qa/canonical_cache.py``)
     - **content-addressed + proof-carrying**; a hit **skips the GPU**
   * - Continuous evals
     - bench harnesses (``bench/``)
     - energy COGS per grounded answer, not just quality (see :doc:`bench`)
   * - Insufficient evidence -> no answer
     - ``UNGROUNDED`` honest abstention + pre-GPU metacognition guards
     - rejection happens **before** the GPU is touched
   * - Observability
     - run-DAG + audit chain (``qa/dag.py``, ``store.py``)
     - tamper-evident ``event_hash`` chain, not just dashboards

Scaling to 10M+
---------------

arborist is at **3.47M documents / 6.2M chunks** today (all of 2010 English
Wikipedia, time-locked) — about **35%** of the 10M target. The architecture
is complete; reaching 10M is a sourcing + storage question, not a redesign:

* **Storage:** ~11.6 KB/doc -> **+77 GB** for the remaining ~6.5M docs
  (~118 GB total at 10M).
* **Sourcing:** 2010 enwiki is essentially exhausted at 3.47M, so the
  remaining docs come from later dumps, other languages, or other knowledge
  bases — each ingested as its own **time-locked** corpus rather than
  blurring the cutoff.
* **Levers already in-tree:** zstd-dictionary compression (``compress.py``),
  hot/cold eviction (``evict.py``), and shard/mesh distribution
  (``mesh/``) — 10M never has to live on one box.
