Concepts#

Arborist is a content-addressed document store that gives every cached LLM answer a verifiable Merkle proof tying it back to its source documents. This page is the orientation: what the system is, the core abstractions you’ll see in code and docs, and how they compose.

What arborist is#

A reference implementation of two papers stacked:

  1. Merkle Providence Reverse RAG (whitepaper, April 2026) — the runtime: question → cache → retrieval → LLM → verifier → cache write.

  2. Merkle-AGI v9.8 — the admissibility ledger that makes cached answers cryptographically verifiable across peers.

Three layers stacked on one SQLite file (per shard):

  • Surface — ingested documents (Wikipedia, HTML pages, anything with a URI). Chunked, Merkle-rooted, FTS5-indexed.

  • Core — distilled documents Merkle-bound back to surfaces via per-chunk inclusion proofs. Recursive — a depth-N core can compress into a depth-N+1 core.

  • Providence cache — Q&A records keyed on the v9.8 8-dimension invariant; every record carries an audit_mode and a Merkle proof.

Three-layer stack (surface, core, providence cache)

How the three layers compose. Surface document_roots flow into both core derivations (with per-chunk inclusion proofs in derivations.proof_blob) and providence-cache records (as the source_root dimension of the 8-dim cache_key). Cores reuse the surface proof structure, so the Merkle binding survives compression. Providence rows then carry their own merkle_proof that ties an answer back to the chunks the verifier ran against — verifiable offline by any peer that holds the same surface roots.#

Arborist module graph

Top-level module graph. Substrate (merkle, document) at the bottom; pipelines (ingest, distill, qa) in the middle; CLI on top.#

The Merkle commitment#

Two peers that ingest the same source + run the same chunker + canonicalization compute bit-identical document_root hashes. That is the v9.8 admissibility property: identity by content, not by location.

Arborist uses fox’s existing Go Merkle conventions verbatim (proxy.unturf.com/pkg/verified/merkle.go):

  • Leaf hash: sha256(0x00 || canonical_chunk_bytes).

  • Internal hash: sha256(0x03 || left || right). Non-commutative — H(L,R) H(R,L).

  • Odd layers self-duplicate the trailing element. Not zero-pad.

  • Proof carries explicit IsLeft flag per sibling. Not lexical sort.

See Substrate: Core data structures for the Python port.

Federation across peers#

Identity-by-content is the federation primitive. Two peers that ingest the same source with the same chunker and canonicalization compute the same document_root and the same cache_key; that property is what lets peers exchange and verify each other’s providence records without a central authority. The mesh layer implements gossip-based sync, group key management, and per-peer audit-chain reconciliation on top of that.

See Federation: multiplayer arborist (also reachable as Federation: multiplayer arborist) for the mesh wire format, member-set semantics, and deploy notes.

The 8-dim cache key#

Every providence record is keyed on:

cache_key = sha256(
    source_root | question_hash | model_profile_hash |
    conversation_hash | governance_policy_hash |
    schema_version | canonicalization_version | chunking_version
)

Bumping any of the eight dimensions invalidates prior records on lookup. This is how the system stays honest across model changes, schema migrations, prompt edits, etc. — old answers don’t silently serve under new conditions.

8-dim cache_key composition

The eight mandatory dimensions split across two groups: the top row (corpus, question, model, conversation) is the dynamic input surface; the bottom row (policy + three pinned versions) is the admissibility scaffold. The optional 9th dimension (verifier_policy_hash) exists for audit legibility under ticket #000058 — it does not add correctness coverage, because the verifier-policy fields are already a subset of governance_policy_hash. 8-dim is the default write form; 9-dim is opt-in.#

The audit chain#

Every state-changing operation writes one row in audit_events with event_hash = sha256(prev_event_hash || canonical(body)). Linear chain per shard, verified by make chain-check-shards (any break is the loudest possible signal).

Always write via arborist.store.append_audit — never insert into audit_events directly.

The trichotomy and the four-rung ladder#

Every answer carries two stacked labels.

Schema layer — v9.8 trichotomy, persisted, drives cache lookups and the audit chain:

audit_mode

meaning

STRICT

every evidence unit verifies against context

HYBRID

some claims source-grounded, some emerged from training

UNGROUNDED

no evidence, or none verifies — purely emergent

Display layer — four-rung ladder for claim-lattice modes only; renderer-only transformation, schema unchanged:

rung

what’s actually proved

EVIDENCE-WARRANTED

pointer verified + warrant ran & passed + no soft demotes

ANCHOR-WARRANTED

pointer-linked + warrant passed; soft-demote violations

POINTER-LINKED

pointer/source/chunk verified, but warrant didn’t apply

UNGROUNDED

no verified pairs

The point of the display ladder: STRICT in claim-lattice mode is not “the answer is correct” — it’s “every pointer resolved to a valid evidence object, source_role allowed, citation coverage passed.” The display label spells out the actual property so users don’t read STRICT as full semantic entailment.

Verifier ladder

How (audit_mode, violations) maps to a display rung at render time.#

The verifier (lexical, layered)#

Five strategies run in sequence; first to find evidence classifies. Each is lexical (substring or token coverage), never embeddings — soft signals stay out of the proof path.

  1. quote — model wraps claims in "..."; verbatim substring tested

  2. span — bullet/sentence units substring-tested

  3. entity — multi-word proper nouns with proximity-cluster gating

  4. paraphrase — ≥85% token coverage on prose-shaped spans

  5. claim_lattice — pointer-line [E1,E2] or JSON; runs seven deterministic hard checks: parser succeeded, evidence_id resolves, source_role allowed, claim text non-empty, citation coverage threshold, pointer count cap, anchor-class warrant.

See Q&A Pipeline: question → answer → verify → cache for the implementation; the whitepaper §13.8 covers the layered design.

Falsification state#

Every providence record carries a falsification_state in {live, failed, stale, quarantined}. Cache lookups filter on state='live'. Drift detection (re-ingest produces a different document_root) flips the record to stale.

Two ways to remove a wrong answer:

  • make falsify KEY=... — record stays in DB, state flips to failed. Audit-preserving. Use when downstream consumers might reference it.

  • make burn KEY=... — actually deletes the row. Refuses if the record has children unless FORCE=1. Use during scratch corpus building.

Falsification state machine

State machine for falsification_state. Every transition writes one audit_event so the chain stays intact — the schema column is the lookup gate, the audit chain is the history. stale live is the rarely-trodden path where a corpus was rolled back to a prior state; cleanest in single-shard test setups. burn actually deletes the row and is refused if the record has children unless FORCE=1.#

Sidecar diagnostics#

The verifier stays binary; soft signals proliferate as sidecars — read-only diagnostic functions that pull the same source chunks the verifier saw, classify spans, detect topic shifts, surface metaphor framing, and never write to providence_cache or extend the audit chain. That invariant is what keeps audit_mode a binary classification rather than a soft score.

See Q&A Pipeline: question → answer → verify → cache (arborist.qa.inspect).

Where to go next#


Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

  • TRUTH — First principles, math & science, open source code freely distributed.

  • FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.

  • HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.

  • LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.