Concepts#
Arborist is a content-addressed document store that gives every cached LLM answer a verifiable Merkle proof tying it back to its source documents. This page is the orientation: what the system is, the core abstractions you’ll see in code and docs, and how they compose.
What arborist is#
A reference implementation of two papers stacked:
Merkle Providence Reverse RAG (whitepaper, April 2026) — the runtime: question → cache → retrieval → LLM → verifier → cache write.
Merkle-AGI v9.8 — the admissibility ledger that makes cached answers cryptographically verifiable across peers.
Three layers stacked on one SQLite file (per shard):
Surface — ingested documents (Wikipedia, HTML pages, anything with a URI). Chunked, Merkle-rooted, FTS5-indexed.
Core — distilled documents Merkle-bound back to surfaces via per-chunk inclusion proofs. Recursive — a depth-N core can compress into a depth-N+1 core.
Providence cache — Q&A records keyed on the v9.8 8-dimension invariant; every record carries an audit_mode and a Merkle proof.
How the three layers compose. Surface document_roots flow into both
core derivations (with per-chunk inclusion proofs in
derivations.proof_blob) and providence-cache records (as the
source_root dimension of the 8-dim cache_key). Cores reuse
the surface proof structure, so the Merkle binding survives
compression. Providence rows then carry their own merkle_proof
that ties an answer back to the chunks the verifier ran against —
verifiable offline by any peer that holds the same surface roots.#
Top-level module graph. Substrate (merkle, document) at the bottom; pipelines (ingest, distill, qa) in the middle; CLI on top.#
The Merkle commitment#
Two peers that ingest the same source + run the same chunker +
canonicalization compute bit-identical document_root hashes.
That is the v9.8 admissibility property: identity by content, not by
location.
Arborist uses fox’s existing Go Merkle conventions verbatim
(proxy.unturf.com/pkg/verified/merkle.go):
Leaf hash:
sha256(0x00 || canonical_chunk_bytes).Internal hash:
sha256(0x03 || left || right). Non-commutative —H(L,R) ≠ H(R,L).Odd layers self-duplicate the trailing element. Not zero-pad.
Proof carries explicit IsLeft flag per sibling. Not lexical sort.
See Substrate: Core data structures for the Python port.
Federation across peers#
Identity-by-content is the federation primitive. Two peers that
ingest the same source with the same chunker and canonicalization
compute the same document_root and the same cache_key;
that property is what lets peers exchange and verify each other’s
providence records without a central authority. The mesh layer
implements gossip-based sync, group key management, and per-peer
audit-chain reconciliation on top of that.
See Federation: multiplayer arborist (also reachable as Federation: multiplayer arborist) for the mesh wire format, member-set semantics, and deploy notes.
The 8-dim cache key#
Every providence record is keyed on:
cache_key = sha256(
source_root | question_hash | model_profile_hash |
conversation_hash | governance_policy_hash |
schema_version | canonicalization_version | chunking_version
)
Bumping any of the eight dimensions invalidates prior records on lookup. This is how the system stays honest across model changes, schema migrations, prompt edits, etc. — old answers don’t silently serve under new conditions.
The eight mandatory dimensions split across two groups: the top row
(corpus, question, model, conversation) is the dynamic input
surface; the bottom row (policy + three pinned versions) is the
admissibility scaffold. The optional 9th dimension
(verifier_policy_hash) exists for audit legibility under
ticket #000058 — it does not add correctness coverage, because the
verifier-policy fields are already a subset of
governance_policy_hash. 8-dim is the default write form; 9-dim
is opt-in.#
The audit chain#
Every state-changing operation writes one row in audit_events
with event_hash = sha256(prev_event_hash || canonical(body)).
Linear chain per shard, verified by make chain-check-shards (any
break is the loudest possible signal).
Always write via arborist.store.append_audit — never insert into
audit_events directly.
The trichotomy and the four-rung ladder#
Every answer carries two stacked labels.
Schema layer — v9.8 trichotomy, persisted, drives cache lookups and the audit chain:
|
meaning |
|---|---|
STRICT |
every evidence unit verifies against context |
HYBRID |
some claims source-grounded, some emerged from training |
UNGROUNDED |
no evidence, or none verifies — purely emergent |
Display layer — four-rung ladder for claim-lattice modes only; renderer-only transformation, schema unchanged:
rung |
what’s actually proved |
|---|---|
EVIDENCE-WARRANTED |
pointer verified + warrant ran & passed + no soft demotes |
ANCHOR-WARRANTED |
pointer-linked + warrant passed; soft-demote violations |
POINTER-LINKED |
pointer/source/chunk verified, but warrant didn’t apply |
UNGROUNDED |
no verified pairs |
The point of the display ladder: STRICT in claim-lattice mode
is not “the answer is correct” — it’s “every pointer resolved to a
valid evidence object, source_role allowed, citation coverage passed.”
The display label spells out the actual property so users don’t read
STRICT as full semantic entailment.
How (audit_mode, violations) maps to a display rung at render time.#
The verifier (lexical, layered)#
Five strategies run in sequence; first to find evidence classifies. Each is lexical (substring or token coverage), never embeddings — soft signals stay out of the proof path.
quote — model wraps claims in
"..."; verbatim substring testedspan — bullet/sentence units substring-tested
entity — multi-word proper nouns with proximity-cluster gating
paraphrase — ≥85% token coverage on prose-shaped spans
claim_lattice — pointer-line
[E1,E2]or JSON; runs seven deterministic hard checks: parser succeeded, evidence_id resolves, source_role allowed, claim text non-empty, citation coverage threshold, pointer count cap, anchor-class warrant.
See Q&A Pipeline: question → answer → verify → cache for the implementation; the whitepaper §13.8 covers the layered design.
Falsification state#
Every providence record carries a falsification_state in
{live, failed, stale, quarantined}. Cache lookups filter on
state='live'. Drift detection (re-ingest produces a different
document_root) flips the record to stale.
Two ways to remove a wrong answer:
make falsify KEY=...— record stays in DB, state flips tofailed. Audit-preserving. Use when downstream consumers might reference it.make burn KEY=...— actually deletes the row. Refuses if the record has children unlessFORCE=1. Use during scratch corpus building.
State machine for falsification_state. Every transition writes
one audit_event so the chain stays intact — the schema column
is the lookup gate, the audit chain is the history. stale →
live is the rarely-trodden path where a corpus was rolled back to
a prior state; cleanest in single-shard test setups. burn
actually deletes the row and is refused if the record has
children unless FORCE=1.#
Sidecar diagnostics#
The verifier stays binary; soft signals proliferate as sidecars —
read-only diagnostic functions that pull the same source chunks the
verifier saw, classify spans, detect topic shifts, surface metaphor
framing, and never write to providence_cache or extend the audit
chain. That invariant is what keeps audit_mode a binary
classification rather than a soft score.
See Q&A Pipeline: question → answer → verify → cache (arborist.qa.inspect).
Where to go next#
Quickstart — install and run the canonical paths
Makefile reference — every workflow as a
maketargetSubstrate: Core data structures — Merkle tree + document primitives
Q&A Pipeline: question → answer → verify → cache — Q&A pipeline, verifier, evidence map, run-DAG
Federation: multiplayer arborist — federation primitives across peers
License — full AGPL + Permacomputer Preamble
Permacomputer Preamble — License: AGPL-3.0-only
This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.
Our permacomputer is community-owned infrastructure optimized around four values:
TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.
NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.