Concepts
========

Arborist is a content-addressed document store that gives every cached
LLM answer a verifiable Merkle proof tying it back to its source
documents. This page is the orientation: what the system is, the
core abstractions you'll see in code and docs, and how they compose.

What arborist is
----------------

A reference implementation of two papers stacked:

1. **Merkle Providence Reverse RAG** (whitepaper, April 2026) — the
   runtime: question → cache → retrieval → LLM → verifier → cache write.
2. **Merkle-AGI v9.8** — the admissibility ledger that makes cached
   answers cryptographically verifiable across peers.

Three layers stacked on one SQLite file (per shard):

- **Surface** — ingested documents (Wikipedia, HTML pages, anything
  with a URI). Chunked, Merkle-rooted, FTS5-indexed.
- **Core** — distilled documents Merkle-bound back to surfaces via
  per-chunk inclusion proofs. Recursive — a depth-N core can compress
  into a depth-N+1 core.
- **Providence cache** — Q&A records keyed on the v9.8 8-dimension
  invariant; every record carries an audit_mode and a Merkle proof.

.. figure:: diagrams/three-layer-stack.svg
   :alt: Three-layer stack (surface, core, providence cache)
   :width: 100%

   How the three layers compose. Surface document_roots flow into both
   core derivations (with per-chunk inclusion proofs in
   ``derivations.proof_blob``) and providence-cache records (as the
   ``source_root`` dimension of the 8-dim ``cache_key``). Cores reuse
   the surface proof structure, so the Merkle binding survives
   compression. Providence rows then carry their own ``merkle_proof``
   that ties an answer back to the chunks the verifier ran against —
   verifiable offline by any peer that holds the same surface roots.

.. figure:: diagrams/arborist-modules.svg
   :alt: Arborist module graph
   :width: 100%

   Top-level module graph. Substrate (merkle, document) at the bottom;
   pipelines (ingest, distill, qa) in the middle; CLI on top.

The Merkle commitment
---------------------

Two peers that ingest the same source + run the same chunker +
canonicalization compute **bit-identical** ``document_root`` hashes.
That is the v9.8 admissibility property: identity by content, not by
location.

Arborist uses fox's existing Go Merkle conventions verbatim
(``proxy.unturf.com/pkg/verified/merkle.go``):

- **Leaf hash:** ``sha256(0x00 || canonical_chunk_bytes)``.
- **Internal hash:** ``sha256(0x03 || left || right)``. Non-commutative —
  ``H(L,R) ≠ H(R,L)``.
- **Odd layers self-duplicate** the trailing element. Not zero-pad.
- **Proof carries explicit IsLeft flag** per sibling. Not lexical sort.

See :doc:`api/substrate` for the Python port.

Federation across peers
-----------------------

Identity-by-content is the federation primitive. Two peers that
ingest the same source with the same chunker and canonicalization
compute the same ``document_root`` and the same ``cache_key``;
that property is what lets peers exchange and verify each other's
providence records without a central authority. The mesh layer
implements gossip-based sync, group key management, and per-peer
audit-chain reconciliation on top of that.

See :doc:`api/mesh` (also reachable as :ref:`federation`) for the
mesh wire format, member-set semantics, and deploy notes.

The 8-dim cache key
-------------------

Every providence record is keyed on:

.. code-block::

   cache_key = sha256(
       source_root | question_hash | model_profile_hash |
       conversation_hash | governance_policy_hash |
       schema_version | canonicalization_version | chunking_version
   )

Bumping any of the eight dimensions invalidates prior records on
lookup. This is how the system stays honest across model changes,
schema migrations, prompt edits, etc. — old answers don't silently
serve under new conditions.

.. figure:: diagrams/cache-key-8dim.svg
   :alt: 8-dim cache_key composition
   :width: 100%

   The eight mandatory dimensions split across two groups: the top row
   (corpus, question, model, conversation) is the dynamic input
   surface; the bottom row (policy + three pinned versions) is the
   admissibility scaffold. The optional 9th dimension
   (``verifier_policy_hash``) exists for audit *legibility* under
   ticket #000058 — it does not add correctness coverage, because the
   verifier-policy fields are already a subset of
   ``governance_policy_hash``. 8-dim is the default write form; 9-dim
   is opt-in.

The audit chain
---------------

Every state-changing operation writes one row in ``audit_events``
with ``event_hash = sha256(prev_event_hash || canonical(body))``.
Linear chain per shard, verified by ``make chain-check-shards`` (any
break is the loudest possible signal).

Always write via ``arborist.store.append_audit`` — never insert into
``audit_events`` directly.

The trichotomy and the four-rung ladder
----------------------------------------

Every answer carries two stacked labels.

**Schema layer** — v9.8 trichotomy, persisted, drives cache lookups
and the audit chain:

============== =============================================================
``audit_mode`` meaning
============== =============================================================
STRICT         every evidence unit verifies against context
HYBRID         some claims source-grounded, some emerged from training
UNGROUNDED     no evidence, or none verifies — purely emergent
============== =============================================================

**Display layer** — four-rung ladder for claim-lattice modes only;
renderer-only transformation, schema unchanged:

==================== ==========================================================
rung                 what's actually proved
==================== ==========================================================
EVIDENCE-WARRANTED   pointer verified + warrant ran & passed + no soft demotes
ANCHOR-WARRANTED     pointer-linked + warrant passed; soft-demote violations
POINTER-LINKED       pointer/source/chunk verified, but warrant didn't apply
UNGROUNDED           no verified pairs
==================== ==========================================================

The point of the display ladder: ``STRICT`` in claim-lattice mode
is *not* "the answer is correct" — it's "every pointer resolved to a
valid evidence object, source_role allowed, citation coverage passed."
The display label spells out the actual property so users don't read
``STRICT`` as full semantic entailment.

.. figure:: diagrams/verifier-ladder.svg
   :alt: Verifier ladder
   :width: 80%

   How ``(audit_mode, violations)`` maps to a display rung at render time.

The verifier (lexical, layered)
-------------------------------

Five strategies run in sequence; first to find evidence classifies.
Each is **lexical** (substring or token coverage), never embeddings —
soft signals stay out of the proof path.

1. **quote** — model wraps claims in ``"..."``; verbatim substring tested
2. **span** — bullet/sentence units substring-tested
3. **entity** — multi-word proper nouns with proximity-cluster gating
4. **paraphrase** — ≥85% token coverage on prose-shaped spans
5. **claim_lattice** — pointer-line ``[E1,E2]`` or JSON; runs **seven
   deterministic hard checks**:
   parser succeeded, evidence_id resolves, source_role allowed, claim
   text non-empty, citation coverage threshold, pointer count cap,
   anchor-class warrant.

See :doc:`api/qa` for the implementation; the whitepaper §13.8
covers the layered design.

Falsification state
-------------------

Every providence record carries a ``falsification_state`` in
``{live, failed, stale, quarantined}``. Cache lookups filter on
``state='live'``. Drift detection (re-ingest produces a different
``document_root``) flips the record to ``stale``.

Two ways to remove a wrong answer:

- ``make falsify KEY=...`` — record stays in DB, state flips to
  ``failed``. Audit-preserving. Use when downstream consumers might
  reference it.
- ``make burn KEY=...`` — actually deletes the row. Refuses if the
  record has children unless ``FORCE=1``. Use during scratch corpus
  building.

.. figure:: diagrams/falsification-states.svg
   :alt: Falsification state machine
   :width: 100%

   State machine for ``falsification_state``. Every transition writes
   one ``audit_event`` so the chain stays intact — the schema column
   is the lookup gate, the audit chain is the history. ``stale →
   live`` is the rarely-trodden path where a corpus was rolled back to
   a prior state; cleanest in single-shard test setups. ``burn``
   actually deletes the row and is refused if the record has
   children unless ``FORCE=1``.

Sidecar diagnostics
-------------------

The verifier stays binary; soft signals proliferate as **sidecars** —
read-only diagnostic functions that pull the same source chunks the
verifier saw, classify spans, detect topic shifts, surface metaphor
framing, and **never write to providence_cache or extend the audit
chain**. That invariant is what keeps ``audit_mode`` a binary
classification rather than a soft score.

See :doc:`api/qa` (``arborist.qa.inspect``).

Where to go next
----------------

* :doc:`quickstart` — install and run the canonical paths
* :doc:`api/makefile` — every workflow as a ``make`` target
* :doc:`api/substrate` — Merkle tree + document primitives
* :doc:`api/qa` — Q&A pipeline, verifier, evidence map, run-DAG
* :doc:`api/mesh` — federation primitives across peers
* :doc:`license` — full AGPL + Permacomputer Preamble
