Cookbook#

Recipes for common workflows beyond the quickstart. Each starts from a working arborist install (make bootstrap already run) and a populated shards directory under ~/.arborist/shards/.

Re-crawl a website to detect changes#

After make crawl-ingest lands a site into a shard, the per-page ETag and Last-Modified headers are kept in document_http_meta. A re-crawl can ask “did anything change?” without downloading bodies.

make crawl-ingest URL=https://example.com DEPTH=2  # initial crawl
# ...later...
make recrawl-check DOMAIN=example.com              # conditional HEAD per page

Each URL is classified fresh (304), stale (200 with new body), gone (404/410), or unreachable. One tiny round-trip per URL, no body transfer when content is unchanged.

Falsify a wrong answer (audit-preserving)#

The verifier called something STRICT but you know it’s wrong. Mark the record falsified — it stays in the DB so downstream consumers that referenced it can still trace history.

make query Q="When did X happen?"             # see the answer + cache_key
make inspect KEY=<cache_key>                  # diagnose unverified spans
make falsify KEY=<cache_key> REASON='wrong year — sources cited 1942 not 1944'

Future lookups skip records whose falsification_state != 'live'. A falsify audit event records the act; the chain stays intact.

Promote your own past answers into the corpus#

After enough STRICT answers accumulate, treat them as a derived source. The providence ingest path promotes mature STRICT records into the document corpus where retrieval can pick them up.

make ingest-self-providence KG_SECONDS=86400   # only records ≥1 day old

The KG_SECONDS (kindergarten window) prevents the system from trusting freshly-cached answers as ground truth before they’ve had time to fail. See Q&A Pipeline: question → answer → verify → cache for the ProvidenceSource impl.

Query across mixed corpora#

Every shard under ~/.arborist/shards/ is queried automatically. Mix Wikipedia, your Grok export, a crawled site, and your own git repos in one query — retrieval ranks across all of them.

make ingest-cur-attached                                          # Wikipedia
make ingest-grok-attached GROK_EXPORT=$HOME/Downloads/<uuid>      # Grok chats
make crawl-ingest URL=https://russell.ballestrini.net DEPTH=2     # personal site
make ingest-git GIT_REPO=$HOME/git/myproject                      # source code

make query Q="how does my project handle authentication?"

Each shard contributes hits; the query path’s title-relevance + body coverage rerank lets neologisms in your private corpus outrank generic Wikipedia matches.

Override the LLM endpoint#

Default points at https://hermes.ai.unturf.com/v1 (Hermes-3-8B, no auth). Point at any OpenAI-compatible endpoint via env:

export ARBORIST_LLM_ENDPOINT="https://your-vllm.example.com/v1"
export ARBORIST_LLM_MODEL="meta-llama/Llama-3.1-70B-Instruct"
export ARBORIST_LLM_API_KEY="..."   # optional; many vLLM deploys are open
make query Q="..."

The model id folds into model_profile_hash (one of the 8 cache key dimensions), so swapping models invalidates prior cache hits on lookup — no risk of serving an answer one model produced under another model’s identity.

Verify shard integrity after a bulk operation#

Any state-changing op (mass falsify, hash bump, schema migration) should be followed by:

make chain-check-shards          # 0 chain breaks per shard = intact
make analyze-shards              # compression spectrum + audit integrity
make verify-shards               # round-trip Merkle proofs on a sample

Chain breaks are the loudest possible signal. Run these before declaring an op successful.

Run the QA bench and read the results#

The QA bench measures how often the verifier says STRICT vs HYBRID vs UNGROUNDED across a fixed question set, per answer mode.

make bench-qa-smoke                          # ~30s, 5 questions × 3 modes
make bench-qa BENCH_QA_N=3                   # full sweep, 3 samples each

Output lands in bench/qa_results/<utc-stamp>.{jsonl,md}. The markdown file has the summary table; the JSONL has every per-question record for drill-down.

Resume an interrupted bench:

make bench-qa --resume bench/qa_results/<previous-utc-stamp>.jsonl

Same --seed is required for shuffled-task-order alignment.

Tune retrieval per-question#

When a query returns the wrong sources, K= injects extra retrieval keywords without changing what the LLM sees as the question:

make query Q="What did Orwell mean by always at war?" K="1984 Oceania Eastasia"

Provenance gap on this is tracked in Q&A Pipeline: question → answer → verify → cache (arborist.qa.query).

Use arborist as a Python library#

Everything below uses the supported embedding surface, arborist.embed. Import from there, not from internal modules — internal refactors are free to move things around behind that seam. See Storage: SQLite schema and audit for the full arborist.store reference and Substrate: Core data structures for the Merkle primitives the recipes call into.

Open a store and ingest documents you already hold#

The minimum useful contact surface: pass in your own Document objects, get back content-addressed storage with an audit chain. Idempotent — re-running with the same content yields the same document_root and skips the insert.

from pathlib import Path
from arborist.embed import (
    open_store, ingest_documents, search, Document, Edge,
)

conn = open_store(Path("data/arborist.db"))   # creates + migrates

stats = ingest_documents(conn, [
    Document(
        uri="https://example.com/post-a",
        content="anarcho-capitalism describes a stateless society "
                "where private property and free markets coordinate "
                "without coercion.",
        source_type="my_app",
        title="Anarcho-Capitalism Primer",
        edges=[Edge(edge_type="references",
                    dst_uri="https://example.com/post-b")],
        extra={"md5": "deadbeef"},          # your provenance, carried along
    ),
])
print(stats)                                # IngestStats(seen=1, inserted=1, ...)

for hit in search(conn, "free markets", limit=5):
    print(hit.document_uri, round(hit.score, 3))

conn.close()

Define a custom Source for stateful corpora#

When you have a corpus (a directory tree, a paginated API, a database table) it’s cleaner to express it as a Source. The ABC has one required method, iter_documents(), which must be deterministic and idempotent. That’s exactly the contract every built-in source under arborist/sources/ already implements.

from arborist.embed import Source, Document
from arborist.ingest import ingest_source
from arborist.embed import open_store

class TaggedDocs(Source):
    """Ingest a list of (uri, body) pairs under a shared tag."""

    source_type = "tagged_example"

    def __init__(self, tag, items):
        self.tag = tag
        self._items = items

    def iter_documents(self):
        for uri, body in self._items:
            yield Document(
                uri=uri,
                content=body,
                source_type=self.source_type,
                extra={"tag": self.tag},
            )

conn = open_store("data/arborist.db")
stats = ingest_source(conn, TaggedDocs("research", [
    ("https://example.com/c", "third doc body about content-addressing."),
]))
print(stats)
conn.close()

Walk and verify the audit chain#

Every state-changing op writes one row in audit_events with event_hash = sha256(prev_event_hash || canonical(body)). Verifying the chain is just re-running that hash for every row and checking the linkage. (make chain-check-shards does this at scale; the recipe below is the same logic, inlined.)

import hashlib
from arborist.embed import open_store
from arborist.store import latest_event_hash

conn = open_store("data/arborist.db")
print("head:", latest_event_hash(conn))

prev = None
bad = 0
for seq, eh, ph, body in conn.execute(
    "SELECT seq, event_hash, prev_event_hash, body "
    "FROM audit_events ORDER BY seq"
):
    h = hashlib.sha256()
    if ph is not None:
        h.update(bytes.fromhex(ph))
    h.update(body.encode("utf-8", errors="surrogatepass"))
    if h.hexdigest() != eh or (prev is not None and ph != prev):
        bad += 1
    prev = eh

print(f"chain breaks: {bad}")              # 0 = intact
conn.close()

Round-trip a Merkle inclusion proof#

The proof primitives from arborist.merkle are the Python port of proxy.unturf.com’s Go conventions. Use them to re-derive a document_root from its leaves, build a proof for any chunk, and serialize the proof for over-the-wire delivery to another peer.

import json
from arborist.embed import open_store
from arborist.merkle import (
    MerkleTree, verify_proof, proof_to_dict, proof_from_dict,
)

conn = open_store("data/arborist.db")

doc_root_hex, = conn.execute(
    "SELECT document_root FROM documents LIMIT 1"
).fetchone()
leaves = [
    bytes.fromhex(r[0]) for r in conn.execute(
        "SELECT leaf_hash FROM chunks WHERE document_root=? ORDER BY idx",
        (doc_root_hex,),
    )
]

tree = MerkleTree.build(leaves)
assert tree.root.hex() == doc_root_hex     # bit-identical re-derivation

proof = tree.proof(0)
assert verify_proof(proof)                 # local round-trip

blob = json.dumps(proof_to_dict(proof))    # serialize for wire
proof_received = proof_from_dict(json.loads(blob))
assert verify_proof(proof_received)        # any peer can verify

conn.close()

Two peers that ingested the same source with the same chunker and canonicalization will compute byte-identical document_root hashes and accept each other’s proofs — see Federation: multiplayer arborist for the federation primitives built on top of this property.

Run a Q&A from Python and read the result programmatically#

The CLI (arborist query / make query) is one entry point to the multi-route retrieval + LLM + verifier pipeline. The same surface is callable directly from Python via arborist.qa.query.query() — useful when you want to drive a batch sweep, integrate into a notebook, or wrap the result in your own application logic. Cache lookups, the 8-dim cache_key, the verifier, the run-DAG, and the falsification gate all behave identically to the CLI path.

from pathlib import Path
from arborist.qa.query import query
from arborist.qa.client import OpenAICompatibleClient

client = OpenAICompatibleClient(
    base_url="https://hermes.ai.unturf.com/v1",       # any OpenAI-compat
)

result = query(
    question="What is anarcho-capitalism?",
    qa_db=Path.home() / ".arborist" / "qa.db",        # cache lives here
    chat_client=client,
    model_id="adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic",
    shards_dir=Path.home() / ".arborist" / "shards",  # OR single_db=...
)

print(result["status"])              # cache_hit | cache_miss_then_written
print(result["audit_mode"])          # STRICT | HYBRID | UNGROUNDED
print(result["cache_key"])           # 64-char hex
print(result["answer_text"])
for src in result["sources"]:
    print("  ->", src["document_uri"], src["document_root"][:12],
          "role:", src["source_role"])

The first call writes one providence-cache row + one audit event; the second call with the same question replays from cache in ~100 ms (status == "cache_hit"). To exercise the pipeline deterministically in tests, swap the live client for StubClient:

from arborist.qa.client import StubClient

stub = StubClient(answer='Anarcho-capitalism is "a political '
                         'philosophy that advocates the elimination '
                         'of centralized state dictums".')
result = query(
    question="What is anarcho-capitalism?",
    qa_db=Path("/tmp/qa.db"),
    chat_client=stub,
    model_id="stub/test",
    single_db=Path("/tmp/arborist.db"),
)
assert result["audit_mode"] in {"STRICT", "HYBRID", "UNGROUNDED"}

That stubbed shape is exactly how the test suite drives the runner — the verifier still runs lexically against the assembled context, so the audit_mode classification is real even though the model output is canned.

For the single-document path (arborist ask / one document_root in hand), use arborist.qa.runner.ask() instead — same return shape, same cache_key invariants, scoped to one document.


Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

  • TRUTH — First principles, math & science, open source code freely distributed.

  • FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.

  • HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.

  • LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.