Cookbook#
Recipes for common workflows beyond the quickstart. Each starts from
a working arborist install (make bootstrap already run) and a
populated shards directory under ~/.arborist/shards/.
Re-crawl a website to detect changes#
After make crawl-ingest lands a site into a shard, the per-page
ETag and Last-Modified headers are kept in document_http_meta. A
re-crawl can ask “did anything change?” without downloading bodies.
make crawl-ingest URL=https://example.com DEPTH=2 # initial crawl
# ...later...
make recrawl-check DOMAIN=example.com # conditional HEAD per page
Each URL is classified fresh (304), stale (200 with new body),
gone (404/410), or unreachable. One tiny round-trip per URL,
no body transfer when content is unchanged.
Falsify a wrong answer (audit-preserving)#
The verifier called something STRICT but you know it’s wrong. Mark the record falsified — it stays in the DB so downstream consumers that referenced it can still trace history.
make query Q="When did X happen?" # see the answer + cache_key
make inspect KEY=<cache_key> # diagnose unverified spans
make falsify KEY=<cache_key> REASON='wrong year — sources cited 1942 not 1944'
Future lookups skip records whose falsification_state != 'live'.
A falsify audit event records the act; the chain stays intact.
Promote your own past answers into the corpus#
After enough STRICT answers accumulate, treat them as a derived
source. The providence ingest path promotes mature STRICT records
into the document corpus where retrieval can pick them up.
make ingest-self-providence KG_SECONDS=86400 # only records ≥1 day old
The KG_SECONDS (kindergarten window) prevents the system from
trusting freshly-cached answers as ground truth before they’ve had
time to fail. See Q&A Pipeline: question → answer → verify → cache for the ProvidenceSource impl.
Query across mixed corpora#
Every shard under ~/.arborist/shards/ is queried automatically.
Mix Wikipedia, your Grok export, a crawled site, and your own git
repos in one query — retrieval ranks across all of them.
make ingest-cur-attached # Wikipedia
make ingest-grok-attached GROK_EXPORT=$HOME/Downloads/<uuid> # Grok chats
make crawl-ingest URL=https://russell.ballestrini.net DEPTH=2 # personal site
make ingest-git GIT_REPO=$HOME/git/myproject # source code
make query Q="how does my project handle authentication?"
Each shard contributes hits; the query path’s title-relevance + body coverage rerank lets neologisms in your private corpus outrank generic Wikipedia matches.
Override the LLM endpoint#
Default points at https://hermes.ai.unturf.com/v1 (Hermes-3-8B,
no auth). Point at any OpenAI-compatible endpoint via env:
export ARBORIST_LLM_ENDPOINT="https://your-vllm.example.com/v1"
export ARBORIST_LLM_MODEL="meta-llama/Llama-3.1-70B-Instruct"
export ARBORIST_LLM_API_KEY="..." # optional; many vLLM deploys are open
make query Q="..."
The model id folds into model_profile_hash (one of the 8 cache
key dimensions), so swapping models invalidates prior cache hits on
lookup — no risk of serving an answer one model produced under
another model’s identity.
Verify shard integrity after a bulk operation#
Any state-changing op (mass falsify, hash bump, schema migration) should be followed by:
make chain-check-shards # 0 chain breaks per shard = intact
make analyze-shards # compression spectrum + audit integrity
make verify-shards # round-trip Merkle proofs on a sample
Chain breaks are the loudest possible signal. Run these before declaring an op successful.
Run the QA bench and read the results#
The QA bench measures how often the verifier says STRICT vs HYBRID vs UNGROUNDED across a fixed question set, per answer mode.
make bench-qa-smoke # ~30s, 5 questions × 3 modes
make bench-qa BENCH_QA_N=3 # full sweep, 3 samples each
Output lands in bench/qa_results/<utc-stamp>.{jsonl,md}. The
markdown file has the summary table; the JSONL has every per-question
record for drill-down.
Resume an interrupted bench:
make bench-qa --resume bench/qa_results/<previous-utc-stamp>.jsonl
Same --seed is required for shuffled-task-order alignment.
Tune retrieval per-question#
When a query returns the wrong sources, K= injects extra retrieval
keywords without changing what the LLM sees as the question:
make query Q="What did Orwell mean by always at war?" K="1984 Oceania Eastasia"
Provenance gap on this is tracked in
Q&A Pipeline: question → answer → verify → cache (arborist.qa.query).
Use arborist as a Python library#
Everything below uses the supported embedding surface,
arborist.embed. Import from there, not from internal modules —
internal refactors are free to move things around behind that seam.
See Storage: SQLite schema and audit for the full arborist.store reference and
Substrate: Core data structures for the Merkle primitives the recipes call into.
Open a store and ingest documents you already hold#
The minimum useful contact surface: pass in your own
Document objects, get back content-addressed
storage with an audit chain. Idempotent — re-running with the same
content yields the same document_root and skips the insert.
from pathlib import Path
from arborist.embed import (
open_store, ingest_documents, search, Document, Edge,
)
conn = open_store(Path("data/arborist.db")) # creates + migrates
stats = ingest_documents(conn, [
Document(
uri="https://example.com/post-a",
content="anarcho-capitalism describes a stateless society "
"where private property and free markets coordinate "
"without coercion.",
source_type="my_app",
title="Anarcho-Capitalism Primer",
edges=[Edge(edge_type="references",
dst_uri="https://example.com/post-b")],
extra={"md5": "deadbeef"}, # your provenance, carried along
),
])
print(stats) # IngestStats(seen=1, inserted=1, ...)
for hit in search(conn, "free markets", limit=5):
print(hit.document_uri, round(hit.score, 3))
conn.close()
Define a custom Source for stateful corpora#
When you have a corpus (a directory tree, a paginated API, a database
table) it’s cleaner to express it as a Source. The ABC has one
required method, iter_documents(), which must be deterministic
and idempotent. That’s exactly the contract every built-in source under
arborist/sources/ already implements.
from arborist.embed import Source, Document
from arborist.ingest import ingest_source
from arborist.embed import open_store
class TaggedDocs(Source):
"""Ingest a list of (uri, body) pairs under a shared tag."""
source_type = "tagged_example"
def __init__(self, tag, items):
self.tag = tag
self._items = items
def iter_documents(self):
for uri, body in self._items:
yield Document(
uri=uri,
content=body,
source_type=self.source_type,
extra={"tag": self.tag},
)
conn = open_store("data/arborist.db")
stats = ingest_source(conn, TaggedDocs("research", [
("https://example.com/c", "third doc body about content-addressing."),
]))
print(stats)
conn.close()
Walk and verify the audit chain#
Every state-changing op writes one row in audit_events with
event_hash = sha256(prev_event_hash || canonical(body)). Verifying
the chain is just re-running that hash for every row and checking the
linkage. (make chain-check-shards does this at scale; the recipe
below is the same logic, inlined.)
import hashlib
from arborist.embed import open_store
from arborist.store import latest_event_hash
conn = open_store("data/arborist.db")
print("head:", latest_event_hash(conn))
prev = None
bad = 0
for seq, eh, ph, body in conn.execute(
"SELECT seq, event_hash, prev_event_hash, body "
"FROM audit_events ORDER BY seq"
):
h = hashlib.sha256()
if ph is not None:
h.update(bytes.fromhex(ph))
h.update(body.encode("utf-8", errors="surrogatepass"))
if h.hexdigest() != eh or (prev is not None and ph != prev):
bad += 1
prev = eh
print(f"chain breaks: {bad}") # 0 = intact
conn.close()
Round-trip a Merkle inclusion proof#
The proof primitives from arborist.merkle are the Python port of
proxy.unturf.com’s Go conventions. Use them to re-derive a
document_root from its leaves, build a proof for any chunk, and
serialize the proof for over-the-wire delivery to another peer.
import json
from arborist.embed import open_store
from arborist.merkle import (
MerkleTree, verify_proof, proof_to_dict, proof_from_dict,
)
conn = open_store("data/arborist.db")
doc_root_hex, = conn.execute(
"SELECT document_root FROM documents LIMIT 1"
).fetchone()
leaves = [
bytes.fromhex(r[0]) for r in conn.execute(
"SELECT leaf_hash FROM chunks WHERE document_root=? ORDER BY idx",
(doc_root_hex,),
)
]
tree = MerkleTree.build(leaves)
assert tree.root.hex() == doc_root_hex # bit-identical re-derivation
proof = tree.proof(0)
assert verify_proof(proof) # local round-trip
blob = json.dumps(proof_to_dict(proof)) # serialize for wire
proof_received = proof_from_dict(json.loads(blob))
assert verify_proof(proof_received) # any peer can verify
conn.close()
Two peers that ingested the same source with the same chunker and
canonicalization will compute byte-identical document_root hashes
and accept each other’s proofs — see Federation: multiplayer arborist for the
federation primitives built on top of this property.
Run a Q&A from Python and read the result programmatically#
The CLI (arborist query / make query) is one entry point to
the multi-route retrieval + LLM + verifier pipeline. The same surface
is callable directly from Python via arborist.qa.query.query() —
useful when you want to drive a batch sweep, integrate into a notebook,
or wrap the result in your own application logic. Cache lookups, the
8-dim cache_key, the verifier, the run-DAG, and the falsification gate
all behave identically to the CLI path.
from pathlib import Path
from arborist.qa.query import query
from arborist.qa.client import OpenAICompatibleClient
client = OpenAICompatibleClient(
base_url="https://hermes.ai.unturf.com/v1", # any OpenAI-compat
)
result = query(
question="What is anarcho-capitalism?",
qa_db=Path.home() / ".arborist" / "qa.db", # cache lives here
chat_client=client,
model_id="adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic",
shards_dir=Path.home() / ".arborist" / "shards", # OR single_db=...
)
print(result["status"]) # cache_hit | cache_miss_then_written
print(result["audit_mode"]) # STRICT | HYBRID | UNGROUNDED
print(result["cache_key"]) # 64-char hex
print(result["answer_text"])
for src in result["sources"]:
print(" ->", src["document_uri"], src["document_root"][:12],
"role:", src["source_role"])
The first call writes one providence-cache row + one audit event;
the second call with the same question replays from cache in ~100 ms
(status == "cache_hit"). To exercise the pipeline deterministically
in tests, swap the live client for StubClient:
from arborist.qa.client import StubClient
stub = StubClient(answer='Anarcho-capitalism is "a political '
'philosophy that advocates the elimination '
'of centralized state dictums".')
result = query(
question="What is anarcho-capitalism?",
qa_db=Path("/tmp/qa.db"),
chat_client=stub,
model_id="stub/test",
single_db=Path("/tmp/arborist.db"),
)
assert result["audit_mode"] in {"STRICT", "HYBRID", "UNGROUNDED"}
That stubbed shape is exactly how the test suite drives the runner — the verifier still runs lexically against the assembled context, so the audit_mode classification is real even though the model output is canned.
For the single-document path (arborist ask / one document_root
in hand), use arborist.qa.runner.ask() instead — same return
shape, same cache_key invariants, scoped to one document.
Permacomputer Preamble — License: AGPL-3.0-only
This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.
Our permacomputer is community-owned infrastructure optimized around four values:
TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.
NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.