Cookbook
========

Recipes for common workflows beyond the quickstart. Each starts from
a working arborist install (``make bootstrap`` already run) and a
populated shards directory under ``~/.arborist/shards/``.

Re-crawl a website to detect changes
-------------------------------------

After ``make crawl-ingest`` lands a site into a shard, the per-page
ETag and Last-Modified headers are kept in ``document_http_meta``. A
re-crawl can ask "did anything change?" without downloading bodies.

.. code-block:: sh

   make crawl-ingest URL=https://example.com DEPTH=2  # initial crawl
   # ...later...
   make recrawl-check DOMAIN=example.com              # conditional HEAD per page

Each URL is classified ``fresh`` (304), ``stale`` (200 with new body),
``gone`` (404/410), or ``unreachable``. One tiny round-trip per URL,
no body transfer when content is unchanged.

Falsify a wrong answer (audit-preserving)
------------------------------------------

The verifier called something STRICT but you know it's wrong. Mark
the record falsified — it stays in the DB so downstream consumers
that referenced it can still trace history.

.. code-block:: sh

   make query Q="When did X happen?"             # see the answer + cache_key
   make inspect KEY=<cache_key>                  # diagnose unverified spans
   make falsify KEY=<cache_key> REASON='wrong year — sources cited 1942 not 1944'

Future lookups skip records whose ``falsification_state != 'live'``.
A ``falsify`` audit event records the act; the chain stays intact.

Promote your own past answers into the corpus
----------------------------------------------

After enough STRICT answers accumulate, treat them as a derived
source. The ``providence`` ingest path promotes mature STRICT records
into the document corpus where retrieval can pick them up.

.. code-block:: sh

   make ingest-self-providence KG_SECONDS=86400   # only records ≥1 day old

The ``KG_SECONDS`` (kindergarten window) prevents the system from
trusting freshly-cached answers as ground truth before they've had
time to fail. See :doc:`api/qa` for the ``ProvidenceSource`` impl.

Query across mixed corpora
--------------------------

Every shard under ``~/.arborist/shards/`` is queried automatically.
Mix Wikipedia, your Grok export, a crawled site, and your own git
repos in one query — retrieval ranks across all of them.

.. code-block:: sh

   make ingest-cur-attached                                          # Wikipedia
   make ingest-grok-attached GROK_EXPORT=$HOME/Downloads/<uuid>      # Grok chats
   make crawl-ingest URL=https://russell.ballestrini.net DEPTH=2     # personal site
   make ingest-git GIT_REPO=$HOME/git/myproject                      # source code

   make query Q="how does my project handle authentication?"

Each shard contributes hits; the query path's title-relevance + body
coverage rerank lets neologisms in your private corpus outrank generic
Wikipedia matches.

Override the LLM endpoint
-------------------------

Default points at ``https://hermes.ai.unturf.com/v1`` (Hermes-3-8B,
no auth). Point at any OpenAI-compatible endpoint via env:

.. code-block:: sh

   export ARBORIST_LLM_ENDPOINT="https://your-vllm.example.com/v1"
   export ARBORIST_LLM_MODEL="meta-llama/Llama-3.1-70B-Instruct"
   export ARBORIST_LLM_API_KEY="..."   # optional; many vLLM deploys are open
   make query Q="..."

The model id folds into ``model_profile_hash`` (one of the 8 cache
key dimensions), so swapping models invalidates prior cache hits on
lookup — no risk of serving an answer one model produced under
another model's identity.

Verify shard integrity after a bulk operation
----------------------------------------------

Any state-changing op (mass falsify, hash bump, schema migration)
should be followed by:

.. code-block:: sh

   make chain-check-shards          # 0 chain breaks per shard = intact
   make analyze-shards              # compression spectrum + audit integrity
   make verify-shards               # round-trip Merkle proofs on a sample

Chain breaks are the loudest possible signal. Run these before
declaring an op successful.

Run the QA bench and read the results
--------------------------------------

The QA bench measures how often the verifier says STRICT vs HYBRID
vs UNGROUNDED across a fixed question set, per answer mode.

.. code-block:: sh

   make bench-qa-smoke                          # ~30s, 5 questions × 3 modes
   make bench-qa BENCH_QA_N=3                   # full sweep, 3 samples each

Output lands in ``bench/qa_results/<utc-stamp>.{jsonl,md}``. The
markdown file has the summary table; the JSONL has every per-question
record for drill-down.

Resume an interrupted bench:

.. code-block:: sh

   make bench-qa --resume bench/qa_results/<previous-utc-stamp>.jsonl

Same ``--seed`` is required for shuffled-task-order alignment.

Tune retrieval per-question
---------------------------

When a query returns the wrong sources, ``K=`` injects extra retrieval
keywords without changing what the LLM sees as the question:

.. code-block:: sh

   make query Q="What did Orwell mean by always at war?" K="1984 Oceania Eastasia"

Provenance gap on this is tracked in
:doc:`api/qa` (``arborist.qa.query``).

Use arborist as a Python library
================================

Everything below uses the supported embedding surface,
:mod:`arborist.embed`. Import from there, not from internal modules —
internal refactors are free to move things around behind that seam.
See :doc:`api/storage` for the full ``arborist.store`` reference and
:doc:`api/substrate` for the Merkle primitives the recipes call into.

Open a store and ingest documents you already hold
---------------------------------------------------

The minimum useful contact surface: pass in your own
:class:`~arborist.document.Document` objects, get back content-addressed
storage with an audit chain. Idempotent — re-running with the same
``content`` yields the same ``document_root`` and skips the insert.

.. code-block:: python

   from pathlib import Path
   from arborist.embed import (
       open_store, ingest_documents, search, Document, Edge,
   )

   conn = open_store(Path("data/arborist.db"))   # creates + migrates

   stats = ingest_documents(conn, [
       Document(
           uri="https://example.com/post-a",
           content="anarcho-capitalism describes a stateless society "
                   "where private property and free markets coordinate "
                   "without coercion.",
           source_type="my_app",
           title="Anarcho-Capitalism Primer",
           edges=[Edge(edge_type="references",
                       dst_uri="https://example.com/post-b")],
           extra={"md5": "deadbeef"},          # your provenance, carried along
       ),
   ])
   print(stats)                                # IngestStats(seen=1, inserted=1, ...)

   for hit in search(conn, "free markets", limit=5):
       print(hit.document_uri, round(hit.score, 3))

   conn.close()

Define a custom :class:`~arborist.source.Source` for stateful corpora
---------------------------------------------------------------------

When you have a corpus (a directory tree, a paginated API, a database
table) it's cleaner to express it as a :class:`Source`. The ABC has one
required method, :meth:`iter_documents`, which must be deterministic
and idempotent. That's exactly the contract every built-in source under
``arborist/sources/`` already implements.

.. code-block:: python

   from arborist.embed import Source, Document
   from arborist.ingest import ingest_source
   from arborist.embed import open_store

   class TaggedDocs(Source):
       """Ingest a list of (uri, body) pairs under a shared tag."""

       source_type = "tagged_example"

       def __init__(self, tag, items):
           self.tag = tag
           self._items = items

       def iter_documents(self):
           for uri, body in self._items:
               yield Document(
                   uri=uri,
                   content=body,
                   source_type=self.source_type,
                   extra={"tag": self.tag},
               )

   conn = open_store("data/arborist.db")
   stats = ingest_source(conn, TaggedDocs("research", [
       ("https://example.com/c", "third doc body about content-addressing."),
   ]))
   print(stats)
   conn.close()

Walk and verify the audit chain
-------------------------------

Every state-changing op writes one row in ``audit_events`` with
``event_hash = sha256(prev_event_hash || canonical(body))``. Verifying
the chain is just re-running that hash for every row and checking the
linkage. (``make chain-check-shards`` does this at scale; the recipe
below is the same logic, inlined.)

.. code-block:: python

   import hashlib
   from arborist.embed import open_store
   from arborist.store import latest_event_hash

   conn = open_store("data/arborist.db")
   print("head:", latest_event_hash(conn))

   prev = None
   bad = 0
   for seq, eh, ph, body in conn.execute(
       "SELECT seq, event_hash, prev_event_hash, body "
       "FROM audit_events ORDER BY seq"
   ):
       h = hashlib.sha256()
       if ph is not None:
           h.update(bytes.fromhex(ph))
       h.update(body.encode("utf-8", errors="surrogatepass"))
       if h.hexdigest() != eh or (prev is not None and ph != prev):
           bad += 1
       prev = eh

   print(f"chain breaks: {bad}")              # 0 = intact
   conn.close()

Round-trip a Merkle inclusion proof
-----------------------------------

The proof primitives from :mod:`arborist.merkle` are the Python port of
``proxy.unturf.com``'s Go conventions. Use them to re-derive a
``document_root`` from its leaves, build a proof for any chunk, and
serialize the proof for over-the-wire delivery to another peer.

.. code-block:: python

   import json
   from arborist.embed import open_store
   from arborist.merkle import (
       MerkleTree, verify_proof, proof_to_dict, proof_from_dict,
   )

   conn = open_store("data/arborist.db")

   doc_root_hex, = conn.execute(
       "SELECT document_root FROM documents LIMIT 1"
   ).fetchone()
   leaves = [
       bytes.fromhex(r[0]) for r in conn.execute(
           "SELECT leaf_hash FROM chunks WHERE document_root=? ORDER BY idx",
           (doc_root_hex,),
       )
   ]

   tree = MerkleTree.build(leaves)
   assert tree.root.hex() == doc_root_hex     # bit-identical re-derivation

   proof = tree.proof(0)
   assert verify_proof(proof)                 # local round-trip

   blob = json.dumps(proof_to_dict(proof))    # serialize for wire
   proof_received = proof_from_dict(json.loads(blob))
   assert verify_proof(proof_received)        # any peer can verify

   conn.close()

Two peers that ingested the same source with the same chunker and
canonicalization will compute byte-identical ``document_root`` hashes
and accept each other's proofs — see :doc:`api/mesh` for the
federation primitives built on top of this property.

Run a Q&A from Python and read the result programmatically
-----------------------------------------------------------

The CLI (``arborist query`` / ``make query``) is one entry point to
the multi-route retrieval + LLM + verifier pipeline. The same surface
is callable directly from Python via :func:`arborist.qa.query.query` —
useful when you want to drive a batch sweep, integrate into a notebook,
or wrap the result in your own application logic. Cache lookups, the
8-dim cache_key, the verifier, the run-DAG, and the falsification gate
all behave identically to the CLI path.

.. code-block:: python

   from pathlib import Path
   from arborist.qa.query import query
   from arborist.qa.client import OpenAICompatibleClient

   client = OpenAICompatibleClient(
       base_url="https://hermes.ai.unturf.com/v1",       # any OpenAI-compat
   )

   result = query(
       question="What is anarcho-capitalism?",
       qa_db=Path.home() / ".arborist" / "qa.db",        # cache lives here
       chat_client=client,
       model_id="adamo1139/Hermes-3-Llama-3.1-8B-FP8-Dynamic",
       shards_dir=Path.home() / ".arborist" / "shards",  # OR single_db=...
   )

   print(result["status"])              # cache_hit | cache_miss_then_written
   print(result["audit_mode"])          # STRICT | HYBRID | UNGROUNDED
   print(result["cache_key"])           # 64-char hex
   print(result["answer_text"])
   for src in result["sources"]:
       print("  ->", src["document_uri"], src["document_root"][:12],
             "role:", src["source_role"])

The first call writes one providence-cache row + one audit event;
the second call with the same question replays from cache in ~100 ms
(``status == "cache_hit"``). To exercise the pipeline deterministically
in tests, swap the live client for :class:`~arborist.qa.client.StubClient`:

.. code-block:: python

   from arborist.qa.client import StubClient

   stub = StubClient(answer='Anarcho-capitalism is "a political '
                            'philosophy that advocates the elimination '
                            'of centralized state dictums".')
   result = query(
       question="What is anarcho-capitalism?",
       qa_db=Path("/tmp/qa.db"),
       chat_client=stub,
       model_id="stub/test",
       single_db=Path("/tmp/arborist.db"),
   )
   assert result["audit_mode"] in {"STRICT", "HYBRID", "UNGROUNDED"}

That stubbed shape is exactly how the test suite drives the runner —
the verifier still runs lexically against the assembled context, so
the audit_mode classification is real even though the model output
is canned.

For the single-document path (``arborist ask`` / one ``document_root``
in hand), use :func:`arborist.qa.runner.ask` instead — same return
shape, same cache_key invariants, scoped to one document.
