Q&A Pipeline: question → answer → verify → cache

Contents

Q&A Pipeline: question → answer → verify → cache#

Question answering, caching, verification, evidence mapping.

keys#

The 8-dim Merkle-AGI v9.8 cache_key.

v9.8 invariant: no answer is reused unless all eight match and the record is live (not failed/stale/quarantined):

  1. source_root — content fingerprint of the document

  2. question_hash — SHA-256 of normalized question text

  3. model_profile_hash — model_id + revision + quantization

  4. conversation_hash — full canonical OpenAI messages array

  5. governance_policy_hash — sampling/policy parameters dict

  6. schema_version — arborist DB schema version

  7. canonicalization_version — text normalization rules

  8. chunking_version — chunker name & parameters

Bumping ANY of these eight dimensions yields a distinct cache_key, so prior records cannot be served. This is the runtime drift detection the providence whitepaper compresses into “cache_key = source_root + ‘:’ + question_hash” — that’s a simplification; the rigorous form is all eight dimensions hashed together.

arborist.qa.keys.canonical_question(question, *, mode='equivalence_class')[source]#

Canonical form of question for the given dedup mode.

Two modes:

  • "equivalence_class" (default): four-step canonicalization — canonicalize() (NFC + ws-collapse + strip ends), then lowercase, then trailing-punctuation strip, then standalone-article filter (the, a, an). All variants of “Who is THE Batman?” / “who is batman” / “who is X.” collapse to one form. The default for chat-style agents that prefer fast cache hits.

  • "strict": only canonicalize() — NFC + ws-collapse + strip ends. Case-sensitive, punctuation-sensitive, article-sensitive. Maximum granularity. The choice for audit-grade agents that want every distinct phrasing to get its own answer.

Exposed as a function so callers can dedup BEFORE hashing — e.g. inject the canonical form into the user message used for conversation_hash, while still sending the verbatim question to the LLM. Without this split "who is batman", "who is batman?", and "who is the batman?" collapse on question_hash (under equivalence_class) but each hits conversation_hash differently, missing cache.

The choice of mode flows through the question_dedup policy field into governance_policy_hash so two agents under different modes write records under different cache_key values — they coexist in parallel namespaces, never collide.

Parameters:
Return type:

str

arborist.qa.keys.question_hash(question, *, mode='equivalence_class')[source]#

SHA-256 of the dedup-mode-canonicalized question.

See canonical_question for what each mode does. The hash is the SHA-256 of the canonical form. Bumping _QUESTION_TRAILING_STRIP or _QUESTION_ARTICLE_STRIP (the equivalence-class strip sets) orphans prior cache records whose canonical question contained newly-stripped tokens; they live as history but won’t be re-hit on lookup.

Equivalence class examples (mode=”equivalence_class”):

"who is X"           |
"who is X?"          |
"Who Is X."          | -> same question_hash
"who is the X"       |
"who is a X"         |
"who is an X?"      |  (CJK question mark)

Strict mode (mode=”strict”) distinguishes all of those.

What’s IN the trailing-strip set: .?!,;: (ASCII), ?!。、 (CJK full-width), (ellipsis). Pairs like " ' ) ] } are NOT — naive one-sided stripping breaks balance. Apostrophes aren’t either — X's is a different question from X.

Parameters:
Return type:

str

arborist.qa.keys.model_profile_hash(model_id, revision='', quantization='')[source]#

SHA-256 of model identity. Bumping any field bumps the cache key.

Parameters:
  • model_id (str)

  • revision (str)

  • quantization (str)

Return type:

str

arborist.qa.keys.conversation_hash(messages)[source]#

SHA-256 of canonical JSON of the full OpenAI messages array.

Order matters: a 6-turn dialogue arriving at the same final question produces a different hash than a single-turn ask.

Parameters:

messages (list[dict])

Return type:

str

arborist.qa.keys.governance_policy_hash(policy)[source]#

SHA-256 of canonical JSON of the sampling/policy dict.

Includes temperature, top_p, max_tokens, and the system prompt — any of those changing means the answer is governed differently and the cache must miss.

Parameters:

policy (dict)

Return type:

str

arborist.qa.keys.verifier_policy_hash(policy)[source]#

SHA-256 of canonical JSON of the verifier-relevant subset of policy.

Pulls _VERIFIER_POLICY_FIELDS out of policy and hashes only those. Empty dict → constant hash (sha256(“{}”)). Folded into cache_key as a 9th dimension so a verifier-policy change is observable from the cache_key alone, separate from governance_policy_hash which folds in temperature / top_p / prompts.

The two hashes overlap (verifier fields ARE in the broader policy dict and so contribute to governance_policy_hash too). That’s intentional — bumping a verifier rule bumps BOTH dimensions. Bumping a non-verifier field (e.g. temperature) bumps ONLY governance_policy_hash. The asymmetry is what makes the audit legible: which dimension changed answers a question that scanning the whole policy dict cannot.

Parameters:

policy (dict)

Return type:

str

arborist.qa.keys.cache_key(source_root, question_hash_value, model_profile_hash_value, conversation_hash_value, governance_policy_hash_value, schema_version, canonicalization_version, chunking_version, verifier_policy_hash_value=None)[source]#

SHA-256 of the cache-identity dimensions joined with ‘|’.

8-dim form (legacy): omit verifier_policy_hash_value (or pass None). The result matches pre-2026-05-01 cache identity and keeps backward compatibility with cached records written before the 9th dimension landed.

9-dim form: pass verifier_policy_hash_value explicitly. Records written under the 9-dim form bind to the verifier-policy identity; lookups with a different verifier_policy_hash miss. The 9th dimension is the explicit “did the verifier rules change?” gate.

Any drift in any dimension produces a distinct cache_key.

Parameters:
  • source_root (str)

  • question_hash_value (str)

  • model_profile_hash_value (str)

  • conversation_hash_value (str)

  • governance_policy_hash_value (str)

  • schema_version (str)

  • canonicalization_version (str)

  • chunking_version (str)

  • verifier_policy_hash_value (str | None)

Return type:

str

runner#

Q&A runner: cache-first lookup -> inference fallback -> provable record.

Implements the v9.8 admissibility invariant: no record reused unless all 8 cache_key dimensions match AND state is ‘live’ (not failed/stale/quarantined).

  • Cache hit -> persisted audit_mode (STRICT/HYBRID/UNGROUNDED).

  • Cache miss -> call ChatClient, run faithfulness check, classify, store record, audit event.

arborist.qa.runner.ask(conn, *, document_root, question, client, model_id, revision='', quantization='', policy=None, chain='private', fidelity=None)[source]#

Look up cached answer or run inference. Returns a result dict.

See arborist.qa.query.query for fidelity semantics — it controls lookup tolerance: "strict" only checks the cache_key matching the call’s policy["question_dedup"]; the default "equivalence_class" falls back to the alternate dedup mode’s cache_key on miss so a fast-cache agent can reuse records written under either mode. Result includes lookup_path.

Parameters:
Return type:

dict

query#

Multi-source corpus Q&A.

Pose a question, the tree finds related cached docs, assembles them as context, asks Hermes, caches the answer.

The flow:
  1. FTS5 search across all shards (chunks_fts can’t be UNION’d in views, so we query each shard’s index independently and merge by score).

  2. Title-boost rerank: hits whose title contains query tokens get a score bump. Title is a strong topical signal that BM25 alone misses (BM25 favors short docs with rare body tokens — Tell_(poker) outranks Back_to_the_Future without a title boost).

  3. Pick top-K distinct documents within a character budget.

  4. Compute context_root = Merkle root over the sorted source document_roots — that’s the “source” dimension of v9.8’s 8-dim cache_key for this multi-source answer.

  5. Cache lookup; hit returns the persisted audit_mode.

  6. Miss calls Hermes via the OpenAI-compatible client, then runs the faithfulness check (verify_quotes) — every double-quoted span in the answer is verbatim-matched against the assembled context. Result classifies the answer as STRICT (every quote >=1 verified against context), HYBRID (some claims sourced, some emergent / training-derived), or UNGROUNDED (no quotes verify — purely emergent).

  7. Persist record with merkle_proof = {context_root, sources: […]}, audit_mode, and unverified_quotes (the spans the model produced that didn’t appear in any source — corpus-growth signal).

Per-source proofs are not bundled here (the source roots themselves are already content-addressed). A verifier asks the shards for any specific chunk’s proof on demand.

arborist.qa.query.query(*, question, qa_db, chat_client, model_id, revision='', quantization='', shards_dir=None, single_db=None, top_k=8, over_fetch=32, max_context_chars=None, policy=None, chain='private', fidelity=None, burn_existing=False, retrieval_keywords=None, translator=None, progress=None, extra_body=None)[source]#

Answer question using the corpus. Cache to qa_db. Returns a result dict.

fidelity controls lookup tolerance — see FIDELITY_MODES in arborist.qa.keys. "strict" checks only the cache_key matching this call’s policy["question_dedup"]. "equivalence_class" (default) tries the primary cache_key first, then the alternate dedup-mode cache_key as a fallback so a fast-cache agent can reuse a record written under either mode. Result includes lookup_path naming which key matched (or "miss" when the LLM ran).

burn_existing=True deletes the matching live providence_cache row (under the primary dedup-mode cache_key) BEFORE the cache lookup, forcing a fresh inference. Each burn writes a providence_burn audit event. Test-ergonomic: run make query Q=… BURN=1 after tweaking a knob to see the new behavior without finding cache_keys by hand. Result includes burned_existing reporting how many rows were deleted (0 or 1 for the primary key; the equivalence- class fallback key is left alone so prior alt-mode records stay historic).

retrieval_keywords augments the FTS5 search and title-filter token set with operator-supplied keywords WITHOUT changing what the LLM sees as its question or what the verifier checks. Empirically observed 2026-05-01: long discursive questions like ‘what technology is currently or soon available which may enable one person to reconstruct another person’s thoughts…’ under- retrieve because their content tokens get diluted by template phrasing. Appending domain keywords (‘transcranial knowledge acquisition’) narrows OR-mode FTS5 to the topical article (Neurotechnology) and lifts the verdict from HYBRID to STRICT.

Keywords do NOT enter question_hash directly, but they DO change which sources get chosen — and that re-routes the context_root and conversation_hash components of cache_key. Two calls with the same question and different keywords therefore land under different cache_keys (different contexts, different cached records — correctly so). Pair with burn_existing=True to force fresh inference when iterating on keyword sets.

Parameters:
  • question (str)

  • qa_db (Path)

  • chat_client (ChatClient)

  • model_id (str)

  • revision (str)

  • quantization (str)

  • shards_dir (Path | None)

  • single_db (Path | None)

  • top_k (int)

  • over_fetch (int)

  • max_context_chars (int | None)

  • policy (dict | None)

  • chain (str)

  • fidelity (str | None)

  • burn_existing (bool)

  • retrieval_keywords (str | None)

  • translator (object | None)

  • progress (Progress | None)

  • extra_body (dict | None)

Return type:

dict

verify#

Post-LLM faithfulness check: did the answer ground its claims in context?

Three layered strategies, tried in order. The first one that finds evidence classifies the answer. verifier_method on the result records which path fired so the audit chain stays diagnostic.

  1. quote — model wrapped claims in double quotes per system prompt. Strongest signal — explicit, verbatim, model-asserted.

  2. span — no quotes, but bullet/sentence-level lines from the answer appear verbatim in context. Catches models that quote inline without "..." marks.

  3. entity — no quotes and no span match, but multi-word proper-noun phrases from the answer appear verbatim in context. Catches the Wikipedia-infobox-to-prose case: the model paraphrases structure so spans diverge, but every named entity is intact and grounded.

Each strategy classifies into v9.8’s audit-mode trichotomy (RAG-adapted vocabulary; substrate calls UNGROUNDED “VISUAL”):

  • STRICT — every evidence unit (>=1) verifies verbatim against context

  • HYBRID — some verify, others do not (mixed source / emergent)

  • UNGROUNDED — no evidence, or none verify (purely emergent)

unverified_quotes (kept under that name for schema continuity) collects spans the model produced that don’t appear in any source — the corpus-growth signal mined by arborist emergent.

Hard rule (CLAUDE.md “soft hash vs hard hash”): every check is a lexical substring test under norm-v1 + lowercase canonicalization. No embeddings, no semantic similarity, no fuzzy alignment. The contract is “this token sequence either is or isn’t in the context.”

Wikitext context is run through arborist.wikitext.to_base before the substring test. The corpus stores raw wikitext (so the link graph is recoverable from any page), but the LLM produces clean prose. Without the strip, every wikilink-carrying source paragraph compares as “different surface form” and the verifier wrongly reports UNGROUNDED on genuine source-grounded quotes. With the strip, paraphrases of markup ([[Cloud]] vs Cloud) verify, while paraphrases of prose still flag honestly. mwparserfromhell is an optional dep; if absent, the strip is a no-op and verification falls back to today’s behavior.

arborist.qa.verify.extract_quotes(answer_text)[source]#

Pull double-quoted spans of length >= MIN_QUOTE_CHARS from answer_text.

Sequential pairing: locate every double-quote character, then pair them as (1st, 2nd), (3rd, 4th), …. Each pair brackets one quoted span; text between consecutive pairs is the model’s own framing prose (not captured). This is the correct model for adjacent quote pairs like “title” prose “quote” — naive regex matching paired the close of “title” with the open of “quote” and captured prose as a phantom quote, dragging classifications down to HYBRID incorrectly.

Parameters:

answer_text (str)

Return type:

list[str]

arborist.qa.verify.extract_claim_spans(answer_text)[source]#

Strip bullet markers, split into sentences, drop framing prefixes.

Returns each non-empty span of length >= MIN_SPAN_CHARS. These are the “claim units” the model wrote — each one we’ll substring-test against context.

Parameters:

answer_text (str)

Return type:

list[str]

arborist.qa.verify.extract_proper_nouns(answer_text)[source]#

Pull multi-word capitalized phrases. Deduplicated, order preserved.

Multi-word only — single capitalized words at sentence start are too noisy (“Based”, “Now”, “However”). Multi-word phrases like “Keanu Reeves” or “Thomas A. Anderson” are reliable proper-noun candidates and substring-test cleanly against source prose or structured wikitext.

Parameters:

answer_text (str)

Return type:

list[str]

arborist.qa.verify.verify_quotes(answer_text, context, *, entity_policy='proximity', proximity_n=3, proximity_window=300)[source]#

Classify an answer’s grounding against its retrieved context.

Tries quote → span → entity verification in sequence. The first strategy that finds evidence classifies the answer; later strategies don’t run.

entity_policy controls how the entity path classifies — see ENTITY_POLICIES. The quote and span paths are unaffected; they are explicit-claim evidence and always classify per the trichotomy.

Returns a dict with these keys:

n_quotes:          int   # evidence units extracted (any path)
n_verified:        int   # of those, how many appear verbatim
audit_mode:        str   # STRICT | HYBRID | UNGROUNDED
unverified_quotes: [str] # spans we couldn't ground in context
verifier_method:   str   # 'quote' | 'span' | 'entity' | 'none'
Parameters:
  • answer_text (str)

  • context (str)

  • entity_policy (str)

  • proximity_n (int)

  • proximity_window (int)

Return type:

dict

arborist.qa.verify.verify_claim_lattice(answer_text, evidence_map, *, allowed_source_roles=('primary_answer_source', 'secondary_context_source', 'background_source', 'unclassified'), max_pointers_per_claim=2, min_citation_coverage=0.3, min_claim_content_tokens=2, lazy_anchor_demote_threshold=0.5, lazy_anchor_demote_min_pairs=3, max_claims_per_answer=12, subject_tokens_absent_threshold=3, question=None, warrant_check_enabled=True, deflection_check_enabled=True, format_collapse_check_enabled=True, warrant_chain_roots=frozenset({}))[source]#

Deterministic verifier for answer_mode="claim_lattice_pointer".

The model wrote pointer-line prose (Claim text. [E12]); the parser pulled (claim_text, [pointer_ids]) pairs from each non-empty line. This verifier maps each pointer id back to its content-addressed evidence object and runs six hard checks:

  1. Parser succeeded — parse_status == "PARSED" (line had a bracket tag). NO_EVIDENCE_POINTER claims (prose without tag) count toward the denominator and downgrade the verdict.

  2. Pointer id resolves to an entry in the runtime-built evidence map. No model-invented ids.

  3. Resolved entry’s source_role is in allowed_source_roles.

  4. Claim text non-empty after tag strip.

  5. Claim’s content tokens textually overlap the cited evidence span at coverage ≥ min_citation_coverage (per-pair, lexical only — see _claim_textually_overlaps_evidence). Catches the magnet-chunk lazy-anchor where the model cites an evidence pointer whose text contains few claim-content tokens.

  6. Pointer count per claim does not exceed max_pointers_per_claim (default 2 — matches the prompt’s “1 or 2 pointers per claim” rule). When exceeded, the claim is TRIMMED to the first N pointers and verification proceeds normally; a POINTER_OVERFLOW_TRIMMED violation is recorded so STRICT is no longer reachable (audit_mode caps at HYBRID for the run). Trim-and-verify (vs hard fail) protects correct claims that were over-cited (e.g. “Leonardo painted the Mona Lisa. [E2,…,E14]”) while keeping the over-citation pattern surfaced. The dropped pointers count toward n_quotes so the denominator reflects what the model emitted.

Removed 2026-04-30: the strict no-double-quote rule. The model routinely paraphrases source prose but copies named-quoted phrases verbatim (e.g. "Constitution State" from a Connecticut span). Hard-rejecting claims that contained any " char was rejecting factually correct, source-grounded claims for cosmetic punctuation. The coverage threshold (Rule 5) and pointer cap (Rule 6) carry the weight of catching synthetic-quote / mega-claim failures the old rule was meant to catch. _has_manual_quote is still defined and used by verify_claim_lattice_json.

Returns a verdict in the same shape as verify_quotes + extras:

n_quotes              total claim-pointer pairs (denominator)
n_verified            pairs where pointer resolved AND
                      source_role allowed AND coverage met
                      AND claim text non-empty
audit_mode            STRICT / HYBRID / UNGROUNDED
unverified_quotes     claim texts that didn't reach
                      EVIDENCE_LINKED -- kept under that name
                      for schema continuity with verify_quotes
verifier_method       "claim_lattice"
claim_statuses        per-claim {text, evidence_ids,
                      pointer_ids, status, reasons[]}; status in
                      {EVIDENCE_LINKED, EVIDENCE_LINKED_PARTIAL,
                      UNKNOWN_EVIDENCE_ID, SOURCE_ROLE_BLOCKED,
                      CITATION_MISMATCH, NO_EVIDENCE_POINTER,
                      SCHEMA_INVALID}
violations            structured violation records for the
                      run-DAG / sidecar
rendered_text         human-readable prose with literal spans
                      interpolated; what the runner persists
                      as answer_text
evidence_id_pairs     per-claim list of resolved
                      content-addressed evidence_ids
                      (run-stable form). Used to thread the
                      parsed lattice into the run-DAG.
Parameters:
  • answer_text (str)

  • allowed_source_roles (tuple[str, ...])

  • max_pointers_per_claim (int)

  • min_citation_coverage (float)

  • min_claim_content_tokens (int)

  • lazy_anchor_demote_threshold (float)

  • lazy_anchor_demote_min_pairs (int)

  • max_claims_per_answer (int)

  • subject_tokens_absent_threshold (int)

  • question (str | None)

  • warrant_check_enabled (bool)

  • deflection_check_enabled (bool)

  • format_collapse_check_enabled (bool)

  • warrant_chain_roots (frozenset[str])

Return type:

dict

arborist.qa.verify.evidence_map_by_evidence_id_local(evidence_map, eid)[source]#

Local helper — returns the EvidenceObject whose evidence_id matches eid, or None. Avoids the import-cycle risk of pulling evidence_map_by_evidence_id into this module’s hot path; the O(N) walk is fine since evidence maps are <30 entries.

Parameters:

eid (str)

arborist.qa.verify.claim_lattice_structured_output_extras(schema=None, *, name='claim_lattice')[source]#

Multi-engine extra_body for JSON-schema enforcement on chat completions. Each inference engine recognises its own key and silently drops the others, so sending all three lets the same call site work across vLLM, llama.cpp, and OpenAI-spec endpoints without per-endpoint branching:

  • guided_json — vLLM grammar-constrained sampling

  • json_schema — llama.cpp native shorthand

  • response_format— OpenAI-spec {type: json_schema, …}

    (honoured by llama.cpp and newer vLLM)

Returns a dict you splat into client.chat_completion(extra_body=…). Defaults to the claim- lattice schema; pass an alternate schema to reuse the helper for other structured-output features. The name is required by OpenAI-spec response_format and is the user-visible label for the schema in some engines’ error messages.

Added 2026-05-19 to enable the Arborist arm to run with Qwen on llama.cpp (the old single-key guided_json was silently dropped on llama.cpp, leaving Qwen un-enforced and the parse-tolerant fallback doing all the work). Hermes/vLLM path is unchanged — it still picks up guided_json and ignores the other two.

Parameters:
Return type:

dict

arborist.qa.verify.verify_claim_lattice_json(answer_json_text, evidence_map, *, allowed_source_roles=('primary_answer_source', 'secondary_context_source', 'background_source', 'unclassified'), max_evidence_per_claim=2, min_citation_coverage=0.3, max_claims_per_answer=12, subject_tokens_absent_threshold=3, question=None, warrant_check_enabled=True, deflection_check_enabled=True, warrant_chain_roots=frozenset({}))[source]#

Deterministic verifier for answer_mode="claim_lattice" (JSON).

Parses the model’s JSON output (lenient pre-parser handles markdown fences / preamble / curly quotes / trailing commas), validates the schema, then runs the same hard checks as verify_claim_lattice but reading evidence_ids from the JSON claim objects.

2026-04-30: switched from content-addressed evidence_ids (Eed1b6e396) to pointer_ids (E1, E2, …) in the prompt & JSON output. Hermes-3-8B was fabricating plausible content- addressed IDs (E1b6e396-style near-misses) on cross-document relationship questions; the verifier correctly rejected them as UNKNOWN_EVIDENCE_ID but the answer text was often factually correct, leaving us with honest UNGROUNDED on right answers. Pointer IDs are short, enumerable, and fabrication-obvious. The runtime still resolves each pointer_id to its content-addressed evidence_id internally and stores that in evidence_id_pairs (cache/run-DAG continuity); only the prompt-facing surface changes.

  1. JSON parses (lenient). Failure → SCHEMA_INVALID, UNGROUNDED.

  2. Top-level is {"claims": [...]}.

  3. Each claim is {"text": str, "evidence_ids": [str, ...]}.

  4. Each evidence_id resolves in the runtime-built evidence map (no model-invented IDs).

  5. Resolved entry’s source_role is in allowed_source_roles.

  6. Claim text contains no double-quote characters anywhere.

  7. Claim text non-empty.

  8. Claim’s content tokens textually overlap the cited evidence span.

  9. len(evidence_ids) <= max_evidence_per_claim.

Returns a verdict in the same shape as verify_claim_lattice plus a json_fixups field naming any drift the lenient parser had to peel ("fence" / "prose_trim" / "curly_quotes" / "trailing_comma"). Empty list = strict JSON parse on first try.

Parameters:
  • answer_json_text (str)

  • allowed_source_roles (tuple[str, ...])

  • max_evidence_per_claim (int)

  • min_citation_coverage (float)

  • max_claims_per_answer (int)

  • subject_tokens_absent_threshold (int)

  • question (str | None)

  • warrant_check_enabled (bool)

  • deflection_check_enabled (bool)

  • warrant_chain_roots (frozenset[str])

Return type:

dict

evidence#

Evidence map for claim-lattice-pointer (quote-by-pointer) answer mode.

Builds a deterministic table of evidence objects from already-retrieved chunks. The model references each object by a short pointer_id (“E1”, “E2”, …) which the runtime maps back to a content-addressed evidence_id for the cache, run-DAG, and audit chain.

Design pinned by G0 ticket (2026-04-29) and the CTI / Clause Lattice Intelligence reframe:

Models should not generate verbatim quotes. They should point to evidence IDs extracted by deterministic code.

This kills the synthetic-elision class by construction — the model never types the quote string, so it can’t drop characters from one.

Two-layer id scheme:

  • pointer_id short numeric tag the model sees in the prompt and

    writes back in pointer-line answers. E + the 1-based position of the evidence object in the map (E1, E2, …, E37). One BPE token per id in standard tokenizers. Stays in the model’s in-distribution citation style.

  • evidence_id content-addressed handle. E + first 8 hex of

    sha256(chunk_root:offset_start:offset_end). Same chunk + same offsets in two runs → same id, forever. The cache_key, run-DAG, and audit chain all use this form so provenance is run-stable.

The verifier (verify_claim_lattice) maps each pointer_id the model writes back to its content-addressed evidence_id before persistence. The model’s literal output is run-dependent (run #1’s E1 and run #2’s E1 likely point at different chunks); the content-addressed layer is what stays stable.

For the first cut every chunk produces exactly one evidence object covering offset_start=0 .. offset_end=len(span). Sub-chunk extraction (paragraph-level, sentence-level) is a future refinement; the schema already accepts arbitrary offsets so adding finer-grained splits later doesn’t break the contract.

class arborist.qa.evidence.EvidenceObject(pointer_id, evidence_id, source_root, document_uri, title, chunk_idx, chunk_root, offset_start, offset_end, source_role, text_hash, span)[source]#

One pinned span the model may reference by pointer_id.

All fields are deterministic from inputs; same chunk + offsets yields the same object byte-for-byte.

  • pointer_id prompt-facing id (“E1”, “E2”, …) — what the

    model sees & writes back. Position-derived; run-dependent on purpose.

  • evidence_id content-addressed handle (“E” + 8 hex of

    sha256(chunk_root:start:end)) — what the cache, run-DAG, and audit chain use. Run-stable.

  • source_root document_root the chunk belongs to

  • document_uri human-readable URI (for the renderer)

  • title doc title (for the renderer & prompt)

  • chunk_idx chunk index within the document

  • chunk_root leaf hash of the chunk

  • offset_start byte offset within the chunk (0 for whole-chunk)

  • offset_end end offset (exclusive)

  • source_role role classification (primary_answer_source / …)

  • text_hash sha256 of the span (for tamper detection)

  • span the literal text (what the renderer interpolates)

Parameters:
  • pointer_id (str)

  • evidence_id (str)

  • source_root (str)

  • document_uri (str)

  • title (str | None)

  • chunk_idx (int)

  • chunk_root (str)

  • offset_start (int)

  • offset_end (int)

  • source_role (str)

  • text_hash (str)

  • span (str)

pointer_id: str#
evidence_id: str#
source_root: str#
document_uri: str#
title: str | None#
chunk_idx: int#
chunk_root: str#
offset_start: int#
offset_end: int#
source_role: str#
text_hash: str#
span: str#
to_dict()[source]#
Return type:

dict

arborist.qa.evidence.build_evidence_map(chunks)[source]#

Build the evidence table from retrieved chunks.

chunks is a list of dicts with keys:

source_root, document_uri, title (optional), chunk_idx, chunk_root, span, source_role (optional, default ‘unclassified’)

Returns a list of EvidenceObject``s, one per chunk, in input order. The 1-based position drives ``pointer_id (E1, E2, …); the chunk’s content drives evidence_id (sha256-derived). For the first cut each chunk = one whole-span evidence object (offset 0 .. len(span)). Sub-chunk splitting is a future refinement.

Parameters:

chunks (list[dict])

Return type:

list[EvidenceObject]

arborist.qa.evidence.evidence_map_root(evidence)[source]#

Merkle root over the sorted evidence_id leaves.

Sorting makes the root order-independent — two retrieval runs that return the same chunks in different orders produce the same root. Use as the evidence_map stage hash in the run-DAG.

Parameters:

evidence (list[EvidenceObject])

Return type:

str

arborist.qa.evidence.render_evidence_block(e)[source]#

Format one evidence object for the LLM prompt.

Header carries the prompt-facing pointer_id, title (or URI tail), and source role so the model has everything it needs to cite without typing the span:

=== E1 (Jurassic_Park_(film) | primary_answer_source) === <literal span text>

The runtime maps E1 back to the content-addressed evidence_id before persistence; the model never sees the hex form.

Parameters:

e (EvidenceObject)

Return type:

str

arborist.qa.evidence.render_evidence_map(evidence)[source]#

Concatenated evidence blocks, ready to drop into the prompt.

Parameters:

evidence (list[EvidenceObject])

Return type:

str

arborist.qa.evidence.render_evidence_block_for_json(e)[source]#

Format one evidence object for the JSON-mode LLM prompt.

2026-04-30: header uses the prompt-facing pointer_id (E1, E2, …) — same as claim_lattice_pointer mode — instead of the content-addressed evidence_id (long hex). The change closes a real failure mode: small models (Hermes-3-8B observed) were fabricating plausible-looking content-addressed IDs (e.g. E1b6e396 when the runtime had Eed1b6e396) → UNKNOWN_EVIDENCE_ID → UNGROUNDED, even when the answer text was correct. Pointer IDs (E1 - E10) are short, enumerable, and fabrication-obvious — the model can’t invent E27 if only E1 - E10 were shown.

The runtime still stores content-addressed evidence_id in the cache & run-DAG (resolved on-the-fly in verify_claim_lattice_json); only the prompt-facing string changes:

=== E1 (Jurassic_Park_(film) | primary_answer_source) ===
<literal span text>
Parameters:

e (EvidenceObject)

Return type:

str

arborist.qa.evidence.render_evidence_map_for_json(evidence)[source]#

Concatenated evidence blocks for JSON mode.

Parameters:

evidence (list[EvidenceObject])

Return type:

str

arborist.qa.evidence.evidence_map_by_pointer_id(evidence)[source]#

Index by prompt-facing pointer_id (E1, E2, …).

Parameters:

evidence (list[EvidenceObject])

Return type:

dict[str, EvidenceObject]

arborist.qa.evidence.evidence_map_by_evidence_id(evidence)[source]#

Index by content-addressed evidence_id (E1f8e4c2a, …).

Parameters:

evidence (list[EvidenceObject])

Return type:

dict[str, EvidenceObject]

arborist.qa.evidence.render_claim_lattice(claims, by_id, *, window=200)[source]#

Convert structured claims to human-readable prose with literal spans.

Each claim becomes one bullet line followed by inlined evidence excerpts. The model’s claim text is rendered verbatim; each cited pointer_id is followed by a spotlight excerpt of the literal source span — a window of window chars centered on the first content token from the claim that appears in the span. Falls back to the leading window when no claim token matches.

Why spotlight over leading-N truncation: when the cited evidence is a whole article and the model lazy-anchors every claim at the same pointer, the leading-N strategy displayed the same article-intro sentence under every claim. The spotlight finds the part of the span the claim is about — “Brachiosaurus appears in the film” + a 15 KB article span gets a window centered on the first “brachiosaurus” mention, not the production-history opener. Same cited evidence id, but the displayed text actually supports different claims differently.

by_id is the pointer_id → EvidenceObject index — what evidence_map_by_pointer_id returns. Determinism: same (claims, by_id, window) → same prose, byte-for-byte. Unknown ids render as [<id>: ?] so violations are visible at a glance.

Parameters:
Return type:

str

quantifier#

Pure quantifier preflight classifier — Ticket #000008 Phase 1.

Maps a question string onto the ten-rung intensity ladder defined in docs/tickets/ticket-000008-broad-quantifier-preflight-guard.md §2. The classifier exists to estimate expected number of claims in the answer and format-discipline risk on small models. It is not formal-semantics quantifier theory; the operational axis is what matters.

Pure function. No I/O. No model call. No retrieval call. Folds into governance_policy_hash via classifier_version (added to arborist.qa.keys._VERIFIER_POLICY_FIELDS in Phase 2).

Intensity rungs (highest wins for multi-quantifier questions):

1.  ABSENT             universal-negation, single-claim shape
2.  SINGULAR           one-fact wh / definite reference
3.  PROPORTIONAL       descriptive fraction (`most`, `half`)
4.  SMALL_NUM_EXPLICIT bounded by digit/word (`top 3`, `seven X`)
5.  COMPARATIVE_BOUND  bounded by inequality (`at least 5`)
6.  FEW                small set, vague (`some`, `a few`)
7.  MANY               medium set, vague (`many`, `numerous`)
8.  ALL                universal quantifier (`all`, `every`)
9.  COMPREHENSIVE      exhaustive request (`complete list of`,
                       `tell me everything`)
10. OPEN_REQUEST       verb-driven enumeration (`tell me about`,
                       `describe`, `explain`)

Returns a dict with:

intensity              one of the ten rungs (or "SINGULAR" by default)
matched_token          the lexical surface form that triggered the rung
explicit_count         int when SMALL_NUM_EXPLICIT or COMPARATIVE_BOUND;
                       None otherwise
is_broad               True for ALL / COMPREHENSIVE / OPEN_REQUEST
operational_shape      mnemonic for downstream policy (e.g.
                       "universal_enumeration", "exhaustive_request")
scope_bound_hint       "bounded" | "unbounded" | "unknown"
                       (see ticket §10.1 -- bounded != unbounded
                       universals; classifier defaults to "unknown"
                       when intensity is broad and no domain anchor
                       is present)

Highest-intensity-wins arbitration: when a question contains overlapping markers (e.g. “tell me about all the planets”), pick the rung farther from SINGULAR. The order in _RUNG_PRIORITY codifies this — later rungs win.

arborist.qa.quantifier.classify_question_quantifier(question)[source]#

Classify question onto the ten-rung intensity ladder.

Highest-intensity-wins arbitration: when multiple rungs match, pick the one farther from SINGULAR. Operationally that means “tell me about all the planets” classifies OPEN_REQUEST (later in the priority order than ALL), even though ALL also matched. The downstream cap is the broader rung’s cap, which is what we want under enumeration pressure.

Returns a dict; see module docstring for fields.

Empty / whitespace-only questions classify SINGULAR (no quantifier pressure) with no matched_token.

Parameters:

question (str)

Return type:

dict

metacognition#

Meta-Cognition Preflight Guard — Ticket #000010 Phase 1.

Runtime epistemic control layer that classifies a question’s shape BEFORE generation, so the model never answers from the surface form alone when the question is ill-posed (false-premise, contradictory, under-specified, broad-quantifier, time-sensitive, out-of-corpus, reference-frame ambiguous).

Pure and deterministic. No I/O, no model call, no retrieval call. Reuses arborist.qa.quantifier.classify_question_quantifier for the broad-quantifier rung; adds four new lightweight detectors:

  • temporal sensitivity (current/latest/today/CEO/etc.)

  • contradiction (lexical) (unmarried+spouse, always+sometimes-not)

  • false-premise (lite) (presupposition patterns)

  • out-of-corpus (my-uploaded-X / my-file shapes)

Reference-frame detection lives in arborist.qa.query._detect_frame (ticket #000002) and is called from the surrounding runtime, not from this module — keeps detection pure-on-question (no corpus lookup needed here).

The output is a QuestionState dataclass that becomes a CTI root node. First pass surfaces it on the query() result dict only; run-DAG node binding deferred to the same Phase 5 work tracked in ticket #000009 (both nodes can land together).

Hard rule (D1): No LLM in this hard path. Model-assisted preflight, if added later, labels itself SOFT_PREFLIGHT_HINT and never produces a PREFLIGHT_OK / PREFLIGHT_BLOCKED without deterministic support.

class arborist.qa.metacognition.QuestionState(raw_question, question_hash, logical_statuses, question_shape, quantifier_intensity, quantifier_matched_token, scope_bound_hint, reference_frames, temporal_sensitivity, temporal_matched_tokens, contradiction_pairs, false_premise_hints, corpus_requirement, known_boundaries, answer_constraints, preflight_result, preflight_policy_hash, classifier_version='metacognition-v0.1')[source]#

Runtime epistemic state for one question.

All fields are deterministic from the question + per-call model_profile / corpus_profile / policy inputs. No randomness, no LLM. Hashable via preflight_policy_hash so the run-DAG (Phase 5) can bind the decision into the audit chain.

Parameters:
  • raw_question (str)

  • question_hash (str)

  • logical_statuses (tuple[Literal['well_formed', 'under_specified', 'false_premise_suspected', 'contradictory_question', 'out_of_corpus_risk', 'stale_risk', 'reference_frame_ambiguous', 'broad_quantifier_unbounded'], ...])

  • question_shape (str)

  • quantifier_intensity (str | None)

  • quantifier_matched_token (str | None)

  • scope_bound_hint (str)

  • reference_frames (tuple[str, ...])

  • temporal_sensitivity (Literal['high', 'medium', 'low'])

  • temporal_matched_tokens (tuple[str, ...])

  • contradiction_pairs (tuple[tuple[str, str], ...])

  • false_premise_hints (tuple[dict, ...])

  • corpus_requirement (str)

  • known_boundaries (tuple[str, ...])

  • answer_constraints (dict)

  • preflight_result (Literal['PREFLIGHT_OK', 'PREFLIGHT_PARTIAL', 'PREFLIGHT_BLOCKED'])

  • preflight_policy_hash (str)

  • classifier_version (str)

raw_question: str#
question_hash: str#
logical_statuses: tuple[Literal['well_formed', 'under_specified', 'false_premise_suspected', 'contradictory_question', 'out_of_corpus_risk', 'stale_risk', 'reference_frame_ambiguous', 'broad_quantifier_unbounded'], ...]#
question_shape: str#
quantifier_intensity: str | None#
quantifier_matched_token: str | None#
scope_bound_hint: str#
reference_frames: tuple[str, ...]#
temporal_sensitivity: Literal['high', 'medium', 'low']#
temporal_matched_tokens: tuple[str, ...]#
contradiction_pairs: tuple[tuple[str, str], ...]#
false_premise_hints: tuple[dict, ...]#
corpus_requirement: str#
known_boundaries: tuple[str, ...]#
answer_constraints: dict#
preflight_result: Literal['PREFLIGHT_OK', 'PREFLIGHT_PARTIAL', 'PREFLIGHT_BLOCKED']#
preflight_policy_hash: str#
classifier_version: str = 'metacognition-v0.1'#
to_dict()[source]#

Convert to JSON-serializable dict for run-DAG / bench.

Return type:

dict

arborist.qa.metacognition.detect_temporal_sensitivity(question)[source]#

Return (sensitivity, matched_tokens).

high = explicit temporal anchor (current, latest, etc.) OR rapid-turnover role pattern. medium reserved for future weekly/monthly cadence detection (not implemented in this pass). low = no temporal markers detected (the default).

Parameters:

question (str)

Return type:

tuple[Literal[‘high’, ‘medium’, ‘low’], tuple[str, …]]

arborist.qa.metacognition.detect_contradiction(question)[source]#

Return tuple of (token_a, token_b) pairs whose BOTH members appear in question (case-insensitive whole-word match).

Returns empty tuple when no contradiction detected. The caller decides whether to label-only or block — by default this surfaces in the audit-line tail, NOT a hard block, since false positives on contradiction would refuse legitimate questions.

Parameters:

question (str)

Return type:

tuple[tuple[str, str], …]

arborist.qa.metacognition.detect_false_premise(question)[source]#

Return tuple of presupposition dicts surfacing the implied relation. Each dict carries:

kind             -- pattern label (stopped_doing, caused, ...)
presupposition   -- natural-language statement of the
                    presupposition
subject          -- extracted subject token-span
predicate        -- extracted predicate token-span

First-pass detection only. The verifier uses these as soft hints; downstream the audit-line tail surfaces “false premise suspected” so the operator can read the audit log and check whether the cited evidence supports the presupposition.

Returns empty tuple when no pattern fires.

Parameters:

question (str)

Return type:

tuple[dict, …]

arborist.qa.metacognition.detect_out_of_corpus(question)[source]#

Return True iff the question references a private / uploaded document that the encyclopedic corpus cannot have.

Parameters:

question (str)

Return type:

bool

arborist.qa.metacognition.preflight_question(question, *, model_profile_id=None, corpus_profile=None, reference_frames=(), policy=None)[source]#

Classify question deterministically into a QuestionState.

Pure function. Reuses the Phase 1 quantifier classifier (#000008) plus four new lightweight detectors (temporal, contradiction, false-premise-lite, out-of-corpus).

corpus_profile is an optional dict carrying corpus boundary metadata (e.g. {"corpus_latest_timestamp": "2003-05-16"}); when present, the temporal detector cross-checks against it. First-pass implementation just records corpus_requirement based on the temporal sensitivity — full cutoff arithmetic deferred to a future amend.

reference_frames is passed in by the caller because frame detection requires retrieved sources (lives in arborist.qa.query._detect_frame). Empty tuple is the default for “no frame routing happened”.

policy overrides for the per-detector enables. Defaults are permissive (all checks on) per ticket #000010 §7.3.

Parameters:
  • question (str)

  • model_profile_id (str | None)

  • corpus_profile (dict | None)

  • reference_frames (tuple[str, ...])

  • policy (dict | None)

Return type:

QuestionState

dag#

Per-run Merkle-DAG provenance for providence records.

Each query/ask call passes through several stages:

question → retrieval → context → prompt → answer → verify → final_label

Each stage emits a hash; the run’s identity is the Merkle root over the ordered sequence of stage hashes. Stored on the providence record as run_dag_root (alongside cache_key). The DAG is verifiable: given the persisted node list & the same Merkle conventions arborist uses elsewhere (non-commutative HashCombine, prefix 0x03, leaf prefix 0x00, self-duplicate odd rule), an auditor can recompute the root from the nodes & confirm the run was constructed as recorded.

Distinct from the linear audit_events chain — that chain tracks state-changing operations across the DB. This DAG tracks the computation provenance of one specific answer. Both coexist; the record’s audit_event_hash links to the chain, run_dag_root & run_dag_blob carry the per-run computation graph.

Stages chosen to mirror the toy-Hermes design (fox 2026-04-30):

question      hash of question_hash (8-dim cache_key dim)
retrieval     hash of sources summary (document_roots + roles +
              scores) -- captures which docs ranked & how
context       context_root (Merkle root over sorted source roots,
              the "source" dim of the cache_key)
prompt        conversation_hash (the assembled messages)
answer        sha256(answer_text)
verify        hash of verdict summary (audit_mode, verifier_method,
              n_quotes, n_verified, claim_statuses)
final_label   hash of (audit_mode, verifier_method, lookup_path)

The DAG is NOT part of cache_key. cache_key inputs (the 8 dims) determine the answer; the answer determines the DAG. Folding the DAG back into cache_key would create a circular dependency.

arborist.qa.dag.localize_failure(*, audit_mode, n_sources, n_quotes, n_verified)[source]#

Map a non-STRICT verdict to the pipeline stage that introduced the failure. Returns None for STRICT outcomes.

Stage labels (in pipeline order):

  • retrieval — no admitted sources. Title/body gates rejected everything, or the corpus genuinely lacks the topic. Repair path: ingest more sources or relax the breadth threshold.

  • context — sources admitted but no quotes extracted. Could be a context-truncation issue (per-source cap dropped the relevant paragraph) or a model that declined to cite anything. Repair path: raise per-source cap; tighten prompt.

  • answer — sources retrieved & quotes extracted but they don’t verify. The model either fabricated content, paraphrased inside quotes, or appended citation tails. Repair path: the mechanical_repair pass + (when wired) the re-prompt feedback loop.

The toy-Hermes design pass calls this “chain-segment failure localization” — debugging becomes typed instead of vague. An operator reading failure_stage='answer' knows retrieval & context were fine; the model is what to fix. failure_stage='retrieval' means stop tuning the verifier & go ingest a relevant source.

Parameters:
  • audit_mode (str)

  • n_sources (int)

  • n_quotes (int)

  • n_verified (int)

Return type:

str | None

arborist.qa.dag.build_preflight_node_payload(*, question_state=None, quantifier=None, answer_contract=None, prompt_contract=None, evidence_contract=None, policy_refs=None)[source]#

Build the canonical nested-clause payload for the preflight DAG stage. Returns a JSON-ready dict; pair with preflight_node_hash() to get the SHA-256 hex.

Five-clause structure per ticket #000009 §8.2 / feedback §3:

  • classifier — quantifier classifier output (#000008): intensity, matched_token, explicit_count, scope_bound_hint, is_broad, classifier_version, operational_shape.

  • answer_contract — guard / cap / reject decisions taken on this run.

  • prompt_contract — reminder enabled / injected / template_id (#000008 §10.5).

  • evidence_contract — exposure budget, one-claim-per-line discipline (#000010 §10.4).

  • policy_refs — governance_policy_hash + model_profile_hash + answer_mode. Reference-by-hash rather than raw policy bundles (feedback §4: avoid double-committing already-hashed state).

Plus the metacog question_state from #000010 — that’s its own clause for now (logical_statuses, false_premise_hints, contradiction_pairs). It’s hashed separately by metacognition.preflight_policy_hash already.

Any clause may be None / empty — the resulting payload is still stable. Includes node_version so legacy runs without the node can be unambiguously labeled unavailable_legacy_run by audit tools.

Parameters:
  • question_state (dict | None)

  • quantifier (dict | None)

  • answer_contract (dict | None)

  • prompt_contract (dict | None)

  • evidence_contract (dict | None)

  • policy_refs (dict | None)

Return type:

dict

arborist.qa.dag.preflight_node_hash(*, question_state=None, quantifier=None, answer_contract=None, prompt_contract=None, evidence_contract=None, policy_refs=None)[source]#

Hash the preflight decision into a stable SHA-256 hex string.

Returns the hash of the nested-clause payload built by build_preflight_node_payload(). See that function for the five-clause structure.

Audit-replay payoff: two cache rows that share the same question + same model output + same verifier verdict but different preflight policy state produce different hashes here, which propagate to run_dag_root via build_run_dag().

Backward compatibility note: Pre-2026-05-04 (c36e85c) callers used a flat 3-key payload (question_state / quantifier / policy_state). Hashes computed with that callsite will NOT match this restructured callsite — run_dag_root values for rows written between c36e85c and the current commit are treated as a discrete generation; they’re still verifiable by re-reading run_dag_blob (the persisted blob captures the payload that was actually hashed).

Parameters:
  • question_state (dict | None)

  • quantifier (dict | None)

  • answer_contract (dict | None)

  • prompt_contract (dict | None)

  • evidence_contract (dict | None)

  • policy_refs (dict | None)

Return type:

str

arborist.qa.dag.build_run_dag(*, question_hash, sources, context_root, conversation_hash, answer_text, audit_mode, verifier_method, n_quotes, n_verified, claim_statuses=None, lookup_path=None, evidence_map_root=None, answer_mode=None, violations=None, raw_answer_text=None, parsed_lattice=None, rendered_text=None, retrieval_plan_hash=None, preflight_hash=None, preflight_payload=None)[source]#

Return {"root": <hex>, "nodes": [<stage>, <hash>], ...}.

All inputs are already-computed hashes or text; no I/O. Idempotent & deterministic — same inputs always produce the same root, byte-for- byte across machines (as long as the Merkle conventions stay pinned; they do, via arborist.merkle).

Two base DAG shapes; both gain an optional preflight stage when preflight_hash is supplied (Ticket #000009):

  • Quote mode (default). 7 stages — question / retrieval / context / prompt / answer / verify / final_label. Triggered when evidence_map_root is None. Backward-compatible with all run_dag_root values written by code that pre-dates G0. With preflight_hash, becomes 8 stages — question / preflight / retrieval / ....

  • Claim-lattice-pointer mode (G0 / CTI). 9 stages — question / retrieval / evidence_map / prompt / raw_answer / parsed_claim_lattice / verify / render / final_label. Triggered when evidence_map_root is non-None. Splits the single answer node into three: the model’s raw output, the parsed claim-lattice, and the rendered prose with literal spans interpolated. context drops out (the context IS the evidence map). All three of raw_answer_text / parsed_lattice / rendered_text should be supplied; missing args fall back to answer_text for the raw_answer & render hashes and [] for the parsed_lattice hash. With preflight_hash, becomes 10 stages.

answer_mode & violations fold into the verify & final_label payloads when provided. preflight_hash (Ticket #000009) is optional; when None, the DAG shape remains 7/9 stages exactly so pre-#000009 records can be re-validated. When supplied, the preflight stage inserts at position 1 (between question and retrieval) per ticket #000009 §3.1.

Parameters:
  • question_hash (str)

  • sources (list[dict])

  • context_root (str)

  • conversation_hash (str)

  • answer_text (str)

  • audit_mode (str)

  • verifier_method (str)

  • n_quotes (int)

  • n_verified (int)

  • claim_statuses (list[dict] | None)

  • lookup_path (str | None)

  • evidence_map_root (str | None)

  • answer_mode (str | None)

  • violations (list[dict] | None)

  • raw_answer_text (str | None)

  • parsed_lattice (list | None)

  • rendered_text (str | None)

  • retrieval_plan_hash (str | None)

  • preflight_hash (str | None)

  • preflight_payload (dict | None)

Return type:

dict

arborist.qa.dag.build_reject_run_dag(*, question_hash, preflight_hash, rejection_reason, answer_text, audit_mode='UNGROUNDED', verifier_method='claim_lattice_pointer', violations=None, preflight_payload=None)[source]#

3-stage reject-broad run-DAG: question preflight final_label.

Ticket #000009 §8.2 / 2026-05-04 feedback §6.2: preflight rejection currently early-returns from query() before the standard build_run_dag() runs, so reject rows have no auditable Merkle commitment. This builder fills that gap with a minimal DAG shape that captures the rejection without pretending retrieval / prompt / raw_model_output happened.

The returned shape is INTENTIONALLY shorter than the standard 7/9/8/10-stage shapes — audit replay can read the stage list and tell instantly that this row is a preflight rejection: 3 stages always means reject path.

final_label carries the rejection_reason + answer_text hash so two rejections that differ only in their (rendered) rationale string still produce different roots. The rejection_reason is the canonical string from the violation (“preflight rejection — broad-quantifier query with unbounded scope. …”), NOT the operator-facing rendered answer_text — that lets policy template changes invalidate the hash even if the operator-visible text is unchanged.

Parameters:
  • question_hash (str)

  • preflight_hash (str)

  • rejection_reason (str)

  • answer_text (str)

  • audit_mode (str)

  • verifier_method (str)

  • violations (list[dict] | None)

  • preflight_payload (dict | None)

Return type:

dict

arborist.qa.dag.verify_run_dag(blob)[source]#

Recompute the Merkle root from blob and check it matches.

Used by audit tooling. Accepts either a parsed dict or the JSON string we persist in providence_cache.run_dag_blob.

Parameters:

blob (str | dict)

Return type:

bool

client#

Chat-completion clients.

ChatClient is a Protocol — any object with a chat_completion method plugs in. We ship two concrete clients:

  • OpenAICompatibleClient — talks to any OpenAI-compatible /v1/chat/completions endpoint (vllm, llama.cpp server, ollama, TGI, hosted services).

  • StubClient — offline canned responses for tests and dry-runs. No network. Operation Voyeur safe.

class arborist.qa.client.ChatClient(*args, **kwargs)[source]#
chat_completion(messages, *, model, temperature=0.1, max_tokens=512, top_p=1.0, extra_body=None)[source]#

Return the assistant’s text response.

extra_body is forwarded as additional fields in the JSON request payload — used for vLLM-specific knobs like guided_json (constrain output to a JSON Schema at sampling time, eliminating SCHEMA_INVALID failures from prompt drift). Endpoints that don’t recognize the field ignore it; the client passes it through opaque-ly.

Parameters:
Return type:

str

class arborist.qa.client.StubClient(answer='[STUB] dry-run answer; no LLM was called.')[source]#

Offline client for tests / –dry-run.

Pass answer=callable(messages, **kw) -> str for dynamic stubbing.

chat_completion(messages, **kwargs)[source]#
Return type:

str

class arborist.qa.client.OpenAICompatibleClient(base_url, api_key=None, timeout=60.0, max_retries=3, retry_backoff_base_s=0.5)[source]#

OpenAI-compatible chat completion over HTTP.

Default endpoint is configurable via env. Pass api_key only if the target requires it; uncloseai’s free endpoint does not.

Retries on transient upstream failures (HTTP 502/503/504) with exponential backoff. The 2026-04-30 QA-modes bench saw 19 of 66 JSON-mode runs error out with 502 from vLLM — clustered, plausibly correlated with guided_json stressing the grammar engine. Retry smooths over the cluster without changing semantics: a 502 still fails the bench cell if all attempts exhaust, but transient bursts no longer dominate the error column.

Parameters:
  • base_url (str)

  • api_key (str | None)

  • timeout (float)

  • max_retries (int)

  • retry_backoff_base_s (float)

close()[source]#

Release the underlying connection pool.

Return type:

None

chat_completion(messages, *, model, temperature=0.1, max_tokens=512, top_p=1.0, extra_body=None, stop=None)[source]#
Parameters:
Return type:

str


Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

  • TRUTH — First principles, math & science, open source code freely distributed.

  • FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.

  • HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.

  • LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.