Q&A Pipeline: question → answer → verify → cache#
Question answering, caching, verification, evidence mapping.
keys#
The 8-dim Merkle-AGI v9.8 cache_key.
v9.8 invariant: no answer is reused unless all eight match and the record is live (not failed/stale/quarantined):
source_root — content fingerprint of the document
question_hash — SHA-256 of normalized question text
model_profile_hash — model_id + revision + quantization
conversation_hash — full canonical OpenAI messages array
governance_policy_hash — sampling/policy parameters dict
schema_version — arborist DB schema version
canonicalization_version — text normalization rules
chunking_version — chunker name & parameters
Bumping ANY of these eight dimensions yields a distinct cache_key, so prior records cannot be served. This is the runtime drift detection the providence whitepaper compresses into “cache_key = source_root + ‘:’ + question_hash” — that’s a simplification; the rigorous form is all eight dimensions hashed together.
- arborist.qa.keys.canonical_question(question, *, mode='equivalence_class')[source]#
Canonical form of question for the given dedup
mode.Two modes:
"equivalence_class"(default): four-step canonicalization —canonicalize()(NFC + ws-collapse + strip ends), then lowercase, then trailing-punctuation strip, then standalone-article filter (the,a,an). All variants of “Who is THE Batman?” / “who is batman” / “who is X.” collapse to one form. The default for chat-style agents that prefer fast cache hits."strict": onlycanonicalize()— NFC + ws-collapse + strip ends. Case-sensitive, punctuation-sensitive, article-sensitive. Maximum granularity. The choice for audit-grade agents that want every distinct phrasing to get its own answer.
Exposed as a function so callers can dedup BEFORE hashing — e.g. inject the canonical form into the user message used for
conversation_hash, while still sending the verbatim question to the LLM. Without this split"who is batman","who is batman?", and"who is the batman?"collapse onquestion_hash(under equivalence_class) but each hitsconversation_hashdifferently, missing cache.The choice of mode flows through the
question_deduppolicy field intogovernance_policy_hashso two agents under different modes write records under differentcache_keyvalues — they coexist in parallel namespaces, never collide.
- arborist.qa.keys.question_hash(question, *, mode='equivalence_class')[source]#
SHA-256 of the dedup-mode-canonicalized question.
See
canonical_questionfor what each mode does. The hash is the SHA-256 of the canonical form. Bumping_QUESTION_TRAILING_STRIPor_QUESTION_ARTICLE_STRIP(the equivalence-class strip sets) orphans prior cache records whose canonical question contained newly-stripped tokens; they live as history but won’t be re-hit on lookup.Equivalence class examples (mode=”equivalence_class”):
"who is X" | "who is X?" | "Who Is X." | -> same question_hash "who is the X" | "who is a X" | "who is an X?" | (CJK question mark)
Strict mode (mode=”strict”) distinguishes all of those.
What’s IN the trailing-strip set:
.?!,;:(ASCII),?!。、(CJK full-width),…(ellipsis). Pairs like"')]}are NOT — naive one-sided stripping breaks balance. Apostrophes aren’t either —X'sis a different question fromX.
- arborist.qa.keys.model_profile_hash(model_id, revision='', quantization='')[source]#
SHA-256 of model identity. Bumping any field bumps the cache key.
- arborist.qa.keys.conversation_hash(messages)[source]#
SHA-256 of canonical JSON of the full OpenAI messages array.
Order matters: a 6-turn dialogue arriving at the same final question produces a different hash than a single-turn ask.
- arborist.qa.keys.governance_policy_hash(policy)[source]#
SHA-256 of canonical JSON of the sampling/policy dict.
Includes temperature, top_p, max_tokens, and the system prompt — any of those changing means the answer is governed differently and the cache must miss.
- arborist.qa.keys.verifier_policy_hash(policy)[source]#
SHA-256 of canonical JSON of the verifier-relevant subset of policy.
Pulls _VERIFIER_POLICY_FIELDS out of policy and hashes only those. Empty dict → constant hash (sha256(“{}”)). Folded into cache_key as a 9th dimension so a verifier-policy change is observable from the cache_key alone, separate from governance_policy_hash which folds in temperature / top_p / prompts.
The two hashes overlap (verifier fields ARE in the broader policy dict and so contribute to governance_policy_hash too). That’s intentional — bumping a verifier rule bumps BOTH dimensions. Bumping a non-verifier field (e.g. temperature) bumps ONLY governance_policy_hash. The asymmetry is what makes the audit legible: which dimension changed answers a question that scanning the whole policy dict cannot.
- arborist.qa.keys.cache_key(source_root, question_hash_value, model_profile_hash_value, conversation_hash_value, governance_policy_hash_value, schema_version, canonicalization_version, chunking_version, verifier_policy_hash_value=None)[source]#
SHA-256 of the cache-identity dimensions joined with ‘|’.
8-dim form (legacy): omit verifier_policy_hash_value (or pass None). The result matches pre-2026-05-01 cache identity and keeps backward compatibility with cached records written before the 9th dimension landed.
9-dim form: pass verifier_policy_hash_value explicitly. Records written under the 9-dim form bind to the verifier-policy identity; lookups with a different verifier_policy_hash miss. The 9th dimension is the explicit “did the verifier rules change?” gate.
Any drift in any dimension produces a distinct cache_key.
runner#
Q&A runner: cache-first lookup -> inference fallback -> provable record.
Implements the v9.8 admissibility invariant: no record reused unless all 8 cache_key dimensions match AND state is ‘live’ (not failed/stale/quarantined).
Cache hit -> persisted audit_mode (STRICT/HYBRID/UNGROUNDED).
Cache miss -> call ChatClient, run faithfulness check, classify, store record, audit event.
- arborist.qa.runner.ask(conn, *, document_root, question, client, model_id, revision='', quantization='', policy=None, chain='private', fidelity=None)[source]#
Look up cached answer or run inference. Returns a result dict.
See
arborist.qa.query.queryfor fidelity semantics — it controls lookup tolerance:"strict"only checks the cache_key matching the call’spolicy["question_dedup"]; the default"equivalence_class"falls back to the alternate dedup mode’s cache_key on miss so a fast-cache agent can reuse records written under either mode. Result includeslookup_path.
query#
Multi-source corpus Q&A.
Pose a question, the tree finds related cached docs, assembles them as context, asks Hermes, caches the answer.
- The flow:
FTS5 search across all shards (chunks_fts can’t be UNION’d in views, so we query each shard’s index independently and merge by score).
Title-boost rerank: hits whose title contains query tokens get a score bump. Title is a strong topical signal that BM25 alone misses (BM25 favors short docs with rare body tokens — Tell_(poker) outranks Back_to_the_Future without a title boost).
Pick top-K distinct documents within a character budget.
Compute context_root = Merkle root over the sorted source document_roots — that’s the “source” dimension of v9.8’s 8-dim cache_key for this multi-source answer.
Cache lookup; hit returns the persisted audit_mode.
Miss calls Hermes via the OpenAI-compatible client, then runs the faithfulness check (
verify_quotes) — every double-quoted span in the answer is verbatim-matched against the assembled context. Result classifies the answer as STRICT (every quote >=1 verified against context), HYBRID (some claims sourced, some emergent / training-derived), or UNGROUNDED (no quotes verify — purely emergent).Persist record with merkle_proof = {context_root, sources: […]}, audit_mode, and unverified_quotes (the spans the model produced that didn’t appear in any source — corpus-growth signal).
Per-source proofs are not bundled here (the source roots themselves are already content-addressed). A verifier asks the shards for any specific chunk’s proof on demand.
- arborist.qa.query.query(*, question, qa_db, chat_client, model_id, revision='', quantization='', shards_dir=None, single_db=None, top_k=8, over_fetch=32, max_context_chars=None, policy=None, chain='private', fidelity=None, burn_existing=False, retrieval_keywords=None, translator=None, progress=None, extra_body=None)[source]#
Answer question using the corpus. Cache to qa_db. Returns a result dict.
fidelity controls lookup tolerance — see
FIDELITY_MODESinarborist.qa.keys."strict"checks only the cache_key matching this call’spolicy["question_dedup"]."equivalence_class"(default) tries the primary cache_key first, then the alternate dedup-mode cache_key as a fallback so a fast-cache agent can reuse a record written under either mode. Result includeslookup_pathnaming which key matched (or"miss"when the LLM ran).burn_existing=True deletes the matching live providence_cache row (under the primary dedup-mode cache_key) BEFORE the cache lookup, forcing a fresh inference. Each burn writes a
providence_burnaudit event. Test-ergonomic: run make query Q=… BURN=1 after tweaking a knob to see the new behavior without finding cache_keys by hand. Result includesburned_existingreporting how many rows were deleted (0 or 1 for the primary key; the equivalence- class fallback key is left alone so prior alt-mode records stay historic).retrieval_keywords augments the FTS5 search and title-filter token set with operator-supplied keywords WITHOUT changing what the LLM sees as its question or what the verifier checks. Empirically observed 2026-05-01: long discursive questions like ‘what technology is currently or soon available which may enable one person to reconstruct another person’s thoughts…’ under- retrieve because their content tokens get diluted by template phrasing. Appending domain keywords (‘transcranial knowledge acquisition’) narrows OR-mode FTS5 to the topical article (Neurotechnology) and lifts the verdict from HYBRID to STRICT.
Keywords do NOT enter
question_hashdirectly, but they DO change which sources get chosen — and that re-routes thecontext_rootandconversation_hashcomponents ofcache_key. Two calls with the same question and different keywords therefore land under different cache_keys (different contexts, different cached records — correctly so). Pair withburn_existing=Trueto force fresh inference when iterating on keyword sets.- Parameters:
question (str)
qa_db (Path)
chat_client (ChatClient)
model_id (str)
revision (str)
quantization (str)
shards_dir (Path | None)
single_db (Path | None)
top_k (int)
over_fetch (int)
max_context_chars (int | None)
policy (dict | None)
chain (str)
fidelity (str | None)
burn_existing (bool)
retrieval_keywords (str | None)
translator (object | None)
progress (Progress | None)
extra_body (dict | None)
- Return type:
verify#
Post-LLM faithfulness check: did the answer ground its claims in context?
Three layered strategies, tried in order. The first one that finds evidence
classifies the answer. verifier_method on the result records which path
fired so the audit chain stays diagnostic.
quote — model wrapped claims in double quotes per system prompt. Strongest signal — explicit, verbatim, model-asserted.
span — no quotes, but bullet/sentence-level lines from the answer appear verbatim in context. Catches models that quote inline without
"..."marks.entity — no quotes and no span match, but multi-word proper-noun phrases from the answer appear verbatim in context. Catches the Wikipedia-infobox-to-prose case: the model paraphrases structure so spans diverge, but every named entity is intact and grounded.
Each strategy classifies into v9.8’s audit-mode trichotomy (RAG-adapted vocabulary; substrate calls UNGROUNDED “VISUAL”):
STRICT — every evidence unit (>=1) verifies verbatim against context
HYBRID — some verify, others do not (mixed source / emergent)
UNGROUNDED — no evidence, or none verify (purely emergent)
unverified_quotes (kept under that name for schema continuity) collects
spans the model produced that don’t appear in any source — the
corpus-growth signal mined by arborist emergent.
Hard rule (CLAUDE.md “soft hash vs hard hash”): every check is a lexical substring test under norm-v1 + lowercase canonicalization. No embeddings, no semantic similarity, no fuzzy alignment. The contract is “this token sequence either is or isn’t in the context.”
Wikitext context is run through arborist.wikitext.to_base before the
substring test. The corpus stores raw wikitext (so the link graph is
recoverable from any page), but the LLM produces clean prose. Without
the strip, every wikilink-carrying source paragraph compares as
“different surface form” and the verifier wrongly reports UNGROUNDED on
genuine source-grounded quotes. With the strip, paraphrases of markup
([[Cloud]] vs Cloud) verify, while paraphrases of prose still
flag honestly. mwparserfromhell is an optional dep; if absent, the
strip is a no-op and verification falls back to today’s behavior.
- arborist.qa.verify.extract_quotes(answer_text)[source]#
Pull double-quoted spans of length >= MIN_QUOTE_CHARS from answer_text.
Sequential pairing: locate every double-quote character, then pair them as (1st, 2nd), (3rd, 4th), …. Each pair brackets one quoted span; text between consecutive pairs is the model’s own framing prose (not captured). This is the correct model for adjacent quote pairs like “title” prose “quote” — naive regex matching paired the close of “title” with the open of “quote” and captured prose as a phantom quote, dragging classifications down to HYBRID incorrectly.
- arborist.qa.verify.extract_claim_spans(answer_text)[source]#
Strip bullet markers, split into sentences, drop framing prefixes.
Returns each non-empty span of length >= MIN_SPAN_CHARS. These are the “claim units” the model wrote — each one we’ll substring-test against context.
- arborist.qa.verify.extract_proper_nouns(answer_text)[source]#
Pull multi-word capitalized phrases. Deduplicated, order preserved.
Multi-word only — single capitalized words at sentence start are too noisy (“Based”, “Now”, “However”). Multi-word phrases like “Keanu Reeves” or “Thomas A. Anderson” are reliable proper-noun candidates and substring-test cleanly against source prose or structured wikitext.
- arborist.qa.verify.verify_quotes(answer_text, context, *, entity_policy='proximity', proximity_n=3, proximity_window=300)[source]#
Classify an answer’s grounding against its retrieved context.
Tries quote → span → entity verification in sequence. The first strategy that finds evidence classifies the answer; later strategies don’t run.
entity_policycontrols how the entity path classifies — seeENTITY_POLICIES. The quote and span paths are unaffected; they are explicit-claim evidence and always classify per the trichotomy.Returns a dict with these keys:
n_quotes: int # evidence units extracted (any path) n_verified: int # of those, how many appear verbatim audit_mode: str # STRICT | HYBRID | UNGROUNDED unverified_quotes: [str] # spans we couldn't ground in context verifier_method: str # 'quote' | 'span' | 'entity' | 'none'
- arborist.qa.verify.verify_claim_lattice(answer_text, evidence_map, *, allowed_source_roles=('primary_answer_source', 'secondary_context_source', 'background_source', 'unclassified'), max_pointers_per_claim=2, min_citation_coverage=0.3, min_claim_content_tokens=2, lazy_anchor_demote_threshold=0.5, lazy_anchor_demote_min_pairs=3, max_claims_per_answer=12, subject_tokens_absent_threshold=3, question=None, warrant_check_enabled=True, deflection_check_enabled=True, format_collapse_check_enabled=True, warrant_chain_roots=frozenset({}))[source]#
Deterministic verifier for
answer_mode="claim_lattice_pointer".The model wrote pointer-line prose (
Claim text. [E12]); the parser pulled (claim_text, [pointer_ids]) pairs from each non-empty line. This verifier maps each pointer id back to its content-addressed evidence object and runs six hard checks:Parser succeeded —
parse_status == "PARSED"(line had a bracket tag). NO_EVIDENCE_POINTER claims (prose without tag) count toward the denominator and downgrade the verdict.Pointer id resolves to an entry in the runtime-built evidence map. No model-invented ids.
Resolved entry’s
source_roleis inallowed_source_roles.Claim text non-empty after tag strip.
Claim’s content tokens textually overlap the cited evidence span at coverage ≥
min_citation_coverage(per-pair, lexical only — see_claim_textually_overlaps_evidence). Catches the magnet-chunk lazy-anchor where the model cites an evidence pointer whose text contains few claim-content tokens.Pointer count per claim does not exceed
max_pointers_per_claim(default 2 — matches the prompt’s “1 or 2 pointers per claim” rule). When exceeded, the claim is TRIMMED to the first N pointers and verification proceeds normally; aPOINTER_OVERFLOW_TRIMMEDviolation is recorded so STRICT is no longer reachable (audit_mode caps at HYBRID for the run). Trim-and-verify (vs hard fail) protects correct claims that were over-cited (e.g. “Leonardo painted the Mona Lisa. [E2,…,E14]”) while keeping the over-citation pattern surfaced. The dropped pointers count towardn_quotesso the denominator reflects what the model emitted.
Removed 2026-04-30: the strict no-double-quote rule. The model routinely paraphrases source prose but copies named-quoted phrases verbatim (e.g.
"Constitution State"from a Connecticut span). Hard-rejecting claims that contained any"char was rejecting factually correct, source-grounded claims for cosmetic punctuation. The coverage threshold (Rule 5) and pointer cap (Rule 6) carry the weight of catching synthetic-quote / mega-claim failures the old rule was meant to catch._has_manual_quoteis still defined and used byverify_claim_lattice_json.Returns a verdict in the same shape as
verify_quotes+ extras:n_quotes total claim-pointer pairs (denominator) n_verified pairs where pointer resolved AND source_role allowed AND coverage met AND claim text non-empty audit_mode STRICT / HYBRID / UNGROUNDED unverified_quotes claim texts that didn't reach EVIDENCE_LINKED -- kept under that name for schema continuity with verify_quotes verifier_method "claim_lattice" claim_statuses per-claim {text, evidence_ids, pointer_ids, status, reasons[]}; status in {EVIDENCE_LINKED, EVIDENCE_LINKED_PARTIAL, UNKNOWN_EVIDENCE_ID, SOURCE_ROLE_BLOCKED, CITATION_MISMATCH, NO_EVIDENCE_POINTER, SCHEMA_INVALID} violations structured violation records for the run-DAG / sidecar rendered_text human-readable prose with literal spans interpolated; what the runner persists as answer_text evidence_id_pairs per-claim list of resolved content-addressed evidence_ids (run-stable form). Used to thread the parsed lattice into the run-DAG.
- Parameters:
answer_text (str)
max_pointers_per_claim (int)
min_citation_coverage (float)
min_claim_content_tokens (int)
lazy_anchor_demote_threshold (float)
lazy_anchor_demote_min_pairs (int)
max_claims_per_answer (int)
subject_tokens_absent_threshold (int)
question (str | None)
warrant_check_enabled (bool)
deflection_check_enabled (bool)
format_collapse_check_enabled (bool)
- Return type:
- arborist.qa.verify.evidence_map_by_evidence_id_local(evidence_map, eid)[source]#
Local helper — returns the EvidenceObject whose
evidence_idmatcheseid, or None. Avoids the import-cycle risk of pulling evidence_map_by_evidence_id into this module’s hot path; the O(N) walk is fine since evidence maps are <30 entries.- Parameters:
eid (str)
- arborist.qa.verify.claim_lattice_structured_output_extras(schema=None, *, name='claim_lattice')[source]#
Multi-engine
extra_bodyfor JSON-schema enforcement on chat completions. Each inference engine recognises its own key and silently drops the others, so sending all three lets the same call site work across vLLM, llama.cpp, and OpenAI-spec endpoints without per-endpoint branching:guided_json— vLLM grammar-constrained samplingjson_schema— llama.cpp native shorthandresponse_format— OpenAI-spec{type: json_schema, …}(honoured by llama.cpp and newer vLLM)
Returns a dict you splat into
client.chat_completion(extra_body=…). Defaults to the claim- lattice schema; pass an alternate schema to reuse the helper for other structured-output features. Thenameis required by OpenAI-specresponse_formatand is the user-visible label for the schema in some engines’ error messages.Added 2026-05-19 to enable the Arborist arm to run with Qwen on llama.cpp (the old single-key
guided_jsonwas silently dropped on llama.cpp, leaving Qwen un-enforced and the parse-tolerant fallback doing all the work). Hermes/vLLM path is unchanged — it still picks upguided_jsonand ignores the other two.
- arborist.qa.verify.verify_claim_lattice_json(answer_json_text, evidence_map, *, allowed_source_roles=('primary_answer_source', 'secondary_context_source', 'background_source', 'unclassified'), max_evidence_per_claim=2, min_citation_coverage=0.3, max_claims_per_answer=12, subject_tokens_absent_threshold=3, question=None, warrant_check_enabled=True, deflection_check_enabled=True, warrant_chain_roots=frozenset({}))[source]#
Deterministic verifier for
answer_mode="claim_lattice"(JSON).Parses the model’s JSON output (lenient pre-parser handles markdown fences / preamble / curly quotes / trailing commas), validates the schema, then runs the same hard checks as
verify_claim_latticebut readingevidence_idsfrom the JSON claim objects.2026-04-30: switched from content-addressed evidence_ids (
Eed1b6e396) to pointer_ids (E1,E2, …) in the prompt & JSON output. Hermes-3-8B was fabricating plausible content- addressed IDs (E1b6e396-style near-misses) on cross-document relationship questions; the verifier correctly rejected them as UNKNOWN_EVIDENCE_ID but the answer text was often factually correct, leaving us with honest UNGROUNDED on right answers. Pointer IDs are short, enumerable, and fabrication-obvious. The runtime still resolves each pointer_id to its content-addressed evidence_id internally and stores that inevidence_id_pairs(cache/run-DAG continuity); only the prompt-facing surface changes.JSON parses (lenient). Failure → SCHEMA_INVALID, UNGROUNDED.
Top-level is
{"claims": [...]}.Each claim is
{"text": str, "evidence_ids": [str, ...]}.Each evidence_id resolves in the runtime-built evidence map (no model-invented IDs).
Resolved entry’s
source_roleis inallowed_source_roles.Claim text contains no double-quote characters anywhere.
Claim text non-empty.
Claim’s content tokens textually overlap the cited evidence span.
len(evidence_ids) <= max_evidence_per_claim.
Returns a verdict in the same shape as
verify_claim_latticeplus ajson_fixupsfield naming any drift the lenient parser had to peel ("fence"/"prose_trim"/"curly_quotes"/"trailing_comma"). Empty list = strict JSON parse on first try.
evidence#
Evidence map for claim-lattice-pointer (quote-by-pointer) answer mode.
Builds a deterministic table of evidence objects from already-retrieved
chunks. The model references each object by a short pointer_id
(“E1”, “E2”, …) which the runtime maps back to a content-addressed
evidence_id for the cache, run-DAG, and audit chain.
Design pinned by G0 ticket (2026-04-29) and the CTI / Clause Lattice Intelligence reframe:
Models should not generate verbatim quotes. They should point to evidence IDs extracted by deterministic code.
This kills the synthetic-elision class by construction — the model never types the quote string, so it can’t drop characters from one.
Two-layer id scheme:
pointer_idshort numeric tag the model sees in the prompt andwrites back in pointer-line answers.
E+ the 1-based position of the evidence object in the map (E1,E2, …,E37). One BPE token per id in standard tokenizers. Stays in the model’s in-distribution citation style.
evidence_idcontent-addressed handle.E+ first 8 hex ofsha256(chunk_root:offset_start:offset_end). Same chunk + same offsets in two runs → same id, forever. The cache_key, run-DAG, and audit chain all use this form so provenance is run-stable.
The verifier (verify_claim_lattice) maps each pointer_id the model
writes back to its content-addressed evidence_id before persistence.
The model’s literal output is run-dependent (run #1’s E1 and run
#2’s E1 likely point at different chunks); the content-addressed
layer is what stays stable.
For the first cut every chunk produces exactly one evidence object
covering offset_start=0 .. offset_end=len(span). Sub-chunk
extraction (paragraph-level, sentence-level) is a future refinement;
the schema already accepts arbitrary offsets so adding finer-grained
splits later doesn’t break the contract.
- class arborist.qa.evidence.EvidenceObject(pointer_id, evidence_id, source_root, document_uri, title, chunk_idx, chunk_root, offset_start, offset_end, source_role, text_hash, span)[source]#
One pinned span the model may reference by
pointer_id.All fields are deterministic from inputs; same chunk + offsets yields the same object byte-for-byte.
pointer_idprompt-facing id (“E1”, “E2”, …) — what themodel sees & writes back. Position-derived; run-dependent on purpose.
evidence_idcontent-addressed handle (“E” + 8 hex ofsha256(chunk_root:start:end)) — what the cache, run-DAG, and audit chain use. Run-stable.
source_rootdocument_root the chunk belongs todocument_urihuman-readable URI (for the renderer)titledoc title (for the renderer & prompt)chunk_idxchunk index within the documentchunk_rootleaf hash of the chunkoffset_startbyte offset within the chunk (0 for whole-chunk)offset_endend offset (exclusive)source_rolerole classification (primary_answer_source / …)text_hashsha256 of the span (for tamper detection)spanthe literal text (what the renderer interpolates)
- Parameters:
- arborist.qa.evidence.build_evidence_map(chunks)[source]#
Build the evidence table from retrieved chunks.
chunksis a list of dicts with keys:source_root, document_uri, title (optional), chunk_idx, chunk_root, span, source_role (optional, default ‘unclassified’)
Returns a list of
EvidenceObject``s, one per chunk, in input order. The 1-based position drives ``pointer_id(E1, E2, …); the chunk’s content drivesevidence_id(sha256-derived). For the first cut each chunk = one whole-span evidence object (offset 0 .. len(span)). Sub-chunk splitting is a future refinement.- Parameters:
- Return type:
- arborist.qa.evidence.evidence_map_root(evidence)[source]#
Merkle root over the sorted evidence_id leaves.
Sorting makes the root order-independent — two retrieval runs that return the same chunks in different orders produce the same root. Use as the
evidence_mapstage hash in the run-DAG.- Parameters:
evidence (list[EvidenceObject])
- Return type:
- arborist.qa.evidence.render_evidence_block(e)[source]#
Format one evidence object for the LLM prompt.
Header carries the prompt-facing pointer_id, title (or URI tail), and source role so the model has everything it needs to cite without typing the span:
=== E1 (Jurassic_Park_(film) | primary_answer_source) === <literal span text>
The runtime maps E1 back to the content-addressed evidence_id before persistence; the model never sees the hex form.
- Parameters:
e (EvidenceObject)
- Return type:
- arborist.qa.evidence.render_evidence_map(evidence)[source]#
Concatenated evidence blocks, ready to drop into the prompt.
- Parameters:
evidence (list[EvidenceObject])
- Return type:
- arborist.qa.evidence.render_evidence_block_for_json(e)[source]#
Format one evidence object for the JSON-mode LLM prompt.
2026-04-30: header uses the prompt-facing
pointer_id(E1, E2, …) — same as claim_lattice_pointer mode — instead of the content-addressedevidence_id(long hex). The change closes a real failure mode: small models (Hermes-3-8B observed) were fabricating plausible-looking content-addressed IDs (e.g.E1b6e396when the runtime hadEed1b6e396) →UNKNOWN_EVIDENCE_ID→ UNGROUNDED, even when the answer text was correct. Pointer IDs (E1-E10) are short, enumerable, and fabrication-obvious — the model can’t inventE27if onlyE1-E10were shown.The runtime still stores content-addressed
evidence_idin the cache & run-DAG (resolved on-the-fly inverify_claim_lattice_json); only the prompt-facing string changes:=== E1 (Jurassic_Park_(film) | primary_answer_source) === <literal span text>
- Parameters:
e (EvidenceObject)
- Return type:
- arborist.qa.evidence.render_evidence_map_for_json(evidence)[source]#
Concatenated evidence blocks for JSON mode.
- Parameters:
evidence (list[EvidenceObject])
- Return type:
- arborist.qa.evidence.evidence_map_by_pointer_id(evidence)[source]#
Index by prompt-facing pointer_id (E1, E2, …).
- Parameters:
evidence (list[EvidenceObject])
- Return type:
- arborist.qa.evidence.evidence_map_by_evidence_id(evidence)[source]#
Index by content-addressed evidence_id (E1f8e4c2a, …).
- Parameters:
evidence (list[EvidenceObject])
- Return type:
- arborist.qa.evidence.render_claim_lattice(claims, by_id, *, window=200)[source]#
Convert structured claims to human-readable prose with literal spans.
Each claim becomes one bullet line followed by inlined evidence excerpts. The model’s claim text is rendered verbatim; each cited pointer_id is followed by a spotlight excerpt of the literal source span — a window of
windowchars centered on the first content token from the claim that appears in the span. Falls back to the leading window when no claim token matches.Why spotlight over leading-N truncation: when the cited evidence is a whole article and the model lazy-anchors every claim at the same pointer, the leading-N strategy displayed the same article-intro sentence under every claim. The spotlight finds the part of the span the claim is about — “Brachiosaurus appears in the film” + a 15 KB article span gets a window centered on the first “brachiosaurus” mention, not the production-history opener. Same cited evidence id, but the displayed text actually supports different claims differently.
by_idis the pointer_id → EvidenceObject index — whatevidence_map_by_pointer_idreturns. Determinism: same (claims, by_id, window) → same prose, byte-for-byte. Unknown ids render as[<id>: ?]so violations are visible at a glance.
quantifier#
Pure quantifier preflight classifier — Ticket #000008 Phase 1.
Maps a question string onto the ten-rung intensity ladder defined in
docs/tickets/ticket-000008-broad-quantifier-preflight-guard.md §2.
The classifier exists to estimate expected number of claims in the
answer and format-discipline risk on small models. It is not
formal-semantics quantifier theory; the operational axis is what
matters.
Pure function. No I/O. No model call. No retrieval call. Folds into
governance_policy_hash via classifier_version (added to
arborist.qa.keys._VERIFIER_POLICY_FIELDS in Phase 2).
Intensity rungs (highest wins for multi-quantifier questions):
1. ABSENT universal-negation, single-claim shape
2. SINGULAR one-fact wh / definite reference
3. PROPORTIONAL descriptive fraction (`most`, `half`)
4. SMALL_NUM_EXPLICIT bounded by digit/word (`top 3`, `seven X`)
5. COMPARATIVE_BOUND bounded by inequality (`at least 5`)
6. FEW small set, vague (`some`, `a few`)
7. MANY medium set, vague (`many`, `numerous`)
8. ALL universal quantifier (`all`, `every`)
9. COMPREHENSIVE exhaustive request (`complete list of`,
`tell me everything`)
10. OPEN_REQUEST verb-driven enumeration (`tell me about`,
`describe`, `explain`)
Returns a dict with:
intensity one of the ten rungs (or "SINGULAR" by default)
matched_token the lexical surface form that triggered the rung
explicit_count int when SMALL_NUM_EXPLICIT or COMPARATIVE_BOUND;
None otherwise
is_broad True for ALL / COMPREHENSIVE / OPEN_REQUEST
operational_shape mnemonic for downstream policy (e.g.
"universal_enumeration", "exhaustive_request")
scope_bound_hint "bounded" | "unbounded" | "unknown"
(see ticket §10.1 -- bounded != unbounded
universals; classifier defaults to "unknown"
when intensity is broad and no domain anchor
is present)
Highest-intensity-wins arbitration: when a question contains
overlapping markers (e.g. “tell me about all the planets”), pick the
rung farther from SINGULAR. The order in _RUNG_PRIORITY codifies
this — later rungs win.
- arborist.qa.quantifier.classify_question_quantifier(question)[source]#
Classify
questiononto the ten-rung intensity ladder.Highest-intensity-wins arbitration: when multiple rungs match, pick the one farther from SINGULAR. Operationally that means “tell me about all the planets” classifies OPEN_REQUEST (later in the priority order than ALL), even though ALL also matched. The downstream cap is the broader rung’s cap, which is what we want under enumeration pressure.
Returns a dict; see module docstring for fields.
Empty / whitespace-only questions classify SINGULAR (no quantifier pressure) with no matched_token.
metacognition#
Meta-Cognition Preflight Guard — Ticket #000010 Phase 1.
Runtime epistemic control layer that classifies a question’s shape BEFORE generation, so the model never answers from the surface form alone when the question is ill-posed (false-premise, contradictory, under-specified, broad-quantifier, time-sensitive, out-of-corpus, reference-frame ambiguous).
Pure and deterministic. No I/O, no model call, no retrieval call.
Reuses arborist.qa.quantifier.classify_question_quantifier for
the broad-quantifier rung; adds four new lightweight detectors:
temporal sensitivity (current/latest/today/CEO/etc.)
contradiction (lexical) (unmarried+spouse, always+sometimes-not)
false-premise (lite) (presupposition patterns)
out-of-corpus (my-uploaded-X / my-file shapes)
Reference-frame detection lives in arborist.qa.query._detect_frame
(ticket #000002) and is called from the surrounding runtime, not
from this module — keeps detection pure-on-question (no corpus
lookup needed here).
The output is a QuestionState dataclass that becomes a CTI root
node. First pass surfaces it on the query() result dict only;
run-DAG node binding deferred to the same Phase 5 work tracked in
ticket #000009 (both nodes can land together).
Hard rule (D1): No LLM in this hard path. Model-assisted preflight,
if added later, labels itself SOFT_PREFLIGHT_HINT and never
produces a PREFLIGHT_OK / PREFLIGHT_BLOCKED without
deterministic support.
- class arborist.qa.metacognition.QuestionState(raw_question, question_hash, logical_statuses, question_shape, quantifier_intensity, quantifier_matched_token, scope_bound_hint, reference_frames, temporal_sensitivity, temporal_matched_tokens, contradiction_pairs, false_premise_hints, corpus_requirement, known_boundaries, answer_constraints, preflight_result, preflight_policy_hash, classifier_version='metacognition-v0.1')[source]#
Runtime epistemic state for one question.
All fields are deterministic from the question + per-call model_profile / corpus_profile / policy inputs. No randomness, no LLM. Hashable via preflight_policy_hash so the run-DAG (Phase 5) can bind the decision into the audit chain.
- Parameters:
raw_question (str)
question_hash (str)
logical_statuses (tuple[Literal['well_formed', 'under_specified', 'false_premise_suspected', 'contradictory_question', 'out_of_corpus_risk', 'stale_risk', 'reference_frame_ambiguous', 'broad_quantifier_unbounded'], ...])
question_shape (str)
quantifier_intensity (str | None)
quantifier_matched_token (str | None)
scope_bound_hint (str)
temporal_sensitivity (Literal['high', 'medium', 'low'])
corpus_requirement (str)
answer_constraints (dict)
preflight_result (Literal['PREFLIGHT_OK', 'PREFLIGHT_PARTIAL', 'PREFLIGHT_BLOCKED'])
preflight_policy_hash (str)
classifier_version (str)
- arborist.qa.metacognition.detect_temporal_sensitivity(question)[source]#
Return
(sensitivity, matched_tokens).high = explicit temporal anchor (current, latest, etc.) OR rapid-turnover role pattern. medium reserved for future weekly/monthly cadence detection (not implemented in this pass). low = no temporal markers detected (the default).
- arborist.qa.metacognition.detect_contradiction(question)[source]#
Return tuple of (token_a, token_b) pairs whose BOTH members appear in
question(case-insensitive whole-word match).Returns empty tuple when no contradiction detected. The caller decides whether to label-only or block — by default this surfaces in the audit-line tail, NOT a hard block, since false positives on contradiction would refuse legitimate questions.
- arborist.qa.metacognition.detect_false_premise(question)[source]#
Return tuple of presupposition dicts surfacing the implied relation. Each dict carries:
kind -- pattern label (stopped_doing, caused, ...) presupposition -- natural-language statement of the presupposition subject -- extracted subject token-span predicate -- extracted predicate token-span
First-pass detection only. The verifier uses these as soft hints; downstream the audit-line tail surfaces “false premise suspected” so the operator can read the audit log and check whether the cited evidence supports the presupposition.
Returns empty tuple when no pattern fires.
- arborist.qa.metacognition.detect_out_of_corpus(question)[source]#
Return True iff the question references a private / uploaded document that the encyclopedic corpus cannot have.
- arborist.qa.metacognition.preflight_question(question, *, model_profile_id=None, corpus_profile=None, reference_frames=(), policy=None)[source]#
Classify
questiondeterministically into a QuestionState.Pure function. Reuses the Phase 1 quantifier classifier (#000008) plus four new lightweight detectors (temporal, contradiction, false-premise-lite, out-of-corpus).
corpus_profile is an optional dict carrying corpus boundary metadata (e.g.
{"corpus_latest_timestamp": "2003-05-16"}); when present, the temporal detector cross-checks against it. First-pass implementation just records corpus_requirement based on the temporal sensitivity — full cutoff arithmetic deferred to a future amend.reference_frames is passed in by the caller because frame detection requires retrieved sources (lives in arborist.qa.query._detect_frame). Empty tuple is the default for “no frame routing happened”.
policy overrides for the per-detector enables. Defaults are permissive (all checks on) per ticket #000010 §7.3.
dag#
Per-run Merkle-DAG provenance for providence records.
Each query/ask call passes through several stages:
question → retrieval → context → prompt → answer → verify → final_label
Each stage emits a hash; the run’s identity is the Merkle root over the
ordered sequence of stage hashes. Stored on the providence record as
run_dag_root (alongside cache_key). The DAG is verifiable: given
the persisted node list & the same Merkle conventions arborist uses
elsewhere (non-commutative HashCombine, prefix 0x03, leaf prefix 0x00,
self-duplicate odd rule), an auditor can recompute the root from the
nodes & confirm the run was constructed as recorded.
Distinct from the linear audit_events chain — that chain tracks
state-changing operations across the DB. This DAG tracks the
computation provenance of one specific answer. Both coexist; the
record’s audit_event_hash links to the chain, run_dag_root &
run_dag_blob carry the per-run computation graph.
Stages chosen to mirror the toy-Hermes design (fox 2026-04-30):
question hash of question_hash (8-dim cache_key dim)
retrieval hash of sources summary (document_roots + roles +
scores) -- captures which docs ranked & how
context context_root (Merkle root over sorted source roots,
the "source" dim of the cache_key)
prompt conversation_hash (the assembled messages)
answer sha256(answer_text)
verify hash of verdict summary (audit_mode, verifier_method,
n_quotes, n_verified, claim_statuses)
final_label hash of (audit_mode, verifier_method, lookup_path)
The DAG is NOT part of cache_key. cache_key inputs (the 8 dims) determine the answer; the answer determines the DAG. Folding the DAG back into cache_key would create a circular dependency.
- arborist.qa.dag.localize_failure(*, audit_mode, n_sources, n_quotes, n_verified)[source]#
Map a non-STRICT verdict to the pipeline stage that introduced the failure. Returns
Nonefor STRICT outcomes.Stage labels (in pipeline order):
retrieval— no admitted sources. Title/body gates rejected everything, or the corpus genuinely lacks the topic. Repair path: ingest more sources or relax the breadth threshold.context— sources admitted but no quotes extracted. Could be a context-truncation issue (per-source cap dropped the relevant paragraph) or a model that declined to cite anything. Repair path: raise per-source cap; tighten prompt.answer— sources retrieved & quotes extracted but they don’t verify. The model either fabricated content, paraphrased inside quotes, or appended citation tails. Repair path: themechanical_repairpass + (when wired) the re-prompt feedback loop.
The toy-Hermes design pass calls this “chain-segment failure localization” — debugging becomes typed instead of vague. An operator reading
failure_stage='answer'knows retrieval & context were fine; the model is what to fix.failure_stage='retrieval'means stop tuning the verifier & go ingest a relevant source.
- arborist.qa.dag.build_preflight_node_payload(*, question_state=None, quantifier=None, answer_contract=None, prompt_contract=None, evidence_contract=None, policy_refs=None)[source]#
Build the canonical nested-clause payload for the preflight DAG stage. Returns a JSON-ready dict; pair with
preflight_node_hash()to get the SHA-256 hex.Five-clause structure per ticket #000009 §8.2 / feedback §3:
classifier— quantifier classifier output (#000008): intensity, matched_token, explicit_count, scope_bound_hint, is_broad, classifier_version, operational_shape.answer_contract— guard / cap / reject decisions taken on this run.prompt_contract— reminder enabled / injected / template_id (#000008 §10.5).evidence_contract— exposure budget, one-claim-per-line discipline (#000010 §10.4).policy_refs— governance_policy_hash + model_profile_hash + answer_mode. Reference-by-hash rather than raw policy bundles (feedback §4: avoid double-committing already-hashed state).
Plus the metacog
question_statefrom #000010 — that’s its own clause for now (logical_statuses, false_premise_hints, contradiction_pairs). It’s hashed separately by metacognition.preflight_policy_hash already.Any clause may be None / empty — the resulting payload is still stable. Includes
node_versionso legacy runs without the node can be unambiguously labeled unavailable_legacy_run by audit tools.
- arborist.qa.dag.preflight_node_hash(*, question_state=None, quantifier=None, answer_contract=None, prompt_contract=None, evidence_contract=None, policy_refs=None)[source]#
Hash the preflight decision into a stable SHA-256 hex string.
Returns the hash of the nested-clause payload built by
build_preflight_node_payload(). See that function for the five-clause structure.Audit-replay payoff: two cache rows that share the same question + same model output + same verifier verdict but different preflight policy state produce different hashes here, which propagate to
run_dag_rootviabuild_run_dag().Backward compatibility note: Pre-2026-05-04 (c36e85c) callers used a flat 3-key payload (question_state / quantifier / policy_state). Hashes computed with that callsite will NOT match this restructured callsite — run_dag_root values for rows written between c36e85c and the current commit are treated as a discrete generation; they’re still verifiable by re-reading run_dag_blob (the persisted blob captures the payload that was actually hashed).
- arborist.qa.dag.build_run_dag(*, question_hash, sources, context_root, conversation_hash, answer_text, audit_mode, verifier_method, n_quotes, n_verified, claim_statuses=None, lookup_path=None, evidence_map_root=None, answer_mode=None, violations=None, raw_answer_text=None, parsed_lattice=None, rendered_text=None, retrieval_plan_hash=None, preflight_hash=None, preflight_payload=None)[source]#
Return
{"root": <hex>, "nodes": [<stage>, <hash>], ...}.All inputs are already-computed hashes or text; no I/O. Idempotent & deterministic — same inputs always produce the same root, byte-for- byte across machines (as long as the Merkle conventions stay pinned; they do, via
arborist.merkle).Two base DAG shapes; both gain an optional
preflightstage whenpreflight_hashis supplied (Ticket #000009):Quote mode (default). 7 stages —
question / retrieval / context / prompt / answer / verify / final_label. Triggered whenevidence_map_rootis None. Backward-compatible with all run_dag_root values written by code that pre-dates G0. Withpreflight_hash, becomes 8 stages —question / preflight / retrieval / ....Claim-lattice-pointer mode (G0 / CTI). 9 stages —
question / retrieval / evidence_map / prompt / raw_answer / parsed_claim_lattice / verify / render / final_label. Triggered whenevidence_map_rootis non-None. Splits the singleanswernode into three: the model’s raw output, the parsed claim-lattice, and the rendered prose with literal spans interpolated.contextdrops out (the context IS the evidence map). All three ofraw_answer_text/parsed_lattice/rendered_textshould be supplied; missing args fall back toanswer_textfor the raw_answer & render hashes and[]for the parsed_lattice hash. Withpreflight_hash, becomes 10 stages.
answer_mode&violationsfold into the verify & final_label payloads when provided.preflight_hash(Ticket #000009) is optional; when None, the DAG shape remains 7/9 stages exactly so pre-#000009 records can be re-validated. When supplied, the preflight stage inserts at position 1 (betweenquestionandretrieval) per ticket #000009 §3.1.- Parameters:
question_hash (str)
context_root (str)
conversation_hash (str)
answer_text (str)
audit_mode (str)
verifier_method (str)
n_quotes (int)
n_verified (int)
lookup_path (str | None)
evidence_map_root (str | None)
answer_mode (str | None)
raw_answer_text (str | None)
parsed_lattice (list | None)
rendered_text (str | None)
retrieval_plan_hash (str | None)
preflight_hash (str | None)
preflight_payload (dict | None)
- Return type:
- arborist.qa.dag.build_reject_run_dag(*, question_hash, preflight_hash, rejection_reason, answer_text, audit_mode='UNGROUNDED', verifier_method='claim_lattice_pointer', violations=None, preflight_payload=None)[source]#
3-stage reject-broad run-DAG:
question → preflight → final_label.Ticket #000009 §8.2 / 2026-05-04 feedback §6.2: preflight rejection currently early-returns from
query()before the standardbuild_run_dag()runs, so reject rows have no auditable Merkle commitment. This builder fills that gap with a minimal DAG shape that captures the rejection without pretending retrieval / prompt / raw_model_output happened.The returned shape is INTENTIONALLY shorter than the standard 7/9/8/10-stage shapes — audit replay can read the stage list and tell instantly that this row is a preflight rejection: 3 stages always means reject path.
final_label carries the rejection_reason + answer_text hash so two rejections that differ only in their (rendered) rationale string still produce different roots. The rejection_reason is the canonical string from the violation (“preflight rejection — broad-quantifier query with unbounded scope. …”), NOT the operator-facing rendered answer_text — that lets policy template changes invalidate the hash even if the operator-visible text is unchanged.
client#
Chat-completion clients.
ChatClient is a Protocol — any object with a chat_completion method plugs in. We ship two concrete clients:
OpenAICompatibleClient — talks to any OpenAI-compatible /v1/chat/completions endpoint (vllm, llama.cpp server, ollama, TGI, hosted services).
StubClient — offline canned responses for tests and dry-runs. No network. Operation Voyeur safe.
- class arborist.qa.client.ChatClient(*args, **kwargs)[source]#
- chat_completion(messages, *, model, temperature=0.1, max_tokens=512, top_p=1.0, extra_body=None)[source]#
Return the assistant’s text response.
extra_bodyis forwarded as additional fields in the JSON request payload — used for vLLM-specific knobs likeguided_json(constrain output to a JSON Schema at sampling time, eliminating SCHEMA_INVALID failures from prompt drift). Endpoints that don’t recognize the field ignore it; the client passes it through opaque-ly.
- class arborist.qa.client.StubClient(answer='[STUB] dry-run answer; no LLM was called.')[source]#
Offline client for tests / –dry-run.
Pass answer=callable(messages, **kw) -> str for dynamic stubbing.
- class arborist.qa.client.OpenAICompatibleClient(base_url, api_key=None, timeout=60.0, max_retries=3, retry_backoff_base_s=0.5)[source]#
OpenAI-compatible chat completion over HTTP.
Default endpoint is configurable via env. Pass api_key only if the target requires it; uncloseai’s free endpoint does not.
Retries on transient upstream failures (HTTP 502/503/504) with exponential backoff. The 2026-04-30 QA-modes bench saw 19 of 66 JSON-mode runs error out with 502 from vLLM — clustered, plausibly correlated with guided_json stressing the grammar engine. Retry smooths over the cluster without changing semantics: a 502 still fails the bench cell if all attempts exhaust, but transient bursts no longer dominate the error column.
- Parameters:
Permacomputer Preamble — License: AGPL-3.0-only
This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.
Our permacomputer is community-owned infrastructure optimized around four values:
TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.
NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.