Substrate: Core data structures#
Pure Merkle tree and document primitives.
merkle#
Merkle tree with non-commutative HashCombine.
Python port of ~/git/proxy.unturf.com/pkg/verified/merkle.go conventions:
Domain separation via single-byte prefixes (leaf=0x00, node=0x03).
HashCombine is non-commutative; sibling order matters always.
Odd layers self-duplicate the trailing element (NOT zero-pad).
Proof carries explicit IsLeft flag per sibling (NOT lexical sort).
Empty tree root is ZeroHash (32 zero bytes).
- arborist.merkle.hash_combine(left, right)[source]#
Non-commutative interior combine with domain prefix 0x03.
- class arborist.merkle.ProofNode(hash, is_left)[source]#
Bases:
objectOne sibling step in a Merkle inclusion proof.
is_left=True means the sibling sits to the LEFT of the running hash, so verification order is: HashCombine(sibling, current).
- class arborist.merkle.MerkleProof(leaf, leaf_index, siblings, root)[source]#
Bases:
objectMerkle inclusion proof: leaf + index + sibling path to root.
Proves that a leaf at leaf_index exists in a tree with the given root. verify_proof() recomputes the root by combining the leaf with siblings.
- class arborist.merkle.MerkleTree(layers=<factory>)[source]#
Bases:
objectLayered tree. layers[0] = leaves, layers[-1] = [root].
- classmethod build(leaves)[source]#
Build tree from leaf hashes. Applies odd-element self-duplication.
- Parameters:
- Return type:
- arborist.merkle.verify_proof(proof)[source]#
Recompute root from leaf + sibling path. Returns True iff matches.
- Parameters:
proof (MerkleProof)
- Return type:
- arborist.merkle.proof_to_dict(proof)[source]#
JSON-serializable form for storage in providence_cache.merkle_proof.
- Parameters:
proof (MerkleProof)
- Return type:
document#
Document and Chunk dataclasses + canonical chunkers.
Every Document carries a URI (identification, backtrack, cross-link) and content. Chunkers split content into byte-determined chunks before Merkle hashing. The chunker’s name is committed in chunking_version — changing chunker invalidates all prior cache records under v9.8 admissibility.
- class arborist.document.Edge(edge_type, dst_uri, dst_root=None, anchor=None)[source]#
Bases:
objectA cross-link from this document to another.
- class arborist.document.Document(uri, content, source_type, title=None, edges=<factory>, extra=<factory>)[source]#
Bases:
objectAn ingestable document: URI + content + outbound edges.
- Parameters:
- arborist.document.canonicalize(text)[source]#
Stable text normalization. Bumping this requires CANONICALIZATION_VERSION bump.
- class arborist.document.Chunker(*args, **kwargs)[source]#
Bases:
ProtocolA chunker splits canonicalized text into ordered chunks.
- class arborist.document.TokenChunker(tokens_per_chunk=512)[source]#
Bases:
object512-token chunker (whitespace-tokenized, byte-deterministic).
“Token” here means whitespace-separated unit, NOT a model BPE token. This avoids tokenizer-version drift in the chunking_version.
- Parameters:
tokens_per_chunk (int)
- name = 'tok-512-v1'#
wikitext#
Wikitext → base prose conversion.
Arborist stores raw MediaWiki wikitext in chunks.content so the link
graph and original markup are recoverable from any page on demand. For
LLM context and post-LLM faithfulness verification we need prose — a
deterministic plain-text projection of the same chunk.
This module provides that projection. to_base(raw) is a pure function
of its input plus BASE_VERSION: same wikitext → same prose, forever,
as long as BASE_VERSION is unchanged.
Versioning protocol#
Bump BASE_VERSION whenever the algorithm changes. Callers fold
BASE_VERSION into governance_policy_hash (via policy["base_version"]
in arborist.qa.runner / arborist.qa.query) so a bump invalidates every
prior providence-cache record’s 8-dim cache_key on the next lookup. No
schema migration; the next ask re-derives against fresh prose.
Algorithm (wikitext-base-v1)#
Parse with
mwparserfromhell(handles nested templates, complex tables, and edge cases that pure regex mangles).Drop
<ref>...</ref>and self-closing<ref ... />tags. Citations are not quotable claims about the topic.Drop namespace-prefixed wikilinks:
[[File:...]],[[Image:...]],[[Category:...]]. Image params (thumb|250px|...) and category tags are not prose; they’re metadata.strip_code(normalize=True, collapse=True)— converts surviving templates to empty, wikilinks to their display text, headers to bare text, bold/italic markers to plain text, HTML tags to inner text, HTML entities to characters, external links to anchor text.Whitespace pass: collapse runs of spaces/tabs, drop trailing space on lines, collapse 3+ newlines to 2.
Optional dependency. Install with pip install arborist[wikitext].
- arborist.wikitext.to_base(raw, *, loss_collector=None)[source]#
Convert raw wikitext to base prose. Deterministic. Idempotent.
Empty / whitespace-only input returns
"". Non-wikitext input (already-clean prose) round-trips unchanged modulo whitespace collapsing.When
loss_collectoris provided, every drop / transform / normalize step records aLossEventagainst the collector. The output bytes remain bit-identical to the no-collector path — LossReport is additive metadata, never a gate. DefaultNonekeeps the LLM-call path (verifier + query + runner) unchanged.
- arborist.wikitext.extract_wikilinks(raw)[source]#
Return
(target, display)for every wikilink inraw.targetis the link target (page title) with any#sectionfragment stripped.displayis the visible text if the link uses[[Target|Display]]form, elseNone.File / Image / Category namespace links are included — they’re the very signal the link graph wants. This is the “definition cloud” artifact: from any page, recover its full out-link set without re-parsing wikitext at query time.
Permacomputer Preamble — License: AGPL-3.0-only
This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.
Our permacomputer is community-owned infrastructure optimized around four values:
TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.
NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.