Substrate: Core data structures#

Pure Merkle tree and document primitives.

merkle#

Merkle tree with non-commutative HashCombine.

Python port of ~/git/proxy.unturf.com/pkg/verified/merkle.go conventions:

  • Domain separation via single-byte prefixes (leaf=0x00, node=0x03).

  • HashCombine is non-commutative; sibling order matters always.

  • Odd layers self-duplicate the trailing element (NOT zero-pad).

  • Proof carries explicit IsLeft flag per sibling (NOT lexical sort).

  • Empty tree root is ZeroHash (32 zero bytes).

arborist.merkle.hash_leaf(content)[source]#

Hash a leaf with domain prefix 0x00.

Parameters:

content (bytes)

Return type:

bytes

arborist.merkle.hash_combine(left, right)[source]#

Non-commutative interior combine with domain prefix 0x03.

Parameters:
Return type:

bytes

class arborist.merkle.ProofNode(hash, is_left)[source]#

Bases: object

One sibling step in a Merkle inclusion proof.

is_left=True means the sibling sits to the LEFT of the running hash, so verification order is: HashCombine(sibling, current).

Parameters:
hash: bytes#
is_left: bool#
class arborist.merkle.MerkleProof(leaf, leaf_index, siblings, root)[source]#

Bases: object

Merkle inclusion proof: leaf + index + sibling path to root.

Proves that a leaf at leaf_index exists in a tree with the given root. verify_proof() recomputes the root by combining the leaf with siblings.

Parameters:
leaf: bytes#
leaf_index: int#
siblings: tuple[ProofNode, ...]#
root: bytes#
class arborist.merkle.MerkleTree(layers=<factory>)[source]#

Bases: object

Layered tree. layers[0] = leaves, layers[-1] = [root].

Parameters:

layers (list[list[bytes]])

layers: list[list[bytes]]#
property root: bytes#

Root hash of the tree (topmost layer). Empty tree → ZERO_HASH.

property leaves: list[bytes]#

Leaf hashes in input order (bottom layer of tree).

classmethod build(leaves)[source]#

Build tree from leaf hashes. Applies odd-element self-duplication.

Parameters:

leaves (Iterable[bytes])

Return type:

MerkleTree

proof(leaf_index)[source]#

Generate inclusion proof for leaf_index. Raises IndexError if out of range.

Parameters:

leaf_index (int)

Return type:

MerkleProof

arborist.merkle.verify_proof(proof)[source]#

Recompute root from leaf + sibling path. Returns True iff matches.

Parameters:

proof (MerkleProof)

Return type:

bool

arborist.merkle.proof_to_dict(proof)[source]#

JSON-serializable form for storage in providence_cache.merkle_proof.

Parameters:

proof (MerkleProof)

Return type:

dict

arborist.merkle.proof_from_dict(d)[source]#

Deserialize a proof from JSON dict (inverse of proof_to_dict).

Parameters:

d (dict)

Return type:

MerkleProof

document#

Document and Chunk dataclasses + canonical chunkers.

Every Document carries a URI (identification, backtrack, cross-link) and content. Chunkers split content into byte-determined chunks before Merkle hashing. The chunker’s name is committed in chunking_version — changing chunker invalidates all prior cache records under v9.8 admissibility.

class arborist.document.Edge(edge_type, dst_uri, dst_root=None, anchor=None)[source]#

Bases: object

A cross-link from this document to another.

Parameters:
  • edge_type (str)

  • dst_uri (str)

  • dst_root (str | None)

  • anchor (str | None)

edge_type: str#
dst_uri: str#
dst_root: str | None = None#
anchor: str | None = None#
class arborist.document.Document(uri, content, source_type, title=None, edges=<factory>, extra=<factory>)[source]#

Bases: object

An ingestable document: URI + content + outbound edges.

Parameters:
uri: str#
content: str#
source_type: str#
title: str | None = None#
edges: list[Edge]#
extra: dict#
arborist.document.canonicalize(text)[source]#

Stable text normalization. Bumping this requires CANONICALIZATION_VERSION bump.

Parameters:

text (str)

Return type:

str

class arborist.document.Chunker(*args, **kwargs)[source]#

Bases: Protocol

A chunker splits canonicalized text into ordered chunks.

name: str#
split(text)[source]#
Parameters:

text (str)

Return type:

list[str]

class arborist.document.TokenChunker(tokens_per_chunk=512)[source]#

Bases: object

512-token chunker (whitespace-tokenized, byte-deterministic).

“Token” here means whitespace-separated unit, NOT a model BPE token. This avoids tokenizer-version drift in the chunking_version.

Parameters:

tokens_per_chunk (int)

name = 'tok-512-v1'#
split(text)[source]#
Parameters:

text (str)

Return type:

list[str]

class arborist.document.SentenceChunker[source]#

Bases: object

Sentence-aligned chunker (better for short docs like 2003 Wikipedia).

name = 'sent-v1'#
split(text)[source]#
Parameters:

text (str)

Return type:

list[str]

arborist.document.get_chunker(name=None)[source]#

Lookup chunker by name. Default = TokenChunker.

Parameters:

name (str | None)

Return type:

Chunker

wikitext#

Wikitext → base prose conversion.

Arborist stores raw MediaWiki wikitext in chunks.content so the link graph and original markup are recoverable from any page on demand. For LLM context and post-LLM faithfulness verification we need prose — a deterministic plain-text projection of the same chunk.

This module provides that projection. to_base(raw) is a pure function of its input plus BASE_VERSION: same wikitext → same prose, forever, as long as BASE_VERSION is unchanged.

Versioning protocol#

Bump BASE_VERSION whenever the algorithm changes. Callers fold BASE_VERSION into governance_policy_hash (via policy["base_version"] in arborist.qa.runner / arborist.qa.query) so a bump invalidates every prior providence-cache record’s 8-dim cache_key on the next lookup. No schema migration; the next ask re-derives against fresh prose.

Algorithm (wikitext-base-v1)#

  1. Parse with mwparserfromhell (handles nested templates, complex tables, and edge cases that pure regex mangles).

  2. Drop <ref>...</ref> and self-closing <ref ... /> tags. Citations are not quotable claims about the topic.

  3. Drop namespace-prefixed wikilinks: [[File:...]], [[Image:...]], [[Category:...]]. Image params (thumb|250px|...) and category tags are not prose; they’re metadata.

  4. strip_code(normalize=True, collapse=True) — converts surviving templates to empty, wikilinks to their display text, headers to bare text, bold/italic markers to plain text, HTML tags to inner text, HTML entities to characters, external links to anchor text.

  5. Whitespace pass: collapse runs of spaces/tabs, drop trailing space on lines, collapse 3+ newlines to 2.

Optional dependency. Install with pip install arborist[wikitext].

arborist.wikitext.to_base(raw, *, loss_collector=None)[source]#

Convert raw wikitext to base prose. Deterministic. Idempotent.

Empty / whitespace-only input returns "". Non-wikitext input (already-clean prose) round-trips unchanged modulo whitespace collapsing.

When loss_collector is provided, every drop / transform / normalize step records a LossEvent against the collector. The output bytes remain bit-identical to the no-collector path — LossReport is additive metadata, never a gate. Default None keeps the LLM-call path (verifier + query + runner) unchanged.

Parameters:
  • raw (str)

  • loss_collector (LossCollector | None)

Return type:

str

Return (target, display) for every wikilink in raw.

target is the link target (page title) with any #section fragment stripped. display is the visible text if the link uses [[Target|Display]] form, else None.

File / Image / Category namespace links are included — they’re the very signal the link graph wants. This is the “definition cloud” artifact: from any page, recover its full out-link set without re-parsing wikitext at query time.

Parameters:

raw (str)

Return type:

list[tuple[str, str | None]]


Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

  • TRUTH — First principles, math & science, open source code freely distributed.

  • FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.

  • HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.

  • LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.