Storage: SQLite schema and audit

Storage: SQLite schema and audit#

v9.8 SQLite schema, audit chain, and cross-shard views.

store#

SQLite-backed v9.8 store.

Schema implements the Merkle-AGI v9.8 admissibility ledger:

8-dim providence_cache key (source_root, question_hash, model_profile_hash, conversation_hash, governance_policy_hash, schema_version, canonicalization_version, chunking_version)
falsification_state ∈ {live, failed, stale, quarantined}
audit_events append-only chain (event_hash chains via prev_event_hash)
documents.kind ∈ {surface, core} for layered compression
chunks.tier ∈ {hot, warm, cold} for reversible eviction
derivations table binds core docs back to source surface roots

The providence_cache layer is schema-only in Phase 0 — no Q&A inference yet.

arborist.store.invalidate_migration_cache(db_path)[source]#

Drop the memoized migration claim for one path. Call this after replacing the underlying file; the next connect() will re-run the full migration probe sequence.

Parameters:: db_path (Path | str)
Return type:: None

arborist.store.connect(db_path=PosixPath('/home/docs/.arborist/arborist.db'))[source]#

Open a writable connection, creating the parent dir + schema if needed.

Performance pragmas applied per-connection. Under WAL (set in the schema):

synchronous=NORMAL skips the per-commit fsync; durable up to the last checkpoint (SQLite auto-checkpoints at WAL ~1000 frames).
cache_size=-65536 = 64 MB page cache (reduces re-reads).
temp_store=MEMORY keeps temp tables in RAM (no /tmp churn).
mmap_size=256 MB lets reads come from page-cache without read() syscalls.

Migration probes (executescript(SCHEMA_SQL) + the forward migrations) run once per (physical file, process). Subsequent connect() calls on the same shard skip migration entirely — see #000026 Phase 1.

busy_timeout is set on every connection (before the migration pass, so it covers that too): a peer mid-write — migration DDL, a transaction() block, append_audit’s own BEGIN IMMEDIATE — makes us wait rather than fail fast with database is locked. Without it, concurrent appenders that fail-and-retry can re-read a stale chain head and fork the audit chain (qa.db seq 7724/7725 was that bug); waiting + the BEGIN IMMEDIATE serialization fixes it.

Parameters:: db_path (Path | str)
Return type:: Connection

arborist.store.discover_shards(shards_dir)[source]#

Enumerate shard DB files in shards_dir. Returns sorted list of paths.

Parameters:: shards_dir (Path | str)
Return type:: list[Path]

arborist.store.connect_query(db_path=None, shards_dir=None)[source]#

Open a read-only-style connection that surfaces ALL shards as one DB.

If shards_dir is set, every *.db in it is ATTACHed and UNION ALL views are created over the standard tables so existing queries (SELECT * FROM documents) work unchanged across shards. Reads only — writes still go through connect() against a specific shard.

If shards_dir is None, returns a normal connect(db_path) for back-compat.

Parameters:

db_path (Path | str | None)
shards_dir (Path | str | None)

Return type:

Connection

arborist.store.connect_readonly(db_path)[source]#

Open an existing DB read-only — mode=ro URI, no schema bootstrap, no migration probes.

For CLI paths that only walk data (audit-chain checks, per-shard counts): a read op must neither run DDL nor take a write lock, but connect() does both — it runs executescript(SCHEMA_SQL) plus the forward migrations on the first open of each file in a process. (That migration pass on a read-only walk is exactly what surfaced the fork_score_branches already exists crash.) Raises sqlite3.OperationalError if the file is missing or unreadable; callers iterating a shard set should catch and skip if they expect stragglers.

Parameters:: db_path (Path | str)
Return type:: Connection

arborist.store.transaction(conn)[source]#

BEGIN IMMEDIATE / COMMIT / ROLLBACK around a block.

Parameters:: conn (Connection)
Return type:: Iterator[Connection]

arborist.store.get_meta(conn, key)[source]#

Read a value from the per-DB meta table; None if missing.

Parameters:

conn (Connection)
key (str)

Return type:

str | None

arborist.store.set_meta(conn, key, value)[source]#

Upsert a (key, value) into meta. Caller wraps in a transaction.

Parameters:

conn (Connection)
key (str)
value (str)

Return type:

None

arborist.store.latest_event_hash(conn)[source]#

Return the last event_hash in the audit chain, or None for genesis.

Parameters:: conn (Connection)
Return type:: str | None

arborist.store.chain_audit_events(prev_event_hash, events)[source]#

Compute the event_hash chain for a batch in pure Python.

Each event dict needs: event_type, body (dict), subject_root (str|None), ts (int). Returns (rows_for_executemany, last_event_hash). Insert with:

executemany(“INSERT INTO audit_events

(event_hash, prev_event_hash, event_type, subject_root,
body, ts) VALUES (?, ?, ?, ?, ?, ?)”, rows)

All chain SHA-256s are computed locally — zero DB round-trips per event.

Parameters:

prev_event_hash (str | None)
events (list[dict])

Return type:

tuple[list[tuple], str | None]

arborist.store.append_audit(conn, event_type, body, subject_root=None, ts=None)[source]#

Append one event to the audit chain. Returns the new event_hash (hex).

Atomic head-read + insert. If the connection is not already inside a transaction, the read of the current chain head and the INSERT run inside this call’s own BEGIN IMMEDIATE / COMMIT — so two concurrent appenders serialize on the write lock instead of both reading the same head and chaining off it (which forks the chain; qa.db seq 7724/7725 was exactly that, from two concurrent providence_burn writes). A caller already inside a transaction() gets the append folded into that unit. With connect()’s busy_timeout the loser waits rather than failing database is locked.

Convenience wrapper for one-off events. Bulk inserts should use chain_audit_events() + executemany() inside a transaction() for ~10x throughput on large batches (same serialization guarantee).

Parameters:

conn (Connection)
event_type (str)
body (dict)
subject_root (str | None)
ts (int | None)

Return type:

str

arborist.store.stats(conn)[source]#

Quick landscape report.

Parameters:: conn (Connection)
Return type:: dict

ingest#

Ingest pipeline: Source -> normalize -> chunk -> merkle -> upsert.

Idempotent: re-ingesting the same Document is a no-op (document_root collision is the upsert key).

Performance shape — bulk-batched writer:: Each batch collapses ALL inserts across N docs into a small set of executemany() calls (one per table) instead of per-doc calls. Audit chain hashes computed in pure Python via store.chain_audit_events, then inserted in one shot. With WAL+synchronous=NORMAL, the dominant cost shifts from Python<->C boundary crossings to actual SQLite work.

class arborist.ingest.IngestStats(seen: 'int' = 0, inserted: 'int' = 0, skipped_duplicate: 'int' = 0, chunks_total: 'int' = 0, edges_total: 'int' = 0)[source]#

Bases: object

Parameters:

seen (int)
inserted (int)
skipped_duplicate (int)
chunks_total (int)
edges_total (int)

seen: int = 0#

inserted: int = 0#

skipped_duplicate: int = 0#

chunks_total: int = 0#

edges_total: int = 0#

arborist.ingest.ingest_source(conn, source, chunker_name=None, limit=None, batch_size=200, resume=False, progress=None, loss_report_enabled=True, loss_report_excerpts=True, loss_report_max_excerpt_bytes=200)[source]#

Ingest every document the source yields. Returns counts.

resume=True reads the per-source high-water mark from this DB’s meta table and asks the source to fast-forward past it. After each successful batch flush, the high-water mark is updated in meta. A killed process can rsync forward by re-running with –resume.

progress (optional) gets a tick(seen, inserted=…) call after each batch flush. Pass an arborist.progress.Progress for live stderr output.

Parameters:

conn (Connection)
source (Source)
chunker_name (str | None)
limit (int | None)
batch_size (int)
resume (bool)
progress (Progress | None)
loss_report_enabled (bool)
loss_report_excerpts (bool)
loss_report_max_excerpt_bytes (int)

Return type:

IngestStats

arborist.ingest.verify_random_sample(conn, n=10)[source]#

Sample N documents, regenerate Merkle proof for chunk 0, verify.

Parameters:

conn (Connection)
n (int)

Return type:

dict

evict#

Reversible eviction + rehydrate.

Implements the systematic-forgetting mechanic from the design philosophy:

evict_to_cold: surface chunks demote from hot to cold; content set to NULL, FTS5 row deleted. leaf_hash retained — identity preserved.
rehydrate: refetch URI through the same source pipeline, re-chunk with the original chunking_version, compare leaves and root. Match -> content restored, tier hot. Mismatch -> drift event in audit chain, providence records flipped to falsification_state=’stale’. No content restored.

Cores never evict.

arborist.evict.evict_to_cold(conn, *, source_type=None, older_than_days=None, document_roots=None)[source]#

Demote matching surface chunks from hot to cold.

Cores are never evicted. Content is NULLed; FTS row removed.

Parameters:

conn (Connection)
source_type (str | None)
older_than_days (int | None)
document_roots (Iterable[str] | None)

Return type:

dict

arborist.evict.rehydrate(conn, document_root, *, fetcher=None)[source]#

Refetch URI, verify leaves, restore content if and only if root matches.

Returns a dict whose status is one of: unknown_document, nothing_to_do, source_not_rehydratable, fetch_failed, drift_detected, or rehydrated.

Parameters:

conn (Connection)
document_root (str)
fetcher (Callable[[str], str | None] | None)

Return type:

dict

Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.

Storage: SQLite schema and audit

Contents

Storage: SQLite schema and audit#

store#

ingest#

evict#