Storage: SQLite schema and audit#
v9.8 SQLite schema, audit chain, and cross-shard views.
store#
SQLite-backed v9.8 store.
Schema implements the Merkle-AGI v9.8 admissibility ledger:
8-dim providence_cache key (source_root, question_hash, model_profile_hash, conversation_hash, governance_policy_hash, schema_version, canonicalization_version, chunking_version)
falsification_state ∈ {live, failed, stale, quarantined}
audit_events append-only chain (event_hash chains via prev_event_hash)
documents.kind ∈ {surface, core} for layered compression
chunks.tier ∈ {hot, warm, cold} for reversible eviction
derivations table binds core docs back to source surface roots
The providence_cache layer is schema-only in Phase 0 — no Q&A inference yet.
- arborist.store.invalidate_migration_cache(db_path)[source]#
Drop the memoized migration claim for one path. Call this after replacing the underlying file; the next connect() will re-run the full migration probe sequence.
- arborist.store.connect(db_path=PosixPath('/home/docs/.arborist/arborist.db'))[source]#
Open a writable connection, creating the parent dir + schema if needed.
Performance pragmas applied per-connection. Under WAL (set in the schema):
synchronous=NORMAL skips the per-commit fsync; durable up to the last checkpoint (SQLite auto-checkpoints at WAL ~1000 frames).
cache_size=-65536 = 64 MB page cache (reduces re-reads).
temp_store=MEMORY keeps temp tables in RAM (no /tmp churn).
mmap_size=256 MB lets reads come from page-cache without read() syscalls.
Migration probes (executescript(SCHEMA_SQL) + the forward migrations) run once per (physical file, process). Subsequent
connect()calls on the same shard skip migration entirely — see #000026 Phase 1.busy_timeoutis set on every connection (before the migration pass, so it covers that too): a peer mid-write — migration DDL, atransaction()block,append_audit’s ownBEGIN IMMEDIATE— makes us wait rather than fail fast withdatabase is locked. Without it, concurrent appenders that fail-and-retry can re-read a stale chain head and fork the audit chain (qa.db seq 7724/7725 was that bug); waiting + theBEGIN IMMEDIATEserialization fixes it.- Parameters:
- Return type:
- arborist.store.discover_shards(shards_dir)[source]#
Enumerate shard DB files in shards_dir. Returns sorted list of paths.
- arborist.store.connect_query(db_path=None, shards_dir=None)[source]#
Open a read-only-style connection that surfaces ALL shards as one DB.
If shards_dir is set, every *.db in it is ATTACHed and UNION ALL views are created over the standard tables so existing queries (SELECT * FROM documents) work unchanged across shards. Reads only — writes still go through connect() against a specific shard.
If shards_dir is None, returns a normal connect(db_path) for back-compat.
- Parameters:
- Return type:
- arborist.store.connect_readonly(db_path)[source]#
Open an existing DB read-only —
mode=roURI, no schema bootstrap, no migration probes.For CLI paths that only walk data (audit-chain checks, per-shard counts): a read op must neither run DDL nor take a write lock, but
connect()does both — it runsexecutescript(SCHEMA_SQL)plus the forward migrations on the first open of each file in a process. (That migration pass on a read-only walk is exactly what surfaced thefork_score_branches already existscrash.) Raisessqlite3.OperationalErrorif the file is missing or unreadable; callers iterating a shard set should catch and skip if they expect stragglers.- Parameters:
- Return type:
- arborist.store.transaction(conn)[source]#
BEGIN IMMEDIATE / COMMIT / ROLLBACK around a block.
- Parameters:
conn (Connection)
- Return type:
- arborist.store.get_meta(conn, key)[source]#
Read a value from the per-DB meta table; None if missing.
- Parameters:
conn (Connection)
key (str)
- Return type:
str | None
- arborist.store.set_meta(conn, key, value)[source]#
Upsert a (key, value) into meta. Caller wraps in a transaction.
- Parameters:
conn (Connection)
key (str)
value (str)
- Return type:
None
- arborist.store.latest_event_hash(conn)[source]#
Return the last event_hash in the audit chain, or None for genesis.
- Parameters:
conn (Connection)
- Return type:
str | None
- arborist.store.chain_audit_events(prev_event_hash, events)[source]#
Compute the event_hash chain for a batch in pure Python.
Each event dict needs: event_type, body (dict), subject_root (str|None), ts (int). Returns (rows_for_executemany, last_event_hash). Insert with:
- executemany(“INSERT INTO audit_events
- (event_hash, prev_event_hash, event_type, subject_root,
body, ts) VALUES (?, ?, ?, ?, ?, ?)”, rows)
All chain SHA-256s are computed locally — zero DB round-trips per event.
- arborist.store.append_audit(conn, event_type, body, subject_root=None, ts=None)[source]#
Append one event to the audit chain. Returns the new event_hash (hex).
Atomic head-read + insert. If the connection is not already inside a transaction, the read of the current chain head and the INSERT run inside this call’s own
BEGIN IMMEDIATE/COMMIT— so two concurrent appenders serialize on the write lock instead of both reading the same head and chaining off it (which forks the chain; qa.db seq 7724/7725 was exactly that, from two concurrentprovidence_burnwrites). A caller already inside atransaction()gets the append folded into that unit. Withconnect()’sbusy_timeoutthe loser waits rather than failingdatabase is locked.Convenience wrapper for one-off events. Bulk inserts should use chain_audit_events() + executemany() inside a
transaction()for ~10x throughput on large batches (same serialization guarantee).
- arborist.store.stats(conn)[source]#
Quick landscape report.
- Parameters:
conn (Connection)
- Return type:
ingest#
Ingest pipeline: Source -> normalize -> chunk -> merkle -> upsert.
Idempotent: re-ingesting the same Document is a no-op (document_root collision is the upsert key).
- Performance shape — bulk-batched writer:
Each batch collapses ALL inserts across N docs into a small set of executemany() calls (one per table) instead of per-doc calls. Audit chain hashes computed in pure Python via store.chain_audit_events, then inserted in one shot. With WAL+synchronous=NORMAL, the dominant cost shifts from Python<->C boundary crossings to actual SQLite work.
- class arborist.ingest.IngestStats(seen: 'int' = 0, inserted: 'int' = 0, skipped_duplicate: 'int' = 0, chunks_total: 'int' = 0, edges_total: 'int' = 0)[source]#
Bases:
object
- arborist.ingest.ingest_source(conn, source, chunker_name=None, limit=None, batch_size=200, resume=False, progress=None, loss_report_enabled=True, loss_report_excerpts=True, loss_report_max_excerpt_bytes=200)[source]#
Ingest every document the source yields. Returns counts.
resume=True reads the per-source high-water mark from this DB’s meta table and asks the source to fast-forward past it. After each successful batch flush, the high-water mark is updated in meta. A killed process can rsync forward by re-running with –resume.
progress (optional) gets a tick(seen, inserted=…) call after each batch flush. Pass an arborist.progress.Progress for live stderr output.
- Parameters:
- Return type:
- arborist.ingest.verify_random_sample(conn, n=10)[source]#
Sample N documents, regenerate Merkle proof for chunk 0, verify.
- Parameters:
conn (Connection)
n (int)
- Return type:
evict#
Reversible eviction + rehydrate.
Implements the systematic-forgetting mechanic from the design philosophy:
evict_to_cold: surface chunks demote from hot to cold; content set to NULL, FTS5 row deleted. leaf_hash retained — identity preserved.
rehydrate: refetch URI through the same source pipeline, re-chunk with the original chunking_version, compare leaves and root. Match -> content restored, tier hot. Mismatch -> drift event in audit chain, providence records flipped to falsification_state=’stale’. No content restored.
Cores never evict.
- arborist.evict.evict_to_cold(conn, *, source_type=None, older_than_days=None, document_roots=None)[source]#
Demote matching surface chunks from hot to cold.
Cores are never evicted. Content is NULLed; FTS row removed.
- arborist.evict.rehydrate(conn, document_root, *, fetcher=None)[source]#
Refetch URI, verify leaves, restore content if and only if root matches.
Returns a dict whose
statusis one of:unknown_document,nothing_to_do,source_not_rehydratable,fetch_failed,drift_detected, orrehydrated.
Permacomputer Preamble — License: AGPL-3.0-only
This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.
Our permacomputer is community-owned infrastructure optimized around four values:
TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.
NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.