Retrieval: FTS5 search and concepts#
Full-text search, concept relations (synonym/rivalry overlay).
search#
Search backends.
- class arborist.search.AuditMode(*values)[source]#
-
- STRICT = 'STRICT'#
- HYBRID = 'HYBRID'#
- UNGROUNDED = 'UNGROUNDED'#
- class arborist.search.Hit(document_root: 'str', document_uri: 'str', chunk_idx: 'int', snippet: 'str', score: 'float', audit_mode: 'AuditMode', title: 'str | None' = None)[source]#
Bases:
object- Parameters:
- class arborist.search.SearchBackend(conn)[source]#
Bases:
ABCA search hook over the chunk store.
- Parameters:
conn (sqlite3.Connection)
- class arborist.search.FTS5Backend(conn)[source]#
Bases:
SearchBackend- Parameters:
conn (sqlite3.Connection)
- search(query, limit=20, extra_or_tokens=None)[source]#
Run FTS5 BM25 over chunk content.
extra_or_tokens(synonym-expanded set) is passed through to the OR-mode fallback only. AND mode stays on original query tokens (adding synonyms there would relax the AND constraint and pull in noise). The intended caller is the retrieval pipeline that has already computedsynonym_expand(qtokens)— passing it here saves the OR-mode pool from missing topical synonym terms.
sources#
Source ABC.
Adding a new corpus to arborist = one new Source subclass. The Source contract is intentionally minimal: yield Document objects, one at a time.
MediaWiki ‘cur’ table SQL dump source.
Handles 2003-era SQL dumps in bz2 format (e.g. 20030516_cur_tablesql.bz2). Yields one Document per non-redirect main-namespace article, with [[wikilinks]] extracted as outbound edges.
Implements a stream parser for MySQL extended INSERT syntax. The cur table schema for that era starts: cur_id, cur_namespace, cur_title, cur_text, … We rely on positional access for the first four columns.
- class arborist.sources.wikipedia.WikipediaSqlDump(path, *, table='cur', namespace=0, base_uri='https://en.wikipedia.org/wiki/', shard=None, start_id=0, encoding='latin-1')[source]#
Iterates a MediaWiki SQL table dump (cur or old), bz2 or plain.
Both cur (current snapshot) and old (revision history) tables share the first four column positions: id, namespace, title, text. The cur table has cur_is_redirect at position 10 (we skip redirects); old has no redirect flag (every revision is real).
Shard support: pass shard=(rank, total) and the source yields only docs whose 0-based index satisfies index % total == rank. Useful for spawning N parallel ingest processes against the same dump file — parser CPU runs in parallel, writes serialize at the WAL writer lock.
- Parameters:
- class arborist.sources.wikipedia.WikipediaCurDump(path, namespace=0, base_uri='https://en.wikipedia.org/wiki/')[source]#
Iterates a MediaWiki ‘cur’ table SQL dump.
- class arborist.sources.wikipedia.WikipediaOldDump(path, namespace=0, base_uri='https://en.wikipedia.org/wiki/')[source]#
Iterates a MediaWiki ‘old’ (revision history) table SQL dump.
HTML page source.
Fetches URLs, honors robots.txt automatically, strips noise (script/style/nav/ footer/header), extracts main body text + outbound <a href> links as edges.
Optional dependency. Install with pip install arborist[html].
- arborist.sources.html_page.parse_html(url, html, source_type='html', *, loss_collector=None)[source]#
Pure parse function. Separated so tests can run without network.
When
loss_collectoris provided, drops from noise-selector decomposition (<script>,<style>,<nav>, etc.) and whitespace normalization are recorded against the collector. Output bytes remain bit-identical to the no-collector path. DefaultNonekeeps callers stable.
- class arborist.sources.html_page.HtmlPageSource(urls, *, respect_robots=True, timeout=30.0, loss_report_enabled=True, loss_report_excerpts=True, loss_report_max_excerpt_bytes=200, default_author=None)[source]#
Iterates a list of URLs, fetching and parsing each as HTML.
- Parameters:
concepts#
Corpus-derived concept relations: synonyms, antonyms, rivalries, categories.
Replaces the hand-curated frozensets that lived in arborist.qa.concepts
through April 2026 (commit c6182ae). The frozensets were Phase 1; this is
Phase 2.
The concept-relations layer is a secondary index over the existing
Merkle-committed corpus. Writes to concept_relations NEVER affect
document_root, chunk_root, or cache_key — backfilling
relations is safe across the entire corpus without invalidating cached
answers or breaking audit chains.
Architecture:
store.py— DB read/write helpers, append-only with UNIQUE-key idempotencyextract.py— Extractor ABC + registry; per-source extractors plug in herequery.py— Cross-shardsynonyms_for(token)&rivalries_for(token)seed.py— One-time migration of the legacy frozensets to manual rows
Public API for retrieval-time use (matches the legacy
arborist.qa.concepts shape, so call sites in query.py keep
working):
synonym_expand(tokens, *, shards_dir) -> set[str]
rivalry_excluded(tokens, *, shards_dir, compare_phrasing=False) -> set[str]
has_compare_phrasing(question) -> bool
- arborist.concepts.add_concept_relation(conn, *, source_root, relation_kind, token, target, evidence_kind, confidence=1.0, derived_from=None, derived_at=None)[source]#
Append a concept relation. Returns True if a row was inserted, False if the (source_root, relation_kind, token, target, evidence_kind) tuple already existed (idempotent re-derivation).
Tokens are stored exactly as given — case preservation lets the query layer decide normalization. Substring lookup at retrieval time is case-insensitive via SQLite’s NOCASE comparator.
- arborist.concepts.concept_relations_for_token(conn, token, *, relation_kind=None)[source]#
Return all relations whose
tokenmatches (case-insensitive). Ifrelation_kindis given, filter to that kind.- Parameters:
conn (Connection)
token (str)
relation_kind (str | None)
- Return type:
- arborist.concepts.has_compare_phrasing(question)[source]#
True if the question contains comparison language.
- arborist.concepts.invalidate_cache()[source]#
Drop all cached indices. Call after a writer commits new rows (the mtime check would catch this on next read, but invalidating explicitly is faster on the same-process write+read pattern).
- Return type:
None
- arborist.concepts.purge_by_evidence_kind(conn, evidence_kind, *, derived_from=None)[source]#
Delete every row with the given
evidence_kind(and optionalderived_from). Returns the number of rows removed.The intended use: revoke a buggy extractor’s output cleanly. Manual rows live under
evidence_kind='manual'and are NOT touched by a purge of any other kind.- Parameters:
conn (Connection)
evidence_kind (str)
derived_from (str | None)
- Return type:
- arborist.concepts.rivalry_excluded(tokens, *, shards_dir=None, compare_phrasing=False)[source]#
Tokens whose presence in a doc title means EXCLUDE that doc.
For each rivalry pair (A, B): if exactly ONE side appears in the query AND no comparison language was used, exclude the OTHER side’s tokens. If both sides appear, or if the user asked for a comparison, no exclusion (they wanted both).
- arborist.concepts.synonym_expand(tokens, *, shards_dir=None, max_neighbors_per_token=8, max_total=50)[source]#
Add direct synonym neighbors for any input token whose degree is bounded enough that its neighbors are likely topical, not topic-adjacency noise.
Two caps protect retrieval performance & quality:
max_neighbors_per_token— tokens with more direct neighbors than this contribute NO expansion. Generic tokens (“person”, “thoughts”) have huge degree in the Wikipedia reciprocal-link graph; expanding them dumps random topical-cluster noise. Specific named entities (“athlon”, “telepathy”) have small focused neighborhoods that pass the cap.max_total— overall cap on expanded set size. Bounds the SQL clause count downstream. Original query tokens are always preserved; if total > cap, neighbors are sorted alphabetically & truncated.
Permacomputer Preamble — License: AGPL-3.0-only
This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.
Our permacomputer is community-owned infrastructure optimized around four values:
TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.
NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.