Retrieval: FTS5 search and concepts

Retrieval: FTS5 search and concepts#

Full-text search, concept relations (synonym/rivalry overlay).

search#

Search backends.

class arborist.search.AuditMode(*values)[source]#

Bases: str, Enum

STRICT = 'STRICT'#

HYBRID = 'HYBRID'#

UNGROUNDED = 'UNGROUNDED'#

class arborist.search.Hit(document_root: 'str', document_uri: 'str', chunk_idx: 'int', snippet: 'str', score: 'float', audit_mode: 'AuditMode', title: 'str | None' = None)[source]#

Bases: object

Parameters:

document_root (str)
document_uri (str)
chunk_idx (int)
snippet (str)
score (float)
audit_mode (AuditMode)
title (str | None)

document_root: str#

document_uri: str#

chunk_idx: int#

snippet: str#

score: float#

audit_mode: AuditMode#

title: str | None = None#

class arborist.search.SearchBackend(conn)[source]#

Bases: ABC

A search hook over the chunk store.

Parameters:: conn (sqlite3.Connection)

name: str#

audit_mode: AuditMode#

abstractmethod search(query, limit=20)[source]#

Parameters:

query (str)
limit (int)

Return type:

list[Hit]

class arborist.search.FTS5Backend(conn)[source]#

Bases: SearchBackend

Parameters:: conn (sqlite3.Connection)

name: str = 'fts5'#

audit_mode: AuditMode = 'UNGROUNDED'#

search(query, limit=20, extra_or_tokens=None)[source]#

Run FTS5 BM25 over chunk content.

extra_or_tokens (synonym-expanded set) is passed through to the OR-mode fallback only. AND mode stays on original query tokens (adding synonyms there would relax the AND constraint and pull in noise). The intended caller is the retrieval pipeline that has already computed synonym_expand(qtokens) — passing it here saves the OR-mode pool from missing topical synonym terms.

Parameters:

query (str)
limit (int)
extra_or_tokens (set[str] | None)

Return type:

list[Hit]

sources#

Source ABC.

Adding a new corpus to arborist = one new Source subclass. The Source contract is intentionally minimal: yield Document objects, one at a time.

class arborist.source.Source[source]#

A corpus that yields documents into the ingest pipeline.

source_type: str#: source_type tag stored on every Document this source produces.

abstractmethod iter_documents()[source]#

Yield Document objects. Must be deterministic & idempotent.

Return type:: Iterator[Document]

MediaWiki ‘cur’ table SQL dump source.

Handles 2003-era SQL dumps in bz2 format (e.g. 20030516_cur_tablesql.bz2). Yields one Document per non-redirect main-namespace article, with [[wikilinks]] extracted as outbound edges.

Implements a stream parser for MySQL extended INSERT syntax. The cur table schema for that era starts: cur_id, cur_namespace, cur_title, cur_text, … We rely on positional access for the first four columns.

class arborist.sources.wikipedia.WikipediaSqlDump(path, *, table='cur', namespace=0, base_uri='https://en.wikipedia.org/wiki/', shard=None, start_id=0, encoding='latin-1')[source]#

Iterates a MediaWiki SQL table dump (cur or old), bz2 or plain.

Both cur (current snapshot) and old (revision history) tables share the first four column positions: id, namespace, title, text. The cur table has cur_is_redirect at position 10 (we skip redirects); old has no redirect flag (every revision is real).

Shard support: pass shard=(rank, total) and the source yields only docs whose 0-based index satisfies index % total == rank. Useful for spawning N parallel ingest processes against the same dump file — parser CPU runs in parallel, writes serialize at the WAL writer lock.

Parameters:

path (str | Path)
table (str)
namespace (int)
base_uri (str)
shard (tuple[int, int] | None)
start_id (int)
encoding (str)

iter_documents()[source]#

Yield Document objects. Must be deterministic & idempotent.

Return type:: Iterator[Document]

class arborist.sources.wikipedia.WikipediaCurDump(path, namespace=0, base_uri='https://en.wikipedia.org/wiki/')[source]#

Iterates a MediaWiki ‘cur’ table SQL dump.

Parameters:

path (str | Path)
namespace (int)
base_uri (str)

class arborist.sources.wikipedia.WikipediaOldDump(path, namespace=0, base_uri='https://en.wikipedia.org/wiki/')[source]#

Iterates a MediaWiki ‘old’ (revision history) table SQL dump.

Parameters:

path (str | Path)
namespace (int)
base_uri (str)

HTML page source.

Fetches URLs, honors robots.txt automatically, strips noise (script/style/nav/ footer/header), extracts main body text + outbound <a href> links as edges.

Optional dependency. Install with pip install arborist[html].

arborist.sources.html_page.parse_html(url, html, source_type='html', *, loss_collector=None)[source]#

Pure parse function. Separated so tests can run without network.

When loss_collector is provided, drops from noise-selector decomposition (<script>, <style>, <nav>, etc.) and whitespace normalization are recorded against the collector. Output bytes remain bit-identical to the no-collector path. Default None keeps callers stable.

Parameters:

url (str)
html (str)
source_type (str)
loss_collector (LossCollector | None)

Return type:

Document | None

class arborist.sources.html_page.HtmlPageSource(urls, *, respect_robots=True, timeout=30.0, loss_report_enabled=True, loss_report_excerpts=True, loss_report_max_excerpt_bytes=200, default_author=None)[source]#

Iterates a list of URLs, fetching and parsing each as HTML.

Parameters:

urls (Iterable[str])
respect_robots (bool)
timeout (float)
loss_report_enabled (bool)
loss_report_excerpts (bool)
loss_report_max_excerpt_bytes (int)
default_author (str | None)

source_type: str = 'html'#: source_type tag stored on every Document this source produces.

classmethod from_file(path, **kwargs)[source]#

Parameters:: path (str | Path)
Return type:: HtmlPageSource

iter_documents()[source]#

Yield Document objects. Must be deterministic & idempotent.

Return type:: Iterator[Document]

concepts#

Corpus-derived concept relations: synonyms, antonyms, rivalries, categories.

Replaces the hand-curated frozensets that lived in arborist.qa.concepts through April 2026 (commit c6182ae). The frozensets were Phase 1; this is Phase 2.

The concept-relations layer is a secondary index over the existing Merkle-committed corpus. Writes to concept_relations NEVER affect document_root, chunk_root, or cache_key — backfilling relations is safe across the entire corpus without invalidating cached answers or breaking audit chains.

Architecture:

store.py — DB read/write helpers, append-only with UNIQUE-key idempotency
extract.py — Extractor ABC + registry; per-source extractors plug in here
query.py — Cross-shard synonyms_for(token) & rivalries_for(token)
seed.py — One-time migration of the legacy frozensets to manual rows

Public API for retrieval-time use (matches the legacy arborist.qa.concepts shape, so call sites in query.py keep working):

synonym_expand(tokens, *, shards_dir) -> set[str]
rivalry_excluded(tokens, *, shards_dir, compare_phrasing=False) -> set[str]
has_compare_phrasing(question) -> bool

arborist.concepts.add_concept_relation(conn, *, source_root, relation_kind, token, target, evidence_kind, confidence=1.0, derived_from=None, derived_at=None)[source]#

Append a concept relation. Returns True if a row was inserted, False if the (source_root, relation_kind, token, target, evidence_kind) tuple already existed (idempotent re-derivation).

Tokens are stored exactly as given — case preservation lets the query layer decide normalization. Substring lookup at retrieval time is case-insensitive via SQLite’s NOCASE comparator.

Parameters:

conn (Connection)
source_root (str)
relation_kind (str)
token (str)
target (str)
evidence_kind (str)
confidence (float)
derived_from (str | None)
derived_at (int | None)

Return type:

bool

arborist.concepts.concept_relations_for_token(conn, token, *, relation_kind=None)[source]#

Return all relations whose token matches (case-insensitive). If relation_kind is given, filter to that kind.

Parameters:

conn (Connection)
token (str)
relation_kind (str | None)

Return type:

list[dict]

arborist.concepts.has_compare_phrasing(question)[source]#

True if the question contains comparison language.

Parameters:: question (str)
Return type:: bool

arborist.concepts.invalidate_cache()[source]#

Drop all cached indices. Call after a writer commits new rows (the mtime check would catch this on next read, but invalidating explicitly is faster on the same-process write+read pattern).

Return type:: None

arborist.concepts.purge_by_evidence_kind(conn, evidence_kind, *, derived_from=None)[source]#

Delete every row with the given evidence_kind (and optional derived_from). Returns the number of rows removed.

The intended use: revoke a buggy extractor’s output cleanly. Manual rows live under evidence_kind='manual' and are NOT touched by a purge of any other kind.

Parameters:

conn (Connection)
evidence_kind (str)
derived_from (str | None)

Return type:

int

arborist.concepts.rivalry_excluded(tokens, *, shards_dir=None, compare_phrasing=False)[source]#

Tokens whose presence in a doc title means EXCLUDE that doc.

For each rivalry pair (A, B): if exactly ONE side appears in the query AND no comparison language was used, exclude the OTHER side’s tokens. If both sides appear, or if the user asked for a comparison, no exclusion (they wanted both).

Parameters:

tokens (set[str])
shards_dir (Path | str | None)
compare_phrasing (bool)

Return type:

set[str]

arborist.concepts.synonym_expand(tokens, *, shards_dir=None, max_neighbors_per_token=8, max_total=50)[source]#

Add direct synonym neighbors for any input token whose degree is bounded enough that its neighbors are likely topical, not topic-adjacency noise.

Two caps protect retrieval performance & quality:

max_neighbors_per_token — tokens with more direct neighbors than this contribute NO expansion. Generic tokens (“person”, “thoughts”) have huge degree in the Wikipedia reciprocal-link graph; expanding them dumps random topical-cluster noise. Specific named entities (“athlon”, “telepathy”) have small focused neighborhoods that pass the cap.
max_total — overall cap on expanded set size. Bounds the SQL clause count downstream. Original query tokens are always preserved; if total > cap, neighbors are sorted alphabetically & truncated.

Parameters:

tokens (set[str])
shards_dir (Path | str | None)
max_neighbors_per_token (int)
max_total (int)

Return type:

set[str]

Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.

Retrieval: FTS5 search and concepts

Contents

Retrieval: FTS5 search and concepts#

search#

sources#

concepts#