Benchmark surface#

Arborist ships the complete Dav1DPrometheus 5S/5T/5F/5R evaluation suite — 21 sub-batteries, 662+ deterministic fixtures in the default runner (Phase 1d expansion 2026-05-09) plus ~110 additional math π* fixtures — as first-class infrastructure. Every benchmark is reproducible, no LLM-as-judge, and many sub-batteries route through the actual arborist surface (parser, verifier, audit chain, π* registry) rather than synthetic gold output.

Quick reference#

make bench-suite              # complete 5S + 5T + 5F + 5R (662 tasks)
make bench-5s                 # representational discipline
make bench-5t                 # temporal / cross-reasoning
make bench-5f                 # operational quality (synthetic + live)
make bench-5r                 # workspace operators
# Per-π* 5S targets (#000030 SymPy substrate + the registry's text/
# arithmetic/logic core + the last-stub graduation):
make bench-5s-math            # arithmetic@v1 + logic-kernel@v1
make bench-5s-code            # code-py-ast@v1
make bench-5s-time-series     # time-series-quantized@v1
make bench-5s-tabular         # tabular-pinned@v1 (last reserved stub)
make bench-5s-algebra         # algebra-symbolic@v1
make bench-5s-calculus-limit  # calculus-limit@v1
make bench-5s-calculus-series # calculus-series@v1
make bench-5s-linear-algebra  # linear-algebra@v1
make bench-5s-function-sampled # function-sampled@v1 (SymPy → time-series)
# Real-shard + selection + witness-divergence harness:
make bench-real-shard         # #000026 — real-shard latency / audit baseline
make bench-fork-baseline      # pin current bench output as ForkScore parent
make bench-fork-score         # score child vs pinned parent (CI-gateable)
make bench-witness-divergence # extract LLM-divergence as 5F fixtures

Each invocation emits a JSON bench.batteries.base.BatteryResult with per-task pass/fail, fixture digest, runtime digest, and sub-battery-specific metrics.

The four batteries#

5S — representation discipline#

Five sub-batteries testing what the system understands at the sign / meaning / derivation level:

Syntax — does the named π* parse the input without raising?
Semantics — do two surface forms canonicalize to the same bytes when they should (and not when they shouldn’t)?
Syllogism — does each step in a deductive chain validly follow under the named rule (categorical_transitivity, chain_3, invalid_converse, missing_premise)?
Synthesis — does the system assemble cited facts into a coherent derivation supported by the fact set?
Semiotics — is meaning preserved under controlled label swaps?

5T — temporal / cross-reasoning discipline#

Six sub-batteries (legacy transfer plus the canonical Dav1DPrometheus five):

Transfer Learning — does a learned pattern carry across task / domain / carrier?
Triangulation — do independent strategies (substring, token_subset, token_overlap, entity_match) agree at threshold?
Truthtables — exhaustive propositional coverage at N=2..4 variables.
Transitivity — typed-relation chains under whitelist (implies, subset_of, ancestor_of, before, less_than).
Time — temporal-context preservation across memory_root snapshots; integrates with #000017 surface.

5F — operational quality#

Five sub-batteries, all with embedded (Phase 1a synthetic) AND live (Phase 1b.2, routes through real arborist surfaces) modes:

Sub-battery	Live surface
Function	`arborist.qa.parse_claims.parse_pointer_claims`
Finetuning	`arborist.selfmodel.store_snapshot` round-trip
Falsification	`arborist.qa.verify.verify_quotes`
Formulate	`arborist.qa.parse_claims.parse_pointer_claims`
Feedback Loop	`arborist.store.append_audit` + `memory.snapshot`

Per-task detail.source reports "embedded" or "live" so bench output distinguishes synthetic from production signal.

5R — workspace operators#

Five sub-batteries testing operators applied to a workspace (SelfModel + memory_root + audit chain):

React — observations integrate into downstream state.
Rearrange — restructure without semantic shift.
Restore — retrieve prior facts from history.
Replicate — π*-determinism across N replicas.
Resonate — variance-zero across N runs.

Cross-modality discipline#

Every fixture carries:

carrier — domain whitelist enforced by bench.batteries.base.PHASE_1_CARRIERS. Phase 1 domains: text, claim_lattice, memory_snapshot, selfmodel_snapshot, providence_record, audit_event, code, arithmetic, logic, time_series.
domain — sub-domain qualifier (e.g., rational, propositional, python_ast).
pi_star_ref — registry key naming the canonicalizer.
loss_report_refs — optional projection-loss links.
modality_notes — scope note.

Unsupported carriers fail explicitly with reason="unsupported_carrier" — never silently accepted. Hidden- channel work is defensive only (detection / flagging, never generation).

ForkScore consumes battery output#

The v8 ForkScore (see v8 ForkScore) reads BatteryResult JSON from a parent and child organism, computes a weighted scalar verdict with ACCEPT / MARGINAL / REJECT classes:

make bench-suite                                    # generates parent.json
# ... apply changes ...
make bench-suite                                    # generates child.json
arborist substrate score --parent parent.json --child child.json

Authoring new fixtures#

See docs/spec-methodology.md (per-author checklist) and the existing fixture files under bench/fixtures/. New sub-batteries follow the protocol in bench.batteries.base — Battery.run(fixtures_path) → BatteryResult, deterministic, no LLM-as-judge, carrier metadata mandatory.

Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

TRUTH — First principles, math & science, open source code freely distributed.
FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.
HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.
LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.

Benchmark surface

Contents