Benchmark surface#
Arborist ships the complete Dav1DPrometheus 5S/5T/5F/5R evaluation suite — 21 sub-batteries, 662+ deterministic fixtures in the default runner (Phase 1d expansion 2026-05-09) plus ~110 additional math π* fixtures — as first-class infrastructure. Every benchmark is reproducible, no LLM-as-judge, and many sub-batteries route through the actual arborist surface (parser, verifier, audit chain, π* registry) rather than synthetic gold output.
Quick reference#
make bench-suite # complete 5S + 5T + 5F + 5R (662 tasks)
make bench-5s # representational discipline
make bench-5t # temporal / cross-reasoning
make bench-5f # operational quality (synthetic + live)
make bench-5r # workspace operators
# Per-π* 5S targets (#000030 SymPy substrate + the registry's text/
# arithmetic/logic core + the last-stub graduation):
make bench-5s-math # arithmetic@v1 + logic-kernel@v1
make bench-5s-code # code-py-ast@v1
make bench-5s-time-series # time-series-quantized@v1
make bench-5s-tabular # tabular-pinned@v1 (last reserved stub)
make bench-5s-algebra # algebra-symbolic@v1
make bench-5s-calculus-limit # calculus-limit@v1
make bench-5s-calculus-series # calculus-series@v1
make bench-5s-linear-algebra # linear-algebra@v1
make bench-5s-function-sampled # function-sampled@v1 (SymPy → time-series)
# Real-shard + selection + witness-divergence harness:
make bench-real-shard # #000026 — real-shard latency / audit baseline
make bench-fork-baseline # pin current bench output as ForkScore parent
make bench-fork-score # score child vs pinned parent (CI-gateable)
make bench-witness-divergence # extract LLM-divergence as 5F fixtures
Each invocation emits a JSON bench.batteries.base.BatteryResult
with per-task pass/fail, fixture digest, runtime digest, and
sub-battery-specific metrics.
The four batteries#
5S — representation discipline#
Five sub-batteries testing what the system understands at the sign / meaning / derivation level:
Syntax — does the named π* parse the input without raising?
Semantics — do two surface forms canonicalize to the same bytes when they should (and not when they shouldn’t)?
Syllogism — does each step in a deductive chain validly follow under the named rule (categorical_transitivity, chain_3, invalid_converse, missing_premise)?
Synthesis — does the system assemble cited facts into a coherent derivation supported by the fact set?
Semiotics — is meaning preserved under controlled label swaps?
5T — temporal / cross-reasoning discipline#
Six sub-batteries (legacy transfer plus the canonical
Dav1DPrometheus five):
Transfer Learning — does a learned pattern carry across task / domain / carrier?
Triangulation — do independent strategies (substring, token_subset, token_overlap, entity_match) agree at threshold?
Truthtables — exhaustive propositional coverage at N=2..4 variables.
Transitivity — typed-relation chains under whitelist (
implies,subset_of,ancestor_of,before,less_than).Time — temporal-context preservation across memory_root snapshots; integrates with #000017 surface.
5F — operational quality#
Five sub-batteries, all with embedded (Phase 1a synthetic) AND live (Phase 1b.2, routes through real arborist surfaces) modes:
Sub-battery |
Live surface |
|---|---|
Function |
|
Finetuning |
|
Falsification |
|
Formulate |
|
Feedback Loop |
|
Per-task detail.source reports "embedded" or "live" so
bench output distinguishes synthetic from production signal.
5R — workspace operators#
Five sub-batteries testing operators applied to a workspace (SelfModel + memory_root + audit chain):
React — observations integrate into downstream state.
Rearrange — restructure without semantic shift.
Restore — retrieve prior facts from history.
Replicate — π*-determinism across N replicas.
Resonate — variance-zero across N runs.
Cross-modality discipline#
Every fixture carries:
carrier— domain whitelist enforced bybench.batteries.base.PHASE_1_CARRIERS. Phase 1 domains:text,claim_lattice,memory_snapshot,selfmodel_snapshot,providence_record,audit_event,code,arithmetic,logic,time_series.domain— sub-domain qualifier (e.g.,rational,propositional,python_ast).pi_star_ref— registry key naming the canonicalizer.loss_report_refs— optional projection-loss links.modality_notes— scope note.
Unsupported carriers fail explicitly with
reason="unsupported_carrier" — never silently accepted. Hidden-
channel work is defensive only (detection / flagging, never
generation).
ForkScore consumes battery output#
The v8 ForkScore (see v8 ForkScore) reads BatteryResult JSON from a parent and child organism, computes a weighted scalar verdict with ACCEPT / MARGINAL / REJECT classes:
make bench-suite # generates parent.json
# ... apply changes ...
make bench-suite # generates child.json
arborist substrate score --parent parent.json --child child.json