Benchmark surface#

Arborist ships the complete Dav1DPrometheus 5S/5T/5F/5R evaluation suite — 21 sub-batteries, 662+ deterministic fixtures in the default runner (Phase 1d expansion 2026-05-09) plus ~110 additional math π* fixtures — as first-class infrastructure. Every benchmark is reproducible, no LLM-as-judge, and many sub-batteries route through the actual arborist surface (parser, verifier, audit chain, π* registry) rather than synthetic gold output.

Quick reference#

make bench-suite              # complete 5S + 5T + 5F + 5R (662 tasks)
make bench-5s                 # representational discipline
make bench-5t                 # temporal / cross-reasoning
make bench-5f                 # operational quality (synthetic + live)
make bench-5r                 # workspace operators
# Per-π* 5S targets (#000030 SymPy substrate + the registry's text/
# arithmetic/logic core + the last-stub graduation):
make bench-5s-math            # arithmetic@v1 + logic-kernel@v1
make bench-5s-code            # code-py-ast@v1
make bench-5s-time-series     # time-series-quantized@v1
make bench-5s-tabular         # tabular-pinned@v1 (last reserved stub)
make bench-5s-algebra         # algebra-symbolic@v1
make bench-5s-calculus-limit  # calculus-limit@v1
make bench-5s-calculus-series # calculus-series@v1
make bench-5s-linear-algebra  # linear-algebra@v1
make bench-5s-function-sampled # function-sampled@v1 (SymPy → time-series)
# Real-shard + selection + witness-divergence harness:
make bench-real-shard         # #000026 — real-shard latency / audit baseline
make bench-fork-baseline      # pin current bench output as ForkScore parent
make bench-fork-score         # score child vs pinned parent (CI-gateable)
make bench-witness-divergence # extract LLM-divergence as 5F fixtures

Each invocation emits a JSON bench.batteries.base.BatteryResult with per-task pass/fail, fixture digest, runtime digest, and sub-battery-specific metrics.

The four batteries#

5S — representation discipline#

Five sub-batteries testing what the system understands at the sign / meaning / derivation level:

  • Syntax — does the named π* parse the input without raising?

  • Semantics — do two surface forms canonicalize to the same bytes when they should (and not when they shouldn’t)?

  • Syllogism — does each step in a deductive chain validly follow under the named rule (categorical_transitivity, chain_3, invalid_converse, missing_premise)?

  • Synthesis — does the system assemble cited facts into a coherent derivation supported by the fact set?

  • Semiotics — is meaning preserved under controlled label swaps?

5T — temporal / cross-reasoning discipline#

Six sub-batteries (legacy transfer plus the canonical Dav1DPrometheus five):

  • Transfer Learning — does a learned pattern carry across task / domain / carrier?

  • Triangulation — do independent strategies (substring, token_subset, token_overlap, entity_match) agree at threshold?

  • Truthtables — exhaustive propositional coverage at N=2..4 variables.

  • Transitivity — typed-relation chains under whitelist (implies, subset_of, ancestor_of, before, less_than).

  • Time — temporal-context preservation across memory_root snapshots; integrates with #000017 surface.

5F — operational quality#

Five sub-batteries, all with embedded (Phase 1a synthetic) AND live (Phase 1b.2, routes through real arborist surfaces) modes:

Sub-battery

Live surface

Function

arborist.qa.parse_claims.parse_pointer_claims

Finetuning

arborist.selfmodel.store_snapshot round-trip

Falsification

arborist.qa.verify.verify_quotes

Formulate

arborist.qa.parse_claims.parse_pointer_claims

Feedback Loop

arborist.store.append_audit + memory.snapshot

Per-task detail.source reports "embedded" or "live" so bench output distinguishes synthetic from production signal.

5R — workspace operators#

Five sub-batteries testing operators applied to a workspace (SelfModel + memory_root + audit chain):

  • React — observations integrate into downstream state.

  • Rearrange — restructure without semantic shift.

  • Restore — retrieve prior facts from history.

  • Replicate — π*-determinism across N replicas.

  • Resonate — variance-zero across N runs.

Cross-modality discipline#

Every fixture carries:

  • carrier — domain whitelist enforced by bench.batteries.base.PHASE_1_CARRIERS. Phase 1 domains: text, claim_lattice, memory_snapshot, selfmodel_snapshot, providence_record, audit_event, code, arithmetic, logic, time_series.

  • domain — sub-domain qualifier (e.g., rational, propositional, python_ast).

  • pi_star_ref — registry key naming the canonicalizer.

  • loss_report_refs — optional projection-loss links.

  • modality_notes — scope note.

Unsupported carriers fail explicitly with reason="unsupported_carrier" — never silently accepted. Hidden- channel work is defensive only (detection / flagging, never generation).

ForkScore consumes battery output#

The v8 ForkScore (see v8 ForkScore) reads BatteryResult JSON from a parent and child organism, computes a weighted scalar verdict with ACCEPT / MARGINAL / REJECT classes:

make bench-suite                                    # generates parent.json
# ... apply changes ...
make bench-suite                                    # generates child.json
arborist substrate score --parent parent.json --child child.json

Authoring new fixtures#

See docs/spec-methodology.md (per-author checklist) and the existing fixture files under bench/fixtures/. New sub-batteries follow the protocol in bench.batteries.baseBattery.run(fixtures_path) BatteryResult, deterministic, no LLM-as-judge, carrier metadata mandatory.


Permacomputer Preamble — License: AGPL-3.0-only

This is free software for the public good of a permacomputer hosted at permacomputer.com, an always-on computer by the people, for the people. Durable, easy to repair, & distributed like tap water for machine learning intelligence.

Our permacomputer is community-owned infrastructure optimized around four values:

  • TRUTH — First principles, math & science, open source code freely distributed.

  • FREEDOM — Voluntary partnerships, freedom from tyranny & corporate control.

  • HARMONY — Minimal waste, self-renewing systems with diverse thriving connections.

  • LOVE — Be yourself without hurting others, cooperation through natural law.

NO WARRANTY. Software is provided “AS IS” without warranty of any kind. Full text: License.