Benchmark surface
=================

Arborist ships the complete **Dav1DPrometheus 5S/5T/5F/5R** evaluation
suite — 21 sub-batteries, **662+ deterministic fixtures** in the
default runner (Phase 1d expansion 2026-05-09) plus ~110 additional
math π* fixtures — as first-class infrastructure. Every benchmark is
reproducible, no LLM-as-judge, and many sub-batteries route through
the actual arborist surface (parser, verifier, audit chain, π*
registry) rather than synthetic gold output.

Quick reference
---------------

.. code-block:: bash

   make bench-suite              # complete 5S + 5T + 5F + 5R (662 tasks)
   make bench-5s                 # representational discipline
   make bench-5t                 # temporal / cross-reasoning
   make bench-5f                 # operational quality (synthetic + live)
   make bench-5r                 # workspace operators
   # Per-π* 5S targets (#000030 SymPy substrate + the registry's text/
   # arithmetic/logic core + the last-stub graduation):
   make bench-5s-math            # arithmetic@v1 + logic-kernel@v1
   make bench-5s-code            # code-py-ast@v1
   make bench-5s-time-series     # time-series-quantized@v1
   make bench-5s-tabular         # tabular-pinned@v1 (last reserved stub)
   make bench-5s-algebra         # algebra-symbolic@v1
   make bench-5s-calculus-limit  # calculus-limit@v1
   make bench-5s-calculus-series # calculus-series@v1
   make bench-5s-linear-algebra  # linear-algebra@v1
   make bench-5s-function-sampled # function-sampled@v1 (SymPy → time-series)
   # Real-shard + selection + witness-divergence harness:
   make bench-real-shard         # #000026 — real-shard latency / audit baseline
   make bench-fork-baseline      # pin current bench output as ForkScore parent
   make bench-fork-score         # score child vs pinned parent (CI-gateable)
   make bench-witness-divergence # extract LLM-divergence as 5F fixtures

Each invocation emits a JSON :class:`bench.batteries.base.BatteryResult`
with per-task pass/fail, fixture digest, runtime digest, and
sub-battery-specific metrics.

The four batteries
------------------

5S — representation discipline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Five sub-batteries testing what the system understands at the
sign / meaning / derivation level:

- **Syntax** — does the named π* parse the input without raising?
- **Semantics** — do two surface forms canonicalize to the same
  bytes when they should (and not when they shouldn't)?
- **Syllogism** — does each step in a deductive chain validly
  follow under the named rule (categorical_transitivity, chain_3,
  invalid_converse, missing_premise)?
- **Synthesis** — does the system assemble cited facts into a
  coherent derivation supported by the fact set?
- **Semiotics** — is meaning preserved under controlled label
  swaps?

5T — temporal / cross-reasoning discipline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Six sub-batteries (legacy ``transfer`` plus the canonical
Dav1DPrometheus five):

- **Transfer Learning** — does a learned pattern carry across
  task / domain / carrier?
- **Triangulation** — do independent strategies (substring,
  token_subset, token_overlap, entity_match) agree at threshold?
- **Truthtables** — exhaustive propositional coverage at N=2..4
  variables.
- **Transitivity** — typed-relation chains under whitelist
  (``implies``, ``subset_of``, ``ancestor_of``, ``before``,
  ``less_than``).
- **Time** — temporal-context preservation across memory_root
  snapshots; integrates with #000017 surface.

5F — operational quality
~~~~~~~~~~~~~~~~~~~~~~~~

Five sub-batteries, all with **embedded** (Phase 1a synthetic) AND
**live** (Phase 1b.2, routes through real arborist surfaces) modes:

================  =============================================
Sub-battery       Live surface
================  =============================================
Function          ``arborist.qa.parse_claims.parse_pointer_claims``
Finetuning        ``arborist.selfmodel.store_snapshot`` round-trip
Falsification     ``arborist.qa.verify.verify_quotes``
Formulate         ``arborist.qa.parse_claims.parse_pointer_claims``
Feedback Loop     ``arborist.store.append_audit`` + ``memory.snapshot``
================  =============================================

Per-task ``detail.source`` reports ``"embedded"`` or ``"live"`` so
bench output distinguishes synthetic from production signal.

5R — workspace operators
~~~~~~~~~~~~~~~~~~~~~~~~

Five sub-batteries testing operators applied to a workspace
(SelfModel + memory_root + audit chain):

- **React** — observations integrate into downstream state.
- **Rearrange** — restructure without semantic shift.
- **Restore** — retrieve prior facts from history.
- **Replicate** — π*-determinism across N replicas.
- **Resonate** — variance-zero across N runs.

Cross-modality discipline
-------------------------

Every fixture carries:

- ``carrier`` — domain whitelist enforced by
  :data:`bench.batteries.base.PHASE_1_CARRIERS`. Phase 1
  domains: ``text``, ``claim_lattice``, ``memory_snapshot``,
  ``selfmodel_snapshot``, ``providence_record``, ``audit_event``,
  ``code``, ``arithmetic``, ``logic``, ``time_series``.
- ``domain`` — sub-domain qualifier (e.g.,
  ``rational``, ``propositional``, ``python_ast``).
- ``pi_star_ref`` — registry key naming the canonicalizer.
- ``loss_report_refs`` — optional projection-loss links.
- ``modality_notes`` — scope note.

Unsupported carriers fail explicitly with
``reason="unsupported_carrier"`` — never silently accepted. Hidden-
channel work is defensive only (detection / flagging, never
generation).

ForkScore consumes battery output
---------------------------------

The v8 ForkScore (see :doc:`v8-fork-score`) reads BatteryResult JSON
from a parent and child organism, computes a weighted scalar
verdict with ACCEPT / MARGINAL / REJECT classes:

.. code-block:: bash

   make bench-suite                                    # generates parent.json
   # ... apply changes ...
   make bench-suite                                    # generates child.json
   arborist substrate score --parent parent.json --child child.json

Authoring new fixtures
----------------------

See :file:`docs/spec-methodology.md` (per-author checklist) and the
existing fixture files under :file:`bench/fixtures/`. New
sub-batteries follow the protocol in :mod:`bench.batteries.base` —
``Battery.run(fixtures_path) → BatteryResult``, deterministic, no
LLM-as-judge, carrier metadata mandatory.
