EVALSIG¶

Know whether your LLM eval gains are real or just noise. Catch it in CI, before shipping.

EVALSIG sits between any LLM eval harness (Inspect AI, lm-eval-harness, HELM, simple-evals, your internal pipeline) and the decision to ship a model. It applies the statistical machinery the academic literature has spent the last two years recommending but no commercial tool ships end-to-end: paired- difference testing, clustered standard errors, permutation tests, minimum- detectable-effect / power analysis, always-valid sequential monitoring, and multiple-comparison corrections.

The problem in one sentence¶

Frontier labs ship model updates on 1 to 3 percentage-point eval deltas, and Anthropic measured a 6 percentage-point swing on Terminal-Bench from infrastructure config alone. EVALSIG is the release gate that tells those two cases apart.

Why now¶

Item-level noise. Apple's GSM-Symbolic showed up to 65pp accuracy drops from adding an irrelevant clause to a math problem with the same answer. Zhao et al. showed swapping two few-shot examples can drop accuracy from 88.5% to 51.3%.
Infrastructure noise. Anthropic published a 6pp swing on Terminal-Bench 2.0 from resource config alone, and 1.54pp on SWE-bench from a 5x RAM change. Even at temperature 0, batch size and kernel fusion produce different outputs.
No paired inference in the field. Frontier models correlate 0.3 to 0.7 question-to-question. Comparing two models on the same items has 2 to 4 times lower variance than independent samples. Every commercial eval tool today stops at "bootstrap CI on a single run".

What you get¶

You want to...	Reach for
Decide if a 1.2pp delta is real	`evalsig.compare(a, b)` -> `ComparisonResult`
Block a CI build on a bad release	`evalsig gate --baseline a.json --candidate b.json --min-delta 0.005`
Plan how many items you actually need	`evalsig.required_n(target_delta, sd_diff)`
Stop early on expensive evals without alpha inflation	`evalsig.sequential_gate(stream)`
Audit a third-party "X is +3pp better" claim	`evalsig compare --output markdown ...`
Keep an append-only history per project	`evalsig.store.write_run(...)` + `evalsig history`

Quickstart¶

Install:

pip install evalsig

Compare two runs from the CLI:

evalsig gate \
  --baseline baseline.eval \
  --candidate candidate.eval \
  --metric accuracy \
  --cluster passage_id \
  --min-delta 0.005 \
  --alpha 0.05 \
  --power 0.80

Or from Python:

from evalsig import compare
from evalsig.io import read_inspect_log

a = read_inspect_log("baseline.eval")
b = read_inspect_log("candidate.eval")

result = compare(a, b, cluster="passage_id", alpha=0.05, one_sided=True)
print(result.delta)         # +0.0124
print(result.ci)            # (-0.003, +0.027)
print(result.p_value)       # 0.082
print(result.significant)   # False
print(result.mde)           # 0.018 (would have needed to detect ~1.8pp at 80% power)

Where to go next¶

New to EVALSIG? Start with Quickstart.
Want the conceptual map? Read What EVALSIG solves and Paired vs unpaired.
Building automation? Jump to CLI reference and Integrations.
Plugging into your own harness? See Modules: evalsig.io and Configuration.
Want to know why the statistics work? Methodology is the long-form story with citations.

Project status¶

Version: 0.1.0
Python: 3.10+
License: Apache-2.0
Tests: 45 unit + 4 end-to-end Monte-Carlo experiments, all passing.