Quickstart¶
This page gets you from pip install to a working release gate in 30
seconds. For a longer, hand-held walkthrough, see Your first comparison.
1. Install¶
Not yet on PyPI; install from source:
2. Write two runs as RunFrame JSON¶
EVALSIG's native format is one JSON file per run. Each item carries an id,
a score, and optionally a cluster_id (for example a passage or template).
{
"run_id": "claude-x::mmlu-pro",
"model_id": "claude-x",
"task_id": "mmlu-pro",
"metric_name": "accuracy",
"items": [
{"item_id": "q1", "score": 1.0, "cluster_id": "stem"},
{"item_id": "q2", "score": 0.0, "cluster_id": "stem"},
{"item_id": "q3", "score": 1.0, "cluster_id": "humanities"}
]
}
You can also read native Inspect AI .eval
logs, lm-eval-harness samples_*.jsonl,
or HELM scenario_state.json files
without any conversion.
3. Gate the release from the CLI¶
evalsig gate \
--baseline baseline.json \
--candidate candidate.json \
--metric accuracy \
--min-delta 0.005 \
--alpha 0.05 \
--power 0.80
You'll see something like:
EVALSIG release gate
====================
delta: +0.0124 (paired_permutation)
CI (95%): [+0.0023, +inf]
p-value: 0.0070
required MDE: 0.0050
detectable: 0.0040 at 80% power
VERDICT: ALLOW
The exit code is 0 (ALLOW), 1 (REJECT), or 2 (INCONCLUSIVE), so CI
systems can read the verdict without parsing stdout. See
Understanding the output for what each field means.
4. Or use the Python API¶
from evalsig import compare, gate
from evalsig.io import read_runframe_json
a = read_runframe_json("baseline.json")
b = read_runframe_json("candidate.json")
result = compare(a, b, alpha=0.05, one_sided=True)
print(result.delta, result.p_value, result.significant)
report = gate(a, b, min_delta=0.005, alpha=0.05, power=0.80)
print(report.verdict.value) # 'ALLOW' / 'REJECT' / 'INCONCLUSIVE'
5. Wire it into CI¶
The shipped GitHub Action runs the same gate and writes a Markdown summary into the workflow run:
- uses: vtensor/evalsig@v0.1
with:
baseline: baseline.json
candidate: candidate.json
metric: accuracy
min_delta: '0.005'
alpha: '0.05'
power: '0.80'
Or use the pytest plugin, which fails the test with the full Markdown report when the gate refuses to ship:
def test_no_regression(evalsig_gate):
a = evalsig_gate.load("baseline.json")
b = evalsig_gate.load("candidate.json")
evalsig_gate.assert_no_regression(a, b, min_delta=0.005)
What just happened¶
EVALSIG aligned the two runs on item_id, picked a paired statistical test
based on the data shape (binary scores -> McNemar, continuous -> paired
permutation, clustered -> cluster bootstrap), computed the delta, the
confidence interval, the p-value, and the minimum detectable effect, then
compared the result to your --min-delta policy.
For the full reasoning chain see Concepts. For every knob, see CLI reference and Python API.