Methodology¶
This page is the long-form story behind EVALSIG: why the package exists, which papers it borrows from, and how we validated the claims in code.
The pain in numbers¶
- Item-level noise. Apple's GSM-Symbolic showed up to 65pp accuracy drops from adding an irrelevant clause to a math problem with the same answer (Mirzadeh et al., 2024). Zhao et al. (2021) showed swapping two few-shot examples drops accuracy from 88.5% to 51.3%.
- Infrastructure noise. Anthropic Engineering (March 2025) quantified a 6pp p<0.01 swing on Terminal-Bench 2.0 from resource config alone, and 1.54pp on SWE-bench from 5x RAM. Even at temperature 0, batch size and kernel fusion produce stochastic outputs (arXiv:2408.04667; Thinking Machines, 2025).
- No paired inference. Miller (2024, "Adding Error Bars to Evals") shows frontier models correlate 0.3 to 0.7 question-to-question. Paired analysis gives 2 to 4 times lower variance for free.
Despite all this, every commercial eval tool we surveyed in May 2026 stopped at "bootstrap CI on a single run". Inspect AI is the only one that ships clustered SE.
What EVALSIG implements¶
| Feature | EVALSIG | lm-eval | Inspect AI | HELM | OpenAI Evals | Promptfoo | Braintrust | LangSmith | Galileo | Patronus | W&B Weave | Vellum | HoneyHive |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bootstrap CI | yes | partial | yes | per-scenario | no | no | side-by-side | no | no | no | no | no | no |
| Clustered SE | yes | no | yes | no | no | no | no | no | no | no | no | no | no |
| Paired test | yes | no | no | no | no | no | no | no | no | no | no | no | no |
| Permutation | yes | no | no | no | no | no | no | no | no | no | no | no | no |
| MDE / power | yes | no | no | no | no | no | no | no | no | no | no | no | no |
| Sequential | yes | no | no | no | no | no | no | no | no | no | no | no | no |
| Multiplicity | yes | no | no | no | no | no | no | no | no | no | no | no | no |
Validation¶
The research/validate.py script runs four Monte Carlo experiments
that, together, confirm EVALSIG does what the design doc promises.
Total runtime under 30 seconds on a laptop.
E1: paired inference beats unpaired Welch¶
- Setup: 400 simulations, 500 items each, true lift 1.5pp, shared per-item luck (the regime Miller cites).
- Result: paired permutation power 85.8%, unpaired Welch power 0.0%. Same data, same effect.
E2: cluster bootstrap controls Type-I under clustering¶
- Setup: 600 simulations under the null, 50 clusters of 10 items, mean within-cluster correlation 0.71.
- Result: naive item-level bootstrap rejects 43.2% of the time (target 5%). Cluster bootstrap rejects 5.5%.
E3: MDE matches the empirical detection rate¶
- Setup:
sd_diff = 0.4,n = 1000, alpha = 0.05, target power 80%. Computed MDE = 0.0315. Then 500 simulations with the true effect equal to that MDE. - Result: empirical detection rate 81%, within 1pp of the target.
E4: CLI release gate end-to-end¶
- infra-noise (6,000 items, two configs of the same model): REJECT (delta -0.30pp, p = 0.78, MDE 1.00pp).
- real-improvement (2,000 items, 2.5pp true lift): ALLOW (delta +2.95pp, p = 0.0002, MDE 0.94pp).
- underpowered (80 items, 2.5pp true lift): INCONCLUSIVE (delta +6.25pp, p = 0.23, MDE 16.75pp). The gate suggests collecting ~9,899 more items.
Reproduce with:
Why this is hard to copy¶
The statistics is well known. The literature is mature. So why doesn't every eval vendor ship this already? Three reasons:
- It is dull. Implementing bootstrap CIs, McNemar, cluster bootstraps, and an MDE planner is not a sexy roadmap item. The teams that would do it are usually doing something else.
- It requires giving up the easy headline. A statistically defensible release gate refuses some releases that would have shipped under a naive comparison. Vendors do not love saying "actually, your improvement is not significant".
- It crosses team lines. The eval runner, the dashboard, the CI gate, the audit trail, and the SaaS billing are usually separate products with separate roadmaps. EVALSIG is the thin layer that joins them.
References¶
- Anthropic Engineering, "Quantifying infrastructure noise in agentic coding evals," March 2025.
- Mirzadeh et al., "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in LLMs," Apple ML Research, Oct 2024.
- Zhao et al., "Calibrate Before Use: Improving Few-Shot Performance of Language Models," ICML 2021.
- "Non-Determinism of Deterministic LLM Settings," arXiv:2408.04667.
- Thinking Machines, "Defeating Nondeterminism in LLM Inference," 2025.
- Miller, "Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations," arXiv:2411.00640.
- Howard, Ramdas, McAuliffe, Sekhon, "Time-uniform, nonparametric, nonasymptotic confidence sequences," Annals of Statistics, 2021.
- Liang et al., "Holistic Evaluation of Language Models" (HELM), arXiv:2211.09110.
- Benjamini, Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing," JRSS-B, 1995.