Changelog¶
EVALSIG follows Semantic Versioning. The current
line is 0.x, which means anything tagged *.minor.* may include
breaking changes; we will flag them clearly when they happen.
0.1.0 (2026-05-17)¶
Initial release. This is the version the design doc was written against; every "Goals v1" item is shipped except as noted.
Library¶
evalsig.types: ItemResult, RunFrame, ComparisonResult, MDEResult.evalsig.exceptions: typed error hierarchy.evalsig.inference.paired: paired t-test, paired permutation, paired bootstrap.evalsig.inference.unpaired: Welch's t-test, two-sample permutation, two-sample bootstrap.evalsig.inference.mcnemar: exact-binomial and chi-squared-with- continuity branches.evalsig.inference.cluster_bootstrap: block bootstrap over the cluster id.evalsig.inference.mde: closed-form MDE, required-N inverse, one-way-ANOVA ICC estimator.evalsig.inference.power: power-for-delta under both fixed and clustered designs.evalsig.inference.effect_size: Cohen's d, paired Cohen's d, Cliff's delta.evalsig.inference.sequential: Howard 2021 asymptotic confidence sequence; streamingsequential_gate.evalsig.inference.multiplicity: Bonferroni, Holm, Benjamini- Hochberg.evalsig.compare.compare: orchestration with auto-method selection.evalsig.compare.gate: ALLOW / REJECT / INCONCLUSIVE state machine with required-N suggestions.evalsig.compare.report: JSON, Markdown, and TTY renderers.
I/O and store¶
evalsig.io.json_runframe: read/write the canonical EVALSIG JSON schema.evalsig.io.lm_eval: read EleutherAI'ssamples_*.jsonl.evalsig.io.inspect_log: read Inspect AI JSON exports.evalsig.io.helm: read HELMscenario_state.json.evalsig.io.parquet: read/write the canonical Parquet schema.evalsig.io.base: Reader protocol and a small format registry.evalsig.store: append-only writer, manifest, and read/query API.
CLI¶
evalsig compare,evalsig gate,evalsig mde,evalsig watch,evalsig doctor,evalsig history,evalsig version.--output {tty,json,markdown}on every result-producing command.- BSD sysexits-style exit codes.
Integrations¶
evalsig.integrations.pytest_plugin:evalsig_gatefixture withassert_no_regression.evalsig.integrations.github_action: the entry point used by the shipped GitHub Action (action.yml).evalsig.integrations.braintrust: optional Braintrust publisher.evalsig._telemetry: opt-in local usage log (off by default).
Tests¶
- 45 unit tests covering every primitive (paired, unpaired, mcnemar, cluster bootstrap, MDE/power, effect sizes, multiplicity, sequential testing, the store, the IO layer, the report renderers, the CLI, and the pytest plugin).
- 4 end-to-end Monte Carlo experiments in
research/validate.py. All pass on each release.
Known omissions¶
- Sphinx-style reference pages are not auto-generated yet; the
per-module docs in
docs/modules/are hand-written. - The SaaS dashboard, billing, and the closed-source ingestion layer are tracked in a separate repo.