Skip to content

Changelog

EVALSIG follows Semantic Versioning. The current line is 0.x, which means anything tagged *.minor.* may include breaking changes; we will flag them clearly when they happen.

0.1.0 (2026-05-17)

Initial release. This is the version the design doc was written against; every "Goals v1" item is shipped except as noted.

Library

  • evalsig.types: ItemResult, RunFrame, ComparisonResult, MDEResult.
  • evalsig.exceptions: typed error hierarchy.
  • evalsig.inference.paired: paired t-test, paired permutation, paired bootstrap.
  • evalsig.inference.unpaired: Welch's t-test, two-sample permutation, two-sample bootstrap.
  • evalsig.inference.mcnemar: exact-binomial and chi-squared-with- continuity branches.
  • evalsig.inference.cluster_bootstrap: block bootstrap over the cluster id.
  • evalsig.inference.mde: closed-form MDE, required-N inverse, one-way-ANOVA ICC estimator.
  • evalsig.inference.power: power-for-delta under both fixed and clustered designs.
  • evalsig.inference.effect_size: Cohen's d, paired Cohen's d, Cliff's delta.
  • evalsig.inference.sequential: Howard 2021 asymptotic confidence sequence; streaming sequential_gate.
  • evalsig.inference.multiplicity: Bonferroni, Holm, Benjamini- Hochberg.
  • evalsig.compare.compare: orchestration with auto-method selection.
  • evalsig.compare.gate: ALLOW / REJECT / INCONCLUSIVE state machine with required-N suggestions.
  • evalsig.compare.report: JSON, Markdown, and TTY renderers.

I/O and store

  • evalsig.io.json_runframe: read/write the canonical EVALSIG JSON schema.
  • evalsig.io.lm_eval: read EleutherAI's samples_*.jsonl.
  • evalsig.io.inspect_log: read Inspect AI JSON exports.
  • evalsig.io.helm: read HELM scenario_state.json.
  • evalsig.io.parquet: read/write the canonical Parquet schema.
  • evalsig.io.base: Reader protocol and a small format registry.
  • evalsig.store: append-only writer, manifest, and read/query API.

CLI

  • evalsig compare, evalsig gate, evalsig mde, evalsig watch, evalsig doctor, evalsig history, evalsig version.
  • --output {tty,json,markdown} on every result-producing command.
  • BSD sysexits-style exit codes.

Integrations

  • evalsig.integrations.pytest_plugin: evalsig_gate fixture with assert_no_regression.
  • evalsig.integrations.github_action: the entry point used by the shipped GitHub Action (action.yml).
  • evalsig.integrations.braintrust: optional Braintrust publisher.
  • evalsig._telemetry: opt-in local usage log (off by default).

Tests

  • 45 unit tests covering every primitive (paired, unpaired, mcnemar, cluster bootstrap, MDE/power, effect sizes, multiplicity, sequential testing, the store, the IO layer, the report renderers, the CLI, and the pytest plugin).
  • 4 end-to-end Monte Carlo experiments in research/validate.py. All pass on each release.

Known omissions

  • Sphinx-style reference pages are not auto-generated yet; the per-module docs in docs/modules/ are hand-written.
  • The SaaS dashboard, billing, and the closed-source ingestion layer are tracked in a separate repo.