Configuration¶
EVALSIG has no required configuration file. Every knob is either a CLI flag, a Python argument, or an environment variable, and every default is documented below.
Defaults¶
| Knob | Default | Where to override |
|---|---|---|
alpha |
0.05 | --alpha / alpha= |
power |
0.80 | --power / target_power= |
method |
auto |
--method / method= |
n_resamples |
10,000 | --resamples / n_resamples= |
one-sided |
off | --one-sided / one_sided= |
rng seed |
0 | --seed / rng= |
| Output renderer | tty |
--output {tty,json,markdown} |
| Store project id | default |
--project / project_id= |
The minimum-detectable-effect policy (--min-delta) is the one
parameter with no default: every gate invocation must declare what
"meaningfully better" means.
Environment variables¶
| Variable | Default | Meaning |
|---|---|---|
EVALSIG_LOG_LEVEL |
WARNING |
Stdlib log level (DEBUG, INFO, WARNING, ERROR) |
EVALSIG_TELEMETRY |
off | Set to 1 to enable opt-in local usage logging |
EVALSIG_TELEMETRY_PATH |
~/.evalsig/usage.jsonl |
Where the telemetry file lives |
EVALSIG never writes telemetry unless EVALSIG_TELEMETRY=1 is set, and
even then it only writes a local JSONL file. No network traffic.
Choosing a method¶
The default method="auto" picks based on the data shape:
- Both runs are 0/1 and no cluster id ->
mcnemar - A cluster id is set ->
cluster_bootstrap - Otherwise ->
paired_permutation
Override when you have a reason: very large n with normal-looking diffs
benefits from paired_t (~100x faster), and paired_bootstrap is the
right call when you want a percentile CI rather than a t-distribution
CI.
Choosing min-delta¶
A common rule of thumb:
| Eval type | Suggested min-delta |
|---|---|
| Large, low-noise (MMLU, BBH) | 0.005 (0.5pp) |
| Agentic / long-context | 0.01 to 0.02 (1 to 2pp) |
| Small or noisy judge-graded | 0.02 to 0.05 |
If your run can't afford the items needed for a tight min-delta, relax the policy. Pretending the run is bigger than it is just hides the problem.
Choosing alpha and power¶
alpha = 0.05 and power = 0.80 are the standard defaults from the
literature. Departures we have seen:
- Very high-stakes shipments (foundation model releases) use
alpha = 0.01. - Power of 0.90 cuts the chance of an inconclusive run but requires about 33% more items.
- Two-sided is the right default for "is this different?"; one-sided is the right default for "is this better?", which is the usual release question.
Reproducibility¶
Always pass a rng argument (Python) or --seed (CLI). EVALSIG never
uses a global random state, so without an explicit seed your runs are
non-reproducible. Default is 0 to make accidental non-determinism rare.
Limits¶
n_pairsmust be at least 2 for any test.n_resamplesof 10,000 is a sensible default. Below 1,000 the p-value is too granular for typical alpha values; above 100,000 you usually hit a memory bound before you hit a precision bound.- The cluster bootstrap concatenates per-cluster slices, which is fine
up to about a million items in a single resample. For larger runs,
switch to chunked Parquet reads (
evalsig.io.read_runframe_parquet) and resample at the cluster level only.
See also¶
- CLI reference for every flag.
- Python API for every keyword argument.
- Output formats for renderer details.