Understanding the output¶
Every evalsig compare and evalsig gate invocation produces a
ComparisonResult (or a GateReport wrapping one). This page is a guided
tour through every field, so you know how to read it without guessing.
A sample output¶
EVALSIG release gate
====================
delta: +0.0124 (paired_permutation)
CI (95%): [+0.0023, +inf]
p-value: 0.0070
required MDE: 0.0050
detectable: 0.0040 at 80% power
VERDICT: ALLOW
delta¶
The estimated effect: candidate mean minus baseline mean over the aligned
items. Always reported as (b - a), so positive means the candidate is
better.
Units are whatever your metric is. For accuracy, this is a proportion, so
+0.0124 means "1.24 percentage points better than baseline".
method¶
Which underlying statistical test produced the p-value and CI. The auto-selector picks one of:
| Method | When auto picks it |
|---|---|
mcnemar_exact / mcnemar_chi2 |
Both runs are 0/1 and no cluster id |
paired_permutation |
Continuous scores, no cluster id |
cluster_bootstrap |
A cluster id is provided |
paired_t |
You asked for it explicitly |
paired_bootstrap |
You asked for it explicitly |
You can override with --method on the CLI or method= in the Python API.
CI¶
Confidence interval for the delta. Two-sided unless you passed
--one-sided, in which case one side is +inf (for "greater") or -inf
(for "less").
The width of the CI is the practical answer to "how much did we learn?": narrow means precise, wide means we still have a lot of noise.
p-value¶
The probability of observing a delta at least this extreme if the truth
were zero effect. Smaller is stronger evidence against the null. We don't
recommend chasing 0.001 vs 0.01: the policy threshold is your --alpha,
typically 0.05.
significant¶
True when p_value < alpha. For a one-sided test we also require the
delta to point in the requested direction. This boolean is the input to
the gate's verdict logic.
required MDE / min-delta¶
How big a real effect you said you cared about before running the gate. Below this threshold, the gate refuses to ship even if the result is statistically significant. A common policy is 0.5pp for general-purpose evals, 1pp for higher-noise agentic evals, and lower for very large human-labeled tests.
detectable / MDE¶
The smallest delta you could have detected at the requested power with this much data. The formula behind it is
If MDE is larger than min-delta, the run was underpowered: even if a real effect equal to min-delta existed, you would not have seen it. The gate flags this as INCONCLUSIVE and tells you roughly how many more items to collect.
VERDICT¶
The release-gate decision. Three possible values:
- ALLOW (exit 0). Significant and clears the policy threshold. Ship.
- REJECT (exit 1). Either not significant, or significant but below the policy threshold. The candidate is not statistically better in the way you said you cared about.
- INCONCLUSIVE (exit 2). Not significant AND the run was too small to detect the policy threshold. Collect more data and re-run.
Notes¶
Any time EVALSIG sees something worth flagging during alignment, it adds a short note:
"item-set coverage is 87% (below 95%); 35 items only in candidate, 12 only in baseline"-- partial overlap can bias the delta. You probably want to figure out why."3 item(s) have cluster_id mismatch between runs; using baseline's cluster assignment"-- a sign the two harnesses disagree on how items are grouped."RunFrames carry cluster_id; pass cluster=<name> to opt into cluster- aware inference"-- the data has clusters but you didn't ask for the cluster bootstrap, so you got item-level inference.
Treat the notes like compiler warnings: not blocking, but worth reading.
JSON form¶
Pass --output json or --json report.json to get the same fields as a
structured object:
{
"delta": 0.0124,
"ci": [0.0023, "Infinity"],
"ci_level": 0.95,
"p_value": 0.0070,
"significant": true,
"n_pairs": 1000,
"n_clusters": null,
"method": "paired_permutation",
"mde": 0.0040,
"notes": []
}
This is the format the SaaS dashboard ingests and the format you should persist for compliance audits. The schema is part of EVALSIG's public contract; field names and types are stable across the 0.x line.