Scenario: A/B on RLHF checkpoints¶
You run an RLHF loop. Every few hours you have a fresh checkpoint and need to decide whether to promote it. The eval suite is fixed; the items overlap exactly between runs; the per-step deltas are typically small.
This is the canonical paired-difference scenario, and EVALSIG's gate is built for it.
The setup¶
- Baseline: the current production checkpoint, scored on the eval suite.
- Candidate: the new RLHF checkpoint, scored on the same suite.
- Decision: promote iff the candidate is statistically better at a policy threshold of 1pp.
The script¶
scripts/promote_or_reject.py
import sys
from evalsig import gate
from evalsig.io import read_runframe_json
baseline = read_runframe_json("baseline.json")
candidate = read_runframe_json(sys.argv[1])
report = gate(
baseline, candidate,
min_delta=0.01, # promotion policy: at least 1pp better
alpha=0.05,
power=0.80,
one_sided=True,
rng=0,
)
print(report.verdict.value)
sys.exit(report.exit_code)
Run it after every checkpoint:
python scripts/promote_or_reject.py checkpoint-step-12345.json
case $? in
0) ./promote.sh ;;
1) echo "checkpoint rejected" ;;
2) echo "need more eval items" ;;
esac
Recording history¶
If you also want the per-checkpoint trend, write each candidate plus its verdict into the local store:
from evalsig.store import write_run
write_run(
"/var/lib/evalsig/store",
candidate,
project_id="rlhf-loop",
delta=report.comparison.delta,
p_value=report.comparison.p_value,
verdict=report.verdict.value,
parent_run_id=baseline.run_id,
)
And query later:
evalsig history --root /var/lib/evalsig/store --project rlhf-loop \
--since 2026-05-01 --until 2026-06-01
When the answer is "not yet"¶
Small deltas plus a small eval suite mean a lot of INCONCLUSIVE verdicts. Two options:
- Grow the suite.
evalsig.required_n(target_delta=0.005, sd_diff=0.3)gives you the number. - Switch to a sequential gate. Stream item-level diffs as you collect them and stop as soon as the always-valid CI excludes zero. See Sequential watch in CI.
What you should not do¶
- Don't peek at fixed-sample p-values and stop early. The alpha
guarantee fails. If you want to peek, use
evalsig watch. - Don't promote on a single 1pp delta without checking the MDE. Without the MDE, you don't know whether you're seeing signal or noise.