evalsig.inference¶
The statistical core. Pure NumPy and SciPy, no I/O, no globals, fully reproducible given an RNG.
This module's job is the math. The compare/ and cli/ layers exist
only to orchestrate it.
Paired tests¶
| Function | Returns |
|---|---|
paired_t_test(a, b, alternative, ci_level) |
PairedOutcome |
paired_permutation_test(a, b, alternative, ci_level, n_resamples, rng) |
PairedOutcome |
paired_bootstrap_ci(a, b, alternative, ci_level, n_resamples, rng) |
PairedOutcome |
The PairedOutcome dataclass has delta, ci, ci_level, p_value,
n_pairs, method, and sd_diff.
Unpaired tests¶
Same returns (an UnpairedOutcome). Use these only when the two runs
are not on the same items.
McNemar's test¶
from evalsig.inference import mcnemar_test
out = mcnemar_test(a, b, alternative="greater")
print(out.b_wins, out.c_wins) # discordant pair counts
Auto-picks exact binomial vs continuity-corrected chi-squared based on the number of discordant pairs.
Cluster bootstrap¶
from evalsig.inference import cluster_bootstrap_ci
out = cluster_bootstrap_ci(a, b, cluster_id,
alternative="two-sided",
n_resamples=5_000,
rng=0)
print(out.n_clusters)
Resamples whole clusters with replacement. Use whenever items belong to groups.
MDE / required N¶
mde(sd_diff, n_pairs, alpha, power, one_sided, n_clusters, icc)->MDEResultrequired_n(target_delta, sd_diff, alpha, power, one_sided, icc, mean_cluster_size)->intestimate_icc(values, cluster_id)->floatin[0, 1]power_for_delta(delta, sd_diff, n_pairs, alpha, one_sided, n_clusters, icc)->floatin[0, 1]
Effect sizes¶
All three return an EffectSize(name, value, magnitude) dataclass.
Magnitudes follow conventional thresholds (negligible / small /
medium / large).
Sequential testing¶
confidence_sequence(diffs, alpha, rho, sigma_bound)-- one-shot always-valid CI on the running mean.sequential_gate(stream, alpha, alternative, rho, min_n, sigma_bound)-- walk a stream and stop when the CI excludes zero.
Returns a SequentialOutcome with stopped, delta, ci, n_pairs,
method, half_width.
Multiplicity corrections¶
All three take a 1D array of p-values and return a MultipleTestResult
with p_adjusted, reject, and method.
Why a pure-math module¶
The whole package depends on inference/ -- it never depends back. That
gives three properties:
- The math is testable in isolation. Property tests verify coverage with Monte Carlo; golden tests pin numeric outputs against R and statsmodels.
- The math is parallelisable. No globals means no surprise serialisation.
- The math is portable. You can pull
inference/into another tool and use it as a stand-alone library if you want only the primitives.