Clustered standard errors¶

If items in your eval naturally belong to groups (a passage with multiple questions, a template that spawns many problems, a stem with several follow-ups), you cannot treat them as independent. The naive standard error will be too narrow and your false-positive rate will be much higher than the alpha you asked for.

The model¶

Suppose the per-item difference between two runs decomposes as

d_i = w_{cluster(i)} + e_i

w_k is a cluster-level shift: every item in cluster k gets the same push. e_i is the per-item residual. Even under the null (mean of d is zero), a single eval run sees one realization of the w_k draws, which biases item-level inference.

What goes wrong without clustering¶

We ran 600 simulations under the null with 50 clusters of 10 items (mean within-cluster correlation 0.71). The naive paired bootstrap rejected the null in 43.2% of runs. The target was 5%.

In other words: ship 100 candidates that are not actually better and the naive test would let about 43 of them through. The cluster bootstrap brings that back down to 5.5%, which is what you wanted.

You can reproduce this with:

python research/validate.py

See experiment_2_clustered_typeI in research/validate.py.

How EVALSIG handles it¶

If your RunFrame items have cluster_id set and you pass cluster=<name> to compare() (or --cluster <name> on the CLI), EVALSIG switches the auto method to the cluster bootstrap, which resamples whole clusters with replacement.

from evalsig import compare

result = compare(a, b, cluster="passage_id", alpha=0.05)
print(result.method)        # 'cluster_bootstrap'
print(result.n_clusters)    # e.g. 200

The MDE returned in this case includes a design-effect correction:

deff   = 1 + (mean_cluster_size - 1) * icc
n_eff  = n_pairs / deff
MDE    = (z_alpha + z_beta) * sd_diff / sqrt(n_eff)

icc is the within-cluster correlation, estimated automatically from the data. The design effect tells you how much the clustering "wastes" your sample: with ICC = 0.20 and 10-item clusters, the design effect is 2.8, so your effective N is roughly a third of your nominal N.

The CLI flag¶

evalsig gate \
  --baseline a.json --candidate b.json \
  --cluster passage_id \
  --min-delta 0.005 --alpha 0.05 --power 0.80

--cluster accepts any field name that exists as cluster_id on the items.

How to estimate cluster size and ICC in advance¶

When planning a new eval suite, use:

from evalsig.inference.mde import required_n

n = required_n(target_delta=0.01, sd_diff=0.3,
               icc=0.15, mean_cluster_size=8,
               alpha=0.05, power=0.80, one_sided=True)
print(n)  # 1437 items (vs ~588 without clustering)

The clustering inflates your required sample size; planning for it up front avoids a "we have 1000 items, why is everything inconclusive?" debugging round.

What you should not do¶

Do not drop the cluster id and pretend items are independent. The CI will be too narrow.
Do not average within clusters and then run an item-level test on the averages. That throws away the within-cluster information and usually loses more power than the clustering would have cost.
Do not "fix" the cluster bootstrap by using more resamples. The resamples need to resample clusters, not items.