1. The problem, stated plainly
The empirical index is a mean of the half-tie kernel \(h\) over comparisons (write \(s_r \in \{0,\tfrac12,1\}\) for
the score of an individual comparison \(r\) — the realized value of \(h\)), so the textbook standard
error divides its variance by the number of comparisons. That count overstates the information: in a log,
one model response is reused across many comparisons, and the same prompt and judge recur across many more.
The comparisons are far from independent, so dividing by their count is simply the wrong denominator
(draft_sections.tex §4).
2. Primitive, just-in-time: the influence function
New to the leaderboard score \(S_a\) and the opponent weights \(w_{ab}\)? L0009 builds them from the pairwise \(\psi\) and derives the influence form below step by step — read it first if this section moves too fast.
variance.py: produce an influence matrix
once, cluster it however you like.For a model's win-rate (a self-normalized / Hájek ratio), the contribution of comparison
\(r=(a,b)\) to model \(a\)'s score is, from leaderboard.py:67,
read it as: take the comparison's centered outcome \(s_r - \hat\psi_{ab}\) (deviation from the pair's
mean), weight it by the opponent weight \(w_{ab}\), and spread it over the \(n_{ab}\) comparisons of that pair.
Counts and weights are treated as fixed — the standard linearization of a ratio estimator
(winrate/README.md, caveats). Stack these into the \((n, K)\) influence array and the rest is bookkeeping.
3. From influence to the sandwich
The naive variance assumes the rows are independent, so it just sums their outer products —
\(\hat V_{\text{naive}} = \sum_r u_r u_r^\top\) = influence.T @ influence (variance.naive_cov).
The cluster-robust variance instead sums influence within a cluster first, then takes outer products
of the cluster totals:
That is exactly variance._cluster_meat (group, sum, outer-product). The difference between the two is
the sum of cross-products of rows in the same cluster — the within-cluster covariances the naive
formula throws away. When those covariances are positive (the usual case), \(\hat V_{\text{cluster}} > \hat V_{\text{naive}}\)
and the naive interval undercovers. The naive estimator is literally the special case "every row is its own
cluster" (multiway_cluster_cov(infl, [arange(n)])).
Two axes at once: inclusion–exclusion
Comparisons cluster by prompt and by judge simultaneously (crossed, not nested). Cameron–Gelbach–Miller
combine dimensions by inclusion–exclusion (variance.py:49–53):
Add each dimension's cluster sandwich; subtract the intersection so the rows that share both a prompt and a judge are not double-counted. For \(d\) dimensions it's the full alternating sum over subsets.
4. The punchline: effective \(n\) = number of prompts
Here is the derivation to keep in your pocket. On mmlu_pro there are 19 models, so each model's single
answer to a prompt is compared against 18 opponents (per judge). But whether model \(a\) wins those 18
comparisons is mostly decided by that one answer. So the 18 within-prompt comparisons are nearly
perfectly correlated for the purpose of estimating \(a\)'s win-rate — they collapse to roughly one
independent observation per prompt.
with cluster size \(\bar m \approx 18\) and shared-response correlation \(\rho_{\text{shared}}\) near 1. The measured prompt-clustered inflation is 4.2× — the same number. (Adding the judge axis, where 3 judges re-rate the same answer, pushes it to 5.4×.) A nominal 95% naive interval therefore covers closer to one half.
5. Two-sided honesty
The naive formula doesn't always undercover. With positive within-prompt or within-judge correlation — typical — it undercovers. But in a deliberately within-prompt paired log, where the same prompt difficulty hits both responses and cancels, the naive interval can actually overcover. Stating both directions is the paper's voice and is more defensible than a blanket "naive is anticonservative."
Load-bearing assumptions for the central limit result (state which are fragile): crossed-cluster sampling, a many-cluster Lindeberg condition (one dominant judge or prompt family breaks it — relevant to our 3-judge data, where the judge ICC is only a crude estimate), and nondegenerate variance. The estimator is consistent and asymptotically normal, and the whole score vector is jointly normal — so any fixed linear leaderboard score gets valid Wald and simultaneous inference (the hook into L0005).
1. What does the cluster-robust variance add that the naive variance omits?
2. Roughly why is the SE inflation on mmlu_pro about 4.2×?
3. The pair-level prompt ICC is only ~0.03. Why isn't that a contradiction with 4–6× inflation?
4. How does the multiway estimator combine prompt and judge clustering?
variance.py:49–53).5. When can the naive interval actually overcover?
Read next (primary source)
Read Cameron, Gelbach & Miller (2011), "Robust inference with multiway
clustering," JBES 29(2) — the inclusion–exclusion estimator implemented in variance.py
(needs adding to paperpile.bib). In-repo, read draft_sections.tex §4 and run
scripts/eda_our_data.py --dataset mmlu_pro to watch the 4.2×/5.4× numbers and the ICCs print.