Win-rate inference · Lesson 0003

Why the naive SE is wrongInfluence functions, the cluster sandwich, multiway inclusion–exclusion, and the punchline: the effective sample size is the number of prompts, not comparisons.

Why this lesson. This is Contribution 2 and your most surprising number: standard errors inflate 4–6×. A reviewer will push back ("your ICC is only 0.03 — aren't you over-correcting?"), and the answer is a clean back-of-envelope you can derive on the spot. Get this and you can defend every interval in the paper.

1. The problem, stated plainly

The empirical index is a mean of the half-tie kernel \(h\) over comparisons (write \(s_r \in \{0,\tfrac12,1\}\) for the score of an individual comparison \(r\) — the realized value of \(h\)), so the textbook standard error divides its variance by the number of comparisons. That count overstates the information: in a log, one model response is reused across many comparisons, and the same prompt and judge recur across many more. The comparisons are far from independent, so dividing by their count is simply the wrong denominator (draft_sections.tex §4).

2. Primitive, just-in-time: the influence function

New to the leaderboard score \(S_a\) and the opponent weights \(w_{ab}\)? L0009 builds them from the pairwise \(\psi\) and derives the influence form below step by step — read it first if this section moves too fast.

Influence = each row's first-order push on the estimate. Linearize the estimator as a sum of per-comparison contributions, \(\hat\theta - \theta \approx \sum_r u_r\). Then the variance is the variance of that sum, and any dependence structure can be applied to the same \(u_r\) without re-deriving the estimator. That is the whole design of variance.py: produce an influence matrix once, cluster it however you like.

For a model's win-rate (a self-normalized / Hájek ratio), the contribution of comparison \(r=(a,b)\) to model \(a\)'s score is, from leaderboard.py:67,

\[ u_r^{(a)} = \frac{w_{ab}\,\big(s_r - \hat\psi_{ab}\big)}{n_{ab}} \]

read it as: take the comparison's centered outcome \(s_r - \hat\psi_{ab}\) (deviation from the pair's mean), weight it by the opponent weight \(w_{ab}\), and spread it over the \(n_{ab}\) comparisons of that pair. Counts and weights are treated as fixed — the standard linearization of a ratio estimator (winrate/README.md, caveats). Stack these into the \((n, K)\) influence array and the rest is bookkeeping.

3. From influence to the sandwich

The naive variance assumes the rows are independent, so it just sums their outer products — \(\hat V_{\text{naive}} = \sum_r u_r u_r^\top\) = influence.T @ influence (variance.naive_cov). The cluster-robust variance instead sums influence within a cluster first, then takes outer products of the cluster totals:

\[ \hat V_{\text{cluster}} = \sum_{g} \Big(\sum_{r \in g} u_r\Big)\Big(\sum_{r \in g} u_r\Big)^{\!\top} \]

That is exactly variance._cluster_meat (group, sum, outer-product). The difference between the two is the sum of cross-products of rows in the same cluster — the within-cluster covariances the naive formula throws away. When those covariances are positive (the usual case), \(\hat V_{\text{cluster}} > \hat V_{\text{naive}}\) and the naive interval undercovers. The naive estimator is literally the special case "every row is its own cluster" (multiway_cluster_cov(infl, [arange(n)])).

Two axes at once: inclusion–exclusion

Comparisons cluster by prompt and by judge simultaneously (crossed, not nested). Cameron–Gelbach–Miller combine dimensions by inclusion–exclusion (variance.py:49–53):

\[ \hat V = \hat V_{\text{prompt}} + \hat V_{\text{judge}} - \hat V_{\text{prompt}\,\cap\,\text{judge}} \]

Add each dimension's cluster sandwich; subtract the intersection so the rows that share both a prompt and a judge are not double-counted. For \(d\) dimensions it's the full alternating sum over subsets.

4. The punchline: effective \(n\) = number of prompts

Here is the derivation to keep in your pocket. On mmlu_pro there are 19 models, so each model's single answer to a prompt is compared against 18 opponents (per judge). But whether model \(a\) wins those 18 comparisons is mostly decided by that one answer. So the 18 within-prompt comparisons are nearly perfectly correlated for the purpose of estimating \(a\)'s win-rate — they collapse to roughly one independent observation per prompt.

\[ \text{SE inflation} = \sqrt{\text{DEFF}} = \sqrt{1 + (\bar m - 1)\,\rho_{\text{shared}}} \;\approx\; \sqrt{\bar m} \;\approx\; \sqrt{18} \approx 4.24 \]

with cluster size \(\bar m \approx 18\) and shared-response correlation \(\rho_{\text{shared}}\) near 1. The measured prompt-clustered inflation is 4.2× — the same number. (Adding the judge axis, where 3 judges re-rate the same answer, pushes it to 5.4×.) A nominal 95% naive interval therefore covers closer to one half.

The ICC paradox — the reviewer's trap, defused. The reported pair-level prompt ICC is only \(\approx 0.03\), which seems to contradict a 4.2× inflation. It doesn't. That ICC is computed on pair-centered scores pooled over all within-prompt pairs — including pairs like \((a,b)\) and \((c,d)\) that share no model and are nearly uncorrelated, which drags the average down. The dependence that actually inflates \(a\)'s win-rate is the correlation among comparisons that share \(a\)'s response, and reuse makes that correlation high. The small global ICC and the large inflation measure different things; the sandwich captures the one that matters, the ICC summary does not.

5. Two-sided honesty

The naive formula doesn't always undercover. With positive within-prompt or within-judge correlation — typical — it undercovers. But in a deliberately within-prompt paired log, where the same prompt difficulty hits both responses and cancels, the naive interval can actually overcover. Stating both directions is the paper's voice and is more defensible than a blanket "naive is anticonservative."

Load-bearing assumptions for the central limit result (state which are fragile): crossed-cluster sampling, a many-cluster Lindeberg condition (one dominant judge or prompt family breaks it — relevant to our 3-judge data, where the judge ICC is only a crude estimate), and nondegenerate variance. The estimator is consistent and asymptotically normal, and the whole score vector is jointly normal — so any fixed linear leaderboard score gets valid Wald and simultaneous inference (the hook into L0005).

Retrieval check — answer from memory before moving on

1. What does the cluster-robust variance add that the naive variance omits?

Why. Summing influence within a cluster before the outer product adds exactly the cross-products of same-cluster rows — the covariances the naive sum-of-outer-products drops. Positive ones ⇒ wider, honest intervals.

2. Roughly why is the SE inflation on mmlu_pro about 4.2×?

Why. A model's one answer decides its ~18 within-prompt comparisons, so they collapse to ≈1 independent unit. Naive counts 18; SE is understated by ≈√18 ≈ 4.24, matching the measured 4.2×.

3. The pair-level prompt ICC is only ~0.03. Why isn't that a contradiction with 4–6× inflation?

Why. The global ICC pools all within-prompt pairs, including \((a,b)\) vs \((c,d)\) that share no response and are ~uncorrelated, pulling the average down. The binding dependence is among comparisons that share a model's response — high because of reuse.

4. How does the multiway estimator combine prompt and judge clustering?

Why. Inclusion–exclusion: \(V_{\text{prompt}} + V_{\text{judge}} - V_{\text{prompt}\cap\text{judge}}\), so rows sharing both a prompt and a judge aren't double-counted (variance.py:49–53).

5. When can the naive interval actually overcover?

Why. In a within-prompt paired design, shared prompt difficulty affects both responses and cancels from the contrast, so the naive variance can be too large. Stating both directions is the honest framing.

Read next (primary source)

Read Cameron, Gelbach & Miller (2011), "Robust inference with multiway clustering," JBES 29(2) — the inclusion–exclusion estimator implemented in variance.py (needs adding to paperpile.bib). In-repo, read draft_sections.tex §4 and run scripts/eda_our_data.py --dataset mmlu_pro to watch the 4.2×/5.4× numbers and the ICCs print.