1. The gap: one pairwise number → one model number
L0002 left us with the PI matrix: \(\psi_{ab}\) is the win-rate of model \(a\) against a single opponent \(b\). But a leaderboard needs one number per model, not a whole row of pairwise numbers. So we have to aggregate each model's row of the matrix into a scalar. That scalar is the leaderboard score \(S_a\).
The natural aggregate is a weighted average of how \(a\) does against each opponent:
Read it in words: "how often does \(a\) win, averaged over a chosen field of opponents." \(S_a\) is just a weighted mean of the pairwise win-rates \(\psi_{ab}\) you already understand.
2. Where \(w_{ab}\) comes from — it's a choice, not data
The opponent mixture \(w_{ab}\) is the weight you put on opponent \(b\) when scoring \(a\). It is an
analyst's choice, and it matters. leaderboard_scores offers two (code at leaderboard.py:50–57):
| mixture | weight \(w_{ab}\) | meaning |
|---|---|---|
"uniform" | \(1/d_a\), where \(d_a\) = # distinct opponents \(a\) faced | every opponent counts equally — a Borda-style score; removes adaptive-matchmaking schedule confounding |
"as_sampled" | \(n_{ab}/N_a\), where \(N_a=\sum_b n_{ab}\) | weight opponents by how often actually played; conflates skill with who you were matched against |
Two details from the code: opponents never faced get \(w_{ab}=0\) (the has = pair > 0 mask), and the
weights are normalized to sum to one (divided by row_sums) — itself a self-normalization, the same
"ratio" flavor we'll see again in a moment.
"as_sampled" mixture recovers exactly the raw model-level
win-rate. Substitute \(w_{ab}=n_{ab}/N_a\):
\[ S_a^{\text{as\_sampled}} = \sum_b \frac{n_{ab}}{N_a}\,\psi_{ab} = \frac{1}{N_a}\sum_b \sum_{r\in(a,b)} s_r = \frac{1}{N_a}\sum_{\text{all of }a\text{'s comparisons}} s_r. \]
So \(S_a\) is literally the mean of all of \(a\)'s half-tie scores — the model-level win-rate, which is a weighted
average of the pairwise win-rates \(\psi_{ab}\). That's the link between L0002's pairwise \(\psi\) and "the win-rate"
the leaderboard reports. "uniform" is the same idea but refusing to let a lopsided schedule tilt the field.3. Now linearize: why, and the form
We want \(\mathrm{Var}(S_a)\). But \(S_a\) is a weighted combination of ratios (each \(\psi_{ab}=\sum s/n_{ab}\) is a ratio), and the comparisons feeding different \(\psi_{ab}\) are dependent. There's no clean formula for the variance of a nonlinear function of correlated ratios. The fix is to approximate \(S_a\) by a sum of per-comparison terms, because the variance of a sum we can always write down.
Condition on the design — treat the counts \(n_{ab}\) and weights \(w_{ab}\) as fixed (this is the "Hájek
linearization treating weights and counts as fixed" caveat in winrate/README.md). Then each \(\psi_{ab}\)
is a plain mean and \(S_a\) is a plain weighted sum of scores:
Center each term by its mean to get the mean-zero influence of comparison \(r\) (in pair \((a,b)\)):
This is exactly leaderboard.py:67: w[a,b] * (s − psi[a,b]) / pair[a,b]. (The same comparison
also feeds \(b\)'s score, oriented the other way — line 68.)
Why the \(-\psi_{ab}\)? The Hájek-ratio view
Two ways to see it, and they agree. Centering: subtracting \(\psi_{ab}\) makes each term mean-zero, so \(\sum_r u_r\) has the same variance as \(S_a\). Ratio: \(\psi_{ab}\) is a self-normalized ratio \(\bar Y/\bar X\) with \(y_r=s_r,\ x_r=1\); the textbook linearization of a ratio is \(\tfrac{1}{\bar X}\,\mathrm{mean}(y_r - R\,x_r)\), which gives \((s_r-\psi_{ab})/n_{ab}\). The \(-\psi_{ab}\) is the ratio-correction term — "Hájek" names this self-normalized estimator, and treating \(n_{ab}\) as fixed is what lets us stop at first order.
as_sampled. There \(w_{ab}=n_{ab}/N_a\), so \(w_{ab}/n_{ab}=1/N_a\) is
constant, and \(u_r^{(a)} = (s_r-\psi_{ab})/N_a\) — the simple centered score over the model's total count, exactly
what you'd expect for a plain mean. The machinery reduces to the obvious thing in the obvious case.4. The significance: one \(u_r\), any dependence structure
Once \(S_a-\mathbb E[S_a]\approx\sum_r u_r\), the variance is a double sum of covariances:
The \(u_r\) are a fixed property of the estimator — "how does each comparison push \(S_a\)." They do not depend on what you assume about dependence. The only thing that changes between estimators is which off-diagonal covariances you keep — i.e., the assumed correlation pattern:
| dependence assumption | which \(\mathrm{Cov}(u_r,u_{r'})\) you keep | code |
|---|---|---|
| independence (naive) | diagonal only: \(\sum_r u_r u_r^\top\) | naive_cov |
| within-prompt | pairs sharing a prompt: \(\sum_g(\sum_{r\in g}u_r)(\cdot)^\top\) | _cluster_meat |
| within-judge | pairs sharing a judge | _cluster_meat |
| prompt + judge | inclusion–exclusion over both | multiway_cluster_cov |
So there are two orthogonal concerns: influence = how each comparison pushes the estimate (a property
of the estimator); clustering = which comparisons move together (a property of the dependence
model). Linearization separates them, so you compute the influence matrix once and swap dependence
models freely — which is precisely why multiway_cluster_cov(influence, [cluster_arrays]) takes the
influence plus a list of label arrays.
5. So — is it just to enable the sandwich?
Mechanically, yes: writing \(S_a\) as \(\sum_r u_r\) is what makes the cluster-robust sandwich apply — the
sandwich's "meat" is the estimate of \(\sum_r\sum_{r'}\mathrm{Cov}(u_r,u_{r'})\) under a chosen sparsity
pattern. (Here the sandwich's "bread", the Jacobian of a general M-estimator, is the identity, because we folded
everything into \(u_r\) — so multiway_cluster_cov computes the meat directly.) But the deeper payoff is the
decoupling: one influence matrix serves every clustering and every estimand — the scalar win-rate, the
\(K\)-vector leaderboard, and the standardized board (L0004/L0008) all reduce to "produce influence, then cluster
it." That is why adding a clustering axis or a new estimand never requires re-deriving a variance. L0003 picks up
exactly here.
1. What is the leaderboard score \(S_a(w)\)?
2. The opponent mixture \(w_{ab}\) is best described as…
"uniform" weights every opponent equally (schedule-robust), "as_sampled" weights by how often played (conflates skill with schedule).3. The "as_sampled" mixture makes \(S_a\) equal to…
4. Why does treating counts and weights as fixed help?
5. "The same \(u_r\) works for any dependence structure" matters because…
Read next (primary source)
Read the code itself: winrate/leaderboard.py lines 41–71
(leaderboard_scores) — the weights, the score, and the influence in ~30 lines. For the ratio/Hájek
linearization, Song et al. (2023) [Song2023-sm] gives the U-statistic/variance backbone; a clean
standalone influence-function primer is still a flagged gap in RESOURCES.md.