L0009 — From ψ to the leaderboard score, and how we linearize it

Course order. Read this between L0002 and L0003. L0002 gave you the pairwise index \(\psi_{ab}\) (one number for a vs one opponent b). L0003's influence function suddenly uses a model-level score \(S_a\) and opponent weights \(w_{ab}\) — this bridge builds those first, so the linearization in L0003 has somewhere to stand. (L0005 later builds rankings and cycles on top of \(S_a\).)

Why this lesson. The influence function is the hinge of the whole inference contribution. To defend the cluster-robust SEs — and to trust them — you need to see \(S_a\) built honestly from \(\psi\), and then watch it collapse into a sum of per-comparison terms. No step on faith.

1. The gap: one pairwise number → one model number

L0002 left us with the PI matrix: \(\psi_{ab}\) is the win-rate of model \(a\) against a single opponent \(b\). But a leaderboard needs one number per model, not a whole row of pairwise numbers. So we have to aggregate each model's row of the matrix into a scalar. That scalar is the leaderboard score \(S_a\).

The natural aggregate is a weighted average of how \(a\) does against each opponent:

\[ S_a(w) = \sum_{b} w_{ab}\,\psi_{ab}, \qquad \sum_b w_{ab} = 1. \]

Read it in words: "how often does \(a\) win, averaged over a chosen field of opponents." \(S_a\) is just a weighted mean of the pairwise win-rates \(\psi_{ab}\) you already understand.

Per-comparison notation (used from here on). \(r\) indexes a single comparison (one row of the schema). Its half-tie score is \(s_r \in \{0, \tfrac12, 1\}\) — the realized value of the kernel \(h\) from L0002 (1 if \(a\) won that comparison, \(\tfrac12\) a tie, 0 a loss). \(n_{ab}\) is the number of comparisons of pair \((a,b)\), so \(\psi_{ab} = \tfrac{1}{n_{ab}}\sum_{r\in(a,b)} s_r\) (the pairwise win-rate is the mean of those scores), and \(N_a = \sum_b n_{ab}\) is \(a\)'s total comparison count.

2. Where \(w_{ab}\) comes from — it's a choice, not data

The opponent mixture \(w_{ab}\) is the weight you put on opponent \(b\) when scoring \(a\). It is an analyst's choice, and it matters. leaderboard_scores offers two (code at leaderboard.py:50–57):

mixture	weight \(w_{ab}\)	meaning
`"uniform"`	\(1/d_a\), where \(d_a\) = # distinct opponents \(a\) faced	every opponent counts equally — a Borda-style score; removes adaptive-matchmaking schedule confounding
`"as_sampled"`	\(n_{ab}/N_a\), where \(N_a=\sum_b n_{ab}\)	weight opponents by how often actually played; conflates skill with who you were matched against

Two details from the code: opponents never faced get \(w_{ab}=0\) (the has = pair > 0 mask), and the weights are normalized to sum to one (divided by row_sums) — itself a self-normalization, the same "ratio" flavor we'll see again in a moment.

The bridge to "win-rate." The "as_sampled" mixture recovers exactly the raw model-level win-rate. Substitute \(w_{ab}=n_{ab}/N_a\): \[ S_a^{\text{as\_sampled}} = \sum_b \frac{n_{ab}}{N_a}\,\psi_{ab} = \frac{1}{N_a}\sum_b \sum_{r\in(a,b)} s_r = \frac{1}{N_a}\sum_{\text{all of }a\text{'s comparisons}} s_r. \] So \(S_a\) is literally the mean of all of \(a\)'s half-tie scores — the model-level win-rate, which is a weighted average of the pairwise win-rates \(\psi_{ab}\). That's the link between L0002's pairwise \(\psi\) and "the win-rate" the leaderboard reports. "uniform" is the same idea but refusing to let a lopsided schedule tilt the field.

3. Now linearize: why, and the form

We want \(\mathrm{Var}(S_a)\). But \(S_a\) is a weighted combination of ratios (each \(\psi_{ab}=\sum s/n_{ab}\) is a ratio), and the comparisons feeding different \(\psi_{ab}\) are dependent. There's no clean formula for the variance of a nonlinear function of correlated ratios. The fix is to approximate \(S_a\) by a sum of per-comparison terms, because the variance of a sum we can always write down.

Condition on the design — treat the counts \(n_{ab}\) and weights \(w_{ab}\) as fixed (this is the "Hájek linearization treating weights and counts as fixed" caveat in winrate/README.md). Then each \(\psi_{ab}\) is a plain mean and \(S_a\) is a plain weighted sum of scores:

\[ S_a = \sum_b w_{ab}\,\frac{1}{n_{ab}}\sum_{r\in(a,b)} s_r \;=\; \sum_r \frac{w_{ab}}{n_{ab}}\, s_r. \]

Center each term by its mean to get the mean-zero influence of comparison \(r\) (in pair \((a,b)\)):

\[ u_r^{(a)} = \frac{w_{ab}}{n_{ab}}\,\big(s_r - \psi_{ab}\big), \qquad\text{so}\qquad S_a - \mathbb E[S_a] \approx \sum_r u_r^{(a)}. \]

This is exactly leaderboard.py:67: w[a,b] * (s − psi[a,b]) / pair[a,b]. (The same comparison also feeds \(b\)'s score, oriented the other way — line 68.)

Why the \(-\psi_{ab}\)? The Hájek-ratio view

Two ways to see it, and they agree. Centering: subtracting \(\psi_{ab}\) makes each term mean-zero, so \(\sum_r u_r\) has the same variance as \(S_a\). Ratio: \(\psi_{ab}\) is a self-normalized ratio \(\bar Y/\bar X\) with \(y_r=s_r,\ x_r=1\); the textbook linearization of a ratio is \(\tfrac{1}{\bar X}\,\mathrm{mean}(y_r - R\,x_r)\), which gives \((s_r-\psi_{ab})/n_{ab}\). The \(-\psi_{ab}\) is the ratio-correction term — "Hájek" names this self-normalized estimator, and treating \(n_{ab}\) as fixed is what lets us stop at first order.

Sanity check with as_sampled. There \(w_{ab}=n_{ab}/N_a\), so \(w_{ab}/n_{ab}=1/N_a\) is constant, and \(u_r^{(a)} = (s_r-\psi_{ab})/N_a\) — the simple centered score over the model's total count, exactly what you'd expect for a plain mean. The machinery reduces to the obvious thing in the obvious case.

4. The significance: one \(u_r\), any dependence structure

Once \(S_a-\mathbb E[S_a]\approx\sum_r u_r\), the variance is a double sum of covariances:

\[ \mathrm{Var}(S_a) \approx \sum_r \sum_{r'} \mathrm{Cov}\big(u_r,\,u_{r'}\big). \]

The \(u_r\) are a fixed property of the estimator — "how does each comparison push \(S_a\)." They do not depend on what you assume about dependence. The only thing that changes between estimators is which off-diagonal covariances you keep — i.e., the assumed correlation pattern:

dependence assumption	which \(\mathrm{Cov}(u_r,u_{r'})\) you keep	code
independence (naive)	diagonal only: \(\sum_r u_r u_r^\top\)	`naive_cov`
within-prompt	pairs sharing a prompt: \(\sum_g(\sum_{r\in g}u_r)(\cdot)^\top\)	`_cluster_meat`
within-judge	pairs sharing a judge	`_cluster_meat`
prompt + judge	inclusion–exclusion over both	`multiway_cluster_cov`

So there are two orthogonal concerns: influence = how each comparison pushes the estimate (a property of the estimator); clustering = which comparisons move together (a property of the dependence model). Linearization separates them, so you compute the influence matrix once and swap dependence models freely — which is precisely why multiway_cluster_cov(influence, [cluster_arrays]) takes the influence plus a list of label arrays.

5. So — is it just to enable the sandwich?

Mechanically, yes: writing \(S_a\) as \(\sum_r u_r\) is what makes the cluster-robust sandwich apply — the sandwich's "meat" is the estimate of \(\sum_r\sum_{r'}\mathrm{Cov}(u_r,u_{r'})\) under a chosen sparsity pattern. (Here the sandwich's "bread", the Jacobian of a general M-estimator, is the identity, because we folded everything into \(u_r\) — so multiway_cluster_cov computes the meat directly.) But the deeper payoff is the decoupling: one influence matrix serves every clustering and every estimand — the scalar win-rate, the \(K\)-vector leaderboard, and the standardized board (L0004/L0008) all reduce to "produce influence, then cluster it." That is why adding a clustering axis or a new estimand never requires re-deriving a variance. L0003 picks up exactly here.

Retrieval check — answer from memory before moving on

1. What is the leaderboard score \(S_a(w)\)?

Why. \(S_a(w)=\sum_b w_{ab}\psi_{ab}\): each model's row of the PI matrix collapsed to a scalar by averaging the pairwise \(\psi_{ab}\) over a chosen opponent field.

2. The opponent mixture \(w_{ab}\) is best described as…

Why. \(w\) is a choice, not data: "uniform" weights every opponent equally (schedule-robust), "as_sampled" weights by how often played (conflates skill with schedule).

3. The "as_sampled" mixture makes \(S_a\) equal to…

Why. With \(w_{ab}=n_{ab}/N_a\) the score telescopes to \((1/N_a)\sum s_r\) over all of \(a\)'s comparisons — the plain mean of its half-tie scores.

4. Why does treating counts and weights as fixed help?

Why. Conditioning on the design turns each \(\psi_{ab}\) into a mean and \(S_a\) into a weighted sum of scores, so \(S_a-\mathbb E[S_a]\approx\sum_r u_r\) and the variance is the variance of a sum.

5. "The same \(u_r\) works for any dependence structure" matters because…

Why. The \(u_r\) encode the estimator; the clustering encodes the dependence. Compute influence once, then swap naive / prompt / judge / multiway by regrouping — and reuse it for every estimand.