Win-rate inference · Lesson 0002

The estimand, and why \(\psi\) travelsWhy the win-rate is the probabilistic index, which transforms keep its information, and the one structural fact (it's a U-statistic) that makes the whole machinery importable.

Why this lesson. "It's the same object as the clinical win proportion" is the sentence the entire paper leans on — it's what lets you import U-statistic inference and PIM adjustment. To defend that to a reviewer you need to show it's not a loose analogy but an identity, and to know precisely what travels and what doesn't. That last part is also where Paper 2 begins.

1. The object, defined once

For an ordered pair of models, the probabilistic index is

\[ \psi_{ab} = P(a \succ b) + \tfrac12\,P(a = b) = p_1 + \tfrac12 p_0 \]

where \((p_1, p_0, p_{-1})\) are the trinomial probabilities that a wins, ties, or loses (\(p_1+p_0+p_{-1}=1\); main.tex §Methods). Each comparison contributes the half-tie kernel

\[ h = I(a \succ b) + \tfrac12\,I(a = b) \;\in\; \{0,\ \tfrac12,\ 1\} \]

and \(\hat\psi\) is just the mean of \(h\). You can see the kernel computed literally in data.py: score = where(rating_a > rating_b, 1.0, where(==, 0.5, 0.0)) — that expression is \(h\). The schema then carries it as score ∈ {0, ½, 1} (and admits any value in \([0,1]\), so averaged or model-predicted scores reuse the same pipeline).

The identities, enforced by code. Two facts pin the matrix down: \(\psi_{ba} = 1 - \psi_{ab}\) (antisymmetry) and \(\psi_{aa} = \tfrac12\) (self-comparison). pi.pi_matrix builds these in by construction: it accumulates \(s\) for orientation \((a,b)\) and \(1-s\) for \((b,a)\), then sets the diagonal to \(\tfrac12\) (pi.py:54,60). So you never estimate \(\psi_{ab}\) and \(\psi_{ba}\) as two free numbers that might disagree — the structure is exact.

2. Which transforms keep the information — and which throw it away

This is the crux of Contribution 1, and a clean thing to put on a whiteboard. From \(p_1+p_0+p_{-1}=1\) we get \(1 - \psi = p_{-1} + \tfrac12 p_0\), and then:

\[ \text{Win odds} = \frac{\psi}{1-\psi} \qquad \text{Net benefit} = 2\psi - 1 = p_1 - p_{-1}\ (=\text{NPS}) \qquad \text{Win ratio} = \frac{p_1}{p_{-1}} \]

Win odds and net benefit are one-to-one (strictly monotone) functions of \(\psi\) — invertible, so they carry exactly the same information. The win ratio is not a function of \(\psi\) alone: it needs how the mass splits between \(p_0\) and the extremes. One numerical example settles it — two outcome distributions with the same \(\psi\):

trinomial \((p_1, p_0, p_{-1})\)\(\psi\)net benefit \(2\psi-1\)win odds \(\psi/(1-\psi)\)win ratio \(p_1/p_{-1}\)
\(A = (0.60,\ 0.00,\ 0.40)\)0.60+0.201.501.50
\(B = (0.50,\ 0.20,\ 0.30)\)0.60+0.201.501.67

Same \(\psi\) ⇒ net benefit and win odds are forced equal (they're transforms). The win ratio differs (1.50 vs 1.67) precisely because it discards the tie mass \(p_0\). That is exactly why pi.py ships win_odds and net_benefit but deliberately omits the win ratio — it isn't a property of the estimand the rest of the paper is about.

Defend-it framing. If a reviewer says "win-rate, win odds, net benefit, win ratio — pick a lane," the answer is: the first three are the same estimand in different clothing, so any one implies the others; the win ratio is a different estimand that answers a different question and cannot be recovered from \(\psi\). You report \(\psi\) (and its transforms) and say so.

3. The structural fact: \(\psi\) is a U-statistic

Primitive, just-in-time: a U-statistic. A U-statistic is the average of a kernel evaluated over tuples of observations. An ordinary sample mean is the trivial case (a one-argument kernel). The win-rate uses a two-argument kernel \(h(Y_i, Y_j)\) that compares two responses. Comparing all of model a's responses against all of model b's gives the two-sample form
\[ \hat\psi_{ab} = \frac{1}{n_a\,n_b} \sum_{i=1}^{n_a} \sum_{j=1}^{n_b} h\big(Y_{ai},\, Y_{bj}\big) \]

which is exactly the Mann–Whitney \(U\) divided by the number of cross-pairs, i.e. the "win proportion." Two consequences flow from this, and they are the reason \(\psi\) travels:

  1. It unlocks the inference + adjustment machinery. Because \(\psi\) is the Mann–Whitney parameter / clinical win proportion, the U-statistic variance theory (delta method, asymptotic normality of \(\log\) win odds) and the PIM regression framework — both built on U-statistic estimating equations — apply directly. The paper isn't analogizing; it's reusing theory for the identical estimand (main.tex "A Unified U-Statistic Framework," lines 111–131).
  2. Its summands are dependent — by construction. The pairs \((i,j)\) and \((i,k)\) both reuse response \(i\), so the terms in the average are correlated. A U-statistic's variance is therefore not \(\sigma^2/n\). This is intrinsic to the estimand, not a data defect — and it is the seed of the entire next lesson: a single model response reused against every opponent is the same dependence, which is why the naive i.i.d. standard error undercovers.

4. What travels — and the seam where it stops

The imported machinery assumes the pseudo-observations are independent / exchangeable across pairs (two independent clinical arms; exchangeable PIM pairs). LLM evaluation logs break that in two specific ways, and naming them now tells you exactly where the later contributions live:

imported assumptionhow LLM logs break itresolved in
comparisons are independent drawssame prompt / judge / session / reused response → crossed dependenceL0003 (cluster-robust variance)
PIM pairs are exchangeable, covariates differboth responses answer the same prompt → \(W_i - W_j = 0\)L0004 (prompt-cancellation)

So the honest one-liner — the paper's own — is: the estimand transfers exactly; the inference assumptions do not. That gap is the paper, and the fragility of the gap (which assumptions a real log violates) is the doorway to Paper 2.

Retrieval check — answer from memory before moving on

1. In the PI matrix, \(\psi_{ba}\) is related to \(\psi_{ab}\) how?

Why. Antisymmetry: \(\psi_{ba} = 1 - \psi_{ab}\). pi.pi_matrix enforces it by accumulating 1−s for the reverse orientation, and fixes the diagonal at \(\tfrac12\).

2. Which of these is not a one-to-one function of \(\psi\)?

Why. Win odds (\(\psi/(1-\psi)\)) and net benefit (\(2\psi-1\)) are monotone transforms; the win ratio \(p_1/p_{-1}\) discards the tie mass \(p_0\), so two distributions with the same \(\psi\) can have different win ratios.

3. Two outcome distributions have the same \(\psi\). What is necessarily equal between them?

Why. Net benefit \(= 2\psi - 1\) depends only on \(\psi\), so equal \(\psi\) forces equal net benefit (and equal win odds). \(p_1\), \(p_0\), and the win ratio can all still differ — see the A/B table above.

4. What makes the win-rate a U-statistic, structurally?

Why. \(\hat\psi\) averages the two-argument kernel \(h(Y_i,Y_j)\) over response pairs — the Mann–Whitney \(U\) over \(n_a\,n_b\) cross-pairs. The Gumbel/latent/MLE options describe Bradley–Terry, not \(\psi\).

5. Why does identifying \(\psi\) with the Mann–Whitney parameter actually matter for the paper?

Why. Same estimand ⇒ U-statistic variance theory and PIM adjustment apply directly. It does not fix dependence (that's L0003) or imply transitivity (L0005) — the imported assumptions still have to be earned.

Read next (primary source)

Read Song et al. (2023), "The win odds: statistical inference and regression" [Song2023-sm] — the cleanest statement of U-statistic inference and regression for exactly this estimand (the backbone the paper ports). In-repo, read main.tex lines 111–131 ("A Unified U-Statistic Framework"), which states the equivalence and the transforms you just worked through.