1. The object, defined once
For an ordered pair of models, the probabilistic index is
where \((p_1, p_0, p_{-1})\) are the trinomial probabilities that a wins, ties, or loses
(\(p_1+p_0+p_{-1}=1\); main.tex §Methods). Each comparison contributes the half-tie kernel
and \(\hat\psi\) is just the mean of \(h\). You can see the kernel computed literally in
data.py: score = where(rating_a > rating_b, 1.0, where(==, 0.5, 0.0)) — that expression
is \(h\). The schema then carries it as score ∈ {0, ½, 1} (and admits any value in
\([0,1]\), so averaged or model-predicted scores reuse the same pipeline).
pi.pi_matrix builds these in by construction: it accumulates \(s\) for orientation \((a,b)\)
and \(1-s\) for \((b,a)\), then sets the diagonal to \(\tfrac12\) (pi.py:54,60). So you never
estimate \(\psi_{ab}\) and \(\psi_{ba}\) as two free numbers that might disagree — the structure is exact.2. Which transforms keep the information — and which throw it away
This is the crux of Contribution 1, and a clean thing to put on a whiteboard. From \(p_1+p_0+p_{-1}=1\) we get \(1 - \psi = p_{-1} + \tfrac12 p_0\), and then:
Win odds and net benefit are one-to-one (strictly monotone) functions of \(\psi\) — invertible, so they carry exactly the same information. The win ratio is not a function of \(\psi\) alone: it needs how the mass splits between \(p_0\) and the extremes. One numerical example settles it — two outcome distributions with the same \(\psi\):
| trinomial \((p_1, p_0, p_{-1})\) | \(\psi\) | net benefit \(2\psi-1\) | win odds \(\psi/(1-\psi)\) | win ratio \(p_1/p_{-1}\) |
|---|---|---|---|---|
| \(A = (0.60,\ 0.00,\ 0.40)\) | 0.60 | +0.20 | 1.50 | 1.50 |
| \(B = (0.50,\ 0.20,\ 0.30)\) | 0.60 | +0.20 | 1.50 | 1.67 |
Same \(\psi\) ⇒ net benefit and win odds are forced equal (they're transforms). The win ratio differs
(1.50 vs 1.67) precisely because it discards the tie mass \(p_0\). That is exactly why pi.py ships
win_odds and net_benefit but deliberately omits the win ratio — it isn't a property of
the estimand the rest of the paper is about.
3. The structural fact: \(\psi\) is a U-statistic
which is exactly the Mann–Whitney \(U\) divided by the number of cross-pairs, i.e. the "win proportion." Two consequences flow from this, and they are the reason \(\psi\) travels:
- It unlocks the inference + adjustment machinery. Because \(\psi\) is the Mann–Whitney parameter /
clinical win proportion, the U-statistic variance theory (delta method, asymptotic normality of
\(\log\) win odds) and the PIM regression framework — both built on U-statistic estimating equations —
apply directly. The paper isn't analogizing; it's reusing theory for the identical estimand
(
main.tex"A Unified U-Statistic Framework," lines 111–131). - Its summands are dependent — by construction. The pairs \((i,j)\) and \((i,k)\) both reuse response \(i\), so the terms in the average are correlated. A U-statistic's variance is therefore not \(\sigma^2/n\). This is intrinsic to the estimand, not a data defect — and it is the seed of the entire next lesson: a single model response reused against every opponent is the same dependence, which is why the naive i.i.d. standard error undercovers.
4. What travels — and the seam where it stops
The imported machinery assumes the pseudo-observations are independent / exchangeable across pairs (two independent clinical arms; exchangeable PIM pairs). LLM evaluation logs break that in two specific ways, and naming them now tells you exactly where the later contributions live:
| imported assumption | how LLM logs break it | resolved in |
|---|---|---|
| comparisons are independent draws | same prompt / judge / session / reused response → crossed dependence | L0003 (cluster-robust variance) |
| PIM pairs are exchangeable, covariates differ | both responses answer the same prompt → \(W_i - W_j = 0\) | L0004 (prompt-cancellation) |
So the honest one-liner — the paper's own — is: the estimand transfers exactly; the inference assumptions do not. That gap is the paper, and the fragility of the gap (which assumptions a real log violates) is the doorway to Paper 2.
1. In the PI matrix, \(\psi_{ba}\) is related to \(\psi_{ab}\) how?
pi.pi_matrix enforces it by accumulating 1−s for the reverse orientation, and fixes the diagonal at \(\tfrac12\).2. Which of these is not a one-to-one function of \(\psi\)?
3. Two outcome distributions have the same \(\psi\). What is necessarily equal between them?
4. What makes the win-rate a U-statistic, structurally?
5. Why does identifying \(\psi\) with the Mann–Whitney parameter actually matter for the paper?
Read next (primary source)
Read Song et al. (2023), "The win odds: statistical inference and
regression" [Song2023-sm] — the cleanest statement of U-statistic inference and regression for
exactly this estimand (the backbone the paper ports). In-repo, read main.tex lines 111–131 ("A
Unified U-Statistic Framework"), which states the equivalence and the transforms you just worked through.