Win-rate inference · Lesson 0001

The whole method on one pageFour contributions, one estimand spine, and how every winrate/ module and script maps to a paper claim.

Why this lesson. You need to defend every number with Jean and reviewers, trust the framing yourself, and find the seams that become Paper 2. Before any deep dive, you want the map: where each claim lives, what produces it, and how the pieces depend on each other. Everything later hangs here.

The one idea

There is a single primitive estimand. Everything else is either a transform of it or an added assumption on top of it.

\[ \underbrace{\text{win-rate (ties split)}}_{\text{LLM eval}} \;=\; \psi_{ab} = p_1 + \tfrac12 p_0 \;=\; \underbrace{\text{Mann–Whitney parameter}}_{\text{nonparametric stats}} \;=\; \underbrace{\text{win proportion}}_{\text{clinical trials}} \]

Win odds \(\psi/(1-\psi)\) and net benefit \(2\psi-1\) are one-to-one transforms, so they carry the same information. The win ratio \(p_1/p_{-1}\) is not a transform — it throws away ties — which is why it is deliberately absent from pi.py. Hold onto this: it is the spine of the entire paper, and the reason the clinical / PIM inference machinery is even available to import.

The spine (data → claim)

One pipeline serves every dataset and every estimand. New dataset = new loader; nothing downstream changes.

loaderdata.py
schema1 row = 1 comparison
PI matrixpi.py
influence + covvariance.py
board / ranks / cyclesleaderboard.py · covariate.py
The keystone. variance.multiway_cluster_cov is the single inference primitive. Feed it a per-row influence matrix and a list of clustering axes, and it returns the covariance — for a scalar win-rate, the full leaderboard vector, or a standardized board alike. Learn this one function and most of the inference contribution follows.

The four contributions, mapped

Contribution 1 · the estimand

The win-rate is the probabilistic index; everything downstream is a transform or an assumption.

Code: pi.pi_matrix (\(\psi\) + pair counts), win_odds, net_benefit · schema.py (the half-tie score ∈ {0,½,1}).

Take: identifying \(\psi\) with the Mann–Whitney / win-proportion is what makes U-statistic inference and PIMs importable — the paper's whole leverage.

Contribution 2 · design-robust inference

Comparisons aren't independent — they share prompts, judges, sessions, reused responses — so the naive SE is wrong.

Code: leaderboard_scores (per-row influence) → variance.multiway_cluster_cov · naive_cov · oneway_icc. Script: eda_our_data.py.

Headline: on mmlu_pro, SE inflates 4.2× (prompt) and 5.4× (prompt+judge) — a nominal 95% naive interval covers ≈50%. Binding \(n\) = number of prompts, not comparisons.

Contribution 3 · prompt-cancellation

The difference-form PIM cannot adjust for prompt mix: both answers share the prompt, so \(X_a - X_b = 0\) and \(\beta_W\) is unidentified.

Code: covariate.prompt_covariate_cancellation (diagnose) → stratified_pi (effect modification) → standardized_leaderboard (fix). Scripts: prompt_cancellation.py, our_data_covariate.py.

Headline: on Arena, per-category win-rates span up to 0.142 for one model; standardizing to a balanced mix reorders 34 of 48 models (one by six ranks). The adjusted win odds is the standardized marginal, not \(\exp(\beta_Z)\) (noncollapsibility).

Contribution 4 · pairwise → leaderboard

A PI matrix is not a ranking. Report what the data resolve: rank-confidence sets and a transitivity diagnostic.

Code: leaderboard.rank_confidence_sets (max-t / bonferroni) · condorcet_cycles · pairwise_pi_table · ci_supported_cycles · variance.maxt_critical_value. Scripts: arena_rank_sets.py, cycle_sweep.py.

Headline: on Arena, 0 of 52 models get a unique certified rank (only gemini-2.5-pro resolved to [1,2]); 305 raw 3-cycles but 0 survive simultaneous CIs (strongest weakest-edge \(z = 1.51\)).

The honest through-line. Each contribution makes a weaker claim than current practice: a cluster-robust win-rate, a rank-confidence set instead of a point rank, a transitivity claim that survives uncertainty, and a covariate-standardized index that names its target prompt mix. That two-sided honesty (here is the method and the cost it exposes) is the paper's voice — and your strongest ground when defending it.
Retrieval check — answer from memory before moving on

1. What single object is the primitive estimand that everything reduces to?

Why. \(\psi = p_1 + \tfrac12 p_0\) equals the win-rate, the Mann–Whitney parameter, and the clinical win proportion. Win odds and net benefit are transforms of it; the win ratio is not (it discards ties).

2. Which count actually binds the precision of a model's win-rate here?

Why. Each model's single answer per prompt is reused against every opponent, so comparisons are far from independent. The effective sample size tracks prompts — that's why a tiny pair-level ICC (~0.03) still inflates the SE 4–6×.

3. Why does the difference-form PIM fail to adjust for a prompt covariate?

Why. A prompt covariate is shared by both responses, so \(X_a = X_b\) and the design column \(X_a - X_b\) is identically zero. \(\beta_W\) is unidentified — prompt-cancellation. The fix is effect modification (interact/stratify) then standardize.

4. The covariate-adjusted win odds equals which quantity?

Why. The win odds is noncollapsible: the adjusted value is \(\psi^\star/(1-\psi^\star)\) where \(\psi^\star\) is the standardized marginal PI — not \(\exp(\hat\beta_Z)\), and the two differ even with no confounding. A common reviewer trap; know it cold.

5. Which module turns the PI matrix into rank-confidence sets and cycle diagnostics?

Why. leaderboard.py holds rank_confidence_sets, condorcet_cycles, pairwise_pi_table, and ci_supported_cycles — but it leans on variance.py for every covariance (the shared primitive).

Read next (primary source)

Read outline_paper1.md — the Contributions list (lines 19–29) and the Preliminary findings (lines 89–94). It is the highest-fidelity statement of the four claims and their headline numbers, in the author's own framing. Pair it with winrate/README.md ("The spine") to see the same four claims as code.