The one idea
There is a single primitive estimand. Everything else is either a transform of it or an added assumption on top of it.
Win odds \(\psi/(1-\psi)\) and net benefit \(2\psi-1\) are one-to-one transforms, so they carry
the same information. The win ratio \(p_1/p_{-1}\) is not a transform — it throws away ties —
which is why it is deliberately absent from pi.py. Hold onto this: it is the spine of the
entire paper, and the reason the clinical / PIM inference machinery is even available to import.
The spine (data → claim)
One pipeline serves every dataset and every estimand. New dataset = new loader; nothing downstream changes.
variance.multiway_cluster_cov is the single inference primitive.
Feed it a per-row influence matrix and a list of clustering axes, and it returns the covariance —
for a scalar win-rate, the full leaderboard vector, or a standardized board alike. Learn this one function
and most of the inference contribution follows.The four contributions, mapped
The win-rate is the probabilistic index; everything downstream is a transform or an assumption.
Code: pi.pi_matrix (\(\psi\) + pair counts), win_odds, net_benefit · schema.py (the half-tie score ∈ {0,½,1}).
Take: identifying \(\psi\) with the Mann–Whitney / win-proportion is what makes U-statistic inference and PIMs importable — the paper's whole leverage.
Comparisons aren't independent — they share prompts, judges, sessions, reused responses — so the naive SE is wrong.
Code: leaderboard_scores (per-row influence) → variance.multiway_cluster_cov · naive_cov · oneway_icc. Script: eda_our_data.py.
Headline: on mmlu_pro, SE inflates 4.2× (prompt) and 5.4× (prompt+judge) — a nominal 95% naive interval covers ≈50%. Binding \(n\) = number of prompts, not comparisons.
The difference-form PIM cannot adjust for prompt mix: both answers share the prompt, so \(X_a - X_b = 0\) and \(\beta_W\) is unidentified.
Code: covariate.prompt_covariate_cancellation (diagnose) → stratified_pi (effect modification) → standardized_leaderboard (fix). Scripts: prompt_cancellation.py, our_data_covariate.py.
Headline: on Arena, per-category win-rates span up to 0.142 for one model; standardizing to a balanced mix reorders 34 of 48 models (one by six ranks). The adjusted win odds is the standardized marginal, not \(\exp(\beta_Z)\) (noncollapsibility).
A PI matrix is not a ranking. Report what the data resolve: rank-confidence sets and a transitivity diagnostic.
Code: leaderboard.rank_confidence_sets (max-t / bonferroni) · condorcet_cycles · pairwise_pi_table · ci_supported_cycles · variance.maxt_critical_value. Scripts: arena_rank_sets.py, cycle_sweep.py.
Headline: on Arena, 0 of 52 models get a unique certified rank (only gemini-2.5-pro resolved to [1,2]); 305 raw 3-cycles but 0 survive simultaneous CIs (strongest weakest-edge \(z = 1.51\)).
1. What single object is the primitive estimand that everything reduces to?
2. Which count actually binds the precision of a model's win-rate here?
3. Why does the difference-form PIM fail to adjust for a prompt covariate?
4. The covariate-adjusted win odds equals which quantity?
5. Which module turns the PI matrix into rank-confidence sets and cycle diagnostics?
leaderboard.py holds rank_confidence_sets, condorcet_cycles, pairwise_pi_table, and ci_supported_cycles — but it leans on variance.py for every covariance (the shared primitive).Read next (primary source)
Read outline_paper1.md — the Contributions list (lines 19–29) and the
Preliminary findings (lines 89–94). It is the highest-fidelity statement of the four claims and their
headline numbers, in the author's own framing. Pair it with winrate/README.md ("The spine")
to see the same four claims as code.