1. A ranking is an added assumption
The PI matrix is a set of pairwise numbers. Collapsing it to a scalar requires either an opponent-mixture aggregation or a transitivity assumption — a separate step with its own commitments. The aggregation is
The mixture \(w\) is a choice and it matters (leaderboard_scores): "uniform" weights every
opponent faced equally — a Borda-style score that removes the schedule confounding of adaptive matchmaking;
"as_sampled" weights opponents by how often they were actually played, recovering the raw marginal
win-rate, which conflates skill with who you were matched against.
2. Rank-confidence sets
For every pair, form a simultaneous interval on \(S_a - S_c\); the order is certain only when that
interval excludes zero. A model's best possible rank counts the models certainly above it, its worst counts
those certainly below (rank_confidence_sets):
leaderboard.py:116). Clipping would invent false certainty and overstate
resolution — the opposite of the paper's claim. Small detail, big credibility.3. Correlation-aware simultaneity (max-t)
Comparing all \(\binom{K}{2}\) differences is a multiplicity problem. A Bonferroni cutoff is valid but
conservative. Because the score differences share components (they're built from the same score vector),
the max-t critical value simulates the true joint distribution
\(\max_{a
4. Transitivity diagnostic
Because the \(\psi_{ab}\) need not be transitive, test for Condorcet 3-cycles \(a \succ b \succ c \succ a\),
but report only those that survive uncertainty. The pipeline: condorcet_cycles finds raw cycles in the
count-thresholded tournament; pairwise_pi_table gives each edge a cluster-robust SE;
ci_supported_cycles keeps a cycle only if every directed edge clears \(\tfrac12\) with simultaneous
confidence (Bonferroni over the cycle's three edges); cycle_robustness reports the weakest-edge \(z\).
The finding: 305 raw 3-cycles (pairs with \(\geq 20\) comparisons), but 0 survive — the strongest cycle's weakest edge reaches only \(z = 1.51\), short of \(\approx 2\). Restricting to well-sampled pairs removes cycles even in the point estimates (\(\geq 200\) per pair ⇒ zero raw cycles). Raw cycle-counting overstates intransitivity; proper uncertainty quantification reveals the tournament is effectively transitive here.
5. Bradley–Terry is a projection, not an oracle
BT fits a latent score with \(P(i \succ j) = e^{\theta_i}/(e^{\theta_i}+e^{\theta_j})\). The paper's framing: BT is a design-weighted logistic projection of the PI matrix onto the additive-log-odds subspace. So its score is a best-fitting summary whose target depends on the analysis weights — when the matrix is materially intransitive, BT doesn't estimate the win-rates, it approximates them under a transitivity assumption it cannot test. The reporting rule: if the matrix is materially intransitive, report the matrix, the chosen aggregation with its uncertainty, and the diagnostics — and say plainly no assumption-free scalar leaderboard exists. (Here it is effectively transitive, so a scalar board is defensible — with rank sets, not point ranks.)
1. Why prefer the "uniform" opponent mixture over "as_sampled"?
"as_sampled" recovers the raw marginal win-rate and conflates skill with schedule.2. A model's order relative to another is "certain" exactly when?
3. Max-t barely beats Bonferroni on Arena. What does that tell you?
4. A Condorcet cycle is counted as statistically real only when…
5. In the paper's framing, the Bradley–Terry score is best understood as…
Read next (primary source)
Read Ameli et al. (2025), "A Statistical Framework for Ranking
LLM-based Chatbots," ICLR [Ameli2025-fg] as the BT/Davidson/Rao–Kupper foil. In-repo, read
draft_sections.tex §ranking and run scripts/arena_rank_sets.py and
scripts/cycle_sweep.py to reproduce the rank sets and the cycle sweep. (Note: the LLM-judge
non-transitivity citation is still a \red{} bib gap.)