Win-rate inference · Lesson 0005

From the matrix to a rankingA PI matrix is not a leaderboard. Report what the data resolve: rank-confidence sets, correlation-aware simultaneous intervals, and a transitivity diagnostic — and recognize Bradley–Terry as a projection.

Why this lesson. This is Contribution 4 and the most quotable result: on Arena, 0 of 52 models get a unique certified rank, and 305 apparent cycles all evaporate under uncertainty. Both are easy to misread, so you need to defend exactly what they do and don't claim.

1. A ranking is an added assumption

The PI matrix is a set of pairwise numbers. Collapsing it to a scalar requires either an opponent-mixture aggregation or a transitivity assumption — a separate step with its own commitments. The aggregation is

\[ S_a(w) = \sum_b w_b\,\psi_{ab}, \qquad \text{rank} = \text{a function of the differences } S_a - S_c. \]

The mixture \(w\) is a choice and it matters (leaderboard_scores): "uniform" weights every opponent faced equally — a Borda-style score that removes the schedule confounding of adaptive matchmaking; "as_sampled" weights opponents by how often they were actually played, recovering the raw marginal win-rate, which conflates skill with who you were matched against.

2. Rank-confidence sets

For every pair, form a simultaneous interval on \(S_a - S_c\); the order is certain only when that interval excludes zero. A model's best possible rank counts the models certainly above it, its worst counts those certainly below (rank_confidence_sets):

\[ \text{rank\_best}_a = 1 + \#\{c : S_c - S_a \text{ certainly} > 0\}, \quad \text{rank\_worst}_a = K - \#\{c : S_a - S_c \text{ certainly} > 0\}. \]
Refuse to manufacture certainty. When a difference's variance isn't usably positive (NaN scores, or a finite-sample-indefinite robust covariance), the code skips the pair rather than clipping the variance to zero (leaderboard.py:116). Clipping would invent false certainty and overstate resolution — the opposite of the paper's claim. Small detail, big credibility.

3. Correlation-aware simultaneity (max-t)

Comparing all \(\binom{K}{2}\) differences is a multiplicity problem. A Bonferroni cutoff is valid but conservative. Because the score differences share components (they're built from the same score vector), the max-t critical value simulates the true joint distribution \(\max_{avariance.maxt_critical_value) — less conservative than the union bound.

Read the negative result correctly. On Arena, max-t barely tightens the union bound (mean rank-interval width \(15.5 \to 15.2\)). That is the point: the unresolved middle of the board is genuine sampling uncertainty, not an artifact of conservative correction. So "0 of 52 uniquely ranked" is a statement about the data, not about your method being too cautious. The published total order asserts distinctions the data do not support.

4. Transitivity diagnostic

Because the \(\psi_{ab}\) need not be transitive, test for Condorcet 3-cycles \(a \succ b \succ c \succ a\), but report only those that survive uncertainty. The pipeline: condorcet_cycles finds raw cycles in the count-thresholded tournament; pairwise_pi_table gives each edge a cluster-robust SE; ci_supported_cycles keeps a cycle only if every directed edge clears \(\tfrac12\) with simultaneous confidence (Bonferroni over the cycle's three edges); cycle_robustness reports the weakest-edge \(z\).

\[ \text{cycle significant} \iff \min_{\text{edges}} \frac{\hat\psi_{\text{edge}} - \tfrac12}{\mathrm{se}_{\text{edge}}} \ \gtrsim\ 2. \]

The finding: 305 raw 3-cycles (pairs with \(\geq 20\) comparisons), but 0 survive — the strongest cycle's weakest edge reaches only \(z = 1.51\), short of \(\approx 2\). Restricting to well-sampled pairs removes cycles even in the point estimates (\(\geq 200\) per pair ⇒ zero raw cycles). Raw cycle-counting overstates intransitivity; proper uncertainty quantification reveals the tournament is effectively transitive here.

5. Bradley–Terry is a projection, not an oracle

BT fits a latent score with \(P(i \succ j) = e^{\theta_i}/(e^{\theta_i}+e^{\theta_j})\). The paper's framing: BT is a design-weighted logistic projection of the PI matrix onto the additive-log-odds subspace. So its score is a best-fitting summary whose target depends on the analysis weights — when the matrix is materially intransitive, BT doesn't estimate the win-rates, it approximates them under a transitivity assumption it cannot test. The reporting rule: if the matrix is materially intransitive, report the matrix, the chosen aggregation with its uncertainty, and the diagnostics — and say plainly no assumption-free scalar leaderboard exists. (Here it is effectively transitive, so a scalar board is defensible — with rank sets, not point ranks.)

Retrieval check — answer from memory before moving on

1. Why prefer the "uniform" opponent mixture over "as_sampled"?

Why. Uniform weights every opponent faced equally (Borda-style), so a model isn't rewarded for who it happened to be matched against. "as_sampled" recovers the raw marginal win-rate and conflates skill with schedule.

2. A model's order relative to another is "certain" exactly when?

Why. Certainty requires the simultaneous interval on \(S_a - S_c\) to exclude zero. Differing point estimates alone don't certify an order — that's the whole rank-confidence-set idea.

3. Max-t barely beats Bonferroni on Arena. What does that tell you?

Why. If a tighter, correlation-aware cutoff barely changes the rank widths, the lack of resolution isn't from conservative correction — it's real sampling uncertainty. The data don't support the published distinctions.

4. A Condorcet cycle is counted as statistically real only when…

Why. All three directed edges must exceed \(\tfrac12\) simultaneously; the weakest edge is the bottleneck. The strongest Arena cycle's weakest edge reached only \(z=1.51\), so none survive.

5. In the paper's framing, the Bradley–Terry score is best understood as…

Why. BT is a design-weighted logistic projection onto the additive-log-odds subspace — a fitted summary that assumes transitivity, whose target depends on the analysis weights. Under intransitivity it approximates, it doesn't estimate.

Read next (primary source)

Read Ameli et al. (2025), "A Statistical Framework for Ranking LLM-based Chatbots," ICLR [Ameli2025-fg] as the BT/Davidson/Rao–Kupper foil. In-repo, read draft_sections.tex §ranking and run scripts/arena_rank_sets.py and scripts/cycle_sweep.py to reproduce the rank sets and the cycle sweep. (Note: the LLM-judge non-transitivity citation is still a \red{} bib gap.)