L0005 — Pairwise → leaderboard

Why this lesson. This is Contribution 4 and the most quotable result: on Arena, 0 of 52 models get a unique certified rank, and 305 apparent cycles all evaporate under uncertainty. Both are easy to misread, so you need to defend exactly what they do and don't claim.

1. A ranking is an added assumption

The PI matrix is a set of pairwise numbers. Collapsing it to a scalar requires either an opponent-mixture aggregation or a transitivity assumption — a separate step with its own commitments. The aggregation is

\[ S_a(w) = \sum_b w_b\,\psi_{ab}, \qquad \text{rank} = \text{a function of the differences } S_a - S_c. \]

The mixture \(w\) is a choice and it matters (leaderboard_scores): "uniform" weights every opponent faced equally — a Borda-style score that removes the schedule confounding of adaptive matchmaking; "as_sampled" weights opponents by how often they were actually played, recovering the raw marginal win-rate, which conflates skill with who you were matched against.

2. Rank-confidence sets

For every pair, form a simultaneous interval on \(S_a - S_c\); the order is certain only when that interval excludes zero. A model's best possible rank counts the models certainly above it, its worst counts those certainly below (rank_confidence_sets):

\[ \text{rank\_best}_a = 1 + \#\{c : S_c - S_a \text{ certainly} > 0\}, \quad \text{rank\_worst}_a = K - \#\{c : S_a - S_c \text{ certainly} > 0\}. \]

Refuse to manufacture certainty. When a difference's variance isn't usably positive (NaN scores, or a finite-sample-indefinite robust covariance), the code skips the pair rather than clipping the variance to zero (leaderboard.py:116). Clipping would invent false certainty and overstate resolution — the opposite of the paper's claim. Small detail, big credibility.

3. Correlation-aware simultaneity (max-t)

Comparing all \(\binom{K}{2}\) differences is a multiplicity problem. A Bonferroni cutoff is valid but conservative. Because the score differences share components (they're built from the same score vector), the max-t critical value simulates the true joint distribution \(\max_{avariance.maxt_critical_value) — less conservative than the union bound.

Read the negative result correctly. On Arena, max-t barely tightens the union bound (mean rank-interval width \(15.5 \to 15.2\)). That is the point: the unresolved middle of the board is genuine sampling uncertainty, not an artifact of conservative correction. So "0 of 52 uniquely ranked" is a statement about the data, not about your method being too cautious. The published total order asserts distinctions the data do not support.

4. Transitivity diagnostic

Because the \(\psi_{ab}\) need not be transitive, test for Condorcet 3-cycles \(a \succ b \succ c \succ a\), but report only those that survive uncertainty. The pipeline: condorcet_cycles finds raw cycles in the count-thresholded tournament; pairwise_pi_table gives each edge a cluster-robust SE; ci_supported_cycles keeps a cycle only if every directed edge clears \(\tfrac12\) with simultaneous confidence (Bonferroni over the cycle's three edges); cycle_robustness reports the weakest-edge \(z\).

\[ \text{cycle significant} \iff \min_{\text{edges}} \frac{\hat\psi_{\text{edge}} - \tfrac12}{\mathrm{se}_{\text{edge}}} \ \gtrsim\ 2. \]

The finding: 305 raw 3-cycles (pairs with \(\geq 20\) comparisons), but 0 survive — the strongest cycle's weakest edge reaches only \(z = 1.51\), short of \(\approx 2\). Restricting to well-sampled pairs removes cycles even in the point estimates (\(\geq 200\) per pair ⇒ zero raw cycles). Raw cycle-counting overstates intransitivity; proper uncertainty quantification reveals the tournament is effectively transitive here.

5. Bradley–Terry is a projection, not an oracle

BT fits a latent score with \(P(i \succ j) = e^{\theta_i}/(e^{\theta_i}+e^{\theta_j})\). The paper's framing: BT is a design-weighted logistic projection of the PI matrix onto the additive-log-odds subspace. So its score is a best-fitting summary whose target depends on the analysis weights — when the matrix is materially intransitive, BT doesn't estimate the win-rates, it approximates them under a transitivity assumption it cannot test. The reporting rule: if the matrix is materially intransitive, report the matrix, the chosen aggregation with its uncertainty, and the diagnostics — and say plainly no assumption-free scalar leaderboard exists. (Here it is effectively transitive, so a scalar board is defensible — with rank sets, not point ranks.)

Retrieval check — answer from memory before moving on

1. Why prefer the "uniform" opponent mixture over "as_sampled"?

Why. Uniform weights every opponent faced equally (Borda-style), so a model isn't rewarded for who it happened to be matched against. "as_sampled" recovers the raw marginal win-rate and conflates skill with schedule.

2. A model's order relative to another is "certain" exactly when?

Why. Certainty requires the simultaneous interval on \(S_a - S_c\) to exclude zero. Differing point estimates alone don't certify an order — that's the whole rank-confidence-set idea.

3. Max-t barely beats Bonferroni on Arena. What does that tell you?

Why. If a tighter, correlation-aware cutoff barely changes the rank widths, the lack of resolution isn't from conservative correction — it's real sampling uncertainty. The data don't support the published distinctions.

4. A Condorcet cycle is counted as statistically real only when…

Why. All three directed edges must exceed \(\tfrac12\) simultaneously; the weakest edge is the bottleneck. The strongest Arena cycle's weakest edge reached only \(z=1.51\), so none survive.

5. In the paper's framing, the Bradley–Terry score is best understood as…

Why. BT is a design-weighted logistic projection onto the additive-log-odds subspace — a fitted summary that assumes transitivity, whose target depends on the analysis weights. Under intransitivity it approximates, it doesn't estimate.

From the matrix to a rankingA PI matrix is not a leaderboard. Report what the data resolve: rank-confidence sets, correlation-aware simultaneous intervals, and a transitivity diagnostic — and recognize Bradley–Terry as a projection.

1. A ranking is an added assumption

2. Rank-confidence sets

3. Correlation-aware simultaneity (max-t)

4. Transitivity diagnostic

5. Bradley–Terry is a projection, not an oracle

Read next (primary source)