Win-rate inference · Reference

GlossaryThe canonical nomenclature. Every lesson uses these symbols and definitions.

Compressed definitions for quick reference. Grouped by where they appear in the method. Symbols match winrate/ and draft_sections.tex.

The estimand

Probabilistic Index (PI)
\(\psi_{ab} = P(a \succ b) + \tfrac12 P(a = b) = p_1 + \tfrac12 p_0\) — the chance model a's answer is preferred to b's, with ties split. The paper's primitive estimand. In code: pi.pi_matrix returns psi[i,j].
Half-tie score / half-win kernel
\(h = I(a \succ b) + \tfrac12 I(a = b)\) — the comparison rule. Its realized value for an individual comparison \(r\) is written \(s_r \in \{0, \tfrac12, 1\}\) (1 = \(a\) won, ½ = tie, 0 = \(a\) lost). So \(h\) is the kernel and \(s_r\) is the number it produces for one row; the mean of \(s_r\) over comparisons estimates \(\psi\). In code: the score column (schema).
Comparison index & counts
\(r\) indexes a single comparison (one schema row). \(n_{ab}\) = number of comparisons of pair \((a,b)\), so \(\psi_{ab} = \tfrac{1}{n_{ab}}\sum_{r\in(a,b)} s_r\); \(N_a = \sum_b n_{ab}\) = total comparisons involving model \(a\). In code: pair (the count matrix) and its row sums.
Trinomial outcome
\(p_1, p_0, p_{-1}\) — probabilities that a wins, ties, loses (\(p_1+p_0+p_{-1}=1\)).
Win odds
\(\psi / (1 - \psi)\) — one-to-one transform of \(\psi\). pi.win_odds.
Net benefit (= NPS)
\(2\psi - 1\) — one-to-one transform of \(\psi\). pi.net_benefit.
Win ratio
\(p_1 / p_{-1}\) — not a transform of \(\psi\) because it discards ties; deliberately absent from the code.
Mann–Whitney parameter / win proportion
The same object as \(\psi\) under different names (nonparametric stats / clinical trials). The identity that lets the paper import existing inference machinery.
U-statistic
An average of a kernel \(h\) over pairs of observations. The win-rate is a (Mann–Whitney) U-statistic, which is why its variance has U-statistic structure rather than i.i.d.-mean structure.

Inference under dependence

Crossed dependence
Comparisons are correlated along several non-nested axes at once: same prompt, same judge, same reused response, same session. "Crossed," not "nested."
Influence (per-row)
Each comparison's first-order contribution to the estimate. Linearizes the estimator so any clustering can be applied to one influence matrix. In code: the influence array \((n, K)\) from leaderboard_scores.
Hájek estimator
A ratio (self-normalized) estimator — the win-rate is a ratio of weighted sums, linearized by treating weights and counts as fixed (winrate/README.md, "known first-pass caveats").
Cluster-robust (sandwich) variance
Variance built from sums of within-cluster influence totals (outer products), instead of summing per-row outer products. Captures within-cluster correlation.
Multiway clustering (Cameron–Gelbach–Miller)
Combine several clustering dimensions by inclusion–exclusion: add each dimension's cluster sandwich, subtract their intersections, etc. In code: variance.multiway_cluster_cov.
Naive / i.i.d. variance
The special case where every comparison is its own cluster (sum of per-row outer products). variance.naive_cov. Undercovers under positive within-cluster correlation.
ICC (intra-cluster correlation)
How strongly two observations sharing a cluster move together. variance.oneway_icc. Small pair-level ICC can still produce large SE inflation (see "effective sample size").
Effective sample size
The count that actually bounds precision. Here \(\approx\) number of prompts, not number of comparisons, because one response is reused against every opponent.

Covariate adjustment

PIM (Probabilistic Index Model)
Semiparametric regression \(g\big(\mathrm{PI}(X_i,X_j)\big) = \boldsymbol{\beta}^\top(X_i - X_j)\) (Thas 2012). The "difference-form" model.
Prompt-cancellation
Because both responses answer the same prompt, \(W_i = W_j\) so \(W_i - W_j = 0\); the prompt-covariate column is identically zero and \(\beta_W\) is unidentified. The difference-form PIM is structurally blind to prompt mix. Diagnosed by covariate.prompt_covariate_cancellation.
Effect modification
The fix: let the prompt covariate enter as a model-by-prompt interaction (or stratify), so \(\psi\) varies across strata. covariate.stratified_pi.
Standardization / standardized marginal
\(\psi^\star_{ab} = \sum_h p^\star(h)\,\psi_{ab}^{(h)}\) — average stratum-specific PIs over a target prompt mixture \(p^\star\). covariate.standardized_leaderboard.
Noncollapsibility
The adjusted win odds is the standardized marginal contrast \(\psi^\star/(1-\psi^\star)\), not \(\exp(\beta_Z)\); the two differ even without confounding.

From PI matrix to leaderboard

Leaderboard score
\(S_a(w) = \sum_b w_b\,\psi_{ab}\) — aggregate the PI matrix against an opponent mixture \(w\). leaderboard.leaderboard_scores.
Opponent mixture (uniform vs as_sampled)
"uniform": equal weight per opponent faced (Borda-style; removes adaptive-matchmaking schedule). "as_sampled": weight by how often each opponent was played (raw marginal win-rate; conflates skill with schedule).
Rank-confidence set
A model's [rank_best, rank_worst] — it is certainly ahead of another only when their score-difference interval excludes zero. leaderboard.rank_confidence_sets.
max-t critical value
Correlation-aware simultaneous cutoff for all pairwise score differences (simulated from the score covariance). Less conservative than Bonferroni. variance.maxt_critical_value.
Condorcet cycle / intransitivity
A loop \(a \succ b \succ c \succ a\) in the PI tournament. leaderboard.condorcet_cycles; survival under uncertainty via ci_supported_cycles / cycle_robustness.
Bradley–Terry (BT)
A latent-score model \(P(i \succ j) = e^{\theta_i}/(e^{\theta_i}+e^{\theta_j})\). The paper frames it as a design-weighted logistic projection of the PI matrix onto the additive-log-odds subspace — a fitted summary that assumes transitivity, not a free estimate.

Identification (what logs license)

Consistency
The observed outcome is the outcome that would occur under the realized comparison (no interference / well-defined potential outcomes).
Sequential ignorability
Adaptive matchmaking selects pairs only through logged history, not through unlogged difficulty. Fragile: a matchmaker that avoids "uninformative" pairs on hidden signals breaks it.
Positivity / overlap
Every comparison (and target-prompt cell) we want has nonzero probability of being sampled. Fragile: adaptive systems avoid uninformative pairs, and unmatched pairs are not identified at all.
Not identified from logs
PI for never-matched pairs; a judge-free "true" human preference; any capability-vs-judge-bias split (without calibration data). The paper states these as features, not apologies.