Glossary — Win-rate inference

Compressed definitions for quick reference. Grouped by where they appear in the method. Symbols match winrate/ and draft_sections.tex.

The estimand

Probabilistic Index (PI): \(\psi_{ab} = P(a \succ b) + \tfrac12 P(a = b) = p_1 + \tfrac12 p_0\) — the chance model a's answer is preferred to b's, with ties split. The paper's primitive estimand. In code: pi.pi_matrix returns psi[i,j].
Half-tie score / half-win kernel: \(h = I(a \succ b) + \tfrac12 I(a = b)\) — the comparison rule. Its realized value for an individual comparison \(r\) is written \(s_r \in \{0, \tfrac12, 1\}\) (1 = \(a\) won, ½ = tie, 0 = \(a\) lost). So \(h\) is the kernel and \(s_r\) is the number it produces for one row; the mean of \(s_r\) over comparisons estimates \(\psi\). In code: the score column (schema).
Comparison index & counts: \(r\) indexes a single comparison (one schema row). \(n_{ab}\) = number of comparisons of pair \((a,b)\), so \(\psi_{ab} = \tfrac{1}{n_{ab}}\sum_{r\in(a,b)} s_r\); \(N_a = \sum_b n_{ab}\) = total comparisons involving model \(a\). In code: pair (the count matrix) and its row sums.
Trinomial outcome: \(p_1, p_0, p_{-1}\) — probabilities that a wins, ties, loses (\(p_1+p_0+p_{-1}=1\)).
Win odds: \(\psi / (1 - \psi)\) — one-to-one transform of \(\psi\). pi.win_odds.
Net benefit (= NPS): \(2\psi - 1\) — one-to-one transform of \(\psi\). pi.net_benefit.
Win ratio: \(p_1 / p_{-1}\) — not a transform of \(\psi\) because it discards ties; deliberately absent from the code.
Mann–Whitney parameter / win proportion: The same object as \(\psi\) under different names (nonparametric stats / clinical trials). The identity that lets the paper import existing inference machinery.
U-statistic: An average of a kernel \(h\) over pairs of observations. The win-rate is a (Mann–Whitney) U-statistic, which is why its variance has U-statistic structure rather than i.i.d.-mean structure.

Inference under dependence

Crossed dependence: Comparisons are correlated along several non-nested axes at once: same prompt, same judge, same reused response, same session. "Crossed," not "nested."
Influence (per-row): Each comparison's first-order contribution to the estimate. Linearizes the estimator so any clustering can be applied to one influence matrix. In code: the influence array \((n, K)\) from leaderboard_scores.
Hájek estimator: A ratio (self-normalized) estimator — the win-rate is a ratio of weighted sums, linearized by treating weights and counts as fixed (winrate/README.md, "known first-pass caveats").
Cluster-robust (sandwich) variance: Variance built from sums of within-cluster influence totals (outer products), instead of summing per-row outer products. Captures within-cluster correlation.
Multiway clustering (Cameron–Gelbach–Miller): Combine several clustering dimensions by inclusion–exclusion: add each dimension's cluster sandwich, subtract their intersections, etc. In code: variance.multiway_cluster_cov.
Naive / i.i.d. variance: The special case where every comparison is its own cluster (sum of per-row outer products). variance.naive_cov. Undercovers under positive within-cluster correlation.
ICC (intra-cluster correlation): How strongly two observations sharing a cluster move together. variance.oneway_icc. Small pair-level ICC can still produce large SE inflation (see "effective sample size").
Effective sample size: The count that actually bounds precision. Here \(\approx\) number of prompts, not number of comparisons, because one response is reused against every opponent.

Covariate adjustment

PIM (Probabilistic Index Model): Semiparametric regression \(g\big(\mathrm{PI}(X_i,X_j)\big) = \boldsymbol{\beta}^\top(X_i - X_j)\) (Thas 2012). The "difference-form" model.
Prompt-cancellation: Because both responses answer the same prompt, \(W_i = W_j\) so \(W_i - W_j = 0\); the prompt-covariate column is identically zero and \(\beta_W\) is unidentified. The difference-form PIM is structurally blind to prompt mix. Diagnosed by covariate.prompt_covariate_cancellation.
Effect modification: The fix: let the prompt covariate enter as a model-by-prompt interaction (or stratify), so \(\psi\) varies across strata. covariate.stratified_pi.
Standardization / standardized marginal: \(\psi^\star_{ab} = \sum_h p^\star(h)\,\psi_{ab}^{(h)}\) — average stratum-specific PIs over a target prompt mixture \(p^\star\). covariate.standardized_leaderboard.
Noncollapsibility: The adjusted win odds is the standardized marginal contrast \(\psi^\star/(1-\psi^\star)\), not \(\exp(\beta_Z)\); the two differ even without confounding.

From PI matrix to leaderboard

Leaderboard score: \(S_a(w) = \sum_b w_b\,\psi_{ab}\) — aggregate the PI matrix against an opponent mixture \(w\). leaderboard.leaderboard_scores.
Opponent mixture (uniform vs as_sampled): "uniform": equal weight per opponent faced (Borda-style; removes adaptive-matchmaking schedule). "as_sampled": weight by how often each opponent was played (raw marginal win-rate; conflates skill with schedule).
Rank-confidence set: A model's [rank_best, rank_worst] — it is certainly ahead of another only when their score-difference interval excludes zero. leaderboard.rank_confidence_sets.
max-t critical value: Correlation-aware simultaneous cutoff for all pairwise score differences (simulated from the score covariance). Less conservative than Bonferroni. variance.maxt_critical_value.
Condorcet cycle / intransitivity: A loop \(a \succ b \succ c \succ a\) in the PI tournament. leaderboard.condorcet_cycles; survival under uncertainty via ci_supported_cycles / cycle_robustness.
Bradley–Terry (BT): A latent-score model \(P(i \succ j) = e^{\theta_i}/(e^{\theta_i}+e^{\theta_j})\). The paper frames it as a design-weighted logistic projection of the PI matrix onto the additive-log-odds subspace — a fitted summary that assumes transitivity, not a free estimate.

Identification (what logs license)

Consistency: The observed outcome is the outcome that would occur under the realized comparison (no interference / well-defined potential outcomes).
Sequential ignorability: Adaptive matchmaking selects pairs only through logged history, not through unlogged difficulty. Fragile: a matchmaker that avoids "uninformative" pairs on hidden signals breaks it.
Positivity / overlap: Every comparison (and target-prompt cell) we want has nonzero probability of being sampled. Fragile: adaptive systems avoid uninformative pairs, and unmatched pairs are not identified at all.
Not identified from logs: PI for never-matched pairs; a judge-free "true" human preference; any capability-vs-judge-bias split (without calibration data). The paper states these as features, not apologies.