Compressed definitions for quick reference. Grouped by where they appear in the method.
Symbols match winrate/ and draft_sections.tex.
The estimand
- Probabilistic Index (PI)
- \(\psi_{ab} = P(a \succ b) + \tfrac12 P(a = b) = p_1 + \tfrac12 p_0\) — the chance model a's answer is
preferred to b's, with ties split. The paper's primitive estimand. In code:
pi.pi_matrixreturnspsi[i,j]. - Half-tie score / half-win kernel
- \(h = I(a \succ b) + \tfrac12 I(a = b)\) — the comparison rule. Its realized value for an individual comparison \(r\) is written \(s_r \in \{0, \tfrac12, 1\}\) (1 = \(a\) won, ½ = tie, 0 = \(a\) lost). So \(h\) is the kernel and \(s_r\) is the number it produces for one row; the mean of \(s_r\) over comparisons estimates \(\psi\). In code: the
scorecolumn (schema). - Comparison index & counts
- \(r\) indexes a single comparison (one schema row). \(n_{ab}\) = number of comparisons of pair \((a,b)\), so \(\psi_{ab} = \tfrac{1}{n_{ab}}\sum_{r\in(a,b)} s_r\); \(N_a = \sum_b n_{ab}\) = total comparisons involving model \(a\). In code:
pair(the count matrix) and its row sums. - Trinomial outcome
- \(p_1, p_0, p_{-1}\) — probabilities that a wins, ties, loses (\(p_1+p_0+p_{-1}=1\)).
- Win odds
- \(\psi / (1 - \psi)\) — one-to-one transform of \(\psi\).
pi.win_odds. - Net benefit (= NPS)
- \(2\psi - 1\) — one-to-one transform of \(\psi\).
pi.net_benefit. - Win ratio
- \(p_1 / p_{-1}\) — not a transform of \(\psi\) because it discards ties; deliberately absent from the code.
- Mann–Whitney parameter / win proportion
- The same object as \(\psi\) under different names (nonparametric stats / clinical trials). The identity that lets the paper import existing inference machinery.
- U-statistic
- An average of a kernel \(h\) over pairs of observations. The win-rate is a (Mann–Whitney) U-statistic, which is why its variance has U-statistic structure rather than i.i.d.-mean structure.
Inference under dependence
- Crossed dependence
- Comparisons are correlated along several non-nested axes at once: same prompt, same judge, same reused response, same session. "Crossed," not "nested."
- Influence (per-row)
- Each comparison's first-order contribution to the estimate. Linearizes the estimator so any clustering can be applied to one influence matrix. In code: the
influencearray \((n, K)\) fromleaderboard_scores. - Hájek estimator
- A ratio (self-normalized) estimator — the win-rate is a ratio of weighted sums, linearized by treating weights and counts as fixed (
winrate/README.md, "known first-pass caveats"). - Cluster-robust (sandwich) variance
- Variance built from sums of within-cluster influence totals (outer products), instead of summing per-row outer products. Captures within-cluster correlation.
- Multiway clustering (Cameron–Gelbach–Miller)
- Combine several clustering dimensions by inclusion–exclusion: add each dimension's cluster sandwich, subtract their intersections, etc. In code:
variance.multiway_cluster_cov. - Naive / i.i.d. variance
- The special case where every comparison is its own cluster (sum of per-row outer products).
variance.naive_cov. Undercovers under positive within-cluster correlation. - ICC (intra-cluster correlation)
- How strongly two observations sharing a cluster move together.
variance.oneway_icc. Small pair-level ICC can still produce large SE inflation (see "effective sample size"). - Effective sample size
- The count that actually bounds precision. Here \(\approx\) number of prompts, not number of comparisons, because one response is reused against every opponent.
Covariate adjustment
- PIM (Probabilistic Index Model)
- Semiparametric regression \(g\big(\mathrm{PI}(X_i,X_j)\big) = \boldsymbol{\beta}^\top(X_i - X_j)\) (Thas 2012). The "difference-form" model.
- Prompt-cancellation
- Because both responses answer the same prompt, \(W_i = W_j\) so \(W_i - W_j = 0\); the prompt-covariate column is identically zero and \(\beta_W\) is unidentified. The difference-form PIM is structurally blind to prompt mix. Diagnosed by
covariate.prompt_covariate_cancellation. - Effect modification
- The fix: let the prompt covariate enter as a model-by-prompt interaction (or stratify), so \(\psi\) varies across strata.
covariate.stratified_pi. - Standardization / standardized marginal
- \(\psi^\star_{ab} = \sum_h p^\star(h)\,\psi_{ab}^{(h)}\) — average stratum-specific PIs over a target prompt mixture \(p^\star\).
covariate.standardized_leaderboard. - Noncollapsibility
- The adjusted win odds is the standardized marginal contrast \(\psi^\star/(1-\psi^\star)\), not \(\exp(\beta_Z)\); the two differ even without confounding.
From PI matrix to leaderboard
- Leaderboard score
- \(S_a(w) = \sum_b w_b\,\psi_{ab}\) — aggregate the PI matrix against an opponent mixture \(w\).
leaderboard.leaderboard_scores. - Opponent mixture (uniform vs as_sampled)
"uniform": equal weight per opponent faced (Borda-style; removes adaptive-matchmaking schedule)."as_sampled": weight by how often each opponent was played (raw marginal win-rate; conflates skill with schedule).- Rank-confidence set
- A model's
[rank_best, rank_worst]— it is certainly ahead of another only when their score-difference interval excludes zero.leaderboard.rank_confidence_sets. - max-t critical value
- Correlation-aware simultaneous cutoff for all pairwise score differences (simulated from the score covariance). Less conservative than Bonferroni.
variance.maxt_critical_value. - Condorcet cycle / intransitivity
- A loop \(a \succ b \succ c \succ a\) in the PI tournament.
leaderboard.condorcet_cycles; survival under uncertainty viaci_supported_cycles/cycle_robustness. - Bradley–Terry (BT)
- A latent-score model \(P(i \succ j) = e^{\theta_i}/(e^{\theta_i}+e^{\theta_j})\). The paper frames it as a design-weighted logistic projection of the PI matrix onto the additive-log-odds subspace — a fitted summary that assumes transitivity, not a free estimate.
Identification (what logs license)
- Consistency
- The observed outcome is the outcome that would occur under the realized comparison (no interference / well-defined potential outcomes).
- Sequential ignorability
- Adaptive matchmaking selects pairs only through logged history, not through unlogged difficulty. Fragile: a matchmaker that avoids "uninformative" pairs on hidden signals breaks it.
- Positivity / overlap
- Every comparison (and target-prompt cell) we want has nonzero probability of being sampled. Fragile: adaptive systems avoid uninformative pairs, and unmatched pairs are not identified at all.
- Not identified from logs
- PI for never-matched pairs; a judge-free "true" human preference; any capability-vs-judge-bias split (without calibration data). The paper states these as features, not apologies.