Win-rate inference · Lesson 0004

Prompt-cancellation, and the fixWhy the difference-form PIM is structurally blind to prompt mix, how effect modification + standardization repairs it, and why the adjusted win odds is not \(\exp(\beta_Z)\).

Why this lesson. This is the paper's sharpest single contribution and the one most likely to be misunderstood by a reviewer who knows clinical PIMs. The result is a two-line proof you should be able to write from memory, and the fix carries a noncollapsibility subtlety that is a classic trap.

1. What you actually want

You would like to report a win-rate adjusted for prompt mix, so a leaderboard is not an artifact of which prompts happened to be asked. The obvious tool is a probabilistic index model (PIM) with a linear predictor in the covariate difference (main.tex §covariate_adjustment, Thas 2012):

\[ g\big(\mathrm{PI}(\mathbf X_i, \mathbf X_j)\big) = \boldsymbol\beta^\top(\mathbf X_i - \mathbf X_j), \qquad \mathbf X = (Z, \mathbf W) \]

with \(Z\) the model indicator and \(\mathbf W\) the prompt-level adjustment covariates (e.g. subject, language, code).

2. The failure, in two lines

Both responses answer the same prompt, so they share its features: \(\mathbf W_i = \mathbf W_j\). Therefore

\[ \mathbf W_i - \mathbf W_j = \mathbf 0 \quad\Longrightarrow\quad \boldsymbol\beta_W \text{ multiplies a column of zeros} \quad\Longrightarrow\quad \boldsymbol\beta_W \text{ is unidentified.} \]

The difference-form model is structurally blind to prompt mix — the dominant source of leaderboard sensitivity. This is prompt-cancellation. It does not arise in the two-arm clinical setting that motivated PIMs, because there the treatment and control patients have different baseline \(\mathbf W\); that's exactly why the imported machinery misses it. covariate.prompt_covariate_cancellation reports the diagnosis: model_side_difference_max_abs = 0 across every comparison, yet the covariate varies across prompts (so it could matter, if it were usable).

The crucial asymmetry. The model indicator \(Z\) does not cancel — \(a \neq b\) means \(Z_i - Z_j \neq 0\) — so head-to-head model contrasts are still estimable. Only the prompt-level covariates cancel. So the difference form fails at precisely one job: adjusting for prompt content. That specificity is the whole point.

3. The fix: effect modification, then standardize

Let the prompt covariate enter as effect modification rather than a main effect — model-by-prompt interactions, or simply a separate index per stratum:

\[ \operatorname{logit}\mu_{ab}(w) = \alpha_{ab} + \boldsymbol\delta_{ab}^\top f(w), \qquad \psi_{ab}^\star = \sum_h p^\star(h)\,\psi_{ab}^{(h)} \]

First show the covariate does modify the index — covariate.stratified_pi returns the win-rate per model within each stratum; spread across columns is direct evidence of effect modification (the very thing the difference form cannot capture). Then standardize: average the stratum-specific indices over a target prompt mixture \(p^\star\) (covariate.standardized_leaderboard, with \(p^\star\) = uniform / observed / a supplied dict). The standardized score feeds straight into rank-confidence sets because its cluster-robust covariance reuses the same variance engine from L0003.

dataset · covariateeffect-modification evidencestandardization moves the board
Arena · category (5-level)per-category win-rate spans up to 0.142 for one modelreorders 34 / 48 models (one by six ranks)
mmlu_pro · subject (7)smaller but real spreadmoves 4 / 19 models
gpqa · difficulty (3)claude-3.5-sonnet: \(0.577\) undergrad → \(0.502\) postgradmoves 7 / 18; that model drops two ranks

The methodological point is invariant to magnitude: the quantity that moves the ranking is precisely the one the difference-form model discards.

4. Noncollapsibility: the adjusted win odds is not \(\exp(\beta_Z)\)

The trap. main.tex currently states \(WO_{adj} = \exp(\hat\beta_Z)\). That is the conditional win odds. The win odds is noncollapsible: because the odds transform is nonlinear, averaging stratum probabilities and then forming the odds is not the same as exponentiating a coefficient. The adjusted win odds you should report is the standardized marginal contrast
\[ WO^\star = \frac{\psi^\star}{1 - \psi^\star}, \qquad WO^\star \neq \exp(\hat\beta_Z) \ \text{even with no confounding.} \]

Net benefit \(2\psi^\star - 1\) is linear, so it collapses fine; the win odds does not. This is the Scheidegger / Cao caveat (draft_sections.tex §standardization), and it is a one-line correction to flag against the main draft.

Two honest limits to state alongside the fix. Overlap: standardization is only defined for models present in every stratum (the scripts enforce models_present_in_all_strata / a per-stratum min_count); absent that, a model's standardized score is NaN, not imputed. Adjustment ≠ free precision: the standardized estimand inherits its uncertainty from the same cluster-robust covariance — you buy interpretability, not narrower intervals.

Seam to Paper 2. Arena ships prompt covariates, so covariate-standardized win-rates are supported. But the matchmaking propensity (which pairs were shown) is unlogged, so matchmaking-standardized win-rates are not — that needs Paper 2's design or an added assumption. This is the open decision with Jean (outline_paper1.md, "Open decisions").
Retrieval check — answer from memory before moving on

1. In the difference-form PIM, why is \(\boldsymbol\beta_W\) unidentified for a prompt covariate?

Why. \(\mathbf W_i = \mathbf W_j\) for two answers to one prompt, so \(\mathbf W_i - \mathbf W_j = \mathbf 0\): a column of zeros carries no information about \(\boldsymbol\beta_W\). That's prompt-cancellation.

2. Which term does not cancel in the same difference-form model?

Why. The two responses come from different models, so \(Z_i - Z_j \neq 0\): head-to-head contrasts stay estimable. Only the prompt-level covariates cancel — the failure is specific, not total.

3. What is the fix for prompt-cancellation?

Why. The covariate must enter as effect modification (interaction / stratum-specific index), and the stratum indices are then standardized to a target prompt mixture \(p^\star\). A main effect would just cancel again.

4. The covariate-adjusted win odds should be reported as which quantity?

Why. Noncollapsibility: report \(\psi^\star/(1-\psi^\star)\), not \(\exp(\hat\beta_Z)\). The odds transform is nonlinear, so the marginal and conditional odds differ even without confounding.

5. What does covariate standardization buy you?

Why. Standardization makes the estimand interpretable against a stated target prompt distribution; its uncertainty still comes from the same cluster-robust covariance — interpretability, not free precision.

Read next (primary source)

Read Scheidegger, Wandel & Mütze (2026), "Covariate adjustment for the win odds," Stat. Med. [Scheidegger2026-uf] for the standardized-marginal construction and the noncollapsibility caveat; Cao et al. (2025) [Cao2025-dx] reinforces the latter. In-repo, read draft_sections.tex §standardization and run scripts/prompt_cancellation.py --covariate category.