1. What you actually want
You would like to report a win-rate adjusted for prompt mix, so a leaderboard is not an artifact of which
prompts happened to be asked. The obvious tool is a probabilistic index model (PIM) with a linear predictor in
the covariate difference (main.tex §covariate_adjustment, Thas 2012):
with \(Z\) the model indicator and \(\mathbf W\) the prompt-level adjustment covariates (e.g. subject, language, code).
2. The failure, in two lines
Both responses answer the same prompt, so they share its features: \(\mathbf W_i = \mathbf W_j\). Therefore
The difference-form model is structurally blind to prompt mix — the dominant source of leaderboard
sensitivity. This is prompt-cancellation. It does not arise in the two-arm clinical setting that
motivated PIMs, because there the treatment and control patients have different baseline \(\mathbf W\); that's
exactly why the imported machinery misses it. covariate.prompt_covariate_cancellation reports the
diagnosis: model_side_difference_max_abs = 0 across every comparison, yet the covariate
varies across prompts (so it could matter, if it were usable).
3. The fix: effect modification, then standardize
Let the prompt covariate enter as effect modification rather than a main effect — model-by-prompt interactions, or simply a separate index per stratum:
First show the covariate does modify the index — covariate.stratified_pi returns the win-rate per
model within each stratum; spread across columns is direct evidence of effect modification (the very thing the
difference form cannot capture). Then standardize: average the stratum-specific indices over a
target prompt mixture \(p^\star\) (covariate.standardized_leaderboard, with \(p^\star\) =
uniform / observed / a supplied dict). The standardized score feeds straight into rank-confidence sets because
its cluster-robust covariance reuses the same variance engine from L0003.
| dataset · covariate | effect-modification evidence | standardization moves the board |
|---|---|---|
| Arena · category (5-level) | per-category win-rate spans up to 0.142 for one model | reorders 34 / 48 models (one by six ranks) |
mmlu_pro · subject (7) | smaller but real spread | moves 4 / 19 models |
gpqa · difficulty (3) | claude-3.5-sonnet: \(0.577\) undergrad → \(0.502\) postgrad | moves 7 / 18; that model drops two ranks |
The methodological point is invariant to magnitude: the quantity that moves the ranking is precisely the one the difference-form model discards.
4. Noncollapsibility: the adjusted win odds is not \(\exp(\beta_Z)\)
main.tex currently states \(WO_{adj} = \exp(\hat\beta_Z)\). That is the
conditional win odds. The win odds is noncollapsible: because the odds transform is nonlinear,
averaging stratum probabilities and then forming the odds is not the same as exponentiating a
coefficient. The adjusted win odds you should report is the standardized marginal contrastNet benefit \(2\psi^\star - 1\) is linear, so it collapses fine; the win odds does not. This is the
Scheidegger / Cao caveat (draft_sections.tex §standardization), and it is a one-line correction to
flag against the main draft.
Two honest limits to state alongside the fix. Overlap: standardization is only defined for models present
in every stratum (the scripts enforce models_present_in_all_strata / a per-stratum
min_count); absent that, a model's standardized score is NaN, not imputed. Adjustment ≠ free
precision: the standardized estimand inherits its uncertainty from the same cluster-robust covariance — you
buy interpretability, not narrower intervals.
outline_paper1.md, "Open decisions").1. In the difference-form PIM, why is \(\boldsymbol\beta_W\) unidentified for a prompt covariate?
2. Which term does not cancel in the same difference-form model?
3. What is the fix for prompt-cancellation?
4. The covariate-adjusted win odds should be reported as which quantity?
5. What does covariate standardization buy you?
Read next (primary source)
Read Scheidegger, Wandel & Mütze (2026), "Covariate adjustment for the
win odds," Stat. Med. [Scheidegger2026-uf] for the standardized-marginal construction and the
noncollapsibility caveat; Cao et al. (2025) [Cao2025-dx] reinforces the latter. In-repo, read
draft_sections.tex §standardization and run scripts/prompt_cancellation.py --covariate category.