1. First, name the two estimands (so the disagreement is precise)
\(\exp(\hat\beta_Z)\) is not "wrong" — it is the conditional win odds: the within-stratum effect. The paper reports a leaderboard win-rate, which is a population-level (marginal) statement. These are different estimands answering different questions:
| estimand | what it answers | this paper's target? |
|---|---|---|
| \(\exp(\hat\beta_Z)\) — conditional win odds | "within a prompt stratum, what are \(a\)'s odds vs \(b\)?" | no — and see §3, it isn't even adjusted here |
| \(\psi^\star/(1-\psi^\star)\) — standardized marginal | "averaged over the target prompt mix, what are \(a\)'s odds?" | yes — the win-rate is the marginal PI |
So the question isn't "which is correct in general" (both are legitimate); it's "which one is the paper actually about." The reported win-rate, by definition, is the marginal PI — so its odds form must be the marginal too.
2. The argument that needs no concession: internal consistency
This is the one to lead with, because Jean doesn't have to agree with anything philosophical — only that the
paper should be consistent with its own thesis. The framework's whole claim (main.tex §unified) is that
the four metrics are transforms of one PI:
Now look at what the draft actually does for the adjusted versions:
main.tex:253–257recovers \(NB_{adj} = NPS_{adj} = 2\,PI_{adj} - 1\) with \(PI_{adj}\) the standardized marginal (inverse link integrated over the covariate distribution). ✓ marginalmain.tex:250reports \(WO_{adj} = \exp(\hat\beta_Z)\), the conditional odds. ✗ not marginal
3. The number that proves it isn't cosmetic
Two prompt strata, model \(a\) vs \(b\). Conditional win probabilities \(\psi^{(1)}=0.55\), \(\psi^{(2)}=0.75\) (effect modification — \(a\) is much stronger on stratum 2). Observed mix is imbalanced (80% stratum 1); the target is balanced. All numbers below are computed, not illustrative.
| quantity | value | what it is |
|---|---|---|
| \(\exp(\hat\beta_Z)\) | 1.439 | as-sampled / unadjusted pooled win odds (\(PI = 0.590\)) |
| \(\psi^\star/(1-\psi^\star)\) | 1.857 | standardized marginal win odds (\(\psi^\star = 0.650\)) |
A gap of \(+0.42\) in the win odds from a relabeling that the draft treats as a triviality. On the real board this is exactly the mechanism behind your Contribution 3 result — standardizing reorders 34 of 48 Arena models. The choice of estimand moves the leaderboard.
4. Why \(\exp(\hat\beta_Z)\) is the unadjusted odds here (the prompt-cancellation tie-in)
Crucial, and specific to the draft's difference-form model \(\operatorname{logit}\psi = \beta_Z + \boldsymbol\beta_W^\top(\mathbf W_i - \mathbf W_j)\): because both responses share the prompt, \(\mathbf W_i - \mathbf W_j = \mathbf 0\) and the prompt term cancels (L0004). So \(\hat\beta_Z\) is fit with no prompt adjustment, and \(\operatorname{expit}(\hat\beta_Z)\) equals the pooled, as-sampled win probability. In other words, in the draft as written, \(\exp(\hat\beta_Z)\) is not a covariate-adjusted win odds at all — it is the raw win odds wearing an adjusted label. You cannot adjust for prompt mix through the difference form; you must use effect modification and then standardize.
5. Noncollapsibility: you can't shortcut the average
Once you fix cancellation with effect modification, a tempting shortcut is to average the stratum odds (or the coefficients) to get the adjusted win odds. Noncollapsibility says you can't — you must average the probabilities on the target mix and then transform. Same example:
| aggregation | value | verdict |
|---|---|---|
| average the PIs, then take odds — \(\psi^\star/(1-\psi^\star)\) | 1.857 | ✓ the standardized marginal |
| \(\exp(\text{mean of log stratum-odds})\) | 1.915 | ✗ noncollapsible — wrong |
| average of the stratum odds | 2.111 | ✗ noncollapsible — wrong |
Because the odds transform is nonlinear, "average then transform" \(\neq\) "transform then average." The win odds is noncollapsible; net benefit (linear in \(\psi\)) collapses fine, which is why \(NB_{adj}=2\,PI_{adj}-1\) looks innocent while \(WO_{adj}=\exp(\hat\beta_Z)\) hides the problem.
6. Rehearsing Jean's replies
| Jean might say | your reply |
|---|---|
| "\(\exp(\beta_Z)\) is the standard PIM output." | For the conditional effect, yes. But PIMs also support standardization (Thas; Scheidegger), and for a noncollapsible measure reporting the marginal is standard in adjusted analyses. Our reported estimand is the marginal win-rate, so its odds form must match. |
| "The conditional win odds is a valid estimand." | Agreed — it's just a different question. A leaderboard is a population statement, and in our setting (§4) the conditional from the difference form isn't even adjusted. |
| "They're close enough; it won't matter." | 1.44 vs 1.86 in a two-stratum toy, and 34/48 Arena models reorder under standardization. It changes the board. |
| "Standardizing needs an arbitrary target mix." | We state it explicitly (balanced, or a named target) — transparent by design. \(\exp(\beta_Z)\) also implies a mix: the platform's as-sampled matchmaking, just unstated and usually undesirable. |
| "Let's keep the difference-form; it's simpler." | It's unidentified for prompt covariates (cancellation) — it cannot do the adjustment we want at all. Effect modification + standardization is the minimal correct fix. |
7. What to propose (and the good news)
Concretely: report the adjusted win-rate, win odds, net benefit, and NPS as standardized marginals on a
stated target mix; get their SEs from the same cluster-robust covariance (L0003); cite Scheidegger (2026) / Cao
(2025) for the noncollapsibility correction. The good news for the argument: the implementation in
winrate/covariate.py already computes the standardized marginal —
standardized_leaderboard returns \(\psi^\star = \sum_h p^\star(h) S_a^{(h)}\), from which
\(WO^\star = \psi^\star/(1-\psi^\star)\) follows directly. So this isn't asking Jean to build anything; it's aligning
the main.tex prose (lines 148, 250) with what the code already does correctly.
1. The single strongest opening argument (needs no concession from Jean) is…
2. In the draft's difference-form model, \(\exp(\hat\beta_Z)\) numerically equals…
3. To get the standardized marginal win odds you must…
4. Jean: "Choosing a target mix is arbitrary." Best reply?
5. The most reassuring practical point for Jean is…
covariate.standardized_leaderboard already returns \(\psi^\star\); \(WO^\star=\psi^\star/(1-\psi^\star)\) follows. The ask is to align the main.tex prose with the implementation, not to build anything.Read next (primary source)
Read Scheidegger, Wandel & Mütze (2026) [Scheidegger2026-uf]
and Cao et al. (2025) [Cao2025-dx] for the standardized-marginal win odds and noncollapsibility,
so you can cite them to Jean. In-repo, reread main.tex:247–257 (the inconsistency) against
winrate/covariate.py (standardized_leaderboard, the correct implementation).