Win-rate inference · Lesson 0008 · the mission lesson

Why the adjusted win odds must be the standardized marginalThe case to make to Jean that the paper should report \(\psi^\star/(1-\psi^\star)\), not \(\exp(\hat\beta_Z)\) — built so it survives a competent co-author's pushback.

Why this lesson. This is the concrete goal: persuade Jean. The aim is not to say "you're wrong" — it's to show a cleaner, internally-consistent version of her own framework. You'll lead with an argument that needs no concession from her (internal consistency), back it with a verified number, and have rebuttals ready for her strongest replies. Frame it collaboratively throughout.

1. First, name the two estimands (so the disagreement is precise)

\(\exp(\hat\beta_Z)\) is not "wrong" — it is the conditional win odds: the within-stratum effect. The paper reports a leaderboard win-rate, which is a population-level (marginal) statement. These are different estimands answering different questions:

estimandwhat it answersthis paper's target?
\(\exp(\hat\beta_Z)\) — conditional win odds"within a prompt stratum, what are \(a\)'s odds vs \(b\)?"no — and see §3, it isn't even adjusted here
\(\psi^\star/(1-\psi^\star)\) — standardized marginal"averaged over the target prompt mix, what are \(a\)'s odds?"yes — the win-rate is the marginal PI

So the question isn't "which is correct in general" (both are legitimate); it's "which one is the paper actually about." The reported win-rate, by definition, is the marginal PI — so its odds form must be the marginal too.

2. The argument that needs no concession: internal consistency

This is the one to lead with, because Jean doesn't have to agree with anything philosophical — only that the paper should be consistent with its own thesis. The framework's whole claim (main.tex §unified) is that the four metrics are transforms of one PI:

\[ \text{win-rate} = \psi, \qquad WO = \frac{\psi}{1-\psi}, \qquad NB = NPS = 2\psi - 1. \]

Now look at what the draft actually does for the adjusted versions:

The contradiction, stated in one line. If \(NB_{adj}\) uses \(PI_{adj}\) but \(WO_{adj}=\exp(\hat\beta_Z)\), then \(WO_{adj} \neq PI_{adj}/(1-PI_{adj})\): the reported win odds and net benefit are no longer transforms of the same PI. The unification — the paper's headline contribution — breaks on its own page. The only repair that keeps the framework intact is to report all four as standardized marginals.

3. The number that proves it isn't cosmetic

Two prompt strata, model \(a\) vs \(b\). Conditional win probabilities \(\psi^{(1)}=0.55\), \(\psi^{(2)}=0.75\) (effect modification — \(a\) is much stronger on stratum 2). Observed mix is imbalanced (80% stratum 1); the target is balanced. All numbers below are computed, not illustrative.

quantityvaluewhat it is
\(\exp(\hat\beta_Z)\)1.439as-sampled / unadjusted pooled win odds (\(PI = 0.590\))
\(\psi^\star/(1-\psi^\star)\)1.857standardized marginal win odds (\(\psi^\star = 0.650\))

A gap of \(+0.42\) in the win odds from a relabeling that the draft treats as a triviality. On the real board this is exactly the mechanism behind your Contribution 3 result — standardizing reorders 34 of 48 Arena models. The choice of estimand moves the leaderboard.

4. Why \(\exp(\hat\beta_Z)\) is the unadjusted odds here (the prompt-cancellation tie-in)

Crucial, and specific to the draft's difference-form model \(\operatorname{logit}\psi = \beta_Z + \boldsymbol\beta_W^\top(\mathbf W_i - \mathbf W_j)\): because both responses share the prompt, \(\mathbf W_i - \mathbf W_j = \mathbf 0\) and the prompt term cancels (L0004). So \(\hat\beta_Z\) is fit with no prompt adjustment, and \(\operatorname{expit}(\hat\beta_Z)\) equals the pooled, as-sampled win probability. In other words, in the draft as written, \(\exp(\hat\beta_Z)\) is not a covariate-adjusted win odds at all — it is the raw win odds wearing an adjusted label. You cannot adjust for prompt mix through the difference form; you must use effect modification and then standardize.

5. Noncollapsibility: you can't shortcut the average

Once you fix cancellation with effect modification, a tempting shortcut is to average the stratum odds (or the coefficients) to get the adjusted win odds. Noncollapsibility says you can't — you must average the probabilities on the target mix and then transform. Same example:

aggregationvalueverdict
average the PIs, then take odds — \(\psi^\star/(1-\psi^\star)\)1.857✓ the standardized marginal
\(\exp(\text{mean of log stratum-odds})\)1.915✗ noncollapsible — wrong
average of the stratum odds2.111✗ noncollapsible — wrong

Because the odds transform is nonlinear, "average then transform" \(\neq\) "transform then average." The win odds is noncollapsible; net benefit (linear in \(\psi\)) collapses fine, which is why \(NB_{adj}=2\,PI_{adj}-1\) looks innocent while \(WO_{adj}=\exp(\hat\beta_Z)\) hides the problem.

Be precise — a subtlety Jean may catch. The textbook clinical noncollapsibility comes from variation in \(\mathbf W_i - \mathbf W_j\) across pairs; in our same-prompt setting that difference is zero, so that exact mechanism is nullified. Here the gap arises through effect modification + prompt-mix reweighting instead. Don't claim "noncollapsibility just like the clinical case" — claim the consistency argument (§2), the cancellation argument (§4), and the aggregation-shortcut failure (§5). That ordering is bulletproof.

6. Rehearsing Jean's replies

Jean might sayyour reply
"\(\exp(\beta_Z)\) is the standard PIM output."For the conditional effect, yes. But PIMs also support standardization (Thas; Scheidegger), and for a noncollapsible measure reporting the marginal is standard in adjusted analyses. Our reported estimand is the marginal win-rate, so its odds form must match.
"The conditional win odds is a valid estimand."Agreed — it's just a different question. A leaderboard is a population statement, and in our setting (§4) the conditional from the difference form isn't even adjusted.
"They're close enough; it won't matter."1.44 vs 1.86 in a two-stratum toy, and 34/48 Arena models reorder under standardization. It changes the board.
"Standardizing needs an arbitrary target mix."We state it explicitly (balanced, or a named target) — transparent by design. \(\exp(\beta_Z)\) also implies a mix: the platform's as-sampled matchmaking, just unstated and usually undesirable.
"Let's keep the difference-form; it's simpler."It's unidentified for prompt covariates (cancellation) — it cannot do the adjustment we want at all. Effect modification + standardization is the minimal correct fix.

7. What to propose (and the good news)

Concretely: report the adjusted win-rate, win odds, net benefit, and NPS as standardized marginals on a stated target mix; get their SEs from the same cluster-robust covariance (L0003); cite Scheidegger (2026) / Cao (2025) for the noncollapsibility correction. The good news for the argument: the implementation in winrate/covariate.py already computes the standardized marginal — standardized_leaderboard returns \(\psi^\star = \sum_h p^\star(h) S_a^{(h)}\), from which \(WO^\star = \psi^\star/(1-\psi^\star)\) follows directly. So this isn't asking Jean to build anything; it's aligning the main.tex prose (lines 148, 250) with what the code already does correctly.

Rehearsal — answer as if Jean just said it

1. The single strongest opening argument (needs no concession from Jean) is…

Why. Internal consistency: the draft uses marginal \(PI_{adj}\) for NB/NPS but \(\exp(\beta_Z)\) for WO, so \(WO_{adj}\neq PI_{adj}/(1-PI_{adj})\) — the unification breaks on its own page. No philosophy needed.

2. In the draft's difference-form model, \(\exp(\hat\beta_Z)\) numerically equals…

Why. Prompt-cancellation zeroes the \(\mathbf W\) term, so \(\hat\beta_Z\) is fit with no prompt adjustment and \(\operatorname{expit}(\hat\beta_Z)\) is the pooled as-sampled win probability — the raw win odds, not an adjusted one.

3. To get the standardized marginal win odds you must…

Why. Noncollapsibility: the odds transform is nonlinear, so "average then transform" ≠ "transform then average." Average the probabilities on the target mix first, then form the odds (1.857), not 1.915 or 2.111.

4. Jean: "Choosing a target mix is arbitrary." Best reply?

Why. There is no mix-free adjusted estimand. \(\exp(\beta_Z)\) implicitly uses the platform's as-sampled matchmaking mix — unstated and usually undesirable. Standardization just makes the choice explicit and defensible.

5. The most reassuring practical point for Jean is…

Why. covariate.standardized_leaderboard already returns \(\psi^\star\); \(WO^\star=\psi^\star/(1-\psi^\star)\) follows. The ask is to align the main.tex prose with the implementation, not to build anything.

Read next (primary source)

Read Scheidegger, Wandel & Mütze (2026) [Scheidegger2026-uf] and Cao et al. (2025) [Cao2025-dx] for the standardized-marginal win odds and noncollapsibility, so you can cite them to Jean. In-repo, reread main.tex:247–257 (the inconsistency) against winrate/covariate.py (standardized_leaderboard, the correct implementation).