Win-rate inference · Lesson 0007

Capstone: defending the boardAn interleaved review of all four contributions, framed as the reviewer objections you will actually face — and the one-line rebuttal for each.

Why this lesson. Your top goal is defending the paper with Jean and reviewers. This lesson interleaves every prior contribution — deliberately mixing topics so retrieval is effortful and the knowledge sticks (storage strength, not just in-the-moment fluency). Treat each quiz item as a referee report.

The defense cheat-sheet

Six objections you should expect, the rebuttal in one line, and where it's grounded. Print this.

reviewer objectionyour rebuttalgrounding
"Your ICC is only 0.03 — you're over-correcting the SEs."The global ICC averages in pairs sharing no model; the binding dependence is among comparisons sharing a model's response (reuse), so effective \(n\) ≈ #prompts and inflation ≈ \(\sqrt{\#\text{opponents}}\).L0003
"Just fit Bradley–Terry; it handles all this."BT is a design-weighted logistic projection assuming transitivity; it doesn't remove the dependence (still needs a sandwich on a misspecified latent model) and under intransitivity it approximates, not estimates.L0005
"Adjust for prompt difficulty with a covariate in the PIM."Prompt-cancellation: both answers share the prompt, so \(\mathbf W_i - \mathbf W_j = 0\) and \(\beta_W\) is unidentified. Use effect modification + standardization.L0004
"Report \(\exp(\beta_Z)\) as the adjusted win odds."Noncollapsibility: report the standardized marginal \(\psi^\star/(1-\psi^\star)\); it differs from \(\exp(\beta_Z)\) even without confounding.L0004
"305 cycles means your ranking is incoherent."None survive simultaneous CIs (strongest weakest-edge \(z=1.51\)); at \(\geq 200\) comparisons/pair there are zero even in point estimates. Raw cycle-counting overstates intransitivity.L0005
"You can't claim anything — judges are biased."Correct, and stated: \(\psi^\star\) is defined relative to a judge distribution, not a judge-free ideal. The supported claims (cluster-robust, covariate-standardized) don't depend on that split.L0006
The through-line to repeat. Every contribution makes a weaker, better-supported claim than current practice: a cluster-robust win-rate, a rank-confidence set instead of a point rank, a transitivity claim that survives uncertainty, and a covariate-standardized index that names its target prompt mix. The estimand transfers exactly; the inference assumptions do not — and where they fail is Paper 2.
Referee round — interleaved; answer each from memory

1. Referee: "Your win-rate, win odds, and net benefit are three different analyses." Best reply?

Why. Win odds and net benefit are monotone transforms of the same \(\psi\) — one implies the others. (The win ratio is the genuinely different one; it discards ties.) [L0002]

2. Referee: "A 0.03 ICC can't justify 4× wider intervals." Best reply?

Why. The dependence that inflates \(a\)'s win-rate is among comparisons sharing \(a\)'s response; the global pair-centered ICC averages those away with unrelated pairs. Effective \(n\) ≈ #prompts. [L0003]

3. Referee: "Put prompt difficulty in the PIM as a covariate." Best reply?

Why. A prompt covariate is shared by both responses, so \(\mathbf W_i - \mathbf W_j = 0\): prompt-cancellation. \(Z\) still identifies; only the prompt term cancels. Fix with effect modification + standardization. [L0004]

4. Referee: "Report the exponentiated coefficient as the adjusted win odds." Best reply?

Why. Report the standardized marginal \(\psi^\star/(1-\psi^\star)\); the nonlinear odds transform makes it differ from \(\exp(\beta_Z)\) even with no confounding. [L0004]

5. Referee: "Only one model is uniquely ranked — your method is too conservative." Best reply?

Why. The correlation-aware max-t cutoff hardly tightens the widths (15.5→15.2), so the unresolved middle is genuine sampling uncertainty, not conservative correction. [L0005]

6. Referee: "You found 305 cycles — the ranking is incoherent." Best reply?

Why. The strongest cycle's weakest edge is only \(z=1.51\); at \(\geq200\) comparisons/pair there are zero even in point estimates. Raw cycle-counting overstates intransitivity. [L0005]

7. Referee: "Standardization reordering 34/48 models is just noise." Best reply?

Why. The shift tracks how heterogeneous prompts are (per-category spread up to 0.142), and the moving quantity is exactly the one the difference-form PIM discards; its uncertainty comes from the same cluster-robust covariance. [L0004]

8. Referee: "Your estimand is meaningless because judges are biased." Best reply?

Why. The paper states this as a scope boundary: there is no judge-free ideal to recover without calibration data. The supported claims don't depend on a capability-vs-bias split. [L0006]

Where this leaves you

You can now trace any number in §7 to a function and a script, derive the SE inflation on a whiteboard, prove prompt-cancellation in two lines, state the noncollapsibility correction against the main draft, read the rank and cycle results without over- or under-claiming, and name the identification boundary that becomes Paper 2. That is the full method, defended.

Read next (primary source)

Read outline_paper1.md — the draft abstract (lines 11–15) and §7 preliminary findings, now that every claim in them has a lesson behind it. Then take the open decisions block to Jean: covariate-standardized (claim it) vs. matchmaking-standardized (Paper 2), and the cross-domain framing call.