Win-rate inference · Lesson 0006

What a log licenses — and where Paper 2 beginsThe identification conditions, the candid list of what logs cannot identify, which assumptions are fragile, and how each fragility becomes a controllable lever in the design paper.

Why this lesson. Two of your three goals live here: defending the paper's scope ("you can't claim that from a log") and extending to Paper 2 (each fragile assumption is a seam the design paper turns into a lever). Inference (L0003) tells you the SE; identification tells you whether the target is even estimable from this data at all.

1. Estimation vs identification

L0003 answered "given the data, how precise is \(\hat\psi\)?" This lesson asks the prior question: "does the log even pin down \(\psi^\star\)?" An estimand is identified if it is a function of the observable data distribution. You can have a tight standard error around a number that doesn't answer your question — precision without identification. The paper earns trust by stating the conditions plainly and then listing, without apology, what cannot be recovered (outline_paper1.md §3 identification; draft_sections.tex §3).

2. The three conditions

conditionwhat it sayshow a real log can break it
Consistencythe observed outcome is the one that would occur under the realized comparison (well-defined potential outcomes, no interference)usually safe; strains if context bleeds across comparisons
Sequential ignorabilityadaptive matchmaking selects pairs only through logged historyfragile: a matchmaker that avoids "uninformative" pairs on unlogged difficulty signals
Positivity / overlapevery comparison (and target-prompt cell) we want has nonzero probability of being sampledfragile: adaptive systems avoid uninformative pairs; never-matched pairs have zero overlap

When the target prompt distribution \(p^\star\) differs from the logged one, identification additionally needs a density ratio (a reweighting) — and that reweighting is only valid where overlap holds.

3. What logs cannot identify (stated as features)

Framing matters for the defense. These are not weaknesses to bury — they are the boundary of what the estimand means. Stating them up front is what lets the supported claims (cluster-robust, covariate-standardized) stand unchallenged. A reviewer who raises one of these is agreeing with the paper.

4. The fragile asymptotics

The central limit result also needs a many non-dominant clusters condition (a Lindeberg-type requirement). One dominant judge or one dominant prompt family breaks it — directly relevant to the 3-judge benchmark data, where the judge ICC is only a crude estimate and the decision-relevant quantity is the SE inflation, not the ICC point value (winrate/README.md caveats; variance.oneway_icc is unreliable with very few clusters).

5. The seams → Paper 2

The forward look. Paper 1 is honest about three fragilities; Paper 2 turns each into a controllable lever.
Paper 1 fragilityPaper 2 lever
sequential ignorability (unlogged matchmaking selection)design the matchmaking / log the propensity ⇒ off-policy reweighting becomes valid
positivity (adaptive systems avoid pairs)deliberate assignment that guarantees overlap on the pairs you care about
many-cluster asymptotics (few judges)design the judge panel / finite-sample corrections

Concretely, the open decision with Jean: Paper 1 claims covariate-standardized win-rates (supported — Arena logs prompt covariates), but not matchmaking-standardized win-rates (the matchmaking propensity is unlogged). That exact line — supported now vs. needs-design — is the boundary between the two papers (outline_paper1.md, "Open decisions to settle with Jean").

Retrieval check — answer from memory before moving on

1. What is the difference between identification and estimation here?

Why. Identification asks whether the estimand is a function of the observable distribution; estimation asks how precisely you can pin it down. You can have a tight SE around a non-identified (wrong) target.

2. Which assumption does adaptive matchmaking most directly threaten?

Why. If matchmaking selects pairs on unlogged difficulty, selection isn't captured by logged history and sequential ignorability fails. Positivity is also threatened, but the selection mechanism is the ignorability issue.

3. Which quantity is not identified from observational logs?

Why. Judges are a selected sample, so \(\psi^\star\) is defined relative to a judge distribution; there is no judge-free ideal to recover without external calibration. The other three are supported claims.

4. Why is the many-cluster condition relevant to the 3-judge benchmark data?

Why. The CLT needs many non-dominant clusters; with only three judges the judge dimension is near-degenerate, so the judge ICC is crude and the asymptotics on that axis are fragile — report the SE inflation, not the ICC.

5. Which standardization does Paper 1 claim is supported by Arena logs?

Why. Arena logs prompt covariates, so covariate standardization is supported. Matchmaking propensity is unlogged, so matchmaking-standardized win-rates need Paper 2's design or an added assumption.

Read next (primary source)

Read In-repo, outline_paper1.md §3 (the identification subsection and the "what logs cannot identify" list) and §8 (Discussion and limitations), plus the "Open decisions to settle with Jean" block — that is the precise statement of the Paper 1 / Paper 2 boundary you're defending.