L0006 — Identification & the seams to Paper 2

Why this lesson. Two of your three goals live here: defending the paper's scope ("you can't claim that from a log") and extending to Paper 2 (each fragile assumption is a seam the design paper turns into a lever). Inference (L0003) tells you the SE; identification tells you whether the target is even estimable from this data at all.

1. Estimation vs identification

L0003 answered "given the data, how precise is \(\hat\psi\)?" This lesson asks the prior question: "does the log even pin down \(\psi^\star\)?" An estimand is identified if it is a function of the observable data distribution. You can have a tight standard error around a number that doesn't answer your question — precision without identification. The paper earns trust by stating the conditions plainly and then listing, without apology, what cannot be recovered (outline_paper1.md §3 identification; draft_sections.tex §3).

2. The three conditions

condition	what it says	how a real log can break it
Consistency	the observed outcome is the one that would occur under the realized comparison (well-defined potential outcomes, no interference)	usually safe; strains if context bleeds across comparisons
Sequential ignorability	adaptive matchmaking selects pairs only through logged history	fragile: a matchmaker that avoids "uninformative" pairs on unlogged difficulty signals
Positivity / overlap	every comparison (and target-prompt cell) we want has nonzero probability of being sampled	fragile: adaptive systems avoid uninformative pairs; never-matched pairs have zero overlap

When the target prompt distribution \(p^\star\) differs from the logged one, identification additionally needs a density ratio (a reweighting) — and that reweighting is only valid where overlap holds.

3. What logs cannot identify (stated as features)

PI for never-matched pairs — no comparisons, no estimand, absent a structural model that borrows strength.
A judge-free "true" human preference — judges are a selected sample; \(\psi^\star\) is defined relative to a judge distribution, not to a Platonic ideal.
Any capability-vs-judge-bias split — you cannot separate "model is better" from "judge prefers this style" without external calibration data.
Pairwise indices for unobserved pairs — only a model (e.g. BT) fills these, at the cost of its assumption.

Framing matters for the defense. These are not weaknesses to bury — they are the boundary of what the estimand means. Stating them up front is what lets the supported claims (cluster-robust, covariate-standardized) stand unchallenged. A reviewer who raises one of these is agreeing with the paper.

4. The fragile asymptotics

The central limit result also needs a many non-dominant clusters condition (a Lindeberg-type requirement). One dominant judge or one dominant prompt family breaks it — directly relevant to the 3-judge benchmark data, where the judge ICC is only a crude estimate and the decision-relevant quantity is the SE inflation, not the ICC point value (winrate/README.md caveats; variance.oneway_icc is unreliable with very few clusters).

5. The seams → Paper 2

The forward look. Paper 1 is honest about three fragilities; Paper 2 turns each into a controllable lever.

Paper 1 fragility	Paper 2 lever
sequential ignorability (unlogged matchmaking selection)	design the matchmaking / log the propensity ⇒ off-policy reweighting becomes valid
positivity (adaptive systems avoid pairs)	deliberate assignment that guarantees overlap on the pairs you care about
many-cluster asymptotics (few judges)	design the judge panel / finite-sample corrections

Concretely, the open decision with Jean: Paper 1 claims covariate-standardized win-rates (supported — Arena logs prompt covariates), but not matchmaking-standardized win-rates (the matchmaking propensity is unlogged). That exact line — supported now vs. needs-design — is the boundary between the two papers (outline_paper1.md, "Open decisions to settle with Jean").

Retrieval check — answer from memory before moving on

1. What is the difference between identification and estimation here?

Why. Identification asks whether the estimand is a function of the observable distribution; estimation asks how precisely you can pin it down. You can have a tight SE around a non-identified (wrong) target.

2. Which assumption does adaptive matchmaking most directly threaten?

Why. If matchmaking selects pairs on unlogged difficulty, selection isn't captured by logged history and sequential ignorability fails. Positivity is also threatened, but the selection mechanism is the ignorability issue.

3. Which quantity is not identified from observational logs?

Why. Judges are a selected sample, so \(\psi^\star\) is defined relative to a judge distribution; there is no judge-free ideal to recover without external calibration. The other three are supported claims.

4. Why is the many-cluster condition relevant to the 3-judge benchmark data?

Why. The CLT needs many non-dominant clusters; with only three judges the judge dimension is near-degenerate, so the judge ICC is crude and the asymptotics on that axis are fragile — report the SE inflation, not the ICC.

5. Which standardization does Paper 1 claim is supported by Arena logs?

Why. Arena logs prompt covariates, so covariate standardization is supported. Matchmaking propensity is unlogged, so matchmaking-standardized win-rates need Paper 2's design or an added assumption.

What a log licenses — and where Paper 2 beginsThe identification conditions, the candid list of what logs cannot identify, which assumptions are fragile, and how each fragility becomes a controllable lever in the design paper.

1. Estimation vs identification

2. The three conditions

3. What logs cannot identify (stated as features)

4. The fragile asymptotics

5. The seams → Paper 2

Read next (primary source)