1. Estimation vs identification
L0003 answered "given the data, how precise is \(\hat\psi\)?" This lesson asks the prior question: "does the
log even pin down \(\psi^\star\)?" An estimand is identified if it is a function of the observable data
distribution. You can have a tight standard error around a number that doesn't answer your question — precision
without identification. The paper earns trust by stating the conditions plainly and then listing, without
apology, what cannot be recovered (outline_paper1.md §3 identification; draft_sections.tex §3).
2. The three conditions
| condition | what it says | how a real log can break it |
|---|---|---|
| Consistency | the observed outcome is the one that would occur under the realized comparison (well-defined potential outcomes, no interference) | usually safe; strains if context bleeds across comparisons |
| Sequential ignorability | adaptive matchmaking selects pairs only through logged history | fragile: a matchmaker that avoids "uninformative" pairs on unlogged difficulty signals |
| Positivity / overlap | every comparison (and target-prompt cell) we want has nonzero probability of being sampled | fragile: adaptive systems avoid uninformative pairs; never-matched pairs have zero overlap |
When the target prompt distribution \(p^\star\) differs from the logged one, identification additionally needs a density ratio (a reweighting) — and that reweighting is only valid where overlap holds.
3. What logs cannot identify (stated as features)
- PI for never-matched pairs — no comparisons, no estimand, absent a structural model that borrows strength.
- A judge-free "true" human preference — judges are a selected sample; \(\psi^\star\) is defined relative to a judge distribution, not to a Platonic ideal.
- Any capability-vs-judge-bias split — you cannot separate "model is better" from "judge prefers this style" without external calibration data.
- Pairwise indices for unobserved pairs — only a model (e.g. BT) fills these, at the cost of its assumption.
4. The fragile asymptotics
The central limit result also needs a many non-dominant clusters condition (a Lindeberg-type
requirement). One dominant judge or one dominant prompt family breaks it — directly relevant to the 3-judge
benchmark data, where the judge ICC is only a crude estimate and the decision-relevant quantity is the SE
inflation, not the ICC point value (winrate/README.md caveats; variance.oneway_icc is
unreliable with very few clusters).
5. The seams → Paper 2
| Paper 1 fragility | Paper 2 lever |
|---|---|
| sequential ignorability (unlogged matchmaking selection) | design the matchmaking / log the propensity ⇒ off-policy reweighting becomes valid |
| positivity (adaptive systems avoid pairs) | deliberate assignment that guarantees overlap on the pairs you care about |
| many-cluster asymptotics (few judges) | design the judge panel / finite-sample corrections |
Concretely, the open decision with Jean: Paper 1 claims covariate-standardized win-rates (supported —
Arena logs prompt covariates), but not matchmaking-standardized win-rates (the matchmaking
propensity is unlogged). That exact line — supported now vs. needs-design — is the boundary between the two
papers (outline_paper1.md, "Open decisions to settle with Jean").
1. What is the difference between identification and estimation here?
2. Which assumption does adaptive matchmaking most directly threaten?
3. Which quantity is not identified from observational logs?
4. Why is the many-cluster condition relevant to the 3-judge benchmark data?
5. Which standardization does Paper 1 claim is supported by Arena logs?
Read next (primary source)
Read In-repo, outline_paper1.md §3 (the identification subsection and
the "what logs cannot identify" list) and §8 (Discussion and limitations), plus the "Open decisions to settle
with Jean" block — that is the precise statement of the Paper 1 / Paper 2 boundary you're defending.