7  Instrumental Variables

NoteLearning Objectives

By the end of this chapter, students should be able to:

  1. Diagnose the endogeneity problem: explain why an unobserved confounder makes back-door adjustment fail, and describe how an instrumental variable provides an alternative identification route.
  2. State the three IV assumptions — relevance, exogeneity, and exclusion — in the DAG/do-calculus, structural-equation, and potential-outcomes languages, and explain which are testable and which require institutional justification.
  3. Follow how the IV assumptions identify the Wald estimand, and interpret the reduced form and first stage as its two observable components.
  4. Explain what the Wald estimand identifies when treatment effects are heterogeneous — the Local Average Treatment Effect for compliers — why that estimand depends on the choice of instrument, and when it coincides with the ATE.
  5. Compare IV and back-door adjustment on the assumptions each requires, the estimand each identifies, and the ways each can fail.

7.1 Why Instrumental Variables?

Chapter 6 showed how causal effects can be identified when all confounders are observed and can be blocked by conditioning on \(X\). When some confounders are unobserved, back-door adjustment fails. This chapter develops instrumental variables (IV) as an alternative identification strategy: rather than blocking the confounding path \(T \leftarrow U \to Y\), IV exploits an external variable \(Z\) whose effect on \(T\) is free of confounding by \(U\). This chapter asks what IV identifies and under what assumptions; Chapter 13 asks how that estimand is computed and tested in practice.

7.1.1 The Endogeneity Problem

Unconfoundedness \((Y(0), Y(1)) \indep T \mid X\) requires that every variable affecting both treatment and outcome is observed and included in \(X\). In many empirical settings this is implausible: in labor economics, unobserved ability or motivation affects both schooling decisions and wages; in epidemiology, unobserved health behaviors affect both treatment uptake and outcomes. Whenever an unobserved confounder \(U\) creates a back-door path \(T \leftarrow U \to Y\), the adjustment formula fails: \[\int f(y \mid t, x)\, p(x)\, dx \;\neq\; f(y \mid \doop(T{=}t)).\] The gap is the endogeneity bias. We need a different identification strategy.

7.1.2 The IV Idea

An instrumental variable \(Z\) is an observed variable that: (1) moves treatment \(T\) (relevance); (2) does so exogenously\(Z\) is unrelated to the unobserved confounder \(U\) (exogeneity); (3) affects \(Y\) only through \(T\)\(Z\) has no direct path to \(Y\) (exclusion). Under these three conditions, the variation in \(T\) induced by \(Z\) is free of confounding by \(U\), so the ratio of the \(Z\)-induced variation in \(Y\) to the \(Z\)-induced variation in \(T\) becomes a meaningful target for identification.

NoteExample: Returns to Schooling

A researcher wants to estimate the causal effect of years of schooling \(T\) on log wages \(Y\). Unobserved ability \(U\) raises both schooling and wages, so OLS is biased upward. No set of observed covariates \(X\) fully captures ability. An instrumental variable \(Z\) that shifts schooling for reasons unrelated to ability — such as proximity to a school or a policy change in compulsory attendance laws — provides a way to isolate the causal effect of schooling on wages.

IV does not eliminate the confounding path \(T \leftarrow U \to Y\), nor does it make treatment as-if randomly assigned for the full population. Instead, IV avoids confounding rather than controlling for it — a distinction that matters both for interpreting what is identified (the LATE for compliers, not the ATE) and for understanding which assumption does the heaviest lifting (exclusion, not unconfoundedness).

7.2 Graphical Setup and Core Assumptions

7.2.1 The IV DAG

The causal structure for a basic IV model with covariates \(X\) is:

Z T Y X U (1)
The basic IV DAG with covariates $X$ and unobserved confounder $U$. Blue arrows: causal effects. Dashed red: confounding paths. Green: observed covariate effects. The label (1) on $Z \to T$ marks the relevance assumption. The absence of a $Z \to Y$ arrow encodes exclusion.

The three IV assumptions correspond to three distinct features of this DAG: (1) Relevance: there is a directed path \(Z \to T\); (2) Exogeneity: conditional on \(X\), the DAG implies \(Z \indep U \mid X\) by d-separation; (3) Exclusion: every directed path from \(Z\) to \(Y\) in \(\mathcal{G}\) passes through \(T\) — there is no direct edge \(Z \to Y\).

The distinction between exogeneity and exclusion is fundamental. Exogeneity says the instrument is not confounded with latent causes of the outcome. Exclusion says the instrument has no causal channel to the outcome except through treatment. A randomized encouragement may be exogenous by design, yet still violate exclusion if the encouragement changes outcomes through information, motivation, or stigma apart from the treatment itself.

7.2.2 The Three Assumptions in Three Languages

The same three assumptions can be expressed in three causal languages. These are parallel formulations, not literally identical statements: each highlights a different aspect of the design. The graphical formulation is most useful for causal design; the structural formulation is most useful for deriving moment restrictions; the potential-outcomes formulation prepares the ground for the LATE framework. The assumptions are ordered relevance \(\to\) exogeneity \(\to\) exclusion, reflecting the natural sequence in which a researcher assesses them.

Assumption Potential outcomes Structural / econometric Do-calculus / DAG
Relevance \(P(T_i(1) \neq T_i(0)) > 0\) \(\pi \neq 0\) in \(T = \pi Z + \delta^\top X + \eta\) \(Z \to T\) in \(\mathcal{G}\) (no d-separation)
Exogeneity \(Z \indep (Y(0), Y(1), T(0), T(1)) \mid X\) \(\E[\varepsilon \mid Z, X] = 0\) \(Z \indep U \mid X\)
Exclusion \(Y_i(t, z) = Y_i(t, z')\) for all \(z, z'\) \(Z\) absent from structural equation for \(Y\) \(f(y \mid x, \doop(T{=}t), z) = f(y \mid x, \doop(T{=}t))\)
NoteRemark: Three Languages, One Identification Argument

These three formulations are aligned but not literally identical. The graphical statement encodes causal structure. The potential-outcomes statement encodes counterfactual independence. The structural formulation encodes moment orthogonality. In a well-specified IV model they support the same identification argument, but they are not interchangeable symbols.

Relevance is about the \(Z \to T\) link — \(Z\) must genuinely move \(T\). Exogeneity is about the absence of omitted common causes linking \(Z\) to \(Y\). Exclusion is about the absence of any direct causal path \(Z \to Y\) that bypasses \(T\).

Why the do-calculus formulation is preferred. The exclusion restriction in the do-calculus column reads: \[f(y \mid x, \doop(T{=}t), z) = f(y \mid x, \doop(T{=}t)).\] This is a statement about the interventional density — the distribution of \(Y\) after we have set \(T = t\) by do-surgery. The do-operator makes it impossible to confuse this with the observational statement \(f(y \mid x, T{=}t, z) = f(y \mid x, T{=}t)\), which is a much weaker condition.

7.2.3 Relevance

NoteDefinition: Relevance

The instrument \(Z\) is relevant if it has a non-zero causal effect on the treatment \(T\) within at least some stratum of covariates \(X\): \(P(T \mid \doop(Z{=}z), X{=}x)\) varies with \(z\) for some \(x\). In the linear first-stage model \(T = \pi Z + \delta^\top X + \eta\), this reduces to \(\pi \neq 0\).

Graphically, relevance means \(Z\) and \(T\) are not d-separated in \(\mathcal{G}\). The conditional covariance \(\mathrm{Cov}(Z, T \mid X)\) and the first-stage \(F\)-statistic are empirical diagnostics for the observable association; the causal claim that \(Z\) shifts \(T\) is design-based.

7.2.4 Exogeneity

NoteDefinition: Exogeneity

The instrument \(Z\) is exogenous given covariates \(X\) if there is no open back-door path from \(Z\) to \(Y\) through unobserved causes: \(Z \indep U \mid X\) in \(\mathcal{G}\) by d-separation.

In the structural linear model, the same requirement appears as \(\E[Z\varepsilon \mid X] = 0\), because the structural residual \(\varepsilon\) is a function of \(U\). This moment condition is a consequence of the graphical assumption, not a definition of exogeneity. A researcher who adopts the moment condition as a primitive has no guarantee that \(Z\) is free of back-door paths to the outcome.

NoteExample: Quarter of Birth (Angrist and Krueger 1991)

Angrist and Krueger (1991) used quarter of birth as an instrument for schooling to estimate the return to education. Exogeneity requires no open back-door path from quarter of birth to wages through unobserved variables such as ability or family background; birth timing is largely outside parental control. Bound et al. (1995) later raised concerns about weak first stages in some specifications.

Exogeneity is untestable: the unobserved confounder \(U\) is, by definition, unobserved.

7.2.5 Exclusion

NoteDefinition: Exclusion Restriction

The instrument \(Z\) satisfies the exclusion restriction if it affects the outcome \(Y\) only through its effect on the treatment \(T\). In do-calculus terms: \[f(y \mid x, \doop(T{=}t), z) = f(y \mid x, \doop(T{=}t)) \quad \text{for all } z.\] In potential outcomes terms: \(Y_i(t, z) = Y_i(t, z')\) for all \(z \neq z'\).

Exclusion is a causal restriction, not a conditional-correlation restriction. Adding a direct arrow \(Z \to Y\) to the DAG is exactly the formal counterpart of exclusion failing. In the mutilated graph \(\mathcal{G}_{\overline{T}}\), the path \(Z \to Y\) would remain open. Rule 1 of the do-calculus, which would allow removing \(Z\) from the density of \(Y\) given \(\doop(T{=}t)\), no longer applies.

WarningWhy Exclusion Is the Hardest Assumption

Relevance can be tested. Exogeneity can sometimes be defended by design. But exclusion — that \(Z\) has no direct effect on \(Y\) whatsoever — is:

  • Untestable in just-identified models.
  • Easy to violate in practice. An instrument that “moves treatment” often also moves other inputs. If \(Z\) is distance to a hospital (instrument for treatment uptake), it may also directly affect health outcomes through travel time in emergencies.
  • Consequential. Even a small direct effect of \(Z\) on \(Y\) can cause substantial bias in the Wald estimand, especially when the first stage is weak (derived in Section 7.4).
NoteRemark: Three Assumptions Are Necessary but Not Sufficient

The three IV assumptions are necessary but not sufficient for point identification of any causal estimand (Hernán and Robins 2006; Levis et al. 2024). A fourth structural assumption on the counterfactual distribution of treatment response or treatment effect heterogeneity is required. The choice of fourth assumption determines which causal quantity the Wald estimand identifies: constant treatment effects (Section 8.7), under which the Wald estimand identifies the ATE; or monotonicity (Section 7.7), under which it identifies the LATE for compliers.

Strikingly, across these different identification schemes the same conditional Wald formula \(\mathrm{Cov}(Z, Y \mid X)/\mathrm{Cov}(Z, T \mid X)\) appears as the identifying expression in many cases — but its causal target changes. The three core IV assumptions are all expressible within the graphical language; the fourth assumption, by contrast, lies outside the graphical language in every case.

NoteExample: Same Observables, Different ATEs

Let \(Z, T \in \{0,1\}\) with \(\E[T \mid Z{=}1] - \E[T \mid Z{=}0] = 0.2\) and \(\E[Y \mid Z{=}1] - \E[Y \mid Z{=}0] = 0.1\), giving Wald ratio \(= 0.5\). Both DGPs share the same compliance structure: \(P(\text{co}) = 0.2\), \(P(\text{at}) = 0.4\), \(P(\text{nt}) = 0.4\).

DGP A (constant effects): \(\tau_i = 0.5\) for all units. \(\text{ATE} = 0.5\). Wald ratio \(= 0.5\).

DGP B (heterogeneous effects + monotonicity): \(\tau_{\text{co}} = 0.5\), \(\tau_{\text{at}} = \tau_{\text{nt}} = 0\). \(\text{ATE} = 0.2 \times 0.5 = 0.1\). Wald ratio \(= 0.5\).

Both DGPs produce the same first stage, reduced form, and Wald ratio. Yet the ATE is \(0.5\) under DGP A and \(0.1\) under DGP B — a fivefold difference. No amount of data can distinguish them without a fourth structural assumption.

7.3 Identification in the Linear Homogeneous-Effect Model

NoteFramework 1: Constant Treatment Effect and Linear Structure

This section works within a linear structural model in which the treatment effect is the same for every unit: \(Y_i(t) - Y_i(t') = \beta(t-t')\) for all \(i, t, t'\). Under this assumption, IV identifies the single parameter \(\beta\), which equals the ATE, the ATT, and the LATE simultaneously: \(\beta = \text{ATE} = \text{ATT} = \text{LATE}\).

7.3.1 The Linear Structural Model

Consider the linear structural model: \[Y = \alpha + \beta T + \gamma^\top X + \varepsilon, \tag{7.1}\] \[T = \pi Z + \delta^\top X + \eta, \tag{7.2}\] where \(\varepsilon\) and \(\eta\) are structural errors with \(\mathrm{Cov}(\varepsilon, \eta) = \rho\sigma\tau \neq 0\). The non-zero covariance is the source of endogeneity: OLS applied to Equation 7.1 gives a biased estimator of \(\beta\). The reduced form substitutes Equation 13.2 into Equation 7.1: \[Y = \alpha + \beta\pi Z + (\beta\delta + \gamma)^\top X + (\beta\eta + \varepsilon).\] The reduced-form coefficient on \(Z\) is \(\beta\pi\): the total effect of the instrument on the outcome. Dividing by the first-stage coefficient \(\pi\) recovers \(\beta\), provided \(\pi \neq 0\).

7.3.2 The OLS Bias

The probability limit of the OLS estimator is: \[\mathrm{plim}\; \hat\beta_{\mathrm{OLS}} = \beta + \frac{\mathrm{Cov}(T, \varepsilon)}{\mathrm{Var}(T)} = \beta + \frac{\rho\sigma\tau}{\pi^2 \mathrm{Var}(Z) + \tau^2}.\] The bias is zero only if \(\rho = 0\) (no unobserved confounding) or if the instrument perfectly determines \(T\). The direction of the bias depends on the sign of \(\rho\).

7.3.3 Derivation of the Wald Estimand

The derivation has three ingredients: the first stage (how much \(Z\) moves \(T\)), the reduced form (how much \(Z\) moves \(Y\)), and the exclusion restriction (any effect of \(Z\) on \(Y\) must operate through \(T\)).

  1. Exogeneity (\(Z \indep U \mid X\)) implies \(\E[\varepsilon \mid Z, X] = 0\).
  2. Exclusion (no \(Z\) term in the outcome equation) combined with exogeneity yields the moment condition \(\E[\varepsilon \cdot Z \mid X] = 0\).
  3. Multiplying Equation 7.1 by \((Z - \E[Z \mid X])\) and applying step 2: \(\mathrm{Cov}(Y, Z \mid X) = \beta \cdot \mathrm{Cov}(T, Z \mid X)\), which is the reduced-form decomposition: the \(Z\)\(Y\) covariance is entirely attributable to the causal path \(Z \to T \to Y\).
  4. Relevance (\(\mathrm{Cov}(Z, T \mid X) \neq 0\)) ensures the denominator is non-zero:

\[\beta = \frac{\mathrm{Cov}(Y,\, Z \mid X)}{\mathrm{Cov}(T,\, Z \mid X)}. \tag{7.3}\]

In the binary instrument case (\(Z \in \{0,1\}\)) without covariates, this simplifies to the Wald estimand:

NoteThe Wald Estimand

\[\beta = \frac{\E[Y \mid Z{=}1] - \E[Y \mid Z{=}0]}{\E[T \mid Z{=}1] - \E[T \mid Z{=}0]}. \tag{7.4}\]

The numerator is the reduced form: the total effect of \(Z\) on \(Y\). The denominator is the first stage: the effect of \(Z\) on \(T\). The exclusion restriction guarantees the entire reduced form operates through \(T\); dividing by the first stage strips out the \(Z \to T\) piece.

NoteRemark: Estimation Deferred

The Wald estimand is an identification result: it expresses \(\beta\) as a ratio of observable quantities. Estimation — how to consistently estimate this ratio from a finite sample, including the correct treatment of standard errors — is taken up in Chapter 13.

NoteExample: Returns to Schooling (Angrist and Krueger 1991)

In the Mincer earnings equation \(\log W = \alpha + \beta S + \gamma^\top X + \varepsilon\), where \(S\) is years of schooling and \(W\) is wages, OLS is biased upward because unobserved ability \(U\) raises both \(S\) and \(W\). Angrist and Krueger (1991) use quarter of birth as \(Z\): proximity to mandatory school-leaving age at different birth quarters generates exogenous variation in completed schooling. The IV estimate of the return to schooling is approximately \(0.08\)\(0.10\) per year.

7.4 Why the IV Assumptions Matter

Each assumption is load-bearing, and each failure mode produces a distinct, quantifiable distortion of the Wald estimand.

When relevance fails. If \(\pi = 0\), the Wald estimand is undefined. When \(\pi\) is small but nonzero, the instrument is weak. The estimator’s variance diverges as \(\pi \to 0\), and finite-sample bias pulls the IV estimate toward the OLS estimate at a rate proportional to \(1/F\), where \(F\) is the first-stage \(F\)-statistic.

When exclusion fails. If \(Y = \alpha + \beta T + \delta Z + \gamma^\top X + \varepsilon\) with \(\delta \neq 0\), the Wald estimand converges to \(\beta + \delta/\pi\). The bias \(\delta/\pi\) is amplified by a weak first stage: a small direct effect combined with a weak instrument can produce large bias. This is why a weak instrument with a plausible exclusion violation is not “nearly valid” — it may be severely misleading.

When exogeneity fails. If \(\mathrm{Cov}(Z, \varepsilon \mid X) \neq 0\), the Wald estimand converges to \(\beta + \mathrm{Cov}(\varepsilon, Z)/\mathrm{Cov}(T, Z)\). Again the bias is amplified by weak instruments.

Assumption Directly testable? Basis for assessment
Relevance Association testable; causal claim design-based First-stage \(F\)-statistic; the causal \(Z \to T\) link rests on the design
Exogeneity No Institutional knowledge; randomization (if available); placebo regressions on pre-determined outcomes
Exclusion No in just-identified case; partially in overidentified case Institutional argument; overidentification test (\(J\)-test, Chapter 13) checks mutual consistency but cannot confirm all instruments are valid (Kitagawa 2015)

7.5 Lab: OLS vs. IV Across Instrument Strengths

This lab verifies the bias formulas numerically and traces the bias–variance tradeoff across the full range of instrument strength. It studies estimator behavior conditional on the IV assumptions being true; it does not address whether a proposed instrument is valid in an applied study.

NoteDGP for Lab 7

The simulation implements the linear structural model of Section 8.7 with no covariates \(X\). Each replication draws \(n = 500\) observations from: \[Y_i = \beta T_i + \varepsilon_i, \qquad T_i = \pi Z_i + \eta_i,\] with \(Z_i \overset{\mathrm{i.i.d.}}{\sim} N(0,1)\) and structural errors: \[\eta_i = U_i, \qquad \varepsilon_i = \rho U_i + \sqrt{1-\rho^2}\,\xi_i, \qquad U_i, \xi_i \overset{\mathrm{i.i.d.}}{\sim} N(0,1).\] Fixed parameters: \(\beta = 1\) (true causal effect), \(\rho = 0.8\) (strong positive endogeneity). The first-stage coefficient \(\pi\) is varied across eight values \(\pi \in \{0, 0.10, 0.15, 0.20, 0.30, 0.50, 1.00, 2.00\}\). The expected first-stage \(F\)-statistic is approximately \(n\pi^2 = 500\pi^2\).

NotePotential Outcomes Interpretation

The potential outcome is \(Y_i(t) = \beta t + \varepsilon_i\), so the individual treatment effect is constant: \(Y_i(t) - Y_i(t') = \beta(t-t')\) for all \(i\). This is Framework 1: \(\beta\) is simultaneously the ATE, ATT, and LATE.

The shared factor \(U_i\) creates confounding: \(\mathrm{Cov}(Y_i(t), T_i) = \mathrm{Cov}(\varepsilon_i, \eta_i) = \rho \neq 0\), so unconfoundedness fails. The instrument \(Z_i\) is drawn independently of \(U_i\), so \(Z_i \indep \varepsilon_i\) (exogeneity); \(Z_i\) does not appear in \(Y_i(t)\) (exclusion); and \(Z_i\) moves \(T_i\) through \(\pi\) (relevance). All three IV assumptions hold exactly by construction.

Estimators. OLS regresses \(Y\) on \(T\): \(\hat\beta_{\mathrm{OLS}} = \mathrm{Cov}(Y, T)/\mathrm{Var}(T)\), converging to \(1 + 0.8/(\pi^2+1)\). IV uses the Wald estimator: \(\hat\beta_{\mathrm{IV}} = \mathrm{Cov}(Y, Z)/\mathrm{Cov}(T, Z)\), consistent for \(\beta\) for any \(\pi \neq 0\).

Results (\(n = 500\), \(B = 2{,}000\) replications, seed 2024):

\(\pi\) \(F\) Theory bias OLS mean OLS RMSE IV mean IV RMSE
0.00 1 +0.800 1.801 0.802
0.10 6 +0.792 1.791 0.792 0.872 4.393
0.15 13 +0.782 1.782 0.783 0.764 6.491
0.20 21 +0.769 1.769 0.770 0.940 0.385
0.30 46 +0.734 1.735 0.735 0.975 0.165
0.50 127 +0.640 1.639 0.640 0.996 0.091
1.00 500 +0.400 1.401 0.402 0.999 0.045
2.00 2006 +0.160 1.160 0.161 1.000 0.022

Lesson 1: The OLS bias formula is exact. The theory bias \(0.8/(\pi^2+1)\) matches the simulated OLS bias to four decimal places across all eight values of \(\pi\). OLS is biased in the direction of \(\rho\) at every value of \(\pi\), including \(\pi = 0\).

Lesson 2: IV is consistent but has catastrophically heavy tails when the instrument is weak. At \(\pi = 0.10\) (\(F \approx 6\)) and \(\pi = 0.15\) (\(F \approx 13\)), the IV mean is far from the true value. The IV median is \(\approx 1.01\) at both values, confirming consistency — the mean is dragged off by a small fraction of replications in which the first stage is near zero. Standard deviations of \(4.4\) and \(6.5\) make these estimates useless.

Lesson 3: The RMSE crossover occurs near \(F \approx 20\). IV first beats OLS on RMSE at \(\pi = 0.20\) (\(F \approx 21\)): \(0.385 < 0.770\). The Staiger–Stock rule of thumb (\(F \geq 10\)) is slightly too lenient: at \(F \approx 13\), IV RMSE is still \(6.5\), nearly \(8\times\) OLS. A more conservative threshold of \(F \geq 20\)\(25\) is needed.

Lesson 4: A strong instrument eliminates both problems. At \(\pi = 1.00\) (\(F \approx 500\)), IV RMSE \(= 0.045\) while OLS RMSE \(= 0.402\) — a 9-fold improvement from IV. OLS efficiency is illusory: its small variance is offset by a large, persistent bias.

WarningRMSE Cannot Be the Only Criterion

OLS has lower RMSE than IV for \(\pi < 0.20\) in this DGP. This does not vindicate OLS. An estimator with RMSE \(= 0.78\) because it is biased by \(0.80\) is useless for causal inference: the bias is systematic and does not shrink with sample size. As \(n \to \infty\), IV RMSE shrinks to zero while OLS RMSE stays at \(0.80\). The RMSE comparison is only meaningful in finite samples — it quantifies when a weak instrument is so unreliable that IV is not yet practically useful, but that is an argument for finding a stronger instrument, not for using OLS.

NoteRemark: The First-Stage \(F\)-Statistic as a Diagnostic

The formula \(F \approx n\pi^2\) gives first-stage \(F\)-statistics matching the simulation throughout. The Bound et al. (1995) threshold of \(F \geq 10\) is widely used in practice and corresponds roughly to IV RMSE within a factor of two of the strong-instrument limit. This simulation suggests that threshold may understate the problem when endogeneity is strong (\(\rho = 0.8\)): at \(F = 13\) the IV RMSE is \(6.5\), far above any useful threshold. A researcher should report the first-stage \(F\) alongside any IV estimate, and treat values below \(20\)\(25\) with particular caution.

7.6 Multiple Instruments and Overidentification

When there are exactly as many instruments as endogenous variables (\(q = p\)), the model is just-identified. When \(q > p\), the model is overidentified: the extra instruments impose additional moment restrictions that are testable. Under the homogeneous-effect linear model, every valid instrument must imply the same structural coefficient \(\beta\), so those extra restrictions are testable. Under heterogeneous treatment effects, valid instruments can legitimately identify different LATEs because they shift treatment for different complier populations.

The order condition (\(q \geq p\), counting requirement) and the rank condition (instruments are linearly independent in the first stage) are both necessary for identification. The Sargan–Hansen \(J\)-test (Sargan 1958; Hansen 1982) formalizes the mutual-consistency check; rejection indicates at least one instrument either violates exogeneity or exclusion, or identifies a different LATE, but does not localize which. The test statistic and asymptotic distribution are derived in Chapter 13.

WarningWhat Overidentification Does Not Do

Passing the \(J\)-test does not confirm that any instrument is valid. It only shows that the sample moment conditions are mutually compatible — a weaker conclusion than validity. Multiple invalid instruments can agree with one another if they share the same violation; consistent instruments are not necessarily valid ones. Overidentification is an opportunity for a consistency check, not a substitute for the institutional argument that makes a design credible.

7.7 Heterogeneous Treatment Effects and the LATE Framework

NoteFramework 2: Heterogeneous Treatment Effects

We now drop homogeneity and allow \(\tau_i = Y_i(1) - Y_i(0)\) to vary across units. Throughout this section we work in the binary instrument, binary treatment setting (\(Z, T \in \{0,1\}\)). In this setting, the Wald ratio generally no longer identifies the ATE. Instead, under an additional assumption of monotonicity, it identifies the Local Average Treatment Effect (LATE): the average treatment effect for the subpopulation whose treatment status is changed by the instrument.

7.7.1 Compliance Types

NoteDefinition: Compliance Types (Angrist et al. 1996)

For binary \(Z, T \in \{0,1\}\), the four compliance types are defined by the pair of potential treatment decisions \((T_i(0), T_i(1))\):

Compliance type \(T_i(0)\) \(T_i(1)\)
Complier 0 1
Always-taker 1 1
Never-taker 0 0
Defier 1 0

A complier takes treatment if and only if the instrument is switched on. An always-taker takes treatment regardless of \(Z\). A never-taker never takes treatment. A defier does the opposite of what the instrument suggests.

The instrument \(Z\) only shifts treatment for compliers: always-takers and never-takers have the same treatment status regardless of \(Z\), so they contribute nothing to the denominator \(\E[T \mid Z{=}1] - \E[T \mid Z{=}0]\).

7.7.2 The Monotonicity Assumption

NoteDefinition: Monotonicity (Angrist and Imbens 1994)

The treatment assignment is monotone in \(Z\) if \(T_i(1) \geq T_i(0)\) for all \(i\). Equivalently: there are no defiers in the population.

Monotonicity is not a generic law of causal inference. It is a design-specific claim about how this particular instrument changes treatment behavior. Switching the instrument from 0 to 1 may induce some units to take treatment (compliers) and leave others unaffected (always-takers or never-takers), but it should not reverse anyone’s treatment decision.

7.7.3 The LATE Theorem

TipTheorem: LATE Theorem (Angrist and Imbens 1994)

Suppose: (1) Exogeneity (PO form): \(Z \indep (Y(0), Y(1), T(0), T(1))\); (2) Exclusion: \(Y_i(t, z) = Y_i(t)\) for all \(z\); (3) Relevance: \(\E[T(1) - T(0)] \neq 0\); (4) Monotonicity: \(T_i(1) \geq T_i(0)\) for all \(i\). Then the Wald estimand identifies the Local Average Treatment Effect (LATE): \[\frac{\E[Y \mid Z{=}1] - \E[Y \mid Z{=}0]}{\E[T \mid Z{=}1] - \E[T \mid Z{=}0]} = \E[Y(1) - Y(0) \mid T_i(1) > T_i(0)] \equiv \tau_{\mathrm{LATE}}.\]

Denominator. By consistency for \(T\) and exogeneity (\(Z \indep (T(0), T(1))\)): \[\E[T \mid Z{=}1] - \E[T \mid Z{=}0] = \E[T(1)] - \E[T(0)] = \E[T(1) - T(0)].\] Monotonicity (no defiers) gives \(\E[T(1) - T(0)] = P(\text{complier})\), since always-takers contribute \(1-1=0\) and never-takers contribute \(0-0=0\): \(\E[T \mid Z{=}1] - \E[T \mid Z{=}0] = P(\text{complier})\).

Numerator. By consistency, exclusion, and exogeneity: \[\E[Y \mid Z{=}1] - \E[Y \mid Z{=}0] = \E[Y(T(1))] - \E[Y(T(0))] = \sum_c P(c)\,\E[Y(T_c(1)) - Y(T_c(0)) \mid \text{type}=c].\] For always-takers: \(T(1) = T(0) = 1\), contribution \(= 0\). For never-takers: \(T(1) = T(0) = 0\), contribution \(= 0\). For compliers: \(T(1) = 1\), \(T(0) = 0\), so \(Y(T(1)) - Y(T(0)) = Y(1) - Y(0)\). No defiers by monotonicity. Therefore: \[\E[Y \mid Z{=}1] - \E[Y \mid Z{=}0] = P(\text{complier})\,\E[Y(1) - Y(0) \mid \text{complier}].\]

Ratio. Dividing numerator by denominator: \(\tau_{\mathrm{LATE}} = \E[Y(1) - Y(0) \mid \text{complier}]\). \(\square\)

NoteRemark: The Two Senses of “Local”

The LATE is local in two senses: it is local to the complier group defined by this instrument, and local to the particular instrument that defines that group. A different instrument, even for the same treatment, will generally select a different complier population and identify a different LATE. This is developed in Section 7.8.

7.8 Interpreting IV Estimands

7.8.1 What the Two Frameworks Say

Framework 1 (linear, homogeneous) Framework 2 (heterogeneous effects)
Key assumption \(\tau_i = \beta\) for all \(i\) Monotonicity; no defiers
What IV identifies \(\beta = \text{ATE} = \text{ATT} = \text{LATE}\) \(\tau_{\mathrm{LATE}} = \E[\tau_i \mid \text{complier}]\)
Estimand depends on instrument? No (same \(\beta\) regardless of \(Z\)) Yes (different \(Z\) \(\Rightarrow\) different compliers \(\Rightarrow\) different LATE)
Required for ATE? Yes, automatically Only if all units are compliers or effects homogeneous

Framework 1 is a special case of Framework 2: when \(\tau_i = \beta\) for all \(i\), the LATE equals the ATE equals \(\beta\). In applied work, the default interpretation of the Wald estimand is the LATE; the ATE interpretation requires the additional homogeneity argument of Framework 1.

NoteRemark: Same Formula, Different Estimand

Across a range of fourth assumptions, the conditional Wald formula \(\mathrm{Cov}(Z, Y \mid X)/\mathrm{Cov}(Z, T \mid X)\) serves as the identifying expression in many cases (Levis et al. 2024). Two researchers using the same instrument and computing the same Wald ratio may be consistently estimating different causal quantities if they maintain different structural assumptions. The data alone cannot resolve which estimand the Wald ratio identifies; that determination requires the researcher to commit to a structural assumption about treatment response.

7.8.2 When Does LATE Equal ATE?

LATE equals ATE only under additional structure, most notably treatment-effect homogeneity. To see this, decompose the ATE by compliance type: \[\text{ATE} = P(\text{co})\,\E[\tau_i \mid \text{co}] + P(\text{at})\,\E[\tau_i \mid \text{at}] + P(\text{nt})\,\E[\tau_i \mid \text{nt}].\] The LATE equals only the first term divided by its probability weight. ATE and LATE coincide if and only if: (1) mean treatment effects are equal across compliance types, or (2) everyone is a complier, or (3) the average effects for always-takers and never-takers happen to equal the LATE — an untestable coincidence.

7.8.3 Different Instruments, Different Estimands

Because the LATE is specific to the complier population, and different instruments select different complier populations, two valid instruments for the same treatment can legitimately identify different LATEs. This is informative about treatment effect heterogeneity, not a contradiction.

NoteExample: Compulsory Schooling Laws vs. Distance to College (Angrist and Krueger 1991; Card 1995)

Compulsory schooling laws (Angrist and Krueger 1991) identify the return to schooling for students at the margin of dropping out — typically lower-income students. Distance to the nearest college (Card 1995) identifies the return for students deterred by geographic distance — again, disproportionately lower-income. Both instruments are valid; the LATEs can legitimately differ even if both are valid.

7.8.4 The Policy Relevance of LATE

For many policy questions, the LATE is exactly the right estimand. If a policy is designed to encourage a subset of the population to take treatment, then the effect on compliers (those who respond to the encouragement) is precisely what the policy-maker wants to know. When the ATE over the full population is required, IV alone is insufficient under heterogeneous effects; additional assumptions or a second instrument are needed to extrapolate from the LATE to the ATE.

7.9 Practical Guidance on Defending an IV Design

A researcher proposing an instrument should be able to answer five questions explicitly:

  1. What exactly is the instrument? Specify \(Z\) precisely: its source of variation, the level at which it varies, and the population to which it applies.
  2. Why does it shift treatment? Articulate the causal mechanism by which \(Z\) moves \(T\). The first-stage \(F\)-statistic is a diagnostic for instrument strength, not a substitute for a causal account of the \(Z \to T\) link.
  3. Why is it as-if random relative to latent outcome determinants? The most credible sources are designed randomization (lotteries, randomized encouragement), natural experiments, and shift-share designs (Bartik 1991; Goldsmith-Pinkham et al. 2020). Placebo regressions on pre-determined outcomes provide partial evidence.
  4. Why can it affect the outcome only through treatment? Exclusion is untestable in just-identified models. A useful diagnostic: how large would the direct effect \(\delta\) have to be, relative to \(\pi\), to overturn the estimated causal effect? When the first stage is weak, the answer is: not very large.
  5. What population margin does it shift? Identify the complier population. This determines the LATE that is being identified and governs the external validity of the estimates.

7.10 Applied Example: Charter School Lotteries and the KIPP Lynn Study

This example illustrates a canonical randomized-encouragement IV design. The lottery randomizes offer status \(Z\), not actual treatment \(S\) (years of KIPP attendance). Winning the lottery does not force a student to attend KIPP, and losing does not make later attendance impossible. Thus the lottery offer is the instrument and actual attendance is the endogenous treatment.

Setting. KIPP (Knowledge Is Power Program) schools follow a “No Excuses” model: extended school days, longer academic year, selective teacher hiring, and strict behavioral norms. KIPP Academy Lynn was substantially oversubscribed beginning in 2005. Massachusetts law requires oversubscribed charter schools to select students by lottery, so the school conducted randomized admissions lotteries from 2005 through 2008. The outcome \(Y_{igt}\) is the student’s standardized score on the Massachusetts MCAS, normalized to mean zero and standard deviation one within each subject–grade–year cell statewide.

Mapping the three assumptions (Angrist et al. 2012).

Relevance. Lottery winners were offered a seat and about 80% accepted; losers rarely enrolled elsewhere at KIPP. The first-stage regression yields a coefficient of approximately \(1.2\): lottery winners had spent about 1.2 more years at KIPP than lottery losers at the time of each MCAS exam. The first-stage \(F\)-statistic is far above conventional thresholds.

Exogeneity. Offer status was determined by randomly drawn lottery-sequence numbers, so independence holds by design. A joint test of covariate balance yields \(p = 0.615\), consistent with no pre-lottery differences.

Exclusion. The offer merely provides access to a school; it does not itself deliver instruction. One potential violation is a discouragement effect: losing the lottery might demoralize students. The authors address this by noting that scores of lottery losers are typical of demographically comparable students in Lynn, inconsistent with large discouragement effects. Random assignment of the instrument does not, by itself, imply exclusion — it only guarantees exogeneity.

First stage, reduced form, and 2SLS. The model is just-identified (one excluded instrument per endogenous variable), so the 2SLS estimator of \(\theta\) (effect per year at KIPP) equals the ratio of the reduced-form coefficient on \(Z_i\) to the first-stage coefficient \(\pi\).

Subject First stage Reduced form 2SLS
Math 1.221 (0.068) 0.430 (0.067) 0.352 (0.053)
ELA 1.228 (0.068) 0.164 (0.073) 0.133 (0.059)

Standard errors clustered at the student level. \(N = 833\) student-by-test observations.

Each year at KIPP raises math scores by approximately \(0.35\sigma\) and ELA scores by approximately \(0.13\sigma\). The reduced-form estimate for math (\(0.43\sigma\)) is larger than the 2SLS estimate (\(0.35\sigma\)) because the first stage exceeds 1: lottery winners accumulated somewhat more than one additional year at KIPP per unit of follow-up time.

LATE interpretation. The 2SLS estimand \(\theta\) is not the effect of KIPP on all students in Lynn, nor on all applicants — it is the average per-year treatment effect for the lottery compliers. Compliance is partial in both directions: some lottery winners do not enroll and some lottery losers eventually find entry.

Treatment effect heterogeneity. Reading gains (\(\approx 0.13\sigma\) overall) are driven almost entirely by students classified as having limited English proficiency (LEP, \(\approx 0.43\sigma\)) and special education needs (SPED, \(\approx 0.27\sigma\)); non-LEP, non-SPED students show negligible ELA gains. Math effects are large and positive across all subgroups but are largest for LEP and lower-achieving students. This connects directly to Section 7.8: different instruments would identify distinct subpopulation LATEs; the overall 2SLS estimate is a weighted average of subgroup LATEs with weights proportional to each subgroup’s share of the complier population.

NoteRemark: Lottery Design and Assumption Credibility (Angrist and Pischke 2009)

Instruments derived from designed randomization — lotteries, random assignment, randomized encouragement — provide the most transparent basis for the exogeneity assumption. Exogeneity within the applicant sample, conditional on application cohort, is not merely plausible — it follows from the randomization protocol. This is why lottery-based IV designs occupy a privileged position in the program evaluation literature. The exclusion restriction still requires institutional argument.

7.11 IV versus Back-Door Adjustment

Dimension Back-door / propensity score Instrumental variables
Core assumption All confounders observed: \((Y(0),Y(1)) \indep T \mid X\) Valid instrument: relevance, exogeneity, exclusion
Unobserved confounders Fatal: back-door adjustment fails Permitted: IV routes around \(U\)
Estimand ATE, ATT, or ATC; all coincide under homogeneity LATE (compliers only); reduces to common \(\beta\) under homogeneous effects
Testability Unconfoundedness is untestable; overlap is testable Relevance testable; exogeneity and exclusion untestable (just-identified)
Main threat Unmeasured confounder Exclusion restriction violation
Identifies ATE? Yes, under strong ignorability Only under homogeneous effects

Complementary failure modes. Back-door adjustment fails when \(X\) does not capture all confounders. IV fails when the exclusion restriction is violated: the bias in the Wald estimand is \(\delta/\pi\), amplified by weak instruments. The two failures are orthogonal: back-door adjustment requires many observed covariates but tolerates no unobserved ones, while IV tolerates unobserved confounders but requires an instrument with no direct effect on the outcome.

When both strategies are available. The Hausman (1978) endogeneity test compares OLS and IV: under the null that \(T\) is exogenous given \(X\), both are consistent, and a large discrepancy is evidence of endogeneity. Under the null, an appropriately scaled quadratic form in \(\hat\beta_{\mathrm{IV}} - \hat\beta_{\mathrm{OLS}}\) is asymptotically \(\chi^2_p\). Disagreement does not by itself tell us which method is wrong: the two strategies typically target different estimands (ATE vs. LATE), and disagreement can arise from legitimate effect heterogeneity rather than failure of either assumption.

7.12 Chapter Summary

Symbol Meaning
\(Z\) Instrument
\(\pi\) First-stage coefficient: \(\E[\partial T/\partial Z]\)
\(\rho\) Endogeneity: \(\mathrm{Cov}(\varepsilon, \eta)/(\sigma\tau)\)
\(\tau_{\mathrm{LATE}}\) \(\E[Y(1) - Y(0) \mid T_i(1) > T_i(0)]\)
Wald estimand \((\E[Y\mid Z{=}1]-\E[Y\mid Z{=}0])/(\E[T\mid Z{=}1]-\E[T\mid Z{=}0])\)
Reduced form Total effect of \(Z\) on \(Y\)
First stage Effect of \(Z\) on \(T\)
  1. IV identifies effects from exogenous treatment variation. When back-door adjustment fails because an unobserved \(U\) creates a path \(T \leftarrow U \to Y\), a valid instrument \(Z\) identifies the causal effect by exploiting only the component of treatment variation that \(Z\) induces. IV does not block the confounding path — it avoids it.
  2. Three assumptions, ordered by testability. Relevance can be assessed with the first-stage \(F\)-statistic (though the \(F\)-statistic is a sample diagnostic). Exogeneity and exclusion must be defended by institutional knowledge, design logic, and causal structure. Each violated assumption produces a distinct, quantifiable bias, amplified by weak instruments.
  3. Framework 1: homogeneous-effect SEM \(\Rightarrow\) Wald identifies \(\beta\). Under constant treatment effects and the linear structural model, the Wald estimand identifies \(\beta = \text{ATE} = \text{ATT} = \text{LATE}\).
  4. Framework 2: heterogeneity + monotonicity \(\Rightarrow\) Wald identifies LATE. Under heterogeneous treatment effects and monotonicity, the Wald estimand identifies the average treatment effect for compliers only — those whose treatment status changes with the instrument, a latent subgroup defined by \((T(0), T(1))\).
  5. Different instruments identify different effects. The LATE depends on the instrument through the complier population it selects. This is informative about treatment effect heterogeneity, not a contradiction. LATE equals ATE only under additional structure.
  6. IV versus back-door adjustment. Complementary failure modes: back-door fails when confounders are unobserved; IV fails when the exclusion restriction is violated or the instrument is not truly exogenous.
  7. Estimation deferred to Chapter 13. This chapter establishes what IV identifies and under what assumptions. How the Wald ratio is estimated from finite data — reduced-form regression, two-stage least squares, asymptotic inference, and overidentification tests — is the subject of Chapter 13.

7.13 Problems

1. The three IV assumptions in three languages. Consider the DAG: \(\{Z \to T,\; T \to Y,\; U \to T,\; U \to Y,\; X \to T,\; X \to Y,\; X \to Z\}\) with \(U\) unobserved.

  1. List all back-door paths from \(T\) to \(Y\). Does \(X\) alone satisfy the back-door criterion? Explain.
  2. Verify the three IV assumptions using d-separation: (i) Relevance: show \(Z\) and \(T\) are not d-separated in \(\mathcal{G}\). (ii) Exogeneity: show \(Z \indep U \mid X\) in \(\mathcal{G}\). (iii) Exclusion: show \(Y \indep Z \mid T, X\) in \(\mathcal{G}_{\overline{T}}\).
  3. Now add the arrow \(Z \to Y\) to the DAG. Which IV assumption is violated? Show explicitly which step of the Wald derivation in Section 8.7 breaks down.
  4. Translate each of the three IV assumptions into the structural language: write the equations for \(T\) and \(Y\) and identify which coefficient restriction corresponds to each assumption.

2. Bias under assumption violations. Let \(Y = \beta T + \varepsilon\) and \(T = \pi Z + \eta\) with \(\E[\varepsilon \mid Z] = 0\) and \(\pi \neq 0\).

  1. Starting from \(\E[Y \mid Z{=}1] - \E[Y \mid Z{=}0]\), substitute the structural equation for \(Y\) and simplify. What role does exogeneity play?
  2. Show that \(\E[T \mid Z{=}1] - \E[T \mid Z{=}0] = \pi\) in the linear first-stage model.
  3. Derive the Wald estimand and confirm it equals \(\beta\).
  4. Now suppose the exclusion restriction fails and \(Y = \beta T + \delta Z + \varepsilon\) with \(\delta \neq 0\). Derive the probability limit of the Wald estimator and confirm the bias formula from Section 7.4.
  5. Suppose instead that exogeneity fails: \(\E[\varepsilon \mid Z] = cZ\) for some constant \(c \neq 0\). Derive the probability limit of the Wald estimator and express the bias in terms of \(c\) and \(\pi\). Compare the structure of this bias with the exclusion violation bias.

3. Order, rank, and the limits of overidentification. Consider a model with one endogenous variable \(T\) and two instruments \(Z_1\) and \(Z_2\), both satisfying exogeneity and exclusion.

  1. State the order condition and verify it is satisfied.
  2. State the rank condition. What would it mean geometrically if the rank condition failed — i.e., if \(Z_1\) and \(Z_2\) were perfectly collinear in the first-stage regression?
  3. Explain intuitively why having two valid instruments rather than one should improve estimation precision.
  4. Now suppose \(Z_1\) is valid but \(Z_2\) violates the exclusion restriction. Under what conditions does the Sargan–Hansen \(J\)-test have power to detect \(Z_2\)’s invalidity? Under what conditions does the test fail?
  5. Why does passing the \(J\)-test not confirm that both \(Z_1\) and \(Z_2\) are valid? Give a concrete example in which both instruments are invalid and the \(J\)-test has no power.

4. Compliance types and the LATE. In a binary instrument, binary treatment study, suppose the population has: 30% compliers with average treatment effect \(\tau_c = 6\); 25% always-takers with \(\tau_a = 3\); 45% never-takers with \(\tau_n = 1\); no defiers.

  1. Compute \(P(\text{complier}) = \E[T \mid Z{=}1] - \E[T \mid Z{=}0]\).
  2. Compute the ATE as a weighted average of \(\tau_c\), \(\tau_a\), \(\tau_n\) with appropriate weights.
  3. The Wald estimand equals \(\tau_c = 6\). By how much does this overstate the ATE, and why?
  4. A second study uses a different binary instrument \(Z'\) with a complier population of 50% and a LATE of 2. Is this contradictory? What can you infer about the relative treatment effect in the two complier populations?
  5. Explain, using compliance type language, why the denominator of the Wald estimand equals \(P(\text{complier})\).

5. The exclusion restriction: plausibility and violations. Evaluate the exclusion restriction for each proposed instrument. For each, state (i) whether the restriction is plausible and why; (ii) a specific mechanism by which it could be violated; and (iii) whether the violation would bias the IV estimate upward or downward.

  1. Instrument: rainfall in the home region of a politician, used as an instrument for government infrastructure spending. Outcome: local economic growth.
  2. Instrument: distance to the nearest hospital, used as an instrument for hospital admission. Outcome: 30-day mortality.
  3. Instrument: a randomly assigned financial incentive to enroll in a health screening program. Outcome: health status two years later.
  4. Instrument: lottery number in the Vietnam-era draft lottery, used as an instrument for military service. Outcome: lifetime earnings. [This is the Angrist (1990) study; discuss why this instrument is widely regarded as satisfying the exclusion restriction.]

6. IV versus back-door adjustment. A researcher studies the effect of job training (\(T\)) on earnings (\(Y\)). Two strategies are available: (A) a rich set of pre-treatment covariates \(X\) and a propensity-score estimator; (B) a lottery that randomly selected units to be offered training (not required to attend), used as instrument \(Z\).

  1. Under what assumption does strategy (A) identify the ATE? What specific unobserved variable would most plausibly violate this assumption?
  2. Strategy (B) identifies a LATE. Describe the complier population in words. Is the LATE likely to be larger or smaller than the ATE in this setting? Explain.
  3. Both strategies are implemented and yield estimates of $1,800 and $2,400 per year, respectively. Describe a Hausman-type test that uses both estimates. Under what null hypothesis does the test have an approximate \(\chi^2\) distribution?
  4. If the two estimates differ significantly, which strategy would you trust more and why? What additional evidence would help distinguish the two explanations (endogeneity bias in (A) versus LATE \(\neq\) ATE in (B))?
Angrist, Joshua D., Susan M. Dynarski, Thomas J. Kane, Parag A. Pathak, and Christopher R. Walters. 2012. “Who Benefits from KIPP?” Journal of Policy Analysis and Management 31 (4): 837–60.
Angrist, Joshua D., and Guido W. Imbens. 1994. “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62 (2): 467–75.
Angrist, Joshua D., Guido W. Imbens, and Donald B. Rubin. 1996. “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association 91 (434): 444–55.
Angrist, Joshua D., and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect Schooling and Earnings?” Quarterly Journal of Economics 106 (4): 979–1014.
Angrist, Joshua D., and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Bartik, Timothy J. 1991. Who Benefits from State and Local Economic Development Policies? W. E. Upjohn Institute for Employment Research.
Bound, John, David A. Jaeger, and Regina M. Baker. 1995. “Problems with Instrumental Variables Estimation When the Correlation Between the Instruments and the Endogenous Explanatory Variable Is Weak.” Journal of the American Statistical Association 90 (430): 443–50.
Card, David. 1995. “Using Geographic Variation in College Proximity to Estimate the Return to Schooling.” In Aspects of Labour Market Behaviour: Essays in Honour of John Vanderkamp, edited by Louis N. Christofides, E. Kenneth Grant, and Robert Swidinsky. University of Toronto Press.
Goldsmith-Pinkham, Paul, Isaac Sorkin, and Henry Swift. 2020. “Bartik Instruments: What, When, Why, and How.” American Economic Review 110 (8): 2586–624.
Hansen, Lars Peter. 1982. “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica 50 (4): 1029–54. https://doi.org/10.2307/1912775.
Hausman, Jerry A. 1978. “Specification Tests in Econometrics.” Econometrica 46 (6): 1251–71.
Hernán, Miguel A., and James M. Robins. 2006. “Instruments for Causal Inference: An Epidemiologist’s Dream?” Epidemiology 17 (4): 360–72.
Kitagawa, Toru. 2015. “A Test for Instrument Validity.” Econometrica 83 (5): 2043–63. https://doi.org/10.3982/ECTA11974.
Levis, Alexander W., Edward H. Kennedy, and Luke Keele. 2024. “Nonparametric Identification and Efficient Estimation of Causal Effects with Instrumental Variables.” arXiv Preprint arXiv:2402.09332.
Sargan, John D. 1958. “The Estimation of Economic Relationships Using Instrumental Variables.” Econometrica 26 (3): 393–415. https://doi.org/10.2307/1907619.