13 Estimation under Instrumental Variables

Learning Objectives: Core Material

By the end of the core sections, students should be able to:

Construct the Wald estimator as the sample analog of the Wald estimand and interpret its numerator and denominator as the reduced form and first stage. Extend this to the IV regression estimator with covariates as the direct sample analog of \(\widehat{\mathrm{Cov}}(\tilde Z, Y)/\widehat{\mathrm{Cov}}(\tilde Z, T)\), where \(\tilde Z\) is the linear-projection residual of \(Z\) on the intercept and \(X\).
Distinguish the structural form of a linear SEM from its reduced form; derive the reduced form by solving out endogenous variables; define the reduced form regression estimator; and interpret \(\hat\phi_{\mathrm{RF}}\) as the intent-to-treat (ITT) effect when \(Z\) is randomly assigned.
Derive the 2SLS estimator from first principles via the constrained normal equations argument, and prove its numerical equivalence with the IV regression estimator.
Interpret 2SLS as a method-of-moments estimator based on the IV orthogonality condition, connect it to the estimating-equation framework of Chapter 10, and extend to GMM for overidentified models.
State the asymptotic distribution of the GMM estimator, specialize the sandwich variance formula to the exactly identified and efficient GMM cases, and construct cluster-robust standard errors.
Explain the weak-instrument problem and its consequences for finite-sample bias and inferential reliability.

Learning Objectives: Advanced Enrichment (Sections Section 13.8–Section 13.9)

Students who complete the advanced sections should additionally be able to:

Describe the GEL framework as a one-step alternative to efficient GMM; state the saddle-point formulation; identify empirical likelihood, exponential tilting, and the continuous updating estimator as special cases; and interpret the Legendre–Fenchel duality as observation re-weighting.
Describe the control function approach, explain its equivalence to 2SLS in the linear model, and state the conditions under which \(V = F_{T \mid Z,X}(T \mid Z, X)\) serves as a control variable in the nonlinear triangular model of Imbens and Newey (2009).

13.1 From Identification to Estimation

Chapter 7 established what IV identifies and under what assumptions. This chapter shows how to estimate that target from finite data using sample analogs of the same orthogonality restrictions.

Chapter 7 showed that when an instrument \(Z\) satisfies relevance, exogeneity, and exclusion, the residualized-covariance ratio \(\mathrm{Cov}(\tilde Z, Y)/\mathrm{Cov}(\tilde Z, T)\) is a well-defined function of observable quantities but does not identify any specific causal parameter without a fourth structural assumption. This chapter works within the linear constant-effect model: \[Y = \alpha + \beta T + \gamma^\top X + \varepsilon, \tag{13.1}\] \[T = a_T + \pi Z + \delta^\top X + \eta, \tag{13.2}\] where the fourth structural assumption is the constant-effect restriction: \(Y_i(1) - Y_i(0) = \beta\) for all \(i\). Under the three core assumptions plus this restriction, the structural coefficient \(\beta\) is identified by the Wald formula: \[\beta = \frac{\mathrm{Cov}(\tilde Z,\, Y)}{\mathrm{Cov}(\tilde Z,\, T)}.\]

This chapter addresses the estimation problem: how to recover \(\beta\) from a finite sample. The answer is not a single formula but a family of estimators — the Wald estimator, two-stage least squares (2SLS), and the generalized method of moments (GMM) — each of which is a sample analog of the same underlying orthogonality restriction.

Chapter structure. Core material (Section 13.2–Section 13.7) develops the mainstream IV estimators and their asymptotic theory. Advanced enrichment (Section 13.8–Section 13.9) introduces GEL and the control function approach.

13.2 The Wald Estimator and the IV Regression Estimator

13.2.1 The Wald Estimator

We begin with the simplest setting: a binary instrument \(Z \in \{0,1\}\), scalar treatment \(T\), scalar outcome \(Y\), and no additional covariates \(X\).

Definition: Wald Estimator

Let \(n_z = \sum_{i=1}^n \mathbf{1}(Z_i = z)\) and define within-group sample means \(\bar{Y}_z = n_z^{-1}\sum_{i:\,Z_i=z} Y_i\), \(\bar{T}_z = n_z^{-1}\sum_{i:\,Z_i=z} T_i\). The Wald estimator is: \[\hat\beta_{\mathrm{Wald}} = \frac{\bar{Y}_1 - \bar{Y}_0}{\bar{T}_1 - \bar{T}_0}. \tag{13.3}\]

The numerator \(\bar{Y}_1 - \bar{Y}_0\) estimates the reduced form: the total effect of the instrument on the outcome. The denominator \(\bar{T}_1 - \bar{T}_0\) estimates the first stage: the effect of the instrument on treatment uptake. Their ratio recovers the effect of treatment on the outcome by attributing all of the instrument’s effect on \(Y\) to the path \(Z \to T \to Y\) — valid precisely because the exclusion restriction rules out any direct path \(Z \to Y\).

Remark: Consistency by the Continuous Mapping Theorem

The within-group means are consistent for \(\E[Y \mid Z{=}z]\) and \(\E[T \mid Z{=}z]\) by the LLN. Consistency of \(\hat\beta_{\mathrm{Wald}}\) for \(\beta\) then follows from the continuous mapping theorem, provided the denominator \(\E[T \mid Z{=}1] - \E[T \mid Z{=}0] \neq 0\) (the relevance condition).

Remark: What the Wald Estimator Targets under Heterogeneous Effects

Binary treatment. For binary \(T\), Chapter 7 showed that under the additional monotonicity assumption (\(T_i(1) \geq T_i(0)\) for all \(i\)), the Wald estimand identifies the LATE for compliers. The Wald estimator consistently estimates the LATE, not the population ATE.

Continuous treatment. For continuous \(T\) with binary \(Z\), the IV estimand identifies a weighted average of marginal causal effects \(\partial Y_i(t)/\partial t\) across the treatment distribution, with weights determined by the instrument-induced shift in \(T\). “The LATE” is no longer the right name for what is estimated.

13.2.2 The IV Regression Estimator with Covariates

When observed covariates \(X \in \mathbb{R}^p\) are present, the Wald estimator is no longer applicable. We derive the general IV estimator from the estimating-equation principle.

Notation. Stack observations into \(\mathbf{Y}, \mathbf{T}, \mathbf{Z} \in \mathbb{R}^n\) and \(\mathbf{X} \in \mathbb{R}^{n \times p}\). Let \(\bar{\mathbf{X}} = [\mathbf{1}_n,\,\mathbf{X}]\) and define the annihilator matrix: \[M_X = I_n - \bar{\mathbf{X}}\bigl(\bar{\mathbf{X}}^\top\bar{\mathbf{X}}\bigr)^{-1}\bar{\mathbf{X}}^\top, \tag{13.4}\] the orthogonal projection onto the orthogonal complement of \(\mathrm{col}(\bar{\mathbf{X}})\).

Derivation via constrained normal equations. The structural model imposes: \[\E\!\left[\begin{pmatrix} 1 \\ X_i \\ Z_i \end{pmatrix}\varepsilon_i\right] = 0, \qquad \E[T_i\,\varepsilon_i] \neq 0. \tag{13.5}\]

Consider the sample normal equations one would obtain by running OLS of \(\mathbf{Y}\) on \((\mathbf{T}, \bar{\mathbf{X}}, \mathbf{Z})\), with residual \(\hat{\mathbf{e}} = \mathbf{Y} - \mathbf{T}\beta - \bar{\mathbf{X}}\bar\gamma\):

\[\mathbf{T}^\top \hat{\mathbf{e}} = \mathbf{0}, \quad (\text{NE-T}) \quad \bar{\mathbf{X}}^\top \hat{\mathbf{e}} = \mathbf{0}, \quad (\text{NE-X}) \quad \mathbf{Z}^\top \hat{\mathbf{e}} = \mathbf{0}. \quad (\text{NE-Z})\]

At the true \(\beta\), (NE-T) fails because \(\E[T_i\varepsilon_i] \neq 0\). We drop the invalid (NE-T) and solve the \((p+2)\)-dimensional system (NE-X), (NE-Z). From (NE-X): \(\hat{\bar\gamma} = (\bar{\mathbf{X}}^\top\bar{\mathbf{X}})^{-1}(\bar{\mathbf{X}}^\top\mathbf{Y} - \bar{\mathbf{X}}^\top\mathbf{T}\beta)\). Substituting into (NE-Z) and simplifying using \(M_X\): \(\mathbf{Z}^\top M_X \mathbf{T}\,\beta = \mathbf{Z}^\top M_X \mathbf{Y}\).

Definition: IV Regression Estimator

The IV regression estimator in the linear IV model with scalar instrument \(Z\) and covariates \(X\) is: \[\hat\beta_{\mathrm{IV}} = \frac{\mathbf{Z}^\top M_X \mathbf{Y}}{\mathbf{Z}^\top M_X \mathbf{T}} = \frac{\widehat{\mathrm{Cov}}(\tilde{Z},\, Y)}{\widehat{\mathrm{Cov}}(\tilde{Z},\, T)}, \tag{13.6}\] where \(\tilde{Z}_i\), \(\tilde{Y}_i\), \(\tilde{T}_i\) are the within-\(X\) residuals (entries of \(M_X\mathbf{Z}\), \(M_X\mathbf{Y}\), \(M_X\mathbf{T}\)).

The Wald estimator is the special case \(Z \in \{0,1\}\), no covariates: \(M_X = I_n - n^{-1}\mathbf{1}\mathbf{1}^\top\) is the centering matrix, and \(\hat\beta_{\mathrm{IV}} = \hat\beta_{\mathrm{Wald}}\).

13.2.3 Structural Form, Reduced Form, and the Reduced Form Regression

The structural form. System Equation 13.1–Equation 13.2 is the structural form: each equation describes how one variable is determined by others, including endogenous variables on the right-hand side. The structural coefficient \(\beta\) has a causal interpretation, but \(T\) is correlated with \(\varepsilon\), so OLS applied to the structural form is inconsistent.

The reduced form. The reduced form is obtained by solving the structural system so each endogenous variable is expressed purely as a function of exogenous variables \(Z\) and \(X\). Substituting Equation 13.2 into Equation 13.1: \[Y = \underbrace{(\alpha + \beta a_T)}_{\alpha_{\mathrm{rf}}} + \underbrace{\beta\pi}_{\phi}\, Z + \underbrace{(\beta\delta + \gamma)^\top}_{\gamma_{\mathrm{rf}}^\top}\, X + \underbrace{(\beta\eta + \varepsilon)}_{\nu}. \tag{13.7}\]

Writing compactly: \(Y = \alpha_{\mathrm{rf}} + \phi Z + \gamma_{\mathrm{rf}}^\top X + \nu\), where \(\phi = \beta\pi\), \(\nu = \beta\eta + \varepsilon\). Both right-hand-side variables (\(Z\) and \(X\)) are exogenous, so OLS is consistent for \(\phi\). Neither \(\beta\) nor \(\pi\) is separately identified from the reduced form alone.

Definition: Reduced Form Regression Estimator

The reduced form regression estimator is OLS of \(\phi\) in the reduced form equation Equation 13.7: \[\hat\phi_{\mathrm{RF}} = \frac{\widehat{\mathrm{Cov}}(\tilde{Z},\, Y)}{\widehat{\mathrm{Var}}(\tilde{Z})}, \tag{13.8}\] consistent for \(\phi = \beta\pi\) under \(\E[\nu \mid Z, X] = 0\).

Relationship to the first stage and IV estimator. The first-stage regression estimator is \(\hat\pi_{\mathrm{FS}} = \widehat{\mathrm{Cov}}(\tilde{Z}, T)/\widehat{\mathrm{Var}}(\tilde{Z})\). Since \(\phi = \beta\pi\), the sample analog gives: \[\hat\beta_{\mathrm{IV}} = \frac{\hat\phi_{\mathrm{RF}}}{\hat\pi_{\mathrm{FS}}}, \tag{13.9}\] reproducing Equation 13.6. The reduced form delivers the instrument’s total effect on the outcome; the first stage scales it by the instrument’s effect on treatment; the ratio recovers the structural parameter.

Intent-to-treat interpretation. When \(Z\) is randomly assigned, \(\hat\phi_{\mathrm{RF}}\) estimates the intent-to-treat (ITT) effect: the average effect on \(Y\) of being assigned \(Z = 1\) rather than \(Z = 0\), regardless of actual treatment uptake. The ITT requires only exogeneity of \(Z\), not the exclusion restriction, and is often of direct policy interest.

Remark: Reporting All Three Quantities

In applied work it is standard practice to report the first stage, reduced form, and IV (or 2SLS) estimates side by side. The reduced form and first stage each have a transparent OLS interpretation and can be assessed independently before the ratio is formed. The IV estimate carries no identifying content beyond what the reduced form and first stage together contain.

13.3 Two-Stage Least Squares

Two-stage least squares (2SLS) extends the Wald estimator to settings with continuous or multi-valued instruments, multiple instruments, and observed covariates.

Stage 1: The First-Stage Regression. Regress \(T\) on \(Z\) and \(X\) by OLS, obtaining fitted values \(\hat{T}_i = \hat{a}_T + \hat\pi^\top Z_i + \hat\delta^\top X_i\). In matrix form, \(\hat{\mathbf{T}} = P_W \mathbf{T}\) where \(P_W\) is the projection onto the column space of the instrument design matrix \(\mathbf{W}\). The fitted values \(\hat{T}_i\) isolate the component of treatment variation spanned by \((1, Z, X)\) — the variation that is exogenous under IV validity.

Stage 2: The Second-Stage Regression. Regress \(Y\) on \(\hat{T}\) and \(X\) by OLS.

Definition: 2SLS Estimator

The two-stage least squares estimator \(\hat\beta_{\mathrm{2SLS}}\) is the OLS coefficient on \(\hat{T}\) in the second-stage regression. In the single-instrument case (\(q = 1\)): \[\hat\beta_{\mathrm{2SLS}} = \frac{\sum_{i=1}^n \tilde{Z}_i\, Y_i}{\sum_{i=1}^n \tilde{Z}_i\, T_i} = \frac{\widehat{\mathrm{Cov}}(\tilde{Z},\, Y)}{\widehat{\mathrm{Cov}}(\tilde{Z},\, T)}. \tag{13.10}\]

The component of \(T\) orthogonal to \((Z, X)\) — the residual \(\hat\eta_i = T_i - \hat{T}_i\), correlated with \(\varepsilon_i\) when \(T\) is endogenous — is dropped before the causal coefficient is estimated.

Standard Errors from the Second-Stage OLS Are Incorrect

A common error is to report the standard errors from the second-stage OLS regression directly. Those standard errors use \(\hat{T}\) rather than \(T\) and residuals that do not equal the structural errors \(\varepsilon_i\), so they are invalid. Correct inference requires the sandwich variance formula from Section 13.6, or software that implements 2SLS natively.

13.4 Equivalence of the IV Regression Estimator and 2SLS

Theorem: IV Regression–2SLS Equivalence

In the linear IV model with a scalar instrument \(Z\) and covariates \(X \in \mathbb{R}^p\), the IV regression estimator and the 2SLS estimator coincide: \(\hat\beta_{\mathrm{2SLS}} = \hat\beta_{\mathrm{IV}}\).

Proof

First-stage projection. The first-stage fitted value is \(\hat{\mathbf{T}} = P_{[\bar{\mathbf{X}},\mathbf{Z}]}\mathbf{T}\). By the Frisch–Waugh–Lovell theorem: \[M_X\hat{\mathbf{T}} = M_X\mathbf{Z}(\mathbf{Z}^\top M_X\mathbf{Z})^{-1}\mathbf{Z}^\top M_X\mathbf{T}.\]

2SLS second-stage closed form. Applying FWL to the second-stage regression of \(\mathbf{Y}\) on \((\hat{\mathbf{T}}, \bar{\mathbf{X}})\): \[\hat\beta_{\mathrm{2SLS}} = \frac{\hat{\mathbf{T}}^\top M_X \mathbf{Y}}{\hat{\mathbf{T}}^\top M_X \hat{\mathbf{T}}} = \frac{\mathbf{T}^\top M_X\mathbf{Z}(\mathbf{Z}^\top M_X\mathbf{Z})^{-1}\mathbf{Z}^\top M_X\mathbf{Y}}{\mathbf{T}^\top M_X\mathbf{Z}(\mathbf{Z}^\top M_X\mathbf{Z})^{-1}\mathbf{Z}^\top M_X\mathbf{T}}.\]

Under exact identification (\(q = 1\)), the scalars \(\mathbf{T}^\top M_X\mathbf{Z}\) and \((\mathbf{Z}^\top M_X\mathbf{Z})^{-1}\) cancel from numerator and denominator, leaving \(\hat\beta_{\mathrm{2SLS}} = \mathbf{Z}^\top M_X\mathbf{Y}/\mathbf{Z}^\top M_X\mathbf{T} = \hat\beta_{\mathrm{IV}}\). \(\square\)

Remark: What the Proof Reveals

In OLS, the normal equation for each regressor forces the residual to be orthogonal to that regressor. When \(T\) is endogenous, its normal equation is corrupted: \(\E[\mathbf{T}^\top\boldsymbol\varepsilon] \neq 0\). IV simply replaces this corrupted equation with the instrument equation \(\mathbf{Z}^\top\hat{\mathbf{e}} = 0\), which is valid because \(Z\) is exogenous. The IV estimator is OLS on the structural model but with one invalid normal equation swapped for a valid one. The 2SLS derivation reaches the same formula by a projection argument.

In the scalar-instrument linear model, Wald, IV regression, and 2SLS are not competing methods; they are different representations of the same sample analog of the identification formula.

13.5 The Moment-Condition View and GMM

13.5.1 2SLS as a Method-of-Moments Estimator

Stack the constant, covariates, and instrument into \(W = (1,\,X^\top,\,Z^\top)^\top\). The IV moment condition is \(\E[W\,(Y - \alpha - \beta T - \gamma^\top X)] = 0\). Setting \(\theta = (\alpha, \beta, \gamma^\top)^\top\) and defining \(U(O;\,\theta) = W\,(Y - D^\top\theta)\) where \(D = (1, T, X^\top)^\top\), the identifying condition is \(\E\{U(O;\,\theta_0)\} = 0\) — exactly an estimating equation in the sense of Chapter 10.

Theorem: 2SLS as a Method-of-Moments Estimator

In the exactly identified linear IV model (\(q = 1\)), the 2SLS estimator \(\hat\theta_{\mathrm{2SLS}}\) is the unique solution to the sample moment system: \[\frac{1}{n}\sum_{i=1}^n W_i\,\bigl(Y_i - \hat\alpha - \hat\beta_{\mathrm{2SLS}}\,T_i - \hat\gamma^\top X_i\bigr) = 0, \qquad W_i = (1,\,X_i^\top,\,Z_i)^\top.\]

13.5.2 Overidentification and GMM

When \(q > 1\) instruments are available, the model is overidentified: more moment conditions than parameters. The system \(\mathbb{P}_n U(O;\,\theta) = 0\) is generically overdetermined and has no exact solution.

The generalized method of moments (GMM) minimizes a weighted quadratic form in the sample moments: \[\hat\theta_{\mathrm{GMM}} = \arg\min_{\theta}\;\bigl[\mathbb{P}_n U(O;\,\theta)\bigr]^\top\,\hat\Omega_n\,\bigl[\mathbb{P}_n U(O;\,\theta)\bigr].\]

Different choices of \(\hat\Omega_n\) yield different estimators: \(\hat\Omega_n = (n^{-1}\sum_i W_i W_i^\top)^{-1}\) yields 2SLS; \(\hat\Omega_n = [\mathbb{P}_n U(O;\,\hat\theta)U(O;\,\hat\theta)^\top]^{-1}\) yields the efficient GMM estimator. Under homoskedasticity, efficient GMM and 2SLS coincide. Under heteroskedasticity, efficient GMM is weakly (and generically strictly) more efficient.

Remark: Overidentification as a Testable Restriction

An overidentified model imposes more moment conditions than needed for point identification. The Sargan–Hansen \(J\)-statistic (Sargan 1958) tests the joint null that all moment conditions hold: under \(H_0\), \(J \xrightarrow{d} \chi^2_{q-1}\). A rejection indicates that the full set of maintained moment conditions is incompatible with the data. The test does not identify which assumption failed.

Example: GMM with Two Instruments

Consider \(Y = \beta T + \varepsilon\) (after demeaning), with two instruments \(Z_1, Z_2\) giving moment conditions \(\E[Z_j\varepsilon] = 0\), \(j = 1, 2\).

Sample covariances: \(\widehat{\mathrm{Cov}}(Z_1, Y) = 0.40\), \(\widehat{\mathrm{Cov}}(Z_1, T) = 0.50\), \(\widehat{\mathrm{Cov}}(Z_2, Y) = 0.60\), \(\widehat{\mathrm{Cov}}(Z_2, T) = 0.80\).

Just-identified IV estimates: \(\hat\beta^{(1)} = 0.40/0.50 = 0.80\), \(\hat\beta^{(2)} = 0.60/0.80 = 0.75\).

GMM with identity weighting (\(\hat\Omega = I_2\)). With \(a = (0.50, 0.80)^\top\) and \(c = (0.40, 0.60)^\top\), the sample moment vector is \(\hat{U}(\beta) = c - a\beta\). The first-order condition \(a^\top(c - a\beta) = 0\) gives: \[\hat\beta_{\mathrm{GMM}} = \frac{a^\top c}{a^\top a} = \frac{(0.50)(0.40)+(0.80)(0.60)}{(0.50)^2+(0.80)^2} = \frac{0.68}{0.89} \approx 0.764.\] A precision-weighted average lying between 0.75 and 0.80, with more weight on \(Z_2\) (larger first-stage covariance).

Sargan–Hansen \(J\)-test. Residual moments at \(\hat\beta = 0.764\): \(\hat{U}_1 = 0.018\), \(\hat{U}_2 = -0.011\). The \(J\)-statistic \(J = n\,\hat{U}(\hat\beta)^\top\hat\Sigma^{-1}\hat{U}(\hat\beta)\) has a \(\chi^2_1\) null distribution. A failure to reject is consistent with both instruments satisfying the exclusion restriction.

13.6 Asymptotic Theory of the GMM Estimator

13.6.1 Setup and Notation

Stack the structural regressors and instruments: \[D_i = (1,\, T_i,\, X_i^\top)^\top \in \mathbb{R}^k, \qquad W_i = (1,\, X_i^\top,\, Z_i^\top)^\top \in \mathbb{R}^m,\] where \(k = p + 2\) and \(m = p + 1 + q\). The IV moment condition is \(\E[U(O_i;\,\theta_0)] = 0\), \(U(O_i;\,\theta) = W_i(Y_i - D_i^\top\theta)\). Define the sensitivity matrix \(A = \E[W_i D_i^\top] \in \mathbb{R}^{m \times k}\) and moment variance matrix \(\Sigma = \E[\varepsilon_i^2\, W_i W_i^\top]\).

13.6.2 Asymptotic Distribution

Theorem: Asymptotic Distribution of the GMM Estimator

Suppose the IV moment condition holds, \(O_i\) are i.i.d. with finite fourth moments, \(A\) has full column rank, and \(\Sigma\) is positive definite. Then: \[\sqrt{n}\,\bigl(\hat\theta_{\mathrm{GMM}} - \theta_0\bigr) \;\xrightarrow{d}\; N\!\bigl(0,\; V_{\mathrm{GMM}}(\Omega)\bigr),\] where the sandwich variance is: \[V_{\mathrm{GMM}}(\Omega) = (A^\top\Omega A)^{-1}\,A^\top\Omega\,\Sigma\,\Omega A\,(A^\top\Omega A)^{-1}. \tag{13.11}\]

The formula is an instance of the general estimating-equation theory from Chapter 10: the sensitivity matrix \(A\) plays the role of \(-\E[\partial U/\partial\theta^\top]\) and the moment variance \(\Sigma\) plays the role of \(\E[UU^\top]\).

13.6.3 Two Important Special Cases

Exactly identified case (\(q = 1\), \(m = k\)). \(A\) is square and invertible. All weighting matrices yield the same estimator. The sandwich variance simplifies to: \[V_{\mathrm{IV}} = A^{-1}\,\Sigma\,(A^\top)^{-1}. \tag{13.12}\] In the scalar no-covariate case (\(p = 0\)), the block corresponding to \(\beta\) is: \[V_\beta = \frac{\E[\varepsilon^2 \tilde{Z}^2]}{(\E[\tilde{Z}\,T])^2}, \qquad \tilde{Z} = Z - \E[Z].\] Under homoskedasticity and first-stage relation \(T = a_T + \pi Z + \eta\): \(V_\beta = \sigma^2/(\pi^2\,\mathrm{Var}(Z))\).

Efficient GMM. The asymptotic variance \(V_{\mathrm{GMM}}(\Omega)\) is minimized by \(\Omega^\ast = \Sigma^{-1}\), giving: \[V_{\mathrm{eff}} = \bigl(A^\top\Sigma^{-1}A\bigr)^{-1}. \tag{13.13}\] This is the semiparametric efficiency lower bound for IV estimation within the linear IV moment-restriction model \(\E[W\varepsilon] = 0\).

Remark: When 2SLS Is Efficient

The 2SLS weighting matrix is \(\Omega_{\mathrm{2SLS}} = \E[W_i W_i^\top]^{-1}\). Under homoskedasticity, \(\Sigma = \sigma^2\E[WW^\top]\), so \(\Sigma^{-1} \propto \Omega_{\mathrm{2SLS}}\), and 2SLS achieves \(V_{\mathrm{eff}}\). Under heteroskedasticity, efficient GMM is weakly (and generically strictly) more efficient.

13.6.4 Consistent Variance Estimation

Let \(\hat\varepsilon_i = Y_i - D_i^\top\hat\theta\) denote the structural residuals. Consistent estimators: \(\hat{A} = n^{-1}\sum_i W_i D_i^\top\), \(\hat\Sigma = n^{-1}\sum_i \hat\varepsilon_i^2\, W_i W_i^\top\). The heteroskedasticity-robust sandwich variance estimator is: \[\hat{V}_{\mathrm{GMM}} = (\hat{A}^\top\hat\Omega_n\hat{A})^{-1}\,\hat{A}^\top\hat\Omega_n\,\hat\Sigma\,\hat\Omega_n\hat{A}\,(\hat{A}^\top\hat\Omega_n\hat{A})^{-1}. \tag{13.14}\]

In the exactly identified case, Equation 13.14 reduces to \(\hat{A}^{-1}\hat\Sigma(\hat{A}^\top)^{-1}\), independent of \(\hat\Omega_n\). In applications, the default should be heteroskedasticity-robust standard errors; cluster-robust standard errors are required when observations within groups share unmodeled common shocks.

Remark: Two-Step Efficient GMM

Because the optimal weighting \(\hat\Omega_n = \hat\Sigma^{-1}\) requires preliminary residuals \(\hat\varepsilon_i\), efficient GMM is typically implemented in two steps: obtain a consistent first-step estimator (e.g., 2SLS) to form \(\hat\varepsilon_i\), construct \(\hat\Sigma\), and re-minimize the GMM criterion with \(\hat\Omega_n = \hat\Sigma^{-1}\).

13.7 Weak Instruments and Inferential Fragility

When the first-stage relationship is weak, IV estimation becomes severely fragile. A weak instrument is not merely an efficiency problem: it makes the finite-sample distribution of IV estimators highly non-normal, magnifies bias toward OLS, and undermines conventional confidence intervals.

13.7.1 The Weak-Instrument Problem

The 2SLS closed form Equation 13.10 divides by the sample covariance \(\widehat{\mathrm{Cov}}(\tilde{Z}, T)\). When the population first-stage coefficient \(\pi\) is near zero, the consequences are (Bound et al. 1995):

Finite-sample bias toward OLS. As \(\pi \to 0\), the 2SLS bias approaches the OLS bias rather than zero.
Non-Gaussian finite-sample distribution. The distribution of \(\hat\beta_{\mathrm{2SLS}}\) can be highly skewed or heavy-tailed, rendering the \(N(0,\,V_{\mathrm{2SLS}})\) approximation unreliable.
Size distortion. Wald-type confidence intervals can severely undercover the true parameter.

13.7.2 Diagnostic: The First-Stage \(F\)-Statistic

The most widely used diagnostic is the \(F\)-statistic from the first-stage regression, testing the joint significance of \(Z\) after partialling out \(X\). Staiger and Stock (1997) argued informally for \(F \geq 10\) as adequate instrument strength; Stock and Yogo (2005) provided formal critical values. A large first-stage \(F\) supports relevance, but says nothing directly about exogeneity or exclusion. The first-stage \(F\)-statistic is a relevance diagnostic, not a certificate of instrument validity.

The \(F > 10\) Rule of Thumb

The threshold \(F_{\text{first stage}} > 10\) is a widely cited but coarse diagnostic. It targets finite-sample bias relative to OLS, not validity of the exclusion restriction. A more reliable diagnostic is the effective \(F\)-statistic of Olea and Pflueger (2013), which remains valid under heteroskedasticity and within-cluster correlation.

13.7.3 Weak-Instrument-Robust Inference

When first-stage strength is uncertain, alternative inferential procedures with size guarantees are needed.

The Anderson–Rubin (AR) test (Anderson and Rubin, 1949) inverts the question: it tests whether a hypothesized value \(\beta_0\) is consistent with the IV moment condition. Substituting \(\beta_0\) into the structural model gives \(Y - \beta_0 T = \alpha + \gamma^\top X + (\varepsilon + (\beta - \beta_0)T)\). If \(\beta_0 = \beta\), the composite error is uncorrelated with \(Z\) by instrument validity. The AR test regresses \(Y - \beta_0 T\) on \(Z\) and \(X\) and tests the null that the coefficient on \(Z\) is zero. Under the classical homoskedastic Gaussian linear model the \(F\)-statistic has an exact finite-sample \(F\)-distribution regardless of instrument strength. Inverting this test yields a confidence set valid whether or not the instrument is weak.

The conditional likelihood ratio (CLR) test of Moreira (2003) extends the AR idea more efficiently to the multiple-instrument case.

Remark: Two Distinct IV Concerns

The weak-instrument problem is a statistical concern: it questions the quality of estimation given a valid identification strategy. It is distinct from the interpretive concern of Chapter 7 — that even a perfectly strong instrument identifies only the LATE for compliers, not the population ATE. Any complete IV analysis must address both.

13.8 Generalized Empirical Likelihood

Advanced Topic

This section introduces GEL as a one-step alternative to efficient GMM. The main takeaway is that every GEL estimator shares the same first-order asymptotic efficiency as efficient GMM, while GEL can have better higher-order finite-sample properties in overidentified models. The duality formulation and special-cases table may be treated as optional reading.

GEL uses the same IV moment restrictions but incorporates them through an implied reweighting of the sample rather than through a separately estimated covariance weight matrix. Section Section 13.6 showed that efficient GMM requires a two-step procedure; this two-step structure introduces finite-sample bias (Newey and Smith 2004). GEL provides a one-step alternative.

13.8.1 The GEL Estimator

Let \(U_i(\theta) = W_i(Y_i - D_i^\top\theta)\) and \(\bar{U}(\theta) = n^{-1}\sum_i U_i(\theta)\). GEL introduces a strictly convex function \(G\colon \mathcal{V} \to \mathbb{R}\) (open interval \(\mathcal{V}\) containing zero) and solves the saddle-point problem: \[\hat\theta_{\mathrm{GEL}} = \arg\min_{\theta}\;\sup_{\lambda \in \Lambda_n(\theta)}\;\frac{1}{n}\sum_{i=1}^n [-G(\lambda^\top U_i(\theta))], \tag{13.15}\] where \(\lambda \in \mathbb{R}^m\) is an auxiliary dual variable. \(G\) is normalized: \(G(0) = 0\), \(g(0) = 1\), \(g'(0) = 1\) where \(g = G'\).

Definition: Special Cases of GEL

Empirical likelihood (EL): \(G(v) = -\log(1 - v)\), \(\mathcal{V} = (-\infty, 1)\).
Exponential tilting (ET): \(G(v) = e^v - 1\), \(\mathcal{V} = \mathbb{R}\).
Continuous updating estimator (CUE): \(G(v) = v + v^2/2\), \(\mathcal{V} = \mathbb{R}\). The GEL saddle-point problem reduces to simultaneous minimization of the GMM criterion with the weighting matrix \(\hat\Omega(\theta) = [n^{-1}\sum_i U_i(\theta)U_i(\theta)^\top]^{-1}\) continuously updated at the current \(\theta\).

13.8.2 The Convex-Conjugate Duality (Optional)

The GEL saddle-point problem Equation 13.15 is the Lagrangian dual of a minimum-discrepancy (MD) primal problem that re-weights observations. Define the Legendre–Fenchel conjugate of \(G\): \[F(\omega) = \sup_{v \in \mathcal{V}}\,[\omega v - G(v)], \tag{13.16}\] a strictly convex function with \(F(1) = 0\) (by the normalization). The MD estimator minimizes a convex divergence between observation weights \(\omega_i\) and the reference \(\omega_i = 1\), subject to \(\sum_i \omega_i U_i(\theta) = 0\).

Theorem: GEL–MD Duality (Newey and Smith 2004; Ragusa 2011)

The GEL problem Equation 13.15 is the Lagrangian dual of the MD problem: their first-order conditions coincide, so \(\hat\theta_{\mathrm{GEL}} = \hat\theta_{\mathrm{MD}}\). The dual variable \(\hat\lambda\) and primal weights \(\hat\omega_i\) are related by \(\hat\omega_i = g(\hat\lambda^\top U_i(\hat\theta))\).

The GEL special cases correspond to different divergences: EL uses reverse KL; ET uses forward KL; CUE uses a quadratic distance.

Remark: Re-weighting Observations vs. Re-weighting Moments

GMM re-weights the moment conditions via the matrix \(\hat\Omega\). MD/GEL instead re-weights the observations, finding the empirical distribution closest to uniform in the \(F\)-divergence sense consistent with the moment restrictions. The two approaches are asymptotically equivalent to first order.

13.8.3 Asymptotic Properties and Comparison with GMM

Theorem: First-Order Asymptotic Equivalence (Newey and Smith 2004)

Under regularity conditions, every GEL estimator achieves the efficient GMM variance: \[\sqrt{n}\,(\hat\theta_{\mathrm{GEL}} - \theta_0) \;\xrightarrow{d}\; N\!\left(0,\; (A^\top\Sigma^{-1}A)^{-1}\right).\]

Theorem: GEL Overidentification Test (Wilks’ Theorem) (Newey and Smith 2004)

Under the same conditions, the GEL profile divergence satisfies: \[T_{\mathrm{GEL}} \equiv 2\sum_{i=1}^n F(\hat\omega_i) = -2\sum_{i=1}^n G(\hat\lambda^\top U_i(\hat\theta_{\mathrm{GEL}})) \;\xrightarrow{d}\; \chi^2_{m-k}. \tag{13.17}\]

For EL specifically, \(T_{\mathrm{EL}} = -2\sum_i \log(n\hat\pi_i)\), the empirical likelihood ratio statistic. To higher order, GEL estimators have smaller bias than two-step GMM: GEL eliminates the bias from estimating the Jacobian; EL additionally eliminates bias from estimating the weighting matrix \(\Sigma\).

13.9 The Control Function Approach

The control-function approach offers an alternative route to handling endogeneity: instead of projecting treatment onto instruments, it augments the outcome model with a control variable that absorbs the endogenous component of treatment selection.

2SLS achieves identification by replacing the endogenous regressor \(T\) with its exogenous projection \(\hat{T}\). The control function approach adds the first-stage residual to the outcome regression as an explicit control for the endogenous variation, rather than removing it from the treatment variable.

13.9.1 Linear Model and Equivalence to 2SLS

The linear control-function representation requires the linear control-function assumption: \[\E[\varepsilon \mid \eta, Z, X] = \rho\,\eta, \qquad \rho = \frac{\mathrm{Cov}(\varepsilon, \eta)}{\mathrm{Var}(\eta)}. \tag{13.18}\]

This holds under joint normality of \((\varepsilon, \eta)\) given \((Z, X)\). Defining \(\xi = \varepsilon - \rho\eta\), assumption Equation 13.18 is equivalent to \(\E[\xi \mid \eta, Z, X] = 0\). Substituting into the outcome equation: \(Y = \alpha + \beta T + \gamma^\top X + \rho\eta + \xi\). Including \(\eta\) as an additional regressor renders \(T\) exogenous in the augmented regression.

Control Function Estimator (Linear Case)

First stage. Regress \(T\) on \(Z\) and \(X\) by OLS; obtain residuals \(\hat\eta_i = T_i - \hat\pi^\top Z_i - \hat\delta^\top X_i\).
Second stage. Regress \(Y\) on \(T\), \(X\), and \(\hat\eta\) by OLS; the coefficient on \(T\) is the control function estimator \(\hat\beta_{\mathrm{CF}}\).

Theorem: Equivalence of 2SLS and the Control Function Estimator

In the linear IV model, \(\hat\beta_{\mathrm{CF}}\) from the augmented OLS regression of \(Y\) on \((T, X, \hat\eta)\) is numerically identical to \(\hat\beta_{\mathrm{2SLS}}\).

Proof

The Frisch–Waugh–Lovell theorem states that the coefficient on \(T\) in the OLS regression of \(Y\) on \((T, X, \hat\eta)\) equals the coefficient on \(\tilde{T}\) in the regression of \(\tilde{Y}\) on \(\tilde{T}\), where tildes denote residuals after projecting out \((X, \hat\eta)\). Since \(\hat\eta = T - \hat{T}\), projecting out \(\hat\eta\) and \(X\) from \(T\) is equivalent to projecting out \(X\) and \(T - \hat{T}\) from \(T\), which leaves \(\hat{T}\) after partialling out \(X\). Thus \(\tilde{T}\) is the residual from projecting \(\hat{T}\) on \(X\) — the same residual used in the 2SLS second stage. \(\square\)

Standard Errors in the Second Stage

The coefficient on \(T\) is identical to 2SLS, but the OLS standard errors are not. They ignore the sampling variation in \(\hat\eta_i\) and are therefore incorrect. Correct inference requires the 2SLS sandwich formula, or a bootstrap that resamples both stages jointly.

13.9.2 Testing Endogeneity via Residual Inclusion

The control function representation yields a natural test of \(H_0: \rho = 0\) (exogeneity of \(T\)). The \(t\)-statistic on \(\hat\eta\) in the augmented regression tests endogeneity. A rejection suggests endogeneity under the maintained instrument assumptions; it is not a stand-alone validation of the instrument, since the test takes instrument validity as given. This is the regression-based form of the Hausman (1978) endogeneity test.

13.9.3 Brief Note on Nonlinear Extensions

2SLS does not carry over. In a probit or Poisson outcome model, 2SLS is generally inconsistent: plugging in \(\hat T\) breaks the nonlinear link function.

Related control-variable methods exist. Imbens and Newey (2009) showed that under independence, \((\varepsilon, \eta) \indep Z \mid X\), and scalar monotonicity, \(T = h(Z, X, \eta)\) strictly monotonic in a scalar \(\eta\), the conditional CDF: \[V = F_{T \mid Z, X}(T \mid Z, X) \tag{13.19}\] is a valid control variable in the sense that \(T \indep \varepsilon \mid X, V\). Conditioning on \((X, V)\) recovers structural variation in \(T\), allowing identification of the average structural function. Note that \(X\) must be retained in the conditioning set; dropping it gives the stronger \(T \indep \varepsilon \mid V\), which fails whenever \(X\) has any direct effect on \(\varepsilon\).

13.10 Chapter Summary

Symbol	Meaning
\(\hat\beta_{\mathrm{Wald}}\)	Wald estimator: \((\bar{Y}_1 - \bar{Y}_0)/(\bar{T}_1 - \bar{T}_0)\)
\(\hat\phi_{\mathrm{RF}}\)	Reduced form regression estimator of \(\phi = \beta\pi\)
\(\hat\pi_{\mathrm{FS}}\)	First-stage regression estimator of \(\pi\)
\(M_X\)	Annihilator matrix (within-\(X\) residuals)
\(\hat\beta_{\mathrm{IV}}\)	IV regression estimator Equation 13.6
\(\hat\beta_{\mathrm{2SLS}}\)	2SLS estimator Equation 13.10
\(V_{\mathrm{GMM}}(\Omega)\)	Sandwich variance Equation 13.11
\(V_{\mathrm{eff}}\)	Efficient GMM variance Equation 13.13
\(\hat{V}_{\mathrm{GMM}}\)	Consistent variance estimator Equation 13.14
\(J\)-statistic	Sargan–Hansen overidentification test
\(\hat\theta_{\mathrm{GEL}}\)	GEL estimator Equation 13.15

Wald, IV regression, and 2SLS are one estimator. The Wald estimator, IV regression estimator, and 2SLS are numerically identical in the single-instrument case: they are different representations of the same sample analog of the identification formula. 2SLS extends to multiple instruments; the Wald estimator is the further special case \(Z \in \{0,1\}\), \(X\) absent.
Structural form vs. reduced form. The structural form contains endogenous regressors; the reduced form expresses each endogenous variable as a function of exogenous variables. The reduced form coefficient \(\phi = \beta\pi\) is estimable by OLS; \(\beta\) is recovered only via the ratio \(\hat\phi_{\mathrm{RF}}/\hat\pi_{\mathrm{FS}}\).
2SLS as moment-of-moments. 2SLS is the method-of-moments estimator for the IV orthogonality condition \(\E[W\varepsilon] = 0\), an instance of Chapter 10’s estimating-equation framework.
GMM and efficient weighting. Efficient GMM achieves the semiparametric efficiency bound; 2SLS is efficient under homoskedasticity but not generally under heteroskedasticity. All standard errors should be heteroskedasticity-robust; cluster-robust when observations within groups share common shocks.
Weak instruments. Weak instruments cause finite-sample bias toward OLS, non-Gaussian distributions, and size distortion. The first-stage \(F\)-statistic is a relevance diagnostic, not a validity certificate. Anderson–Rubin confidence sets provide weak-instrument-robust inference.
GEL. GEL estimators achieve the efficient GMM variance in one step without a preliminary weighting step. To higher order, they have smaller bias than two-step GMM in overidentified models.
Control function. In the linear model, the control function approach is numerically equivalent to 2SLS; it provides a direct test of endogeneity via the \(t\)-statistic on \(\hat\eta\). In nonlinear models, related control-variable methods exist under the stronger independence and scalar-monotonicity conditions of Imbens and Newey (2009).

13.11 Problems

1. The Wald estimator as a ratio-of-moments estimator.

Augment \(\beta\) with an intercept \(\alpha\) and express the structural model as the solution to the two-dimensional moment condition \(\E[(1, Z)^\top(Y - \alpha - \beta T)] = 0\). Verify exact identification and solve to recover \(\beta = \Delta_Y/\Delta_T\).
Using \(\bar{Y}_z \xrightarrow{p} \mu_Y(z)\) and \(\bar{T}_z \xrightarrow{p} \mu_T(z)\), prove \(\hat\beta_{\mathrm{Wald}} \xrightarrow{p} \beta\) via the continuous mapping theorem.
Apply the delta method to \((\hat\Delta_Y, \hat\Delta_T)^\top\) to show \(\sqrt{n}(\hat\beta_{\mathrm{Wald}} - \beta) \xrightarrow{d} N(0, V)\) where \(V = \Delta_T^{-2}\sum_{z \in \{0,1\}}\mathrm{Var}(Y_i - \beta T_i \mid Z_i = z)/p_z\), and confirm this matches the IV variance formula Equation 13.12 in the scalar no-covariate case.

2. Matrix form of 2SLS and why second-stage standard errors are wrong.

Let \(\mathbf{D} \in \mathbb{R}^{n \times k}\) be the full regressor matrix and \(\mathbf{W} \in \mathbb{R}^{n \times m}\) the full instrument matrix. Let \(P_\mathbf{W} = \mathbf{W}(\mathbf{W}^\top\mathbf{W})^{-1}\mathbf{W}^\top\).

Show the 2SLS estimator can be written as \(\hat\theta_{\mathrm{2SLS}} = (\mathbf{D}^\top P_\mathbf{W}\mathbf{D})^{-1}\mathbf{D}^\top P_\mathbf{W}\mathbf{Y}\).
In the single-instrument, no-covariate case, verify from the matrix formula that \(\hat\beta_{\mathrm{2SLS}} = \hat\beta_{\mathrm{IV}}\).
The second-stage OLS uses \(\hat{\mathbf{D}}\) in place of \(\mathbf{D}\). Let \(\hat{\boldsymbol\varepsilon}_{\mathrm{2nd}} = \mathbf{Y} - \hat{\mathbf{D}}\hat\theta_{\mathrm{2SLS}}\). Show \(\hat{\boldsymbol\varepsilon}_{\mathrm{2nd}} \neq \mathbf{Y} - \mathbf{D}\hat\theta_{\mathrm{2SLS}}\) in general. Explain why this discrepancy makes the second-stage OLS standard errors invalid, and identify the correct residuals for the sandwich variance Equation 13.14.

3. Efficiency of GMM and the Sargan–Hansen \(J\)-statistic.

Prove \(V_{\mathrm{GMM}}(\Omega) \succeq V_{\mathrm{eff}}\) for every positive-definite \(\Omega\), where \(V_{\mathrm{eff}} = (A^\top\Sigma^{-1}A)^{-1}\). (Hint: factor \(V_{\mathrm{GMM}}(\Omega) - V_{\mathrm{eff}}\) as \(C^\top\Sigma^{-1}C\) for a suitable matrix \(C\).)
Under homoskedasticity, show that 2SLS is the efficient GMM estimator by verifying \(\Omega_{\mathrm{2SLS}}\) is a scalar multiple of \(\Sigma^{-1}\).
Return to the two-instrument example. At \(\hat\beta \approx 0.764\) with \(\hat\Sigma = I_2\) and \(n = 200\), compute \(J = n\,\hat{U}(\hat\beta)^\top\hat\Sigma^{-1}\hat{U}(\hat\beta)\) and determine using the \(\chi^2_1\) critical value at the 5% level whether the overidentifying restriction is rejected.

4. GEL first-order conditions and the minimum-discrepancy dual.

For EL, \(G(v) = -\log(1 - v)\). Write the first-order condition for the inner supremum and show it implies \(\sum_i \hat\pi_i U_i(\theta) = 0\) where \(\hat\pi_i \propto (1 - \hat\lambda^\top U_i(\theta))^{-1}\).
For CUE, \(G(v) = v + v^2/2\). Solve the inner supremum explicitly at fixed \(\theta\) to show \(\hat\lambda = -[n^{-1}\sum_i U_i(\theta)U_i(\theta)^\top]^{-1}\bar{U}(\theta)\), and confirm the profile objective equals \(\frac{1}{2}\bar{U}^\top[n^{-1}\sum_i U_iU_i^\top]^{-1}\bar{U}\).
In the exactly identified case (\(m = k\)), show that \(\hat\lambda = 0\) at any GEL solution \(\hat\theta\). Conclude that every GEL estimator coincides with the just-identified GMM estimator and the empirical probabilities all equal \(1/n\).

5. Control function, endogeneity testing, and the limits of instrument diagnostics.

Let \(\varepsilon = \rho\eta + \xi\) with \(\rho = \mathrm{Cov}(\varepsilon,\eta)/\mathrm{Var}(\eta)\).

Show \(\E[\xi] = 0\) and \(\mathrm{Cov}(\xi, \eta) = 0\) by construction. Then assume \(\E[\varepsilon \mid \eta, Z] = \rho\eta\) and verify \(\E[\xi \mid \eta, Z] = 0\). Explain why this renders \(T\) exogenous in the augmented regression of \(Y\) on \((T, \eta)\).
Show that the coefficient on \(\hat\eta\) in the augmented regression is a consistent estimator of \(\rho\), and connect the \(t\)-test on \(\hat\eta\) to the Hausman (1978) endogeneity test.
Suppose an instrument \(Z\) affects wages both through education and through a direct network effect, but the model is exactly identified. Explain why neither the first-stage \(F\)-test nor the control function endogeneity test can detect this exclusion-restriction violation, and what additional information would be needed.

Bound, John, David A. Jaeger, and Regina M. Baker. 1995. “Problems with Instrumental Variables Estimation When the Correlation Between the Instruments and the Endogenous Explanatory Variable Is Weak.” Journal of the American Statistical Association 90 (430): 443–50.

Hausman, Jerry A. 1978. “Specification Tests in Econometrics.” Econometrica 46 (6): 1251–71.

Imbens, Guido W., and Whitney K. Newey. 2009. “Identification and Estimation of Triangular Simultaneous Equations Models Without Additivity.” Econometrica 77 (5): 1481–512.

Newey, Whitney K., and Richard J. Smith. 2004. “Higher Order Properties of GMM and Generalized Empirical Likelihood Estimators.” Econometrica 72 (1): 219–55.

Ragusa, Giuseppe. 2011. “Minimum Divergence, Generalized Empirical Likelihoods, and Higher Order Expansions.” Econometric Reviews 30 (4): 406–56.

Sargan, John D. 1958. “The Estimation of Economic Relationships Using Instrumental Variables.” Econometrica 26 (3): 393–415. https://doi.org/10.2307/1907619.