11 Doubly Robust Estimation and Semiparametric Efficiency

Learning Objectives

By the end of this chapter, students should be able to:

Derive the AIPW estimator as a bias-corrected prediction estimator, and explain the role of each term.
Prove the double robustness property using the law of iterated expectations.
Describe the class of augmented IPW estimators indexed by arbitrary control functions \(b_t(x)\), verify unbiasedness for any \(b_t\), and identify the optimal choice that minimizes total variance.
Interpret the augmentation step as a Hilbert-space projection onto the orthogonal complement of the augmentation space, and derive the Pythagorean variance decomposition.
Carry out the projection argument in the causal inference setting, starting from the Horvitz–Thompson estimator.
State the semiparametric efficiency bound for the ATE and explain why the AIPW estimating function is the efficient influence function.
Derive two approaches to building double robustness directly into the outcome model: IPW-weighted regression and the augmented model with clever covariate; explain why including \(\hat\pi_i^{-1}\) acts as a debiasing correction and connect this to TMLE.
Identify the product-rate condition as the key sufficient condition for nuisance estimation error to be asymptotically negligible, and construct a consistent variance estimator and Wald confidence interval.

11.1 Why Combine Outcome Regression and Weighting?

Under consistency, conditional exchangeability, and positivity, the ATE \(\tau = \E\{Y(1) - Y(0)\}\) is identified by either of the following: \[\tau = \E\{\mu_1(X) - \mu_0(X)\}, \qquad \text{or} \qquad \tau = \E\!\left\{\frac{TY}{\pi(X)} - \frac{(1-T)Y}{1-\pi(X)}\right\},\] where \(\mu_t(x) = \E(Y \mid T{=}t,\, X{=}x)\) and \(\pi(x) = P(T{=}1 \mid X{=}x)\).

These formulas suggest two basic estimation strategies. The outcome regression (prediction) estimator averages \(\hat\mu_1(X_i) - \hat\mu_0(X_i)\) over the sample. The IPW estimator reweights observed outcomes using the propensity score. Each strategy has weaknesses: regression can be biased under misspecification, while IPW can be unstable when estimated propensity scores are close to 0 or 1 (Rosenbaum and Rubin 1983).

This chapter develops a third strategy: combine both models to construct an estimator that is consistent when either model is correct, and achieves the semiparametric efficiency bound when both are correct. We derive the same estimator in three complementary ways: as a bias-corrected prediction estimator (Section 11.2), as the optimal member of a class of augmented estimators (Section 11.4 and Section 11.6), and as the efficient influence function (Section 11.7).

11.2 The Prediction Estimator and Its Bias

Work first with generic working regression functions \(m_t(x)\), not necessarily equal to the true conditional means. The prediction estimator based on \(m_t\) is: \[\hat\tau_{\mathrm{pred}}(m) = \frac{1}{n}\sum_{i=1}^n \{m_1(X_i) - m_0(X_i)\}.\] Define the population prediction estimand \(\tau_{\mathrm{pred}}(m) = \E\{m_1(X) - m_0(X)\}\). Under ignorability: \[\tau_{\mathrm{pred}}(m) - \tau = \E\bigl[(m_1(X) - \mu_1(X)) - (m_0(X) - \mu_0(X))\bigr]. \tag{11.1}\]

If a propensity score estimator \(\hat\pi\) is available, the two bias terms in Equation 11.1 can be estimated from observed data. Defining residuals \(e_i(t) = Y_i(t) - \hat\mu_t(X_i)\): \[\widehat{\mathrm{Bias}}(\hat\tau_{\mathrm{pred}}) = -\frac{1}{n}\sum_{i=1}^n \frac{T_i}{\hat\pi(X_i)}\,e_i(1) + \frac{1}{n}\sum_{i=1}^n \frac{1-T_i}{1-\hat\pi(X_i)}\,e_i(0).\]

The bias-corrected prediction estimator is \(\hat\tau_{\mathrm{AIPW}} = \hat\tau_{\mathrm{pred}} - \widehat{\mathrm{Bias}}(\hat\tau_{\mathrm{pred}}) = \hat\mu_{1,\mathrm{dr}} - \hat\mu_{0,\mathrm{dr}}\), where: \[\hat\mu_{1,\mathrm{dr}} = \frac{1}{n}\sum_{i=1}^n \frac{T_i}{\hat\pi(X_i)}\{Y_i - \hat\mu_1(X_i)\} + \frac{1}{n}\sum_{i=1}^n \hat\mu_1(X_i), \tag{11.2}\] \[\hat\mu_{0,\mathrm{dr}} = \frac{1}{n}\sum_{i=1}^n \frac{1-T_i}{1-\hat\pi(X_i)}\{Y_i - \hat\mu_0(X_i)\} + \frac{1}{n}\sum_{i=1}^n \hat\mu_0(X_i). \tag{11.3}\]

11.3 The Augmented IPW Estimator

Notation. From this point on, \(\mu_t(x)\) inside estimating functions denotes a generic working outcome regression; the truth is written \(\mu_t^*(x) = \E(Y \mid T{=}t, X{=}x)\). The propensity score \(\pi(x)\) inside estimating functions denotes a working model; the truth is \(\pi^*(x) = P(T{=}1 \mid X{=}x)\).

Collecting terms gives the estimating-equation form. The AIPW estimator solves \(\mathbb{P}_n\{\phi(O;\tau,\hat\eta)\} = 0\) where: \[\phi(O;\tau,\eta) = \left[\frac{T}{\pi(X)}\{Y-\mu_1(X)\}+\mu_1(X)\right] - \left[\frac{1-T}{1-\pi(X)}\{Y-\mu_0(X)\}+\mu_0(X)\right] - \tau. \tag{11.4}\]

Solving explicitly:

Definition: AIPW Estimator

\[\hat\tau_{\mathrm{AIPW}} = \frac{1}{n}\sum_{i=1}^n \Bigg[\hat\mu_1(X_i) - \hat\mu_0(X_i) + \frac{T_i}{\hat\pi(X_i)}\{Y_i - \hat\mu_1(X_i)\} - \frac{1-T_i}{1-\hat\pi(X_i)}\{Y_i - \hat\mu_0(X_i)\}\Bigg]. \tag{11.5}\]

The augmented inverse probability weighted (AIPW) estimator equals the prediction estimator plus a weighted-residual bias correction.

Remark: Two Roles of the Augmentation Terms

The terms \(T_i\{Y_i - \hat\mu_1(X_i)\}/\hat\pi(X_i)\) have two complementary roles. From the bias-correction perspective, they are IPW-weighted residuals that estimate the prediction error of the outcome regression. From an estimating-equation perspective, they make the estimating function orthogonal to nuisance perturbations. The estimated bias term has zero expectation when the outcome model is correct; the bias-corrected estimator is unbiased when the propensity model is correct. Hence the name doubly robust.

Theorem: Double Robustness

Let \(\phi(O;\tau,\eta)\) be the estimating function in Equation 11.4. Suppose consistency, conditional exchangeability, and positivity hold. Then \(\E\{\phi(O;\tau,\eta)\} = 0\) if either:

\(\mu_t(x) = \mu_t^*(x)\) for \(t=0,1\), regardless of whether \(\pi(x)\) is correctly specified; or
\(\pi(x) = \pi^*(x)\), regardless of whether \(\mu_0\) and \(\mu_1\) are correctly specified.

Proof

Case 1: outcome regression correct. If \(\mu_t(X) = \E(Y \mid T{=}t, X)\), then conditional on \(X\): \[\E\!\left[\frac{T}{\pi(X)}\{Y-\mu_1(X)\}\,\middle|\,X\right] = \frac{P(T{=}1 \mid X)}{\pi(X)}\,\E\{Y-\mu_1(X) \mid T{=}1, X\} = 0,\] and similarly the untreated augmentation term vanishes. Therefore \(\E\{\phi\} = \E\{\mu_1(X)-\mu_0(X)\} - \tau = 0\).

Case 2: propensity score correct. Suppose \(\pi(X) = \pi^*(X)\) but \(\mu_0, \mu_1\) are arbitrary. Compute the conditional expectation of the first bracket given \(X\): \[\E\!\left[\frac{T}{\pi(X)}\{Y-\mu_1(X)\} + \mu_1(X)\,\Big|\, X\right] = \frac{\E[TY \mid X]}{\pi(X)} - \mu_1(X)\,\frac{\E[T \mid X]}{\pi(X)} + \mu_1(X) = \mu_1^*(X),\] using \(\E[T \mid X] = \pi^*(X) = \pi(X)\) (so the \(\mu_1\) terms cancel) and \(\E[TY \mid X] = \pi(X)\mu_1^*(X)\). By symmetry the control bracket has conditional expectation \(\mu_0^*(X)\). Taking expectations over \(X\) and invoking \(\tau = \E[\mu_1^*(X) - \mu_0^*(X)]\) gives \(\E\{\phi\} = 0\). \(\square\)

Remark: The Algebra of Case 2

The key cancellation is that the working outcome function \(\mu_1(X)\) enters the bracket twice — once multiplied by \(T/\pi(X)\) and once as a standalone term — and the two instances have equal and opposite conditional expectations because \(\E[T/\pi(X) \mid X] = 1\) when \(\pi\) is correct. The bracket “forgets” \(\mu_1\) entirely and converges to \(\mu_1^*(X)\). A wrong \(\mu_t\) is subtracted out by the correct \(\pi\).

Double robustness guarantees consistency if either model is correct; it does not protect against misspecification of both models simultaneously.

11.4 A Class of Augmented Estimators

Remark: Three Senses of “Optimal” in This Chapter

Class-optimal control function \(b_t^*\) minimizing variance within the parametric family (Section 11.4); (ii) Projection-optimal estimator \(\hat\theta_{\mathrm{opt}}\) minimizing variance after orthogonal augmentation correction (Section 11.5); (iii) Semiparametrically efficient estimator attaining the information lower bound (Section 11.7). The first two agree in the ATE setting, and Section 11.7 shows they also agree with the third.

For any square-integrable functions \(b_1(x)\) and \(b_0(x)\), define: \[\hat\mu_{1,b} = \frac{1}{n}\sum_{i=1}^n \frac{T_i}{\pi(X_i)}\,Y_i - \frac{1}{n}\sum_{i=1}^n \left\{\frac{T_i}{\pi(X_i)}-1\right\} b_1(X_i), \tag{11.6}\] \[\hat\mu_{0,b} = \frac{1}{n}\sum_{i=1}^n \frac{1-T_i}{1-\pi(X_i)}\,Y_i - \frac{1}{n}\sum_{i=1}^n \left\{\frac{1-T_i}{1-\pi(X_i)}-1\right\} b_0(X_i), \tag{11.7}\] and let \(\hat\tau_b = \hat\mu_{1,b} - \hat\mu_{0,b}\). The standard AIPW estimator is the special case \(b_t = \mu_t\) (since \(TY/\pi - (T/\pi - 1)\mu_1 = T(Y-\mu_1)/\pi + \mu_1\)).

Theorem: Unbiasedness and Variance of the AIPW Class

Under conditional exchangeability, SUTVA, and positivity, with the true propensity score, for any square-integrable \(b_0\), \(b_1\):

Unbiasedness: \(\E(\hat\tau_b) = \tau\).
Total variance: \[\mathrm{Var}(\hat\tau_b) = \frac{1}{n}\,\E\!\left[\left(\frac{1}{\pi(X)}-1\right)\{Y(1)-b_1(X)\}^2 + 2\{Y(1)-b_1(X)\}\{Y(0)-b_0(X)\} + \left(\frac{1}{1-\pi(X)}-1\right)\{Y(0)-b_0(X)\}^2\right] + \frac{1}{n}\,\mathrm{Var}\{Y(1)-Y(0)\}. \tag{11.8}\]

Proof

Let \(\xi = \frac{T}{\pi(X)}\{Y-b_1(X)\} + b_1(X) - \frac{1-T}{1-\pi(X)}\{Y-b_0(X)\} - b_0(X)\), so \(\hat\tau_b = n^{-1}\sum_i \xi_i\). By SUTVA, \(\xi = \frac{T}{\pi(X)}\{Y(1)-b_1(X)\} + b_1(X) - \frac{1-T}{1-\pi(X)}\{Y(0)-b_0(X)\} - b_0(X)\).

Part (i). Since \(\E[T/\pi(X) \mid X] = 1\), \(\E[(T/\pi(X)-1)b_1(X)] = 0\). Under conditional exchangeability, \(\E[(T/\pi(X))Y(1) \mid X] = \mu_1^*(X)\). Hence \(\E[\hat\mu_{1,b}] = \mu_1\) and \(\E(\hat\tau_b) = \tau\).

Part (ii). Apply the law of total variance conditioning on \((X, Y(1), Y(0))\). The conditional mean is \(\E[\xi \mid X, Y(1), Y(0)] = Y(1) - Y(0)\). Since \(\xi\) is linear in \(T\) with \(\mathrm{Var}(T \mid X) = \pi(X)(1-\pi(X))\): \[\mathrm{Var}(\xi \mid X, Y(1), Y(0)) = \left(\frac{1}{\pi(X)}-1\right)\{Y(1)-b_1(X)\}^2 + 2\{Y(1)-b_1(X)\}\{Y(0)-b_0(X)\} + \left(\frac{1}{1-\pi(X)}-1\right)\{Y(0)-b_0(X)\}^2.\] Adding the between-group term \(\mathrm{Var}(Y(1)-Y(0))\) gives Equation 11.8. \(\square\)

Theorem: Optimal Control Functions

The control functions \(b_1^*(X) = \E\{Y(1) \mid X\}\) and \(b_0^*(X) = \E\{Y(0) \mid X\}\) minimize the total variance in Equation 11.8. Under conditional exchangeability and consistency, \(b_t^*(X) = \mu_t^*(X) = \E(Y \mid T{=}t, X)\).

Proof

Write \(u(X) = \mu_1^*(X)-b_1(X)\) and \(v(X) = \mu_0^*(X)-b_0(X)\), and decompose \(Y(t) - b_t(X) = d_t(X) + \varepsilon_t\) where \(d_1 = u\), \(d_0 = v\), \(\E[\varepsilon_t \mid X] = 0\). Cross terms between \((u, v)\) and \((\varepsilon_1, \varepsilon_0)\) vanish by iterated expectations, leaving a \(b_t\)-dependent term: \[Q(b_0, b_1) = \E\!\left[\left(\frac{1}{\pi}-1\right)u^2 + 2uv + \left(\frac{1}{1-\pi}-1\right)v^2\right].\] Writing the integrand as a perfect square: \[\frac{1-\pi}{\pi}\,u^2 + 2uv + \frac{\pi}{1-\pi}\,v^2 = \left(\sqrt{\frac{1-\pi}{\pi}}\,u + \sqrt{\frac{\pi}{1-\pi}}\,v\right)^{\!2} \geq 0,\] with equality at \(u \equiv 0\), \(v \equiv 0\), i.e., \(b_t^* = \mu_t^*\). \(\square\)

Remark: Non-Uniqueness of the Joint Minimizer

The perfect square vanishes whenever \((1-\pi(X))u(X) + \pi(X)v(X) = 0\) a.s., so \(Q = 0\) on a one-dimensional family of pairs \((b_0, b_1)\). The canonical choice \(b_t^* = \mu_t^*\) is distinguished by being the arm-wise optimum, separately minimizing \(\mathrm{Var}(\hat\mu_{t,b})\) for each \(t\). The arm-by-arm projection in Section 11.6 makes this uniqueness explicit.

11.5 The Projection Interpretation

The optimality of AIPW has an elegant interpretation in terms of projections in a Hilbert space of estimators. The collection of mean-zero square-integrable random variables under inner product \(\langle X, Y\rangle = \E(XY) = \mathrm{Cov}(X, Y)\) is a genuine Hilbert space, and the results below are instances of the \(L^2\) projection theorem.

Let \(\hat\theta_0\) be an unbiased estimator of \(\theta\). Define the augmentation space \(\Lambda\) as a closed linear subspace of mean-zero square-integrable random variables computable from the observed data without knowledge of \(\theta\). For any \(\hat b \in \Lambda\) the estimator \(\hat\theta_b = \hat\theta_0 - \hat b\) remains unbiased.

Theorem: Optimal Projection

The optimal correction is \(\hat b^* = \Pi(\hat\theta_0 \mid \Lambda)\), the \(L^2\) projection of \(\hat\theta_0\) onto \(\Lambda\), characterized by: (1) \(\hat b^* \in \Lambda\); (2) \(\mathrm{Cov}(\hat\theta_0-\hat b^*,\; \hat b) = 0\) for all \(\hat b \in \Lambda\). The optimal estimator is: \[\hat\theta_{\mathrm{opt}} = \hat\theta_0 - \hat b^* = \Pi(\hat\theta_0 \mid \Lambda^\perp),\] and satisfies the Pythagorean identity: \[\mathrm{Var}(\hat\theta_0) = \mathrm{Var}(\hat\theta_{\mathrm{opt}}) + \mathrm{Var}(\hat b^*). \tag{11.9}\]

Proof Sketch

\(\Lambda\) is a closed linear subspace, so the \(L^2\) projection theorem gives unique \(\hat b^* \in \Lambda\) satisfying the orthogonality condition. Any other choice \(\hat b \in \Lambda\) yields: \[\mathrm{Var}(\hat\theta_0 - \hat b) = \mathrm{Var}(\hat\theta_{\mathrm{opt}}) + \mathrm{Var}(\hat b^* - \hat b) \geq \mathrm{Var}(\hat\theta_{\mathrm{opt}}),\] using \(\mathrm{Cov}(\hat\theta_{\mathrm{opt}},\, \hat b^* - \hat b) = 0\) (since \(\hat b^* - \hat b \in \Lambda\) and \(\hat\theta_{\mathrm{opt}} \perp \Lambda\)). Specializing to \(\hat b = 0\) gives Equation 11.9. \(\square\)

The Pythagorean identity shows the variance of the initial estimator decomposes orthogonally. Tsiatis (2006) calls \(\Lambda\) the augmentation space because its elements are the corrections that reduce variance.

Remark: Finite-Sample Analogy for Semiparametric Projection

This theorem is a finite-sample \(L^2(P)\) analogy for the projection arguments in semiparametric efficiency theory. In the full theory, the projection is at the level of influence functions and tangent spaces. Here we project a realized estimator \(\hat\theta_0\) onto the orthogonal complement of an augmentation space \(\Lambda\) of observable corrections. Section 11.6 shows that the finite-sample projection recovers AIPW, and Section 11.7 identifies the AIPW estimating function with the efficient influence function from the tangent-space projection.

11.6 Projection in the Causal Inference Setting

We now specialize the projection framework to the ATE, assuming \(\pi(X)\) is known. Decompose the Horvitz–Thompson estimator as \(\hat\tau_{\mathrm{HT}} = \hat\mu_{1,\mathrm{HT}} - \hat\mu_{0,\mathrm{HT}}\) where: \[\hat\mu_{1,\mathrm{HT}} = \frac{1}{n}\sum_{i=1}^n \frac{T_i Y_i}{\pi_i}, \qquad \hat\mu_{0,\mathrm{HT}} = \frac{1}{n}\sum_{i=1}^n \frac{(1-T_i)Y_i}{1-\pi_i}.\]

Define arm-specific augmentation spaces: \[\Lambda_1 = \left\{n^{-1}\sum_{i=1}^n\!\left(\frac{T_i}{\pi_i}-1\right) b_1(X_i) : b_1 \in \mathcal{L}^2\right\}, \quad \Lambda_0 = \left\{n^{-1}\sum_{i=1}^n\!\left(\frac{1-T_i}{1-\pi_i}-1\right) b_0(X_i) : b_0 \in \mathcal{L}^2\right\}. \tag{11.10}\]

Every element of \(\Lambda_1\) (resp. \(\Lambda_0\)) has expectation zero, so augmenting leaves each arm-mean estimator unbiased. The combined augmentation space is \(\Lambda = \Lambda_1 + \Lambda_0\).

Theorem: Arm-wise Projections onto the Augmentation Spaces

The optimal corrections are: \[\Pi(\hat\mu_{1,\mathrm{HT}} \mid \Lambda_1) = n^{-1}\sum_{i=1}^n\!\left(\frac{T_i}{\pi_i}-1\right) b_1^*(X_i), \quad b_1^*(x) = \E\{Y(1) \mid x\},\] \[\Pi(\hat\mu_{0,\mathrm{HT}} \mid \Lambda_0) = n^{-1}\sum_{i=1}^n\!\left(\frac{1-T_i}{1-\pi_i}-1\right) b_0^*(X_i), \quad b_0^*(x) = \E\{Y(0) \mid x\}.\]

Proof

We verify the treated arm. Denote the residual \(R_1 = \hat\mu_{1,\mathrm{HT}} - n^{-1}\sum_i(T_i/\pi_i - 1)b_1^*(X_i) = n^{-1}\sum_i r_{1i}\) where \(r_{1i} = (T_i/\pi_i)\{Y_i - b_1^*(X_i)\} + b_1^*(X_i)\).

By the Optimal Projection Theorem, it suffices to show \(\mathrm{Cov}(R_1,\, \hat b) = 0\) for every \(\hat b = n^{-1}\sum_j(T_j/\pi_j - 1)b_1(X_j) \in \Lambda_1\). Using independence across units, \(\mathrm{Cov}(R_1, \hat b) = n^{-1}\E[r_{1i}\,b_i]\). Condition on \((X_i, Y_i(1), Y_i(0))\); the only randomness is \(T_i \mid X_i\). Since \(T_i^2 = T_i\), \(\E[T_i/\pi_i \cdot(T_i/\pi_i - 1) \mid X_i] = 1/\pi_i - 1\). Taking outer expectation and using \(\E\{Y_i(1) - b_1^*(X_i) \mid X_i\} = 0\), the contribution vanishes. The constant term \(b_1^*(X_i)\) in \(r_{1i}\) also contributes zero because \(\E[(T_i/\pi_i - 1) \mid X_i] = 0\). \(\square\)

The optimal estimators are: \[\hat\mu_{1,\mathrm{opt}} = \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi_i}\{Y_i-\mu_1^*(X_i)\}+\mu_1^*(X_i)\right], \quad \hat\mu_{0,\mathrm{opt}} = \frac{1}{n}\sum_{i=1}^n\left[\frac{1-T_i}{1-\pi_i}\{Y_i-\mu_0^*(X_i)\}+\mu_0^*(X_i)\right].\] Their difference is precisely the AIPW estimator Equation 11.5 with \(b_t^*(X) = \mu_t^*(X)\), confirming the Optimal Control Functions theorem by a different route.

Remark: What the Arm-wise Decomposition Adds

The joint optimization in the Unbiasedness and Variance theorem has a non-unique minimizer (Remark on Non-Uniqueness). The arm-wise projection has a unique minimizer \(b_t^*(x) = \E\{Y(t) \mid X{=}x\}\), and the pair \((\mu_0^*, \mu_1^*)\) also attains the joint minimum. The arm-wise projection therefore selects the canonical representative, and the resulting estimator coincides with AIPW.

11.7 The Efficient Influence Function and Semiparametric Efficiency

This section states the semiparametric efficiency result and interprets it in light of the estimator development above. A complete proof requires explicit characterization of the nuisance tangent space; see Tsiatis (2006) for a rigorous development.

Under the nonparametric model for \(O=(X,T,Y)\), the efficient influence function for the ATE is: \[\varphi_{\mathrm{eff}}(O) = \frac{T}{\pi(X)}\{Y-\mu_1(X)\} - \frac{1-T}{1-\pi(X)}\{Y-\mu_0(X)\} + \mu_1(X) - \mu_0(X) - \tau. \tag{11.11}\]

Comparing Equation 11.11 with Equation 11.4, the AIPW estimating function is the efficient influence function \(\varphi^*(O)\) from Chapter 10. The augmentation space \(\Lambda\) in Equation 11.10 is the finite-sample analogue of the nuisance tangent space of the nonparametric model, and projection onto \(\Lambda^\perp\) is the finite-sample counterpart of the semiparametric operation that removes nuisance tangent directions.

Theorem: Semiparametric Efficiency Bound for the ATE

Under the nonparametric model for \(O = (X, T, Y)\), every regular asymptotically linear estimator \(\hat\tau\) of the ATE satisfies: \[\mathrm{Avar}(\sqrt{n}(\hat\tau - \tau)) \;\geq\; \E\{\varphi_{\mathrm{eff}}(O)^2\},\] and the bound is attained by estimators whose influence function equals \(\varphi_{\mathrm{eff}}\) in Equation 11.11.

Remark: EIF as the Semiparametric Score

The efficient influence function plays the same role in semiparametric theory that the score plays in parametric models: its variance gives the information lower bound, analogous to the Cramér–Rao bound. Strictly speaking, \(\varphi_{\mathrm{eff}}\) is not itself a score (the score belongs to the tangent space); it is the canonical gradient — the Riesz representer of the pathwise derivative of \(\tau\) along regular submodels (Bickel et al. 1993).

The three derivations in this chapter — bias correction (Section 11.2), optimal augmentation (Section 11.4), and semiparametric identification — converge on the same estimating function. This convergence is the deepest explanation of why AIPW occupies a central place in the causal inference toolkit.

11.8 Doubly Robust Regression: Weighted and Augmented Approaches

The AIPW estimator achieves double robustness by adding an explicit bias-correction term. An equally important question is how to build double robustness directly into the outcome model fit, so that the prediction estimator is itself doubly robust without a separate augmentation step. The unifying concept is the internal bias calibration (IBC) condition.

11.8.1 The Internal Bias Calibration Conditions

Let \(\hat\mu_t(x)\) denote any fitted outcome model and \(\hat\pi_i = \hat\pi(X_i)\). The prediction estimator requires no augmentation if the IPW-weighted residuals vanish: \[\sum_{i=1}^n \frac{T_i}{\hat\pi_i}\{Y_i - \hat\mu_1(X_i)\} = 0, \qquad \sum_{i=1}^n \frac{1-T_i}{1-\hat\pi_i}\{Y_i - \hat\mu_0(X_i)\} = 0. \tag{11.12}\]

We call Equation 11.12 the internal bias calibration (IBC) conditions (Firth and Bennett 1998). When both IBC conditions hold, the augmentation terms in Equation 11.5 are zero by construction, so \(\hat\tau_{\mathrm{pred}} = \hat\tau_{\mathrm{AIPW}}\) and the prediction estimator is itself doubly robust.

11.8.2 Weighted Regression Approach

Suppose the outcome model for arm \(t\) is parameterized as \(\mu_t(X;\theta_t)\) with a constant term. Use IPW-weighted least squares: \[\sum_{i=1}^n \frac{T_i}{\hat\pi_i}\{Y_i - \mu_1(X_i;\theta_1)\}^2. \tag{11.13}\]

The normal equation of Equation 11.13 with respect to the intercept component is \(\sum_i \frac{T_i}{\hat\pi_i}\{Y_i - \mu_1(X_i;\hat\theta_1)\} = 0\), which is exactly IBC condition Equation 11.12. Hence IPW-weighted fitted values automatically satisfy IBC for any model containing a constant (Robins et al. 1994; Bang and Robins 2005). The key point is not that weighted regression creates a fundamentally different doubly robust estimator; rather, it constructs fitted values for which the prediction estimator algebraically equals the AIPW estimator.

11.8.3 Augmented Model Approach and the Clever Covariate

Let \(\hat\mu_1^{(0)}(X_i)\) be any initial fit. Augment it by including the clever covariate \(\hat\pi_i^{-1}\) (Laan and Rubin 2006; Laan and Rose 2011) and run OLS of \(Y_i\) on \(\hat\mu_1^{(0)}(X_i)\) and \(\hat\pi_i^{-1}\) among treated units: \[Y_i = \alpha_1 + \beta_1\hat\mu_1^{(0)}(X_i) + \gamma_1\hat\pi_i^{-1} + e_i(1), \qquad T_i = 1. \tag{11.14}\]

The fitted outcome model is \(\hat\mu_1(X_i) = \hat\alpha_1 + \hat\beta_1\hat\mu_1^{(0)}(X_i) + \hat\gamma_1\hat\pi_i^{-1}\).

The normal equation for \(\hat\gamma_1\) (the coefficient on \(\hat\pi_i^{-1}\)) is \(\sum_i \frac{T_i}{\hat\pi_i}\{Y_i - \hat\mu_1(X_i)\} = 0\), exactly IBC condition Equation 11.12. Since this sum equals zero, the prediction estimator is: \[\hat\mu_1^{\mathrm{pred}} = \frac{1}{n}\sum_{i=1}^n \hat\mu_1(X_i) + \frac{1}{n}\sum_{i=1}^n \frac{T_i}{\hat\pi_i}\{Y_i - \hat\mu_1(X_i)\},\] which is the AIPW representation. The covariate \(\hat\pi_i^{-1}\) is “clever” because it is chosen so that the fitted regression satisfies the same score equation appearing in the AIPW bias correction.

Remark: Behavior under Correct Outcome Model

If \(\hat\mu_1^{(0)}\) is correctly specified, then \(\hat\alpha_1 \overset{p}{\to} 0\), \(\hat\beta_1 \overset{p}{\to} 1\), and \(\hat\gamma_1 \overset{p}{\to} 0\): the augmented model reduces asymptotically to \(\hat\mu_1^{(0)}(X_i)\), and including the additional covariates carries no asymptotic efficiency cost.

Remark: Connection to TMLE

The augmented model approach captures the same targeting idea that underlies targeted minimum loss-based estimation (TMLE) (Laan and Rubin 2006; Laan and Rose 2011). The standard TMLE targeting step uses the fixed-offset form (coefficient on \(\hat\mu_1^{(0)}\) fixed at 1), the minimal fluctuation sufficient to satisfy the IBC condition. The model Equation 11.14 goes further by freely estimating \(\hat\beta_1\), providing additional recalibration of the initial fit beyond canonical TMLE.

11.9 Lab: Simulation Study

This lab compares four estimators of the ATE: \(\hat\tau_{\mathrm{HT}}\), \(\hat\tau_{\mathrm{AIPW}}\), the fixed-offset augmented-model estimator \(\hat\tau_{\mathrm{aug}}\), and the improved augmented-model estimator \(\hat\tau_{\mathrm{aug}^+}\) of Equation 11.14. A \(2\times 2\) design over nuisance-model correctness demonstrates double robustness directly. The lab also reports empirical 95% Wald coverage, illustrating the distinction between double-robust consistency and valid asymptotic inference.

DGP. \(n = 1000\) i.i.d. observations. Draw \(X_i \sim N(0,1)\), \(T_i \mid X_i \sim \mathrm{Bernoulli}(\pi^*(X_i))\) with \(\pi^*(x) = \mathrm{expit}\{0.2x + 0.2(x^2 - 1)\}\). Potential outcomes: \(Y_i(1) = 1 + X_i + 0.5X_i^2 + \varepsilon_i(1)\), \(Y_i(0) = X_i + 0.5X_i^2 + \varepsilon_i(0)\), \(\varepsilon_i(t) \sim N(0,1)\) i.i.d., giving true ATE \(\tau = 1\). The \(X^2\) term enters both the true OR and true PS; omitting it yields four scenarios.

Scenarios.

Scenario	OR fit	PS fit
S1	correct: \(Y \sim (1, X, X^2)\)	correct: \(T \sim (1, X, X^2)\)
S2	correct: \(Y \sim (1, X, X^2)\)	misspecified: \(T \sim (1, X)\)
S3	misspecified: \(Y \sim (1, X)\)	correct: \(T \sim (1, X, X^2)\)
S4	misspecified: \(Y \sim (1, X)\)	misspecified: \(T \sim (1, X)\)

The fitted \(\hat\pi\) is clipped to \([10^{-3}, 1-10^{-3}]\).

Results (\(B = 2000\) replications, set.seed(2025)). Bias, Var, MSE \(\times 10^{-3}\); Cov = empirical 95% Wald coverage.

Scenario	Metric	\(\hat\tau_{\mathrm{HT}}\)	\(\hat\tau_{\mathrm{AIPW}}\)	\(\hat\tau_{\mathrm{aug}}\)	\(\hat\tau_{\mathrm{aug}^+}\)
S1: OR ✓, PS ✓	Bias	0.55	−0.44	−0.45	−0.45
	Var	6.82	4.25	4.25	4.30
	MSE	6.82	4.25	4.25	4.30
	Cov (%)	99.9	95.0	94.9	94.5
S2: OR ✓, PS ✗	Bias	188.56	0.62	0.62	0.62
	Var	6.32	4.33	4.33	4.33
	MSE	41.87	4.32	4.32	4.32
	Cov (%)	67.4	94.2	94.2	94.2
S3: OR ✗, PS ✓	Bias	1.46	2.02	2.25	−25.26
	Var	5.74	4.71	4.41	8.02
	MSE	5.74	4.71	4.41	8.66
	Cov (%)	99.9	98.4	98.2	93.2
S4: OR ✗, PS ✗	Bias	188.42	188.77	188.62	−5.42
	Var	6.28	6.30	6.29	4.26
	MSE	41.78	41.93	41.87	4.29
	Cov (%)	67.5	33.1	33.1	94.7

S1 (both correct). All four estimators are approximately unbiased. \(\hat\tau_{\mathrm{HT}}\) has about 60% higher MSE because it discards the outcome model. The three augmented estimators are essentially equivalent, confirming \((\hat\alpha_t, \hat\beta_t, \hat\gamma_t) \to (0, 1, 0)\) when the OR is correct.

S2 (OR correct, PS misspecified). HT is badly biased (MSE \(\approx 42 \times 10^{-3}\), coverage 67.4%). The three augmented estimators are indistinguishable: when the OR is correct, they converge to the same limit regardless of PS specification. First direct demonstration of double robustness.

S3 (OR misspecified, PS correct). HT, AIPW, and \(\hat\tau_{\mathrm{aug}}\) remain consistent through the correct PS. \(\hat\tau_{\mathrm{aug}}\) achieves the lowest MSE (\(4.41 \times 10^{-3}\)). \(\hat\tau_{\mathrm{aug}^+}\) shows a finite-sample bias (\(-25 \times 10^{-3}\)) from high-leverage clever-covariate values. Second double-robustness demonstration.

S4 (both misspecified). Double robustness provides no guarantee. HT, AIPW, and \(\hat\tau_{\mathrm{aug}}\) are all badly biased; Wald coverage collapses to \(\approx 33\%\) for AIPW and aug. The apparent “recovery” of \(\hat\tau_{\mathrm{aug}^+}\) (bias \(-5 \times 10^{-3}\), coverage 94.7%) is a DGP-specific artifact and does not indicate triple robustness.

Takeaway. Comparing S2–S3 against S4 is the crisp illustration of ?thm-dr: AIPW is consistent whenever at least one nuisance model is correct, and only S4 breaks the guarantee.

11.10 Asymptotic Inference with Estimated Nuisance Functions

Under suitable regularity conditions: \[\sqrt{n}(\hat\tau_{\mathrm{AIPW}}-\tau) = \frac{1}{\sqrt{n}}\sum_{i=1}^n\varphi_{\mathrm{eff}}(O_i) + o_p(1) \overset{d}{\longrightarrow} N\!\bigl(0,\;\E\{\varphi_{\mathrm{eff}}(O)^2\}\bigr).\]

The \(o_p(1)\) remainder captures nuisance estimation error. Neyman orthogonality implies this error enters only through a second-order remainder. A sufficient condition for the remainder to be negligible is the product-rate condition: \[\|\hat\pi-\pi\|\cdot\|\hat\mu_t-\mu_t\| = o_p(n^{-1/2}), \qquad t=0,1, \tag{11.15}\] where \(\|\cdot\|\) denotes the \(L_2(P)\) norm. A symmetric sufficient condition: each nuisance estimator converges at rate \(o_p(n^{-1/4})\). This is much weaker than the parametric rate \(n^{-1/2}\) required of each individually.

Remark: Neyman Orthogonality

The first-order insensitivity of the AIPW estimating function to nuisance perturbations is an instance of Neyman orthogonality, a property shared by a broad class of semiparametric estimators. Chapter 12 exploits this property systematically through cross-fitting, which eliminates the need for additional empirical-process conditions on the nuisance estimators.

11.10.1 Variance Estimation and Confidence Intervals

Because \(\hat\tau_{\mathrm{AIPW}}\) is asymptotically linear with influence function \(\varphi_{\mathrm{eff}}\), its asymptotic variance is estimated by the empirical variance of the plug-in influence values: \[\hat\varphi_i = \frac{T_i}{\hat\pi(X_i)}\{Y_i-\hat\mu_1(X_i)\} - \frac{1-T_i}{1-\hat\pi(X_i)}\{Y_i-\hat\mu_0(X_i)\} + \hat\mu_1(X_i)-\hat\mu_0(X_i) - \hat\tau_{\mathrm{AIPW}}, \tag{11.16}\] \[\hat V = \frac{1}{n(n-1)}\sum_{i=1}^n(\hat\varphi_i-\bar\varphi)^2, \qquad \bar\varphi = \frac{1}{n}\sum_{i=1}^n\hat\varphi_i. \tag{11.17}\]

The Wald confidence interval is \(\hat\tau_{\mathrm{AIPW}} \pm z_{1-\alpha/2}\sqrt{\hat V}\).

Remark: Double Robustness versus Efficient Inference

It is important not to conflate two distinct results. Double-robust consistency (?thm-dr) requires only that one nuisance model be consistent, together with overlap and standard LLN regularity. It does not require the product-rate condition Equation 12.12.

Efficient asymptotic inference — the \(\sqrt{n}\)-normal expansion and the Wald interval based on \(\hat V\) — is a strictly stronger requirement, needing both nuisance functions to be estimated consistently at rates satisfying Equation 12.12. Plugging estimated nuisance functions into Equation 11.17 estimates the asymptotic variance of the efficient influence function only when the asymptotic linear representation with \(\varphi_{\mathrm{eff}}\) holds; that representation does not follow from double robustness alone.

Suppose the propensity score is correct but the outcome regression is misspecified: \(\hat\pi \to \pi^*\) but \(\hat\mu_t \to \mu_t^\dagger \neq \mu_t^*\). AIPW remains consistent for \(\tau\) (?thm-dr, Case 2): the AIPW moment evaluated at the limits \((\pi^*, \mu_0^\dagger, \mu_1^\dagger)\) is \[\varphi^\dagger(O) = \frac{T}{\pi^*(X)}\{Y-\mu_1^\dagger(X)\} - \frac{1-T}{1-\pi^*(X)}\{Y-\mu_0^\dagger(X)\} + \mu_1^\dagger(X) - \mu_0^\dagger(X) - \tau, \tag{11.18}\] with \(\E\{\varphi^\dagger(O)\} = 0\).

It is tempting to conclude that \(\varphi^\dagger\) is then the influence function of \(\hat\tau_{\mathrm{AIPW}}\) and that the plug-in variance Equation 11.17 consistently estimates \(\E\{\varphi^\dagger(O)^2\}\). This conclusion is correct only when \(\pi^*\) is known, not estimated. When \(\pi\) is estimated and \(\mu^\dagger \neq \mu^*\), the AIPW moment is no longer Neyman-orthogonal in the \(\pi\)-direction at \((\pi^*, \mu^\dagger)\). The Gateaux derivative of the population moment with respect to \(\pi\) at this point, along a perturbation \(h(X)\), evaluates to \[-\E\!\left[h(X)\left\{\frac{\mu_1^*(X) - \mu_1^\dagger(X)}{\pi^*(X)} + \frac{\mu_0^*(X) - \mu_0^\dagger(X)}{1-\pi^*(X)}\right\}\right],\] which is generally nonzero whenever \(\mu_t^\dagger \neq \mu_t^*\). The first-order error from estimating \(\pi\) therefore enters the asymptotic linear expansion of \(\sqrt{n}(\hat\tau_{\mathrm{AIPW}}-\tau)\), acquiring a contribution beyond \(n^{-1/2}\sum_i \varphi^\dagger(O_i)\).

Consequently, the plug-in variance Equation 11.17 is not generally valid under one-correct-model misspecification, and the naive Wald interval can have incorrect coverage. Valid asymptotic inference in this regime requires one of the following: (i) treating \((\hat\pi, \hat\mu, \hat\tau_{\mathrm{AIPW}})\) jointly as the solution of a stacked estimating-equation system and using the corresponding sandwich variance, which absorbs the nuisance-estimation contribution; (ii) sample splitting or cross-fitting together with conditions ensuring the nuisance-estimation contribution is asymptotically negligible (Chapter 12); or (iii) verifying that the misspecified limit coincides with the truth, which restores orthogonality. The plug-in variance Equation 11.17 is justified without qualification only when \(\hat\tau_{\mathrm{AIPW}}\) admits an asymptotic linear representation with \(\varphi_{\mathrm{eff}}\) — a representation that holds at \((\pi^*, \mu^*)\) but generally fails at \((\pi^*, \mu^\dagger)\) once \(\pi^*\) is replaced by its estimator.

Extreme Propensity Scores

When \(\hat\pi(X_i)\) is near 0 or 1, the IPW weights become large and can destabilize the estimator. Practical remedies: overlap diagnostics, truncation of extreme propensity scores, or restricting the target population to a subgroup with adequate overlap. The augmented model approach of Section 11.8 provides a complementary strategy by folding the propensity score into the outcome-model fit.

Looking ahead: when plug-in fails. For finite-dimensional parametric nuisance models, the product-rate condition holds automatically at the parametric rate \(n^{-1/2}\). The situation changes with flexible machine-learning methods: nuisance convergence rates may be slower than \(n^{-1/4}\), and machine-learning function classes are typically not Donsker, so the empirical-process remainder need not vanish when the same data are used for both nuisance estimation and score evaluation.

Chapter 12 addresses both difficulties by (a) formalizing Neyman orthogonality and exhibiting the AIPW score as an orthogonal score, and (b) decoupling nuisance estimation from score evaluation via cross-fitting. The resulting double/debiased machine learning (DML) estimator (Chernozhukov et al. 2018) is a direct extension of the AIPW development here.

11.11 Comparison of Regression, IPW, and AIPW

All consistency statements below are conditional on the causal identification assumptions of Chapter 5: consistency, conditional exchangeability, and positivity.

Estimator	Uses \(\mu_t\)	Uses \(\pi\)	Consistent if	Main weakness
Regression (prediction)	Yes	No	outcome model correct	Sensitive to OR misspecification
IPW (Horvitz–Thompson)	No	Yes	propensity model correct	Unstable under weak overlap
AIPW	Yes	Yes	either model correct	Requires estimating both nuisances and careful inference

When overlap is weak, the IBC-based approaches of Section 11.8 can improve numerical stability by folding the propensity score into the outcome-model fit. They do not, however, eliminate the fundamental information loss where covariate support is lacking: in such cases the practical recommendation is to change the target estimand (trimming, overlap weighting, or restriction to a subgroup) rather than to expect any algebraic refinement of AIPW to repair the problem.

11.12 Chapter Summary

Symbol	Meaning
\(\tau\)	ATE \(= \E\{Y(1)-Y(0)\}\)
\(\mu_t(x)\)	Outcome regression \(\E(Y \mid T{=}t, X{=}x)\)
\(\pi(x)\)	Propensity score \(P(T{=}1 \mid X{=}x)\)
\(b_t(x)\)	Control function; optimal \(b_t^(x) = \mu_t^(x)\)
\(\hat\tau_{\mathrm{AIPW}}\)	AIPW estimator Equation 11.5
\(\Lambda\)	Augmentation space Equation 11.10
\(\varphi_{\mathrm{eff}}(O)\)	Efficient influence function Equation 11.11
IBC	Internal bias calibration conditions Equation 11.12
\(\hat\pi_i^{-1}\)	Clever covariate; its normal equation enforces IBC

The prediction estimator is biased when the outcome model is misspecified; the bias can be estimated with the propensity score and subtracted to yield the AIPW estimator.
The AIPW estimator has two key properties: double robustness (consistent when either model is correct) and class-optimal variance (the choice \(b_t^* = \mu_t^*\) minimizes variance when both nuisances are correctly specified).
The same class-specific optimum follows from a Hilbert-space projection: the optimal estimator is the projection of the HT estimator onto \(\Lambda^\perp\), and the Pythagorean identity governs the variance reduction.
The AIPW estimating function is the efficient influence function for the ATE; its variance gives the semiparametric efficiency bound. The bias-correction, optimal-augmentation, and semiparametric-efficiency derivations all converge on the same object.
Double robustness can be enforced within the outcome-regression fitting step, by IPW-weighted regression or by an augmented regression model including the inverse propensity score as a clever covariate (TMLE connection).
The simulation (Section 12.10) demonstrates double robustness across a \(2\times 2\) design: AIPW remains consistent whenever at least one nuisance model is correct (S1–S3) and loses the guarantee in S4. Wald coverage is approximately nominal in S1–S3 and collapses in S4.
For asymptotic inference, the product-rate condition Equation 12.12 is sufficient for root-\(n\) inference. In modern machine-learning settings, cross-fitting makes these conditions more plausible; Chapter 12 develops this approach.

11.13 Problems

1. Bias of the prediction estimator.

Verify the bias formula Equation 11.1 by writing \(\tau_{\mathrm{pred}}(m) - \tau\) as a difference of working-model errors and simplifying.
Under what condition on the propensity score model does \(\widehat{\mathrm{Bias}}(\hat\tau_{\mathrm{pred}})\) have zero expectation? Provide a careful argument using iterated expectations.
Construct a simple example (binary \(X\), binary \(T\), binary \(Y\)) in which the estimated bias is nonzero and the outcome model is misspecified, but the AIPW estimator is still consistent.

2. The AIPW class and optimal control functions.

Verify that \(\hat\tau_b\) with \(b_t = \mu_t\) equals the AIPW estimator Equation 11.5.
Using Equation 11.8, show that \(b_t^*(x) = \E\{Y(t) \mid x\}\) minimizes conditional variance by completing the square in \(b_t\).
Suppose \(\pi(x) = 1/2\) for all \(x\). Simplify Equation 11.8 and interpret the result.

3. Double robustness.

Verify Case 1 of ?thm-dr: with \(\mu_t = \mu_t^*\) and arbitrary \(\pi\), show \(\E\{\phi\} = 0\).
Verify Case 2: with \(\pi = \pi^*\) and arbitrary \(\mu_t\), show \(\E\{\phi\} = 0\).
Provide a counterexample showing \(\E\{\phi\} \neq 0\) when both models are misspecified.

4. Projection and the Pythagorean identity.

Verify \(\mathrm{Cov}(\hat\theta_{\mathrm{opt}}, \hat b^*) = 0\) from the definition of \(\hat b^*\).
Deduce Equation 11.9 from the decomposition \(\hat\theta_0 = \hat\theta_{\mathrm{opt}} + \hat b^*\).
Explain why \(\hat\theta_{\mathrm{opt}}\) is still unbiased for \(\theta\) even though \(\hat b^* \in \Lambda\) is subtracted.

5. Semiparametric efficiency.

Show \(\E\{\varphi_{\mathrm{eff}}(O)^2\} \leq \E\{\varphi_{\mathrm{IPW}}(O)^2\}\) by writing \(\varphi_{\mathrm{IPW}} = \varphi_{\mathrm{eff}} + (\varphi_{\mathrm{IPW}} - \varphi_{\mathrm{eff}})\) and showing the cross term vanishes.
Interpret the efficiency gain \(\E\{\varphi_{\mathrm{IPW}}^2\} - \E\{\varphi_{\mathrm{eff}}^2\}\) in terms of the variance of the outcome regression.
Give one reason why an efficient estimator based on \(\varphi_{\mathrm{eff}}\) may still be disfavored in practice relative to a simpler, less efficient alternative.

6. Augmented model and the clever covariate.

Let \(\hat\mu_1^{(0)}(X_i)\) be any initial outcome model fit. Show that the normal equation for \(\hat\gamma_1\) in the augmented model Equation 11.14 is exactly the IBC condition Equation 11.12, and conclude that the prediction estimator can always be written in AIPW form.
Explain why \(\hat\gamma_1 \overset{p}{\to} 0\) when the initial model \(\hat\mu_1^{(0)}\) is correctly specified, and what this implies about the efficiency cost of including the clever covariate.
Explain in words why \(\hat\pi_i^{-1}\) is called the clever covariate: what bias does it absorb, and why does this make the prediction estimator a debiased estimator even when \(\hat\mu_1^{(0)}\) is misspecified?

7. Coverage of the Wald confidence interval (computational). Use the DGP of Section 12.10 with the propensity score generalized to \(\pi^*(x) = \mathrm{expit}\{0.2x + \gamma(x^2 - 1)\}\), so that the baseline (\(\gamma = 0.2\)) coincides with the lab and larger \(\gamma\) degrades overlap. Implement the AIPW estimator and variance estimator Equation 11.17. Following Scenario S3, use logistic regression of \(T\) on \((1, X, X^2)\) for \(\hat\pi\) (correct PS) and OLS of \(Y\) on \((1, X)\) separately in each arm for \(\hat\mu_t\) (misspecified OR); truncate \(\hat\pi\) to \([10^{-3}, 1-10^{-3}]\).

With \(\gamma = 0.2\), \(n = 500\), \(B = 2000\) replications, compute the empirical coverage of the 95% Wald interval \(\hat\tau_{\mathrm{AIPW}} \pm z_{0.975}\sqrt{\hat V}\). Compare the average SE \(\overline{\sqrt{\hat V}}\) with the empirical SD of \(\hat\tau_{\mathrm{AIPW}}\) across replications.
Repeat with \(\gamma \in \{0.5,\, 1.0\}\), producing increasingly weak overlap. Report coverage, average SE, and empirical SD. What do you observe?
Which assumption of Section 13.6 is stressed as \(\gamma\) grows, and which of the remedies in the Extreme Propensity Scores warning would you try first? (No simulation required; a paragraph suffices.)

Bang, Heejung, and James M. Robins. 2005. “Doubly Robust Estimation in Missing Data and Causal Inference Models.” Biometrics 61 (4): 962–73.

Bickel, Peter J., Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. 1993. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, et al. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68.

Firth, David, and Karen E. Bennett. 1998. “Robust Models in Probability Sampling.” Journal of the Royal Statistical Society: Series B 60 (1): 3–21.

Laan, Mark J. van der, and Sherri Rose. 2011. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer.

Laan, Mark J. van der, and Daniel Rubin. 2006. “Targeted Maximum Likelihood Learning.” The International Journal of Biostatistics 2 (1): 1–40.

Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66.

Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

Tsiatis, Anastasios A. 2006. Semiparametric Theory and Missing Data. Springer.