11 Doubly Robust Estimation and Semiparametric Efficiency
11.1 Why Combine Outcome Regression and Weighting?
Under consistency, conditional exchangeability, and positivity, the ATE \(\tau = \E\{Y(1) - Y(0)\}\) is identified by either of the following: \[\tau = \E\{\mu_1(X) - \mu_0(X)\}, \qquad \text{or} \qquad \tau = \E\!\left\{\frac{TY}{\pi(X)} - \frac{(1-T)Y}{1-\pi(X)}\right\},\] where \(\mu_t(x) = \E(Y \mid T{=}t,\, X{=}x)\) and \(\pi(x) = P(T{=}1 \mid X{=}x)\).
These formulas suggest two basic estimation strategies. The outcome regression (prediction) estimator averages \(\hat\mu_1(X_i) - \hat\mu_0(X_i)\) over the sample. The IPW estimator reweights observed outcomes using the propensity score. Each strategy has weaknesses: regression can be biased under misspecification, while IPW can be unstable when estimated propensity scores are close to 0 or 1 (Rosenbaum and Rubin 1983).
This chapter develops a third strategy: combine both models to construct an estimator that is consistent when either model is correct, and achieves the semiparametric efficiency bound when both are correct. We derive the same estimator in three complementary ways: as a bias-corrected prediction estimator (Section 11.2), as the optimal member of a class of augmented estimators (Section 11.4 and Section 11.6), and as the efficient influence function (Section 11.7).
11.2 The Prediction Estimator and Its Bias
Work first with generic working regression functions \(m_t(x)\), not necessarily equal to the true conditional means. The prediction estimator based on \(m_t\) is: \[\hat\tau_{\mathrm{pred}}(m) = \frac{1}{n}\sum_{i=1}^n \{m_1(X_i) - m_0(X_i)\}.\] Define the population prediction estimand \(\tau_{\mathrm{pred}}(m) = \E\{m_1(X) - m_0(X)\}\). Under ignorability: \[\tau_{\mathrm{pred}}(m) - \tau = \E\bigl[(m_1(X) - \mu_1(X)) - (m_0(X) - \mu_0(X))\bigr]. \tag{11.1}\]
If a propensity score estimator \(\hat\pi\) is available, the two bias terms in Equation 11.1 can be estimated from observed data. Defining residuals \(e_i(t) = Y_i(t) - \hat\mu_t(X_i)\): \[\widehat{\mathrm{Bias}}(\hat\tau_{\mathrm{pred}}) = -\frac{1}{n}\sum_{i=1}^n \frac{T_i}{\hat\pi(X_i)}\,e_i(1) + \frac{1}{n}\sum_{i=1}^n \frac{1-T_i}{1-\hat\pi(X_i)}\,e_i(0).\]
The bias-corrected prediction estimator is \(\hat\tau_{\mathrm{AIPW}} = \hat\tau_{\mathrm{pred}} - \widehat{\mathrm{Bias}}(\hat\tau_{\mathrm{pred}}) = \hat\mu_{1,\mathrm{dr}} - \hat\mu_{0,\mathrm{dr}}\), where: \[\hat\mu_{1,\mathrm{dr}} = \frac{1}{n}\sum_{i=1}^n \frac{T_i}{\hat\pi(X_i)}\{Y_i - \hat\mu_1(X_i)\} + \frac{1}{n}\sum_{i=1}^n \hat\mu_1(X_i), \tag{11.2}\] \[\hat\mu_{0,\mathrm{dr}} = \frac{1}{n}\sum_{i=1}^n \frac{1-T_i}{1-\hat\pi(X_i)}\{Y_i - \hat\mu_0(X_i)\} + \frac{1}{n}\sum_{i=1}^n \hat\mu_0(X_i). \tag{11.3}\]
11.3 The Augmented IPW Estimator
Notation. From this point on, \(\mu_t(x)\) inside estimating functions denotes a generic working outcome regression; the truth is written \(\mu_t^*(x) = \E(Y \mid T{=}t, X{=}x)\). The propensity score \(\pi(x)\) inside estimating functions denotes a working model; the truth is \(\pi^*(x) = P(T{=}1 \mid X{=}x)\).
Collecting terms gives the estimating-equation form. The AIPW estimator solves \(\mathbb{P}_n\{\phi(O;\tau,\hat\eta)\} = 0\) where: \[\phi(O;\tau,\eta) = \left[\frac{T}{\pi(X)}\{Y-\mu_1(X)\}+\mu_1(X)\right] - \left[\frac{1-T}{1-\pi(X)}\{Y-\mu_0(X)\}+\mu_0(X)\right] - \tau. \tag{11.4}\]
Solving explicitly:
Double robustness guarantees consistency if either model is correct; it does not protect against misspecification of both models simultaneously.
11.4 A Class of Augmented Estimators
For any square-integrable functions \(b_1(x)\) and \(b_0(x)\), define: \[\hat\mu_{1,b} = \frac{1}{n}\sum_{i=1}^n \frac{T_i}{\pi(X_i)}\,Y_i - \frac{1}{n}\sum_{i=1}^n \left\{\frac{T_i}{\pi(X_i)}-1\right\} b_1(X_i), \tag{11.6}\] \[\hat\mu_{0,b} = \frac{1}{n}\sum_{i=1}^n \frac{1-T_i}{1-\pi(X_i)}\,Y_i - \frac{1}{n}\sum_{i=1}^n \left\{\frac{1-T_i}{1-\pi(X_i)}-1\right\} b_0(X_i), \tag{11.7}\] and let \(\hat\tau_b = \hat\mu_{1,b} - \hat\mu_{0,b}\). The standard AIPW estimator is the special case \(b_t = \mu_t\) (since \(TY/\pi - (T/\pi - 1)\mu_1 = T(Y-\mu_1)/\pi + \mu_1\)).
11.5 The Projection Interpretation
The optimality of AIPW has an elegant interpretation in terms of projections in a Hilbert space of estimators. The collection of mean-zero square-integrable random variables under inner product \(\langle X, Y\rangle = \E(XY) = \mathrm{Cov}(X, Y)\) is a genuine Hilbert space, and the results below are instances of the \(L^2\) projection theorem.
Let \(\hat\theta_0\) be an unbiased estimator of \(\theta\). Define the augmentation space \(\Lambda\) as a closed linear subspace of mean-zero square-integrable random variables computable from the observed data without knowledge of \(\theta\). For any \(\hat b \in \Lambda\) the estimator \(\hat\theta_b = \hat\theta_0 - \hat b\) remains unbiased.
The Pythagorean identity shows the variance of the initial estimator decomposes orthogonally. Tsiatis (2006) calls \(\Lambda\) the augmentation space because its elements are the corrections that reduce variance.
11.6 Projection in the Causal Inference Setting
We now specialize the projection framework to the ATE, assuming \(\pi(X)\) is known. Decompose the Horvitz–Thompson estimator as \(\hat\tau_{\mathrm{HT}} = \hat\mu_{1,\mathrm{HT}} - \hat\mu_{0,\mathrm{HT}}\) where: \[\hat\mu_{1,\mathrm{HT}} = \frac{1}{n}\sum_{i=1}^n \frac{T_i Y_i}{\pi_i}, \qquad \hat\mu_{0,\mathrm{HT}} = \frac{1}{n}\sum_{i=1}^n \frac{(1-T_i)Y_i}{1-\pi_i}.\]
Define arm-specific augmentation spaces: \[\Lambda_1 = \left\{n^{-1}\sum_{i=1}^n\!\left(\frac{T_i}{\pi_i}-1\right) b_1(X_i) : b_1 \in \mathcal{L}^2\right\}, \quad \Lambda_0 = \left\{n^{-1}\sum_{i=1}^n\!\left(\frac{1-T_i}{1-\pi_i}-1\right) b_0(X_i) : b_0 \in \mathcal{L}^2\right\}. \tag{11.10}\]
Every element of \(\Lambda_1\) (resp. \(\Lambda_0\)) has expectation zero, so augmenting leaves each arm-mean estimator unbiased. The combined augmentation space is \(\Lambda = \Lambda_1 + \Lambda_0\).
The optimal estimators are: \[\hat\mu_{1,\mathrm{opt}} = \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i}{\pi_i}\{Y_i-\mu_1^*(X_i)\}+\mu_1^*(X_i)\right], \quad \hat\mu_{0,\mathrm{opt}} = \frac{1}{n}\sum_{i=1}^n\left[\frac{1-T_i}{1-\pi_i}\{Y_i-\mu_0^*(X_i)\}+\mu_0^*(X_i)\right].\] Their difference is precisely the AIPW estimator Equation 11.5 with \(b_t^*(X) = \mu_t^*(X)\), confirming the Optimal Control Functions theorem by a different route.
11.7 The Efficient Influence Function and Semiparametric Efficiency
This section states the semiparametric efficiency result and interprets it in light of the estimator development above. A complete proof requires explicit characterization of the nuisance tangent space; see Tsiatis (2006) for a rigorous development.
Under the nonparametric model for \(O=(X,T,Y)\), the efficient influence function for the ATE is: \[\varphi_{\mathrm{eff}}(O) = \frac{T}{\pi(X)}\{Y-\mu_1(X)\} - \frac{1-T}{1-\pi(X)}\{Y-\mu_0(X)\} + \mu_1(X) - \mu_0(X) - \tau. \tag{11.11}\]
Comparing Equation 11.11 with Equation 11.4, the AIPW estimating function is the efficient influence function \(\varphi^*(O)\) from Chapter 10. The augmentation space \(\Lambda\) in Equation 11.10 is the finite-sample analogue of the nuisance tangent space of the nonparametric model, and projection onto \(\Lambda^\perp\) is the finite-sample counterpart of the semiparametric operation that removes nuisance tangent directions.
11.8 Doubly Robust Regression: Weighted and Augmented Approaches
The AIPW estimator achieves double robustness by adding an explicit bias-correction term. An equally important question is how to build double robustness directly into the outcome model fit, so that the prediction estimator is itself doubly robust without a separate augmentation step. The unifying concept is the internal bias calibration (IBC) condition.
11.8.1 The Internal Bias Calibration Conditions
Let \(\hat\mu_t(x)\) denote any fitted outcome model and \(\hat\pi_i = \hat\pi(X_i)\). The prediction estimator requires no augmentation if the IPW-weighted residuals vanish: \[\sum_{i=1}^n \frac{T_i}{\hat\pi_i}\{Y_i - \hat\mu_1(X_i)\} = 0, \qquad \sum_{i=1}^n \frac{1-T_i}{1-\hat\pi_i}\{Y_i - \hat\mu_0(X_i)\} = 0. \tag{11.12}\]
We call Equation 11.12 the internal bias calibration (IBC) conditions (Firth and Bennett 1998). When both IBC conditions hold, the augmentation terms in Equation 11.5 are zero by construction, so \(\hat\tau_{\mathrm{pred}} = \hat\tau_{\mathrm{AIPW}}\) and the prediction estimator is itself doubly robust.
11.8.2 Weighted Regression Approach
Suppose the outcome model for arm \(t\) is parameterized as \(\mu_t(X;\theta_t)\) with a constant term. Use IPW-weighted least squares: \[\sum_{i=1}^n \frac{T_i}{\hat\pi_i}\{Y_i - \mu_1(X_i;\theta_1)\}^2. \tag{11.13}\]
The normal equation of Equation 11.13 with respect to the intercept component is \(\sum_i \frac{T_i}{\hat\pi_i}\{Y_i - \mu_1(X_i;\hat\theta_1)\} = 0\), which is exactly IBC condition Equation 11.12. Hence IPW-weighted fitted values automatically satisfy IBC for any model containing a constant (Robins et al. 1994; Bang and Robins 2005). The key point is not that weighted regression creates a fundamentally different doubly robust estimator; rather, it constructs fitted values for which the prediction estimator algebraically equals the AIPW estimator.
11.8.3 Augmented Model Approach and the Clever Covariate
Let \(\hat\mu_1^{(0)}(X_i)\) be any initial fit. Augment it by including the clever covariate \(\hat\pi_i^{-1}\) (Laan and Rubin 2006; Laan and Rose 2011) and run OLS of \(Y_i\) on \(\hat\mu_1^{(0)}(X_i)\) and \(\hat\pi_i^{-1}\) among treated units: \[Y_i = \alpha_1 + \beta_1\hat\mu_1^{(0)}(X_i) + \gamma_1\hat\pi_i^{-1} + e_i(1), \qquad T_i = 1. \tag{11.14}\]
The fitted outcome model is \(\hat\mu_1(X_i) = \hat\alpha_1 + \hat\beta_1\hat\mu_1^{(0)}(X_i) + \hat\gamma_1\hat\pi_i^{-1}\).
The normal equation for \(\hat\gamma_1\) (the coefficient on \(\hat\pi_i^{-1}\)) is \(\sum_i \frac{T_i}{\hat\pi_i}\{Y_i - \hat\mu_1(X_i)\} = 0\), exactly IBC condition Equation 11.12. Since this sum equals zero, the prediction estimator is: \[\hat\mu_1^{\mathrm{pred}} = \frac{1}{n}\sum_{i=1}^n \hat\mu_1(X_i) + \frac{1}{n}\sum_{i=1}^n \frac{T_i}{\hat\pi_i}\{Y_i - \hat\mu_1(X_i)\},\] which is the AIPW representation. The covariate \(\hat\pi_i^{-1}\) is “clever” because it is chosen so that the fitted regression satisfies the same score equation appearing in the AIPW bias correction.
11.9 Lab: Simulation Study
This lab compares four estimators of the ATE: \(\hat\tau_{\mathrm{HT}}\), \(\hat\tau_{\mathrm{AIPW}}\), the fixed-offset augmented-model estimator \(\hat\tau_{\mathrm{aug}}\), and the improved augmented-model estimator \(\hat\tau_{\mathrm{aug}^+}\) of Equation 11.14. A \(2\times 2\) design over nuisance-model correctness demonstrates double robustness directly. The lab also reports empirical 95% Wald coverage, illustrating the distinction between double-robust consistency and valid asymptotic inference.
DGP. \(n = 1000\) i.i.d. observations. Draw \(X_i \sim N(0,1)\), \(T_i \mid X_i \sim \mathrm{Bernoulli}(\pi^*(X_i))\) with \(\pi^*(x) = \mathrm{expit}\{0.2x + 0.2(x^2 - 1)\}\). Potential outcomes: \(Y_i(1) = 1 + X_i + 0.5X_i^2 + \varepsilon_i(1)\), \(Y_i(0) = X_i + 0.5X_i^2 + \varepsilon_i(0)\), \(\varepsilon_i(t) \sim N(0,1)\) i.i.d., giving true ATE \(\tau = 1\). The \(X^2\) term enters both the true OR and true PS; omitting it yields four scenarios.
Scenarios.
| Scenario | OR fit | PS fit |
|---|---|---|
| S1 | correct: \(Y \sim (1, X, X^2)\) | correct: \(T \sim (1, X, X^2)\) |
| S2 | correct: \(Y \sim (1, X, X^2)\) | misspecified: \(T \sim (1, X)\) |
| S3 | misspecified: \(Y \sim (1, X)\) | correct: \(T \sim (1, X, X^2)\) |
| S4 | misspecified: \(Y \sim (1, X)\) | misspecified: \(T \sim (1, X)\) |
The fitted \(\hat\pi\) is clipped to \([10^{-3}, 1-10^{-3}]\).
Results (\(B = 2000\) replications, set.seed(2025)). Bias, Var, MSE \(\times 10^{-3}\); Cov = empirical 95% Wald coverage.
| Scenario | Metric | \(\hat\tau_{\mathrm{HT}}\) | \(\hat\tau_{\mathrm{AIPW}}\) | \(\hat\tau_{\mathrm{aug}}\) | \(\hat\tau_{\mathrm{aug}^+}\) |
|---|---|---|---|---|---|
| S1: OR ✓, PS ✓ | Bias | 0.55 | −0.44 | −0.45 | −0.45 |
| Var | 6.82 | 4.25 | 4.25 | 4.30 | |
| MSE | 6.82 | 4.25 | 4.25 | 4.30 | |
| Cov (%) | 99.9 | 95.0 | 94.9 | 94.5 | |
| S2: OR ✓, PS ✗ | Bias | 188.56 | 0.62 | 0.62 | 0.62 |
| Var | 6.32 | 4.33 | 4.33 | 4.33 | |
| MSE | 41.87 | 4.32 | 4.32 | 4.32 | |
| Cov (%) | 67.4 | 94.2 | 94.2 | 94.2 | |
| S3: OR ✗, PS ✓ | Bias | 1.46 | 2.02 | 2.25 | −25.26 |
| Var | 5.74 | 4.71 | 4.41 | 8.02 | |
| MSE | 5.74 | 4.71 | 4.41 | 8.66 | |
| Cov (%) | 99.9 | 98.4 | 98.2 | 93.2 | |
| S4: OR ✗, PS ✗ | Bias | 188.42 | 188.77 | 188.62 | −5.42 |
| Var | 6.28 | 6.30 | 6.29 | 4.26 | |
| MSE | 41.78 | 41.93 | 41.87 | 4.29 | |
| Cov (%) | 67.5 | 33.1 | 33.1 | 94.7 |
S1 (both correct). All four estimators are approximately unbiased. \(\hat\tau_{\mathrm{HT}}\) has about 60% higher MSE because it discards the outcome model. The three augmented estimators are essentially equivalent, confirming \((\hat\alpha_t, \hat\beta_t, \hat\gamma_t) \to (0, 1, 0)\) when the OR is correct.
S2 (OR correct, PS misspecified). HT is badly biased (MSE \(\approx 42 \times 10^{-3}\), coverage 67.4%). The three augmented estimators are indistinguishable: when the OR is correct, they converge to the same limit regardless of PS specification. First direct demonstration of double robustness.
S3 (OR misspecified, PS correct). HT, AIPW, and \(\hat\tau_{\mathrm{aug}}\) remain consistent through the correct PS. \(\hat\tau_{\mathrm{aug}}\) achieves the lowest MSE (\(4.41 \times 10^{-3}\)). \(\hat\tau_{\mathrm{aug}^+}\) shows a finite-sample bias (\(-25 \times 10^{-3}\)) from high-leverage clever-covariate values. Second double-robustness demonstration.
S4 (both misspecified). Double robustness provides no guarantee. HT, AIPW, and \(\hat\tau_{\mathrm{aug}}\) are all badly biased; Wald coverage collapses to \(\approx 33\%\) for AIPW and aug. The apparent “recovery” of \(\hat\tau_{\mathrm{aug}^+}\) (bias \(-5 \times 10^{-3}\), coverage 94.7%) is a DGP-specific artifact and does not indicate triple robustness.
Takeaway. Comparing S2–S3 against S4 is the crisp illustration of ?thm-dr: AIPW is consistent whenever at least one nuisance model is correct, and only S4 breaks the guarantee.
11.10 Asymptotic Inference with Estimated Nuisance Functions
Under suitable regularity conditions: \[\sqrt{n}(\hat\tau_{\mathrm{AIPW}}-\tau) = \frac{1}{\sqrt{n}}\sum_{i=1}^n\varphi_{\mathrm{eff}}(O_i) + o_p(1) \overset{d}{\longrightarrow} N\!\bigl(0,\;\E\{\varphi_{\mathrm{eff}}(O)^2\}\bigr).\]
The \(o_p(1)\) remainder captures nuisance estimation error. Neyman orthogonality implies this error enters only through a second-order remainder. A sufficient condition for the remainder to be negligible is the product-rate condition: \[\|\hat\pi-\pi\|\cdot\|\hat\mu_t-\mu_t\| = o_p(n^{-1/2}), \qquad t=0,1, \tag{11.15}\] where \(\|\cdot\|\) denotes the \(L_2(P)\) norm. A symmetric sufficient condition: each nuisance estimator converges at rate \(o_p(n^{-1/4})\). This is much weaker than the parametric rate \(n^{-1/2}\) required of each individually.
11.10.1 Variance Estimation and Confidence Intervals
Because \(\hat\tau_{\mathrm{AIPW}}\) is asymptotically linear with influence function \(\varphi_{\mathrm{eff}}\), its asymptotic variance is estimated by the empirical variance of the plug-in influence values: \[\hat\varphi_i = \frac{T_i}{\hat\pi(X_i)}\{Y_i-\hat\mu_1(X_i)\} - \frac{1-T_i}{1-\hat\pi(X_i)}\{Y_i-\hat\mu_0(X_i)\} + \hat\mu_1(X_i)-\hat\mu_0(X_i) - \hat\tau_{\mathrm{AIPW}}, \tag{11.16}\] \[\hat V = \frac{1}{n(n-1)}\sum_{i=1}^n(\hat\varphi_i-\bar\varphi)^2, \qquad \bar\varphi = \frac{1}{n}\sum_{i=1}^n\hat\varphi_i. \tag{11.17}\]
The Wald confidence interval is \(\hat\tau_{\mathrm{AIPW}} \pm z_{1-\alpha/2}\sqrt{\hat V}\).
Looking ahead: when plug-in fails. For finite-dimensional parametric nuisance models, the product-rate condition holds automatically at the parametric rate \(n^{-1/2}\). The situation changes with flexible machine-learning methods: nuisance convergence rates may be slower than \(n^{-1/4}\), and machine-learning function classes are typically not Donsker, so the empirical-process remainder need not vanish when the same data are used for both nuisance estimation and score evaluation.
Chapter 12 addresses both difficulties by (a) formalizing Neyman orthogonality and exhibiting the AIPW score as an orthogonal score, and (b) decoupling nuisance estimation from score evaluation via cross-fitting. The resulting double/debiased machine learning (DML) estimator (Chernozhukov et al. 2018) is a direct extension of the AIPW development here.
11.11 Comparison of Regression, IPW, and AIPW
All consistency statements below are conditional on the causal identification assumptions of Chapter 5: consistency, conditional exchangeability, and positivity.
| Estimator | Uses \(\mu_t\) | Uses \(\pi\) | Consistent if | Main weakness |
|---|---|---|---|---|
| Regression (prediction) | Yes | No | outcome model correct | Sensitive to OR misspecification |
| IPW (Horvitz–Thompson) | No | Yes | propensity model correct | Unstable under weak overlap |
| AIPW | Yes | Yes | either model correct | Requires estimating both nuisances and careful inference |
When overlap is weak, the IBC-based approaches of Section 11.8 can improve numerical stability by folding the propensity score into the outcome-model fit. They do not, however, eliminate the fundamental information loss where covariate support is lacking: in such cases the practical recommendation is to change the target estimand (trimming, overlap weighting, or restriction to a subgroup) rather than to expect any algebraic refinement of AIPW to repair the problem.
11.12 Chapter Summary
| Symbol | Meaning |
|---|---|
| \(\tau\) | ATE \(= \E\{Y(1)-Y(0)\}\) |
| \(\mu_t(x)\) | Outcome regression \(\E(Y \mid T{=}t, X{=}x)\) |
| \(\pi(x)\) | Propensity score \(P(T{=}1 \mid X{=}x)\) |
| \(b_t(x)\) | Control function; optimal \(b_t^*(x) = \mu_t^*(x)\) |
| \(\hat\tau_{\mathrm{AIPW}}\) | AIPW estimator Equation 11.5 |
| \(\Lambda\) | Augmentation space Equation 11.10 |
| \(\varphi_{\mathrm{eff}}(O)\) | Efficient influence function Equation 11.11 |
| IBC | Internal bias calibration conditions Equation 11.12 |
| \(\hat\pi_i^{-1}\) | Clever covariate; its normal equation enforces IBC |
- The prediction estimator is biased when the outcome model is misspecified; the bias can be estimated with the propensity score and subtracted to yield the AIPW estimator.
- The AIPW estimator has two key properties: double robustness (consistent when either model is correct) and class-optimal variance (the choice \(b_t^* = \mu_t^*\) minimizes variance when both nuisances are correctly specified).
- The same class-specific optimum follows from a Hilbert-space projection: the optimal estimator is the projection of the HT estimator onto \(\Lambda^\perp\), and the Pythagorean identity governs the variance reduction.
- The AIPW estimating function is the efficient influence function for the ATE; its variance gives the semiparametric efficiency bound. The bias-correction, optimal-augmentation, and semiparametric-efficiency derivations all converge on the same object.
- Double robustness can be enforced within the outcome-regression fitting step, by IPW-weighted regression or by an augmented regression model including the inverse propensity score as a clever covariate (TMLE connection).
- The simulation (Section 12.10) demonstrates double robustness across a \(2\times 2\) design: AIPW remains consistent whenever at least one nuisance model is correct (S1–S3) and loses the guarantee in S4. Wald coverage is approximately nominal in S1–S3 and collapses in S4.
- For asymptotic inference, the product-rate condition Equation 12.12 is sufficient for root-\(n\) inference. In modern machine-learning settings, cross-fitting makes these conditions more plausible; Chapter 12 develops this approach.
11.13 Problems
1. Bias of the prediction estimator.
- Verify the bias formula Equation 11.1 by writing \(\tau_{\mathrm{pred}}(m) - \tau\) as a difference of working-model errors and simplifying.
- Under what condition on the propensity score model does \(\widehat{\mathrm{Bias}}(\hat\tau_{\mathrm{pred}})\) have zero expectation? Provide a careful argument using iterated expectations.
- Construct a simple example (binary \(X\), binary \(T\), binary \(Y\)) in which the estimated bias is nonzero and the outcome model is misspecified, but the AIPW estimator is still consistent.
2. The AIPW class and optimal control functions.
- Verify that \(\hat\tau_b\) with \(b_t = \mu_t\) equals the AIPW estimator Equation 11.5.
- Using Equation 11.8, show that \(b_t^*(x) = \E\{Y(t) \mid x\}\) minimizes conditional variance by completing the square in \(b_t\).
- Suppose \(\pi(x) = 1/2\) for all \(x\). Simplify Equation 11.8 and interpret the result.
3. Double robustness.
- Verify Case 1 of ?thm-dr: with \(\mu_t = \mu_t^*\) and arbitrary \(\pi\), show \(\E\{\phi\} = 0\).
- Verify Case 2: with \(\pi = \pi^*\) and arbitrary \(\mu_t\), show \(\E\{\phi\} = 0\).
- Provide a counterexample showing \(\E\{\phi\} \neq 0\) when both models are misspecified.
4. Projection and the Pythagorean identity.
- Verify \(\mathrm{Cov}(\hat\theta_{\mathrm{opt}}, \hat b^*) = 0\) from the definition of \(\hat b^*\).
- Deduce Equation 11.9 from the decomposition \(\hat\theta_0 = \hat\theta_{\mathrm{opt}} + \hat b^*\).
- Explain why \(\hat\theta_{\mathrm{opt}}\) is still unbiased for \(\theta\) even though \(\hat b^* \in \Lambda\) is subtracted.
5. Semiparametric efficiency.
- Show \(\E\{\varphi_{\mathrm{eff}}(O)^2\} \leq \E\{\varphi_{\mathrm{IPW}}(O)^2\}\) by writing \(\varphi_{\mathrm{IPW}} = \varphi_{\mathrm{eff}} + (\varphi_{\mathrm{IPW}} - \varphi_{\mathrm{eff}})\) and showing the cross term vanishes.
- Interpret the efficiency gain \(\E\{\varphi_{\mathrm{IPW}}^2\} - \E\{\varphi_{\mathrm{eff}}^2\}\) in terms of the variance of the outcome regression.
- Give one reason why an efficient estimator based on \(\varphi_{\mathrm{eff}}\) may still be disfavored in practice relative to a simpler, less efficient alternative.
6. Augmented model and the clever covariate.
- Let \(\hat\mu_1^{(0)}(X_i)\) be any initial outcome model fit. Show that the normal equation for \(\hat\gamma_1\) in the augmented model Equation 11.14 is exactly the IBC condition Equation 11.12, and conclude that the prediction estimator can always be written in AIPW form.
- Explain why \(\hat\gamma_1 \overset{p}{\to} 0\) when the initial model \(\hat\mu_1^{(0)}\) is correctly specified, and what this implies about the efficiency cost of including the clever covariate.
- Explain in words why \(\hat\pi_i^{-1}\) is called the clever covariate: what bias does it absorb, and why does this make the prediction estimator a debiased estimator even when \(\hat\mu_1^{(0)}\) is misspecified?
7. Coverage of the Wald confidence interval (computational). Use the DGP of Section 12.10 with the propensity score generalized to \(\pi^*(x) = \mathrm{expit}\{0.2x + \gamma(x^2 - 1)\}\), so that the baseline (\(\gamma = 0.2\)) coincides with the lab and larger \(\gamma\) degrades overlap. Implement the AIPW estimator and variance estimator Equation 11.17. Following Scenario S3, use logistic regression of \(T\) on \((1, X, X^2)\) for \(\hat\pi\) (correct PS) and OLS of \(Y\) on \((1, X)\) separately in each arm for \(\hat\mu_t\) (misspecified OR); truncate \(\hat\pi\) to \([10^{-3}, 1-10^{-3}]\).
- With \(\gamma = 0.2\), \(n = 500\), \(B = 2000\) replications, compute the empirical coverage of the 95% Wald interval \(\hat\tau_{\mathrm{AIPW}} \pm z_{0.975}\sqrt{\hat V}\). Compare the average SE \(\overline{\sqrt{\hat V}}\) with the empirical SD of \(\hat\tau_{\mathrm{AIPW}}\) across replications.
- Repeat with \(\gamma \in \{0.5,\, 1.0\}\), producing increasingly weak overlap. Report coverage, average SE, and empirical SD. What do you observe?
- Which assumption of Section 13.6 is stressed as \(\gamma\) grows, and which of the remedies in the Extreme Propensity Scores warning would you try first? (No simulation required; a paragraph suffices.)