12 Flexible Nuisance Estimation, Orthogonal Scores, and Cross-Fitting
12.1 Why Flexible Nuisance Estimation Is Both Attractive and Dangerous
In Chapters 10 and 11 we developed estimation and inference using estimating equations, influence functions, doubly robust scores, and semiparametric efficiency. A central feature of these methods is that the causal parameter depends on nuisance functions such as the outcome regressions \(\mu_t(x) = \E(Y \mid T{=}t,\, X{=}x)\) and the propensity score \(\pi(x) = P(T{=}1 \mid X{=}x)\). When these nuisance functions are too complex for low-dimensional parametric forms, flexible methods — penalized regression, splines, random forests, boosting, neural networks, ensembles — become attractive.
These methods can substantially improve predictive accuracy, but they create new statistical difficulties: slower convergence, overfitting, difficult asymptotic characterization, and failure of naive plug-in inference even when prediction quality is high.
Better nuisance prediction therefore does not automatically imply valid inference for the target causal parameter. This chapter explains how orthogonal scores, sample splitting, and cross-fitting make it possible to combine flexible nuisance estimation with valid large-sample inference. The solution combines an orthogonal score, which removes first-order sensitivity to nuisance estimation error, with cross-fitting, which separates nuisance estimation from score evaluation to control the remaining empirical process remainder.
12.2 Why Naive Plug-In Estimation Can Fail
Suppose \(\psi = \Psi(\eta)\), where \(\eta\) denotes a nuisance object such as \((\mu_0, \mu_1, \pi)\). The plug-in estimator is \(\hat\psi_{\mathrm{plug}} = \Psi(\hat\eta)\).
12.2.1 The First-Order Taylor Expansion
Apply a first-order Taylor expansion of \(\Psi\) around the true nuisance \(\eta_0\): \[\hat\psi_{\mathrm{plug}} - \psi = \underbrace{D_\eta\Psi(\eta_0)[\hat\eta - \eta_0]}_{\text{linear term}} + \underbrace{R(\hat\eta,\,\eta_0)}_{\text{second-order remainder}}, \tag{12.1}\] where \(D_\eta\Psi(\eta_0)[h]\) denotes the Gateaux derivative and \(|R(\hat\eta, \eta_0)| \lesssim \|\hat\eta - \eta_0\|^2\).
The second-order remainder is manageable: if \(\|\hat\eta - \eta_0\| = o_p(n^{-1/4})\) then \(R = o_p(n^{-1/2})\). The obstacle is the linear term, proportional to the nuisance estimation error \(\hat\eta - \eta_0\).
12.2.2 Finite-Dimensional Nuisance: Orthogonality Is Sufficient
When \(\eta \in \mathbb{R}^d\) and \(\hat\eta - \eta_0 = O_p(n^{-1/2})\), the linear term in Equation 12.1 is \(O_p(n^{-1/2})\) and contributes to the limiting distribution. If the functional satisfies \(D_\eta\Psi(\eta_0) = 0\) — the Neyman orthogonality condition of Section 12.3 — then the linear term vanishes. The expansion reduces to the second-order remainder alone, which at parametric rates is \(O_p(n^{-1})\) and hence negligible. In the finite-dimensional setting, Neyman orthogonality is sufficient to remove nuisance contamination.
12.2.3 Infinite-Dimensional Nuisance: Orthogonality Is Not Enough
When \(\eta\) belongs to a function space, orthogonality is no longer sufficient on its own. Write the plug-in estimating equation as \(\mathbb{P}_n\{\phi(\,\cdot\,;\psi, \hat\eta)\} = 0\), and decompose the deviation from the population equation: \[\mathbb{P}_n\{\phi(\,\cdot\,;\psi, \hat\eta)\} - \mathbb{P}_n\{\phi(\,\cdot\,;\psi, \eta_0)\} = \underbrace{P\{\phi(\,\cdot\,;\psi,\hat\eta) - \phi(\,\cdot\,;\psi,\eta_0)\}}_{\text{population bias}} + \underbrace{(\mathbb{P}_n - P)\{\phi(\,\cdot\,;\psi,\hat\eta) - \phi(\,\cdot\,;\psi,\eta_0)\}}_{\text{empirical process term}}. \tag{12.2}\]
Neyman orthogonality controls the population bias: the Gateaux derivative of \(P\phi\) vanishes at the truth, so the population bias is second order. It says nothing about the empirical process term, whose behavior depends on the complexity of the function class \(\{\phi(\,\cdot\,;\psi,\eta) : \eta \in \mathcal{H}\}\). For flexible machine-learning estimators this class is typically too rich for classical empirical-process arguments, and the term can remain non-negligible even as \(\hat\eta\) converges consistently.
12.3 Orthogonal Scores
This condition means that small local perturbations of the nuisance have no first-order effect on the estimating equation at the truth. Orthogonality controls only the population-level bias; a separate argument is needed for the empirical process term when the same sample is reused.
12.4 The Orthogonal Score for the Average Treatment Effect
Under consistency, conditional exchangeability, and positivity, the efficient influence function from Chapter 11 is: \[\varphi_{\mathrm{eff}}(O;\,\tau,\eta) = \frac{T}{\pi(X)}\{Y - \mu_1(X)\} - \frac{1-T}{1-\pi(X)}\{Y - \mu_0(X)\} + \mu_1(X) - \mu_0(X) - \tau, \tag{12.4}\] where \(\eta = (\mu_0, \mu_1, \pi)\). Chapter 11 identified this as the right score for the ATE; Chapter 12’s role is to explain how to estimate its nuisance components safely with flexible learners.
The score Equation 12.4 simultaneously fulfills three purposes:
- It identifies \(\tau\) through the moment condition \(\E\{\varphi_{\mathrm{eff}}(O;\,\tau,\eta)\} = 0\).
- It is doubly robust, yielding a consistent estimator whenever either the outcome model or the propensity score model is correctly specified.
- It is the efficient influence function, so an estimator that is regular, asymptotically linear with this influence function, and whose nuisance remainders are asymptotically negligible achieves the semiparametric efficiency bound.
12.5 Why Reusing the Same Data Can Be Problematic
12.5.1 The Plug-In Remainder for the AIPW Estimator
Consider the plug-in AIPW estimator defined by the moment equation \(\mathbb{P}_n\{\varphi_{\mathrm{eff}}(O;\,\hat\tau_{\mathrm{plug}},\hat\eta)\} = 0\) where the same sample is used for both nuisance estimation and score evaluation. Because \(\varphi_{\mathrm{eff}}\) is linear in \(\tau\) with \(\partial_\tau\varphi_{\mathrm{eff}} = -1\): \[\hat\tau_{\mathrm{plug}} - \tau_0 = \mathbb{P}_n\{\varphi_{\mathrm{eff}}(O;\,\tau_0,\eta_0)\} + \underbrace{\mathbb{P}_n\{\varphi_{\mathrm{eff}}(O;\,\tau_0,\hat\eta) - \varphi_{\mathrm{eff}}(O;\,\tau_0,\eta_0)\}}_{=:\,R_n}. \tag{12.6}\]
The first term satisfies the CLT. The remainder \(R_n\) decomposes as in Equation 12.2: \[R_n = \underbrace{P\{\varphi_{\mathrm{eff}}(\,\cdot\,;\tau_0,\hat\eta) - \varphi_{\mathrm{eff}}(\,\cdot\,;\tau_0,\eta_0)\}}_{\text{population bias}} + \underbrace{(\mathbb{P}_n - P)\{\varphi_{\mathrm{eff}}(\,\cdot\,;\tau_0,\hat\eta) - \varphi_{\mathrm{eff}}(\,\cdot\,;\tau_0,\eta_0)\}}_{\text{empirical process term}}. \tag{12.7}\]
By the Neyman orthogonality lemma, the population bias is \(o_p(n^{-1/2})\) under the product-rate condition. Writing \(f_\eta(\cdot) = \varphi_{\mathrm{eff}}(\,\cdot\,;\tau_0,\eta)\), the empirical process term takes the form: \[(\mathbb{P}_n - P)\{f_{\hat\eta} - f_{\eta_0}\} = \frac{1}{\sqrt{n}}\,\mathbb{G}_n\{f_{\hat\eta} - f_{\eta_0}\}. \tag{12.8}\]
This term depends on \(\hat\eta\), which is estimated from the same sample. Whether it is \(o_p(n^{-1/2})\) depends on how complex the class \(\{f_\eta : \eta \in \mathcal{H}\}\) is — a question the Donsker condition answers.
12.5.2 The Donsker Condition
A key sufficient condition is finite bracketing entropy: the class \(\mathcal{F}\) is Donsker whenever \(\int_0^{\delta_0}\sqrt{\log N_{[\,]}(\epsilon, \mathcal{F}, L_2(P))}\,d\epsilon < \infty\), where \(N_{[\,]}\) counts the minimum number of \(\epsilon\)-brackets needed to cover \(\mathcal{F}\).
12.5.3 Connecting the Two Terms
Returning to Equation 12.7:
- Population bias. Controlled by Neyman orthogonality alone; \(o_p(n^{-1/2})\) under the product-rate condition regardless of how \(\hat\eta\) is estimated.
- Empirical process term. If \(\{f_\eta : \eta \in \mathcal{H}\}\) is Donsker and \(\|\hat\eta - \eta_0\| \to 0\) in probability, then by stochastic equicontinuity, \(\mathbb{G}_n\{f_{\hat\eta} - f_{\eta_0}\} = o_p(1)\), making the term \(o_p(n^{-1/2})\).
When the Donsker condition fails, the empirical process term can remain non-negligible even as \(\hat\eta \to \eta_0\). Orthogonality and cross-fitting are complements, not substitutes: orthogonality controls the population bias; cross-fitting controls the empirical process term.
12.6 Sample Splitting
Sample splitting resolves the empirical process problem by construction. Partition \(\{1,\ldots,n\}\) into a training sample \(\mathcal{I}_{\mathrm{train}}\) and an evaluation sample \(\mathcal{I}_{\mathrm{eval}}\). Fit nuisance estimators \(\hat\eta^{(\mathrm{train})}\) on \(\mathcal{I}_{\mathrm{train}}\), then solve the estimating equation on the held-out sample: \[\frac{1}{|\mathcal{I}_{\mathrm{eval}}|}\sum_{i \in \mathcal{I}_{\mathrm{eval}}}\varphi_{\mathrm{eff}}\!\bigl(O_i;\,\tau,\,\hat\eta^{(\mathrm{train})}\bigr) = 0.\]
The remainder on the evaluation fold decomposes as: \[R_n^{(\mathrm{e})} = \underbrace{P\{\varphi_{\mathrm{eff}}(\,\cdot\,;\tau,\hat\eta^{(\mathrm{train})}) - \varphi_{\mathrm{eff}}(\,\cdot\,;\tau,\eta_0)\}}_{o_p(n^{-1/2})\text{ by orthogonality}} + \underbrace{(\mathbb{P}_{n_{\mathrm{e}}} - P)\{\varphi_{\mathrm{eff}}(\,\cdot\,;\tau,\hat\eta^{(\mathrm{train})}) - \varphi_{\mathrm{eff}}(\,\cdot\,;\tau,\eta_0)\}}_{\text{empirical process term}}. \tag{12.9}\]
Because \(\hat\eta^{(\mathrm{train})}\) is independent of the observations in \(\mathcal{I}_{\mathrm{eval}}\), conditioning on \(\hat\eta^{(\mathrm{train})}\) makes the summands conditionally i.i.d. with mean zero. Under \(L_2\) consistency, strong overlap, and finite-variance moment conditions, a conditional variance bound gives the empirical process term \(= o_p(n^{-1/2})\). No Donsker condition is needed: independence allows a conditional argument in place of stochastic equicontinuity.
The main disadvantage is inefficiency: only \(|\mathcal{I}_{\mathrm{eval}}|\) observations contribute to the estimation of \(\tau\), and the estimate depends on the particular random partition chosen.
12.7 Cross-Fitting
Cross-fitting extends sample splitting to recover full-sample efficiency while preserving the independence argument. Partition \(\{1,\ldots,n\}\) into \(K\) approximately equal folds \(\mathcal{I}_1,\ldots,\mathcal{I}_K\). For each fold \(k\): estimate nuisance functions on the complement \(\hat\eta^{(-k)} = (\hat\mu_0^{(-k)}, \hat\mu_1^{(-k)}, \hat\pi^{(-k)})\), then evaluate the score on the held-out fold. The cross-fitted estimator \(\hat\tau_{\mathrm{DML}}\) solves: \[\frac{1}{n}\sum_{k=1}^K\sum_{i \in \mathcal{I}_k}\varphi_{\mathrm{eff}}\!\bigl(O_i;\,\tau,\,\hat\eta^{(-k)}\bigr) = 0. \tag{12.10}\]
For each fold \(k\), the nuisance estimate \(\hat\eta^{(-k)}\) is trained on \(\{1,\ldots,n\}\setminus\mathcal{I}_k\), which is independent of \(\mathcal{I}_k\). Conditioning on \(\hat\eta^{(-k)}\), the fold-\(k\) summands are independent and mean zero. The fold-by-fold argument applies to each \(R_n^{(k)}\) separately, giving \(\sqrt{n}R_n^{(k)} = o_p(1)\) for each \(k\) without any Donsker assumption.
Cross-fitting achieves two goals simultaneously: it preserves the held-out independence structure that removes the need for Donsker conditions, and ensures every observation serves once as an evaluation point so the full sample contributes to the estimation of \(\tau\).
12.8 Double Machine Learning for the Average Treatment Effect
The estimator Equation 12.11 is simply the AIPW estimator from Chapter 11, with out-of-fold nuisance estimates substituted in place of full-sample estimates. The term double machine learning (DML), introduced by Chernozhukov et al. (2018), refers to the combination of three ingredients: (i) Neyman orthogonality, which reduces the population bias to a second-order product; (ii) cross-fitting, which removes same-sample dependence in the empirical-process term; and (iii) flexible machine-learning estimation of the nuisance functions, made valid by the first two ingredients.
The word “double” should not be read as merely meaning “two models”; it refers to the combination of nuisance learning with orthogonalization and debiasing. DML is not a new causal estimand and not a new identification strategy; it is a modern estimation protocol built around an orthogonal score. The cross-fitted AIPW estimator in the biostatistics literature is the same object.
12.9 Rate Conditions and Asymptotic Normality
12.10 Lab: Simulation Study of the DML Estimator
This lab verifies the theoretical properties of ?thm-asymp. The central message is that flexible nuisance estimation alone is not sufficient for valid inference: a consistent nonparametric estimator can still produce a biased AIPW estimator if the same sample is reused for nuisance estimation and score evaluation.
DGP. \(n = 2000\), covariates \(X_1, X_2, X_3 \overset{\mathrm{iid}}{\sim} \mathrm{Uniform}(-1, 1)\). True nuisance functions: \[\pi(X) = \mathrm{expit}\!\bigl(\sin(\pi X_1) + X_2^2 - 0.5\bigr), \quad \mu_1(X) = 2\sin(\pi X_1) + X_2^2 + X_3, \quad \mu_0(X) = \sin(\pi X_1) + X_2^2 - X_3.\] True ATE: \(\tau = \E\{\mu_1(X) - \mu_0(X)\} = \E\{\sin(\pi X_1) + 2X_3\} = 0\) (both \(X_1\) and \(X_3\) symmetric about zero). The nonlinearity of \(\pi\) and \(\mu_t\) means random forests can fit them consistently; their function class falls outside classical fixed Donsker regimes.
Three estimators. (1) Oracle AIPW: plug in the true nuisance functions — infeasible but provides the semiparametric efficiency benchmark. (2) Naive RF AIPW: estimate nuisance functions using random forests on the full sample, then evaluate the AIPW score on the same sample. (3) DML (cross-fitted AIPW): same random forest learners, but via \(K = 5\) fold cross-fitting. Estimators (2) and (3) use identical learners; the only difference is data reuse versus cross-fitting.
Settings. Software: Python, scikit-learn. Replications: \(n_{\mathrm{sim}} = 200\). Folds: \(K = 5\), shuffled. Nuisance learners: random forest regressor for \(\mu_t\); random forest classifier for \(\pi\); n_estimators=100, min_samples_leaf=5. Propensity trimming: \(\hat\pi\) clipped to \([0.01, 0.99]\). No hyperparameter tuning inside the cross-fitting loop.
Results:
| Estimator | Bias | Variance | RMSE |
|---|---|---|---|
| Oracle AIPW | 0.009 | 0.0029 | 0.055 |
| Naive RF AIPW | 0.039 | 0.0031 | 0.068 |
| DML | 0.007 | 0.0036 | 0.060 |
Oracle AIPW is nearly unbiased with small variance — the semiparametric efficiency benchmark.
Naive RF AIPW exhibits a clear positive bias (0.039, roughly four times the oracle bias) despite using consistent random forest estimators. The bias arises from the uncontrolled empirical process term Equation 12.8: reusing the same sample allows overfitting in the random forests to propagate into the AIPW score.
DML reduces the bias to 0.007 — within Monte Carlo error of zero and indistinguishable from the oracle — at a modest finite-sample variance cost (from 0.0031 to 0.0036). The small variance increase is a finite-sample phenomenon: each out-of-fold nuisance estimator is trained on \((K-1)n/K = 1600\) observations, so its predictions are slightly noisier than in-sample fits. The Naive estimator’s lower variance is partly a consequence of overfitting. The net effect is a lower RMSE for DML (0.060) than for Naive RF AIPW (0.068): the bias reduction outweighs the variance difference.
12.11 Variance Estimation and Confidence Intervals
Inference for \(\hat\tau_{\mathrm{DML}}\) proceeds exactly as in Chapter 11, except that nuisance estimates are now out-of-fold. Let \(k(i)\) denote the fold containing observation \(i\), and define the estimated influence value: \[\hat\varphi_i = \varphi_{\mathrm{eff}}\!\bigl(O_i;\,\hat\tau_{\mathrm{DML}},\,\hat\eta^{(-k(i))}\bigr).\]
The variance estimator is: \[\hat{V} = \frac{1}{n(n-1)}\sum_{i=1}^n(\hat\varphi_i - \bar\varphi)^2, \qquad \bar\varphi = \frac{1}{n}\sum_{i=1}^n\hat\varphi_i. \tag{12.13}\]
The Wald confidence interval is \(\hat\tau_{\mathrm{DML}} \pm z_{1-\alpha/2}\sqrt{\hat{V}}\).
The formula is identical to the one in Chapter 11; the only difference is that \(\hat\varphi_i\) uses the out-of-fold nuisance estimate \(\hat\eta^{(-k(i))}\) rather than a full-sample estimate.
12.12 A Practical Workflow
12.13 What Machine Learning Does Not Solve
Flexible nuisance estimation is a genuine improvement over parametric methods, but it does not resolve the fundamental difficulties of causal inference. Machine learning does not solve:
- Unmeasured confounding. If important confounders are absent from \(X\), conditional exchangeability \(Y(t) \indep T \mid X\) fails. No flexibility in estimating \(\pi(x)\) or \(\mu_t(x)\) can correct this.
- Violations of consistency. If the treatment version is ill-defined or SUTVA is violated, the potential outcomes framework is compromised.
- Failure of positivity. When \(\pi(x) \approx 0\) or \(\pi(x) \approx 1\), influence values become unstable and coverage degrades severely.
- Ambiguity about the estimand. Flexible estimation cannot resolve disagreement about what causal quantity is scientifically meaningful.
- Invalid instrumental variable assumptions. The exclusion restriction and exogeneity conditions (Chapter 7) are not testable; machine learning cannot substitute for subject-matter justification.
- Fragile mediation assumptions. Sequential ignorability and no interference between mediators (Chapter 8) must be argued on substantive grounds.
Machine learning improves the estimation of nuisance functions within a given identification strategy. It does not create identification where none exists.
12.14 Chapter Summary
| Symbol | Meaning |
|---|---|
| \(\eta\) | Nuisance tuple \((\mu_0, \mu_1, \pi)\) |
| \(\phi(O;\,\psi,\eta)\) | Generic orthogonal score; zero mean at truth |
| \(\varphi_{\mathrm{eff}}(O;\,\tau,\eta)\) | Efficient influence function for the ATE Equation 12.4 |
| \(\mathcal{I}_k\) | \(k\)-th fold; \(k=1,\ldots,K\) |
| \(\hat\eta^{(-k)}\) | Nuisance estimates trained without fold \(k\) |
| \(\hat\tau_{\mathrm{DML}}\) | Cross-fitted AIPW (DML) estimator Equation 12.11 |
| \(\hat\varphi_i\) | Out-of-fold estimated influence value for observation \(i\) |
| \(\hat{V}\) | Variance estimator Equation 12.13 |
- Flexible methods are attractive for nuisance estimation, but naive plug-in estimators can fail for two reasons: absent orthogonality, and same-sample nuisance fitting can create an additional empirical-process bias.
- A score function is Neyman orthogonal if its Gateaux derivative with respect to the nuisance vanishes at the truth, so nuisance estimation error enters only at second order.
- The efficient influence function for the ATE Equation 12.4 is orthogonal, doubly robust, and efficiency-achieving. It serves as the orthogonal score throughout this chapter.
- Even with an orthogonal score, reusing the same data can induce overfitting bias, especially when the nuisance class falls outside classical fixed Donsker regimes.
- Sample splitting removes this bias but wastes data. Cross-fitting recovers efficiency by rotating training and evaluation roles across \(K\) folds.
- The cross-fitted AIPW estimator \(\hat\tau_{\mathrm{DML}}\) is the standard DML estimator for the ATE.
- Under the product-rate condition Equation 12.12, orthogonality and cross-fitting yield asymptotic linearity, root-\(n\) consistency, and asymptotic normality, with asymptotic variance equal to the semiparametric efficiency bound (?thm-asymp). A sufficient symmetric condition is that each nuisance estimator converges at rate \(o_p(n^{-1/4})\).
- The asymptotic variance is estimated consistently from the empirical variance of the out-of-fold estimated influence values.
- Machine learning improves nuisance estimation, but it does not resolve identification problems: unmeasured confounding, positivity failures, and untestable structural assumptions remain the researcher’s responsibility.
12.15 Problems
1. Verifying orthogonality.
- Write out the efficient influence function \(\varphi_{\mathrm{eff}}(O;\,\tau,\eta)\) for the ATE and compute its expectation. Confirm it equals zero at the truth.
- Consider a perturbation \(\pi \mapsto \pi + r\cdot h\) for a bounded function \(h(x)\). Differentiate \(\E\{\varphi_{\mathrm{eff}}(O;\,\tau, \mu_0, \mu_1, \pi+rh)\}\) with respect to \(r\) and evaluate at \(r=0\). Show the derivative is zero, confirming Neyman orthogonality in the propensity score direction.
- Repeat the calculation for a perturbation in \(\mu_1\).
2. First-order bias of the plug-in estimator. Suppose \(\hat\mu_1 = \mu_1 + \delta\) for a deterministic function \(\delta(x)\), and \(\hat\mu_0 = \mu_0\), \(\hat\pi = \pi\).
- Show that the bias of the prediction estimator \(\hat\tau_{\mathrm{pred}} = n^{-1}\sum_i\{\hat\mu_1(X_i) - \hat\mu_0(X_i)\}\) is \(\E\{\delta(X)\}\).
- Show that the population bias of the plug-in AIPW estimator (using \(\hat\mu_1 = \mu_1 + \delta\) with the true \(\pi\) and \(\mu_0\)) is zero. Explain which property of the AIPW score is responsible. Does this mean the plug-in AIPW estimator is asymptotically unbiased when \(\hat\mu_1\) is estimated flexibly from the same sample used to evaluate the score? Explain why or why not.
3. Cross-fitting with \(K = 2\). Partition a sample of \(n = 200\) observations into two halves \(\mathcal{I}_1\) and \(\mathcal{I}_2\).
- Write down the cross-fitted moment equation Equation 12.10 explicitly for \(K = 2\).
- Explain why the contribution from \(\mathcal{I}_1\) uses nuisance estimates trained on \(\mathcal{I}_2\) and vice versa.
- Compare this procedure to single sample splitting with \(\mathcal{I}_1\) as the training fold. What is gained and what is lost?
4. The product-rate condition. Suppose \(\|\hat\pi - \pi\|_{L_2} = o_p(n^{-\alpha})\) and \(\|\hat\mu_t - \mu_t\|_{L_2} = o_p(n^{-\beta})\) for \(\alpha, \beta > 0\).
- State the condition on \(\alpha\) and \(\beta\) under which the product-rate condition Equation 12.12 holds.
- Show that \(\alpha = \beta = 1/4\) satisfies your condition.
- Does \(\alpha = 1/2\), \(\beta = 0\) satisfy it? What does this say about the case where the propensity score is estimated parametrically but the outcome model converges only at an unspecified rate?
5. Variance estimation. Using the explicit formula Equation 12.11, write out the estimated influence value \(\hat\varphi_i\) for a generic observation \(i \in \mathcal{I}_k\). Show that \(n^{-1}\sum_i\hat\varphi_i = 0\) exactly when \(\hat\tau_{\mathrm{DML}}\) solves the cross-fitted moment equation, and explain why the centering in Equation 12.13 is numerically inconsequential.
6. What machine learning cannot do. For each of the following scenarios, identify the identification assumption that is violated and explain why flexible nuisance estimation cannot remedy the problem.
- A study of job-training effects on earnings omits pre-program earnings, a strong predictor of both program participation and subsequent earnings.
- A study of a medical treatment estimates the propensity score accurately, but the treatment was assigned in clusters (hospital wards) and outcomes may be correlated within clusters in a way that depends on the fraction of patients treated.
- An instrument is used to estimate the effect of education on wages, but the instrument also directly affects wages through a channel not related to education.