By the end of this chapter, students should be able to:
- Explain why identification of a causal parameter does not automatically yield a statistically reliable estimator, and articulate the distinct questions that an estimation theory must address.
- Define an estimating equation and verify the population moment condition for standard examples (sample mean, OLS, IPW).
- State the definition of an asymptotically linear estimator and derive the influence function from a generic first-order expansion of an estimating equation.
- Compute the influence function for simple causal estimators when nuisance functions are known, and interpret it as the infinitesimal contribution of a single observation.
- Describe how estimation error in nuisance functions propagates into the target estimator via the Z-estimation stacked-equation framework, and explain why this is the central statistical challenge in causal inference.
- Use the influence function to construct a consistent variance estimator and an approximate Wald confidence interval.
- Define efficiency in the class of regular estimators and explain the role of the efficient influence function as the foundation for doubly robust estimation in Chapter 11.
Why Estimation Needs Its Own Theory
So far the focus has been on identification: under what assumptions can a causal parameter be written as a functional of the observed-data distribution? Identification does not by itself provide a statistically reliable estimator. Once a causal parameter is identified, several distinct questions remain:
- How should the estimator be constructed from the observed sample?
- How does estimation of nuisance functions affect the target estimator?
- What is the large-sample distribution of the estimator?
- How can we compute valid standard errors and confidence intervals?
These questions motivate a separate theory of estimation and inference. In causal inference this issue is especially important because identified parameters often depend on auxiliary, or nuisance, functions such as the outcome regression \(\mu_t(x) = \E(Y \mid T{=}t,\, X{=}x)\) or the propensity score \(\pi(x) = P(T{=}1 \mid X{=}x)\). Even when the causal parameter is identified, different estimation strategies may behave quite differently in terms of robustness, efficiency, and sensitivity to model misspecification.
The goal of this chapter is to introduce a general framework for estimation based on estimating equations and influence functions. This framework will serve as the foundation for the doubly robust and semiparametric methods developed in Chapter 11; see also Imbens and Rubin (2015) and Hernán and Robins (2020) for complementary treatments.
A Running Example: The ATE
Throughout this chapter we use the average treatment effect (ATE) as a running example: \[\tau = \E\{Y(1) - Y(0)\}.\]
Under consistency, conditional exchangeability, and positivity, the ATE is identified by the back-door formula (Chapter 5): \[\tau = \E[\mu_1(X) - \mu_0(X)], \qquad \mu_t(x) = \E(Y \mid T{=}t,\, X{=}x),\quad t\in\{0,1\}.\]
This identity tells us what the target parameter is, but it does not uniquely determine how to estimate it. One may consider a regression-based estimator by fitting models for \(\mu_1(x)\) and \(\mu_0(x)\); a weighting estimator based on the propensity score \(\pi(x)\); or an augmented estimator combining both. All of these may target the same causal parameter yet differ in their statistical properties. A main goal of this chapter is to develop a common language for describing and comparing such estimators.
Estimating Equations
A broad class of estimators can be defined as solutions to estimating equations. Let \(O_1,\dots,O_n\) be i.i.d. observations from a distribution \(P\), let \(\theta\) denote a finite-dimensional target parameter, and write \(\mathbb{P}_n f = n^{-1}\sum_{i=1}^n f(O_i)\) for the empirical average.
An estimating equation for a parameter \(\theta\) is an equation of the form \(\mathbb{P}_n\{U(O;\theta)\} = 0\), where the population moment condition \(\E\{U(O;\theta_0)\} = 0\) holds at the true parameter value \(\theta_0\). The function \(U(O;\theta)\) is called the estimating function.
Many familiar estimators take this form.
(i) Sample mean. Let \(\theta = \E(Y)\). The sample mean \(\hat\theta = \bar Y\) solves \(\mathbb{P}_n(Y-\theta) = 0\). The estimating function is \(U(O;\theta) = Y - \theta\).
(ii) Ordinary least squares. Let \(O = (X,Y)\) and \(\beta = [\E(XX^\top)]^{-1}\E(XY)\). OLS solves \(\mathbb{P}_n[X\{Y - X^\top\beta\}] = 0\).
(iii) IPW estimating equation. Suppose \(\pi(X)\) is known. Under consistency, conditional exchangeability, and positivity, \(\tau = \E\{TY/\pi(X) - (1-T)Y/(1-\pi(X))\}\). Then \(\tau\) solves: \[\E\!\left[\frac{TY}{\pi(X)} - \frac{(1-T)Y}{1-\pi(X)} - \tau\right] = 0.\] The right-hand side is an observed-data functional only after identification has been imposed; without those assumptions, knowing \(\pi(X)\) alone does not turn the expression into a causal quantity.
The main advantage of the estimating-equation framework is that it provides a unified language for both estimator construction and asymptotic analysis.
From Estimating Equations to Asymptotic Linearity
Estimating equations are useful not only because they define estimators, but also because they often yield a convenient first-order expansion. Under suitable regularity conditions, an estimator solving an estimating equation can typically be approximated as \(\sqrt{n}(\hat\theta - \theta_0) = n^{-1/2}\sum_{i=1}^n \varphi(O_i) + o_p(1)\) for some mean-zero function \(\varphi(O)\).
An estimator \(\hat\theta\) is asymptotically linear with influence function \(\varphi(O)\) if \[\sqrt{n}(\hat\theta - \theta_0) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \varphi(O_i) + o_p(1), \qquad \E\{\varphi(O)\} = 0. \tag{10.1}\]
This representation is fundamental because it immediately implies asymptotic normality by the CLT. When \(\theta \in \mathbb{R}^p\): \[\sqrt{n}(\hat\theta - \theta_0) \overset{d}{\longrightarrow} N(\mathbf{0},\; \Sigma), \qquad \Sigma = \E[\varphi(O)\varphi(O)^\top].\] In the scalar case this reduces to \(\Sigma = \E[\varphi(O)^2]\).
Under smoothness and regularity conditions, an estimator \(\hat\theta\) solving \(\mathbb{P}_n\{U(O;\theta)\}=0\) admits a first-order expansion: \[\sqrt{n}(\hat\theta - \theta_0) = -A^{-1}\,\frac{1}{\sqrt{n}}\sum_{i=1}^n U(O_i;\theta_0) + o_p(1), \qquad A = \E\!\left\{\frac{\partial}{\partial\theta^\top}U(O;\theta_0)\right\}. \tag{10.2}\] Hence the influence function is \(\varphi(O) = -A^{-1}U(O;\theta_0)\).
A Taylor expansion of the sample moment condition around \(\theta_0\) gives: \[0 = \mathbb{P}_n\{U(O;\hat\theta)\} \approx \mathbb{P}_n\{U(O;\theta_0)\} + \mathbb{P}_n\!\left\{\frac{\partial}{\partial\theta^\top} U(O;\theta_0)\right\}(\hat\theta - \theta_0).\] By the LLN, \(\mathbb{P}_n\{\partial_{\theta^\top} U(O;\theta_0)\} \to A\) in probability. Rearranging and multiplying by \(\sqrt{n}\) yields the stated expansion. \(\square\)
Influence Functions: Intuition
Statistical Functionals
Many parameters of interest in statistics and causal inference can be written as functionals of the underlying distribution.
A statistical functional is a map \(\Psi : \mathcal{P} \to \mathbb{R}^p\) that assigns a parameter value \(\psi = \Psi(P)\) to each distribution \(P \in \mathcal{P}\). The plug-in estimator of \(\psi\) is \(\hat\psi = \Psi(\mathbb{P}_n)\), obtained by replacing \(P\) with the empirical distribution.
- Mean. \(\Psi(P) = \E_P(Y)\). Plug-in: \(\bar Y\).
- Variance. \(\Psi(P) = \E_P(Y^2) - [\E_P(Y)]^2\). Plug-in: sample variance.
- Quantile. \(\Psi(P) = F_P^{-1}(\tau)\). Plug-in: sample \(\tau\)-quantile.
- OLS regression coefficient. \(\Psi(P) = [\E_P(XX^\top)]^{-1}\E_P(XY)\). Plug-in: OLS estimator.
- ATE. \(\Psi(P) = \E_P[\mu_1(X) - \mu_0(X)]\). Plug-in: average of estimated regression functions.
A functional \(\Psi\) is linear if \(\Psi((1-\epsilon)P + \epsilon Q) = (1-\epsilon)\Psi(P) + \epsilon\Psi(Q)\). The mean (i) is linear; the variance (ii), quantile (iii), OLS coefficient (iv), and ATE (v) are nonlinear. Linear functionals are straightforward to analyze because their plug-in estimators are sample averages. Nonlinear functionals require a local linearization, which is precisely what the influence function provides.
Influence Functions
Influence functions describe the first-order sensitivity of a statistical functional to small perturbations of the underlying distribution. Informally:
The influence function is the infinitesimal contribution of a single observation to the first-order behavior of the estimator.
A function \(\varphi(O)\) is called an influence function for \(\Psi\) if it has mean zero under \(P\) and gives the first-order derivative of \(\Psi(P)\) along regular parametric submodels: \[\sqrt{n}(\hat\psi - \psi) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \varphi(O_i) + o_p(1).\]
The influence function can be defined formally through the pathwise (Gateaux) derivative of the functional \(\Psi(P)\): if \(P_\epsilon = (1-\epsilon)P + \epsilon\delta_O\) is the \(\epsilon\)-mixture of \(P\) with a point mass at \(O\), then: \[\varphi(O) = \left.\frac{d}{d\epsilon}\Psi(P_\epsilon)\right|_{\epsilon=0}.\] This derivative-based definition is the starting point for semiparametric efficiency theory. In this chapter it is enough to interpret \(\varphi(O)\) as the object appearing in the asymptotic linear expansion.
Consider \(\beta = \Psi(P) = [\E(XX^\top)]^{-1}\E(XY)\).
Step 1. Let \(P_\epsilon = (1-\epsilon)P + \epsilon\delta_O\). Then \(\E_{P_\epsilon}(\tilde X \tilde X^\top) = (1-\epsilon)\E(XX^\top) + \epsilon XX^\top\) and similarly for \(\E_{P_\epsilon}(\tilde X \tilde Y)\).
Step 2. Differentiate with respect to \(\epsilon\) at \(\epsilon = 0\). Let \(M(\epsilon) = (1-\epsilon)\E(XX^\top) + \epsilon XX^\top\) and \(v(\epsilon) = (1-\epsilon)\E(XY) + \epsilon XY\). By differentiating \(M(\epsilon)M(\epsilon)^{-1} = I\): \[\frac{d}{d\epsilon}M(\epsilon)^{-1} = -M(\epsilon)^{-1}\dot M(\epsilon)\,M(\epsilon)^{-1}.\] Evaluating at \(\epsilon = 0\) and applying the product rule to \(M^{-1}v\): \[\varphi(O) = [\E(XX^\top)]^{-1}X(Y - X^\top\beta).\]
Verification. \(\E\{\varphi(O)\} = [\E(XX^\top)]^{-1}\E[X(Y - X^\top\beta)] = [\E(XX^\top)]^{-1}\cdot 0 = 0\).
Connection to estimating equations. The OLS estimating function is \(U(O;\beta) = X(Y - X^\top\beta)\), so \(A = -\E(XX^\top)\) and Equation 10.2 gives \(\varphi(O) = [\E(XX^\top)]^{-1}X(Y - X^\top\beta)\) — the same answer. The pathwise route is more fundamental; the estimating-equation route is more computationally direct.
Influence Functions for Simple Causal Estimators
We now illustrate the idea in causal settings when nuisance functions are known.
Known outcome regression. Suppose \(\mu_1(\cdot)\) and \(\mu_0(\cdot)\) are known. The natural plug-in estimator is \(\hat\tau = \mathbb{P}_n\{\mu_1(X) - \mu_0(X)\}\). Its influence function is: \[\varphi(O) = \mu_1(X) - \mu_0(X) - \tau.\]
Known propensity score. Suppose \(\pi(X)\) is known. The IPW estimator \(\hat\tau = \mathbb{P}_n\{TY/\pi(X) - (1-T)Y/(1-\pi(X))\}\) has influence function: \[\varphi(O) = \frac{TY}{\pi(X)} - \frac{(1-T)Y}{1-\pi(X)} - \tau.\]
When nuisance functions are known, the influence function is simply the centered estimating function. This observation becomes much more consequential in the next section, where nuisance functions must be estimated. The key challenge is to understand how estimation error in \(\hat\mu_t\) or \(\hat\pi\) perturbs the first-order expansion.
Several closely related objects share the symbol \(\varphi\). Keeping them separate prevents confusion.
(i) Estimating function. A function \(U(O;\theta)\) that defines an estimator through \(\mathbb{P}_n\{U(O;\hat\theta)\}=0\). This is an input to the construction of \(\hat\theta\).
(ii) Influence function of an estimator. A function \(\varphi(O)\) such that \(\sqrt{n}(\hat\psi-\psi_0) = n^{-1/2}\sum_i\varphi(O_i)+o_p(1)\). This is a property of the estimator. With known nuisance, \(\varphi\) is the centered estimating function; when nuisance is estimated, the two generally differ.
(iii) Influence function (canonical gradient) of a functional. A function \(\varphi^*(O;P)\) that is the pathwise derivative of the parameter functional \(\Psi(P)\) at \(P\). This is a property of the parameter and the model, independent of any estimator.
(iv) Efficient influence function (EIF). Among all valid influence functions of regular estimators in the model, the unique one with smallest asymptotic variance.
(v) Estimated influence values. The numerical quantities \(\hat\varphi_i = \hat\varphi(O_i)\), obtained by plugging in estimated nuisance functions. These are used to compute the sandwich variance \(\hat V = (n(n-1))^{-1}\sum_i(\hat\varphi_i-\bar\varphi)^2\).
A regular estimator \(\hat\psi\) is asymptotically linear with influence function \(\varphi^*\) (the canonical gradient) when nuisance estimators converge fast enough; the estimated influence values \(\hat\varphi_i\) then serve as data-driven proxies for \(\varphi^*(O_i)\) for variance estimation.
Z-Estimation with Nuisance Parameters
The examples in Section 10.6 treated nuisance functions as known. In practice, they must be estimated from the same data. When this happens, the randomness in the estimated nuisance introduces an additional term that the naive plug-in formula ignores. Z-estimation provides the framework that accounts jointly for estimation of both the target and nuisance parameters.
The Stacked Estimating Equation Framework
Suppose the target parameter \(\psi \in \mathbb{R}^p\) depends on an unknown nuisance parameter \(\alpha \in \mathbb{R}^q\). Both are estimated by solving a stacked system: \[\mathbb{P}_n\{U_1(O;\psi,\alpha)\} = 0, \qquad \mathbb{P}_n\{U_2(O;\alpha)\} = 0. \tag{10.3}\]
Here \(U_1\) defines the target estimator \(\hat\psi\) given \(\alpha\), while \(U_2\) defines the nuisance estimator \(\hat\alpha\). The equations are solved simultaneously (or sequentially: first Equation 10.3 for \(\hat\alpha\), then for \(\hat\psi\)).
One might think it is enough to plug the estimated \(\hat\alpha\) into the influence function derived under known \(\alpha\). This is incorrect in general: the randomness in \(\hat\alpha\) contributes an additional term to the first-order expansion of \(\hat\psi\) that is not captured by the plug-in influence function. The stacked framework keeps track of this extra term automatically.
The Z-Estimation Theorem
Write \(\theta = (\psi^\top, \alpha^\top)^\top\) and let \(U(O;\theta) = (U_1(O;\psi,\alpha)^\top, U_2(O;\alpha)^\top)^\top\) be the stacked estimating function.
Suppose: (i) \(\E\{U(O;\theta_0)\}=0\); (ii) \(\hat\theta \overset{p}{\to} \theta_0\); (iii) the Jacobian \(\mathcal{A} = \E\{\partial_{\theta^\top} U(O;\theta_0)\}\) is invertible; (iv) \(U\) is smooth enough in \(\theta\) for a uniform LLN and CLT. Then: \[\sqrt{n}(\hat\theta - \theta_0) = -\mathcal{A}^{-1}\, \frac{1}{\sqrt{n}}\sum_{i=1}^n U(O_i;\theta_0) + o_p(1).\] Partitioning \(\mathcal{A}\) conformably with \((\psi, \alpha)\) as \(\mathcal{A} = \begin{pmatrix} A_{11} & A_{12} \\ 0 & A_{22} \end{pmatrix}\), the block-matrix inverse gives \(\mathcal{A}^{-1} = \begin{pmatrix} A_{11}^{-1} & -A_{11}^{-1}A_{12}A_{22}^{-1} \\ 0 & A_{22}^{-1} \end{pmatrix}\), and the influence function of \(\hat\psi\) is: \[\varphi(O) = -A_{11}^{-1}\,U_1(O;\psi_0,\alpha_0) + A_{11}^{-1}A_{12}A_{22}^{-1}\,U_2(O;\alpha_0). \tag{10.4}\]
The argument mirrors Equation 10.2 applied to the full stacked system. Taylor-expand \(\mathbb{P}_n\{U(O;\hat\theta)\}=0\) around \(\theta_0\): \[0 \approx \mathbb{P}_n\{U(O;\theta_0)\} + \mathbb{P}_n\!\left\{\frac{\partial}{\partial\theta^\top} U(O;\theta_0)\right\}(\hat\theta-\theta_0).\] By the LLN, the matrix of derivatives converges to \(\mathcal{A}\). Rearranging and multiplying by \(\sqrt{n}\) gives the stacked expansion. Extracting the first block via the block-matrix inverse formula gives Equation 10.4. \(\square\)
Hypothesis (ii) is a separate prerequisite, not a consequence of (i), (iii), and (iv). Standard arguments require identifiability of \(\theta_0\) via \(\E\{U(O;\theta)\} = 0\) having a unique zero, together with uniform convergence of the sample moment. See Vaart (1998, Theorem 5.9) for a standard set of sufficient conditions.
Equation 10.4 decomposes the influence function into two parts. The first term \(-A_{11}^{-1}U_1(O;\psi_0,\alpha_0)\) is exactly what we would obtain if \(\alpha_0\) were known. The second term \(A_{11}^{-1}A_{12}A_{22}^{-1}U_2(O;\alpha_0)\) is the nuisance correction that accounts for the additional variability introduced by estimating \(\alpha\). If \(A_{12}=0\), the target estimating function \(U_1\) does not depend on \(\alpha\) at \(\alpha_0\), and the correction vanishes. This is the key condition exploited by locally efficient and doubly robust estimators in Chapter 11.
Equation 10.4 presupposes the system is just identified: \(U_1\) has the same dimension as \(\psi\) and \(U_2\) has the same dimension as \(\alpha\), so that \(A_{11}\) and \(A_{22}\) are square and invertible. When more moment conditions are available than parameters — the overidentified case from GMM — one minimizes \(\mathbb{P}_n U^\top W\mathbb{P}_n U\) for some weight matrix \(W\), and the influence function takes the standard GMM sandwich form (Hansen 1982). The block formula is the just-identified specialization.
Working Example: IPW with Estimated Propensity Score
We apply the Z-estimation theorem to the IPW estimator of the ATE \(\tau\) when \(\pi(X)\) is estimated by logistic regression. In Section 10.7.2 the generic target was \(\psi\); here \(\psi = \tau\).
Setup. Assume \(\pi(X;\alpha) = \mathrm{expit}(X^\top\alpha)\). The IPW estimating equation for \(\tau\) is \(U_1(O;\tau,\alpha) = TY/\pi(X;\alpha) - (1-T)Y/(1-\pi(X;\alpha)) - \tau\), and the logistic regression score is \(U_2(O;\alpha) = X\{T - \pi(X;\alpha)\}\).
Computing the Jacobian blocks. Since \(A_{11} = \E\{\partial_\tau U_1\} = -1\) and \(A_{22} = \E\{\partial_{\alpha^\top}U_2\} = -\Sigma_\pi\) where \(\Sigma_\pi = \E\{\pi(X;\alpha_0)(1-\pi(X;\alpha_0))XX^\top\}\). Using \(\partial\pi/\partial\alpha^\top = \pi(1-\pi)X^\top\) and simplifying: \[A_{12} = -\E\!\left[Y X^\top\!\left\{\frac{T}{\pi(X;\alpha_0)} + \frac{1-T}{1-\pi(X;\alpha_0)} - 1\right\}\right].\]
Corrected influence function. Substituting into Equation 10.4 with \(A_{11} = -1\) and \(A_{22}^{-1} = -\Sigma_\pi^{-1}\) (the two negatives cancel): \[\varphi(O) = U_1(O;\tau_0,\alpha_0) + A_{12}\,\Sigma_\pi^{-1}\,U_2(O;\alpha_0). \tag{10.5}\] The first term is the naive IPW influence function from Section 10.6; the second is the nuisance correction removing the first-order contribution of \(\hat\alpha - \alpha_0\).
We show that estimating \(\alpha\) by maximum likelihood (ML) can only reduce the asymptotic variance of \(\hat\tau_{\mathrm{IPW}}\) relative to using the true \(\alpha_0\).
Write \(\sigma_1^2 = \E[U_1(O;\tau_0,\alpha_0)^2]\). Expanding the variance of Equation 10.5: \[\E[\varphi(O)^2] = \sigma_1^2 + 2A_{12}\Sigma_\pi^{-1}\E[U_1 U_2] + A_{12}\Sigma_\pi^{-1}\E[U_2 U_2^\top]\Sigma_\pi^{-1}A_{12}^\top. \tag{10.6}\]
Bartlett identity (ML score): \(\E[U_2 U_2^\top] = \Sigma_\pi\) (information equality for the correctly specified logistic model).
Cross-identity: Since \(\E_\alpha[U_1(O;\tau_0,\alpha)] = 0\) for all \(\alpha\) (under correct PS specification), differentiating through \(\alpha\): \(A_{12} = -\E[U_1 U_2^\top]\).
Substituting into Equation 10.6: \[\E[\varphi(O)^2] = \sigma_1^2 - A_{12}\Sigma_\pi^{-1}A_{12}^\top = \sigma_1^2 - \E[U_1 U_2^\top]\Sigma_\pi^{-1}\E[U_2 U_1] \leq \sigma_1^2,\] since \(\Sigma_\pi\) is positive definite. Equality holds only when \(\E[U_1 U_2^\top] = 0\).
Scope. This variance-reduction conclusion (sometimes called the estimated propensity score paradox, Robins (1986)) requires: (i) correctly specified parametric propensity model; (ii) ML estimation so the Bartlett identity holds; (iii) strong overlap and moment conditions. When the parametric model is misspecified, when \(\hat\alpha\) is computed by another method, or when \(\pi(X)\) is estimated nonparametrically, the identity \(A_{12} = -\E[U_1 U_2^\top]\) need not hold and the conclusion fails.
The primary difficulty in practice is often not identifying the target parameter, but correctly accounting for the effect of nuisance estimation on the asymptotic distribution of the final estimator. Treating \(\hat\alpha\) as if it were the true \(\alpha_0\) — computing standard errors from the naive IPW influence function rather than from Equation 10.4 — yields correct point estimates under a correctly specified PS model but incorrect standard errors. The Z-estimation framework resolves this variance-accounting issue whenever the nuisance model is correctly specified and the nuisance estimator converges at rate \(n^{-1/2}\). It does not, by itself, protect against model misspecification: if the parametric PS model is wrong, \(\hat\tau\) is inconsistent regardless of which variance formula is used.
The Z-estimation theorem as stated requires \(\hat\alpha\) to converge at rate \(n^{-1/2}\), which holds for finite-dimensional parametric models but fails for nonparametric or machine-learning estimators. When flexible methods are used for \(\pi(x)\) or \(\mu_t(x)\), the correction term in Equation 10.4 no longer has the simple form derived here, and a more careful analysis based on sample splitting or cross-fitting is needed. This is the starting point for the doubly robust and debiased machine-learning estimators in Chapter 11.
Variance Estimation and Confidence Intervals
Once an estimator admits the asymptotic linear representation, its asymptotic variance is \(n^{-1}\mathrm{Var}\{\varphi(O)\}\). A natural estimator is obtained by replacing \(\varphi(O_i)\) with estimated influence values \(\hat\varphi_i\): \[\hat V = \frac{1}{n(n-1)}\sum_{i=1}^n (\hat\varphi_i - \bar\varphi)^2, \qquad \bar\varphi = \frac{1}{n}\sum_{i=1}^n \hat\varphi_i. \tag{10.7}\]
The asymptotically equivalent uncentered form \(\hat V = n^{-2}\sum_i \hat\varphi_i^2\) is also commonly used; the two differ by \(O_p(n^{-1})\). We adopt the centered form throughout this book. The approximate Wald confidence interval at level \(1-a\) is \(\hat\psi \pm z_{1-a/2}\sqrt{\hat V}\).
If \(\sqrt{n}(\hat\psi - \psi) = n^{-1/2}\sum_{i=1}^n \varphi(O_i) + o_p(1)\) with \(\E\{\varphi(O)\} = 0\) and \(\E\{\varphi(O)^2\} < \infty\), then: \[\sqrt{n}(\hat\psi - \psi) \overset{d}{\longrightarrow} N\!\left(0,\, \E[\varphi(O)^2]\right), \qquad \mathrm{Var}(\hat\psi) = \frac{1}{n}\,\E[\varphi(O)^2] + o(n^{-1}). \tag{10.8}\]
The proof follows immediately from the CLT applied to the i.i.d. sum. The influence function thus plays a dual role: it characterizes both the asymptotic distribution and the asymptotic variance.
Equation 10.4 also provides the basis for the sandwich variance estimator. Replacing \(\theta_0\) by \(\hat\theta\), the asymptotic variance of \(\hat\tau\) is consistently estimated by Equation 10.7 with \(\hat\varphi_i\) evaluated at \((\hat\tau, \hat\alpha)\). This estimator automatically accounts for nuisance estimation uncertainty and is valid without any re-sampling. Standard errors from a logistic regression routine that ignores the link between \(\hat\alpha\) and \(\hat\tau\) will in general be incorrect.
In practice, \(\varphi(O_i)\) depends on unknown parameters and nuisance functions, so it is replaced by an estimated influence value \(\hat\varphi_i\). This substitution is not innocuous: the plug-in variance formula is valid only when the estimator is asymptotically linear with influence function \(\varphi\) and the error from replacing unknown quantities by estimates is asymptotically negligible. For the plug-in estimators of Section 10.7, these conditions require nuisance estimators to converge fast enough that the plug-in error is \(o_p(n^{-1/2})\). We return to this point in Chapter 11.
Toward Efficiency: Semiparametric Models and the EIF
Section 10.4 showed that the influence function determines both the asymptotic distribution and variance. Different regular estimators of the same parameter may therefore have different large-sample variances. This raises a natural question: among all regular estimators in a given model, what is the smallest achievable asymptotic variance? This section introduces the basic objects — semiparametric models, regular estimators, and the efficient influence function — as preparation for Chapter 11, where they are used to construct doubly robust and efficient estimators.
Semiparametric Models
A semiparametric model is a statistical model \(\mathcal{P}\) for the distribution \(P\) of the observed data in which: (i) the parameter of interest \(\psi = \Psi(P) \in \mathbb{R}^p\) is finite-dimensional; and (ii) the remaining aspects of \(P\) — the nuisance parameter — are infinite-dimensional.
The canonical example in this course is the ATE. Under the identification assumptions of consistency, conditional exchangeability, and positivity, it equals the observed-data functional \(\tau(P) = \E_P\{\mu_1(X) - \mu_0(X)\}\). We study estimation in the nonparametric model \(\mathcal{P} = \{P : 0 < \pi(X) < 1\ \text{a.s.}\}\), where the nuisance consists of the entire joint distribution of \((X, T, Y)\) subject only to positivity. No parametric form is assumed for \(\mu_t(x)\) or \(\pi(x)\).
The positivity condition \(0 < \pi(X) < 1\) a.s. suffices for identification of \(\tau\), but not for stable regular root-\(n\) inference. Regular asymptotic theory typically requires strong overlap, \(\epsilon \leq \pi(X) \leq 1-\epsilon\) a.s. for some \(\epsilon > 0\), together with appropriate moment conditions on \(Y\). Under weak overlap the inverse-probability weights are unbounded in probability, and the asymptotic variance of IPW-type estimators may fail to be finite or fail to have a normal limiting distribution at the \(\sqrt{n}\) rate.
In a fully parametric model, the Cramér–Rao lower bound gives the minimum variance of any unbiased estimator of \(\psi\). When the nuisance is infinite-dimensional, the classical bound no longer applies: the relevant lower bound must account for all infinitely many directions in which the likelihood can vary. The semiparametric efficiency bound is the analogue of the Cramér–Rao bound for this setting; see Bickel et al. (1993) and Tsiatis (2006) for comprehensive treatments.
Regular Estimators
The efficiency bound is meaningful only within the class of regular estimators, which informally are estimators whose asymptotic distribution is stable under small perturbations of the data-generating distribution.
An estimator \(\hat\psi\) of \(\psi_0 = \Psi(P_0)\) is called regular at \(P_0\) if, along every smooth parametric submodel \(\{P_t : t \in \mathbb{R}\}\) with \(P_0 = P_{t=0}\), the limiting distribution of \(\sqrt{n}(\hat\psi - \psi_t)\) under \(P_{1/\sqrt{n}}\) does not depend on the submodel or its direction.
Regularity rules out pathological estimators that exploit specific features of the data-generating mechanism in a non-uniform way (superefficient estimators). All estimators in this course — regression, IPW, and augmented variants — are regular. See Vaart (1998, Chs. 8, 25) and Tsiatis (2006, Ch. 4).
The Semiparametric Efficiency Bound and the EIF
In a semiparametric model \(\mathcal{P}\), the asymptotic distribution of any regular estimator \(\hat\psi\) of \(\psi_0 \in \mathbb{R}^p\) satisfies: \[\sqrt{n}(\hat\psi - \psi_0) \overset{d}{\longrightarrow} N(0, V^*) * M\] for some distribution \(M\), where \(*\) denotes convolution and \(V^* = \E_P[\varphi^*(O)\,\varphi^*(O)^\top]\) is the semiparametric efficiency bound. The function \(\varphi^*(O)\) with \(\E[\varphi^*(O)] = 0\) and \(\E[\varphi^*(O)\varphi^*(O)^\top] = V^*\) is called the efficient influence function (EIF). An estimator achieves \(V^*\) if and only if it is asymptotically linear with influence function \(\varphi^*(O)\).
For a scalar parameter (\(p = 1\)), \(V^*\) is a scalar and any regular estimator has asymptotic variance at least \(V^*\). For vector-valued \(\psi\), \(V^*\) is a \(p \times p\) matrix and the comparison uses the Loewner partial order: any regular estimator has asymptotic covariance \(V \succeq V^*\) (i.e., \(V - V^*\) is positive semidefinite). Equivalently, every linear combination \(c^\top\hat\psi\) has asymptotic variance at least \(c^\top V^* c\) for every \(c \in \mathbb{R}^p\).
A regular estimator \(\hat\psi\) is semiparametrically efficient at \(P_0 \in \mathcal{P}\) if \(\sqrt{n}(\hat\psi - \psi_0) \overset{d}{\to} N(0, V^*)\), i.e., it achieves \(V^*\) with no additional convolution component. Equivalently, \(\hat\psi\) is asymptotically linear with influence function \(\varphi^*(O)\).
It is tempting to describe the EIF as the “semiparametric score” of the model. The analogy is useful: \(\varphi^*(O)\) plays the role in the semiparametric efficiency bound that the score plays in the Cramér–Rao bound. Strictly speaking, however, the score and the EIF live in different spaces. The score \(\partial_\theta \log p(O;\theta_0)\) is an element of the model’s tangent space (mean-zero functions reachable as derivatives of log-likelihoods). The EIF is the unique canonical gradient: the element of the tangent space that represents the derivative of the target functional \(\Psi\) via \(\langle\varphi^*, S_\theta\rangle = \partial_\theta\Psi(P_\theta)|_{\theta=0}\). In a fully parametric model these coincide (scaled by the inverse Fisher information); in a semiparametric model they generally do not.
The Efficient Influence Function for the ATE
For the ATE \(\tau = \E\{Y(1) - Y(0)\}\) in the nonparametric observed-data model under consistency, conditional exchangeability, and positivity, the efficient influence function is: \[\varphi^*(O) = \frac{T\{Y - \mu_1(X)\}}{\pi(X)} - \frac{(1-T)\{Y - \mu_0(X)\}}{1-\pi(X)} + \mu_1(X) - \mu_0(X) - \tau. \tag{10.9}\]
This expression has a natural decomposition: the first two terms are an IPW-style residual correction, and the last two are the regression estimator centered at \(\tau\). The EIF depends on both nuisance functions \(\pi(X)\) and \(\mu_t(X)\); an estimator that plugs in consistent estimates of both achieves the efficiency bound \(V^* = \E[\varphi^*(O)^2]\).
Two important properties of Equation 11.11 will be central in Chapter 11:
(i) Double robustness. The estimator obtained by replacing \(\pi(X)\) and \(\mu_t(X)\) by estimates and averaging \(\varphi^*(O_i)\) is consistent for \(\tau\) if either \(\hat\pi\) or \(\hat\mu_t\) is consistent — not necessarily both. This is the doubly robust property.
(ii) Semiparametric efficiency. When both nuisance estimators are consistent and converge at suitable rates, the resulting estimator is asymptotically linear with influence function \(\varphi^*(O)\) and therefore achieves the efficiency bound \(V^*\).
The augmented IPW (AIPW) estimator, sometimes called the one-step or debiased estimator, is the principal tool for exploiting both properties simultaneously; it will be derived and analyzed in Chapter 11. The EIF Equation 11.11 becomes central there: the same object generates both doubly robust estimating equations and the semiparametric efficiency bound.
Chapter Summary
| Estimating function \(U(O;\theta)\) |
Defines the estimator via \(\mathbb{P}_n\{U(O;\hat\theta)\}=0\) |
| Influence function \(\varphi(O)\) |
First-order term in expansion of \(\hat\psi - \psi\); used for CLT and variance estimation |
| Asymptotic variance \(\E[\varphi(O)^2]/n\) |
Governs precision; the target for efficiency comparisons |
| Efficient influence function \(\varphi^*(O)\) |
Achieves the semiparametric efficiency bound; basis for doubly robust estimators (Chapter 11) |
| Sandwich variance estimator \(\hat V\) |
Plug-in estimate of the asymptotic variance; valid under asymptotic linearity |
- Identification is not estimation. Identification provides a population formula, but not automatically a satisfactory estimator.
- Estimating equations. Many estimators are defined as solutions to \(\mathbb{P}_n\{U(O;\theta)\}=0\), with the population moment condition identifying \(\theta_0\).
- Asymptotic linearity. Estimating equations often yield a first-order expansion Equation 10.1, which immediately implies asymptotic normality via the CLT.
- Influence functions. The function \(\varphi(O)\) describes the first-order sensitivity of the estimator to a single observation. It is the key object for both asymptotic theory and variance estimation.
- Nuisance estimation. In causal inference, nuisance functions must be estimated, and this estimation error propagates into the target estimator via the Z-estimation framework Equation 10.4. Controlling this propagation is the central statistical challenge.
- Variance from the influence function. The asymptotic variance is \(n^{-1}\E[\varphi(O)^2]\), consistently estimated by Equation 10.7.
- Efficiency. Among regular estimators, the one with the smallest asymptotic variance is efficient. The efficient influence function characterizes this lower bound and guides estimator construction in Chapter 11.
Problems
1. Estimating equations and moment conditions.
- Let \(O = (X, Y)\) and define \(\theta = \mathrm{Cov}(X,Y)/\mathrm{Var}(X)\) (the coefficient in the population simple regression of \(Y\) on \(X\)). Write down an estimating equation for \(\theta\) and verify the population moment condition holds at \(\theta_0\).
- Let \(\theta = F^{-1}(0.5)\) be the population median. Propose an estimating function \(U(O;\theta)\) and verify the population moment condition. (Hint: consider \(U(O;\theta) = \mathbf{1}(Y \leq \theta) - 0.5\).)
- Show that the OLS estimator and the IPW estimator of the ATE are both special cases of the general M-estimator framework.
2. Asymptotic linearity and the CLT.
- Let \(\hat\psi = \bar Y\) be the sample mean of i.i.d. \(Y_1,\dots,Y_n\) with \(\E Y = \psi\) and \(\mathrm{Var}(Y) = \sigma^2 < \infty\). Write down the influence function, state its asymptotic distribution, and give a consistent estimator of \(\mathrm{Var}(\hat\psi)\).
- Suppose \(\hat\psi_1\) and \(\hat\psi_2\) are two asymptotically linear estimators of the same parameter with influence functions \(\varphi_1\) and \(\varphi_2\). Show that \(\hat\psi_\lambda = \lambda\hat\psi_1 + (1-\lambda)\hat\psi_2\) is also asymptotically linear for any fixed \(\lambda \in \mathbb{R}\), and find its influence function.
- Use part (b) to derive the value of \(\lambda \in \mathbb{R}\) that minimizes the asymptotic variance of \(\hat\psi_\lambda\). Express the answer in terms of \(\sigma_1^2 = \E[\varphi_1^2]\), \(\sigma_2^2 = \E[\varphi_2^2]\), and \(\rho = \E[\varphi_1\varphi_2]\). Under what condition on \((\sigma_1^2, \sigma_2^2, \rho)\) does the optimal \(\lambda\) equal \(1/2\)?
3. Influence functions for causal estimators. Consider the ATE \(\tau = \E\{Y(1) - Y(0)\}\) identified under consistency, conditional exchangeability, and positivity.
- Assume \(\pi(X)\) and \(\mu_t(X)\) are both known. Compute the influence functions of: (i) the regression estimator \(\hat\tau_{\mathrm{reg}} = \mathbb{P}_n\{\mu_1(X) - \mu_0(X)\}\); (ii) the IPW estimator \(\hat\tau_{\mathrm{IPW}} = \mathbb{P}_n\{TY/\pi(X) - (1-T)Y/(1-\pi(X))\}\).
- Under what conditions on the data-generating process will \(\hat\tau_{\mathrm{reg}}\) have a smaller asymptotic variance than \(\hat\tau_{\mathrm{IPW}}\)?
- Verify that the EIF Equation 11.11 has mean zero under \(P\) by computing \(\E[\varphi^*(O)]\).
4. Variance estimation.
- Let \(\hat\varphi_i = \hat\mu_1(X_i) - \hat\mu_0(X_i) - \hat\tau_{\mathrm{reg}}\) be the estimated influence values for the regression estimator. Write down the plug-in variance estimator \(\hat V\) and simplify. Under what conditions is \(\hat V\) consistent for \(\mathrm{Var}(\hat\tau_{\mathrm{reg}})\)?
- A researcher reports \(\hat\tau = 2.4\) and \(\hat V = 0.09\) based on \(n = 400\) observations. Compute a 95% Wald confidence interval. Is the treatment effect statistically distinguishable from zero at the 5% level?
- Explain why simply plugging in \(\hat\mu_t\) and \(\hat\pi\) into the influence function formula may yield an inconsistent variance estimator when these nuisance estimators converge at a slower rate than \(n^{-1/2}\).
5. Efficiency comparisons.
- Suppose \(\varphi_{\mathrm{reg}}\) and \(\varphi_{\mathrm{IPW}}\) denote the influence functions of the regression and IPW estimators. Without calculation, explain why neither estimator is always more efficient than the other.
- The semiparametric efficiency bound for the ATE is \(\E[\varphi^*(O)^2]\), where \(\varphi^*\) is given in Equation 11.11. Show that \(\E[\varphi^*(O)^2] \leq \E[\varphi_{\mathrm{IPW}}(O)^2]\) by expanding the squared EIF. (Hint: use the law of iterated expectations to show that the cross-terms cancel.)
- Give one practical reason why an efficient estimator based on the EIF may not always be preferred over a simpler, less efficient estimator.
Bickel, Peter J., Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. 1993. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press.
Hansen, Lars Peter. 1982.
“Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica 50 (4): 1029–54.
https://doi.org/10.2307/1912775.
Hernán, Miguel A., and James M. Robins. 2020. Causal Inference: What If. Chapman & Hall/CRC.
Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.
Robins, James M. 1986. “A New Approach to Causal Inference in Mortality Studies with a Sustained Exposure Period—Application to Control of the Healthy Worker Survivor Effect.” Mathematical Modelling 7 (9–12): 1393–512.
Tsiatis, Anastasios A. 2006. Semiparametric Theory and Missing Data. Springer.
Vaart, Aad W. van der. 1998. Asymptotic Statistics. Cambridge University Press.