Appendix C — Pathwise Differentiability and Efficient Influence Functions
Section Section 10.9 introduced the efficient influence function of the average treatment effect functional and stated, without proof, that this object is the canonical gradient of \(\Psi\) relative to the nonparametric tangent space. This appendix supplies the geometric machinery behind that statement. The treatment follows Bickel et al. (1993) and Tsiatis (2006), with notation aligned to Chapter 10.
The reader should think of this appendix as the formal counterpart of Sections on asymptotic linearity and efficiency in Chapter 10: those sections showed how an influence function determines the asymptotic distribution of an estimator; here we develop the parallel notion of an influence function as a derivative of a functional, and characterize the unique influence function that achieves the semiparametric efficiency bound. No new notation is needed beyond what is standard in modern semiparametric theory.
The applied consequences — the AIPW estimator, double robustness, Neyman orthogonality, cross-fitting, and rate conditions — are developed in Chapters 11 and 12. This appendix supplies only the foundational geometry on which those chapters rest. Section Section C.1 collects the Hilbert-space background used throughout. Section Section C.7 closes the appendix by carrying out the EIF derivation explicitly for the ATE, recovering the AIPW influence function from the projection construction of the Canonical Gradient Theorem.
C.1 Hilbert-Space Background
The geometry of pathwise differentiability lives in the Hilbert space \(L_2(P)\) of square-integrable real-valued functions on \(\mathcal{O}\), equipped with the inner product \(\langle f, g \rangle_P = \E_P[f(O)\,g(O)]\) and norm \(\|f\|_P = \langle f,f\rangle_P^{1/2}\). Two functions are orthogonal, written \(f \perp g\), when \(\langle f, g \rangle_P = 0\). All scores and influence functions introduced below live in the closed subspace of mean-zero functions, \(L_2^0(P) = \{f \in L_2(P) : \E_P[f(O)] = 0\}\).
This section records the four Hilbert-space facts used repeatedly in the rest of the appendix. For a full treatment, see Vaart (1998, sec. 25.7) or the appendices of Tsiatis (2006).
Closed linear span. For a subset \(A \subset L_2^0(P)\), the closed linear span \(\overline{\mathrm{span}}(A)\) is the smallest closed subspace of \(L_2^0(P)\) containing \(A\); it consists of \(L_2(P)\)-limits of finite linear combinations of elements of \(A\). Tangent spaces are defined as closed linear spans because the set of scores of regular parametric submodels need not itself be closed.
Projection theorem. If \(V \subset L_2^0(P)\) is a closed subspace, every \(f \in L_2^0(P)\) admits a unique decomposition \(f = f_V + f_{V^\perp}\) with \(f_V \in V\) and \(f_{V^\perp} \in V^\perp\), where \(V^\perp = \{g \in L_2^0(P) : \langle g, h\rangle_P = 0 \text{ for all } h \in V\}\). The map \(\Pi[\,\cdot\mid V\,]: f \mapsto f_V\) is the orthogonal projection onto \(V\), characterized by (i) \(f - \Pi[f\mid V] \perp V\) (orthogonality of the residual), or (ii) \(\Pi[f\mid V]\) is the unique element of \(V\) minimizing \(\|f - g\|_P\) over \(g \in V\). The decomposition \(L_2^0(P) = V \oplus V^\perp\) is the engine behind the canonical-gradient construction.
Riesz representation. Every continuous linear functional \(\Lambda: L_2^0(P) \to \mathbb{R}\) admits a unique representer \(r_\Lambda \in L_2^0(P)\) such that \(\Lambda(f) = \langle r_\Lambda, f\rangle_P\) for all \(f \in L_2^0(P)\). Pathwise differentiability of a functional \(\Psi\) is precisely the statement that the score-to-derivative map \(S \mapsto \partial_\varepsilon \Psi(P_\varepsilon)|_0\) extends to a continuous linear functional on the tangent space, and the influence function is its Riesz representer.
Codimension and uniqueness. A non-zero continuous linear functional \(\Lambda\) on a Hilbert space \(H\) has closed kernel \(\ker(\Lambda) := \{f : \Lambda(f) = 0\}\) of codimension one, with \(\ker(\Lambda)^\perp = \mathrm{span}\{r_\Lambda\}\) spanned by the Riesz representer. This codimension-one fact is what makes the efficient influence function of a real-valued functional unique, once it exists.
C.2 Regular Parametric Submodels and Scores
Let \(\mathcal{P}\) be a statistical model for the distribution of the observed data \(O\) on \(\mathcal{O}\), with true distribution \(P \in \mathcal{P}\). All densities are taken with respect to a common dominating measure \(\mu\). The parameter of interest is a smooth real-valued functional \(\Psi: \mathcal{P} \to \mathbb{R}\), \(\psi = \Psi(P)\).
For vector-valued \(\Psi: \mathcal{P} \to \mathbb{R}^p\), each component is pathwise differentiable in the sense of the definition below and has its own gradient; stacking gives a vector-valued influence function \(\varphi^* = (\varphi_1^*, \ldots, \varphi_p^*)^\top\) with each \(\varphi_j^* \in L_2^0(P)\). The semiparametric efficiency bound is then the covariance matrix \(\E_P[\varphi^*(O)\,\varphi^*(O)^\top] \in \mathbb{R}^{p \times p}\), and asymptotic-variance comparisons among regular asymptotically linear estimators are made in the Loewner partial order. Beyond this matricial bookkeeping no new ideas arise, and for clarity we restrict attention to \(p = 1\) throughout.
This \(L_2\)-differentiability condition is a convenient shorthand for the regularity assumptions — typically formalized by differentiability in quadratic mean of \(\varepsilon \mapsto p_\varepsilon^{1/2}\) — under which scores are valid \(L_2(P)\) directional derivatives and derivative-under-the-integral calculations are justified. It is weaker than pointwise-smoothness; see Vaart (1998, sec. 7.2) for the rigorous formulation.
A regular submodel is a smooth one-dimensional curve through \(\mathcal{P}\) that can be probed by ordinary parametric methods. Because \(\int p_\varepsilon\,d\mu = 1\) for all \(\varepsilon\), differentiating under the integral gives \(\E_P[S(O)] = 0\) and \(\E_P[S(O)^2] < \infty\), so every score lies in \(L_2^0(P)\).
C.3 Pathwise Differentiability
The functional \(\Psi\) is differentiable along a submodel if the map \(\varepsilon \mapsto \Psi(P_\varepsilon)\) is differentiable at \(\varepsilon = 0\) in the ordinary sense. Pathwise differentiability asserts that this derivative can be represented as an inner product between a fixed function and the score, uniformly over submodels.
Equation Equation C.1 is the defining identity of semiparametric theory. It says the perturbation of \(\Psi\) along a submodel is fully encoded by the \(L_2(P)\) inner product of a fixed function \(\varphi\) with the score \(S\). Geometrically, \(\varphi\) is a representer for the linear functional \(S \mapsto \partial_\varepsilon \Psi(P_\varepsilon)|_0\) restricted to the space of scores.
C.4 Non-Uniqueness of Influence Functions
Definition of pathwise differentiability does not produce a unique influence function. If \(\varphi\) satisfies Equation C.1 and \(h \in L_2^0(P)\) is orthogonal to every score of the model in \(L_2(P)\), then \(\varphi + h\) also satisfies Equation C.1, since \(\E_P[(\varphi+h)S] = \E_P[\varphi S] + \E_P[hS] = \E_P[\varphi S]\). Whether \(\varphi\) is unique depends on how rich the collection of scores is — a point made precise through the tangent space.
The terminological distinctions introduced in Chapter 10 can now be made precise:
- An estimating function is any function \(U(O;\theta)\) whose root defines an estimator.
- An asymptotic influence function of an estimator is the function \(\varphi\) in its asymptotic expansion.
- A (pathwise) influence function of a functional is a \(\varphi \in L_2^0(P)\) satisfying Equation C.1.
For a regular asymptotically linear estimator of a pathwise-differentiable functional, the estimator’s asymptotic influence function is also a pathwise influence function of the functional. The relationship to estimating functions is looser: for an M-estimator solving \(n^{-1}\sum_i U(O_i;\theta) = 0\), the standard expansion gives \(\varphi(O) = -A^{-1}U(O;\theta_0)\) with \(A = \E_P[\partial U(O;\theta_0)/\partial\theta^\top]\), so a generic estimating function \(U\) equals the influence function \(\varphi\) only after this normalization. The first two notions can be defined without reference to the model \(\mathcal{P}\), while the third depends crucially on \(\mathcal{P}\) through the collection of admissible submodels.
C.5 The Tangent Space
Two extreme cases are illustrative. Nonparametric model. If \(\mathcal{P}\) contains all distributions on \(\mathcal{O}\) satisfying mild regularity, then for any bounded \(h \in L_2^0(P)\) the submodel \(p_\varepsilon(o) \propto (1 + \varepsilon h(o))p(o)\) is regular for \(|\varepsilon| < 1/\|h\|_\infty\), with score \(S = h\). Since bounded mean-zero functions are dense in \(L_2^0(P)\), taking closure gives \(\mathcal{T} = L_2^0(P)\). Fully parametric model. If \(\mathcal{P} = \{P_\theta : \theta \in \Theta \subset \mathbb{R}^k\}\) with score components \(S_{\theta_0,1},\ldots,S_{\theta_0,k}\), then \(\mathcal{T} = \mathrm{span}\{S_{\theta_0,1},\ldots,S_{\theta_0,k}\}\) is \(k\)-dimensional.
The nuisance tangent space is a closed subspace of \(\mathcal{T}\), and hence of \(L_2^0(P)\). Its orthogonal complement is taken in \(L_2^0(P)\): \[\mathcal{T}_\eta^\perp := \{f \in L_2^0(P) : \E_P[fg] = 0 \text{ for all } g \in \mathcal{T}_\eta\}. \tag{C.2}\] The Hilbert-space decomposition \(L_2^0(P) = \mathcal{T}_\eta \oplus \mathcal{T}_\eta^\perp\) holds automatically by the projection theorem and provides the geometric structure underlying semiparametric efficiency.
The distinguishing property of the efficient influence function is not membership in \(\mathcal{T}_\eta^\perp\) (which is automatic) but rather that it is the unique influence function lying in the full tangent space \(\mathcal{T}\).
C.6 The Canonical Gradient and the Efficiency Bound
Every influence function lies in \(\mathcal{T}_\eta^\perp\). What distinguishes a single influence function within this set? The answer is orthogonality to \(\mathcal{T}^\perp\), equivalently membership in the full tangent space \(\mathcal{T}\): among all influence functions, there is a unique one lying in \(\mathcal{T}\), and it has the smallest variance.
For regular asymptotically linear estimators, the variance bound follows directly from the Canonical Gradient Theorem (iii) applied to the estimator’s influence function. The Hájek–Le Cam convolution theorem extends the lower-bound interpretation beyond the asymptotically linear class: for any regular estimator \(\hat\psi_n\), \(\sqrt{n}(\hat\psi_n - \psi) \rightsquigarrow Z + W\) where \(Z \sim N(0, V^*)\) and \(W\) is independent of \(Z\), so the variance bound persists by taking variances. A precise statement requires the local asymptotic normality framework of Vaart (1998, secs. 25.3–25.6).
C.7 Worked Example: The ATE Functional
This section makes the geometric machinery concrete by carrying out the EIF derivation for the ATE under the nonparametric observed-data model. The end product is the AIPW influence function. The value of the derivation lies in showing how it arises directly from the Canonical Gradient Theorem as the Riesz representer in \(L_2^0(P)\) of the pathwise derivative of \(\tau\) — recovering the AIPW formula from first principles.
Setup. The observed data are \(O = (X, T, Y)\). Under consistency, conditional exchangeability, and positivity (Chapter 3), the ATE is identified with \(\tau(P) = \E_P[\mu_1(X) - \mu_0(X)]\), \(\mu_t(X) = \E_P[Y \mid T{=}t, X]\). Write \(\pi(X) = P(T{=}1 \mid X)\). The model \(\mathcal{P}\) is nonparametric: no restrictions are placed on the joint law of \((X, T, Y)\) beyond positivity \(0 < \pi(X) < 1\) a.s. Hence \(\mathcal{T} = L_2^0(P)\).
Score factorization. The joint density factors as \(p(x, t, y) = p_X(x)\cdot p_{T\mid X}(t\mid x)\cdot p_{Y\mid T,X}(y\mid t,x)\). Along any regular submodel the score decomposes additively: \[S(O) = S_X(X) + S_T(T\mid X) + S_Y(Y\mid T,X),\] with \(\E_P[S_X(X)] = 0\), \(\E_P[S_T(T\mid X)\mid X] = 0\), \(\E_P[S_Y(Y\mid T,X)\mid T,X] = 0\). Define the closed subspaces of \(L_2^0(P)\): \[\mathcal{H}_X = \{a(X) : \E_P[a(X)] = 0\}, \quad \mathcal{H}_T = \{b(X)(T-\pi(X)) : b \in L_2(P_X)\}, \quad \mathcal{H}_Y = \{c(O) : \E_P[c(O)\mid T,X] = 0\}.\]
A short conditioning calculation shows these three subspaces are pairwise orthogonal in \(L_2^0(P)\). For instance, for \(a(X) \in \mathcal{H}_X\) and \(b(X)(T-\pi(X)) \in \mathcal{H}_T\): \(\E_P[a(X)\cdot b(X)(T-\pi(X))] = \E_P[a(X)b(X)\cdot\E_P\{T-\pi(X)\mid X\}] = 0\), since \(\E_P[T\mid X] = \pi(X)\). The other two pairs follow analogously by conditioning on \(X\) and on \((T,X)\) respectively. Joint spanning follows by writing any \(g \in L_2^0(P)\) as a telescoping sum of conditional expectations and centering each piece. Hence: \[L_2^0(P) = \mathcal{H}_X \oplus \mathcal{H}_T \oplus \mathcal{H}_Y. \tag{C.3}\]
The pathwise derivative. Differentiating \(\tau(P_\varepsilon) = \int(\mu_{1,\varepsilon}(x) - \mu_{0,\varepsilon}(x))\,p_{X,\varepsilon}(x)\,d\mu(x)\) at \(\varepsilon = 0\) and applying the product rule gives: \[\left.\frac{\partial}{\partial\varepsilon}\tau(P_\varepsilon)\right|_0 = \underbrace{\int\{\mu_1(x)-\mu_0(x)\}\,S_X(x)\,p(x)\,d\mu(x)}_{(I)} + \underbrace{\int\bigl(\partial_\varepsilon\mu_{1,\varepsilon}(x) - \partial_\varepsilon\mu_{0,\varepsilon}(x)\bigr)\big|_0 p(x)\,d\mu(x)}_{(II)}.\]
No \(S_T\) term appears: \(\tau(P)\) is a functional of \(p_X\) and \(p_{Y\mid T,X}\) only, so perturbations of the treatment law do not affect \(\tau\) to first order. This already shows that \(\mathcal{H}_T \subset \mathcal{T}_\eta\).
Term (I). Since \(\E_P[S_X] = 0\), we may subtract any constant from \(\mu_1(X) - \mu_0(X)\); the choice \(\tau = \E_P[\mu_1(X) - \mu_0(X)]\) places the result in \(\mathcal{H}_X\): \[(I) = \E_P[\{\mu_1(X)-\mu_0(X)-\tau\}\,S_X(X)] = \langle \varphi_X,\,S_X\rangle_P, \quad \varphi_X(X) := \mu_1(X)-\mu_0(X)-\tau \in \mathcal{H}_X.\]
Term (II). Fix \(t \in \{0,1\}\) and compute: \[\left.\partial_\varepsilon\mu_{t,\varepsilon}(x)\right|_0 = \E_P[(Y-\mu_t(x))\,S_Y(Y\mid t,X)\mid T{=}t,\,X{=}x],\] using \(\E_P[S_Y(Y\mid t,x)\mid T{=}t,X{=}x] = 0\) to subtract \(\mu_t(x)\). Write \(\pi_t(x) = P(T{=}t\mid X{=}x)\). Using the identity \(p(x)\,p(y\mid t,x) = p(x,T{=}t,y)/\pi_t(x)\) and applying for \(t=1\) and \(t=0\): \[(II) = \E_P\!\left[\left\{\frac{T(Y-\mu_1(X))}{\pi(X)} - \frac{(1-T)(Y-\mu_0(X))}{1-\pi(X)}\right\}S_Y(Y\mid T,X)\right] = \langle\varphi_Y,\,S_Y\rangle_P,\] with \(\varphi_Y(O) := T(Y-\mu_1(X))/\pi(X) - (1-T)(Y-\mu_0(X))/(1-\pi(X)) \in \mathcal{H}_Y\).
The canonical gradient. Combining and using the orthogonality of Equation C.3: \[\left.\frac{\partial}{\partial\varepsilon}\tau(P_\varepsilon)\right|_0 = \langle\varphi^*,\,S\rangle_P, \qquad \varphi^*(O) := \varphi_X(X) + \varphi_Y(O), \tag{C.4}\] for every score \(S = S_X + S_T + S_Y \in \mathcal{T} = L_2^0(P)\). By Definition Equation C.1, \(\varphi^*\) is an influence function of \(\tau\). By construction \(\varphi^* \in \mathcal{H}_X \oplus \mathcal{H}_Y \subset L_2^0(P) = \mathcal{T}\), so \(\varphi^*\) is the canonical gradient. Writing it out explicitly: \[\varphi^*(O) = \mu_1(X)-\mu_0(X)-\tau + \frac{T(Y-\mu_1(X))}{\pi(X)} - \frac{(1-T)(Y-\mu_0(X))}{1-\pi(X)}, \tag{C.5}\] which is the AIPW influence function derived in Chapter 10.
Dimension of \(\mathcal{T}_\eta^\perp\). The score-to-derivative map \(\Lambda: \mathcal{T} \to \mathbb{R}\), \(\Lambda(S) = \partial_\varepsilon\tau(P_\varepsilon)|_0\), is by Equation C.4 a non-zero continuous linear functional on \(L_2^0(P)\) with Riesz representer \(\varphi^*\). Its kernel is exactly \(\mathcal{T}_\eta\), which by the codimension-and-uniqueness fact of Section Section C.1 has codimension one. Hence \(\mathcal{T}_\eta^\perp = \mathrm{span}\{\varphi^*\}\) is one-dimensional, as asserted in the Causal Example remark above.
Reading off the efficiency bound. By the definition of the EIF, the semiparametric efficiency bound for estimating \(\tau(P)\) in the nonparametric model is \(V^*(\tau,P) = \E_P[\varphi^*(O)^2]\). Standard manipulations (iterated expectations on each summand of \(\varphi^*\) using the orthogonality of \(\mathcal{H}_X, \mathcal{H}_Y\)) decompose this as: \[V^*(\tau,P) = \E_P\!\left[\frac{\sigma_1^2(X)}{\pi(X)} + \frac{\sigma_0^2(X)}{1-\pi(X)} + \{\mu_1(X)-\mu_0(X)-\tau\}^2\right],\] with \(\sigma_t^2(X) = \mathrm{Var}_P(Y\mid T{=}t,X)\) — the classical semiparametric variance bound for the ATE (Robins et al. 1994). By the Asymptotic Efficiency Bound Theorem, any regular asymptotically linear estimator of \(\tau\) achieves this bound exactly when its influence function equals \(\varphi^*\) in \(L_2(P)\), the analytic statement underpinning the asymptotic optimality of the AIPW estimator established in Chapter 11.
Bibliographic Notes
The modern formulation of pathwise differentiability and tangent spaces is developed in Bickel et al. (1993) and Vaart (1998, chap. 25); the latter is the standard reference for the convolution theorem and local asymptotic normality. Tsiatis (2006) gives a treatment oriented specifically toward missing data and causal inference, and is a natural companion to the material developed here.