Appendix C — Pathwise Differentiability and Efficient Influence Functions

Section Section 10.9 introduced the efficient influence function of the average treatment effect functional and stated, without proof, that this object is the canonical gradient of \(\Psi\) relative to the nonparametric tangent space. This appendix supplies the geometric machinery behind that statement. The treatment follows Bickel et al. (1993) and Tsiatis (2006), with notation aligned to Chapter 10.

The reader should think of this appendix as the formal counterpart of Sections on asymptotic linearity and efficiency in Chapter 10: those sections showed how an influence function determines the asymptotic distribution of an estimator; here we develop the parallel notion of an influence function as a derivative of a functional, and characterize the unique influence function that achieves the semiparametric efficiency bound. No new notation is needed beyond what is standard in modern semiparametric theory.

The applied consequences — the AIPW estimator, double robustness, Neyman orthogonality, cross-fitting, and rate conditions — are developed in Chapters 11 and 12. This appendix supplies only the foundational geometry on which those chapters rest. Section Section C.1 collects the Hilbert-space background used throughout. Section Section C.7 closes the appendix by carrying out the EIF derivation explicitly for the ATE, recovering the AIPW influence function from the projection construction of the Canonical Gradient Theorem.

C.1 Hilbert-Space Background

The geometry of pathwise differentiability lives in the Hilbert space \(L_2(P)\) of square-integrable real-valued functions on \(\mathcal{O}\), equipped with the inner product \(\langle f, g \rangle_P = \E_P[f(O)\,g(O)]\) and norm \(\|f\|_P = \langle f,f\rangle_P^{1/2}\). Two functions are orthogonal, written \(f \perp g\), when \(\langle f, g \rangle_P = 0\). All scores and influence functions introduced below live in the closed subspace of mean-zero functions, \(L_2^0(P) = \{f \in L_2(P) : \E_P[f(O)] = 0\}\).

This section records the four Hilbert-space facts used repeatedly in the rest of the appendix. For a full treatment, see Vaart (1998, sec. 25.7) or the appendices of Tsiatis (2006).

Closed linear span. For a subset \(A \subset L_2^0(P)\), the closed linear span \(\overline{\mathrm{span}}(A)\) is the smallest closed subspace of \(L_2^0(P)\) containing \(A\); it consists of \(L_2(P)\)-limits of finite linear combinations of elements of \(A\). Tangent spaces are defined as closed linear spans because the set of scores of regular parametric submodels need not itself be closed.

Projection theorem. If \(V \subset L_2^0(P)\) is a closed subspace, every \(f \in L_2^0(P)\) admits a unique decomposition \(f = f_V + f_{V^\perp}\) with \(f_V \in V\) and \(f_{V^\perp} \in V^\perp\), where \(V^\perp = \{g \in L_2^0(P) : \langle g, h\rangle_P = 0 \text{ for all } h \in V\}\). The map \(\Pi[\,\cdot\mid V\,]: f \mapsto f_V\) is the orthogonal projection onto \(V\), characterized by (i) \(f - \Pi[f\mid V] \perp V\) (orthogonality of the residual), or (ii) \(\Pi[f\mid V]\) is the unique element of \(V\) minimizing \(\|f - g\|_P\) over \(g \in V\). The decomposition \(L_2^0(P) = V \oplus V^\perp\) is the engine behind the canonical-gradient construction.

Riesz representation. Every continuous linear functional \(\Lambda: L_2^0(P) \to \mathbb{R}\) admits a unique representer \(r_\Lambda \in L_2^0(P)\) such that \(\Lambda(f) = \langle r_\Lambda, f\rangle_P\) for all \(f \in L_2^0(P)\). Pathwise differentiability of a functional \(\Psi\) is precisely the statement that the score-to-derivative map \(S \mapsto \partial_\varepsilon \Psi(P_\varepsilon)|_0\) extends to a continuous linear functional on the tangent space, and the influence function is its Riesz representer.

Codimension and uniqueness. A non-zero continuous linear functional \(\Lambda\) on a Hilbert space \(H\) has closed kernel \(\ker(\Lambda) := \{f : \Lambda(f) = 0\}\) of codimension one, with \(\ker(\Lambda)^\perp = \mathrm{span}\{r_\Lambda\}\) spanned by the Riesz representer. This codimension-one fact is what makes the efficient influence function of a real-valued functional unique, once it exists.

C.2 Regular Parametric Submodels and Scores

Let \(\mathcal{P}\) be a statistical model for the distribution of the observed data \(O\) on \(\mathcal{O}\), with true distribution \(P \in \mathcal{P}\). All densities are taken with respect to a common dominating measure \(\mu\). The parameter of interest is a smooth real-valued functional \(\Psi: \mathcal{P} \to \mathbb{R}\), \(\psi = \Psi(P)\).

For vector-valued \(\Psi: \mathcal{P} \to \mathbb{R}^p\), each component is pathwise differentiable in the sense of the definition below and has its own gradient; stacking gives a vector-valued influence function \(\varphi^* = (\varphi_1^*, \ldots, \varphi_p^*)^\top\) with each \(\varphi_j^* \in L_2^0(P)\). The semiparametric efficiency bound is then the covariance matrix \(\E_P[\varphi^*(O)\,\varphi^*(O)^\top] \in \mathbb{R}^{p \times p}\), and asymptotic-variance comparisons among regular asymptotically linear estimators are made in the Loewner partial order. Beyond this matricial bookkeeping no new ideas arise, and for clarity we restrict attention to \(p = 1\) throughout.

NoteDefinition: Regular Parametric Submodel

A regular parametric submodel of \(\mathcal{P}\) passing through \(P\) is a one-parameter family \(\{P_\varepsilon : \varepsilon \in (-\delta, \delta)\} \subset \mathcal{P}\), \(P_0 = P\), with densities \(p_\varepsilon\) such that \(\varepsilon \mapsto \log p_\varepsilon(O)\) is differentiable at \(\varepsilon = 0\) in \(L_2(P)\), with score \[S(O) = \left.\frac{\partial}{\partial \varepsilon} \log p_\varepsilon(O)\right|_{\varepsilon = 0} \in L_2(P).\]

This \(L_2\)-differentiability condition is a convenient shorthand for the regularity assumptions — typically formalized by differentiability in quadratic mean of \(\varepsilon \mapsto p_\varepsilon^{1/2}\) — under which scores are valid \(L_2(P)\) directional derivatives and derivative-under-the-integral calculations are justified. It is weaker than pointwise-smoothness; see Vaart (1998, sec. 7.2) for the rigorous formulation.

A regular submodel is a smooth one-dimensional curve through \(\mathcal{P}\) that can be probed by ordinary parametric methods. Because \(\int p_\varepsilon\,d\mu = 1\) for all \(\varepsilon\), differentiating under the integral gives \(\E_P[S(O)] = 0\) and \(\E_P[S(O)^2] < \infty\), so every score lies in \(L_2^0(P)\).

NoteRemark: Why Submodels Rather Than Point-Mass Contamination

A heuristic sometimes used for “probing” a functional is the contamination path \(P_\varepsilon = (1-\varepsilon)P + \varepsilon\delta_o\). The corresponding Hampel influence function is the derivative \(\partial_\varepsilon \Psi\{(1-\varepsilon)P + \varepsilon\delta_o\}|_{\varepsilon=0}\). Contamination is intuitive but does not yield a regular submodel in the sense of the definition above: when \(P\) is continuous, \(\delta_o\) is not absolutely continuous with respect to \(P\), so the Radon–Nikodym derivative \(d\delta_o/dP\) does not exist as an \(L_2(P)\) function, and the contamination path lacks a well-defined score in the sense used here. All formal results below use the regular-submodel definition, while the contamination heuristic remains useful for guessing the form of an influence function in concrete examples.

C.3 Pathwise Differentiability

The functional \(\Psi\) is differentiable along a submodel if the map \(\varepsilon \mapsto \Psi(P_\varepsilon)\) is differentiable at \(\varepsilon = 0\) in the ordinary sense. Pathwise differentiability asserts that this derivative can be represented as an inner product between a fixed function and the score, uniformly over submodels.

NoteDefinition: Pathwise Differentiability and Influence Function

The functional \(\Psi: \mathcal{P} \to \mathbb{R}\) is pathwise differentiable at \(P\) if there exists a function \(\varphi \in L_2^0(P)\) such that, for every regular parametric submodel \(\{P_\varepsilon\}\) with score \(S\), \[\left.\frac{\partial}{\partial \varepsilon} \Psi(P_\varepsilon)\right|_{\varepsilon=0} = \E_P[\varphi(O)\,S(O)]. \tag{C.1}\] Any such \(\varphi\) is called an influence function (or gradient) of \(\Psi\) at \(P\). The set of influence functions is denoted \(\mathrm{IF}(\Psi, P)\).

Equation Equation C.1 is the defining identity of semiparametric theory. It says the perturbation of \(\Psi\) along a submodel is fully encoded by the \(L_2(P)\) inner product of a fixed function \(\varphi\) with the score \(S\). Geometrically, \(\varphi\) is a representer for the linear functional \(S \mapsto \partial_\varepsilon \Psi(P_\varepsilon)|_0\) restricted to the space of scores.

NoteExample: Mean Functional

Let \(O = Y\) with \(\E_P[Y^2] < \infty\) and \(\Psi(P) = \E_P[Y]\). For any regular submodel with score \(S\): \[\frac{\partial}{\partial\varepsilon}\int y\,p_\varepsilon(y)\,d\mu(y)\bigg|_0 = \int y\,S(y)\,p(y)\,d\mu(y) = \E_P[Y\cdot S(O)].\] Subtracting \(\E_P[Y]\cdot\E_P[S] = 0\) gives \(\partial_\varepsilon\Psi(P_\varepsilon)|_0 = \E_P[(Y - \E_P Y)\,S(O)]\), so \(\varphi(O) = Y - \Psi(P)\) is an influence function of the mean functional. Under the nonparametric model, this is in fact the unique influence function, and hence the efficient influence function.

NoteRemark: Connection to Estimation

If an estimator \(\hat\psi\) is asymptotically linear at \(P\) with influence function \(\varphi\) in the sense of Chapter 10, and if \(\hat\psi\) is regular along every submodel, then \(\varphi\) satisfies Equation C.1 and is therefore a gradient of \(\Psi\). Pathwise differentiability is thus a necessary condition for the existence of a regular asymptotically linear estimator of \(\Psi\) (Tsiatis 2006, Theorem 3.1).

C.4 Non-Uniqueness of Influence Functions

Definition of pathwise differentiability does not produce a unique influence function. If \(\varphi\) satisfies Equation C.1 and \(h \in L_2^0(P)\) is orthogonal to every score of the model in \(L_2(P)\), then \(\varphi + h\) also satisfies Equation C.1, since \(\E_P[(\varphi+h)S] = \E_P[\varphi S] + \E_P[hS] = \E_P[\varphi S]\). Whether \(\varphi\) is unique depends on how rich the collection of scores is — a point made precise through the tangent space.

The terminological distinctions introduced in Chapter 10 can now be made precise:

  • An estimating function is any function \(U(O;\theta)\) whose root defines an estimator.
  • An asymptotic influence function of an estimator is the function \(\varphi\) in its asymptotic expansion.
  • A (pathwise) influence function of a functional is a \(\varphi \in L_2^0(P)\) satisfying Equation C.1.

For a regular asymptotically linear estimator of a pathwise-differentiable functional, the estimator’s asymptotic influence function is also a pathwise influence function of the functional. The relationship to estimating functions is looser: for an M-estimator solving \(n^{-1}\sum_i U(O_i;\theta) = 0\), the standard expansion gives \(\varphi(O) = -A^{-1}U(O;\theta_0)\) with \(A = \E_P[\partial U(O;\theta_0)/\partial\theta^\top]\), so a generic estimating function \(U\) equals the influence function \(\varphi\) only after this normalization. The first two notions can be defined without reference to the model \(\mathcal{P}\), while the third depends crucially on \(\mathcal{P}\) through the collection of admissible submodels.

C.5 The Tangent Space

NoteDefinition: Tangent Space

The tangent space of \(\mathcal{P}\) at \(P\), denoted \(\mathcal{T}_P(\mathcal{P})\) or simply \(\mathcal{T}\), is the closed linear span in \(L_2^0(P)\) of all scores of regular parametric submodels of \(\mathcal{P}\) passing through \(P\).

Two extreme cases are illustrative. Nonparametric model. If \(\mathcal{P}\) contains all distributions on \(\mathcal{O}\) satisfying mild regularity, then for any bounded \(h \in L_2^0(P)\) the submodel \(p_\varepsilon(o) \propto (1 + \varepsilon h(o))p(o)\) is regular for \(|\varepsilon| < 1/\|h\|_\infty\), with score \(S = h\). Since bounded mean-zero functions are dense in \(L_2^0(P)\), taking closure gives \(\mathcal{T} = L_2^0(P)\). Fully parametric model. If \(\mathcal{P} = \{P_\theta : \theta \in \Theta \subset \mathbb{R}^k\}\) with score components \(S_{\theta_0,1},\ldots,S_{\theta_0,k}\), then \(\mathcal{T} = \mathrm{span}\{S_{\theta_0,1},\ldots,S_{\theta_0,k}\}\) is \(k\)-dimensional.

NoteDefinition: Nuisance Tangent Space

The nuisance tangent space of \(\Psi\) at \(P\), denoted \(\mathcal{T}_\eta\), is the closed linear span in \(L_2^0(P)\) of scores of regular submodels along which the pathwise derivative of \(\Psi\) vanishes: \[\mathcal{T}_\eta = \overline{\mathrm{span}}\left\{S \in \mathcal{T} : \left.\frac{\partial}{\partial\varepsilon}\Psi(P_\varepsilon)\right|_{\varepsilon=0} = 0,\ S \text{ the score of a regular submodel } \{P_\varepsilon\}\right\}.\] A submodel with this property is called a nuisance submodel.

The nuisance tangent space is a closed subspace of \(\mathcal{T}\), and hence of \(L_2^0(P)\). Its orthogonal complement is taken in \(L_2^0(P)\): \[\mathcal{T}_\eta^\perp := \{f \in L_2^0(P) : \E_P[fg] = 0 \text{ for all } g \in \mathcal{T}_\eta\}. \tag{C.2}\] The Hilbert-space decomposition \(L_2^0(P) = \mathcal{T}_\eta \oplus \mathcal{T}_\eta^\perp\) holds automatically by the projection theorem and provides the geometric structure underlying semiparametric efficiency.

TipProposition: Every Influence Function Lies in \(\mathcal{T}_\eta^\perp\)

Let \(\Psi\) be pathwise differentiable at \(P\). Then every influence function \(\varphi \in \mathrm{IF}(\Psi, P)\) satisfies \(\varphi \in \mathcal{T}_\eta^\perp\).

For any nuisance submodel \(\{P_\varepsilon\}\) with score \(S_\eta\), \(\partial_\varepsilon\Psi(P_\varepsilon)|_0 = 0\) by definition of \(\mathcal{T}_\eta\). Combined with Equation C.1: \(0 = \E_P[\varphi\,S_\eta]\). This holds for every score \(S_\eta\) of a nuisance submodel; by linearity and \(L_2\)-continuity it extends to every element of the closed linear span \(\mathcal{T}_\eta\). Hence \(\varphi \perp \mathcal{T}_\eta\), i.e., \(\varphi \in \mathcal{T}_\eta^\perp\). \(\square\)

The distinguishing property of the efficient influence function is not membership in \(\mathcal{T}_\eta^\perp\) (which is automatic) but rather that it is the unique influence function lying in the full tangent space \(\mathcal{T}\).

NoteRemark: Causal Example

For the ATE under consistency, conditional exchangeability, and positivity, the causal estimand is identified with the observed-data functional \(\tau(P) = \E_P[\mu_1(X) - \mu_0(X)]\), \(\mu_t(X) = \E_P[Y \mid T{=}t,\, X]\). In the nonparametric observed-data model for \(O = (X, T, Y)\), \(\mathcal{T} = L_2^0(P)\), and \(\mathcal{T}_\eta\) is the closed linear span of scores of submodels along which the pathwise derivative of \(\tau\) vanishes. Section Section C.7 carries out the construction in detail and shows that \(\mathcal{T}_\eta^\perp\) is one-dimensional, spanned by the AIPW influence function; this is the formal content of the claim in Chapter 10 that \(\varphi^*\) is uniquely determined.

C.6 The Canonical Gradient and the Efficiency Bound

Every influence function lies in \(\mathcal{T}_\eta^\perp\). What distinguishes a single influence function within this set? The answer is orthogonality to \(\mathcal{T}^\perp\), equivalently membership in the full tangent space \(\mathcal{T}\): among all influence functions, there is a unique one lying in \(\mathcal{T}\), and it has the smallest variance.

TipTheorem: Canonical Gradient

Let \(\Psi\) be pathwise differentiable at \(P\) with non-empty influence function set \(\mathrm{IF}(\Psi, P)\). Then:

  1. Any two influence functions differ by an element of \(\mathcal{T}^\perp\): for \(\varphi_1, \varphi_2 \in \mathrm{IF}(\Psi, P)\), \(\varphi_1 - \varphi_2 \in \mathcal{T}^\perp\).
  2. There exists a unique \(\varphi^* \in \mathrm{IF}(\Psi, P)\) with \(\varphi^* \in \mathcal{T}\), namely \(\varphi^* = \Pi[\varphi \mid \mathcal{T}]\) for any \(\varphi \in \mathrm{IF}(\Psi, P)\).
  3. For every \(\varphi \in \mathrm{IF}(\Psi, P)\): \(\E_P[\varphi(O)^2] \geq \E_P[\varphi^*(O)^2]\), with equality iff \(\varphi = \varphi^*\) in \(L_2(P)\).

(i) For \(\varphi_1, \varphi_2 \in \mathrm{IF}(\Psi, P)\) and every score \(S\) of a regular submodel, \(\E_P[(\varphi_1 - \varphi_2)S] = \partial_\varepsilon\Psi|_0 - \partial_\varepsilon\Psi|_0 = 0\). By linearity and \(L_2\)-continuity, \(\E_P[(\varphi_1 - \varphi_2)g] = 0\) for every \(g \in \mathcal{T}\), i.e., \(\varphi_1 - \varphi_2 \in \mathcal{T}^\perp\).

(ii) Fix any \(\varphi \in \mathrm{IF}(\Psi, P)\) and set \(\varphi^* := \Pi[\varphi \mid \mathcal{T}]\), so \(\varphi - \varphi^* \in \mathcal{T}^\perp\). For any score \(S \in \mathcal{T}\): \(\E_P[\varphi^* S] = \E_P[\varphi S] - \E_P[(\varphi - \varphi^*)S] = \E_P[\varphi S] = \partial_\varepsilon\Psi(P_\varepsilon)|_0\), so \(\varphi^* \in \mathrm{IF}(\Psi, P)\). Uniqueness: if \(\tilde\varphi^* \in \mathrm{IF}(\Psi, P) \cap \mathcal{T}\), then by (i), \(\varphi^* - \tilde\varphi^* \in \mathcal{T}^\perp \cap \mathcal{T} = \{0\}\).

(iii) Write \(\varphi = \varphi^* + (\varphi - \varphi^*)\) with the two summands orthogonal in \(L_2(P)\) (since \(\varphi^* \in \mathcal{T}\) and \(\varphi - \varphi^* \in \mathcal{T}^\perp\)). The Pythagorean identity gives \(\E_P[\varphi^2] = \E_P[(\varphi^*)^2] + \E_P[(\varphi - \varphi^*)^2] \geq \E_P[(\varphi^*)^2]\), with equality iff \(\varphi = \varphi^*\). \(\square\)

NoteDefinition: Efficient Influence Function and Efficiency Bound

The unique element \(\varphi^* \in \mathrm{IF}(\Psi, P) \cap \mathcal{T}\) is called the efficient influence function (EIF), or canonical gradient, of \(\Psi\) at \(P\). Its variance \(V^*(\Psi, P) = \E_P[\varphi^*(O)^2]\) is the semiparametric efficiency bound for estimating \(\Psi(P)\) in the model \(\mathcal{P}\).

NoteRemark: Role of the Nuisance Tangent Space

Since every influence function already lies in \(\mathcal{T}_\eta^\perp\), the EIF can equivalently be described as the unique influence function in \(\mathcal{T} \cap \mathcal{T}_\eta^\perp\). This intersection contains exactly the directions in \(\mathcal{T}\) that are orthogonal to nuisance perturbations — directions along which \(\psi\) genuinely moves. In practice, the EIF is often constructed by taking a candidate function in \(\mathcal{T}\) (such as an unbiased estimating function for \(\psi\)) and subtracting its projection onto \(\mathcal{T}_\eta\) to remove the nuisance components; this is the projection view exploited in Chapter 11.

TipTheorem: Asymptotic Efficiency Bound

Let \(\hat\psi_n\) be a regular asymptotically linear estimator of \(\Psi(P)\) in the model \(\mathcal{P}\). Then \(\mathrm{AVar}(\sqrt{n}(\hat\psi_n - \psi)) \geq V^*(\Psi, P)\), and the bound is achieved if and only if the influence function of \(\hat\psi_n\) is \(\varphi^*\) in \(L_2(P)\).

For regular asymptotically linear estimators, the variance bound follows directly from the Canonical Gradient Theorem (iii) applied to the estimator’s influence function. The Hájek–Le Cam convolution theorem extends the lower-bound interpretation beyond the asymptotically linear class: for any regular estimator \(\hat\psi_n\), \(\sqrt{n}(\hat\psi_n - \psi) \rightsquigarrow Z + W\) where \(Z \sim N(0, V^*)\) and \(W\) is independent of \(Z\), so the variance bound persists by taking variances. A precise statement requires the local asymptotic normality framework of Vaart (1998, secs. 25.3–25.6).

NoteRemark: Why the EIF Appears in Concrete Formulas

The Canonical Gradient Theorem explains why specific objects such as the AIPW influence function look “constructed” rather than guessed. In concrete causal examples one may start from an unbiased estimating function such as the IPW score \(U_{\mathrm{IPW}}(O) = TY/\pi(X) - (1-T)Y/(1-\pi(X)) - \tau\). For the nonparametric ATE model, \(U_{\mathrm{IPW}}\) already lies in \(\mathcal{T} = L_2^0(P)\) and is unbiased when \(\pi\) is known, yet it fails to satisfy Equation C.1 for \(\tau(P)\) when \(\pi\) is unknown — because perturbations of \(\pi\) contribute first-order terms to \(\partial_\varepsilon\E_{P_\varepsilon}[U_{\mathrm{IPW}}(O)]|_0\) that the EIF must absorb. The EIF is obtained by solving the pathwise-derivative equation directly, or equivalently by constructing the canonical gradient within \(\mathcal{T} \cap \mathcal{T}_\eta^\perp\); for the nonparametric ATE this calculation produces the AIPW influence function.

C.7 Worked Example: The ATE Functional

This section makes the geometric machinery concrete by carrying out the EIF derivation for the ATE under the nonparametric observed-data model. The end product is the AIPW influence function. The value of the derivation lies in showing how it arises directly from the Canonical Gradient Theorem as the Riesz representer in \(L_2^0(P)\) of the pathwise derivative of \(\tau\) — recovering the AIPW formula from first principles.

Setup. The observed data are \(O = (X, T, Y)\). Under consistency, conditional exchangeability, and positivity (Chapter 3), the ATE is identified with \(\tau(P) = \E_P[\mu_1(X) - \mu_0(X)]\), \(\mu_t(X) = \E_P[Y \mid T{=}t, X]\). Write \(\pi(X) = P(T{=}1 \mid X)\). The model \(\mathcal{P}\) is nonparametric: no restrictions are placed on the joint law of \((X, T, Y)\) beyond positivity \(0 < \pi(X) < 1\) a.s. Hence \(\mathcal{T} = L_2^0(P)\).

Score factorization. The joint density factors as \(p(x, t, y) = p_X(x)\cdot p_{T\mid X}(t\mid x)\cdot p_{Y\mid T,X}(y\mid t,x)\). Along any regular submodel the score decomposes additively: \[S(O) = S_X(X) + S_T(T\mid X) + S_Y(Y\mid T,X),\] with \(\E_P[S_X(X)] = 0\), \(\E_P[S_T(T\mid X)\mid X] = 0\), \(\E_P[S_Y(Y\mid T,X)\mid T,X] = 0\). Define the closed subspaces of \(L_2^0(P)\): \[\mathcal{H}_X = \{a(X) : \E_P[a(X)] = 0\}, \quad \mathcal{H}_T = \{b(X)(T-\pi(X)) : b \in L_2(P_X)\}, \quad \mathcal{H}_Y = \{c(O) : \E_P[c(O)\mid T,X] = 0\}.\]

A short conditioning calculation shows these three subspaces are pairwise orthogonal in \(L_2^0(P)\). For instance, for \(a(X) \in \mathcal{H}_X\) and \(b(X)(T-\pi(X)) \in \mathcal{H}_T\): \(\E_P[a(X)\cdot b(X)(T-\pi(X))] = \E_P[a(X)b(X)\cdot\E_P\{T-\pi(X)\mid X\}] = 0\), since \(\E_P[T\mid X] = \pi(X)\). The other two pairs follow analogously by conditioning on \(X\) and on \((T,X)\) respectively. Joint spanning follows by writing any \(g \in L_2^0(P)\) as a telescoping sum of conditional expectations and centering each piece. Hence: \[L_2^0(P) = \mathcal{H}_X \oplus \mathcal{H}_T \oplus \mathcal{H}_Y. \tag{C.3}\]

The pathwise derivative. Differentiating \(\tau(P_\varepsilon) = \int(\mu_{1,\varepsilon}(x) - \mu_{0,\varepsilon}(x))\,p_{X,\varepsilon}(x)\,d\mu(x)\) at \(\varepsilon = 0\) and applying the product rule gives: \[\left.\frac{\partial}{\partial\varepsilon}\tau(P_\varepsilon)\right|_0 = \underbrace{\int\{\mu_1(x)-\mu_0(x)\}\,S_X(x)\,p(x)\,d\mu(x)}_{(I)} + \underbrace{\int\bigl(\partial_\varepsilon\mu_{1,\varepsilon}(x) - \partial_\varepsilon\mu_{0,\varepsilon}(x)\bigr)\big|_0 p(x)\,d\mu(x)}_{(II)}.\]

No \(S_T\) term appears: \(\tau(P)\) is a functional of \(p_X\) and \(p_{Y\mid T,X}\) only, so perturbations of the treatment law do not affect \(\tau\) to first order. This already shows that \(\mathcal{H}_T \subset \mathcal{T}_\eta\).

Term (I). Since \(\E_P[S_X] = 0\), we may subtract any constant from \(\mu_1(X) - \mu_0(X)\); the choice \(\tau = \E_P[\mu_1(X) - \mu_0(X)]\) places the result in \(\mathcal{H}_X\): \[(I) = \E_P[\{\mu_1(X)-\mu_0(X)-\tau\}\,S_X(X)] = \langle \varphi_X,\,S_X\rangle_P, \quad \varphi_X(X) := \mu_1(X)-\mu_0(X)-\tau \in \mathcal{H}_X.\]

Term (II). Fix \(t \in \{0,1\}\) and compute: \[\left.\partial_\varepsilon\mu_{t,\varepsilon}(x)\right|_0 = \E_P[(Y-\mu_t(x))\,S_Y(Y\mid t,X)\mid T{=}t,\,X{=}x],\] using \(\E_P[S_Y(Y\mid t,x)\mid T{=}t,X{=}x] = 0\) to subtract \(\mu_t(x)\). Write \(\pi_t(x) = P(T{=}t\mid X{=}x)\). Using the identity \(p(x)\,p(y\mid t,x) = p(x,T{=}t,y)/\pi_t(x)\) and applying for \(t=1\) and \(t=0\): \[(II) = \E_P\!\left[\left\{\frac{T(Y-\mu_1(X))}{\pi(X)} - \frac{(1-T)(Y-\mu_0(X))}{1-\pi(X)}\right\}S_Y(Y\mid T,X)\right] = \langle\varphi_Y,\,S_Y\rangle_P,\] with \(\varphi_Y(O) := T(Y-\mu_1(X))/\pi(X) - (1-T)(Y-\mu_0(X))/(1-\pi(X)) \in \mathcal{H}_Y\).

The canonical gradient. Combining and using the orthogonality of Equation C.3: \[\left.\frac{\partial}{\partial\varepsilon}\tau(P_\varepsilon)\right|_0 = \langle\varphi^*,\,S\rangle_P, \qquad \varphi^*(O) := \varphi_X(X) + \varphi_Y(O), \tag{C.4}\] for every score \(S = S_X + S_T + S_Y \in \mathcal{T} = L_2^0(P)\). By Definition Equation C.1, \(\varphi^*\) is an influence function of \(\tau\). By construction \(\varphi^* \in \mathcal{H}_X \oplus \mathcal{H}_Y \subset L_2^0(P) = \mathcal{T}\), so \(\varphi^*\) is the canonical gradient. Writing it out explicitly: \[\varphi^*(O) = \mu_1(X)-\mu_0(X)-\tau + \frac{T(Y-\mu_1(X))}{\pi(X)} - \frac{(1-T)(Y-\mu_0(X))}{1-\pi(X)}, \tag{C.5}\] which is the AIPW influence function derived in Chapter 10.

Dimension of \(\mathcal{T}_\eta^\perp\). The score-to-derivative map \(\Lambda: \mathcal{T} \to \mathbb{R}\), \(\Lambda(S) = \partial_\varepsilon\tau(P_\varepsilon)|_0\), is by Equation C.4 a non-zero continuous linear functional on \(L_2^0(P)\) with Riesz representer \(\varphi^*\). Its kernel is exactly \(\mathcal{T}_\eta\), which by the codimension-and-uniqueness fact of Section Section C.1 has codimension one. Hence \(\mathcal{T}_\eta^\perp = \mathrm{span}\{\varphi^*\}\) is one-dimensional, as asserted in the Causal Example remark above.

Reading off the efficiency bound. By the definition of the EIF, the semiparametric efficiency bound for estimating \(\tau(P)\) in the nonparametric model is \(V^*(\tau,P) = \E_P[\varphi^*(O)^2]\). Standard manipulations (iterated expectations on each summand of \(\varphi^*\) using the orthogonality of \(\mathcal{H}_X, \mathcal{H}_Y\)) decompose this as: \[V^*(\tau,P) = \E_P\!\left[\frac{\sigma_1^2(X)}{\pi(X)} + \frac{\sigma_0^2(X)}{1-\pi(X)} + \{\mu_1(X)-\mu_0(X)-\tau\}^2\right],\] with \(\sigma_t^2(X) = \mathrm{Var}_P(Y\mid T{=}t,X)\) — the classical semiparametric variance bound for the ATE (Robins et al. 1994). By the Asymptotic Efficiency Bound Theorem, any regular asymptotically linear estimator of \(\tau\) achieves this bound exactly when its influence function equals \(\varphi^*\) in \(L_2(P)\), the analytic statement underpinning the asymptotic optimality of the AIPW estimator established in Chapter 11.

Bibliographic Notes

The modern formulation of pathwise differentiability and tangent spaces is developed in Bickel et al. (1993) and Vaart (1998, chap. 25); the latter is the standard reference for the convolution theorem and local asymptotic normality. Tsiatis (2006) gives a treatment oriented specifically toward missing data and causal inference, and is a natural companion to the material developed here.

Bickel, Peter J., Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. 1993. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press.
Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed.” Journal of the American Statistical Association 89 (427): 846–66.
Tsiatis, Anastasios A. 2006. Semiparametric Theory and Missing Data. Springer.
Vaart, Aad W. van der. 1998. Asymptotic Statistics. Cambridge University Press.