4 Potential Outcomes and Adjustment
4.1 Motivation: A Third Language for Causality
The first three chapters developed causal inference primarily in the languages of structural equations and directed acyclic graphs. Those languages are especially effective for expressing interventions syntactically: the SEM shows how an intervention replaces an assignment mechanism, and the DAG shows how it deletes incoming arrows. This chapter introduces a third language: the potential outcomes framework of Neyman (1923) and Rubin (1974).
Potential outcomes provide a direct language for defining causal estimands. Quantities such as the average treatment effect, the average treatment effect on the treated, and related contrasts are most naturally written in terms of counterfactual outcomes \(Y(1)\), \(Y(0)\), and more generally \(Y(t)\). For this reason, the potential outcomes framework has become standard in statistics, biostatistics, epidemiology, and much of econometrics.
Assumptions such as ignorability, positivity, and exclusion can be stated in this language, but DAGs and the do-operator often make their structural content more transparent. The two frameworks should be viewed as complementary rather than competing: potential outcomes define the causal quantities of interest, while DAGs and the do-calculus clarify identification.
4.2 The Neyman–Rubin Potential Outcomes Framework
4.2.1 The Potential Outcome
\(Y_i(1)\) is the outcome unit \(i\) would achieve under treatment; \(Y_i(0)\) is the outcome under control. Only one of these is ever observed for any given unit. The unobserved potential outcome is called the counterfactual.
4.2.2 SUTVA and Consistency
Under SUTVA, the observed outcome equals the potential outcome at the treatment actually received: \[Y_i = Y_i(T_i). \tag{4.1}\]
This consistency equation is the bridge between the potential outcome world and the observed data world. It fails whenever SUTVA fails: if treatment spills over between units (e.g., vaccination provides herd immunity), or if the treatment label \(T=1\) covers multiple distinct interventions.
4.2.3 Connection to the Do-Operator
Proof sketch. In the mutilated SEM \(\mathcal{G}_{\overline{T}}\), the structural equation for \(T\) is replaced by \(T := t\) for every unit. The structural model then determines \(Y\) from \(t\) and unit \(i\)’s own background variables — which is precisely what the potential outcomes framework defines as \(Y_i(t)\). SUTVA’s no-interference condition ensures \(Y_i(t)\) does not depend on other units’ treatment values, so the marginal distribution of \(Y\) in \(\mathcal{G}_{\overline{T}}\) equals the marginal distribution of \(Y(t)\). \(\square\)
This equivalence lets us move freely between two notational traditions. When defining estimands such as \(\E[Y(1) - Y(0)]\), potential-outcome notation is most natural. When proving identification results from a graph, the do-operator is often more transparent: \(P(y \mid \doop(t)) \neq P(y \mid T{=}t)\) whenever \(T\) is endogenous (established in Chapter 1), whereas \(Y(t)\) carries no syntactic marker for this gap.
4.3 Causal Estimands
4.3.1 The Average Treatment Effect and Its Relatives
The ATE and ATT coincide only when the treatment effect does not depend on who selected into treatment — i.e., when \(\E[Y(t) \mid T] = \E[Y(t)]\) for \(t \in \{0,1\}\). In a randomized experiment this holds by design.
Neither quantity is directly observable. The naïve estimator \(\hat\tau_{\mathrm{naive}} = \bar{Y}_{T=1} - \bar{Y}_{T=0}\) estimates \(\E[Y \mid T{=}1] - \E[Y \mid T{=}0] \neq \E[Y(1)] - \E[Y(0)]\), with the gap being the selection bias induced by endogeneity.
4.3.2 Causal Estimands as Functionals of the Structural Equation
In the SEM framework, the outcome is determined by a structural equation \(Y = g(T, \mathbf{X}, U_Y)\), where \(U_Y\) collects all sources of variation not accounted for by \((T, \mathbf{X})\). The potential outcome under \(\doop(T{=}t)\) is: \[Y_i(t) = g(t, \mathbf{X}_i, U_{Y,i}). \tag{4.3}\]
Substituting into the definitions: \[\tau_{\mathrm{ATE}} = \E\!\left[g(1, \mathbf{X}, U_Y) - g(0, \mathbf{X}, U_Y)\right], \qquad \tau_{\mathrm{ATT}} = \E\!\left[g(1, \mathbf{X}, U_Y) - g(0, \mathbf{X}, U_Y) \mid T{=}1\right]. \tag{4.4}\]
The linear SEM as a special case. In the Gaussian linear SEM \(Y = \beta T + \boldsymbol{\gamma}^\top\mathbf{X} + \varepsilon\), the unit-level effect is \(Y_i(1) - Y_i(0) = \beta\) for every unit. Consequently, \(\tau_{\mathrm{ATE}} = \tau_{\mathrm{ATT}} = \beta\). The structural coefficient \(\beta\) is the average treatment effect. This homogeneity is a special property of the linear additive model, not a general feature.
Heterogeneous effects. In the nonparametric SEM, the unit-level effect \(g(1, \mathbf{X}_i, U_{Y,i}) - g(0, \mathbf{X}_i, U_{Y,i})\) varies across units. The ATE and ATT differ whenever treatment selection correlates with individual effect size — i.e., whenever units who benefit more also tend to self-select into treatment.
4.4 Ignorability, Positivity, and Adjustment
4.4.1 Strong Ignorability
The central identifying assumption in observational studies is that treatment assignment is as good as random after conditioning on observed covariates \(X\).
4.4.2 Adjustment Formula under Ignorability
Under strong ignorability, the ATE is identified by the standardization formula: \[\tau_{\mathrm{ATE}} = \E_X\!\left[\E[Y \mid T{=}1, X] - \E[Y \mid T{=}0, X]\right]. \tag{4.5}\]
Derivation. For each treatment level \(t\), by the law of iterated expectations: \[\E[Y(t)] = \E_X\!\left[\E[Y(t) \mid X]\right].\] Under unconfoundedness, \(\E[Y(t) \mid X] = \E[Y(t) \mid T{=}t, X]\). By consistency, \(\E[Y(t) \mid T{=}t, X] = \E[Y \mid T{=}t, X]\). Therefore \(\E[Y(t)] = \E_X[\E[Y \mid T{=}t, X]]\). Applying once for \(t=1\) and once for \(t=0\) and taking the difference yields Equation 4.5. \(\square\)
4.4.3 Back-Door Interpretation
The potential-outcomes assumption \((Y(0), Y(1)) \indep T \mid X\) is the counterfactual expression of the idea that, after conditioning on \(X\), treatment assignment carries no residual information about the outcome that would be observed under intervention \(T{=}t\). In DAG language, the closely related condition is that \(X\) blocks all back-door paths from \(T\) to \(Y\).
Proof sketch. The target statement \(Y(t) \indep T \mid X\) is cross-world: it mixes the counterfactual \(Y(t)\) with the factual treatment \(T\), and is not a d-separation statement in \(\mathcal{G}\) itself. The rigorous translation proceeds through the single-world intervention graph \(\mathcal{G}(t)\) (Appendix B). Under the NPSEM semantics, \(Y(t)\) and \(T\) depend on disjoint exogenous errors together with their own ancestors, and the back-door criterion is precisely the d-separation condition on \(\mathcal{G}(t)\) that makes these conditionally independent given \(X\). See Appendix B for the explicit SWIG derivation. \(\square\)
4.4.4 Overlap and Positivity
4.5 Where the Frameworks Agree and Diverge
| Task | Potential Outcomes | Do-Calculus / DAG | SEM |
|---|---|---|---|
| Define ATE, ATT | ✓ Most natural notation | Via \(\E[Y \mid \doop(t)]\) | Via structural equations |
| Encode causal assumptions | Typically informal; graph implicit | ✓ Explicit directed edges | ✓ Structural equations |
| Read conditional independence | Requires auxiliary graph | ✓ d-separation | With graph only |
| Identification from data | Via ignorability | ✓ Back-door, front-door, do-calculus | ✓ Via exclusion restrictions |
| Likelihood construction | Estimating equations / semiparametric | Observational functionals only | ✓ Structural model |
| Cross-world assumptions | ✓ Natural to state | See Appendix B | Implicit in model |
| Standard in statistics | ✓ Dominant | Growing rapidly | Econometrics |
4.6 Summary
Causal inference = counterfactual questions + graphical assumptions + statistical estimation.
- The potential outcome \(Y(t)\) is the outcome that would be observed under intervention \(T = t\). Under consistency, no interference, and the structural causal semantics adopted in these notes, the observed outcome satisfies \(Y_i = Y_i(T_i)\), and \(Y(t)\) has the same distribution as \(Y \mid \doop(T{=}t)\).
- The ATE and ATT are averages of the unit-level causal effect \(g(1, \mathbf{X}_i, U_{Y,i}) - g(0, \mathbf{X}_i, U_{Y,i})\) over the full population and the treated subpopulation respectively. In the linear SEM, both equal \(\beta\). They diverge under heterogeneous effects when treatment selection correlates with individual effect size.
- Strong ignorability \((Y(0), Y(1)) \indep T \mid X\) with overlap identifies the ATE via Equation 4.5. The back-door criterion provides a sufficient graphical condition for the conditional exchangeability assumption.
- Single world intervention graphs (Richardson and Robins 2014) provide a formal graphical representation of counterfactual variables, making ignorability a d-separation statement (Appendix B).
- The frameworks are complementary: potential outcomes define estimands; the do-calculus identifies them. The back-door criterion connects the two by providing the graphical condition under which adjustment recovers the causal effect.
From ignorability to randomization. In observational studies, ignorability must be justified by substantive knowledge encoded in a causal graph. Randomized experiments provide a different solution: when treatment is assigned randomly, \((Y(0), Y(1)) \indep T\) holds by design, guaranteeing ignorability without any covariate adjustment. Chapter 5 studies this design.
Identification vs. estimation. Chapters 1–4 have focused on identification: whether \(\tau = \E[Y(1) - Y(0)]\) can be written as a functional \(\Phi(P(Y,T,X))\) of the observed distribution. Estimation — constructing \(\hat\tau\) from a finite sample to approximate \(\Phi(P)\) — is studied in Part III, covering regression adjustment, IPW, doubly robust estimators, and instrumental variables.
4.7 Problems
1. SUTVA and consistency. Suppose \(n = 3\) units receive binary treatments \((T_1, T_2, T_3)\) and unit \(i\)’s outcome may depend on all three treatments: \(Y_i = Y_i(T_1, T_2, T_3)\).
- How many potential outcomes does unit 1 have? Write them out explicitly.
- SUTVA imposes no interference: \(Y_i(T_1, T_2, T_3) = Y_i(T_i)\). How many distinct potential outcomes remain?
- Give a real-world example where no-interference plausibly holds and one where it plausibly fails. In the latter case, explain what identification strategy (if any) remains available.
2. ATE, ATT, and selection bias. Let \((Y(0), Y(1), T) \sim P\) with \(P(T{=}1) = 0.5\), \(\E[Y(1)] = 3\), \(\E[Y(0)] = 1\), \(\E[Y(1) \mid T{=}1] = 4\), \(\E[Y(0) \mid T{=}1] = 2\), \(\E[Y(1) \mid T{=}0] = 2\), \(\E[Y(0) \mid T{=}0] = 0\).
- Compute the ATE and ATT. Are they equal?
- Compute the naïve estimator \(\E[Y \mid T{=}1] - \E[Y \mid T{=}0]\). Decompose the gap between this and the ATE into a selection bias term and an ATT–ATE difference.
- Under what graphical condition on the DAG would the naïve estimator equal the ATE? State this condition in both potential outcome language (ignorability) and do-calculus language (back-door criterion).
3. Causal estimands from the structural equation. Consider the nonparametric SEM \(Y = g(T, X, U_Y)\) with binary \(T \in \{0,1\}\), observed covariate \(X\), and \(U_Y\) independent of \((T, X)\) in the mutilated graph.
- Write \(\tau_{\mathrm{ATE}}\) and \(\tau_{\mathrm{ATT}}\) as expectations of \(g(1, X, U_Y) - g(0, X, U_Y)\) over the appropriate distribution. Under what condition do they coincide?
- Specialize to the linear SEM \(g(t, x, u) = \alpha + \beta t + \gamma x + u\). Show that the unit-level causal effect \(Y_i(1) - Y_i(0)\) is constant across all units, and hence \(\tau_{\mathrm{ATE}} = \tau_{\mathrm{ATT}} = \beta\).
- Now consider the heterogeneous SEM \(g(t, x, u) = (\alpha + u)\,t + \gamma x\), where \(U_Y \sim \mathcal{N}(0, \sigma^2)\), \(\mathrm{Cov}(U_Y, T) = \rho\), and \(P(T{=}1) = p \in (0,1)\). Compute \(\tau_{\mathrm{ATE}}\) and \(\tau_{\mathrm{ATT}}\). Show that they differ when \(\rho \neq 0\), and interpret this difference.
- In part (c), show that the population OLS slope on \(T\) in the regression of \(Y\) on \((1, T, X)\) equals \(\alpha + \rho/p = \tau_{\mathrm{ATT}}\), not \(\tau_{\mathrm{ATE}} = \alpha\). What additional structure would be needed to identify \(\tau_{\mathrm{ATE}}\)?
4. SWIGs and ignorability. (Requires Appendix B.) Consider the DAG: \(X \to T\), \(X \to Y\), \(T \to Y\), with \(X\) fully observed.
- Construct the SWIG \(\mathcal{G}(t)\) by splitting \(T\) into its random and fixed halves. Draw the result, labeling the random half, fixed half, and the potential outcome \(Y(t)\).
- In \(\mathcal{G}(t)\), identify all paths between the random half \(T\) and \(Y(t)\). Determine which are blocked and which are open before conditioning.
- Use d-separation in \(\mathcal{G}(t)\) to verify that \((Y(t) \indep T \mid X)_{\mathcal{G}(t)}\) holds. Conclude that \(X\) satisfies the back-door criterion, confirming (prop-ign-bd?).
- Now add a hidden common cause \(U \to T\), \(U \to Y\). Draw the revised SWIG. Does the ignorability argument still hold? State the correct conclusion and identify what additional structure would be needed for identification.