4 Potential Outcomes and Adjustment

Learning Objectives

By the end of this chapter, students should be able to:

Define the potential outcome \(Y(t)\), state SUTVA precisely, and derive the consistency equation \(Y = Y(T)\).
Define the ATE and ATT, explain how they relate to each other, and re-express them as functionals of the structural equation for \(Y\).
Show that the linear SEM structural coefficient \(\beta\) equals the ATE under homogeneous effects, and explain why ATE and ATT diverge under treatment effect heterogeneity.
State the ignorability assumption, interpret it graphically as a back-door condition, and explain why overlap is a necessary companion assumption.
Derive the adjustment formula for the ATE under strong ignorability and positivity, and explain how the back-door criterion provides the graphical justification for conditional exchangeability.

4.1 Motivation: A Third Language for Causality

The first three chapters developed causal inference primarily in the languages of structural equations and directed acyclic graphs. Those languages are especially effective for expressing interventions syntactically: the SEM shows how an intervention replaces an assignment mechanism, and the DAG shows how it deletes incoming arrows. This chapter introduces a third language: the potential outcomes framework of Neyman (1923) and Rubin (1974).

Potential outcomes provide a direct language for defining causal estimands. Quantities such as the average treatment effect, the average treatment effect on the treated, and related contrasts are most naturally written in terms of counterfactual outcomes \(Y(1)\), \(Y(0)\), and more generally \(Y(t)\). For this reason, the potential outcomes framework has become standard in statistics, biostatistics, epidemiology, and much of econometrics.

Assumptions such as ignorability, positivity, and exclusion can be stated in this language, but DAGs and the do-operator often make their structural content more transparent. The two frameworks should be viewed as complementary rather than competing: potential outcomes define the causal quantities of interest, while DAGs and the do-calculus clarify identification.

Chapter 4 bridges the graphical framework of Chapters 1–3 to the applied designs of Part II and the estimation methods of Part III.

4.2 The Neyman–Rubin Potential Outcomes Framework

4.2.1 The Potential Outcome

Definition: Potential Outcome (Neyman 1923; Rubin 1974)

For each unit \(i\) and each possible treatment value \(t \in \mathcal{T}\), the potential outcome \(Y_i(t)\) is the value of the outcome that unit \(i\) would have exhibited had its treatment been set to \(t\), possibly contrary to fact.

The collection \(\{Y_i(t) : t \in \mathcal{T}\}\) is called the schedule of potential outcomes for unit \(i\). In the binary case \(\mathcal{T} = \{0,1\}\), the schedule is \((Y_i(0),\, Y_i(1))\).

\(Y_i(1)\) is the outcome unit \(i\) would achieve under treatment; \(Y_i(0)\) is the outcome under control. Only one of these is ever observed for any given unit. The unobserved potential outcome is called the counterfactual.

The Fundamental Problem of Causal Inference (Holland 1986)

For any unit \(i\), at most one potential outcome \(Y_i(t)\) is observed. The individual-level causal effect \(Y_i(1) - Y_i(0)\) is therefore never directly observable. Causal inference is an inference problem precisely because this quantity must be recovered from a population of units rather than a single unit.

4.2.2 SUTVA and Consistency

Definition: SUTVA (Rubin 1980)

The stable unit treatment value assumption has two components:

No interference. The potential outcome \(Y_i(t)\) depends only on unit \(i\)’s own treatment: \(Y_i(t_1, \ldots, t_n) = Y_i(t_i)\).
No hidden versions. For each treatment level \(t\), there is a single well-defined version. If two units both receive \(T=1\), the treatment is the same in both cases.

Under SUTVA, the observed outcome equals the potential outcome at the treatment actually received: \[Y_i = Y_i(T_i). \tag{4.1}\]

This consistency equation is the bridge between the potential outcome world and the observed data world. It fails whenever SUTVA fails: if treatment spills over between units (e.g., vaccination provides herd immunity), or if the treatment label \(T=1\) covers multiple distinct interventions.

4.2.3 Connection to the Do-Operator

Proposition: Potential Outcomes and the Do-Operator

Under the structural causal model of Chapters 1–3, together with consistency and no interference (SUTVA): \[Y(t) \;\overset{d}{=}\; Y \mid \doop(T{=}t), \tag{4.2}\] so that \(P(Y(t) \le y) = P(Y \le y \mid \doop(T{=}t))\) for every \(y\). Equivalently, \(\E[Y(t)] = \E[Y \mid \doop(T{=}t)]\).

Proof sketch. In the mutilated SEM \(\mathcal{G}_{\overline{T}}\), the structural equation for \(T\) is replaced by \(T := t\) for every unit. The structural model then determines \(Y\) from \(t\) and unit \(i\)’s own background variables — which is precisely what the potential outcomes framework defines as \(Y_i(t)\). SUTVA’s no-interference condition ensures \(Y_i(t)\) does not depend on other units’ treatment values, so the marginal distribution of \(Y\) in \(\mathcal{G}_{\overline{T}}\) equals the marginal distribution of \(Y(t)\). \(\square\)

This equivalence lets us move freely between two notational traditions. When defining estimands such as \(\E[Y(1) - Y(0)]\), potential-outcome notation is most natural. When proving identification results from a graph, the do-operator is often more transparent: \(P(y \mid \doop(t)) \neq P(y \mid T{=}t)\) whenever \(T\) is endogenous (established in Chapter 1), whereas \(Y(t)\) carries no syntactic marker for this gap.

Remark

The SEM representation \(Y_i(t) = g(t, \mathbf{X}_i, U_{Y,i})\) clarifies why the two frameworks are equivalent in content but different in emphasis. The do-calculus works with the interventional distribution \(P(y \mid \doop(t))\) as a population-level object and asks when it can be recovered from observational data. The potential outcomes framework works with unit-level quantities \(Y_i(t)\) and asks what population summaries (ATE, ATT) are scientifically meaningful. The SEM ties the two together.

4.3 Causal Estimands

4.3.1 The Average Treatment Effect and Its Relatives

Definition: ATE and ATT

For a binary treatment \(T \in \{0,1\}\):

The average treatment effect (ATE): \(\tau_{\mathrm{ATE}} = \E[Y(1) - Y(0)]\).
The average treatment effect on the treated (ATT): \(\tau_{\mathrm{ATT}} = \E[Y(1) - Y(0) \mid T{=}1]\).

The ATE and ATT coincide only when the treatment effect does not depend on who selected into treatment — i.e., when \(\E[Y(t) \mid T] = \E[Y(t)]\) for \(t \in \{0,1\}\). In a randomized experiment this holds by design.

Neither quantity is directly observable. The naïve estimator \(\hat\tau_{\mathrm{naive}} = \bar{Y}_{T=1} - \bar{Y}_{T=0}\) estimates \(\E[Y \mid T{=}1] - \E[Y \mid T{=}0] \neq \E[Y(1)] - \E[Y(0)]\), with the gap being the selection bias induced by endogeneity.

4.3.2 Causal Estimands as Functionals of the Structural Equation

In the SEM framework, the outcome is determined by a structural equation \(Y = g(T, \mathbf{X}, U_Y)\), where \(U_Y\) collects all sources of variation not accounted for by \((T, \mathbf{X})\). The potential outcome under \(\doop(T{=}t)\) is: \[Y_i(t) = g(t, \mathbf{X}_i, U_{Y,i}). \tag{4.3}\]

Substituting into the definitions: \[\tau_{\mathrm{ATE}} = \E\!\left[g(1, \mathbf{X}, U_Y) - g(0, \mathbf{X}, U_Y)\right], \qquad \tau_{\mathrm{ATT}} = \E\!\left[g(1, \mathbf{X}, U_Y) - g(0, \mathbf{X}, U_Y) \mid T{=}1\right]. \tag{4.4}\]

The linear SEM as a special case. In the Gaussian linear SEM \(Y = \beta T + \boldsymbol{\gamma}^\top\mathbf{X} + \varepsilon\), the unit-level effect is \(Y_i(1) - Y_i(0) = \beta\) for every unit. Consequently, \(\tau_{\mathrm{ATE}} = \tau_{\mathrm{ATT}} = \beta\). The structural coefficient \(\beta\) is the average treatment effect. This homogeneity is a special property of the linear additive model, not a general feature.

Heterogeneous effects. In the nonparametric SEM, the unit-level effect \(g(1, \mathbf{X}_i, U_{Y,i}) - g(0, \mathbf{X}_i, U_{Y,i})\) varies across units. The ATE and ATT differ whenever treatment selection correlates with individual effect size — i.e., whenever units who benefit more also tend to self-select into treatment.

4.4 Ignorability, Positivity, and Adjustment

4.4.1 Strong Ignorability

The central identifying assumption in observational studies is that treatment assignment is as good as random after conditioning on observed covariates \(X\).

Definition: Strong Ignorability (Rosenbaum and Rubin 1983)

The treatment assignment \(T\) is strongly ignorable given \(X\) if:

Unconfoundedness: \((Y(0), Y(1)) \indep T \mid X\).
Overlap (positivity): \(0 < P(T{=}1 \mid X{=}x) < 1\) for all \(x\) in the support of \(X\).

4.4.2 Adjustment Formula under Ignorability

Under strong ignorability, the ATE is identified by the standardization formula: \[\tau_{\mathrm{ATE}} = \E_X\!\left[\E[Y \mid T{=}1, X] - \E[Y \mid T{=}0, X]\right]. \tag{4.5}\]

Derivation. For each treatment level \(t\), by the law of iterated expectations: \[\E[Y(t)] = \E_X\!\left[\E[Y(t) \mid X]\right].\] Under unconfoundedness, \(\E[Y(t) \mid X] = \E[Y(t) \mid T{=}t, X]\). By consistency, \(\E[Y(t) \mid T{=}t, X] = \E[Y \mid T{=}t, X]\). Therefore \(\E[Y(t)] = \E_X[\E[Y \mid T{=}t, X]]\). Applying once for \(t=1\) and once for \(t=0\) and taking the difference yields Equation 4.5. \(\square\)

Example: A Two-Stratum Adjustment Calculation

Let \(X \in \{0,1\}\) with \(P(X{=}1) = 0.4\), \(P(X{=}0) = 0.6\), and observed conditional means:

	\(T=1\)	\(T=0\)
\(X=1\)	\(\E[Y \mid T=1, X=1] = 8\)	\(\E[Y \mid T=0, X=1] = 6\)
\(X=0\)	\(\E[Y \mid T=1, X=0] = 5\)	\(\E[Y \mid T=0, X=0] = 4\)

Under ignorability and positivity: \[\E[Y(1)] = 0.4 \times 8 + 0.6 \times 5 = 6.2, \qquad \E[Y(0)] = 0.4 \times 6 + 0.6 \times 4 = 4.8.\] Hence \(\tau_{\mathrm{ATE}} = 6.2 - 4.8 = 1.4\). The formula works by comparing treated and control outcomes within each stratum and averaging those within-stratum comparisons over the marginal distribution of \(X\).

4.4.3 Back-Door Interpretation

The potential-outcomes assumption \((Y(0), Y(1)) \indep T \mid X\) is the counterfactual expression of the idea that, after conditioning on \(X\), treatment assignment carries no residual information about the outcome that would be observed under intervention \(T{=}t\). In DAG language, the closely related condition is that \(X\) blocks all back-door paths from \(T\) to \(Y\).

Proposition: Back-Door Criterion Implies Ignorability

Under the NPSEM/SWIG semantics adopted in these notes, if \(X\) satisfies the back-door criterion for the effect of \(T\) on \(Y\) — i.e., (1) no node in \(X\) is a descendant of \(T\), and (2) \(X\) blocks every back-door path from \(T\) to \(Y\) — then unconfoundedness \((Y(t) \indep T \mid X)\) holds for all \(t\).

Proof sketch. The target statement \(Y(t) \indep T \mid X\) is cross-world: it mixes the counterfactual \(Y(t)\) with the factual treatment \(T\), and is not a d-separation statement in \(\mathcal{G}\) itself. The rigorous translation proceeds through the single-world intervention graph \(\mathcal{G}(t)\) (Appendix B). Under the NPSEM semantics, \(Y(t)\) and \(T\) depend on disjoint exogenous errors together with their own ancestors, and the back-door criterion is precisely the d-separation condition on \(\mathcal{G}(t)\) that makes these conditionally independent given \(X\). See Appendix B for the explicit SWIG derivation. \(\square\)

Remark

This is the direction most important for practice: a graphical adjustment set justifies the counterfactual independence needed for standardization, regression adjustment, and propensity-score methods. The converse — whether every ignorability statement corresponds to a back-door condition — is more delicate; see Appendix B.

Example: Labor Training Program

Following LaLonde (1986) and Dehejia and Wahba (1999), let \(T\) = receipt of job training, \(Y\) = earnings two years later, \(X\) = (age, education, prior earnings, race, marital status).

The back-door path \(T \leftarrow X \to Y\) is blocked by conditioning on \(X\), leaving only the causal path \(T \to Y\). Ignorability holds if this DAG is correctly specified. If there is an unobserved variable \(U\) (motivation, ability) that affects both training participation and earnings, \(X\) fails the back-door criterion and ignorability fails.

4.4.4 Overlap and Positivity

Why Overlap Is Non-Negotiable

Without overlap, some covariate strata contain only treated or only control units. In those strata, \(\E[Y \mid T{=}0, X{=}x]\) or \(\E[Y \mid T{=}1, X{=}x]\) is unobservable, and Equation 4.5 cannot be evaluated at those values of \(x\). Identification of the ATE fails not because the causal structure is wrong, but because the data do not span the support needed.

The ATT can remain identified under a weaker, one-sided overlap condition: \(P(T{=}0 \mid X{=}x) > 0\) almost surely on the support of \(X\) in the treated subpopulation. Full two-sided overlap is not required for ATT, because ATT averages only over the treated population.

4.5 Where the Frameworks Agree and Diverge

Task	Potential Outcomes	Do-Calculus / DAG	SEM
Define ATE, ATT	✓ Most natural notation	Via \(\E[Y \mid \doop(t)]\)	Via structural equations
Encode causal assumptions	Typically informal; graph implicit	✓ Explicit directed edges	✓ Structural equations
Read conditional independence	Requires auxiliary graph	✓ d-separation	With graph only
Identification from data	Via ignorability	✓ Back-door, front-door, do-calculus	✓ Via exclusion restrictions
Likelihood construction	Estimating equations / semiparametric	Observational functionals only	✓ Structural model
Cross-world assumptions	✓ Natural to state	See Appendix B	Implicit in model
Standard in statistics	✓ Dominant	Growing rapidly	Econometrics

Our Position in This Course

Both frameworks are indispensable. We use potential outcomes to define causal estimands. We use the do-calculus and DAGs for identification. The back-door criterion is the graphical condition that justifies conditional ignorability and hence the adjustment formula. The do-operator language is preferred throughout this course because it makes the interventional/observational distinction syntactically enforced: \(P(y \mid \doop(t))\) cannot be confused with \(P(y \mid T{=}t)\), whereas \(\E[Y(t)]\) carries no such syntactic marker.

4.6 Summary

Causal inference = counterfactual questions + graphical assumptions + statistical estimation.

The potential outcome \(Y(t)\) is the outcome that would be observed under intervention \(T = t\). Under consistency, no interference, and the structural causal semantics adopted in these notes, the observed outcome satisfies \(Y_i = Y_i(T_i)\), and \(Y(t)\) has the same distribution as \(Y \mid \doop(T{=}t)\).
The ATE and ATT are averages of the unit-level causal effect \(g(1, \mathbf{X}_i, U_{Y,i}) - g(0, \mathbf{X}_i, U_{Y,i})\) over the full population and the treated subpopulation respectively. In the linear SEM, both equal \(\beta\). They diverge under heterogeneous effects when treatment selection correlates with individual effect size.
Strong ignorability \((Y(0), Y(1)) \indep T \mid X\) with overlap identifies the ATE via Equation 4.5. The back-door criterion provides a sufficient graphical condition for the conditional exchangeability assumption.
Single world intervention graphs (Richardson and Robins 2014) provide a formal graphical representation of counterfactual variables, making ignorability a d-separation statement (Appendix B).
The frameworks are complementary: potential outcomes define estimands; the do-calculus identifies them. The back-door criterion connects the two by providing the graphical condition under which adjustment recovers the causal effect.

From ignorability to randomization. In observational studies, ignorability must be justified by substantive knowledge encoded in a causal graph. Randomized experiments provide a different solution: when treatment is assigned randomly, \((Y(0), Y(1)) \indep T\) holds by design, guaranteeing ignorability without any covariate adjustment. Chapter 5 studies this design.

Identification vs. estimation. Chapters 1–4 have focused on identification: whether \(\tau = \E[Y(1) - Y(0)]\) can be written as a functional \(\Phi(P(Y,T,X))\) of the observed distribution. Estimation — constructing \(\hat\tau\) from a finite sample to approximate \(\Phi(P)\) — is studied in Part III, covering regression adjustment, IPW, doubly robust estimators, and instrumental variables.

4.7 Problems

1. SUTVA and consistency. Suppose \(n = 3\) units receive binary treatments \((T_1, T_2, T_3)\) and unit \(i\)’s outcome may depend on all three treatments: \(Y_i = Y_i(T_1, T_2, T_3)\).

How many potential outcomes does unit 1 have? Write them out explicitly.
SUTVA imposes no interference: \(Y_i(T_1, T_2, T_3) = Y_i(T_i)\). How many distinct potential outcomes remain?
Give a real-world example where no-interference plausibly holds and one where it plausibly fails. In the latter case, explain what identification strategy (if any) remains available.

2. ATE, ATT, and selection bias. Let \((Y(0), Y(1), T) \sim P\) with \(P(T{=}1) = 0.5\), \(\E[Y(1)] = 3\), \(\E[Y(0)] = 1\), \(\E[Y(1) \mid T{=}1] = 4\), \(\E[Y(0) \mid T{=}1] = 2\), \(\E[Y(1) \mid T{=}0] = 2\), \(\E[Y(0) \mid T{=}0] = 0\).

Compute the ATE and ATT. Are they equal?
Compute the naïve estimator \(\E[Y \mid T{=}1] - \E[Y \mid T{=}0]\). Decompose the gap between this and the ATE into a selection bias term and an ATT–ATE difference.
Under what graphical condition on the DAG would the naïve estimator equal the ATE? State this condition in both potential outcome language (ignorability) and do-calculus language (back-door criterion).

3. Causal estimands from the structural equation. Consider the nonparametric SEM \(Y = g(T, X, U_Y)\) with binary \(T \in \{0,1\}\), observed covariate \(X\), and \(U_Y\) independent of \((T, X)\) in the mutilated graph.

Write \(\tau_{\mathrm{ATE}}\) and \(\tau_{\mathrm{ATT}}\) as expectations of \(g(1, X, U_Y) - g(0, X, U_Y)\) over the appropriate distribution. Under what condition do they coincide?
Specialize to the linear SEM \(g(t, x, u) = \alpha + \beta t + \gamma x + u\). Show that the unit-level causal effect \(Y_i(1) - Y_i(0)\) is constant across all units, and hence \(\tau_{\mathrm{ATE}} = \tau_{\mathrm{ATT}} = \beta\).
Now consider the heterogeneous SEM \(g(t, x, u) = (\alpha + u)\,t + \gamma x\), where \(U_Y \sim \mathcal{N}(0, \sigma^2)\), \(\mathrm{Cov}(U_Y, T) = \rho\), and \(P(T{=}1) = p \in (0,1)\). Compute \(\tau_{\mathrm{ATE}}\) and \(\tau_{\mathrm{ATT}}\). Show that they differ when \(\rho \neq 0\), and interpret this difference.
In part (c), show that the population OLS slope on \(T\) in the regression of \(Y\) on \((1, T, X)\) equals \(\alpha + \rho/p = \tau_{\mathrm{ATT}}\), not \(\tau_{\mathrm{ATE}} = \alpha\). What additional structure would be needed to identify \(\tau_{\mathrm{ATE}}\)?

4. SWIGs and ignorability. (Requires Appendix B.) Consider the DAG: \(X \to T\), \(X \to Y\), \(T \to Y\), with \(X\) fully observed.

Construct the SWIG \(\mathcal{G}(t)\) by splitting \(T\) into its random and fixed halves. Draw the result, labeling the random half, fixed half, and the potential outcome \(Y(t)\).
In \(\mathcal{G}(t)\), identify all paths between the random half \(T\) and \(Y(t)\). Determine which are blocked and which are open before conditioning.
Use d-separation in \(\mathcal{G}(t)\) to verify that \((Y(t) \indep T \mid X)_{\mathcal{G}(t)}\) holds. Conclude that \(X\) satisfies the back-door criterion, confirming (prop-ign-bd?).
Now add a hidden common cause \(U \to T\), \(U \to Y\). Draw the revised SWIG. Does the ignorability argument still hold? State the correct conclusion and identify what additional structure would be needed for identification.