1 Introduction
This chapter is an orientation, not a complete treatment. It introduces four ideas — the interventional/observational distinction, the causal trinity, the Gaussian confounded model as a running example, and the identification-versus-estimation paradigm — that will be developed rigorously in the chapters that follow.
1.1 Motivation: Why Causal Inference Is Hard
1.1.1 The Central Question
Causal inference is concerned with a deceptively simple question: what would happen to \(Y\) if we were to set \(T = t\)? This is fundamentally different from the question that standard statistical procedures answer: what is the distribution of \(Y\) among units observed to have \(T = t\)?
1.2 The Causal Trinity: Three Languages for One Idea
The same causal question can be expressed in three closely related languages: the structural equation model (SEM), the directed acyclic graph (DAG) with graph surgery, and the potential outcomes framework. Collectively, we call these the causal trinity. Each language illuminates a different facet of the same underlying idea, and fluency in all three is essential for modern causal reasoning.
1.2.1 Language 1: Structural Equation Models
A SEM represents the causal data-generating mechanism as a system of equations, one per variable, with a joint distribution over exogenous disturbances. A variable is exogenous if it is determined outside the system; a variable is endogenous if its value is produced by a structural equation within the model.
The running setup for this chapter is a three-variable confounded SEM over \((U, T, Y)\): \[T = f_T(U, \delta), \qquad Y = f_Y(T, U, \varepsilon), \tag{1.1}\] where \(\delta\) and \(\varepsilon\) are mutually independent exogenous disturbances and \(U\) is an unobserved exogenous confounder.
Observational regime. Because \(U\) enters both equations, conditioning on \(T\) changes the distribution of \(U\), so \(P(y \mid T{=}t) \neq P(y \mid \doop(T{=}t))\). An analyst who regresses \(Y\) on \(T\) without observing \(U\) faces a residual correlated with \(T\) — even though the structural error \(\varepsilon\) is independent of \(T\) by NPSEM-IE.
Interventional regime. The do-operator replaces the equation for \(T\) with a constant \(t\), yielding the mutilated system: \(U \sim p(u)\), \(T = t\) (fixed), \(Y = f_Y(t, U, \varepsilon)\). We define the potential outcome as \(Y(t) \mathrel{:}= f_Y(t, U, \varepsilon)\); by construction, \(Y(t) \sim P(y \mid \doop(T{=}t))\).
1.2.2 Language 2: Directed Acyclic Graphs
1.2.3 Language 3: Potential Outcomes
1.2.4 Roadmap: How the Trinity Organizes These Notes
Part I — Foundations: the graphical language (Chapters 1–3). The DAG and the do-operator are the primary language here because they make the interventional/observational distinction syntactically explicit: confounding is visible as an open back-door path, and identifying assumptions are visible as graph criteria.
Part II — Identification: designs and research strategies (Chapters 4–9). Part II introduces the potential outcomes framework (Chapter 4) and then studies the specific research designs through which observational and experimental data support identification: randomization and back-door adjustment (Chapter 5), propensity score methods (Chapter 6), instrumental variables (Chapter 7), mediation and front-door identification (Chapter 8), and sensitivity analysis (Chapter 9). Ignorability (\(Y(t) \indep T \mid X\)) is the potential-outcomes counterpart of the back-door criterion.
Part III — Estimation: semiparametric and machine-learning methods (Chapters 10–13). Once identification has been established, Part III turns to estimation. The semiparametric efficiency framework — estimating equations, influence functions, doubly robust estimators, and double machine learning — is largely language-neutral. Chapter 13 applies these tools to estimation under instrumental variables.
1.2.5 Historical Roots of the Causal Trinity
The three languages did not emerge from a single research program but grew independently in different disciplines over more than a century.
Structural Equation Models: genetics and econometrics. Sewall Wright (1921) introduced path analysis to decompose correlations among hereditary traits. The framework was transplanted into economics by Haavelmo (1943), who argued that economic relationships should be modeled as autonomous structural equations robust to policy interventions. The Cowles Commission formalized simultaneous equation systems throughout the 1940s–50s. Lucas’s (1976) critique is essentially a statement of the interventional/observational distinction, decades before that phrase existed.
Potential Outcomes: experimental statistics and epidemiology. Neyman (1923) introduced the notation \(Y_i(t)\) in the context of randomized agricultural experiments. The crucial extension to observational studies was made by Rubin (1974), who formalized the assignment mechanism and stated SUTVA explicitly. Holland’s (1986) JASA paper brought the framework to mainstream statistics. Robins (1986) extended potential outcomes to longitudinal treatments via \(g\)-computation. Imbens and Angrist (1994) connected the framework to instrumental variables and defined the LATE.
DAGs and do-calculus: computer science and philosophy. Pearl developed Bayesian networks through the 1980s. The pivotal step came in Pearl (1995), where the do-operator was introduced. The 2000 monograph Causality synthesized DAGs, the do-calculus, identification theory, and connections to potential outcomes. Spirtes, Glymour, and Scheines (1993) — working from philosophy at Carnegie Mellon — developed constraint-based algorithms for learning causal structure from data.
Convergence. The three frameworks are closely related and often intertranslatable, but exact equivalence requires additional assumptions. These notes work mainly in an NPSEM-IE setting, where translation between the languages is especially clean.
1.3 The Core Distinction
The single most important inequality of this course is: \[\underbrace{P\!\left(y \mid x,\, \doop(T{=}t)\right)}_{\text{interventional}} \;\neq\; \underbrace{P\!\left(y \mid x,\, T{=}t\right)}_{\text{observational}} \qquad \text{whenever } T \text{ is endogenous.} \tag{1.2}\]
1.3.1 Two Scenarios: Observed vs. Unobserved Confounder
Scenario 1: \(U\) is observed (no unmeasured confounding). When every component of \(U\) is recorded, both \(P(y \mid x, T{=}t, U{=}u)\) and \(p(u \mid x)\) can be estimated, and the back-door formula Equation 1.3 is a functional of the observable distribution. A special case is unconfoundedness: if \(U \indep T \mid X\) already holds, then \(p(u \mid x, T{=}t) = p(u \mid x)\), and the formula reduces to \(\E[Y \mid X{=}x, T{=}t]\). This is the key assumption underlying regression adjustment and propensity score methods (Chapters 5 and 6).
Scenario 2: \(U\) is unobserved (unmeasured confounding). When \(U\) is latent, the back-door formula is not usable from data. The main identification strategies are instrumental variables (Chapter 7), the front-door criterion (Chapter 3, applied in Chapter 8), and sensitivity analysis (Chapter 9).
| Scenario 1: \(U\) observed | Scenario 2: \(U\) unobserved | |
|---|---|---|
| Confounding type | No unmeasured confounding | Unmeasured confounding |
| Back-door formula usable? | Yes: both terms estimable | No: \(U\) latent |
| Identification route | Back-door adjustment | IV, front-door, … |
| Key assumption | \(Y(t) \indep T \mid X, U\) | IV, front-door, or partial-identification assumptions |
| Chapters in this course | Chs. 3–6 (back-door); Chs. 10–11 (IPW, DR) | Ch. 7 (IV); Ch. 8 (front-door); Ch. 9 (sensitivity) |
1.4 The Gaussian Linear Confounded Model
The endogeneity of \(T\) is immediate: \(\mathrm{Cov}(T,\, \gamma U + \varepsilon) = \gamma\alpha\sigma_U^2 \neq 0\) whenever \(\alpha \neq 0\) and \(\gamma \neq 0\). Consequently, OLS regressing \(Y\) on \(T\) does not estimate \(\beta\).
1.4.1 Two Explicit Densities
Write \(\sigma_T^2 = \alpha^2\sigma_U^2 + \sigma_\delta^2\) for the marginal variance of \(T\).
Interventional density. Set \(T = t\) externally; the back-door path through \(U\) is severed, so \(U \sim \mathcal{N}(0, \sigma_U^2)\) independently: \[P\!\left(y \mid \doop(T{=}t)\right) = \mathcal{N}\!\left(\beta t,\;\; \gamma^2\sigma_U^2 + \sigma_\varepsilon^2\right). \tag{1.5}\]
Observational density. Since \((Y, T)\) is jointly normal, \(\mathrm{Cov}(Y,T) = \beta\sigma_T^2 + \gamma\alpha\sigma_U^2\), giving: \[P\!\left(y \mid T{=}t\right) = \mathcal{N}\!\left(\left(\beta + \frac{\gamma\alpha\sigma_U^2}{\sigma_T^2}\right)t,\;\; \frac{\gamma^2\sigma_U^2\sigma_\delta^2}{\sigma_T^2} + \sigma_\varepsilon^2\right). \tag{1.6}\]
1.4.2 The Endogeneity Gap
| Interventional | Observational | |
|---|---|---|
| Mean | \(\beta t\) | \(\left(\beta + \frac{\gamma\alpha\sigma_U^2}{\sigma_T^2}\right)t\) |
| Variance | \(\gamma^2\sigma_U^2 + \sigma_\varepsilon^2\) | \(\frac{\gamma^2\sigma_U^2\sigma_\delta^2}{\sigma_T^2} + \sigma_\varepsilon^2\) |
| Residual | \((\gamma U{+}\varepsilon) \indep T\) | \((\gamma U{+}\varepsilon)\) correlated with \(T\) |
| Identified by | RCT (or IV, Chapter 7) | OLS |
| Equal when | \(\alpha = 0\) or \(\gamma = 0\) (no confounding) |
The endogeneity bias of OLS is \(\gamma\alpha\sigma_U^2/\sigma_T^2\). With \(\alpha = \gamma = 1\) and \(\sigma_U^2 = \sigma_\delta^2 = 1\): bias \(= 1/(1+1) = 0.5\).
1.4.3 Lab: Simulating the Two Densities
The analytical results were verified by simulation using \(n = 10{,}000\) draws with parameters \(\beta = 2\), \(\alpha = 1\), \(\gamma = 1\), \(\sigma_U = \sigma_\varepsilon = \sigma_\delta = 1\). (R code: chapter1_lab.R.)
Experiment 1: OLS bias. By the omitted-variable bias formula, the OLS probability limit is \(\beta + \gamma\alpha\sigma_U^2/\sigma_T^2 = 2 + 0.5 = 2.5\).
| Simulated | Theory | Formula | |
|---|---|---|---|
| OLS slope | 2.5147 | 2.5000 | \(\beta + \gamma\alpha\sigma_U^2/\sigma_T^2\) |
| True \(\beta\) | — | 2.0000 | \(\beta\) |
| Bias | 0.5147 | 0.5000 | \(\gamma\alpha\sigma_U^2/\sigma_T^2\) |
OLS overestimates the causal effect by 25%.
Experiment 2: Two-sample comparison at \(t_0 = 1\). The exact observational conditional distribution is \(\mathcal{N}(2.5,\, 1.5)\); the interventional distribution is \(\mathcal{N}(2,\, 2)\).
| Mean (Sim / Theory) | SD (Sim / Theory) | |
|---|---|---|
| \(\E[Y \mid T{=}1]\) | 2.497 / 2.500 | 1.229 / \(\sqrt{1.5} = 1.225\) |
| \(\E[Y \mid \doop(T{=}1)]\) | 2.001 / 2.000 | 1.428 / \(\sqrt{2} = 1.414\) |
The variance reduction in the observational distribution reflects \(\sigma_{\mathrm{obs}}^2/\sigma_{\mathrm{int}}^2 = 1.5/2 = 0.75\): conditioning on \(T{=}1\) pins down \(\alpha U + \delta\), restricting the spread of \(Y\) through the correlation with \(\gamma U\).
Experiment 3: Density overlay for varying \(\alpha\). As \(\alpha\) increases from 0 to 1.0, the observational density shifts rightward (growing bias) and narrows (reduced variance). The interventional density \(\mathcal{N}(2, 2)\) is invariant to \(\alpha\).
1.5 The Two-Step Paradigm
Causal inference is a two-step discipline. The two steps are logically distinct and require different tools.
Step 1: Identification is a purely mathematical question about the causal model: can the interventional density be expressed as a function of the observed-data distribution? It does not depend on sample size. The main tools are: the back-door criterion (Chapter 3, applied in Chapters 4–6), the front-door criterion (Chapter 3), the three rules of the do-calculus (Chapter 3), and IV assumptions (Chapter 7).
Step 2: Estimation. Once \(\Psi(P)\) has been identified, the statistical problem is to estimate the functional \(\Psi(P)\) from \(n\) observations as efficiently as possible. The main tools are efficient influence functions (Part III), IPW (Part III), doubly robust estimators (Part III), and double machine learning (Part III).
1.5.1 Worked Example: Flu Vaccination and Infection (Simpson’s Paradox)
A public health agency observes 1,000 individuals. Each person either received a flu vaccine (\(T=1\)) or not (\(T=0\)), and either became infected (\(Y=1\)) or not. Age group \(X\) (elderly \(= E\), young \(= Y\)) is recorded. Elderly people are both more likely to be vaccinated and more susceptible to infection, so \(X\) confounds the \(T\)–\(Y\) relationship. The causal DAG has paths: \(T \to Y\) (causal) and \(T \leftarrow X \to Y\) (back-door).
Observed data:
| Vaccinated (\(T=1\)) | Unvaccinated (\(T=0\)) | Total | |
|---|---|---|---|
| Elderly (\(X=E\)) | 360, 30% infected | 40, 50% infected | 400 |
| Young (\(X=Y\)) | 60, 5% infected | 540, 10% infected | 600 |
| Total | 420, 26.4% infected | 580, 12.8% infected | 1,000 |
Simpson’s paradox:
| Stratum | Vaccinated | Unvaccinated | Difference | Conclusion |
|---|---|---|---|---|
| Elderly | 30% | 50% | \(-20\) pp | vaccine helps |
| Young | 5% | 10% | \(-5\) pp | vaccine helps |
| Aggregate | 26.4% | 12.8% | \(+13.6\) pp | vaccine harms?? |
Within every stratum the vaccine reduces infection. Yet in the aggregate the vaccinated group has a higher infection rate, because 86% of vaccinated are elderly vs. 7% of unvaccinated.
Step 1 (Identification). The set \(\{X\}\) satisfies the back-door criterion: \[P\!\left(Y{=}1 \mid \doop(T{=}t)\right) = \sum_{x \in \{E,Y\}} P(Y{=}1 \mid T{=}t, X{=}x)\,P(X{=}x).\]
Step 2 (Estimation). With \(\hat{P}(X{=}E) = 0.4\) and \(\hat{P}(X{=}Y) = 0.6\): \[\hat{P}(Y{=}1 \mid \doop(T{=}1)) = 0.30 \times 0.4 + 0.05 \times 0.6 = 0.15,\] \[\hat{P}(Y{=}1 \mid \doop(T{=}0)) = 0.50 \times 0.4 + 0.10 \times 0.6 = 0.26.\]
The estimated causal risk difference is \(0.15 - 0.26 = -0.11\): the vaccine causally reduces infection probability by 11 percentage points.
1.6 Summary
Interventional vs. observational distribution. Causal inference answers questions about interventions \(P(y \mid \doop(T{=}t))\), not about observations \(P(y \mid T{=}t)\). In the Gaussian confounded model, the two densities coincide only when \(\alpha = 0\) or \(\gamma = 0\).
The causal trinity. The same causal question can be expressed in three languages — SEM (equation surgery), DAG (graph surgery), and potential outcomes (\(Y(t)\)). In the SEM framework, \(Y(t)\) is defined as the solution for \(Y\) in the mutilated system.
Endogeneity bias in the Gaussian confounded model. The coefficient of \(T\) in \(\E[Y \mid T{=}t]\) differs from the causal coefficient \(\beta\) by \(\gamma\alpha\sigma_U^2/\sigma_T^2\). The observational variance \(\gamma^2\sigma_U^2\sigma_\delta^2/\sigma_T^2 + \sigma_\varepsilon^2\) is smaller than the interventional variance \(\gamma^2\sigma_U^2 + \sigma_\varepsilon^2\).
Two-step paradigm. Causal inference is: identification (expressing \(\Psi(P)\) as a functional of observable data — a mathematical question) followed by estimation (constructing \(\hat{\Psi}_n\) efficiently — a statistical question). The two steps are analytically separate.
Causal inference is the study of when the observational distribution contains enough information to recover the interventional distribution.
1.7 Problems
1. Do-notation fundamentals. For each statement below, decide whether it refers to an interventional or an observational distribution, and rewrite it unambiguously using do-notation where appropriate.
- “The probability that a patient recovers given that they took the drug.”
- “The probability that a patient would recover if we prescribed the drug to everyone.”
- “Among students who attended tutoring, the average exam score was 85.”
- “If we enrolled all students in tutoring, the average exam score would be 85.”
2. Deriving the observational density. In the Gaussian confounded model (Equation 1.4), verify Equation 1.6 by carrying out the following steps. Let \(\sigma_T^2 = \alpha^2\sigma_U^2 + \sigma_\delta^2\).
- Show that \((Y, T)\) is jointly normal by writing both as linear combinations of the independent normals \((U, \varepsilon, \delta)\).
- Compute \(\mathrm{Cov}(Y, T)\) and \(\mathrm{Var}(T)\).
- Apply the conditional normal formula to derive \(\E[Y \mid T{=}t]\) and \(\mathrm{Var}[Y \mid T{=}t]\), and hence verify Equation 1.6.
- Show that the OLS probability limit is \(\beta + \gamma\alpha\sigma_U^2/\sigma_T^2\), and interpret this as an omitted-variable bias formula.
3. The causal trinity. Consider the following verbal causal claim: “Aspirin (\(T\)) reduces the risk of heart attack (\(Y\)) because it inhibits platelet aggregation (\(M\)), but patients with pre-existing cardiovascular disease (\(U\), unobserved) are both more likely to take aspirin and more likely to have a heart attack.”
- Draw the causal DAG implied by this description.
- Write down the recursive SEM with appropriate structural functions \(f_T\), \(f_M\), \(f_Y\).
- State the SUTVA assumption and discuss whether it is plausible in this setting.
- Explain why \(\E[Y \mid T{=}1] - \E[Y \mid T{=}0]\) does not identify the causal effect \(\E[Y \mid \doop(T{=}1)] - \E[Y \mid \doop(T{=}0)]\) in this DAG.