1 Introduction

Learning Objectives

By the end of this chapter, students should be able to:

Articulate the distinction between an interventional distribution \(P(y \mid \doop(T{=}t))\) and an observational conditional distribution \(P(y \mid T{=}t)\), and explain why they differ whenever \(T\) is endogenous.
Describe the same causal question in all three languages — SEM, DAG graph surgery, and potential outcomes — and begin to translate between them.
Derive the two densities explicitly in the Gaussian linear confounded model, and identify the endogeneity bias \(\alpha\gamma\sigma_U^2/\sigma_T^2\) as the gap between them.
State the two-step paradigm (identification then estimation) and locate each step in the causal inference pipeline.

This chapter is an orientation, not a complete treatment. It introduces four ideas — the interventional/observational distinction, the causal trinity, the Gaussian confounded model as a running example, and the identification-versus-estimation paradigm — that will be developed rigorously in the chapters that follow.

1.1 Motivation: Why Causal Inference Is Hard

1.1.1 The Central Question

Causal inference is concerned with a deceptively simple question: what would happen to \(Y\) if we were to set \(T = t\)? This is fundamentally different from the question that standard statistical procedures answer: what is the distribution of \(Y\) among units observed to have \(T = t\)?

Definition: Interventional vs. Observational Distribution

The interventional distribution \(P(y \mid \doop(T{=}t))\) is the distribution of \(Y\) in a hypothetical world where \(T\) has been externally set to \(t\), regardless of the usual data-generating mechanism for \(T\).
The observational conditional distribution \(P(y \mid T{=}t)\) is the distribution of \(Y\) among the subpopulation of units for whom \(T = t\) was observed to hold.

These two distributions coincide when \(T\) is exogenous — that is, when \(T\) shares no common causes with \(Y\). Whenever \(T\) is endogenous, the two distributions differ, and the difference is endogeneity bias.

Example: Drug Trial with Unmeasured Severity

Does a new drug (\(T=1\)) reduce blood pressure (\(Y\))? Suppose sicker patients are more likely to take the drug, so unmeasured severity \(U\) is a common cause of \(T\) and \(Y\). Then:

Observational study: \(\E[Y \mid T{=}1] > \E[Y \mid T{=}0]\). Drug users have higher blood pressure on average. A naïve analyst concludes the drug is harmful.
Randomized trial: \(\E[Y \mid \doop(T{=}1)] < \E[Y \mid \doop(T{=}0)]\). Forcing everyone to take the drug reduces blood pressure. The drug is beneficial.

Same data. Different questions. Opposite answers. The entire discrepancy is caused by the back-door path \(T \leftarrow U \to Y\).

The Fundamental Problem of Causal Inference

We observe \(P(y \mid T{=}t)\) directly from data. We want \(P(y \mid \doop(T{=}t))\). The goal of identification theory is to find conditions under which the former can be used to recover the latter.

1.2 The Causal Trinity: Three Languages for One Idea

The same causal question can be expressed in three closely related languages: the structural equation model (SEM), the directed acyclic graph (DAG) with graph surgery, and the potential outcomes framework. Collectively, we call these the causal trinity. Each language illuminates a different facet of the same underlying idea, and fluency in all three is essential for modern causal reasoning.

**Figure 1.1:** The **causal trinity**: three languages that express the same causal question from different vantage points. Potential outcomes define *what* we want to estimate; the DAG with graph surgery specifies *how* to identify it from observed data; the SEM provides the *generative mechanism* for likelihood construction. Double-headed arrows show the translation assumptions connecting adjacent languages.

Figure 1.1

1.2.1 Language 1: Structural Equation Models

A SEM represents the causal data-generating mechanism as a system of equations, one per variable, with a joint distribution over exogenous disturbances. A variable is exogenous if it is determined outside the system; a variable is endogenous if its value is produced by a structural equation within the model.

The running setup for this chapter is a three-variable confounded SEM over \((U, T, Y)\): \[T = f_T(U, \delta), \qquad Y = f_Y(T, U, \varepsilon), \tag{1.1}\] where \(\delta\) and \(\varepsilon\) are mutually independent exogenous disturbances and \(U\) is an unobserved exogenous confounder.

Remark: Error Independence in the NPSEM-IE

The assumption that the disturbances \(\delta\) and \(\varepsilon\) are mutually independent is a substantive restriction. Because \(U\) is unobserved and enters both \(f_T\) and \(f_Y\), error independence does not remove the \(T\)–\(Y(t)\) dependence induced by \(U\); it restricts only the additional channel through the equation-specific errors \((\delta, \varepsilon)\). Richardson and Robins (2014) call SEMs with this structure Nonparametric Structural Equation Models with Independent Errors (NPSEM-IEs). These notes adopt the NPSEM-IE framework throughout: it ensures the Markov condition holds and integrates cleanly with the graphical identification theory of later chapters.

Observational regime. Because \(U\) enters both equations, conditioning on \(T\) changes the distribution of \(U\), so \(P(y \mid T{=}t) \neq P(y \mid \doop(T{=}t))\). An analyst who regresses \(Y\) on \(T\) without observing \(U\) faces a residual correlated with \(T\) — even though the structural error \(\varepsilon\) is independent of \(T\) by NPSEM-IE.

Interventional regime. The do-operator replaces the equation for \(T\) with a constant \(t\), yielding the mutilated system: \(U \sim p(u)\), \(T = t\) (fixed), \(Y = f_Y(t, U, \varepsilon)\). We define the potential outcome as \(Y(t) \mathrel{:}= f_Y(t, U, \varepsilon)\); by construction, \(Y(t) \sim P(y \mid \doop(T{=}t))\).

1.2.2 Language 2: Directed Acyclic Graphs

Definition: Graph Surgery

In the observational DAG, arrows into \(T\) encode the mechanisms that determine \(T\). Under \(\doop(T{=}t)\), all arrows into \(T\) are deleted, producing the mutilated graph \(\Gcal_{\overline{T}}\). This severs spurious back-door paths while preserving directed causal pathways from \(T\) to its descendants.

1.2.3 Language 3: Potential Outcomes

Definition: Potential Outcome (Rubin 1974)

In the SEM framework, \(Y(t) = f_Y(t, U, \varepsilon)\) is the outcome that would have been observed had \(T\) been set to \(t\), possibly contrary to fact.

Definition: SUTVA

The stable unit treatment value assumption (SUTVA) has two components: (i) no interference — unit \(i\)’s potential outcome depends only on unit \(i\)’s own treatment; and (ii) no hidden versions of treatment — each treatment level corresponds to a single, well-defined intervention. Under SUTVA, the consistency relation \(Y_i = Y_i(T_i)\) holds. SUTVA is developed further in Chapter 4.

1.2.4 Roadmap: How the Trinity Organizes These Notes

Part I — Foundations: the graphical language (Chapters 1–3). The DAG and the do-operator are the primary language here because they make the interventional/observational distinction syntactically explicit: confounding is visible as an open back-door path, and identifying assumptions are visible as graph criteria.

Part II — Identification: designs and research strategies (Chapters 4–9). Part II introduces the potential outcomes framework (Chapter 4) and then studies the specific research designs through which observational and experimental data support identification: randomization and back-door adjustment (Chapter 5), propensity score methods (Chapter 6), instrumental variables (Chapter 7), mediation and front-door identification (Chapter 8), and sensitivity analysis (Chapter 9). Ignorability (\(Y(t) \indep T \mid X\)) is the potential-outcomes counterpart of the back-door criterion.

Part III — Estimation: semiparametric and machine-learning methods (Chapters 10–13). Once identification has been established, Part III turns to estimation. The semiparametric efficiency framework — estimating equations, influence functions, doubly robust estimators, and double machine learning — is largely language-neutral. Chapter 13 applies these tools to estimation under instrumental variables.

1.2.5 Historical Roots of the Causal Trinity

The three languages did not emerge from a single research program but grew independently in different disciplines over more than a century.

Structural Equation Models: genetics and econometrics. Sewall Wright (1921) introduced path analysis to decompose correlations among hereditary traits. The framework was transplanted into economics by Haavelmo (1943), who argued that economic relationships should be modeled as autonomous structural equations robust to policy interventions. The Cowles Commission formalized simultaneous equation systems throughout the 1940s–50s. Lucas’s (1976) critique is essentially a statement of the interventional/observational distinction, decades before that phrase existed.

Potential Outcomes: experimental statistics and epidemiology. Neyman (1923) introduced the notation \(Y_i(t)\) in the context of randomized agricultural experiments. The crucial extension to observational studies was made by Rubin (1974), who formalized the assignment mechanism and stated SUTVA explicitly. Holland’s (1986) JASA paper brought the framework to mainstream statistics. Robins (1986) extended potential outcomes to longitudinal treatments via \(g\)-computation. Imbens and Angrist (1994) connected the framework to instrumental variables and defined the LATE.

DAGs and do-calculus: computer science and philosophy. Pearl developed Bayesian networks through the 1980s. The pivotal step came in Pearl (1995), where the do-operator was introduced. The 2000 monograph Causality synthesized DAGs, the do-calculus, identification theory, and connections to potential outcomes. Spirtes, Glymour, and Scheines (1993) — working from philosophy at Carnegie Mellon — developed constraint-based algorithms for learning causal structure from data.

Convergence. The three frameworks are closely related and often intertranslatable, but exact equivalence requires additional assumptions. These notes work mainly in an NPSEM-IE setting, where translation between the languages is especially clean.

1.3 The Core Distinction

The single most important inequality of this course is: \[\underbrace{P\!\left(y \mid x,\, \doop(T{=}t)\right)}_{\text{interventional}} \;\neq\; \underbrace{P\!\left(y \mid x,\, T{=}t\right)}_{\text{observational}} \qquad \text{whenever } T \text{ is endogenous.} \tag{1.2}\]

Definition: Conditional Exchangeability

Given observed covariates \(X\) and confounders \(U\), the treatment assignment is conditionally exchangeable with the intervention if: \[Y(t) \indep T \mid X,\, U, \qquad \text{or equivalently,} \qquad P(y \mid x,\, T{=}t,\, U{=}u) = P(y \mid x,\, \doop(T{=}t),\, U{=}u).\] Once we condition on all common causes \((X, U)\), observing \(T{=}t\) and intervening to set \(T{=}t\) produce the same distribution of \(Y\).

Definition: Positivity

The treatment assignment satisfies positivity if, for every treatment value \(t\) of interest, \(p_{T \mid X, U}(t \mid x, u) > 0\) for almost all \((x, u)\). This ensures that \(P(y \mid x, T{=}t, U{=}u)\) is well-defined for every treatment value of interest.

Proposition: Back-Door Adjustment Formula

Under conditional exchangeability and positivity: \[P(y \mid x,\, \doop(T{=}t)) = \int P(y \mid x,\, T{=}t,\, U{=}u)\; p(u \mid x)\; du. \tag{1.3}\]

Proof

By positivity, \(P(y \mid x, T{=}t, U{=}u)\) is well-defined. Applying the law of total probability to the interventional density: \[P(y \mid x,\, \doop(T{=}t)) = \int P(y \mid x,\, U{=}u,\, \doop(T{=}t))\; p(u \mid x)\; du.\] The weight is \(p(u \mid x)\), not \(p(u \mid x, T{=}t)\), because under \(\doop(T{=}t)\) the treatment is set externally and carries no information about \(U\). By conditional exchangeability, \(P(y \mid x, U{=}u, \doop(T{=}t)) = P(y \mid x, T{=}t, U{=}u)\). Substituting completes the proof. \(\square\)

Remark: Confounding Discrepancy

What OLS estimates is the observational density \(P(y \mid x, T{=}t) = \int P(y \mid x, T{=}t, U{=}u)\; p(u \mid x, T{=}t)\; du\). The kernel \(P(y \mid x, T{=}t, U{=}u)\) is the same as in Equation 1.3, but averaged over the selection-distorted weight \(p(u \mid x, T{=}t)\) rather than the marginal \(p(u \mid x)\). The confounding discrepancy: \[P(y \mid x,\, T{=}t) - P(y \mid x,\, \doop(T{=}t)) = \int P(y \mid x,\, T{=}t,\, U{=}u)\bigl[\, p(u \mid x,\, T{=}t) - p(u \mid x) \,\bigr]\, du\] is nonzero when \(U\) and \(T\) are dependent and \(U\) affects the conditional distribution of \(Y\).

1.3.1 Two Scenarios: Observed vs. Unobserved Confounder

Scenario 1: \(U\) is observed (no unmeasured confounding). When every component of \(U\) is recorded, both \(P(y \mid x, T{=}t, U{=}u)\) and \(p(u \mid x)\) can be estimated, and the back-door formula Equation 1.3 is a functional of the observable distribution. A special case is unconfoundedness: if \(U \indep T \mid X\) already holds, then \(p(u \mid x, T{=}t) = p(u \mid x)\), and the formula reduces to \(\E[Y \mid X{=}x, T{=}t]\). This is the key assumption underlying regression adjustment and propensity score methods (Chapters 5 and 6).

Scenario 2: \(U\) is unobserved (unmeasured confounding). When \(U\) is latent, the back-door formula is not usable from data. The main identification strategies are instrumental variables (Chapter 7), the front-door criterion (Chapter 3, applied in Chapter 8), and sensitivity analysis (Chapter 9).

	Scenario 1: \(U\) observed	Scenario 2: \(U\) unobserved
Confounding type	No unmeasured confounding	Unmeasured confounding
Back-door formula usable?	Yes: both terms estimable	No: \(U\) latent
Identification route	Back-door adjustment	IV, front-door, …
Key assumption	\(Y(t) \indep T \mid X, U\)	IV, front-door, or partial-identification assumptions
Chapters in this course	Chs. 3–6 (back-door); Chs. 10–11 (IPW, DR)	Ch. 7 (IV); Ch. 8 (front-door); Ch. 9 (sensitivity)

1.4 The Gaussian Linear Confounded Model

Example: Gaussian Linear Confounded Model

The structural equations are: \[Y = \beta T + \gamma U + \varepsilon, \qquad T = \alpha U + \delta, \tag{1.4}\] where \(U \sim \mathcal{N}(0,\sigma_U^2)\), \(\varepsilon \sim \mathcal{N}(0,\sigma_\varepsilon^2)\), \(\delta \sim \mathcal{N}(0,\sigma_\delta^2)\) are mutually independent. Here \(\beta\) is the causal effect of \(T\) on \(Y\), \(\gamma\) is the direct effect of \(U\) on \(Y\), and \(\alpha\) is the direct effect of \(U\) on \(T\).

The endogeneity of \(T\) is immediate: \(\mathrm{Cov}(T,\, \gamma U + \varepsilon) = \gamma\alpha\sigma_U^2 \neq 0\) whenever \(\alpha \neq 0\) and \(\gamma \neq 0\). Consequently, OLS regressing \(Y\) on \(T\) does not estimate \(\beta\).

**Figure 1.2:** Graph surgery on the Gaussian confounded model. The faded arrow in the mutilated DAG has been severed by the intervention do(T = t). The back-door path T ← U → Y is eliminated; only the causal path T → Y remains active.

Figure 1.2

1.4.1 Two Explicit Densities

Write \(\sigma_T^2 = \alpha^2\sigma_U^2 + \sigma_\delta^2\) for the marginal variance of \(T\).

Interventional density. Set \(T = t\) externally; the back-door path through \(U\) is severed, so \(U \sim \mathcal{N}(0, \sigma_U^2)\) independently: \[P\!\left(y \mid \doop(T{=}t)\right) = \mathcal{N}\!\left(\beta t,\;\; \gamma^2\sigma_U^2 + \sigma_\varepsilon^2\right). \tag{1.5}\]

Observational density. Since \((Y, T)\) is jointly normal, \(\mathrm{Cov}(Y,T) = \beta\sigma_T^2 + \gamma\alpha\sigma_U^2\), giving: \[P\!\left(y \mid T{=}t\right) = \mathcal{N}\!\left(\left(\beta + \frac{\gamma\alpha\sigma_U^2}{\sigma_T^2}\right)t,\;\; \frac{\gamma^2\sigma_U^2\sigma_\delta^2}{\sigma_T^2} + \sigma_\varepsilon^2\right). \tag{1.6}\]

1.4.2 The Endogeneity Gap

	Interventional	Observational
Mean	\(\beta t\)	\(\left(\beta + \frac{\gamma\alpha\sigma_U^2}{\sigma_T^2}\right)t\)
Variance	\(\gamma^2\sigma_U^2 + \sigma_\varepsilon^2\)	\(\frac{\gamma^2\sigma_U^2\sigma_\delta^2}{\sigma_T^2} + \sigma_\varepsilon^2\)
Residual	\((\gamma U{+}\varepsilon) \indep T\)	\((\gamma U{+}\varepsilon)\) correlated with \(T\)
Identified by	RCT (or IV, Chapter 7)	OLS
Equal when	\(\alpha = 0\) or \(\gamma = 0\) (no confounding)

The endogeneity bias of OLS is \(\gamma\alpha\sigma_U^2/\sigma_T^2\). With \(\alpha = \gamma = 1\) and \(\sigma_U^2 = \sigma_\delta^2 = 1\): bias \(= 1/(1+1) = 0.5\).

1.4.3 Lab: Simulating the Two Densities

The analytical results were verified by simulation using \(n = 10{,}000\) draws with parameters \(\beta = 2\), \(\alpha = 1\), \(\gamma = 1\), \(\sigma_U = \sigma_\varepsilon = \sigma_\delta = 1\). (R code: chapter1_lab.R.)

Experiment 1: OLS bias. By the omitted-variable bias formula, the OLS probability limit is \(\beta + \gamma\alpha\sigma_U^2/\sigma_T^2 = 2 + 0.5 = 2.5\).

	Simulated	Theory	Formula
OLS slope	2.5147	2.5000	\(\beta + \gamma\alpha\sigma_U^2/\sigma_T^2\)
True \(\beta\)	—	2.0000	\(\beta\)
Bias	0.5147	0.5000	\(\gamma\alpha\sigma_U^2/\sigma_T^2\)

OLS overestimates the causal effect by 25%.

Experiment 2: Two-sample comparison at \(t_0 = 1\). The exact observational conditional distribution is \(\mathcal{N}(2.5,\, 1.5)\); the interventional distribution is \(\mathcal{N}(2,\, 2)\).

	Mean (Sim / Theory)	SD (Sim / Theory)
\(\E[Y \mid T{=}1]\)	2.497 / 2.500	1.229 / \(\sqrt{1.5} = 1.225\)
\(\E[Y \mid \doop(T{=}1)]\)	2.001 / 2.000	1.428 / \(\sqrt{2} = 1.414\)

The variance reduction in the observational distribution reflects \(\sigma_{\mathrm{obs}}^2/\sigma_{\mathrm{int}}^2 = 1.5/2 = 0.75\): conditioning on \(T{=}1\) pins down \(\alpha U + \delta\), restricting the spread of \(Y\) through the correlation with \(\gamma U\).

Experiment 3: Density overlay for varying \(\alpha\). As \(\alpha\) increases from 0 to 1.0, the observational density shifts rightward (growing bias) and narrows (reduced variance). The interventional density \(\mathcal{N}(2, 2)\) is invariant to \(\alpha\).

1.5 The Two-Step Paradigm

Causal inference is a two-step discipline. The two steps are logically distinct and require different tools.

Definition: Identification

A causal quantity \(P(y \mid \doop(T{=}t))\) is identified from the observational distribution \(P\) if there exists a functional \(\Psi\) such that \(P(y \mid \doop(T{=}t)) = \Psi(P)\), and \(\Psi(P)\) depends only on the observable joint distribution.

Step 1: Identification is a purely mathematical question about the causal model: can the interventional density be expressed as a function of the observed-data distribution? It does not depend on sample size. The main tools are: the back-door criterion (Chapter 3, applied in Chapters 4–6), the front-door criterion (Chapter 3), the three rules of the do-calculus (Chapter 3), and IV assumptions (Chapter 7).

Step 2: Estimation. Once \(\Psi(P)\) has been identified, the statistical problem is to estimate the functional \(\Psi(P)\) from \(n\) observations as efficiently as possible. The main tools are efficient influence functions (Part III), IPW (Part III), doubly robust estimators (Part III), and double machine learning (Part III).

1.5.1 Worked Example: Flu Vaccination and Infection (Simpson’s Paradox)

A public health agency observes 1,000 individuals. Each person either received a flu vaccine (\(T=1\)) or not (\(T=0\)), and either became infected (\(Y=1\)) or not. Age group \(X\) (elderly \(= E\), young \(= Y\)) is recorded. Elderly people are both more likely to be vaccinated and more susceptible to infection, so \(X\) confounds the \(T\)–\(Y\) relationship. The causal DAG has paths: \(T \to Y\) (causal) and \(T \leftarrow X \to Y\) (back-door).

Observed data:

	Vaccinated (\(T=1\))	Unvaccinated (\(T=0\))	Total
Elderly (\(X=E\))	360, 30% infected	40, 50% infected	400
Young (\(X=Y\))	60, 5% infected	540, 10% infected	600
Total	420, 26.4% infected	580, 12.8% infected	1,000

Simpson’s paradox:

Stratum	Vaccinated	Unvaccinated	Difference	Conclusion
Elderly	30%	50%	\(-20\) pp	vaccine helps
Young	5%	10%	\(-5\) pp	vaccine helps
Aggregate	26.4%	12.8%	\(+13.6\) pp	vaccine harms??

Within every stratum the vaccine reduces infection. Yet in the aggregate the vaccinated group has a higher infection rate, because 86% of vaccinated are elderly vs. 7% of unvaccinated.

Step 1 (Identification). The set \(\{X\}\) satisfies the back-door criterion: \[P\!\left(Y{=}1 \mid \doop(T{=}t)\right) = \sum_{x \in \{E,Y\}} P(Y{=}1 \mid T{=}t, X{=}x)\,P(X{=}x).\]

Step 2 (Estimation). With \(\hat{P}(X{=}E) = 0.4\) and \(\hat{P}(X{=}Y) = 0.6\): \[\hat{P}(Y{=}1 \mid \doop(T{=}1)) = 0.30 \times 0.4 + 0.05 \times 0.6 = 0.15,\] \[\hat{P}(Y{=}1 \mid \doop(T{=}0)) = 0.50 \times 0.4 + 0.10 \times 0.6 = 0.26.\]

The estimated causal risk difference is \(0.15 - 0.26 = -0.11\): the vaccine causally reduces infection probability by 11 percentage points.

The Two Steps, Concretely

Step 1 (identification) established — from the DAG alone — that the back-door formula expresses the causal quantity as a functional of observable data. Step 2 (estimation) replaced the population probabilities in that formula with sample proportions.

These two steps are logically separate: Step 1 is a mathematical argument about the causal model that does not depend on sample size; Step 2 is a statistical problem conditional on the identification formula. A researcher who skips Step 1 and reports the raw difference conflates the two and draws the wrong conclusion.

1.6 Summary

Interventional vs. observational distribution. Causal inference answers questions about interventions \(P(y \mid \doop(T{=}t))\), not about observations \(P(y \mid T{=}t)\). In the Gaussian confounded model, the two densities coincide only when \(\alpha = 0\) or \(\gamma = 0\).
The causal trinity. The same causal question can be expressed in three languages — SEM (equation surgery), DAG (graph surgery), and potential outcomes (\(Y(t)\)). In the SEM framework, \(Y(t)\) is defined as the solution for \(Y\) in the mutilated system.
Endogeneity bias in the Gaussian confounded model. The coefficient of \(T\) in \(\E[Y \mid T{=}t]\) differs from the causal coefficient \(\beta\) by \(\gamma\alpha\sigma_U^2/\sigma_T^2\). The observational variance \(\gamma^2\sigma_U^2\sigma_\delta^2/\sigma_T^2 + \sigma_\varepsilon^2\) is smaller than the interventional variance \(\gamma^2\sigma_U^2 + \sigma_\varepsilon^2\).
Two-step paradigm. Causal inference is: identification (expressing \(\Psi(P)\) as a functional of observable data — a mathematical question) followed by estimation (constructing \(\hat{\Psi}_n\) efficiently — a statistical question). The two steps are analytically separate.

Causal inference is the study of when the observational distribution contains enough information to recover the interventional distribution.

1.7 Problems

1. Do-notation fundamentals. For each statement below, decide whether it refers to an interventional or an observational distribution, and rewrite it unambiguously using do-notation where appropriate.

“The probability that a patient recovers given that they took the drug.”
“The probability that a patient would recover if we prescribed the drug to everyone.”
“Among students who attended tutoring, the average exam score was 85.”
“If we enrolled all students in tutoring, the average exam score would be 85.”

2. Deriving the observational density. In the Gaussian confounded model (Equation 1.4), verify Equation 1.6 by carrying out the following steps. Let \(\sigma_T^2 = \alpha^2\sigma_U^2 + \sigma_\delta^2\).

Show that \((Y, T)\) is jointly normal by writing both as linear combinations of the independent normals \((U, \varepsilon, \delta)\).
Compute \(\mathrm{Cov}(Y, T)\) and \(\mathrm{Var}(T)\).
Apply the conditional normal formula to derive \(\E[Y \mid T{=}t]\) and \(\mathrm{Var}[Y \mid T{=}t]\), and hence verify Equation 1.6.
Show that the OLS probability limit is \(\beta + \gamma\alpha\sigma_U^2/\sigma_T^2\), and interpret this as an omitted-variable bias formula.

3. The causal trinity. Consider the following verbal causal claim: “Aspirin (\(T\)) reduces the risk of heart attack (\(Y\)) because it inhibits platelet aggregation (\(M\)), but patients with pre-existing cardiovascular disease (\(U\), unobserved) are both more likely to take aspirin and more likely to have a heart attack.”

Draw the causal DAG implied by this description.
Write down the recursive SEM with appropriate structural functions \(f_T\), \(f_M\), \(f_Y\).
State the SUTVA assumption and discuss whether it is plausible in this setting.
Explain why \(\E[Y \mid T{=}1] - \E[Y \mid T{=}0]\) does not identify the causal effect \(\E[Y \mid \doop(T{=}1)] - \E[Y \mid \doop(T{=}0)]\) in this DAG.