5 Randomization and Back-Door Adjustment

Learning Objectives

By the end of this chapter, students should be able to:

Explain in do-calculus terms why a randomized experiment makes $f(y \mid \doop(T{=}t)) = f(y \mid T{=}t)$, and identify the graphical feature that produces this equality.
State the ignorability assumption in both potential-outcomes and graphical language, and distinguish weak from strong ignorability.
Apply the regression-adjusted estimator to reduce variance in a randomized experiment, and explain why consistency holds even under model misspecification.
Derive and compute the standardization (g-formula) estimator of the ATE from a contingency table or regression output.
Explain the method of stratification (subclassification) and describe when and why it removes confounding bias.
Identify Simpson’s paradox in a numerical example and resolve it using the back-door criterion.
Articulate why propensity score methods are needed when $X$ is high-dimensional, motivating Chapter 6.

5.1 Randomized Experiments

5.1.1 The Do-Calculus of Randomization

The defining feature of a randomized experiment is that the analyst sets the treatment for each unit, independently of all background variables. In do-calculus terms, the data come from the mutilated distribution $f(y \mid \doop(T{=}t))$ rather than the observational distribution $f(y \mid T{=}t)$.

The Fundamental Equality of Randomized Experiments

In a completely randomized experiment, $T$ is assigned independently of all pre-treatment variables. In the DAG, this means there are no arrows into $T$: the treatment node has no parents. By Rule 2 of the do-calculus, with $X = \varnothing$, $Z = T$, $W = \varnothing$, the graphical condition $(Y \indep T)_{\mathcal{G}_{\underline{T}}}$ holds because $T$ is isolated in $\mathcal{G}_{\underline{T}}$ (no parents in $\mathcal{G}$, and the outgoing edge $T \to Y$ is deleted). Rule 2 therefore gives: \[f\!\bigl(y \mid \doop(T{=}t)\bigr) = f(y \mid T{=}t). \tag{5.1}\] In a randomized experiment, observing $T = t$ is the same as intervening to set $T = t$.

What the equality assumes beyond the graph. The fundamental equality Equation 5.1 relies on more than deletion of arrows into $T$. Two further conditions are needed. First, SUTVA (no interference and no hidden treatment versions), which licenses the consistency equation $Y_i = Y_i(T_i)$. Second, the realized treatment must equal the assigned treatment for every unit (full compliance); when this fails, $f(y \mid T{=}t)$ describes outcomes among those who actually received $t$ rather than those who were assigned $t$. When compliance fails, the intent-to-treat versus per-protocol distinction becomes substantive (Chapters 7 and 13).

Graphical argument. In the observational DAG, arrows into $T$ from observed covariates $X$ and unobserved confounders $U$ create back-door paths from $T$ to $Y$. Randomization physically severs these arrows: assignment is determined by a coin flip, not by $X$ or $U$. The mutilated graph $\mathcal{G}_{\overline{T}}$ is, in a randomized experiment, the actual data-generating graph. There are no back-door paths to block, because none exist.

Randomization as graph surgery. Left: in the observational DAG, both $U$ and $X$ have arrows into $T$, creating back-door paths. Right: randomization severs all arrows into $T$ (marked ×) and replaces them with an exogenous coin flip. The mutilated graph $\mathcal{G}_{\overline{T}}$ has no back-door paths.

Potential outcomes statement. In the potential outcomes language, randomization implies $(Y(0), Y(1)) \indep T$ unconditionally — no covariate adjustment is needed. This is the strongest possible version of ignorability.

5.1.2 Estimation of the ATE in a Randomized Experiment

Lemma: Identification under Complete Randomization

Under complete randomization, $\E[Y(t)] = \E[Y \mid T{=}t]$, $t \in \{0,1\}$.

Proof. By the PO–do equivalence, $\E[Y(t)] = \E[Y \mid \doop(T{=}t)]$. The treatment node has no parents, so Equation 5.1 gives $f(y \mid \doop(T{=}t)) = f(y \mid T{=}t)$. $\square$

Under complete randomization, the ATE is estimated by the difference-in-means (DIM) estimator: \[\hat\tau_{\mathrm{DIM}} = \bar Y_1 - \bar Y_0 = \frac{1}{n_1}\sum_{i:\, T_i=1} Y_i - \frac{1}{n_0}\sum_{i:\, T_i=0} Y_i.\]

Theorem: Neyman’s Theorem for the Difference-in-Means Estimator

Under complete randomization, let $\mathcal{F}_N = \{(Y_i(0), Y_i(1))\}$ be the potential outcomes (fixed), and let expectations be over the randomization distribution. Define the finite-population variances $S_t^2 = \frac{1}{n-1}\sum_i (Y_i(t) - \bar Y(t))^2$, cross-variance $S_{01} = \frac{1}{n-1}\sum_i (Y_i(0)-\bar Y(0))(Y_i(1)-\bar Y(1))$, and effect variance $S_\tau^2 = \frac{1}{n-1}\sum_i (\tau_i - \bar\tau_n)^2$. Then:

Unbiasedness. $\E[\hat\tau_{\mathrm{DIM}} \mid \mathcal{F}_N] = \bar Y(1) - \bar Y(0) = \bar\tau_n$.
Design variance. \[\mathrm{Var}(\hat\tau_{\mathrm{DIM}} \mid \mathcal{F}_N) = \frac{S_1^2}{n_1} + \frac{S_0^2}{n_0} - \frac{S_\tau^2}{n}. \tag{5.2}\]
Conservative variance estimator. The within-arm sample variance estimator $\hat V = \hat S_1^2/n_1 + \hat S_0^2/n_0$ satisfies $\E[\hat V \mid \mathcal{F}_N] - \mathrm{Var}(\hat\tau_{\mathrm{DIM}} \mid \mathcal{F}_N) = S_\tau^2/n \geq 0$. It is exact when the unit-level treatment effects $\tau_i$ are constant.

Proof of Neyman’s Theorem

Throughout we condition on $\mathcal{F}_N$. Under CRE, $\E[T_i \mid \mathcal{F}_N] = n_1/n$ for all $i$, and: \[\mathrm{Cov}(T_i, T_j \mid \mathcal{F}_N) = \begin{cases} f_1(1-f_1) & i = j, \\ -f_1(1-f_1)/(n-1) & i \neq j, \end{cases}\] where $f_1 = n_1/n$. The $i \neq j$ expression follows from sampling $n_1$ units without replacement.

(i) Unbiasedness. $\E[n_1^{-1}\sum_i T_i Y_i(1) \mid \mathcal{F}_N] = \bar Y(1)$ by $\E[T_i \mid \mathcal{F}_N] = n_1/n$; the control arm is symmetric.

(ii) Variance. Writing $Z_i(t) = Y_i(t) - \bar Y(t)$ and using the covariance formula: \[\mathrm{Var}\!\left(\frac{1}{n_1}\sum_i T_i Y_i(1)\right) = \frac{f_1(1-f_1)}{n_1^2}\cdot\frac{n}{n-1}\sum_i Z_i(1)^2 = \frac{n_0}{n_1 n} S_1^2,\] and the cross-arm covariance yields $-S_{01}/n$. Combining gives the first form of Equation 5.2. Rearranging using $n_0/(n_1 n) = 1/n_1 - 1/n$ gives the second form.

(iii) Conservatism. By finite-population sampling theory, $\E[\hat S_t^2 \mid \mathcal{F}_N] = S_t^2$. Therefore $\E[\hat V \mid \mathcal{F}_N] = S_1^2/n_1 + S_0^2/n_0$, and subtracting Equation 5.2 gives $S_\tau^2/n \geq 0$. $\square$

Remark: Why the Conservatism Cannot Be Removed

The correction term $S_\tau^2/n$ depends on the unit-level effect variance. Computing $S_\tau^2$ requires both $Y_i(0)$ and $Y_i(1)$ for every unit — precisely what the fundamental problem of causal inference forbids. The estimator $\hat V$ is not merely pragmatic; it is the sharpest variance estimator based on observed data alone without further modelling assumptions.

Remark: Superpopulation Variance as a Corollary

Under i.i.d. superpopulation draws with arm variances $(\sigma_0^2, \sigma_1^2)$, the total variance has two phases: randomization variance (given $\mathcal{F}_N$) and sampling variance (over draws of $\mathcal{F}_N$). The law of total variance gives: \[\mathrm{Var}(\hat\tau_{\mathrm{DIM}}) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_0^2}{n_0}. \tag{5.3}\] The Neyman correction $S_\tau^2/n$ in Equation 5.2 is exactly cancelled by the sampling variance of $\bar\tau_n$ when averaging over superpopulation draws. See Ding (2024) for a complete treatment.

5.1.3 Fisher’s Randomization Inference vs. Neyman’s Repeated Sampling

Fisher’s framework tests the sharp null $Y_i(1) = Y_i(0)$ for all $i$: under this null, every missing potential outcome is known, making the exact randomization distribution of any test statistic computable over all $\binom{n}{n_1}$ assignments. Neyman’s framework studies the repeated-sampling behavior of estimators like $\hat\tau_{\mathrm{DIM}}$ for average causal effects.

Fisher’s approach is aligned with exact hypothesis testing under a sharp null; Neyman’s is aligned with point estimation and uncertainty quantification. Both rely on the treatment assignment mechanism but answer different questions.

Remark: Randomization Licenses Inference Without Population Assumptions

Randomization licenses causal inference without any assumption about a population model. In Fisher’s design-based framework, the potential outcomes $\{Y_i(0), Y_i(1)\}$ are fixed numbers; the only source of randomness is the assignment vector $\mathbf{T}$, drawn uniformly from all $\binom{n}{n_1}$ possible assignments. Probability statements refer to this mechanism — not to repeated draws from a population. Units need not be a random sample; they can be a convenience sample, a census, or a targeted group. Observational methods, by contrast, must invoke either an i.i.d. sampling assumption or a superpopulation framework, because no analogous physical mechanism anchors probability statements.

Remark: Fisher’s Sharp Null — Unidentifiable Effects, Exactly Testable Hypothesis

Individual causal effects $\tau_i = Y_i(1) - Y_i(0)$ are unidentifiable for any unit. Yet the sharp null $H_0: Y_i(1) = Y_i(0)$ for all $i$ is exactly testable. The resolution: under $H_0$, the sharp null completely specifies the joint potential outcome table — $Y_i(1) = Y_i(0) = Y_i^{\mathrm{obs}}$ regardless of assignment — so the full $2n$-vector of potential outcomes is known, and the permutation distribution of any test statistic is computable from observed data with no estimation step.

Remark: Further Reading

Ding (2024) provides a comprehensive graduate-level treatment of CRE, stratified experiments, and regression adjustment. Li and Ding (2017) works out the asymptotic theory underlying confidence intervals for $\hat\tau_{\mathrm{DIM}}$ via a finite-population CLT requiring neither i.i.d. observations nor a superpopulation model. Ding et al. (2016) develop randomization-based tests for treatment effect heterogeneity.

5.2 Ignorability

5.2.1 Two Routes to the Same Estimand

Section Section 5.1 established that randomization achieves Equation 5.1 and that DIM is unbiased for the ATE without covariate adjustment. In observational studies, no such physical severance occurs: treatment is selected based on characteristics that may include unmeasured variables $U$ affecting the outcome.

Both settings target the same estimand — the ATE $\E[Y(1) - Y(0)]$ — but reach it by fundamentally different routes:

	Randomized experiment	Observational study
How ignorability arises	By design: researcher severs all arrows into $T$	By assumption: analyst asserts no unmeasured confounders
$(Y(0),Y(1)) \indep T \mid X$	Guaranteed — holds unconditionally	Assumed — requires $X$ to capture every confounder
Credibility	As strong as the randomization protocol	As strong as substantive knowledge of the DGP
Testability	Can audit the assignment mechanism	Cannot be verified from data alone
Failure mode	Protocol violations, non-compliance	Any unmeasured variable affecting both $T$ and $Y$

Statistical adjustment can implement ignorability once assumed, but cannot create it. No covariate adjustment — however flexible — can close a back-door path through an unmeasured variable. This is why a well-conducted RCT is considered more credible than even the most carefully adjusted observational study.

5.2.2 Terminology

Strong ignorability was defined in Chapter 4: joint unconfoundedness $(Y(0), Y(1)) \indep T \mid X$ together with overlap $0 < P(T{=}1 \mid X) < 1$ a.s. (Rosenbaum and Rubin 1983). The pointwise (weak) form $Y(t) \indep T \mid X$ for each $t$ separately is what identification of $\E[Y(t)]$ actually uses. Equivalent names in the literature: unconfoundedness, selection on observables, no unmeasured confounders, conditional exchangeability.

5.2.3 Three Languages for Ignorability

Language	Statement of ignorability
SEM	In $Y = f_Y(T, X, U_Y)$ with $T = f_T(X, U_T)$, the exogenous factor $U_Y$ is independent of $T$ given $X$ — no unobserved common cause of $T$ and $Y$ remains after conditioning on $X$.
DAG / do-calculus	$X$ satisfies the back-door criterion for $(T, Y)$: $X$ blocks all back-door paths and contains no descendant of $T$.
Potential outcomes	$Y(t) \indep T \mid X$ for $t \in \{0,1\}$ (weak ignorability).

These three forms are closely aligned under NPSEM-IE semantics together with consistency and no interference.

5.2.4 What Ignorability Requires

Ignorability Is Untestable from Observational Data

No statistical test can confirm or refute unconfoundedness using observed data alone. If an unobserved variable $U$ affects both $T$ and $Y$, the back-door path $T \leftarrow U \to Y$ is open, and conditioning on any set of observed variables cannot close it. Sensitivity analysis (Rosenbaum 2002) can quantify how large such an unobserved confounder would have to be to overturn a conclusion, but cannot confirm the assumption itself.

5.2.5 From Ignorability to Identification

Lemma: Identification of $\E[Y(t)]$

Suppose: (i) weak ignorability $Y(t) \indep T \mid X$; (ii) overlap $P(T{=}t \mid X{=}x) > 0$ a.s.; (iii) consistency $Y = Y(T)$. Then: \[\E[Y(t)] = \E\bigl[\E[Y \mid T{=}t, X]\bigr] = \int \E[Y \mid T{=}t, X{=}x]\, p(x)\, dx. \tag{5.4}\]

Proof. Apply the law of iterated expectations, then use each assumption in turn: \[\E[Y(t)] \overset{\mathrm{LIE}}{=} \E[\E[Y(t) \mid X]] \overset{(i)}{=} \E[\E[Y(t) \mid T{=}t, X]] \overset{(iii)}{=} \E[\E[Y \mid T{=}t, X]]. \quad\square\] Overlap (ii) guarantees $\E[Y \mid T{=}t, X{=}x]$ is well-defined everywhere in the support of $X$.

Remark: Two Languages, One Formula

?lem-ident and the back-door theorem (Chapter 3) are two proofs of the same formula in different languages. The graphical proof applies Rules 2 and 3 to the mutilated graph; the potential-outcomes proof applies the law of iterated expectations to counterfactual variables. The bridge is the structural causal semantics: under NPSEM-IE, $Y(t)$ has the same distribution as $Y$ under $\doop(T{=}t)$ (Chapter 4). Consistency connects $\E[Y(t) \mid T{=}t, X]$ to the observed $\E[Y \mid T{=}t, X]$.

Theorem: Identification of the ATE and ATT

Under the same three assumptions: \[\tau_{\mathrm{ATE}} = \int [\E(Y \mid T{=}1, X{=}x) - \E(Y \mid T{=}0, X{=}x)]\, p(x)\, dx, \tag{5.5}\] \[\tau_{\mathrm{ATT}} = \int [\E(Y \mid T{=}1, X{=}x) - \E(Y \mid T{=}0, X{=}x)]\, p(x \mid T{=}1)\, dx. \tag{5.6}\]

Proof. Apply ?lem-ident separately to $t=1$ and $t=0$ and subtract. For the ATT, replace $p(x)$ by $p(x \mid T{=}1)$. $\square$

Assumption traceability. Each step in the proof invokes exactly one assumption:

Proof step	Assumption invoked	Failure mode
$\E[Y(t) \mid X] = \E[Y(t) \mid T{=}t, X]$	Weak ignorability	Unmeasured confounder: $Y(t)$ depends on $T$ within $X$-strata
$\E[Y \mid T{=}t, X{=}x]$ is well-defined	Overlap	Empty stratum: no units with $T{=}t$ at $x$
$\E[Y(t) \mid T{=}t, X] = \E[Y \mid T{=}t, X]$	Consistency	Interference or hidden treatment versions

Overlap is testable from data (check support of $X \mid T{=}1$ vs. $X \mid T{=}0$). Weak ignorability and consistency are not testable from data alone.

5.2.6 Which Variables to Condition On: The Pre-Treatment Requirement

Before estimating anything, there is a prior graphical question: which variables should be in $X$? The answer is not “all available variables.” Conditioning on the wrong variable introduces bias.

Never Condition on a Post-Treatment Variable

A variable $L$ is post-treatment if $T \to L$ in the DAG. Including $L$ in $X$ can bias the estimated treatment effect through two mechanisms:

Blocking a causal pathway. If $L$ is a mediator ($T \to L \to Y$), conditioning on $L$ blocks part of the causal effect of $T$. The result estimates a direct effect, not the total effect.
Opening a collider path. If $L$ is a collider ($T \to L \leftarrow U \to Y$), conditioning on $L$ opens a previously blocked path, creating a spurious $T$–$Y$ association through the unobserved $U$.

In both cases, including $L$ violates Condition 1 of the back-door criterion: $X$ must contain no descendant of $T$. The back-door criterion is not merely a sufficiency condition for identification — it is the correct filter for deciding which variables to include.

The practical rule: before running any regression, classify every candidate covariate as pre-treatment or post-treatment using the causal graph. Only pre-treatment variables satisfying the back-door criterion belong in $X$.

5.3 Regression Adjustment and Standardization

5.3.1 The Common Three-Step Logic

Both regression adjustment and standardization implement the same back-door formula Equation 5.5. They differ only in how they estimate $\mu(t, x) = \E[Y \mid T{=}t, X{=}x]$.

The Three-Step Recipe: Outcome Regression / G-Formula

Estimate the outcome model. Fit a model for $\mu(t,x) = \E[Y \mid T{=}t, X{=}x]$ using observed data. Any regression method may be used: OLS, logistic regression, or a flexible nonparametric estimator.
Predict both potential outcomes for every unit. For each unit $i$ — regardless of treatment actually received — compute: \[\hat{Y}_i(1) = \hat\mu(1, X_i), \qquad \hat{Y}_i(0) = \hat\mu(0, X_i).\] For a treated unit, $\hat{Y}_i(0)$ is the predicted outcome had that unit been assigned to control; for a control unit, $\hat{Y}_i(1)$ is the predicted outcome had that unit been treated.
Average the individual treatment effect estimates. \[\hat\tau_{\mathrm{OR}} = \frac{1}{n}\sum_{i=1}^n [\hat\mu(1, X_i) - \hat\mu(0, X_i)]. \tag{5.7}\] To estimate the ATT, average only over the $n_1$ treated units.

This is the outcome regression (OR) or G-computation estimator (Robins 1986). Step 3 averages out $X$ using the empirical distribution — exactly what Equation 5.5 requires: $\tau_{\mathrm{ATE}} = \int [\mu(1,x) - \mu(0,x)] p(x)\, dx$.

5.3.2 A Worked Example

Binary covariate $X$ (e.g., sex), continuous outcome $Y$ (e.g., earnings). High-$X$ units are overrepresented in the treated arm (70% treated vs. 30% control).

	Treated ($T=1$)	Control ($T=0$)
$X=0$	$n=30$, $\bar Y = 42$	$n=70$, $\bar Y = 35$
$X=1$	$n=70$, $\bar Y = 58$	$n=30$, $\bar Y = 50$
All	$n=100$, $\bar Y = 53.2$	$n=100$, $\bar Y = 38.5$	Unadjusted diff = 14.7

Step 1. $\hat\mu(1,0) = 42$, $\hat\mu(1,1) = 58$, $\hat\mu(0,0) = 35$, $\hat\mu(0,1) = 50$.

Step 2. Within-stratum effects: $42 - 35 = 7$ (for $X{=}0$), $58 - 50 = 8$ (for $X{=}1$).

Step 3. Averaging over the marginal distribution of $X$: 50% of the full sample has $X{=}0$, 50% has $X{=}1$: \[\hat\tau_{\mathrm{ATE}} = 7 \times 0.5 + 8 \times 0.5 = 7.5.\]

After adjusting for $X$, the estimated treatment effect is 7.5, not 14.7. The unadjusted comparison conflates the treatment effect with the higher baseline earnings of $X{=}1$ units who happen to be treated more often.

5.3.3 Standardization as a Special Case

When $X$ is categorical, standardization computes $\hat\mu(t,x)$ as the within-cell sample mean — a fully saturated model: \[\hat\tau_{\mathrm{STD}} = \sum_{x \in \mathcal{X}} [\hat\mu(1,x) - \hat\mu(0,x)]\,\hat{p}(x). \tag{5.8}\] This is algebraically identical to Equation 5.7 when $X$ is categorical: standardization is regression adjustment with a saturated outcome model. The formula is also known as the G-formula (Robins 1986) and as direct standardization in epidemiology.

	Regression adjustment	Standardization
How $\hat\mu(t,x)$ is estimated	Parametric model: OLS, logistic, or flexible learner	Cell means: $\bar Y$ within $(T{=}t, X{=}x)$
Works when $X$ is	Continuous or high-dimensional	Discrete and low-dimensional
Bias if wrong	Model misspecification	Sparse cells (some $(t,x)$ strata empty)
Steps 2–3	Identical: predict both POs, average	Identical: predict both POs, average

5.3.4 Model Specification and What Can Go Wrong

The OR estimator is consistent if and only if $\hat\mu(t,x) \to \mu(t,x)$ as $n \to \infty$ — i.e., if the outcome model is correctly specified.

Misspecification bias. If the true $\mu(t,x)$ is nonlinear but a linear model is used, the estimated treatment effect absorbs the functional form error. Flexible machine learning estimators reduce this risk, at the cost of requiring cross-fitting to avoid overfitting bias (Chapter 11).

Extrapolation. Step 2 predicts $\hat{Y}_i(0)$ for treated units whose covariate values may lie outside the support of the control group. The propensity score overlap condition (Chapter 6) formalizes when extrapolation is unavoidable.

5.3.5 Regression Adjustment in Randomized Experiments: Lin (2013)

In a randomized experiment, outcome regression can reduce variance even though adjustment is not needed for unbiasedness. The regression-adjusted estimator of Lin (2013) fits the fully interacted OLS model: \[Y_i = \alpha + \beta T_i + \gamma^\top \tilde{X}_i + \delta^\top (T_i \cdot \tilde{X}_i) + \varepsilon_i, \tag{5.9}\] where $\tilde{X}_i = X_i - \bar{X}$ are mean-centered covariates, and takes $\hat\beta$ as the ATE estimate.

Why $\hat\beta$ equals the three-step OR estimator. The interacted regression Equation 5.9 fits arm-specific linear models with centered covariates. Under centering, the OLS intercept equals the regression-adjusted estimator of $\E[Y(t)]$, so $\hat\beta = \hat\mu_{1,\mathrm{reg}} - \hat\mu_{0,\mathrm{reg}} = \hat\tau_{\mathrm{reg}}$.

Theorem: Design-Consistency of the Regression-Adjusted Estimator (Lin 2013)

Under complete randomization, $\hat\beta$ from the interacted regression Equation 5.9 is consistent for $\tau_{\mathrm{ATE}}$ regardless of whether the linear model is correctly specified.

Proof

Work in the design-based framework of ?thm-dim: potential outcomes fixed, only $\mathbf{T}$ random. Let $\mathbf{B}_1 = (\sum_i \mathbf{x}_i \mathbf{x}_i^\top)^{-1}\sum_i \mathbf{x}_i Y_i(1)$ be the full-population OLS coefficient (fixed under randomization), and let $e_i(1) = Y_i(1) - \mathbf{x}_i^\top \mathbf{B}_1$. By normal equations, $\sum_i e_i(1) = 0$ and $\frac{1}{N}\sum_i \mathbf{x}_i^\top \mathbf{B}_1 = \bar Y(1)$.

Linearization. A short algebraic manipulation using normal equations gives: \[\hat\mu_{1,\mathrm{reg}} = \bar Y(1) + \frac{1}{N_1}\sum_{i=1}^N T_i e_i(1) + O_p(N^{-1}). \tag{5.10}\] The remainder term $R_N$ is $O_p(N^{-1})$ because the covariate-balance gap is $O_p(N^{-1/2})$ by design, and $\hat{\boldsymbol\beta}_1 - \mathbf{B}_1 = O_p(N^{-1/2})$ by standard within-arm OLS theory.

Consistency. Under CRE, each unit has $\Pr(T_i=1 \mid \mathcal{F}_N) = N_1/N$, so $\E[N_1^{-1}\sum_i T_i e_i(1) \mid \mathcal{F}_N] = 0$ by $\sum_i e_i(1) = 0$. The conditional variance is $O(N^{-1})$ by finite-population sampling theory; by Chebyshev, $N_1^{-1}\sum_i T_i e_i(1) \to_p 0$. By Equation 5.10, $\hat\mu_{1,\mathrm{reg}} - \bar Y(1) \to_p 0$. Symmetrically, $\hat\mu_{0,\mathrm{reg}} - \bar Y(0) \to_p 0$. Therefore $\hat\beta \to_p \bar Y(1) - \bar Y(0) = \tau_{\mathrm{ATE}}$.

Crucially, the probability limit of $\hat{\boldsymbol\beta}_1$ never enters: consistency comes from the randomization distribution, not from correctness of the linear model. $\square$

Byproduct: variance reduction. Equation Equation 5.10 shows that the regression estimator is approximately $\bar\tau + N_1^{-1}\sum_i T_i e_i(1) - N_0^{-1}\sum_i (1-T_i) e_i(0)$. The variance reduction relative to DIM comes from replacing raw outcomes by residuals: when the linear model explains a meaningful share of $Y_i(t)$’s variation, $S_{e(t)}^2 < S_{Y(t)}^2$ and the design variance shrinks.

Remark: The Role of the Regression Model in a Randomized Experiment

In a randomized experiment, the regression model serves one purpose only: variance reduction. It does not contribute to unbiasedness. DIM is already design-consistent; adding covariates reduces $\mathrm{Var}(\hat\tau)$ by absorbing residual variation in $Y$ unrelated to $T$. If the model fits well, the residual variance shrinks and the estimator becomes more precise. If misspecified or covariates are weakly predictive, the variance reduction is small or zero, but no bias is introduced. This stands in sharp contrast to the observational setting, where the regression model carries a double burden: it must both adjust for confounding (unbiasedness) and fit the outcome surface (efficiency). Misspecification in an observational study threatens both properties simultaneously.

Remark: Precedence in Survey Sampling

The design-consistency result in ?thm-lin — model-agnostic consistency under the randomization distribution — was established in the survey sampling literature decades before Lin (2013). Isaki and Fuller (1982) proved this for the generalized regression (GREG) estimator under general probability sampling designs. A CRE is structurally equivalent to simple random sampling from the finite population of potential outcomes, so their result directly implies ?thm-lin. Deville and Särndal (1992) extended this to calibration estimators. Lin (2013)’s contribution was restating and proving this within the Neyman–Rubin framework for the experimental causal inference audience.

5.4 Stratification

Stratification (subclassification) is a non-parametric implementation of the back-door formula. Rather than modeling $\E[Y \mid T, X]$, the analyst divides the sample into strata — subgroups with similar values of $X$ — and estimates the treatment effect within each stratum.

Within stratum $\mathcal{S}_k$, the treated and control units have similar covariate distributions. If the stratum is narrow enough, ignorability holds approximately, and the within-stratum DIM: \[\hat\tau_k = \bar Y_{1,k} - \bar Y_{0,k}\] is approximately unbiased for the stratum-specific ATE. The overall ATE is estimated as: \[\hat\tau_{\mathrm{STRAT}} = \sum_{k=1}^K \hat\tau_k \cdot \frac{n_k}{n}. \tag{5.11}\]

Cochran (1968) showed that with $K = 5$ equal-size strata, roughly 90% of the bias from a single continuous confounder is removed; with $K = 10$, over 95%.

Remark: Stratification, Standardization, and Survey Sampling Terminology

When $X$ is categorical, the stratified estimator Equation 5.11 and the standardization estimator Equation 5.8 are algebraically identical: both compute within-cell treatment effect estimates and aggregate with marginal weights $n_k/n$. The names reflect disciplinary tradition.

When $X$ is continuous, stratification estimates $\hat\mu(t,x)$ by a piecewise-constant step function; regression fits a smooth model. In the survey sampling literature, stratification (design-stage: divide before sampling) and post-stratification (analysis-stage: reweight after sampling) are distinct. Standardization in causal inference corresponds to post-stratification.

Limitations. Stratification faces the curse of dimensionality: with $p$ binary covariates, there are $2^p$ cells, many empty in practice. Two solutions: regression adjustment, which imposes parametric structure on $\hat\mu(t,x)$; and propensity score stratification (Chapter 6), which collapses multivariate $X$ to a scalar $\pi(X) = P(T{=}1 \mid X)$.

5.5 Simpson’s Paradox

Simpson’s paradox — the reversal of an association when conditioning on a third variable — was introduced in Chapter 1. This chapter’s development gives it a second layer of interpretation. The back-door formula used in Chapter 1 is precisely the standardization estimator Equation 5.8: within-stratum conditional means averaged over the marginal distribution of the confounder rather than its treatment-conditional distribution. The pooled association does not do this — it weights strata by $P(X \mid T)$ instead of $P(X)$.

5.5.1 When Should You Not Condition on $X$?

Simpson’s paradox has a mirror image: conditioning can create a spurious association or make a true causal effect disappear. This occurs when $X$ is a mediator or a collider. The back-door criterion gives the correct prescription: include $X$ only if it blocks back-door paths without opening collider paths.

Conditioning on a Mediator or Collider

Adjusting for a post-treatment variable $X$ that lies on a causal pathway $T \to X \to Y$ blocks part of the causal effect, producing a downward-biased estimate of the total effect. Adjusting for a collider $X$ with $T \to X \leftarrow Y$ opens a spurious association that does not exist in the population. Neither mistake is detectable from the data alone; the DAG is essential.

5.6 Lab: Simulation Study of the Outcome Regression Estimator

This simulation compares the linear and local-linear OR estimators across two designs and illustrates the bias-variance tradeoff.

Data-generating process. $X \sim \mathrm{Uniform}(0,1)$, $\varepsilon \sim \mathcal{N}(0,1)$ independent of $X$. Potential outcomes: \[Y(t) = t + 3(1+t)X^2 + \varepsilon, \qquad t \in \{0,1\}. \tag{5.12}\] The conditional ATE is $\tau(x) = 1 + 3x^2$, so $\tau_{\mathrm{ATE}} = 1 + 3\E[X^2] = 1 + 1 = 2$.

Two assignment mechanisms:

CRD: $T \sim \mathrm{Bern}(0.5)$, independent of $X$.
Observational: $T \mid X \sim \mathrm{Bern}(\mathrm{expit}(-2 + 5X))$, giving $\pi(0) \approx 0.12$, $\pi(0.5) \approx 0.62$, $\pi(1) \approx 0.95$. Strong confounding: high-$X$ units are nearly always treated.

Strong ignorability holds in both: under CRD by design; under the observational mechanism because $T$ depends only on $X$ through a known stochastic function, leaving no unmeasured variable affecting both $T$ and $Y$.

Estimators. Linear OR: within each arm, fit OLS of $Y$ on $X$ (misspecified: true conditional mean is quadratic with a $TX^2$ interaction). Local-linear OR: within each arm, fit local linear regression with Gaussian kernel, bandwidth $h = 0.5 \cdot n^{-1/5}$.

Results ($n = 1000$, $B = 2000$ replications, seed 2024):

Design	Estimator	Mean	Bias	SD	RMSE
CRD	Linear OR	2.0013	+0.0013	0.0679	0.0678
CRD	Local-linear OR	2.0333	+0.0333	0.0659	0.0738
Observational	Linear OR	1.9602	−0.0398	0.0851	0.0939
Observational	Local-linear OR	2.0240	+0.0240	0.0923	0.0953

Lesson 1: Under CRD, the linear OR is unbiased regardless of model specification. This confirms ?thm-lin: under complete randomization, the linear OR converges to $\tau_{\mathrm{ATE}}$ even with a wrong outcome model. The randomization distribution, not the model, does the identification work.

Lesson 2: Under an observational design, the misspecified linear OR is biased. The bias grows from $+0.0013$ under CRD to $-0.0398$ under the observational design. The omitted $X^2$ and $TX^2$ terms cause $\hat\mu(t,x)$ to misrepresent the outcome surface; because high-$X$ units are predominantly treated, the misfit is systematically amplified in the direction of confounding. Under CRD, the balanced assignment averages out the same misfit.

Lesson 3: The local-linear OR nearly eliminates bias at the cost of higher variance. Under the observational design, local-linear OR reduces bias from $-0.0398$ to $+0.0240$ but its SD is 0.0923 vs. 0.0851 for the linear model. The RMSE comparison (0.0939 vs. 0.0953) favors the linear estimator in MSE, even though it is biased — the classic bias-variance tradeoff.

Model Misspecification and Confounding Interact

The linear OR bias is $+0.0013$ under CRD and $-0.0398$ under the observational design. The misspecification is identical in both cases — the difference comes entirely from the assignment mechanism. Under CRD, $T \indep X$, so $X$ within each arm is $\mathrm{Uniform}(0,1)$. The population OLS pointwise error $b(x) = -\tfrac{1}{2} + 3x - 3x^2$ integrates to exactly zero over $U(0,1)$: $\E[b(X)] = \int_0^1 (-\tfrac{1}{2}+3x-3x^2)\, dx = 0$. This cancellation is structural: the population OLS line is the linear projection of $\tau(X)$ onto $\{1, X\}$ under $U(0,1)$, and a linear projection always preserves the mean. Under the observational design, the within-arm distributions of $X$ are distorted by the propensity score, so the same cancellation fails.

5.7 Chapter Summary

Estimand	Identification	Key assumption	Method
ATE under CRD	$\E[Y(t)] = \E[Y \mid T{=}t]$	Randomization protocol	DIM, regression-adjusted DIM
ATE under observability	Equation 5.5	Ignorability + overlap	OR, standardization, stratification
ATT	Equation 5.6	Same, one-sided overlap for $t=0$	Same methods, weighted over treated
Propensity score (Ch. 6)	Reduces $X$ to scalar $\pi(X)$	Correctly specified PS model	IPW, matching, PS stratification

Randomization as graph surgery. A randomized experiment removes all arrows into $T$, so $f(y \mid \doop(T{=}t)) = f(y \mid T{=}t)$ and DIM is unbiased for the ATE without covariate adjustment.
Ignorability is the key assumption. Under unconfoundedness $(Y(0),Y(1)) \indep T \mid X$, the back-door adjustment formula Equation 5.5 identifies the ATE. This assumption is substantive, design-determined in RCTs, and untestable from observational data.
Three estimation strategies. Regression adjustment (G-formula), standardization, and stratification all implement Equation 5.5 via different modeling choices. Under a randomized experiment, regression adjustment is design-consistent for the ATE even when the outcome model is misspecified (?thm-lin); in observational studies, it is consistent only when the outcome model is.
Simpson’s paradox. Pooled associations can reverse within subgroups when a confounder determines treatment selection. The correct causal estimate requires standardization over the marginal covariate distribution, guided by the back-door criterion.
The propensity score dimension reduction. When $X$ is high-dimensional, direct stratification or cell-by-cell standardization breaks down. The propensity score $\pi(X)$ provides a scalar sufficient statistic for adjustment. Chapter 6 develops the theory and estimation methods.

5.8 Problems

1. Randomization and the do-calculus. Let the observational DAG be $\{U \to T, U \to Y, X \to T, T \to Y\}$ with $U$ unobserved.

Identify all back-door paths from $T$ to $Y$.
Does $X$ satisfy the back-door criterion? Does $U$? Explain.
Now suppose treatment is randomized. Draw the modified DAG and identify all back-door paths. Show that $f(y \mid \doop(T{=}t)) = f(y \mid T{=}t)$ holds using Rule 2 of the do-calculus. (Hint: identify the appropriate $(X, Z, W)$ instantiation, construct $\mathcal{G}_{\underline{T}}$, and verify $(Y \indep T)_{\mathcal{G}_{\underline{T}}}$.)
Under randomization, is covariate adjustment on $X$ necessary for unbiasedness? Is it ever beneficial? Explain.

2. Ignorability and the back-door criterion. Consider the DAG $\{X \to T, X \to Y, T \to Y, U \to Y\}$ with $U$ unobserved.

Does $\{X\}$ satisfy the back-door criterion? Write out the identifying formula for $\tau_{\mathrm{ATE}}$.
Now add the edge $U \to T$ to the DAG. Does $\{X\}$ still satisfy the back-door criterion? What does the criterion require when both $X$ and $U$ affect $T$?
Explain in words what “unconfoundedness given $X$” means about the role of $U$ in the data-generating process.

3. Standardization. A study of job training ($T$) and earnings ($Y$, in thousands) produces the following cell means, with $P(X{=}0) = 0.4$ and $P(X{=}1) = 0.6$:

$X$	$\hat\mu(1,x)$	$\hat\mu(0,x)$	$n_x/n$
0	28	22	0.4
1	35	31	0.6

Compute $\hat\tau_{\mathrm{ATE}}$ using the standardization formula Equation 5.8.
Compute $\hat\tau_{\mathrm{ATT}}$, given that all treated units come from $X{=}1$ — i.e., $P(X{=}1 \mid T{=}1) = 1$.
The unadjusted difference in means is $33 - 25 = 8$. Compare to your answers in (a) and (b) and explain the discrepancy.

4. Stratification and Cochran’s rule. You have a binary confounder $X \in \{0,1\}$ and form two strata. Within stratum $X{=}0$: $n_0 = 600$, $\bar Y_1 = 10$, $\bar Y_0 = 8$. Within stratum $X{=}1$: $n_1 = 400$, $\bar Y_1 = 15$, $\bar Y_0 = 12$.

Compute the stratified ATE estimator $\hat\tau_{\mathrm{STRAT}}$ via Equation 5.11.
Suppose an unadjusted DIM gives $\hat\tau_{\mathrm{DIM}} = 5.5$. Explain why the two estimates differ and which is the appropriate causal estimate.
Describe one limitation that would arise if $X$ were a continuous variable with 15 dimensions.

5. Simpson’s paradox. A hospital reports that ICU patients ($T{=}1$) have higher mortality than others ($T{=}0$): $P(Y{=}1 \mid T{=}1) = 0.30$ vs. $P(Y{=}1 \mid T{=}0) = 0.10$.

Construct a numerical example (a $2 \times 2 \times 2$ table with disease severity $X$ as the confounder) consistent with the pooled numbers yet showing ICU admission reduces mortality within both severity strata.
Compute the standardized ATE using the marginal distribution of $X$.
Draw the DAG. Identify the back-door path that the pooled comparison fails to block.
A colleague argues the pooled statistic is the right answer since the hospital treats patients with both mild and severe illness. Explain, using potential outcomes notation, why this argument is incorrect.

Cochran, William G. 1968. “The Effectiveness of Adjustment by Subclassification in Removing Bias in Observational Studies.” Biometrics 24 (2): 295–313. https://doi.org/10.2307/2528036.

Deville, Jean-Claude, and Carl-Erik Särndal. 1992. “Calibration Estimators in Survey Sampling.” Journal of the American Statistical Association 87 (418): 376–82.

Ding, Peng. 2024. A First Course in Causal Inference. CRC Press.

Ding, Peng, Avi Feller, and Luke Miratrix. 2016. “Randomization Inference for Treatment Effect Variation.” Journal of the Royal Statistical Society: Series B 78 (3): 655–71.

Isaki, Cary T., and Wayne A. Fuller. 1982. “Survey Design Under the Regression Superpopulation Model.” Journal of the American Statistical Association 77 (377): 89–96.

Li, Xinran, and Peng Ding. 2017. “General Forms of Finite Population Central Limit Theorems with Applications to Causal Inference.” Journal of the American Statistical Association 112 (520): 1759–69.

Lin, Winston. 2013. “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique.” The Annals of Applied Statistics 7 (1): 295–318.

Robins, James M. 1986. “A New Approach to Causal Inference in Mortality Studies with a Sustained Exposure Period—Application to Control of the Healthy Worker Survivor Effect.” Mathematical Modelling 7 (9–12): 1393–512.

Rosenbaum, Paul R. 2002. Observational Studies. 2nd ed. Springer. https://doi.org/10.1007/978-1-4757-3692-2.

Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

	Treated (\(T=1\))	Control (\(T=0\))
\(X=0\)	\(n=30\), \(\bar Y = 42\)	\(n=70\), \(\bar Y = 35\)
\(X=1\)	\(n=70\), \(\bar Y = 58\)	\(n=30\), \(\bar Y = 50\)
All	\(n=100\), \(\bar Y = 53.2\)	\(n=100\), \(\bar Y = 38.5\)	Unadjusted diff = 14.7