5 Randomization and Back-Door Adjustment
5.1 Randomized Experiments
5.1.1 The Do-Calculus of Randomization
The defining feature of a randomized experiment is that the analyst sets the treatment for each unit, independently of all background variables. In do-calculus terms, the data come from the mutilated distribution \(f(y \mid \doop(T{=}t))\) rather than the observational distribution \(f(y \mid T{=}t)\).
What the equality assumes beyond the graph. The fundamental equality Equation 5.1 relies on more than deletion of arrows into \(T\). Two further conditions are needed. First, SUTVA (no interference and no hidden treatment versions), which licenses the consistency equation \(Y_i = Y_i(T_i)\). Second, the realized treatment must equal the assigned treatment for every unit (full compliance); when this fails, \(f(y \mid T{=}t)\) describes outcomes among those who actually received \(t\) rather than those who were assigned \(t\). When compliance fails, the intent-to-treat versus per-protocol distinction becomes substantive (Chapters 7 and 13).
Graphical argument. In the observational DAG, arrows into \(T\) from observed covariates \(X\) and unobserved confounders \(U\) create back-door paths from \(T\) to \(Y\). Randomization physically severs these arrows: assignment is determined by a coin flip, not by \(X\) or \(U\). The mutilated graph \(\mathcal{G}_{\overline{T}}\) is, in a randomized experiment, the actual data-generating graph. There are no back-door paths to block, because none exist.
Potential outcomes statement. In the potential outcomes language, randomization implies \((Y(0), Y(1)) \indep T\) unconditionally — no covariate adjustment is needed. This is the strongest possible version of ignorability.
5.1.2 Estimation of the ATE in a Randomized Experiment
Proof. By the PO–do equivalence, \(\E[Y(t)] = \E[Y \mid \doop(T{=}t)]\). The treatment node has no parents, so Equation 5.1 gives \(f(y \mid \doop(T{=}t)) = f(y \mid T{=}t)\). \(\square\)
Under complete randomization, the ATE is estimated by the difference-in-means (DIM) estimator: \[\hat\tau_{\mathrm{DIM}} = \bar Y_1 - \bar Y_0 = \frac{1}{n_1}\sum_{i:\, T_i=1} Y_i - \frac{1}{n_0}\sum_{i:\, T_i=0} Y_i.\]
5.1.3 Fisher’s Randomization Inference vs. Neyman’s Repeated Sampling
Fisher’s framework tests the sharp null \(Y_i(1) = Y_i(0)\) for all \(i\): under this null, every missing potential outcome is known, making the exact randomization distribution of any test statistic computable over all \(\binom{n}{n_1}\) assignments. Neyman’s framework studies the repeated-sampling behavior of estimators like \(\hat\tau_{\mathrm{DIM}}\) for average causal effects.
Fisher’s approach is aligned with exact hypothesis testing under a sharp null; Neyman’s is aligned with point estimation and uncertainty quantification. Both rely on the treatment assignment mechanism but answer different questions.
5.2 Ignorability
5.2.1 Two Routes to the Same Estimand
Section Section 5.1 established that randomization achieves Equation 5.1 and that DIM is unbiased for the ATE without covariate adjustment. In observational studies, no such physical severance occurs: treatment is selected based on characteristics that may include unmeasured variables \(U\) affecting the outcome.
Both settings target the same estimand — the ATE \(\E[Y(1) - Y(0)]\) — but reach it by fundamentally different routes:
| Randomized experiment | Observational study | |
|---|---|---|
| How ignorability arises | By design: researcher severs all arrows into \(T\) | By assumption: analyst asserts no unmeasured confounders |
| \((Y(0),Y(1)) \indep T \mid X\) | Guaranteed — holds unconditionally | Assumed — requires \(X\) to capture every confounder |
| Credibility | As strong as the randomization protocol | As strong as substantive knowledge of the DGP |
| Testability | Can audit the assignment mechanism | Cannot be verified from data alone |
| Failure mode | Protocol violations, non-compliance | Any unmeasured variable affecting both \(T\) and \(Y\) |
Statistical adjustment can implement ignorability once assumed, but cannot create it. No covariate adjustment — however flexible — can close a back-door path through an unmeasured variable. This is why a well-conducted RCT is considered more credible than even the most carefully adjusted observational study.
5.2.2 Terminology
Strong ignorability was defined in Chapter 4: joint unconfoundedness \((Y(0), Y(1)) \indep T \mid X\) together with overlap \(0 < P(T{=}1 \mid X) < 1\) a.s. (Rosenbaum and Rubin 1983). The pointwise (weak) form \(Y(t) \indep T \mid X\) for each \(t\) separately is what identification of \(\E[Y(t)]\) actually uses. Equivalent names in the literature: unconfoundedness, selection on observables, no unmeasured confounders, conditional exchangeability.
5.2.3 Three Languages for Ignorability
| Language | Statement of ignorability |
|---|---|
| SEM | In \(Y = f_Y(T, X, U_Y)\) with \(T = f_T(X, U_T)\), the exogenous factor \(U_Y\) is independent of \(T\) given \(X\) — no unobserved common cause of \(T\) and \(Y\) remains after conditioning on \(X\). |
| DAG / do-calculus | \(X\) satisfies the back-door criterion for \((T, Y)\): \(X\) blocks all back-door paths and contains no descendant of \(T\). |
| Potential outcomes | \(Y(t) \indep T \mid X\) for \(t \in \{0,1\}\) (weak ignorability). |
These three forms are closely aligned under NPSEM-IE semantics together with consistency and no interference.
5.2.4 What Ignorability Requires
5.2.5 From Ignorability to Identification
Proof. Apply the law of iterated expectations, then use each assumption in turn: \[\E[Y(t)] \overset{\mathrm{LIE}}{=} \E[\E[Y(t) \mid X]] \overset{(i)}{=} \E[\E[Y(t) \mid T{=}t, X]] \overset{(iii)}{=} \E[\E[Y \mid T{=}t, X]]. \quad\square\] Overlap (ii) guarantees \(\E[Y \mid T{=}t, X{=}x]\) is well-defined everywhere in the support of \(X\).
Proof. Apply ?lem-ident separately to \(t=1\) and \(t=0\) and subtract. For the ATT, replace \(p(x)\) by \(p(x \mid T{=}1)\). \(\square\)
Assumption traceability. Each step in the proof invokes exactly one assumption:
| Proof step | Assumption invoked | Failure mode |
|---|---|---|
| \(\E[Y(t) \mid X] = \E[Y(t) \mid T{=}t, X]\) | Weak ignorability | Unmeasured confounder: \(Y(t)\) depends on \(T\) within \(X\)-strata |
| \(\E[Y \mid T{=}t, X{=}x]\) is well-defined | Overlap | Empty stratum: no units with \(T{=}t\) at \(x\) |
| \(\E[Y(t) \mid T{=}t, X] = \E[Y \mid T{=}t, X]\) | Consistency | Interference or hidden treatment versions |
Overlap is testable from data (check support of \(X \mid T{=}1\) vs. \(X \mid T{=}0\)). Weak ignorability and consistency are not testable from data alone.
5.2.6 Which Variables to Condition On: The Pre-Treatment Requirement
Before estimating anything, there is a prior graphical question: which variables should be in \(X\)? The answer is not “all available variables.” Conditioning on the wrong variable introduces bias.
The practical rule: before running any regression, classify every candidate covariate as pre-treatment or post-treatment using the causal graph. Only pre-treatment variables satisfying the back-door criterion belong in \(X\).
5.3 Regression Adjustment and Standardization
5.3.1 The Common Three-Step Logic
Both regression adjustment and standardization implement the same back-door formula Equation 5.5. They differ only in how they estimate \(\mu(t, x) = \E[Y \mid T{=}t, X{=}x]\).
This is the outcome regression (OR) or G-computation estimator (Robins 1986). Step 3 averages out \(X\) using the empirical distribution — exactly what Equation 5.5 requires: \(\tau_{\mathrm{ATE}} = \int [\mu(1,x) - \mu(0,x)] p(x)\, dx\).
5.3.2 A Worked Example
Binary covariate \(X\) (e.g., sex), continuous outcome \(Y\) (e.g., earnings). High-\(X\) units are overrepresented in the treated arm (70% treated vs. 30% control).
| Treated (\(T=1\)) | Control (\(T=0\)) | ||
|---|---|---|---|
| \(X=0\) | \(n=30\), \(\bar Y = 42\) | \(n=70\), \(\bar Y = 35\) | |
| \(X=1\) | \(n=70\), \(\bar Y = 58\) | \(n=30\), \(\bar Y = 50\) | |
| All | \(n=100\), \(\bar Y = 53.2\) | \(n=100\), \(\bar Y = 38.5\) | Unadjusted diff = 14.7 |
Step 1. \(\hat\mu(1,0) = 42\), \(\hat\mu(1,1) = 58\), \(\hat\mu(0,0) = 35\), \(\hat\mu(0,1) = 50\).
Step 2. Within-stratum effects: \(42 - 35 = 7\) (for \(X{=}0\)), \(58 - 50 = 8\) (for \(X{=}1\)).
Step 3. Averaging over the marginal distribution of \(X\): 50% of the full sample has \(X{=}0\), 50% has \(X{=}1\): \[\hat\tau_{\mathrm{ATE}} = 7 \times 0.5 + 8 \times 0.5 = 7.5.\]
After adjusting for \(X\), the estimated treatment effect is 7.5, not 14.7. The unadjusted comparison conflates the treatment effect with the higher baseline earnings of \(X{=}1\) units who happen to be treated more often.
5.3.3 Standardization as a Special Case
When \(X\) is categorical, standardization computes \(\hat\mu(t,x)\) as the within-cell sample mean — a fully saturated model: \[\hat\tau_{\mathrm{STD}} = \sum_{x \in \mathcal{X}} [\hat\mu(1,x) - \hat\mu(0,x)]\,\hat{p}(x). \tag{5.8}\] This is algebraically identical to Equation 5.7 when \(X\) is categorical: standardization is regression adjustment with a saturated outcome model. The formula is also known as the G-formula (Robins 1986) and as direct standardization in epidemiology.
| Regression adjustment | Standardization | |
|---|---|---|
| How \(\hat\mu(t,x)\) is estimated | Parametric model: OLS, logistic, or flexible learner | Cell means: \(\bar Y\) within \((T{=}t, X{=}x)\) |
| Works when \(X\) is | Continuous or high-dimensional | Discrete and low-dimensional |
| Bias if wrong | Model misspecification | Sparse cells (some \((t,x)\) strata empty) |
| Steps 2–3 | Identical: predict both POs, average | Identical: predict both POs, average |
5.3.4 Model Specification and What Can Go Wrong
The OR estimator is consistent if and only if \(\hat\mu(t,x) \to \mu(t,x)\) as \(n \to \infty\) — i.e., if the outcome model is correctly specified.
Misspecification bias. If the true \(\mu(t,x)\) is nonlinear but a linear model is used, the estimated treatment effect absorbs the functional form error. Flexible machine learning estimators reduce this risk, at the cost of requiring cross-fitting to avoid overfitting bias (Chapter 11).
Extrapolation. Step 2 predicts \(\hat{Y}_i(0)\) for treated units whose covariate values may lie outside the support of the control group. The propensity score overlap condition (Chapter 6) formalizes when extrapolation is unavoidable.
5.3.5 Regression Adjustment in Randomized Experiments: Lin (2013)
In a randomized experiment, outcome regression can reduce variance even though adjustment is not needed for unbiasedness. The regression-adjusted estimator of Lin (2013) fits the fully interacted OLS model: \[Y_i = \alpha + \beta T_i + \gamma^\top \tilde{X}_i + \delta^\top (T_i \cdot \tilde{X}_i) + \varepsilon_i, \tag{5.9}\] where \(\tilde{X}_i = X_i - \bar{X}\) are mean-centered covariates, and takes \(\hat\beta\) as the ATE estimate.
Why \(\hat\beta\) equals the three-step OR estimator. The interacted regression Equation 5.9 fits arm-specific linear models with centered covariates. Under centering, the OLS intercept equals the regression-adjusted estimator of \(\E[Y(t)]\), so \(\hat\beta = \hat\mu_{1,\mathrm{reg}} - \hat\mu_{0,\mathrm{reg}} = \hat\tau_{\mathrm{reg}}\).
Byproduct: variance reduction. Equation Equation 5.10 shows that the regression estimator is approximately \(\bar\tau + N_1^{-1}\sum_i T_i e_i(1) - N_0^{-1}\sum_i (1-T_i) e_i(0)\). The variance reduction relative to DIM comes from replacing raw outcomes by residuals: when the linear model explains a meaningful share of \(Y_i(t)\)’s variation, \(S_{e(t)}^2 < S_{Y(t)}^2\) and the design variance shrinks.
5.4 Stratification
Stratification (subclassification) is a non-parametric implementation of the back-door formula. Rather than modeling \(\E[Y \mid T, X]\), the analyst divides the sample into strata — subgroups with similar values of \(X\) — and estimates the treatment effect within each stratum.
Within stratum \(\mathcal{S}_k\), the treated and control units have similar covariate distributions. If the stratum is narrow enough, ignorability holds approximately, and the within-stratum DIM: \[\hat\tau_k = \bar Y_{1,k} - \bar Y_{0,k}\] is approximately unbiased for the stratum-specific ATE. The overall ATE is estimated as: \[\hat\tau_{\mathrm{STRAT}} = \sum_{k=1}^K \hat\tau_k \cdot \frac{n_k}{n}. \tag{5.11}\]
Cochran (1968) showed that with \(K = 5\) equal-size strata, roughly 90% of the bias from a single continuous confounder is removed; with \(K = 10\), over 95%.
Limitations. Stratification faces the curse of dimensionality: with \(p\) binary covariates, there are \(2^p\) cells, many empty in practice. Two solutions: regression adjustment, which imposes parametric structure on \(\hat\mu(t,x)\); and propensity score stratification (Chapter 6), which collapses multivariate \(X\) to a scalar \(\pi(X) = P(T{=}1 \mid X)\).
5.5 Simpson’s Paradox
Simpson’s paradox — the reversal of an association when conditioning on a third variable — was introduced in Chapter 1. This chapter’s development gives it a second layer of interpretation. The back-door formula used in Chapter 1 is precisely the standardization estimator Equation 5.8: within-stratum conditional means averaged over the marginal distribution of the confounder rather than its treatment-conditional distribution. The pooled association does not do this — it weights strata by \(P(X \mid T)\) instead of \(P(X)\).
5.5.1 When Should You Not Condition on \(X\)?
Simpson’s paradox has a mirror image: conditioning can create a spurious association or make a true causal effect disappear. This occurs when \(X\) is a mediator or a collider. The back-door criterion gives the correct prescription: include \(X\) only if it blocks back-door paths without opening collider paths.
5.6 Lab: Simulation Study of the Outcome Regression Estimator
This simulation compares the linear and local-linear OR estimators across two designs and illustrates the bias-variance tradeoff.
Data-generating process. \(X \sim \mathrm{Uniform}(0,1)\), \(\varepsilon \sim \mathcal{N}(0,1)\) independent of \(X\). Potential outcomes: \[Y(t) = t + 3(1+t)X^2 + \varepsilon, \qquad t \in \{0,1\}. \tag{5.12}\] The conditional ATE is \(\tau(x) = 1 + 3x^2\), so \(\tau_{\mathrm{ATE}} = 1 + 3\E[X^2] = 1 + 1 = 2\).
Two assignment mechanisms:
- CRD: \(T \sim \mathrm{Bern}(0.5)\), independent of \(X\).
- Observational: \(T \mid X \sim \mathrm{Bern}(\mathrm{expit}(-2 + 5X))\), giving \(\pi(0) \approx 0.12\), \(\pi(0.5) \approx 0.62\), \(\pi(1) \approx 0.95\). Strong confounding: high-\(X\) units are nearly always treated.
Strong ignorability holds in both: under CRD by design; under the observational mechanism because \(T\) depends only on \(X\) through a known stochastic function, leaving no unmeasured variable affecting both \(T\) and \(Y\).
Estimators. Linear OR: within each arm, fit OLS of \(Y\) on \(X\) (misspecified: true conditional mean is quadratic with a \(TX^2\) interaction). Local-linear OR: within each arm, fit local linear regression with Gaussian kernel, bandwidth \(h = 0.5 \cdot n^{-1/5}\).
Results (\(n = 1000\), \(B = 2000\) replications, seed 2024):
| Design | Estimator | Mean | Bias | SD | RMSE |
|---|---|---|---|---|---|
| CRD | Linear OR | 2.0013 | +0.0013 | 0.0679 | 0.0678 |
| CRD | Local-linear OR | 2.0333 | +0.0333 | 0.0659 | 0.0738 |
| Observational | Linear OR | 1.9602 | −0.0398 | 0.0851 | 0.0939 |
| Observational | Local-linear OR | 2.0240 | +0.0240 | 0.0923 | 0.0953 |
Lesson 1: Under CRD, the linear OR is unbiased regardless of model specification. This confirms ?thm-lin: under complete randomization, the linear OR converges to \(\tau_{\mathrm{ATE}}\) even with a wrong outcome model. The randomization distribution, not the model, does the identification work.
Lesson 2: Under an observational design, the misspecified linear OR is biased. The bias grows from \(+0.0013\) under CRD to \(-0.0398\) under the observational design. The omitted \(X^2\) and \(TX^2\) terms cause \(\hat\mu(t,x)\) to misrepresent the outcome surface; because high-\(X\) units are predominantly treated, the misfit is systematically amplified in the direction of confounding. Under CRD, the balanced assignment averages out the same misfit.
Lesson 3: The local-linear OR nearly eliminates bias at the cost of higher variance. Under the observational design, local-linear OR reduces bias from \(-0.0398\) to \(+0.0240\) but its SD is 0.0923 vs. 0.0851 for the linear model. The RMSE comparison (0.0939 vs. 0.0953) favors the linear estimator in MSE, even though it is biased — the classic bias-variance tradeoff.
5.7 Chapter Summary
| Estimand | Identification | Key assumption | Method |
|---|---|---|---|
| ATE under CRD | \(\E[Y(t)] = \E[Y \mid T{=}t]\) | Randomization protocol | DIM, regression-adjusted DIM |
| ATE under observability | Equation 5.5 | Ignorability + overlap | OR, standardization, stratification |
| ATT | Equation 5.6 | Same, one-sided overlap for \(t=0\) | Same methods, weighted over treated |
| Propensity score (Ch. 6) | Reduces \(X\) to scalar \(\pi(X)\) | Correctly specified PS model | IPW, matching, PS stratification |
- Randomization as graph surgery. A randomized experiment removes all arrows into \(T\), so \(f(y \mid \doop(T{=}t)) = f(y \mid T{=}t)\) and DIM is unbiased for the ATE without covariate adjustment.
- Ignorability is the key assumption. Under unconfoundedness \((Y(0),Y(1)) \indep T \mid X\), the back-door adjustment formula Equation 5.5 identifies the ATE. This assumption is substantive, design-determined in RCTs, and untestable from observational data.
- Three estimation strategies. Regression adjustment (G-formula), standardization, and stratification all implement Equation 5.5 via different modeling choices. Under a randomized experiment, regression adjustment is design-consistent for the ATE even when the outcome model is misspecified (?thm-lin); in observational studies, it is consistent only when the outcome model is.
- Simpson’s paradox. Pooled associations can reverse within subgroups when a confounder determines treatment selection. The correct causal estimate requires standardization over the marginal covariate distribution, guided by the back-door criterion.
- The propensity score dimension reduction. When \(X\) is high-dimensional, direct stratification or cell-by-cell standardization breaks down. The propensity score \(\pi(X)\) provides a scalar sufficient statistic for adjustment. Chapter 6 develops the theory and estimation methods.
5.8 Problems
1. Randomization and the do-calculus. Let the observational DAG be \(\{U \to T, U \to Y, X \to T, T \to Y\}\) with \(U\) unobserved.
- Identify all back-door paths from \(T\) to \(Y\).
- Does \(X\) satisfy the back-door criterion? Does \(U\)? Explain.
- Now suppose treatment is randomized. Draw the modified DAG and identify all back-door paths. Show that \(f(y \mid \doop(T{=}t)) = f(y \mid T{=}t)\) holds using Rule 2 of the do-calculus. (Hint: identify the appropriate \((X, Z, W)\) instantiation, construct \(\mathcal{G}_{\underline{T}}\), and verify \((Y \indep T)_{\mathcal{G}_{\underline{T}}}\).)
- Under randomization, is covariate adjustment on \(X\) necessary for unbiasedness? Is it ever beneficial? Explain.
2. Ignorability and the back-door criterion. Consider the DAG \(\{X \to T, X \to Y, T \to Y, U \to Y\}\) with \(U\) unobserved.
- Does \(\{X\}\) satisfy the back-door criterion? Write out the identifying formula for \(\tau_{\mathrm{ATE}}\).
- Now add the edge \(U \to T\) to the DAG. Does \(\{X\}\) still satisfy the back-door criterion? What does the criterion require when both \(X\) and \(U\) affect \(T\)?
- Explain in words what “unconfoundedness given \(X\)” means about the role of \(U\) in the data-generating process.
3. Standardization. A study of job training (\(T\)) and earnings (\(Y\), in thousands) produces the following cell means, with \(P(X{=}0) = 0.4\) and \(P(X{=}1) = 0.6\):
| \(X\) | \(\hat\mu(1,x)\) | \(\hat\mu(0,x)\) | \(n_x/n\) |
|---|---|---|---|
| 0 | 28 | 22 | 0.4 |
| 1 | 35 | 31 | 0.6 |
- Compute \(\hat\tau_{\mathrm{ATE}}\) using the standardization formula Equation 5.8.
- Compute \(\hat\tau_{\mathrm{ATT}}\), given that all treated units come from \(X{=}1\) — i.e., \(P(X{=}1 \mid T{=}1) = 1\).
- The unadjusted difference in means is \(33 - 25 = 8\). Compare to your answers in (a) and (b) and explain the discrepancy.
4. Stratification and Cochran’s rule. You have a binary confounder \(X \in \{0,1\}\) and form two strata. Within stratum \(X{=}0\): \(n_0 = 600\), \(\bar Y_1 = 10\), \(\bar Y_0 = 8\). Within stratum \(X{=}1\): \(n_1 = 400\), \(\bar Y_1 = 15\), \(\bar Y_0 = 12\).
- Compute the stratified ATE estimator \(\hat\tau_{\mathrm{STRAT}}\) via Equation 5.11.
- Suppose an unadjusted DIM gives \(\hat\tau_{\mathrm{DIM}} = 5.5\). Explain why the two estimates differ and which is the appropriate causal estimate.
- Describe one limitation that would arise if \(X\) were a continuous variable with 15 dimensions.
5. Simpson’s paradox. A hospital reports that ICU patients (\(T{=}1\)) have higher mortality than others (\(T{=}0\)): \(P(Y{=}1 \mid T{=}1) = 0.30\) vs. \(P(Y{=}1 \mid T{=}0) = 0.10\).
- Construct a numerical example (a \(2 \times 2 \times 2\) table with disease severity \(X\) as the confounder) consistent with the pooled numbers yet showing ICU admission reduces mortality within both severity strata.
- Compute the standardized ATE using the marginal distribution of \(X\).
- Draw the DAG. Identify the back-door path that the pooled comparison fails to block.
- A colleague argues the pooled statistic is the right answer since the hospital treats patients with both mild and severe illness. Explain, using potential outcomes notation, why this argument is incorrect.