6  Propensity Score Methods

NoteLearning Objectives

By the end of this chapter, students should be able to:

  1. Define the propensity score \(\pi(X) = P(T{=}1 \mid X)\) and explain why it is a balancing score.
  2. State and prove the Balancing Property (\(T \indep X \mid \pi(X)\)), the Coarsest Balancing Score Lemma, and the Propensity Score Theorem.
  3. Describe standard methods for estimating the propensity score and the practical pitfalls of each.
  4. Derive the IPW identification formula for the ATE and explain the role of the Horvitz–Thompson representation.
  5. Distinguish the IPW estimator (targets ATE) from the nearest-neighbor matching estimator (targets ATT).
  6. Distinguish positivity (weak overlap) from strong overlap, describe the practical consequences of near-violations, and apply trimming strategies.
  7. Articulate the fundamental limitation of propensity score methods — the unconfoundedness assumption — and explain why this motivates instrumental variables (Chapter 7).

6.1 Motivation: The Curse of Dimensionality

Chapter 5 established that under strong ignorability, the ATE is identified by the back-door adjustment formula, and developed three estimation strategies: regression adjustment, standardization, and stratification. All three require conditioning on a covariate vector \(X\). When \(X\) is low-dimensional, this is feasible. When \(X\) has many components, it is not.

The problem. Stratification requires cells \(\{X = x\}\) to contain both treated and control units. With \(p\) binary covariates, there are \(2^p\) cells. With \(p = 10\), that is 1024 cells — far more than most datasets can support. Regression adjustment avoids the cell-count problem but requires a correctly specified model for \(\E[Y \mid T, X]\), which becomes harder to specify reliably as \(p\) grows.

The solution. Rosenbaum and Rubin (1983) showed that it is sufficient to condition on a single scalar function of \(X\): the propensity score \(\pi(X) = P(T{=}1 \mid X)\). The propensity score reduces the adjustment dimension without changing the identified estimand, provided strong ignorability and positivity hold.

6.2 The Propensity Score and Its Balancing Properties

6.2.1 Balancing Scores and the Propensity Score

NoteDefinition: Balancing Score (Rosenbaum and Rubin 1983)

A function \(b(X)\) is called a balancing score if \(T \indep X \mid b(X)\).

The full covariate vector \(X\) is trivially a balancing score. The propensity score is the coarsest balancing score, providing the greatest dimension reduction without sacrificing identification.

NoteDefinition: Propensity Score (Rosenbaum and Rubin 1983)

For a binary treatment \(T \in \{0,1\}\) and observed covariates \(X\), the propensity score is \(\pi(X) = P(T{=}1 \mid X)\).

In a randomized experiment, \(\pi(X) = 0.5\) for all units. In an observational study, \(\pi(X)\) varies with \(X\) and must be estimated from data.

Connection to strong ignorability. Under strong ignorability \((Y(0),Y(1)) \indep T \mid X\), the potential outcomes carry no further information about treatment assignment once \(X\) is given: \(P(T{=}1 \mid X, Y(0), Y(1)) = P(T{=}1 \mid X) = \pi(X)\). This equality is a consequence of the ignorability assumption, not part of the definition of \(\pi(X)\).

TipTheorem: The Propensity Score Is a Balancing Score (Rosenbaum and Rubin 1983)

\(T \indep X \mid \pi(X)\).

It suffices to show \(P(T{=}1 \mid X, \pi(X)) = P(T{=}1 \mid \pi(X))\). Since \(\pi(X)\) is a deterministic function of \(X\), \(P(T{=}1 \mid X, \pi(X)) = P(T{=}1 \mid X) = \pi(X)\). For the right-hand side, the law of iterated expectations gives: \[P(T{=}1 \mid \pi(X)) = \E[P(T{=}1 \mid X) \mid \pi(X)] = \E[\pi(X) \mid \pi(X)] = \pi(X). \quad\square\]

NoteRemark: Covariate Balance and Design-Stage Analysis

The balancing property implies \(f(X \mid T{=}1, \pi(X)) = f(X \mid T{=}0, \pi(X))\) a.s. Imbens and Rubin (2015) call this design before analysis: covariate balance can be assessed by stratifying on a discretized \(\hat\pi(X)\) and comparing covariate distributions within strata — without consulting the outcome \(Y\), making the diagnostic immune to outcome-fishing.

NoteRemark: Observed Covariates Only

The Balancing Property is a statement about the observed covariate distribution. It does not say anything about unobserved confounders \(U\): if \(U\) affects both \(T\) and \(Y\) and is not captured by \(X\), the balancing property fails to close the back-door path through \(U\).

6.2.2 The Propensity Score Theorem

TipLemma: The Propensity Score Is the Coarsest Balancing Score (Rosenbaum and Rubin 1983)

If \(b(X)\) is any balancing score — that is, \(T \indep X \mid b(X)\) — then \(\pi(X)\) is a function of \(b(X)\): \(\pi(X) = g(b(X))\) for some measurable \(g\).

Since \(b(X)\) is a balancing score, \(P(T{=}1 \mid X, b(X)) = P(T{=}1 \mid b(X))\). Since \(b(X)\) is a function of \(X\): \(\pi(X) = P(T{=}1 \mid X) = P(T{=}1 \mid X, b(X)) = P(T{=}1 \mid b(X))\). Hence \(\pi(X)\) is a measurable function of \(b(X)\) alone. \(\square\)

The lemma states that \(\pi(X)\) is the smallest sufficient statistic for \(T\) among functions of \(X\): any other balancing score retains at least as much information about \(X\).

TipTheorem: Propensity Score Theorem (Rosenbaum and Rubin 1983)

Suppose strong ignorability holds: \((Y(0), Y(1)) \indep T \mid X\) and \(0 < \pi(X) < 1\) a.s. Then for any balancing score \(b(X)\): \[(Y(0), Y(1)) \indep T \mid b(X).\] In particular, \((Y(0), Y(1)) \indep T \mid \pi(X)\).

We prove the result for \(b(X) = \pi(X)\). For any \(t \in \{0,1\}\): \[P(T{=}1 \mid Y(t), \pi(X)) = \E[P(T{=}1 \mid Y(t), X) \mid Y(t), \pi(X)] = \E[P(T{=}1 \mid X) \mid Y(t), \pi(X)] = \E[\pi(X) \mid Y(t), \pi(X)] = \pi(X),\] where the second equality uses ignorability (\(T \indep Y(t) \mid X\)). Hence \(T \indep Y(t) \mid \pi(X)\). By the Coarsest Balancing Score Lemma, the result extends to any \(b(X)\). \(\square\)

NoteRemark: Identification via the Propensity Score

?thm-ps gives an alternative identification formula: \(\E[Y(t)] = \E[\E[Y \mid \pi(X), T{=}t]]\). A common misconception is that \(\E[Y \mid \pi(X), T{=}t]\) and \(\E[Y \mid X, T{=}t]\) are equal pointwise — they are not. The theorem guarantees only that their marginal expectations agree: \(\E[\E[Y \mid \pi(X), T{=}t]] = \E[\E[Y \mid X, T{=}t]] = \E[Y(t)]\).

NoteThe Dimension Reduction

?thm-ps reduces the curse of dimensionality from a \(\dim(X)\)-dimensional problem to a one-dimensional one: \[\tau_{\mathrm{ATE}} = \E[\E[Y \mid T{=}1, \pi(X)] - \E[Y \mid T{=}0, \pi(X)]].\]

6.3 Estimation of the Propensity Score

Logistic regression. The classical approach models \(\mathrm{logit}(\pi(X)) = X^\top\beta\) and estimates \(\beta\) by maximum likelihood. The fitted values \(\hat\pi(X_i) = \sigma(X_i^\top\hat\beta)\) are used in place of the true propensity score.

Machine learning estimators. When \(X\) is high-dimensional or the true propensity score is nonlinear, machine learning methods — gradient boosted trees, random forests, regularized logistic regression — can estimate \(\pi(X)\) more flexibly. A critical issue is overfitting: an estimator that perfectly separates treated and control units in-sample will produce estimated propensity scores near 0 or 1 everywhere, destroying overlap. Cross-fitting (Chapter 11) addresses this by estimating nuisance functions on held-out folds.

WarningPropensity Score Estimation Is a Nuisance, Not the Goal

The propensity score is estimated to construct a good estimator of \(\tau_{\mathrm{ATE}}\). Balancing tests diagnose whether the estimated score has achieved covariate balance, but they do not test whether unobserved confounders are balanced.

6.4 Matching on the Propensity Score

The propensity score motivates matching: for each treated unit \(i\), find a control unit \(j\) with \(\hat\pi(X_j) \approx \hat\pi(X_i)\), and use \(Y_j\) as a local estimate of \(\E[Y(0) \mid \pi(X) = \pi(X_i)]\) (Abadie and Imbens 2006, 2016). The matched control outcome is not a substitute for the individual counterfactual \(Y_i(0)\); it approximates the conditional mean: \[\E[Y(0) \mid \pi(X) = \pi(X_i)] = \E[Y(0) \mid \pi(X) = \pi(X_i), T{=}0] = \E[Y \mid \pi(X) = \pi(X_i), T{=}0],\] where the first equality uses ?thm-ps and the second uses consistency. Averaging across treated units recovers the ATT: \[\hat\tau_{\mathrm{ATT}} = \frac{1}{n_1}\sum_{i:\, T_i=1}[Y_i - Y_{\hat\jmath(i)}], \tag{6.1}\] where \(\hat\jmath(i)\) is the matched control index.

One-to-one nearest-neighbor matching. For each treated unit \(i\): \[\hat\jmath(i) = \arg\min_{j:\, T_j=0}|\hat\pi(X_i) - \hat\pi(X_j)|.\] Matching with replacement reduces bias but increases variance.

Caliper matching. Restricts matches to pairs within a maximum distance \(\delta\), discarding treated units with no close match. This restricts the estimand to a subpopulation with good overlap.

NoteRemark: Matching as Nonparametric Regression on \(\pi(X)\)

Propensity score matching implicitly fits a nonparametric regression of \(Y\) on \(\pi(X)\) separately within each treatment arm. By ?thm-ps, \(\E[Y(t) \mid \pi(X) = p]\) is identified from observed data for each \(t\). Nearest-neighbor matching approximates this by averaging outcomes of close matches — local constant regression on the scalar propensity score. The dimension reduction from ?thm-ps is what makes nonparametric estimation feasible.

6.5 Inverse Probability Weighting

6.5.1 The IPW Identification Formula

Under strong ignorability, the ATE has an inverse probability weighting (IPW) representation: \[\tau_{\mathrm{ATE}} = \E\!\left[\frac{T \cdot Y}{\pi(X)}\right] - \E\!\left[\frac{(1-T)\cdot Y}{1-\pi(X)}\right].\]

TipTheorem: IPW Identification

Under strong ignorability (\((Y(0), Y(1)) \indep T \mid X\) and \(0 < \pi(X) < 1\)): \(\E\!\left[TY/\pi(X)\right] = \E[Y(1)]\).

\[\E\!\left[\frac{TY}{\pi(X)}\right] = \E\!\left[\frac{TY(1)}{\pi(X)}\right] = \E\!\left[\E\!\left[\frac{TY(1)}{\pi(X)}\,\middle|\,X\right]\right] = \E\!\left[\frac{Y(1)}{\pi(X)}\,\E[T \mid X]\right] = \E\!\left[\frac{Y(1)}{\pi(X)}\cdot\pi(X)\right] = \E[Y(1)].\] The first equality uses consistency (\(Y = Y(1)\) when \(T{=}1\)); the third uses ignorability (\(Y(1) \indep T \mid X\)). \(\square\)

NoteRemark: Scope of the Ignorability Assumption

The proof invokes only the marginal independence \(Y(1) \indep T \mid X\), not the full joint \((Y(0), Y(1)) \indep T \mid X\). An analogous proof for the control term uses only \(Y(0) \indep T \mid X\). The joint independence is used in this chapter for uniformity with Chapter 5 and because it is needed for estimands involving the joint distribution of potential outcomes.

6.5.2 Horvitz–Thompson and Hájek Estimators

The Horvitz–Thompson (HT) IPW estimator: \[\hat\tau_{\mathrm{HT}} = \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i Y_i}{\hat\pi(X_i)} - \frac{(1-T_i)Y_i}{1-\hat\pi(X_i)}\right]. \tag{6.2}\]

The Hájek (self-normalized) estimator: \[\hat\tau_{\mathrm{HJ}} = \frac{\sum_i T_i Y_i/\hat\pi(X_i)}{\sum_i T_i/\hat\pi(X_i)} - \frac{\sum_i (1-T_i)Y_i/(1-\hat\pi(X_i))}{\sum_i (1-T_i)/(1-\hat\pi(X_i))}. \tag{6.3}\]

The Hájek estimator caps each unit’s effective weight at its share of the total within its arm, suppressing the influence of extreme weights. The efficient doubly robust AIPW estimator of Chapter 10 combines the outcome model and the propensity score; under correct nuisance specification it achieves higher efficiency than either pure IPW or pure outcome regression.

ATT estimation. The ATT re-weights the control arm to look like the treated population: \[\hat\tau_{\mathrm{HT,ATT}} = \frac{1}{n_1}\sum_{i:\,T_i=1}Y_i - \frac{1}{n_1}\sum_{i:\,T_i=0}\frac{\hat\pi(X_i)}{1-\hat\pi(X_i)}Y_i, \tag{6.4}\] \[\hat\tau_{\mathrm{HJ,ATT}} = \frac{1}{n_1}\sum_{i:\,T_i=1}Y_i - \frac{\sum_{i:\,T_i=0}\frac{\hat\pi(X_i)}{1-\hat\pi(X_i)}Y_i}{\sum_{i:\,T_i=0}\frac{\hat\pi(X_i)}{1-\hat\pi(X_i)}}. \tag{6.5}\]

Each control unit receives weight proportional to its odds of treatment \(\hat\pi/(1-\hat\pi)\). For a control unit with \(\hat\pi(X_i) = 0.95\) the weight is \(19\); combined with a large baseline outcome this produces extreme variance.

6.6 Overlap and Positivity

6.6.1 Positivity and Strong Overlap

NoteDefinition: Positivity (Weak Overlap)

The positivity condition requires \(0 < \pi(X) < 1\) almost surely. Every unit has a positive probability of receiving either treatment or control, regardless of its covariate values.

Together with unconfoundedness, positivity constitutes the strong ignorability of Rosenbaum and Rubin (1983). Positivity is the condition needed for identification.

NoteDefinition: Strong Overlap

The strong overlap condition requires \(c \le \pi(X) \le 1-c\) a.s. for some \(c \in (0, 1/2)\).

Strong overlap bounds the inverse weights uniformly away from infinity. It is the condition under which \(\sqrt{n}\)-consistent, asymptotically normal inference for the ATE goes through (Chapter 11). Positivity alone permits identification but does not guarantee stable inference: when \(\pi(X)\) approaches 0 or 1, IPW weights blow up even though the parameter is technically identified (Khan and Tamer 2010).

6.6.2 Practical Consequences of Near-Violations

WarningNear-Overlap Problems

When \(\hat\pi(X_i)\) is close to 0 or 1, the IPW weight is very large. A small number of units with extreme weights can dominate the estimator, increasing variance dramatically.

6.6.3 Trimming Strategies

Trimming by propensity score. Restrict to units with \(\eta \le \pi(X) \le 1-\eta\) for small \(\eta > 0\). The estimand changes to: \[\tau_{\mathrm{trim}} = \E[Y(1) - Y(0) \mid \eta \le \pi(X) \le 1-\eta].\]

Crump et al. (2009) rule. Crump et al. (2009) derive the optimal trimming threshold minimizing the asymptotic variance of the trimmed ATE estimator. The rule \(\eta = 0.1\) is a common default.

6.6.4 Re-targeting the Estimand

When overlap fails, an alternative to trimming is to change the estimand. The ATT requires only \(P(T{=}0 \mid X{=}x) > 0\) for \(x\) in the support of \(X \mid T{=}1\), i.e., \(\pi(X) < 1\) a.s. on the treated support. The ATT tolerates regions with \(\pi(x) = 0\) (no treated units there) but not regions inside the treated support where \(\pi(x) = 1\). The ATC is symmetric. When overlap fails where \(\pi(X) \approx 0\), re-target to the ATT; when overlap fails where \(\pi(X) \approx 1\), re-target to the ATC.

6.7 Lab: Simulation Study of IPW and Matching Estimators

This lab compares five estimators on a five-covariate DGP with treatment-effect heterogeneity. With five continuous confounders, direct stratification is infeasible, making propensity-score methods the natural choice.

6.7.1 Part 1: Correctly Specified Propensity Score

NoteDGP for Lab 6

\(X_j \overset{\mathrm{i.i.d.}}{\sim} N(0,1)\) for \(j = 1,\ldots,5\). All five covariates are genuine confounders. True propensity score: \[\pi(X) = \mathrm{expit}(-0.5 + 0.8X_1 + 0.5X_2 - 0.3X_3 + 0.2X_4 + 0.1X_5), \tag{6.6}\] giving \(P(T{=}1) \approx 0.40\). Potential outcomes: \[Y(t) = (1 + 0.5X_1)\cdot t + 8 + 0.8X_1 + 0.5X_2 + 0.4X_3 + 0.3X_4 + 0.2X_5 + \varepsilon, \quad \varepsilon \sim N(0,1). \tag{6.7}\] The CATE is \(\tau(X) = 1 + 0.5X_1\), so \(\tau_{\mathrm{ATE}} = 1.000\) (exact). Because \(\pi(X)\) is increasing in \(X_1\), treated units have \(\E[X_1 \mid T{=}1] > 0\), giving \(\tau_{\mathrm{ATT}} \approx 1.199\) (oracle, \(n = 10^7\)). The gap \(\approx 0.199\) reflects selection: units most likely to be treated also tend to benefit more.

Estimators. Five estimators are compared under both the true and estimated propensity score. The estimated score uses logistic regression of \(T\) on \((X_1,\ldots,X_5)\) — correctly specified since the true logit is linear. The five estimators are: HT-ATE Equation 6.2, HJ-ATE Equation 6.3 (both targeting ATE); HT-ATT Equation 6.4, HJ-ATT Equation 6.5, and NNM Equation 6.1 (all targeting ATT). The large baseline mean \(\E[Y(0)] = 8\) amplifies the consequences of extreme IPW weights, making Hájek normalization strongly recommended.

Results (\(n = 1{,}000\), \(B = 2{,}000\) replications, seed 42). Bias relative to each estimator’s own target.

PS Estimator Estimand Mean Bias SD RMSE
Known HT-ATE ATE 1.008 +0.008 0.626 0.626
Known HJ-ATE ATE 1.003 +0.003 0.145 0.145
Known HT-ATT ATT 1.184 −0.016 0.725 0.725
Known HJ-ATT ATT 1.203 +0.003 0.131 0.131
Known NNM ATT 1.207 +0.008 0.130 0.130
Estimated HT-ATE ATE 0.998 −0.002 0.194 0.194
Estimated HJ-ATE ATE 1.002 +0.002 0.102 0.103
Estimated HT-ATT ATT 1.202 +0.002 0.251 0.251
Estimated HJ-ATT ATT 1.202 +0.002 0.101 0.101
Estimated NNM ATT 1.205 +0.006 0.121 0.121

Lesson 1: Hájek normalization is strongly recommended when outcomes are non-centered. With the known PS, HT-ATE achieves SD \(= 0.626\), while HJ-ATE achieves SD \(= 0.145\) — a 4.3-fold reduction. A unit with \(\pi(X_i) = 0.05\) receives raw weight \(20\); multiplied by an outcome near \(\E[Y(0)] = 8\), its contribution is of order \(160\).

Lesson 2: HT-ATT is even more unstable; Hájek normalization is equally beneficial. HT-ATT SD \(= 0.725\). A control unit with \(\hat\pi(X_i) = 0.95\) receives odds-ratio weight \(19\). HJ-ATT reduces SD to \(0.131\) — a 5.5-fold reduction.

Lesson 3: HJ-ATT and NNM converge to the same target with nearly identical efficiency. Under the known PS, HJ-ATT (SD \(= 0.131\)) and NNM (SD \(= 0.130\)) are virtually indistinguishable — a clear “two routes, one estimand” demonstration.

Lesson 4: The estimand gap between ATE and ATT is large and clearly revealed. The ATE estimators converge to \(1.000\); the ATT estimators converge to \(\approx 1.199\). The \(0.199\) gap is not bias. A researcher who applies NNM and reports against the ATE benchmark would conclude the estimator has 20% bias; it is actually unbiased for the correct target.

Lesson 5: Estimated PS collapses HT variance. HT-ATE SD falls from \(0.626\) to \(0.194\) (69% reduction); shrinkage of fitted logistic probabilities toward the sample mean trims extreme weights automatically. HJ-ATT with estimated PS achieves SD \(= 0.101\) — beating NNM (SD \(= 0.121\)).

WarningThe Estimand Must Be Chosen Before the Estimator

In this DGP, \(\tau_{\mathrm{ATE}} = 1.000\) and \(\tau_{\mathrm{ATT}} \approx 1.199\). The gap arises because treatment selection is correlated with the individual treatment effect: high-\(X_1\) units are simultaneously more likely to be treated and more likely to benefit (Imbens and Rubin 2015). Choosing among estimators should be driven by the policy question (ATE or ATT?), not by which produces the preferred point estimate.

6.7.2 Part 2: Robustness to PS Model Misspecification

Modified DGP. Outcome model unchanged. True PS now contains a quadratic term: \[\pi^*(X) = \mathrm{expit}(-0.5 + 0.8X_1 + 0.5X_1^2 + 0.2X_4 + 0.1X_5), \tag{6.8}\] giving a U-shaped propensity surface. \(\tau_{\mathrm{ATE}} = 1.000\) (unchanged); \(\tau_{\mathrm{ATT}}^* \approx 1.152\) (oracle). The estimated PS is still a linear logistic regression on \((X_1,\ldots,X_5)\) — misspecified by omitting \(X_1^2\).

Results (\(n = 1{,}000\), \(B = 2{,}000\), seed 42):

PS Estimator Estimand Mean Bias SD RMSE
True HT-ATE ATE 1.028 +0.028 0.824 0.825
True HJ-ATE ATE 1.013 +0.013 0.152 0.152
True HT-ATT ATT 1.186 +0.034 1.333 1.333
True HJ-ATT ATT 1.183 +0.031 0.212 0.215
True NNM ATT 1.178 +0.026 0.175 0.177
Misspecified HT-ATE ATE 1.478 +0.478 0.147 0.501
Misspecified HJ-ATE ATE 0.861 −0.139 0.087 0.164
Misspecified HT-ATT ATT 1.780 +0.628 0.162 0.649
Misspecified HJ-ATT ATT 1.306 +0.154 0.081 0.174
Misspecified NNM ATT 1.172 +0.020 0.138 0.139

Lesson 6: PS misspecification makes all IPW estimators inconsistent; Hájek cannot fix this. HT-ATE bias is \(+0.478\) and HJ-ATE bias is \(-0.139\)opposite signs. Hájek removes the level error (weight sums drifting from 1) but not the shape error (misweighting of the population). With Monte Carlo means \(w_T \approx 1.045\) and \(w_C \approx 0.972\), the extra HT bias is approximately \((w_T - 1)\E[Y(1)] - (w_C - 1)\E[Y(0)] \approx 0.045 \times 9 + 0.028 \times 8 \approx +0.62\), accounting for the entire gap \(+0.478 - (-0.139) = +0.617\).

Lesson 7: In this simulation, matching is less sensitive to PS misspecification. NNM bias barely changes: \(+0.026\) under the true PS versus \(+0.020\) under misspecification. NNM does not require the PS model to be correct — it only requires that the estimated score roughly orders units so matched pairs are approximately balanced on true confounders. The linear logistic model, though wrong, still captures the dominant linear effects. See Yang et al. (2016) and Yang and Zhang (2023) for theoretical analyses.

Lesson 8: The relative advantage of matching over IPW reverses with model misspecification.

Part 1: Correct PS model Part 2: Misspecified PS model
Estimator Bias RMSE Bias RMSE
HJ-ATT +0.002 0.101 +0.154 0.174
NNM +0.006 0.121 +0.020 0.139
WarningIPW Needs a Correct PS Model; Matching Needs Covariate Balance

IPW is a model-dependent estimator: its consistency relies on the PS model being correctly specified. Matching is a design-dependent estimator: its consistency relies on matched pairs being approximately balanced on the true confounders. This condition can hold even when the PS model is wrong, provided the estimated score still separates units well enough. Neither dominates unconditionally. The recommendation is to assess PS model plausibility via balance diagnostics, or to use both as a sensitivity check (Yang et al. 2016; Yang and Zhang 2023).

6.8 The Limits of Propensity Score Methods

6.8.1 The Untestable Assumption

Every result in this chapter rests on unconfoundedness \((Y(0), Y(1)) \indep T \mid X\): all confounders of the \(T\)\(Y\) relationship are observed and included in \(X\). In graphical terms: \(X\) blocks every back-door path from \(T\) to \(Y\).

NoteDesign-Based vs. Model-Based Identification

Randomization guarantees ignorability: by construction, no back-door paths exist. Propensity scores assume ignorability: the analyst hopes all confounders have been measured and included in \(X\), but this can never be verified from data alone. This distinction — between identification secured by the study design and identification secured by a modeling assumption — is one of the most important dividing lines in causal inference.

Two complementary responses. Sensitivity analysis (Chapter 9) keeps the back-door framework but quantifies how strong an unmeasured confounder would have to be to overturn the conclusion. The instrumental variable approach (Chapter 7) abandons the unconfoundedness assumption entirely by exploiting an external source of variation in treatment independent of the unobserved confounders.

6.8.2 The Road to Instrumental Variables

Strategy Key assumption Identification mechanism
Back-door adjustment (PS) All confounders are observed Condition on \(X\) to block \(T \leftarrow U \to Y\)
Instrumental variables Some confounders may be unobserved Exploit exogenous variation \(Z \to T\)

Under IV assumptions, the ratio of the \(Z\)-induced change in \(Y\) to the \(Z\)-induced change in \(T\) — the Wald estimator — identifies a causal parameter, typically the LATE among compliers. This is derived formally in Chapter 7.

6.9 Chapter Summary

Symbol Meaning
\(\pi(X)\) Propensity score \(P(T{=}1 \mid X)\)
\(b(X)\) Generic balancing score; \(\pi(X)\) is the coarsest
\(\hat\tau_{\mathrm{HT}}\) Horvitz–Thompson IPW estimator Equation 6.2
\(\hat\tau_{\mathrm{HJ}}\) Hájek (self-normalized) IPW estimator Equation 6.3
NNM Nearest-neighbor matching estimator
  1. The propensity score reduces dimension. Under unconfoundedness, \(\pi(X)\) is a balancing score (\(T \indep X \mid \pi(X)\)). By ?thm-ps, \((Y(0),Y(1)) \indep T \mid \pi(X)\), so identification requires adjustment for the scalar \(\pi(X)\) alone.
  2. IPW identification. The ATE is identified as \(\E[TY/\pi(X)] - \E[(1-T)Y/(1-\pi(X))]\) under strong ignorability. The HT estimator is consistent but sensitive to extreme weights; the Hájek variant is recommended in practice.
  3. Matching targets the ATT. Nearest-neighbor matching finds control units with similar propensity scores to each treated unit. It provides transparent covariate balance diagnostics but differs from IPW in estimand and methodology.
  4. Positivity is necessary; strong overlap is needed for stable estimation. Positivity (\(0 < \pi(X) < 1\) a.s.) is necessary for ATE identification. Strong overlap (\(c \le \pi(X) \le 1-c\)) ensures stable \(\sqrt{n}\)-inference. Trimming or re-targeting addresses near-violations.
  5. The fundamental limitation. Unconfoundedness is untestable. When unobserved confounders are present, either sensitivity analysis or an instrumental variable is needed.
Design Key assumption Identified estimand
Randomized experiment \((Y(0),Y(1)) \indep T\) (by design) ATE
Propensity-score adjustment \((Y(0),Y(1)) \indep T \mid X\) ATE or ATT
Instrumental variables Relevance, exogeneity, exclusion LATE (compliers)

6.10 Problems

1. Propensity score and balancing. Suppose \(X = (X_1, X_2)\) with \(\mathrm{logit}(\pi(X)) = \beta_0 + \beta_1 X_1 + \beta_2 X_2\).

  1. State and prove the Balancing Property \(T \indep X \mid \pi(X)\).
  2. Two units \(i\) and \(j\) have \(X_i = (1, 2.3)\) and \(X_j = (0, 3.5)\) but \(\hat\pi(X_i) = \hat\pi(X_j) = 0.4\). If \(i\) is treated and \(j\) is control, explain why comparing \(Y_i\) and \(Y_j\) is a valid approximation to the counterfactual comparison, and what assumption is required.
  3. Explain why matching on \(\hat\pi(X)\) is not the same as matching on \(X\) directly. Under what conditions do they give the same answer?

2. ATE, ATT, and propensity score weighting. A dataset has \(n = 1000\) observations with \(n_1 = 400\) treated units.

  1. Write the Horvitz–Thompson IPW estimator of the ATE as a weighted sum of observed outcomes \(Y_i\). What weights do treated units receive? What weights do control units receive?
  2. Derive an analogous IPW estimator for the ATT. (Hint: the ATT averages over the treated distribution; the weight for control units should reflect the treatment odds.)
  3. Show that when \(\pi(X) = p\) for all units, the IPW estimator of the ATE reduces to the difference-in-means estimator. Under what experimental design does \(\pi(X) = 0.5\) exactly?

3. Overlap. Let \(\pi(X)\) be the true propensity score and suppose \(\pi(x_0) = 1\) for some \(x_0\).

  1. For each of \(\tau_{\mathrm{ATE}}\) and \(\tau_{\mathrm{ATT}}\), identify the formula that fails to be identified when \(\pi(x_0) = 1\), and explain why.
  2. Suppose overlap fails only on a set \(\mathcal{S}\) with \(P(X \in \mathcal{S}) = 0.15\). Define the trimmed ATE estimand. How does it differ from the ATE?
  3. A researcher applies the Crump et al. (2009) rule with \(\eta = 0.1\), removing 8% of the sample. List two reasons the trimmed estimator may have lower variance, and state the cost in terms of external validity.

4. Hidden confounding (preview of Chapter 9). Suppose you estimate \(\hat\tau_{\mathrm{ATE}} = 3.2\) using propensity-score methods, and a referee questions whether unconfoundedness holds.

  1. Define what it means for \(U\) to be a “hidden confounder” in the context of the DAG, and explain why its presence invalidates the IPW identification formula.
  2. Explain in words what it means for an estimated effect to be “robust to hidden confounding,” and why such an assessment depends on a quantitative yardstick. (Chapter 9 develops three formal yardsticks: Rosenbaum’s \(\Gamma\), the E-value, and the marginal sensitivity model.)
  3. Why does the IV strategy of Chapter 7 avoid the hidden-confounding problem entirely, and what assumption replaces unconfoundedness?

5. Matching vs. IPW. You have \(n = 500\) observations comparing Hájek IPW targeting the ATE (estimator A) and 1:1 nearest-neighbor matching targeting the ATT (estimator B).

  1. Explain in one sentence why estimators A and B target different estimands even though both use \(\hat\pi(X)\).
  2. After matching, the SMD for covariate \(X_1\) is 0.05 (vs. 0.45 before matching). What does this tell you about the success of matching, and what assumption does balance on observed covariates not verify?
  3. Under what condition on treatment effect heterogeneity are the ATE and ATT equal?
Abadie, Alberto, and Guido W. Imbens. 2006. “Large Sample Properties of Matching Estimators for Average Treatment Effects.” Econometrica 74 (1): 235–67.
Abadie, Alberto, and Guido W. Imbens. 2016. “Matching on the Estimated Propensity Score.” Econometrica 84 (2): 781–807. https://doi.org/10.3982/ECTA11293.
Crump, Richard K., V. Joseph Hotz, Guido W. Imbens, and Oscar A. Mitnik. 2009. “Dealing with Limited Overlap in Estimation of Average Treatment Effects.” Biometrika 96 (1): 187–99. https://doi.org/10.1093/biomet/asn055.
Imbens, Guido W., and Donald B. Rubin. 2015. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.
Khan, Shakeeb, and Elie Tamer. 2010. “Irregular Identification, Support Conditions, and Inverse Weight Estimation.” Econometrica 78 (6): 2021–42.
Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.
Yang, Shu, Guido W. Imbens, Zhanglin Cui, Douglas E. Faries, and Zbigniew Kadziola. 2016. “Propensity Score Matching and Stratification in Observational Studies with Multi-Level Treatments.” Biometrics 72: 1055–65.
Yang, Shu, and Yunshu Zhang. 2023. “Multiply Robust Matching Estimators of Average and Quantile Treatment Effects.” Scandinavian Journal of Statistics 50: 235–65.