6 Propensity Score Methods
6.1 Motivation: The Curse of Dimensionality
Chapter 5 established that under strong ignorability, the ATE is identified by the back-door adjustment formula, and developed three estimation strategies: regression adjustment, standardization, and stratification. All three require conditioning on a covariate vector \(X\). When \(X\) is low-dimensional, this is feasible. When \(X\) has many components, it is not.
The problem. Stratification requires cells \(\{X = x\}\) to contain both treated and control units. With \(p\) binary covariates, there are \(2^p\) cells. With \(p = 10\), that is 1024 cells — far more than most datasets can support. Regression adjustment avoids the cell-count problem but requires a correctly specified model for \(\E[Y \mid T, X]\), which becomes harder to specify reliably as \(p\) grows.
The solution. Rosenbaum and Rubin (1983) showed that it is sufficient to condition on a single scalar function of \(X\): the propensity score \(\pi(X) = P(T{=}1 \mid X)\). The propensity score reduces the adjustment dimension without changing the identified estimand, provided strong ignorability and positivity hold.
6.2 The Propensity Score and Its Balancing Properties
6.2.1 Balancing Scores and the Propensity Score
The full covariate vector \(X\) is trivially a balancing score. The propensity score is the coarsest balancing score, providing the greatest dimension reduction without sacrificing identification.
In a randomized experiment, \(\pi(X) = 0.5\) for all units. In an observational study, \(\pi(X)\) varies with \(X\) and must be estimated from data.
Connection to strong ignorability. Under strong ignorability \((Y(0),Y(1)) \indep T \mid X\), the potential outcomes carry no further information about treatment assignment once \(X\) is given: \(P(T{=}1 \mid X, Y(0), Y(1)) = P(T{=}1 \mid X) = \pi(X)\). This equality is a consequence of the ignorability assumption, not part of the definition of \(\pi(X)\).
6.2.2 The Propensity Score Theorem
The lemma states that \(\pi(X)\) is the smallest sufficient statistic for \(T\) among functions of \(X\): any other balancing score retains at least as much information about \(X\).
6.3 Estimation of the Propensity Score
Logistic regression. The classical approach models \(\mathrm{logit}(\pi(X)) = X^\top\beta\) and estimates \(\beta\) by maximum likelihood. The fitted values \(\hat\pi(X_i) = \sigma(X_i^\top\hat\beta)\) are used in place of the true propensity score.
Machine learning estimators. When \(X\) is high-dimensional or the true propensity score is nonlinear, machine learning methods — gradient boosted trees, random forests, regularized logistic regression — can estimate \(\pi(X)\) more flexibly. A critical issue is overfitting: an estimator that perfectly separates treated and control units in-sample will produce estimated propensity scores near 0 or 1 everywhere, destroying overlap. Cross-fitting (Chapter 11) addresses this by estimating nuisance functions on held-out folds.
6.4 Matching on the Propensity Score
The propensity score motivates matching: for each treated unit \(i\), find a control unit \(j\) with \(\hat\pi(X_j) \approx \hat\pi(X_i)\), and use \(Y_j\) as a local estimate of \(\E[Y(0) \mid \pi(X) = \pi(X_i)]\) (Abadie and Imbens 2006, 2016). The matched control outcome is not a substitute for the individual counterfactual \(Y_i(0)\); it approximates the conditional mean: \[\E[Y(0) \mid \pi(X) = \pi(X_i)] = \E[Y(0) \mid \pi(X) = \pi(X_i), T{=}0] = \E[Y \mid \pi(X) = \pi(X_i), T{=}0],\] where the first equality uses ?thm-ps and the second uses consistency. Averaging across treated units recovers the ATT: \[\hat\tau_{\mathrm{ATT}} = \frac{1}{n_1}\sum_{i:\, T_i=1}[Y_i - Y_{\hat\jmath(i)}], \tag{6.1}\] where \(\hat\jmath(i)\) is the matched control index.
One-to-one nearest-neighbor matching. For each treated unit \(i\): \[\hat\jmath(i) = \arg\min_{j:\, T_j=0}|\hat\pi(X_i) - \hat\pi(X_j)|.\] Matching with replacement reduces bias but increases variance.
Caliper matching. Restricts matches to pairs within a maximum distance \(\delta\), discarding treated units with no close match. This restricts the estimand to a subpopulation with good overlap.
6.5 Inverse Probability Weighting
6.5.1 The IPW Identification Formula
Under strong ignorability, the ATE has an inverse probability weighting (IPW) representation: \[\tau_{\mathrm{ATE}} = \E\!\left[\frac{T \cdot Y}{\pi(X)}\right] - \E\!\left[\frac{(1-T)\cdot Y}{1-\pi(X)}\right].\]
6.5.2 Horvitz–Thompson and Hájek Estimators
The Horvitz–Thompson (HT) IPW estimator: \[\hat\tau_{\mathrm{HT}} = \frac{1}{n}\sum_{i=1}^n\left[\frac{T_i Y_i}{\hat\pi(X_i)} - \frac{(1-T_i)Y_i}{1-\hat\pi(X_i)}\right]. \tag{6.2}\]
The Hájek (self-normalized) estimator: \[\hat\tau_{\mathrm{HJ}} = \frac{\sum_i T_i Y_i/\hat\pi(X_i)}{\sum_i T_i/\hat\pi(X_i)} - \frac{\sum_i (1-T_i)Y_i/(1-\hat\pi(X_i))}{\sum_i (1-T_i)/(1-\hat\pi(X_i))}. \tag{6.3}\]
The Hájek estimator caps each unit’s effective weight at its share of the total within its arm, suppressing the influence of extreme weights. The efficient doubly robust AIPW estimator of Chapter 10 combines the outcome model and the propensity score; under correct nuisance specification it achieves higher efficiency than either pure IPW or pure outcome regression.
ATT estimation. The ATT re-weights the control arm to look like the treated population: \[\hat\tau_{\mathrm{HT,ATT}} = \frac{1}{n_1}\sum_{i:\,T_i=1}Y_i - \frac{1}{n_1}\sum_{i:\,T_i=0}\frac{\hat\pi(X_i)}{1-\hat\pi(X_i)}Y_i, \tag{6.4}\] \[\hat\tau_{\mathrm{HJ,ATT}} = \frac{1}{n_1}\sum_{i:\,T_i=1}Y_i - \frac{\sum_{i:\,T_i=0}\frac{\hat\pi(X_i)}{1-\hat\pi(X_i)}Y_i}{\sum_{i:\,T_i=0}\frac{\hat\pi(X_i)}{1-\hat\pi(X_i)}}. \tag{6.5}\]
Each control unit receives weight proportional to its odds of treatment \(\hat\pi/(1-\hat\pi)\). For a control unit with \(\hat\pi(X_i) = 0.95\) the weight is \(19\); combined with a large baseline outcome this produces extreme variance.
6.6 Overlap and Positivity
6.6.1 Positivity and Strong Overlap
Together with unconfoundedness, positivity constitutes the strong ignorability of Rosenbaum and Rubin (1983). Positivity is the condition needed for identification.
Strong overlap bounds the inverse weights uniformly away from infinity. It is the condition under which \(\sqrt{n}\)-consistent, asymptotically normal inference for the ATE goes through (Chapter 11). Positivity alone permits identification but does not guarantee stable inference: when \(\pi(X)\) approaches 0 or 1, IPW weights blow up even though the parameter is technically identified (Khan and Tamer 2010).
6.6.2 Practical Consequences of Near-Violations
6.6.3 Trimming Strategies
Trimming by propensity score. Restrict to units with \(\eta \le \pi(X) \le 1-\eta\) for small \(\eta > 0\). The estimand changes to: \[\tau_{\mathrm{trim}} = \E[Y(1) - Y(0) \mid \eta \le \pi(X) \le 1-\eta].\]
Crump et al. (2009) rule. Crump et al. (2009) derive the optimal trimming threshold minimizing the asymptotic variance of the trimmed ATE estimator. The rule \(\eta = 0.1\) is a common default.
6.6.4 Re-targeting the Estimand
When overlap fails, an alternative to trimming is to change the estimand. The ATT requires only \(P(T{=}0 \mid X{=}x) > 0\) for \(x\) in the support of \(X \mid T{=}1\), i.e., \(\pi(X) < 1\) a.s. on the treated support. The ATT tolerates regions with \(\pi(x) = 0\) (no treated units there) but not regions inside the treated support where \(\pi(x) = 1\). The ATC is symmetric. When overlap fails where \(\pi(X) \approx 0\), re-target to the ATT; when overlap fails where \(\pi(X) \approx 1\), re-target to the ATC.
6.7 Lab: Simulation Study of IPW and Matching Estimators
This lab compares five estimators on a five-covariate DGP with treatment-effect heterogeneity. With five continuous confounders, direct stratification is infeasible, making propensity-score methods the natural choice.
6.7.1 Part 1: Correctly Specified Propensity Score
Estimators. Five estimators are compared under both the true and estimated propensity score. The estimated score uses logistic regression of \(T\) on \((X_1,\ldots,X_5)\) — correctly specified since the true logit is linear. The five estimators are: HT-ATE Equation 6.2, HJ-ATE Equation 6.3 (both targeting ATE); HT-ATT Equation 6.4, HJ-ATT Equation 6.5, and NNM Equation 6.1 (all targeting ATT). The large baseline mean \(\E[Y(0)] = 8\) amplifies the consequences of extreme IPW weights, making Hájek normalization strongly recommended.
Results (\(n = 1{,}000\), \(B = 2{,}000\) replications, seed 42). Bias relative to each estimator’s own target.
| PS | Estimator | Estimand | Mean | Bias | SD | RMSE |
|---|---|---|---|---|---|---|
| Known | HT-ATE | ATE | 1.008 | +0.008 | 0.626 | 0.626 |
| Known | HJ-ATE | ATE | 1.003 | +0.003 | 0.145 | 0.145 |
| Known | HT-ATT | ATT | 1.184 | −0.016 | 0.725 | 0.725 |
| Known | HJ-ATT | ATT | 1.203 | +0.003 | 0.131 | 0.131 |
| Known | NNM | ATT | 1.207 | +0.008 | 0.130 | 0.130 |
| Estimated | HT-ATE | ATE | 0.998 | −0.002 | 0.194 | 0.194 |
| Estimated | HJ-ATE | ATE | 1.002 | +0.002 | 0.102 | 0.103 |
| Estimated | HT-ATT | ATT | 1.202 | +0.002 | 0.251 | 0.251 |
| Estimated | HJ-ATT | ATT | 1.202 | +0.002 | 0.101 | 0.101 |
| Estimated | NNM | ATT | 1.205 | +0.006 | 0.121 | 0.121 |
Lesson 1: Hájek normalization is strongly recommended when outcomes are non-centered. With the known PS, HT-ATE achieves SD \(= 0.626\), while HJ-ATE achieves SD \(= 0.145\) — a 4.3-fold reduction. A unit with \(\pi(X_i) = 0.05\) receives raw weight \(20\); multiplied by an outcome near \(\E[Y(0)] = 8\), its contribution is of order \(160\).
Lesson 2: HT-ATT is even more unstable; Hájek normalization is equally beneficial. HT-ATT SD \(= 0.725\). A control unit with \(\hat\pi(X_i) = 0.95\) receives odds-ratio weight \(19\). HJ-ATT reduces SD to \(0.131\) — a 5.5-fold reduction.
Lesson 3: HJ-ATT and NNM converge to the same target with nearly identical efficiency. Under the known PS, HJ-ATT (SD \(= 0.131\)) and NNM (SD \(= 0.130\)) are virtually indistinguishable — a clear “two routes, one estimand” demonstration.
Lesson 4: The estimand gap between ATE and ATT is large and clearly revealed. The ATE estimators converge to \(1.000\); the ATT estimators converge to \(\approx 1.199\). The \(0.199\) gap is not bias. A researcher who applies NNM and reports against the ATE benchmark would conclude the estimator has 20% bias; it is actually unbiased for the correct target.
Lesson 5: Estimated PS collapses HT variance. HT-ATE SD falls from \(0.626\) to \(0.194\) (69% reduction); shrinkage of fitted logistic probabilities toward the sample mean trims extreme weights automatically. HJ-ATT with estimated PS achieves SD \(= 0.101\) — beating NNM (SD \(= 0.121\)).
6.7.2 Part 2: Robustness to PS Model Misspecification
Modified DGP. Outcome model unchanged. True PS now contains a quadratic term: \[\pi^*(X) = \mathrm{expit}(-0.5 + 0.8X_1 + 0.5X_1^2 + 0.2X_4 + 0.1X_5), \tag{6.8}\] giving a U-shaped propensity surface. \(\tau_{\mathrm{ATE}} = 1.000\) (unchanged); \(\tau_{\mathrm{ATT}}^* \approx 1.152\) (oracle). The estimated PS is still a linear logistic regression on \((X_1,\ldots,X_5)\) — misspecified by omitting \(X_1^2\).
Results (\(n = 1{,}000\), \(B = 2{,}000\), seed 42):
| PS | Estimator | Estimand | Mean | Bias | SD | RMSE |
|---|---|---|---|---|---|---|
| True | HT-ATE | ATE | 1.028 | +0.028 | 0.824 | 0.825 |
| True | HJ-ATE | ATE | 1.013 | +0.013 | 0.152 | 0.152 |
| True | HT-ATT | ATT | 1.186 | +0.034 | 1.333 | 1.333 |
| True | HJ-ATT | ATT | 1.183 | +0.031 | 0.212 | 0.215 |
| True | NNM | ATT | 1.178 | +0.026 | 0.175 | 0.177 |
| Misspecified | HT-ATE | ATE | 1.478 | +0.478 | 0.147 | 0.501 |
| Misspecified | HJ-ATE | ATE | 0.861 | −0.139 | 0.087 | 0.164 |
| Misspecified | HT-ATT | ATT | 1.780 | +0.628 | 0.162 | 0.649 |
| Misspecified | HJ-ATT | ATT | 1.306 | +0.154 | 0.081 | 0.174 |
| Misspecified | NNM | ATT | 1.172 | +0.020 | 0.138 | 0.139 |
Lesson 6: PS misspecification makes all IPW estimators inconsistent; Hájek cannot fix this. HT-ATE bias is \(+0.478\) and HJ-ATE bias is \(-0.139\) — opposite signs. Hájek removes the level error (weight sums drifting from 1) but not the shape error (misweighting of the population). With Monte Carlo means \(w_T \approx 1.045\) and \(w_C \approx 0.972\), the extra HT bias is approximately \((w_T - 1)\E[Y(1)] - (w_C - 1)\E[Y(0)] \approx 0.045 \times 9 + 0.028 \times 8 \approx +0.62\), accounting for the entire gap \(+0.478 - (-0.139) = +0.617\).
Lesson 7: In this simulation, matching is less sensitive to PS misspecification. NNM bias barely changes: \(+0.026\) under the true PS versus \(+0.020\) under misspecification. NNM does not require the PS model to be correct — it only requires that the estimated score roughly orders units so matched pairs are approximately balanced on true confounders. The linear logistic model, though wrong, still captures the dominant linear effects. See Yang et al. (2016) and Yang and Zhang (2023) for theoretical analyses.
Lesson 8: The relative advantage of matching over IPW reverses with model misspecification.
| Part 1: Correct PS model | Part 2: Misspecified PS model | |||
|---|---|---|---|---|
| Estimator | Bias | RMSE | Bias | RMSE |
| HJ-ATT | +0.002 | 0.101 | +0.154 | 0.174 |
| NNM | +0.006 | 0.121 | +0.020 | 0.139 |
6.8 The Limits of Propensity Score Methods
6.8.1 The Untestable Assumption
Every result in this chapter rests on unconfoundedness \((Y(0), Y(1)) \indep T \mid X\): all confounders of the \(T\)–\(Y\) relationship are observed and included in \(X\). In graphical terms: \(X\) blocks every back-door path from \(T\) to \(Y\).
Two complementary responses. Sensitivity analysis (Chapter 9) keeps the back-door framework but quantifies how strong an unmeasured confounder would have to be to overturn the conclusion. The instrumental variable approach (Chapter 7) abandons the unconfoundedness assumption entirely by exploiting an external source of variation in treatment independent of the unobserved confounders.
6.8.2 The Road to Instrumental Variables
| Strategy | Key assumption | Identification mechanism |
|---|---|---|
| Back-door adjustment (PS) | All confounders are observed | Condition on \(X\) to block \(T \leftarrow U \to Y\) |
| Instrumental variables | Some confounders may be unobserved | Exploit exogenous variation \(Z \to T\) |
Under IV assumptions, the ratio of the \(Z\)-induced change in \(Y\) to the \(Z\)-induced change in \(T\) — the Wald estimator — identifies a causal parameter, typically the LATE among compliers. This is derived formally in Chapter 7.
6.9 Chapter Summary
| Symbol | Meaning |
|---|---|
| \(\pi(X)\) | Propensity score \(P(T{=}1 \mid X)\) |
| \(b(X)\) | Generic balancing score; \(\pi(X)\) is the coarsest |
| \(\hat\tau_{\mathrm{HT}}\) | Horvitz–Thompson IPW estimator Equation 6.2 |
| \(\hat\tau_{\mathrm{HJ}}\) | Hájek (self-normalized) IPW estimator Equation 6.3 |
| NNM | Nearest-neighbor matching estimator |
- The propensity score reduces dimension. Under unconfoundedness, \(\pi(X)\) is a balancing score (\(T \indep X \mid \pi(X)\)). By ?thm-ps, \((Y(0),Y(1)) \indep T \mid \pi(X)\), so identification requires adjustment for the scalar \(\pi(X)\) alone.
- IPW identification. The ATE is identified as \(\E[TY/\pi(X)] - \E[(1-T)Y/(1-\pi(X))]\) under strong ignorability. The HT estimator is consistent but sensitive to extreme weights; the Hájek variant is recommended in practice.
- Matching targets the ATT. Nearest-neighbor matching finds control units with similar propensity scores to each treated unit. It provides transparent covariate balance diagnostics but differs from IPW in estimand and methodology.
- Positivity is necessary; strong overlap is needed for stable estimation. Positivity (\(0 < \pi(X) < 1\) a.s.) is necessary for ATE identification. Strong overlap (\(c \le \pi(X) \le 1-c\)) ensures stable \(\sqrt{n}\)-inference. Trimming or re-targeting addresses near-violations.
- The fundamental limitation. Unconfoundedness is untestable. When unobserved confounders are present, either sensitivity analysis or an instrumental variable is needed.
| Design | Key assumption | Identified estimand |
|---|---|---|
| Randomized experiment | \((Y(0),Y(1)) \indep T\) (by design) | ATE |
| Propensity-score adjustment | \((Y(0),Y(1)) \indep T \mid X\) | ATE or ATT |
| Instrumental variables | Relevance, exogeneity, exclusion | LATE (compliers) |
6.10 Problems
1. Propensity score and balancing. Suppose \(X = (X_1, X_2)\) with \(\mathrm{logit}(\pi(X)) = \beta_0 + \beta_1 X_1 + \beta_2 X_2\).
- State and prove the Balancing Property \(T \indep X \mid \pi(X)\).
- Two units \(i\) and \(j\) have \(X_i = (1, 2.3)\) and \(X_j = (0, 3.5)\) but \(\hat\pi(X_i) = \hat\pi(X_j) = 0.4\). If \(i\) is treated and \(j\) is control, explain why comparing \(Y_i\) and \(Y_j\) is a valid approximation to the counterfactual comparison, and what assumption is required.
- Explain why matching on \(\hat\pi(X)\) is not the same as matching on \(X\) directly. Under what conditions do they give the same answer?
2. ATE, ATT, and propensity score weighting. A dataset has \(n = 1000\) observations with \(n_1 = 400\) treated units.
- Write the Horvitz–Thompson IPW estimator of the ATE as a weighted sum of observed outcomes \(Y_i\). What weights do treated units receive? What weights do control units receive?
- Derive an analogous IPW estimator for the ATT. (Hint: the ATT averages over the treated distribution; the weight for control units should reflect the treatment odds.)
- Show that when \(\pi(X) = p\) for all units, the IPW estimator of the ATE reduces to the difference-in-means estimator. Under what experimental design does \(\pi(X) = 0.5\) exactly?
3. Overlap. Let \(\pi(X)\) be the true propensity score and suppose \(\pi(x_0) = 1\) for some \(x_0\).
- For each of \(\tau_{\mathrm{ATE}}\) and \(\tau_{\mathrm{ATT}}\), identify the formula that fails to be identified when \(\pi(x_0) = 1\), and explain why.
- Suppose overlap fails only on a set \(\mathcal{S}\) with \(P(X \in \mathcal{S}) = 0.15\). Define the trimmed ATE estimand. How does it differ from the ATE?
- A researcher applies the Crump et al. (2009) rule with \(\eta = 0.1\), removing 8% of the sample. List two reasons the trimmed estimator may have lower variance, and state the cost in terms of external validity.
4. Hidden confounding (preview of Chapter 9). Suppose you estimate \(\hat\tau_{\mathrm{ATE}} = 3.2\) using propensity-score methods, and a referee questions whether unconfoundedness holds.
- Define what it means for \(U\) to be a “hidden confounder” in the context of the DAG, and explain why its presence invalidates the IPW identification formula.
- Explain in words what it means for an estimated effect to be “robust to hidden confounding,” and why such an assessment depends on a quantitative yardstick. (Chapter 9 develops three formal yardsticks: Rosenbaum’s \(\Gamma\), the E-value, and the marginal sensitivity model.)
- Why does the IV strategy of Chapter 7 avoid the hidden-confounding problem entirely, and what assumption replaces unconfoundedness?
5. Matching vs. IPW. You have \(n = 500\) observations comparing Hájek IPW targeting the ATE (estimator A) and 1:1 nearest-neighbor matching targeting the ATT (estimator B).
- Explain in one sentence why estimators A and B target different estimands even though both use \(\hat\pi(X)\).
- After matching, the SMD for covariate \(X_1\) is 0.05 (vs. 0.45 before matching). What does this tell you about the success of matching, and what assumption does balance on observed covariates not verify?
- Under what condition on treatment effect heterogeneity are the ATE and ATT equal?