Appendix A — Graphical Intuition for Conditional Independence and \(d\)-Separation

This appendix provides a gentle introduction to the probabilistic and graphical ideas that underlie Chapters 2 and 3. Our goal is not to give a complete treatment of graphical models, but rather to develop enough intuition so that the formal machinery of \(d\)-separation, back-door adjustment, and intervention graphs does not appear abruptly.

A directed acyclic graph (DAG) is more than a picture. It is a compact language for expressing assumptions about how variables are related. Once those assumptions are represented graphically, the graph tells us which variables may be associated, which paths transmit dependence, and which variables should or should not be conditioned on. These ideas become central in Chapter 2, where we study \(d\)-separation, and in Chapter 3, where we use graph surgery to derive identification results.

The key pedagogical idea of this appendix is simple: before learning the full \(d\)-separation criterion, it is helpful to understand three elementary three-node patterns. Those local patterns explain most of what happens later in larger graphs.

A.1 Conditional Independence: The Probabilistic Language Behind Graphs

Before introducing graphs, we first recall the probabilistic notion that graphs are designed to encode.

Definition: Conditional Independence

Let \(X\), \(Y\), and \(Z\) be random variables. We say that \(X\) and \(Y\) are conditionally independent given \(Z\), written \(X \indep Y \mid Z\), if \[p(x, y \mid z) = p(x \mid z)\,p(y \mid z) \qquad \text{for } P_Z\text{-almost every } z.\]

Equivalently, once \(Z\) is known, learning \(X\) gives no additional information about \(Y\), and learning \(Y\) gives no additional information about \(X\).

In terms of densities (with respect to a dominating measure on \((X, Y, Z)\)), conditional independence admits the following equivalent characterizations, each interpreted almost everywhere: \[X \indep Y \mid Z \;\iff\; f(x,y,z)\,f(z) = f(x,z)\,f(y,z) \;\iff\; \exists\, a, b \colon f(x,y,z) = a(x,z)\,b(y,z).\] The last form is especially useful: it says that the joint density factors into one piece depending on \((x,z)\) and another depending on \((y,z)\), with no cross-term in \(x\) and \(y\).

Proposition: Fundamental Properties of Conditional Independence (Dawid 1979, 1980)

For random variables \(X\), \(Y\), \(Z\), and \(W\), the following hold:

(C1) Symmetry. \(X \indep Y \mid Z \;\Rightarrow\; Y \indep X \mid Z\).
(C2) Decomposition. \(X \indep (Y,W) \mid Z \;\Rightarrow\; X \indep Y \mid Z\).
(C3) Weak Union. \(X \indep (Y,W) \mid Z \;\Rightarrow\; X \indep Y \mid (Z,W)\).
(C4) Contraction. \(X \indep Y \mid Z\) and \(X \indep W \mid (Y,Z) \;\Rightarrow\; X \indep (Y,W) \mid Z\).
(C5) Intersection. If \(f(x,y,z,w) > 0\) for all \((x,y,z,w)\), then \(X \indep Y \mid (Z,W)\) and \(X \indep Z \mid (Y,W) \;\Rightarrow\; X \indep (Y,Z) \mid W\).

Properties (C1)–(C4) hold for any probability distribution and are known as the semigraphoid axioms. Property (C5) additionally requires the joint density to be strictly positive; together with (C1)–(C4) it forms the graphoid axioms. These properties are used implicitly throughout the course whenever conditional independence statements are combined or simplified.

Conditional independence is not the same as marginal independence. Two variables may be dependent marginally but independent after conditioning on a third variable. Conversely, two variables may be independent marginally but become dependent after conditioning. Both phenomena occur repeatedly in causal inference.

Example: Ice cream, drowning, and season

Let \(X = \text{ice cream sales}\), \(Y = \text{drowning incidents}\), and \(Z = \text{season}\). Marginally, \(X\) and \(Y\) are positively associated because both tend to be higher in summer. This does not mean that ice cream sales cause drowning. A more plausible explanation is that season is a common cause of both variables. Once season is fixed, the association largely disappears: \(X \indep Y \mid Z\).

This example illustrates a recurring theme in causal inference: an observed association may be induced by a third variable, and conditioning on that variable can remove the spurious dependence.

Remark: The Central Question

At this stage, the main question for the reader is: Does conditioning on a variable remove association, preserve it, or create it? Graphs provide a systematic answer to exactly this question.

A.2 Three Basic Motifs: Chain, Fork, and Collider

Every path in a DAG is built from local three-node configurations. There are three fundamental types: a chain, a fork, and a collider. Their behavior under conditioning is the foundation of \(d\)-separation.

The three fundamental three-node motifs. Chains and forks are blocked by conditioning on the middle node. Colliders are blocked by default but opened by conditioning on the middle node or one of its descendants.

A.2.1 Chain

Consider the pattern \(X_1 \to X_2 \to X_3\), where the middle node \(X_2\) lies on a directed pathway from \(X_1\) to \(X_3\).

Example: Exercise, body weight, and blood pressure

Let \(X_1 = \text{exercise}\), \(X_2 = \text{body weight}\), and \(X_3 = \text{blood pressure}\). Exercise may affect blood pressure partly through its effect on body weight. Marginally, exercise and blood pressure are associated; conditioning on body weight blocks this particular pathway. In the isolated three-node DAG, \(X_1 \indep X_3 \mid X_2\).

Once the middle variable is fixed, the chain no longer transmits additional information from \(X_1\) to \(X_3\).

A.2.2 Fork

Consider the pattern \(X_1 \leftarrow X_2 \to X_3\), where the middle node \(X_2\) is a common cause of \(X_1\) and \(X_3\).

Example: Ice cream, season, and drowning

With \(X_1 = \text{ice cream sales}\), \(X_2 = \text{season}\), and \(X_3 = \text{drowning incidents}\), season affects both variables, so \(X_1\) and \(X_3\) are associated even though neither causes the other. Conditioning on season blocks this path: \(X_1 \indep X_3 \mid X_2\).

A fork is the simplest graphical form of confounding: the middle node creates association, and conditioning on it blocks the path.

A.2.3 Collider

Consider the pattern \(X_1 \to X_2 \leftarrow X_3\), where the middle node \(X_2\) is a common effect of \(X_1\) and \(X_3\).

Example: Talent, legacy status, and college admission

Let \(X_1 = \text{academic talent}\), \(X_3 = \text{legacy status}\), and \(X_2 = \text{admission to an elite university}\). Talent and legacy status may be unrelated in the general applicant pool. But among admitted students they can become statistically associated: low legacy status makes unusually high talent more likely among admitted applicants, and vice versa. Symbolically, \(X_1 \indep X_3\), but conditioning on \(X_2\) opens the path, so \(X_1\) and \(X_3\) are typically dependent given \(X_2\).

Unlike chains and forks, a collider blocks the path by default. Conditioning on the collider opens the path and may induce association that was not present marginally.

A.2.4 Summary of the Three Motifs

A chain (\(X_1 \to X_2 \to X_3\)) and a fork (\(X_1 \leftarrow X_2 \to X_3\)) are each open by default and blocked by conditioning on the middle node \(X_2\). A collider (\(X_1 \to X_2 \leftarrow X_3\)) is blocked by default and opened by conditioning on the middle node or any of its descendants.

Remark: Path-Blocking versus Conditional Independence

The independence statements in the chain, fork, and collider examples above are read in the isolated three-node DAGs as drawn. In a larger DAG, conditioning on the middle node \(X_2\) of a chain or fork blocks that particular path, but \(X_1\) and \(X_3\) are conditionally independent only if every other path between them is also blocked.

Conditioning Is Not Always Beneficial

Many adjustment mistakes arise from forgetting this asymmetry. Conditioning on a collider can create bias rather than remove it. The informal rule “control for more variables” is unsafe in causal inference; \(d\)-separation teaches a more precise lesson: condition on the right variables, not simply on many variables.

Chapter 2 develops these three motifs into the full \(d\)-separation criterion; see in particular the blocking rules and the extended treatment of collider bias.

A.3 \(d\)-Separation: When Does Conditioning Block a Path?

The three-node motifs explain what happens locally on a path. The next step is to extend this logic to a general DAG, where two variables may be connected by many paths.

Definition: Path

A path between two nodes is a sequence of distinct nodes such that each consecutive pair is connected by an edge, regardless of edge direction.

Definition: Blocked Path

A path is blocked by a conditioning set \(S\) if at least one of the following holds: (1) the path contains a chain or fork node that belongs to \(S\), or (2) the path contains a collider such that neither the collider nor any of its descendants belongs to \(S\). A path is open given \(S\) if it is not blocked.

Definition: \(d\)-Separation

Two nodes \(X\) and \(Y\) are \(d\)-separated by a set \(S\) if every path between \(X\) and \(Y\) is blocked by \(S\). More generally, two disjoint sets of nodes \(\mathbf{X}\) and \(\mathbf{Y}\), each disjoint from \(S\), are \(d\)-separated by \(S\) if every \(X \in \mathbf{X}\) is \(d\)-separated from every \(Y \in \mathbf{Y}\) by \(S\).

Remark: \(d\)-Separation versus Conditional Independence

\(d\)-Separation is a graphical condition; conditional independence is a probabilistic one. The Markov property guarantees only one direction: \(d\)-separation in \(\mathcal{G}\) implies the corresponding conditional independence in every distribution that factorizes according to \(\mathcal{G}\). The reverse implication — that \(d\)-connection forces conditional dependence — requires a faithfulness or no-cancellation assumption. We therefore read an open path as “the graph does not force independence,” not as “dependence is guaranteed.” This distinction is taken up in detail in Chapter 2.

A.3.1 A Confounding Example

Consider the DAG with edges \(X \to T\), \(T \to Y\), and \(X \to Y\). Here \(T\) is the treatment, \(Y\) is the outcome, and \(X\) is a pre-treatment covariate that affects both. There are two paths from \(T\) to \(Y\): the directed causal path \(T \to Y\), and the back-door path \(T \leftarrow X \to Y\).

Without conditioning, the back-door path is open, so the observed association between \(T\) and \(Y\) mixes the causal effect with confounding. Conditioning on \(X\) blocks the fork \(T \leftarrow X \to Y\), thereby isolating the causal path. This confounding graph anticipates the back-door criterion of Chapter 3: the graphical condition that makes adjustment valid is precisely that all back-door paths are blocked.

A.3.2 A Collider Warning

Now consider the DAG with edges \(T \to C\), \(U \to C\), and \(U \to Y\). Here \(C\) is a collider on the path \(T \to C \leftarrow U \to Y\). Without conditioning on \(C\), the path is blocked at the collider. Conditioning on \(C\) opens the path, creating a spurious association between \(T\) and \(Y\) through \(U\).

Even more subtly, conditioning on a descendant of \(C\) can also open the path: if additionally \(C \to D\), then conditioning on \(D\) may also induce association between \(T\) and \(Y\) through the collider at \(C\).

A.3.3 A Practical Checklist

To decide whether \(X\) and \(Y\) are \(d\)-separated by \(S\), proceed as follows. First, list all paths between \(X\) and \(Y\). On each path, classify each interior node as part of a chain, fork, or collider. Then check whether each path is blocked by \(S\). Finally, conclude that \(X\) and \(Y\) are \(d\)-separated if and only if every path is blocked.

A.4 DAG Factorization and the Markov Property

Up to this point, we have used DAGs qualitatively, to decide which paths are open or blocked. We now connect the graph to probability algebra.

Definition: DAG Factorization

Let \(\mathcal{G}\) be a DAG with nodes \(V_1, \ldots, V_p\). A joint distribution \(p(v_1, \ldots, v_p)\) is said to factorize according to \(\mathcal{G}\) if \[p(v_1, \ldots, v_p) = \prod_{j=1}^{p} p\!\left(v_j \mid \mathrm{Pa}(V_j)\right),\] where \(\mathrm{Pa}(V_j)\) denotes the set of parents of \(V_j\) in \(\mathcal{G}\).

This factorization implies the local Markov property: once its parents are known, a node is conditionally independent of all variables that are neither its descendants nor its parents.

Example: A simple confounding graph

For the DAG with edges \(X \to T\), \(T \to Y\), and \(X \to Y\), the joint density factorizes as \(p(x, t, y) = p(x)\,p(t \mid x)\,p(y \mid t, x)\).

Definition: Local Markov Property

Let \(\mathrm{Nd}(V_i) = V \setminus \bigl(\{V_i\} \cup \mathrm{De}(V_i)\bigr)\) denote the set of non-descendants of \(V_i\). A distribution \(P\) satisfies the local Markov property with respect to \(\mathcal{G}\) if, for every node \(V_i \in V\), \[V_i \indep \bigl(\mathrm{Nd}(V_i) \setminus \mathrm{Pa}(V_i)\bigr) \mid \mathrm{Pa}(V_i).\]

Remark: Global Markov Property

The global Markov property is the statement that every \(d\)-separation in \(\mathcal{G}\) implies a conditional independence in \(P\). This is studied in detail in Chapter 2.

Remark: Equivalence for DAGs

For DAGs, under the existence of regular conditional distributions, the recursive factorization, the local Markov property, and the global Markov property are all equivalent. The local property is the version most commonly verified in practice, since the factorization makes it immediate.

Example: Education–earnings graph

Consider the DAG with edges \(N \to E\), \(B \to E\), \(E \to Y\), and \(B \to Y\) (\(N\) = neighborhood, \(B\) = family background, \(E\) = education, \(Y\) = earnings). The corresponding factorization is \(p(n,b,e,y) = p(n)\,p(b)\,p(e \mid n,b)\,p(y \mid e,b)\). The parent set of \(Y\) is \(\mathrm{Pa}(Y) = \{E, B\}\), so the local Markov property gives \(Y \indep N \mid \{E, B\}\). The factorization also implies \(N \indep B\), since neither node has a common ancestor; this is a substantive modeling assumption.

Remark: Edges as Scientific Claims

In a causal DAG, an arrow is typically drawn only when a direct dependence-generating relation is believed to be present. Every arrow represents a substantive scientific claim. Minimality and faithfulness are discussed in Chapter 2.

Optional: Moralization as an Alternative Criterion

Note to Reader

This section may be skipped on a first reading.

There is an alternative graph-theoretic way to check \(d\)-separation based on constructing an undirected graph called the moral graph. To check whether \(X \indep Y \mid S\): First, take the induced subgraph on \(\mathrm{An}(X \cup Y \cup S)\), the ancestors of all variables under consideration. Second, connect any two parents of a common child by an undirected edge. Third, drop all arrow directions. Finally, check whether \(S\) separates \(X\) and \(Y\) in the resulting undirected graph. This procedure yields a criterion equivalent to \(d\)-separation.

Example: Moral graph of a collider

Consider the DAG \(A \to C \leftarrow B\). For the conditional query \(A \indep B \mid C\), the ancestral set is \(\{A, B, C\}\). Moralization connects the two parents \(A\) and \(B\), and after deleting the conditioned node \(C\) the edge \(A - B\) remains; therefore \(A\) and \(B\) are not separated, matching the fact that conditioning on a collider opens the path. By contrast, for the marginal query \(A \indep B\), the ancestral set is \(\{A, B\}\), so \(C\) is discarded before moralization and no moral edge is added; \(A\) and \(B\) are separated, as expected.

Summary

This appendix introduced the graphical ideas that support Chapters 2 and 3. Conditional independence is the probabilistic language that graphs are designed to encode. The three local motifs — chain, fork, and collider — determine how conditioning affects association along a path. \(d\)-Separation extends these local rules to arbitrary graphs: two variables are \(d\)-separated by \(S\) if every path between them is blocked by \(S\). The most important application in causal inference is to distinguish confounding paths from causal paths and to identify valid adjustment sets. Finally, the Markov property gives a probabilistic interpretation to the graph by linking graphical structure to a factorization of the joint distribution.

Dawid, A. Philip. 1979. “Conditional Independence in Statistical Theory.” Journal of the Royal Statistical Society, Series B 41 (1): 1–31.

Dawid, A. Philip. 1980. “Conditional Independence for Statistical Operations.” Annals of Statistics 8 (3): 598–617.