Representing Independences

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\can}[1]{\overline{\text{an}}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\perp_d} \newcommand{\msep}{\perp_m} $

Bayesian networks can represent a probability distribution compactly by exploiting a set of Markov conditions in a digraph. This set of Markov conditions imply (and are implied) by other independences. In this lecture we will see how graphs can be used to derive a sound and nearly complete representation system for independencies. The key to achieving this feature is the concept of d-separation. In fact, a major contribution of the theory of Bayesian networks was to bring forth a representational system that associates (in)dependencies with graph (dis)connectedness.

The Logic of Independence

Recall that a set of variables $\mathcal{X}$ is conditionally independent of another set of variables $\mathcal{Y}$ given a set of variables $\mathcal{Z}$ if their conditional joint distribution factorizes as $$ P(\mathcal{X},\mathcal{Y} \mid \mathcal{Z}) = P(\mathcal{X} \mid \mathcal{Z}) P(\mathcal{Y} \mid \mathcal{Z}) \, . $$ We denote such a relation as $$ \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \, . $$ Independence satisfies a number of useful properties: $$ \begin{align*} & \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \implies \mathcal{Y} \ind \mathcal{X} \mid \mathcal{Z} \, , && \text{[symmetry]}\\ & \mathcal{X} \ind \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \implies \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \,, && \text{[decomposition]} \\ & \mathcal{X} \ind \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \implies \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \cup \mathcal{W} \,, && \text{[weak union]} \\ & \mathcal{X} \ind \mathcal{W} \mid \mathcal{Y} \cup \mathcal{Z} \text{ and } \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \implies \mathcal{X} \ind \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \,. && \text{[contraction]} \end{align*} $$ The above conditions hold for any probability distribution and are called semi-graphoid axioms (for technical reasons we assume that any variable is conditionally independent of the empty set). When the probability distribution is positive, a third condition also holds:1 $$ \begin{align*} & \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \cup \mathcal{W} \text{ and } \mathcal{X} \ind \mathcal{W} \mid \mathcal{Z} \cup \mathcal{Y} \implies \mathcal{X} \ind \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \, .&& \text{[intersection]} \end{align*} $$ A semi-graphoid satisfying intersection is a graphoid.

Markov properties

Decomposition is what allowed us to derive the factorization property of Bayesian networks from the (local) Markov property by using the ordered Markov property: $$ X \ind \{Y: Y < X\} \mid \pa{X} $$ for any topological ordering $<$. Decomposition also allows us to derive the pairwise Markov property: $$ X \ind Y \mid \pa{X} \quad \text{for all } Y \in \nd{X} - \pa{X} \, . $$ The opposite direction (i.e., composition) however does not hold in general. Thus, the pairwise Markov property does not suffice to establish the local Markov property (unless the distribution is positive, then intersection allows us to prove equivalence). The weak union property allows us to derive the following fact: $$ X \ind \nd{X} - \mathcal{S} \mid \pa{X} \cup \mathcal{S} $$ for any $\mathcal{S} \subseteq \nd{X}$. Using this fact we can show that the ordered Markov property implies the local Markov property: for each variable select a topological ordering where $X$ is greater than all its nondescendants.

Incompleteness

Pearl wrongly conjectured that the semi-graphoid formed a complete system for the independence relation in the sense that any other property can be derived from them. However, Studený showed the existence of a property that independence satisfies and cannot be derived from the properties of semi-graphoids.2. Later, he proved that there is no finite characterization of probabilistic independence, which implies that semi-graphoids cannot be used as a axiomatization of the independence relation.3 Yet, semi-graphoid properties remained as a set of useful properties for understanding the concept of d-separation, which we study next.

Dependence graph

By the semi-graphoid axioms, it is possible to conclude many independences from a set of Markov properties. However, working with these axioms is cumbersome. We now investigate a graphical aid to reason with semi-graphoid rules. We take an alternative path and consider the representation of a dependence relation $\not\ind$, rather than an independence relation.

Before formally defining the semantics of directed graphs, let us motivate its definition by discussing some simple scenarios that a (directed) graphical representation of (in)dependence should encode. Recall that our goal is to represent (in)dependences using an acyclic directed graph. We say that $X$ depends on $Y$ if $X$ and $Y$ are probabilistic dependent (note the symmetry). We associate an arc with a member of a dependence relation, so that the graph below represents $A \not\ind B$.

A graph whose arcs encode dependences is called a dependence graph. Since dependence is symmetric, the direction of the arc is arbitrary; we try to resolve the ambiguity by considering multiple (in)dependence relations.

Chain reasoning

The first scenario is chain reasoning, when the dependence of a random variable $C$ on $A$ is mediated by a random variable $B$, meaning that $C$ depends on $B$, $B$ depends on $A$, and $C$ depends on $A$, and $A$ and $C$ are (possibly) conditionally independent given $B$. The random variable $B$ is called an intermediary or mediating variable. Graphically, this is captured by a serial connection:

Serial connection
Serial connection

We have already encountered such type of knowledge, and referred to it as an indirect influence. For example, in the digraph below of the Asia Bayesian Network, the dependence of $D$yspnea on $S$moking is mediated by $B$ronchitis.

The Asia Bayesian network structure
The Asia Bayesian network structure

Note that the influence or dependence of $D$ on $S$ is not removed once we are given complete knowledge about $B$ due to the presence of the other serial connection:

We say that a serial connection $A \rightarrow B \rightarrow C$ is unblocked when $B$ is not observed and blocked otherwise, and that $B$ blocks the trail from $A$ to $C$.
This allows us to associate indirect dependence relations with graph connectedness and conditional independence relations with graph separation. That is, any two variables that are connected by a directed path are unconditionally independent. Similarly, any two variables that are connected only through serial connections are conditionally independent given mediating variables that block those paths. It follows from that reasoning that any variable unconditionally depends on its ancestors and by symmetry on its descendants. For instance, in the Asia network $E$ depends on $A$, $S$, $X$ and $D$ (and obviously on $T$ and $L$). Also, $D$ and $S$ are conditionally independent given $B$ and $E$. Note that an arc can never be blocked by that semantics, hence formalizing the distinction between a direct and an indirect dependence or influence.

Common cause

There is another scenario where a variable $B$ renders two unconditionally dependent variables $A$ and $C$ independent if observed, but not by mediation: when $B$ is a common source of $A$ and $C$. Graphically, this is captured by a so-called divergent connection:

Divergent conection
Divergent conection

In the Asia network, $L$ung cancer and $B$ronchitis are possible unrelated consequences of $S$moking; because of this, exhibiting one increases the chance of exhibiting the other when $S$ is unknown. However, once $S$ is given, this dependence disappears. Another example is the relation between $S$hoe size and $V$ocabulary in schoolchildren.4 Arguably, any correlation between $S$ and $V$ is spurious and caused by their common dependence on age:

Example of spurious correlation caused by a common source
Example of spurious correlation caused by a common source
The trail $A \leftarrow B \rightarrow C$ is blocked by $B$ and unblocked otherwise.
As a consequence, any variable depends on its non-descendants with a common ancestor. Also, two variables connected only by means of divergent connections are conditionally independent given their common ancestors. For example, $E$ depends on $B$ in the Asia network (when nothing is observed), and $E$ and $B$ are conditionally independent given $S$, their common ancestor.

Intercausal reasoning

The cases discussed so far suggest that dependence can be captured by undirected connections (i.e., by an undirected graph), as they relied on blocking and unblocking of trails. However, directing the arcs plays a fundamental role in representing a common scenario known as explaining away, which occurs when $A$ and $C$ are unrelated sources of a common effect $B$. Then $A$ and $C$ are unconditionally independent ($A \ind C$) but conditionally dependent given $B$ ($A \not\ind C \mid B$). We represent this type of reasoning as a convergent connection:

Convergent connection
Convergent connection

To convince yourself why the causes must be conditionally dependent, consider the extreme case when $B$ is a logical function of $A$ and $C$. Recall that in the Asia network the variable $E$ represents the disjunction of $T$uberculosis and $L$ung cancer: $$ E = T \text{ or } L \, . $$ Then if we know that $E=1$ any evidence in favor of $T$ should decrease our belief in $L$ (as the probability of both independent events occurring is lower than the probability of only one of the events occurring). Similarly, an increase in probability of $L$ should correspond to a decrease in the probability of $T$.

Intercausal reasoning is perhaps best described in the so-called collider bias, sample bias or selection bias. We have seen one possible instance of such phenomenon when we discussed the relation between smoking and the severity of COVID-19. While the problem there was made more complex by the introduction of occupation status, a simpler description of the alternative explanation of the correlation is given by the graph below:

Example of phenomenon being modeled as a convergent connection.
Example of phenomenon being modeled as a convergent connection.

As the reasoning goes, it is possible that $S$moking and $D$isease severity are unrelated in the general population, while being related in the specific sample used in the study, which consisted in patients that tested positive for the disease at a large French hospital. In other words, it might be that $S$ and $D$ are unconditionally independent, and are conditionally dependent given $T$. Yet another example can help illustrate the point.

Suppose that a company selects applicants for a job offer by two different tests: a written exam that evaluates logical thinking and a group dynamics that evaluates social skills. Each exam awards each participant a score on a 1 to 5 scale; the top 20% participants by the sum of the two scores are selected for an ensuing phase. The data in the plot below was generated by randomly and independently sampling 30 values for each of the two scores form a Gaussian distribution with mean 2.5 and variance 1.0. The top 6 pairs of scores by their sum were then considered approved.

Artifical example of data generated with sampling bias (approval), that introduces a spurious correlation between two unrelated quantities.
Artifical example of data generated with sampling bias (approval), that introduces a spurious correlation between two unrelated quantities.

As one can see, while the overall data shows no clear relation, the data of approved applicants shows a strong negative correlation between scores of the two exams. This is represented by the graphical structure below.

Example of spurious correlation caused by collider bias.
Example of spurious correlation caused by collider bias.
The trail $A \rightarrow B \leftarrow C$ is unblocked given $B$ and blocked otherwise.
Note the distinction with the other connections. Notably, this type of reasoning can only be captured by considering the direction of arcs. In the Asia network, we have that $A$ and $S$ are unconditionally independent, and are conditionally dependent given $D$.

D-Separation

We now formalize the general definition of blocked and unblocked trails in directed graphs.

We say that a trail from $X$ to $Y$ is blocked by a set of variables $\mathcal{Z}$ if one of the following is true.

  • $X$ or $Y$ are in $\mathcal{Z}$, or
  • there is a serial or divergent connection $A, B, C$ blocked by some $B \in \mathcal{Z}$, or
  • there is a convergent connection $A \rightarrow B \leftarrow C$ and ${B} \cup \nd{B} \not\subseteq \mathcal{Z}$ (i.e., neither $B$ nor any of its descendants are in $\mathcal{Z}$).

If the trail is not blocked we say that it is unblocked.

Note that a convergent connection is unblocked even by nodes outside the trail. For example, in the Asia network, the trail $A,T,E,L,S$ by $\mathcal{Z}=\{L\}$ and unblocked by $\mathcal{Z}=\{D\}$.

The concepts of blocked and unblocked trails lead us to the definition of d-separation.

Two sets of variable $\mathcal{X}$ and $\mathcal{Y}$ are d-separated by a set of variables $\mathcal{Z}$ if every trail from a variable $X \in \mathcal{X}$ to a variable $Y \in \mathcal{Y}$ is blocked by $\mathcal{Z}$. Otherwise, $\mathcal{X}$ and $\mathcal{Y}$ are d-connected.

We denote that $\mathcal{X}$ and $\mathcal{Y}$ are d-separated by $\mathcal{Z}$ as $$ \mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \, . $$

Markov property

D-separation abides by the Markov Property of directed graphs:

Any node is d-separated from its nondescendants by its parents.

Consider a trail from a nondescendant nonparent node $N$ to $X$. If the trail contains a parent $U \in \pa{X}$ then it either contains a serial connection $W \rightarrow U \rightarrow X$ (when $N$ is an ancestor of $X$), which is blocked by $\pa{X}$ or a divergent connection $W \leftarrow U \rightarrow X$ (when $N$ is not an ancestor of $X$), which is also blocked by $\pa{X}$. So consider there is no parent of $X$ in the trail. In this case the trail must contain $Y \in \de{X}$ which is also a descendant of $N$ (otherwise $N$ would be a descendant of $X$). This implies that the trail contains a convergent connection $Z_1 \rightarrow Y \leftarrow Z_2$, where $Z_1 \in \de{N}$ and $Z_2 \in \de{X} \cup \{X\}$. Since this connection is not blocked by $\pa{X}$, this completes the proof.

Thus, d-separation is consistent with the local Markov properties. That is, if we assume that d-separation implies independence then we also assume the local Markov properties in the digraph.

Soundness and completeness

Ideally, we would like to equate d-separation (resp., d-connection) and independence (resp., dependence). That is, we would like to satisfy the following properties:

  • Soundness $\mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \implies \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z}$;

  • Completeness: $\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \implies \mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z}$.

The first property indeed holds:

D-separation is sound for any joint distribution satisfying the local Markov property (viz. $X \ind \nd{X} \mid \pa{X}$).

The soundness condition of d-separation is known as the global Markov property. According to the previous proposition, the global Markov property implies the local Markov property, so a more stringent definition would be to require a probability distribution to satisfy every d-separation in the structure. The result above states that these two definitions are equivalent. The proof is obtained through the application of the semi-graphoid axioms. Note that the result could be restated for distributions factorizing according to the structure (which implies the satisfy the local Markov property).

The second property however is not true for all distributions. In fact, the incompleteness of d-separation is a necessary requirement. To prove that d-separation is incomplete for Bayesian networks consider the following simple model with graph $A \rightarrow B$, $P(A)=1/2$ and $P(B|A)$ given below.

$B=0$ $B=1$
$A=0$ 0.4 0.6
$A=1$ 0.4 0.6

Clearly, $A$ and $B$ are independent with respect to $P(A,B)=P(A)P(B|A)$. However, the graph $A \rightarrow B$ implies that $A$ and $B$ are d-connected. The following example is a more intricate scenario where d-separation fails to imply independence.

Consider a Bayesian network with root nodes $A$, $B$ and $C$, which are all parent of node $D$. Assume that all variables are binary, and that $D$ is deterministically determined by its parents, such that $D=1$ if either $A=0$ and $B=1$, or if $A=0$ and $C=1$. We can write that as a logical expression:

$$ (D=1) \Leftrightarrow (A=0 \wedge B=1) \vee (A=1 \wedge C=1) \, . $$

That is, $A$ acts a selector of a multiplexer $D$ of the values of $B$ and $C$. Accordingly, we have that $$ B \not\ind C \mid D \, . $$ In fact, if $D=1$ then knowledge that $B=0$ leads to $C=1$ with probability one. However (the reader is encouraged to work out the formulas and to show that): $$ B \ind C \mid D, A = 0 , \text{ and } , B \ind C \mid D, A = 1 \, . $$ By definition of conditional independence, we thus have that $$ B \ind C \mid D, A \, , $$ even though $B$ and $C$ are not d-separated by $D$ and $A$. Note that determinism is not required to achieve the counter-example. We could instead specify the conditional distribution $P(D|A,B,C)$ in any way so as to satisfy the above independences while not implying conditional independence between $B$ and $C$ given $D$.

The previous examples required a “hidden” independence among parents of a variable. The following example shows that incompleteness can arise even in more distant variables.

Consider the following Bayesian network with a Gaussian distributed root variable $A$ and deterministic variables $B$, $C$ and $D$.

Example of incompleteness of d-separation.
Example of incompleteness of d-separation.

For any assignment $A=a$, the induced conditional distribution $P(D|A=a)$ is a degenerate distribution (i.e., it places all mass on the single joint value satisfying the corresponding linear equalities) assigning probability one to $D = 6a + 2(-3a) = 0$. Hence, $P(D|A=a)=P(D|A=a’)$ for any $a,a’$. This shows that $D$ and $A$ are independent even though they are d-connected.

Moreover, incompleteness of d-separation is not rare:

For any acyclic directed graph $G$ there is a Bayesian network with structure $G$ and variables $X$ and $Y$ such that $X \ind Y$ and $X \not\dsep Y$.

The proof is based on the example given. Let $Y \in \pa{X}$, and specify $p$ such that $p(X|\pa{X})=p(X|\pa{X}-Y)$. This distribution satisfies all the Markov properties in $G$, and has $X \ind Y$ (but $X \not\dsep Y$).

The above result shows that d-separation cannot be complete in general. The following result shows that d-separation is “as complete as possible”.

If $G$ is an acyclic directed graph and $X$ and $Y$ are d-connected by $\mathcal{Z}$ in $G$, then there is a Bayesian network $(G,p)$ such that $X$ and $Y$ are dependent given $\mathcal{Z}$.

Since $X$ and $Y$ are d-connected by $\mathcal{Z}$, there is an unblocked trail $X,X_1,\dotsc,X_m,Y$ given $\mathcal{Z}$ in $G$. We construct a joint distribution by specifying the conditional distributions (local models) of every variable along this trail so as to ensure that consecutive variables are dependent. For the convergent connections $A \rightarrow B \leftarrow C$ in the trail, we also set the distributions of the descendants of $B$ so as to make each dependent on its parents that are descendant of a variable in the trail. We finally set the distributions of the remaining variable as uniforms (this prevents canceling of the dependencies specified thus far).

The above result shows that for every graph there is a distribution that makes d-separation complete for the corresponding Bayesian network. Hence, any other sound and complete system would have to coincide with d-separation in these networks; in particular, any system that is based exclusively on the graph structure would coincide with d-separation for all distributions lest it be incomplete. The take-home message here is that structure of a Bayesian network should be interpreted as a statement that the suggested dependences are true for some parametrization of the model, while the suggested independences are true for any parametrization.

D-separation and semi-graphoids

We have seen that independence satisfies a number of basic properties known as semi-graphoid axioms. Moreover, independence does not generally satisfy composition and intersection. It can be shown that d-separation satisfies all the semi-graphoid axioms, and also satisfies composition and intersection: $$ \begin{align*} & \mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \text{ and } \mathcal{X} \dsep \mathcal{W} \mid \mathcal{Z} \implies \mathcal{X} \dsep \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \, , && \text{[composition]}\\ & X \dsep Y \mid Z \cup W \text{ and } X \dsep W \mid Z \cup Y \implies X \dsep Y \cup W \mid Z \cup W \, .&& \text{[intersection]} \end{align*} $$ Since not all distributions satisfy composition and intersection, there are distributions whose set of independencies cannot be fully characterized by a Bayesian network. That is, any Bayesian network representation of such distributions either encode independences that are not true (unsound) or miss to represent some independences (incomplete).

Consider four random variables that satisfy the following independences: $$ \begin{align*} &A \not\ind B \,, && A \not\ind C \,, \\ &B \not\ind D \,, && C \not\ind D \,, \\ &A \ind D \mid B,C \,, && B \ind C \mid A,D \, . \end{align*} $$ We will see in later that such a distribution exists. By the first four dependence statements, we must have arcs between $A$ and $B$, between $A$ and $C$, between $B$ and $D$ and between $C$ and $D$. The independence statements imply that any path between $A$ and $D$ must be blocked by $\{B,C\}$, and that any path from $B$ to $C$ must be blocked by $\{A,C\}$. One can show (e.g. by listing them all) that any structure fails to satisfy all of this requirements.

Markov Blanket

We are often interested in the minimal set of variables that when conditioned on those variables makes a variable independent of all remaining variables. For example, we might be interested in predicting the value of a given variable given the values of all other. Then discarding the irrelevant variables makes the inference more efficient.

The Markov blanket of a variable $X$ is a set $\mathcal{B}$ such that $$ X \ind \mathcal{V}-\mathcal{B}-{X} \mid \mathcal{B} \, . $$

We have the following result.

The set formed by the parents, the children and the parents of the children of a variable $X$ is a Markov blanket of $X$. If the distribution is positive it is minimal.

M-separation

Another graphical criteria for verifying independence is based on undirected graphs.

The moral graph of an acyclic directed graph $G$ is the undirected graph obtained by connecting nodes with a common child and then discarding arc directions.

The undirected graph below is the moral graph of the Asia Bayesian network structure.

Moral graph of the Asia Bayesian network.
Moral graph of the Asia Bayesian network.

Let $M[G]$ denote the moral graph of an fixed a acyclic directed graph $G$. Denote by $G[\mathcal{X}]=G-(\mathcal{V}-\mathcal{X})$ the subgraph of $M$ obtained by deleting all nodes outside $\mathcal{X}$ and their corresponding edges. Also, denote by $\can{\mathcal{X}}$ the set containing $\mathcal{X}$ and each ancestor of a node in $\mathcal{X}$.

We say that the sets of variables $\mathcal{X}$ and $\mathcal{Y}$ are m-separated by the set of variables $\mathcal{Z}$, and write $\mathcal{X} \msep \mathcal{Y} \mid \mathcal{Z}$, if they are disconnected in $M[G[\can{\mathcal{X} \cup \mathcal{Y} \cup \mathcal{Z}}]]-\mathcal{Z}$. Otherwise they are m-connected, denoted by $\mathcal{X} \not\msep \mathcal{Y} \mid \mathcal{Z}$.

The graph $M[G[\can{\mathcal{X} \cup \mathcal{Y} \cup \mathcal{Z}}]]-\mathcal{Z}$ is the moral graph of $G$ with barren nodes and evidence nodes deleted (and their corresponding edges). The next results justifies our interest in m-separation.

For any three sets of nodes $\mathcal{X}$, $\mathcal{Y}$ and $\mathcal{Z}$, it follows that $$ \mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \text{ if and only if } \mathcal{X} \msep \mathcal{Y} \mid \mathcal{Z} \, . $$

Consider the transformation of a directed graph into $$ M'=M[G[\can{\mathcal{X} \cup \mathcal{Y} \cup \mathcal{Z}}]]-\mathcal{Z} \, . $$ The serial and divergent connections are transformed into regular 3-long trails, and are blocked if and only if the intermediary node is in $\mathcal{Z}$. Thus, d-separation and m-separation coincide for these connections. Consider a convergent connection $A \rightarrow B \leftarrow C$. This is transformed into a clique:
Thus, the variables $A$ and $C$ are connected in the moral graph. In principle, this could introduce a dependence if neither $B$ nor any of its descendants are in $\mathcal{Z}$. However, this connection is only present if $B$ is in $\mathcal{Z}$.

An important result

The following result will be useful for identifying structures consistent with a data distribution:

The following statements are equivalent for two nodes $X$ and $Y$ in an acyclic directed graph $G$:

  1. $X$ and $Y$ are adjacent.
  2. There is no set $\mathcal{Z}\subseteq\mathcal{V}-\{X,Y\}$ that d-separates $X$ and $Y$.
  3. $X$ and $Y$ are d-connected by $(\an{X} \cup \an{Y})-\{X,Y\}$.
  4. $X$ and $Y$ are d-connected by $(\pa{X} \cup \pa{Y})-\{X,Y\}$.
1 implies 2: No set of nodes can make two nodes connected by an arc d-separated (other than the nodes themselves). 2 implies 3: The ancestors of $X$ and $Y$ are a subset of $\mathcal{V} - \{X,Y\}$, hence by assertion 2 they do not d-separate $X$ and $Y$. 3 implies 4: The parents of a node form a subset of its ancestors. Thus, by the decomposition property of d-separation it follows that if $X$ and $Y$ were d-separated by their parents, then they would also be d-separated by their ancestors. Thus, by modus tollens, when the latter is false the former must be false as well. 4 implies 1: To satisfy the Markov property we must have that either $X$ is a descendant of $Y$ or vice-versa (otherwise by the composition property conditioning on their parents would make them d-separated). Thus assume that $Y$ is a descendant of $X$. Then the parents of $X$ cannot be descendants of $Y$ (else we create a cycle). Hence there are no trails between $X$ and $Y$ which contain convergent connections that are unblocked by their parents. But that can only happen if $X$ and $Y$ are adjacent.

Summary

D-separation provides an elegant and powerful mechanism to represent (in)dependence relations by acyclic directed graphs. At its core is the translation of dependence into graph connectedness. While powerful and sound, d-separation is incomplete, in the sense that it does not capture all possible independence relations. Yet, as we have shown with several examples, d-separation is very expressive, and allows for a large class of very interesting models to be represented in a intuitive and mathematically rigorous form. In practice, it is often useful to use m-separation, the undirected counterpart of d-separation, in order to read off independences from a graph structure.


  1. We say that a joint distribution $P(\mathcal{V})$ is positive if its corresponding joint probability mass function or joint density function is positive (more generally, for mixed-discrete continuous models, we say that it is positive if the product of conditional density and mass function is positive). ↩︎

  2. Milan Studený. Multiinformation and the Problem of Characterization of Independence Relations, Problems of Control and Information Theory 3, 1989. ↩︎

  3. Milan Studený. Conditional Independence Relations Have No Finite Complete Characterization, Transactions of 11th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, 1992. ↩︎

  4. David H. Kaye and David A. Freedman, Reference Guid on Statistics, In: Reference Manual on Scientific Evidence, 3rd edition, The National Academies Pres, 2011. ↩︎

Previous
Next