# Representing Independences

$$\newcommand{\nb}{\text{ne}(#1)} \newcommand{\pa}{\text{pa}(#1)} \newcommand{\ch}{\text{ch}(#1)} \newcommand{\de}{\text{de}(#1)} \newcommand{\an}{\text{an}(#1)} \newcommand{\can}{\overline{\text{an}}(#1)} \newcommand{\nd}{\text{nd}(#1)} \newcommand{\indeg}{\text{indeg}(#1)} \newcommand{\outdeg}{\text{outdeg}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\!\perp_d\!} \newcommand{\msep}{\!\perp_m\!}$$

Bayesian networks can represent a probability distribution compactly by exploiting a set of Markov conditions in a digraph. This set of Markov conditions imply (and are implied) by other independences. In this lecture we will see how graphs can be used to derive a sound and nearly complete representation system for independencies. The key to achieving this feature is the concept of d-separation. In fact, a major contribution of the theory of Bayesian networks was to bring forth a representational system that associates (in)dependencies with graph (dis)connectedness.

## 1 Semi-graphoids

Even though independence is defined as a property of a probability measure, in practice we often start with a set of independence assumptions that ease the burden of specifying a probability measure. In this sense, independence can be seen as a more primitive concept than probability. Given some set of independences (e.g., the Markov properties of a graph), several other independences can be derived. The set of rules for deriving independences is known as semi-graphoid due to its resemblance to connectivity in graphs.

### 1.1 Axioms

Recall that a set of variables $$\mathcal{X}$$ is conditionally independent of another set of variables $$\mathcal{Y}$$ given a set of variables $$\mathcal{Z}$$ if their conditional joint distribution factorizes as

\begin{equation*} p(\mathcal{X},\mathcal{Y} | \mathcal{Z}) = p(\mathcal{X}|\mathcal{Z})p(\mathcal{Y}|\mathcal{Z}) \, . \end{equation*}

We denote such a relation by the notation

\begin{equation*} \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \, . \end{equation*}

Independence satisfies a number of useful properties:

\begin{align*} %\label{GS} \tag{GS} & \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \implies \mathcal{Y} \ind \mathcal{X} \mid \mathcal{Z} \, , && \text{[symmetry]}\\ %\label{GD} \tag{GD} & \mathcal{X} \ind \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \implies \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \,, && \text{[decomposition]} \\ %\label{GU} \tag{GU} & \mathcal{X} \ind \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \implies \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \cup \mathcal{W} \,, && \text{[weak union]} \\ %\label{GC} \tag{GC} & \mathcal{X} \ind \mathcal{W} \mid \mathcal{Y} \cup \mathcal{Z} \text{ and } \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \implies \mathcal{X} \ind \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \,. && \text{[contraction]} \end{align*}

The above conditions hold for any probability distribution and are called semi-graphoid axioms (for technical reasons we assume that any variable is conditionally independent of the empty set). When the probability distribution is positive, a third condition also holds:

\begin{align*} %\label{GI} \tag{GI} & \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \cup \mathcal{W} \text{ and } \mathcal{X} \ind \mathcal{W} \mid \mathcal{Z} \cup \mathcal{Y} \implies \mathcal{X} \ind \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \, .&& \text{[intersection]} \end{align*}

A semi-graphoid satisfying intersection is a graphoid.

### 1.2 Ordered Markov property

Decomposition is what allowed us to derive the factorization property of Bayesian networks from the (local) Markov property by using the ordered Markov property

\begin{equation*} X \ind \{Y: Y < X\} \mid \pa{X} \quad \text{for any topological ordering } < \, . \end{equation*}

Decomposition also allows us to derive the pairwise Markov property:

\begin{equation*} X \ind Y \mid \pa{X} \quad \text{for all } Y \in \nd{X} - \pa{X} \, . \end{equation*}

The opposite direction (i.e., composition) however does not hold in general. Thus, the pairwise Markov property does not suffice to establish the local Markov property (unless the distribution is positive, then intersection allows us to prove equivalence). The weak union property allows us to derive the following fact:

\begin{equation*} X \ind \nd{X} - \mathcal{S} \mid \pa{X} \cup \mathcal{S} \end{equation*}

for any $$\mathcal{S} \subseteq \nd{X}$$. Using this fact we can show that the ordered Markov property implies the local Markov property (for each variable select a topological ordering where $$X$$ is greater than all its nondescendants).

### 1.3 Incompleteness

Pearl wrongly conjectured that the semi-graphoid formed a complete system for the independence relation in the sense that any other property can be derived from them Pea1988. However, Studený showed the existence of a property that independence satisfies and cannot be derived from the properties of semi-graphoids Stu1989. Later, he proved that there is no finite characterization of probabilistic independence, which implies that semi-graphoids cannot be used as a axiomatization of the independence relation Stu1992. Yet, semi-graphoid properties remained as a set of useful properties for understanding the concept of d-separation, which we study next.

## 2 D-separation

Through the semi-graphoid axioms, it is possible to conclude many independences from a set of Markov properties. However, working with these axioms is cumbersome. We now investigate a graphical aid to reason with semi-graphoid rules.

### 2.1 Dependence graph

Before formally defining d-separation, let us motivate its definition by discussing some simple scenarios that a (directed) graphical representation of (in)dependence should encode. Recall that our goal is to represent (in)dependences using an acyclic directed graph. We say that $$X$$ depends on $$Y$$ if $$X$$ and $$Y$$ are probabilistic dependent. We associate an arc with a dependence relation, so that the graph in Figure 1 represents the relation $$\neg(A \ind B)$$. A graph interpreted under this semantics is called a dependence graph. Since (in)dependence is symmetric, a (direct) dependence can be locally represented in any direction (however other dependencies may constrain the direction). Figure 1: Example of dependence graph.

### 2.2 Chain reasoning

The first scenario is chain reasoning, when $$C$$ depends on $$A$$ through $$B$$, meaning that $$C$$ depends on $$B$$, $$B$$ depends on $$A$$, $$C$$ is independent of $$A$$ given $$B$$, and $$C$$ (unconditionally) depends on $$A$$. $$B$$ is called an intermediary or mediating variable of the phenomenon. This can be graphically captured by a serial connection: We say that the trail from $$A$$ to $$C$$ in the serial connection is unblocked when $$B$$ is not given and blocked otherwise, and that $$B$$ blocks the trail from $$A$$ to $$C$$. All the (in)dependencies can be readily read off the graph by equating unblocked connection (resp., blocked connection) and dependence (resp., independence): $$A$$ and $$C$$ are connected (dependent), and $$A$$ and $$C$$ are separated (independent) given $$B$$. Extending this equivalence for longer chains (i.e., serial connections with more than 3 variables) shows that any variable is unconditionally dependent on its ancestors (and by symmetry on its descendants). So for example in the Asia network (Fig. 3) $$E$$ depends on $$A$$, $$S$$, $$X$$ and $$D$$ (and obviously on $$T$$ and $$L$$). Figure 3: Structure of the Asia Bayesian network.

### 2.3 Common cause

Another interesting scenario is when $$B$$ is a common cause of unrelated events $$A$$ and $$C$$. Clearly, $$A$$ depends on $$B$$, and $$C$$ depends on $$B$$. Since $$B$$ is a common cause, $$A$$ and $$C$$ must be unconditionally dependent on each other: observing $$B$$ increases the chances that $$A$$ occurred, which in turn increases the chances that $$C$$ occurred. Graphically, this can be captured by a divergent connection: The trail between $$A$$ and $$C$$ is blocked given $$B$$ and unblocked otherwise. A consequence of this connection is that any variable depends on its non-descendants with a common ancestor. For example, $$E$$ depends on $$B$$ in the Asia network. Whether two variables are conditionally independent depends on the number of possible causes. In essence, two variables are conditionally independent given all their common causes.

### 2.4 Intercausal reasoning

The cases discussed so far suggest that dependence is better represented without directions (i.e., by an undirected graph). However, directing the arcs plays a fundamental role in representing a common scenario known as explaining away, which occurs when $$A$$ and $$C$$ are unrelated causes of a common effect $$B$$. Then $$A$$ and $$C$$ are unconditionally independent ($$A \ind C$$) but conditionally dependent given $$B$$ ($$\neg(A \ind C \mid B)$$). To convince yourself why the causes must be conditionally dependent, consider the case when $$B$$ is the exclusive-or of $$A$$ and $$C$$ and we observe $$B=1$$. Since either $$A=0$$ or $$C=0$$ any increase in the probability of $$A=0$$ must be followed by a decrease in the probability of $$B=0$$. We can represent this type of reasoning as a convergent connection: The trail from $$A$$ to $$C$$ (and viceversa) is unblocked when $$B$$ is given and blocked otherwise. Note that this is the opposite of the previous cases (e.g., divergent connection). This case can only be distinguished by considering the direction of arcs. In the Asia network, we have that $$A$$ and $$S$$ are unconditionally independent and conditionally dependent given $$E$$.

### 2.5 Definition

We now formalize the definition of (un)blocked trails.

We say that a trail from $$X$$ to $$Y$$ is blocked by a set of variables $$\mathcal{Z}$$ if

• $$X$$ or $$Y$$ are in $$\mathcal{Z}$$, or
• there is a serial or divergent connection $$A, B, C$$ blocked by some $$B \in \mathcal{Z}$$, or
• there is a convergent connection $$A \rightarrow B \leftarrow C$$ and $$\{B\} \cup \nd{B} \not\subseteq \mathcal{Z}$$ (i.e., neither $$B$$ nor any of its descendants are in $$\mathcal{Z}$$).

Otherwise the trail is unblocked.

Note that a convergent connection is unblocked even by nodes outside the trail. In the Asia network, the trail $$A,T,E,L,S$$ by $$\mathcal{Z}=\{L\}$$ and unblocked by $$\mathcal{Z}=\{D\}$$.

Two sets of variable $$\mathcal{X}$$ and $$\mathcal{Y}$$ are d-separated by a set of variables $$\mathcal{Z}$$ if every trail from a variable $$X \in \mathcal{X}$$ to a variable $$Y \in \mathcal{Y}$$ is blocked by $$\mathcal{Z}$$. Otherwise, $$\mathcal{X}$$ and $$\mathcal{Y}$$ are d-connected.

We denote that $$\mathcal{X}$$ and $$\mathcal{Y}$$ are d-separated by $$\mathcal{Z}$$ as $\mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \, .$

### 2.6 Markov property

Any node $$X$$ is d-separated from its nondescedants by its parents.

Consider a trail from a nondescendant nonparent node $$N$$ to $$X$$ (see Figure 7 for visual aid). If the trail contains a parent $$U \in \pa{X}$$ then it either contains a serial connection $$W \rightarrow U \rightarrow X$$ (when $$N$$ is an ancestor of $$X$$), which is blocked by $$\pa{X}$$ or a divergent connection $$W \leftarrow U \rightarrow X$$ (when $$N$$ is not an ancestor of $$X$$), which is also blocked by $$\pa{X}$$. So consider there is no parent of $$X$$ in the trail. In this case the trail must contain $$Y \in \de{X}$$ which is also a descendant of $$N$$ (otherwise $$N$$ would be a descendant of $$X$$). This implies that the trail contains a convergent connection $$Z_1 \rightarrow Y \leftarrow Z_2$$, where $$Z_1 \in \de{N}$$ and $$Z_2 \in \de{X} \cup \{X\}$$. Since this connection is not blocked by $$\pa{X}$$, this completes the proof.

Thus, d-separation is consistent with the local Markov properties.

### 2.7 Soundness and completeness

Ideally, we would like to equate d-separation (resp., d-connection) with independence (resp., dependence). That is, we would like to have

\begin{align*} &\mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \implies \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \, ,&&\text{[soundness]}\\ &\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \implies \mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \, .&&\text{[completeness]} \end{align*}

The first desiderata indeed holds:

D-separation is sound for any joint distribution satisfying the local Markov property (viz. $$X \ind \nd{X} \mid \pa{X}$$).

The soundness condition of d-separation is known as the global Markov property. According to the previous proposition, the global Markov property implies the local Markov property, so a more stringent definition would be to require a probability distribution to satisfy every d-separation in the structure. The result above states that these two definitions are equivalent. The proof is obtained through the application of the semi-graphoid axioms. Note that the result could be restated for distributions factorizing according to the structure (which implies the satisfy the local Markov property).

The second desiderata however is not true for all distributions. In fact, the incompleteness of d-separation is a necessary requirement. To prove that d-separation is incomplete for Bayesian networks consider the following simple model with graph $$A \rightarrow B$$, $$p(A)=1/2$$ and $$p(B|A)$$ given below.

 B=0 B=1 A=0 0.4 0.6 A=1 0.4 0.6

Clearly, $$A$$ and $$B$$ are independent wrt $$p(A,B)=p(A)p(B|A)$$. However, the graph $$A \rightarrow B$$ implies that $$A$$ and $$B$$ are d-connected. Here is a more intricate example not exploiting regularities inside conditional probability distributions.

Consider the Bayesian network with a Gaussian distributed root variable $$A$$ and deterministic variables $$B$$,$$C$$ and $$D$$ as shown below. For any assignment $$A=a$$, the induced conditional distribution $$p(D|A=a)$$ is a degenerate distribution (i.e., it places all mass on the single joint value satisfying the corresponding linear equalities) assigning probability one to $$D=6(a) + 2(-3a)=0$$. Hence, $$p(D|A=a)=p(D|A=a)$$ for any $$a,a'$$. This shows that $$D$$ and $$A$$ are independent even though they are d-connected.

Incompleteness of d-separation is not rare:

For any acyclic directed graph $$G$$ there is a Bayesian network with structure $$G$$ and variables $$X$$ and $$Y$$ such that $$X \ind Y$$ and $$X \not\dsep Y$$.

The proof is based on the example given. Let $$Y \in \pa{X}$$, and specify $$p$$ such that $$p(X|\pa{X})=p(X|\pa{X}-Y)$$. This distribution satisfies all the Markov properties in $$G$$, and has $$X \ind Y$$ (but $$X \not\dsep Y$$).

The above result shows that d-separation cannot be complete in general. The following result shows that d-separation is "as complete as possible".

If $$G$$ is an acyclic directed graph and $$X$$ and $$Y$$ are d-connected by $$\mathcal{Z}$$ in $$G$$, then there is a Bayesian network $$(G,p)$$ such that $$X$$ and $$Y$$ are dependent given $$\mathcal{Z}$$.

Since $$X$$ and $$Y$$ are d-connected by $$\mathcal{Z}$$, there is an unblocked trail $$X,X_1,\dotsc,X_m,Y$$ given $$\mathcal{Z}$$ in $$G$$. We construct a joint distribution by specifying the conditional distributions (local models) of every variable along this trail so as to ensure that consecutive variables are dependent. For the convergent connections $$A \rightarrow B \leftarrow C$$ in the trail, we also set the distributions of the descendants of $$B$$ so as to make each dependent on its parents that are descendant of a variable in the trail. We finally set the distributions of the remaining variable as uniforms (this prevents canceling of the dependencies specified thus far).

The above result shows that for every graph there is a distribution that makes d-separation complete for the corresponding Bayesian network. Hence, any other sound and complete system would have to coincide with d-separation in these networks; in particular, any system that is based exclusively on the graph structure would coincide with d-separation for all distributions lest it be incomplete. The take-home message here is that structure of a Bayesian network should be interpreted as a statement that the suggested dependences are true for some parametrization of the model, while the suggested independences are true for any parametrization.

## 3 Markov Blanket

We are often interested in the minimal set of variables that when conditioned on those variables makes a variable independent of all remaining variables. For example, we might be interested in predicting the value of a given variable given the values of all other. Then discarding the irrelevant variables makes the inference more efficient.

The Markov blanket of a variable $$X$$ is a set $$\mathcal{B}$$ such that $X \ind \mathcal{V}-\mathcal{B}-\{X\} | \mathcal{B} \, .$

We have the following result.

The set formed by the parents, the children and the parents of the children of a variable $$X$$ is a markov blanket of $$X$$. If the distribution is positive it is minimal. Figure 7: The Markov blanket of a variable.

## 4 D-separation and semi-graphoids

We have seen that independence satisfies a number of basic properties known as semi-graphoid axioms. Moreover, independence does not generally satisfy composition and intersection. It can be shown that d-separation satisfies all the semi-graphoid axioms. However, d-separation satisfies composition and intersection:

\begin{align*} & \mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \text{ and } \mathcal{X} \dsep \mathcal{W} \mid \mathcal{Z} \implies \mathcal{X} \dsep \mathcal{Y} \cup \mathcal{W} \mid \mathcal{Z} \, , && \text{[composition]}\\ & X \dsep Y \mid Z \cup W \text{ and } X \dsep W \mid Z \cup Y \implies X \dsep Y \cup W \mid Z \cup W \, .&& \text{[intersection]} \end{align*}

Since not all distributions satisfy composition and intersection, there are distributions whose set of independencies cannot be fully characterized by a Bayesian network. That is, any Bayesian network representation of such distributions either encode independences that are not true (unsound) or miss to represent some independences (incomplete).

Consider four random variables that satisfy the following independences:

\begin{align*} &\neg(A \ind B) \,, & \neg(A \ind C) \,, \\ &\neg(B \ind D) \,, & \neg(C \ind D) \,, \\ &A \ind D \mid \{B,C\} \,, & B \ind C \mid \{A,D\} \, . \end{align*}

We will show ahead that such a distribution exists. By the first four dependence statements, we must have arcs between $$A$$ and $$B$$, between $$A$$ and $$C$$, between $$B$$ and $$D$$ and between $$C$$ and $$D$$. The independence statements imply that any path between $$A$$ and $$D$$ must be blocked by $$\{B,C\}$$, and that any path from $$B$$ to $$C$$ must be blocked by $$\{A,C\}$$. One can show (e.g. by listing them all) that any structure fails to satisfy all of this requirements.

## 5 M-separation

Another graphical criteria for verifying independence is based on undirected graphs.

The moral graph of an acyclic directed graph $$G$$ is the undirected graph obtained by connecting nodes with a common child and then discarding arc directions.

The graph in Figure 8 below is the moral graph of the Asian Bayesian Network in Figure 3. Figure 8: Moral graph of the Asia Bayesian network.

Let $$M[G]$$ denote the moral graph of an fixed a acyclic directed graph $$G$$. Denote by $$G[\mathcal{X}]=G-(\mathcal{V}-\mathcal{X})$$ the subgraph of $$M$$ obtained by deleting all nodes outside $$\mathcal{X}$$ and their corresponding edges. Also, denote by $$\can{\mathcal{X}}$$ the set containing $$\mathcal{X}$$ and each ancestor of a node in $$\mathcal{X}$$.

We say that the sets of variables $$\mathcal{X}$$ and $$\mathcal{Y}$$ are m-separated by the set of variables $$\mathcal{Z}$$, and write $$\mathcal{X} \msep \mathcal{Y} \mid \mathcal{Z}$$, if they are disconnected in $$M[G[\can{\mathcal{X} \cup \mathcal{Y} \cup \mathcal{Z}}]]-\mathcal{Z}$$. Otherwise they are m-connected, denoted by $$\mathcal{X} \not\msep \mathcal{Y} \mid \mathcal{Z}$$.

The graph $$M[G[\can{\mathcal{X} \cup \mathcal{Y} \cup \mathcal{Z}}]]-\mathcal{Z}$$ is the moral graph of $$G$$ with barren nodes and evidence nodes deleted (and their corresponding edges). The next results states that m-separation and d-separation are equivalent.

$\mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z} \text{ if and only if } \mathcal{X} \msep \mathcal{Y} \mid \mathcal{Z} \, .$

Consider the transformation of a directed graph into $$M'=M[G[\can{\mathcal{X} \cup \mathcal{Y} \cup \mathcal{Z}}]]-\mathcal{Z}$$. The serial and divergent connections are transformed into regular 3-long trails, and are blocked if and only if the intermediary node is in $$\mathcal{Z}$$. Thus, d-separation and m-separation coincide for these connnections. Consider a convergent connection $$A \rightarrow B \leftarrow C$$. This is transformed into a clique: Thus, the variables $$A$$ and $$C$$ are connected in the moral graph. In principle, this could introduce a dependence if neither $$B$$ nor any of its descendants are in $$\mathcal{Z}$$. However, this connection is only present if $$B$$ is in $$\mathcal{Z}$$.

## 6 Exercises

1. Consider the Bayesian network structure below. • List the (local) Markov properties in the graph.
• Express the joint probability distribution in terms of the network parameters (i.e., the local conditional probabilities)
• Assuming that all variables take values 0 and 1, compute $$P(A=0,B=0)$$ and $$P(E=1|A=1)$$ as a function of the network parameters. Use the independences entailed by the model to simply the expressions.
• $$A \dsep E \mid \{B,H\}$$
• $$G \dsep E \mid D$$
• $$\{A,B\} \dsep \{G,H\} \mid F$$
2. Prove that the set of parents, children and spouses form a Markov Blanket of a node (you do not need to prove that it is minimal).
3. Prove that every root variable (i.e., one with no parents) in a Bayesian network structure is d-separated from any other root variable (given the empty set).
4. Implement the Bayes Ball algorithm and use it to derive independences from the Asia network.

# Bibliography

• [Pea1988] Judea Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann (1988).
• [Stu1989] Milan Studený, Multiinformation and the Problem of Characterization of Independence Relations, Problems of Control and Information Theory, 3, 3-16 (1989).
• [Stu1992] Milan Studený, Conditional Independence Relations Have No Finite Complete Characterization, 377-396, in in: Transactions of 11th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, edited by (1992)

Created: January 22, 2020

Validate