# Bayesian Networks

$$\newcommand{\nb}{\text{ne}(#1)} \newcommand{\pa}{\text{pa}(#1)} \newcommand{\ch}{\text{ch}(#1)} \newcommand{\de}{\text{de}(#1)} \newcommand{\an}{\text{an}(#1)} \newcommand{\nd}{\text{nd}(#1)} \newcommand{\indeg}{\text{indeg}(#1)} \newcommand{\outdeg}{\text{outdeg}(#1)} \newcommand{\ind}{\perp\!\!\!\perp}$$

Any probabilistic phenomenon can be described by a joint probability distribution over a set of random variables. However, directly specifying and manipulating multivariate probability distributions is difficult and inefficient. For example, a joint probability distribution over $$n$$ binary random variables is characterized by its joint probability mass function, which in turn is described by $$2^n$$ numbers. Clearly, representing knowledge in this way does not scale to domains with more than a handful of binary variables. So our goal here will be to find a convenient representation of multivariate probability distributions that scale to a large number of random variables. The key to achieving this goal is the correspondence between (conditional) independence and factorization. For example, consider the task of describing the outcomes of the tosses of three coins represented by binary variables $$A,B,C$$. If we assume that the variables are independent, then the joint distribution factorizes as $$p(A,B,C)=p(A)p(B)p(C)$$. Hence, instead of the $$2^3-1=7$$ numbers we need only $$3\cdot(2^2-1)=3$$ numbers to fully specify the probability function. Full independence is easy to characterize, but what about more complex phenomena such as intercausal reasoning when $$A$$ and $$B$$ are unconditionally independent but conditionally independent given $$C$$. This is the purpose of Bayesian networks: to allow for a compact and intuitive language in which to express independence statements and (consequently) factorized probability functions.

## 1 A few concepts from graph theory

A directed graph is a pair $$(\mathcal{V},\mathcal{E})$$ where $$\mathcal{V}$$ is a set of vertices or nodes and $$\mathcal{E}$$ is a binary relation on $$\mathcal{V}$$.1 The pairs $$(X,Y)$$ in $$\mathcal{E}$$ are called edges or arcs and represented as $$X \rightarrow Y$$.

If $$(X,Y)$$ is in $$\mathcal{E}$$ we say that $$X$$ is a parent of $$Y$$ and $$Y$$ is a child of $$X$$. The set of parents and children of a node $$X$$ are denoted, respectively, as $$\pa{X}$$ and $$\ch{X}$$. The in-degree of a node $$X$$ is the number of its parents and is written as $$\indeg{X}=|\pa{X}|$$. The out-degree is $$\outdeg{X}=|\ch{X}|$$.

A trail is a sequence of nodes $$X_1,\dotsc,X_k$$ such that for $$i=1,\dotsc,k-1$$ either $$X_i \rightarrow X_{i+1}$$ or $$X_{i+1} \rightarrow X_i$$, and each arc appears at most once. That is, a trail is a way of going from a node to another by following arcs in any direction but without passing twice in the same arc. The length of a trail is the number of edges it contains (the number of nodes minus one). A loop is a trail starting and ending at the same node. Two nodes are connected if there is a trail starting at one of the nodes and ending at the other one, otherwise they are separated. A path is a trail where all but the first and the last nodes are necessarily distinct (the first and last might be distinct). A directed path is a path which follows the direction of the arcs (i.e., $$X_i, X_{i+1}$$ is in the path only if $$X_i \rightarrow X_{i+1}$$). A cycle is a directed path starting and ending at the same node, for example, $$A \rightarrow B \rightarrow C \rightarrow A$$.

If there is a directed path from $$X$$ to $$Y$$ we say that $$X$$ is an ancestor of $$Y$$ and that $$Y$$ is a descendant of $$X$$. The descendants (ancestors) of $$X$$ are denoted $$\de{X}$$ ($$\an{X}$$). Its complement with respect to $$\mathcal{V}$$ is called the non-descendants and written $$\nd{X}$$.

A directed graph is acyclic if it has no directed cycles, that is, if no node is an ancestor of itself. A topological ordering of the nodes is a total ordering of the nodes $$<$$ such that $$X < Y$$ if and only $$Y$$ is not an ancestor of $$X$$. A directed graph has a topological ordering if and only if it is acyclic.

An undirected graph is a pair $$(\mathcal{V},\mathcal{E})$$ where $$\mathcal{E}$$ is a set of unordered pairs of nodes also called edges. We denote an edge as $$X-Y$$. We say that nodes $$X$$ and $$Y$$ are adjacent or neighbors if $$X-Y$$ is in $$\mathcal{E}$$. The set of neighbors of a node $$X$$ is denoted by $$\nb{X}$$. The degree of a node is the number of neighbors it has. Some of the definitions for directed graphs apply to undirected graphs: A trail is a sequence of nodes where consecutive nodes are connected by edges and no edge is repeated, and a loop is a trail starting and ending in the same node.

We will be mostly interested in graphs where nodes have few parents (if directed) or neighbors (if undirected). The class of sparse graphs is a set of graphs $$G_n=(\mathcal{V},\mathcal{E})$$ parametrized by $$n=|\mathcal{V}|$$ and such that $$|\mathcal{E}|=O(n)$$. Informally, we say that a graph is sparse if $$|\mathcal{E}| \ll |\mathcal{V}|^2$$.

## 2 Markov property for acyclic directed graphs

We are interested in directed graphs representing probabilistic independences among random variables. Thus, we will consider graphs whose vertex set is a collection of random variables. This allows us to discuss whether two nodes are (probabilistically) independent and whether two variables are connected.

If two sets of random variables $$\mathcal{X}$$ and $$\mathcal{Y}$$ are independent conditional on a set of random variables $$\mathcal{Z}$$ we write

$\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \, .$

So consider a directed graph $$G=(\mathcal{V},\mathcal{E})$$ where $$\mathcal{V}$$ is a collection of random variables defined in a probability space inducing conditional joint probability distributions $$p(\mathcal{X}|\mathcal{Y})$$ for any subsets $$\mathcal{X},\mathcal{Y} \subseteq \mathcal{V}$$ (whenever $$p(\mathcal{Y})>0$$). Our goal here is two-fold: We want to assign a semantics to the graph that captures probabilistic (in)dependence relations, and we want to exploit this correspondence to efficiently represent joint probability distributions. Let us start with the first goal. The following example helps motivating the chosen semantics.

Taken from CDLS1999. The chances of developing dyspnoea are higher on patients that had tuberculosis, lung cancer or bronchitis. A recent visit to Asia increases the chances of tuberculosis, while smoking is a risk factor for both lung cancer and bronchitis. An x-ray does not distinguish between lung cancer and tuberculosis, as neither the presence or absence of dyspnoea. The graph in Figure [BROKEN LINK: fig:asia] encodes these probabilistic relations among the variables.

The key probabilistic property represented by the graph structure is the following.

Markov property

We say that a joint probability distribution (or joint probability mass function or joint probability density) satisfies (or respects) the Markov property wrt a directed graph $$G=(\mathcal{V},\mathcal{E})$$ if for every node/random variable $$X$$ it follows that

$X \ind \nd{X} \mid \pa{X} \, .$

That is, a joint distribution $$p$$ satisfies the Markov property if

$p(X|\nd{X}) = p(X|\pa{X}) \, ,$ and $p(X,\nd{X}-\pa{X}|\pa{X}) = p(X|\pa{X})p(\nd{X}-\pa{X}|\pa{X}) \, ,$ whenever these conditional distributions are defined.

According to the Markov property two variables are independent if they are not connected by any directed path and any common ancestor have their values known. The Markov property thus allows us to represent causal reasoning by reading the arcs as causal relations. In this case it simply states that two phenomena that have no directed connection are independent given their common causes.

## 3 Factorization

The Markov property provides a sensible semantics for a graphical representation of probabilistic (in)dependences. We will see later that this semantics is optimum in a precise sense. It remains to show the connection between this semantics and efficient representation. The following result shows that the joint distribution $$p$$ can be efficiently encoded if $$G$$ is sparse. First, define $$p(X|\pa{X})$$ to be zero at any point where $$p(\pa{X})$$ is zero. Note that if $$p(\mathcal{X})$$ is zero at some point then $$p(\mathcal{X},\mathcal{Y})$$ is also zero.

If a joint probability distribution $$p$$ satisfies the Markov property wrt to an acyclic directed graph $$G=(\mathcal{V},\mathcal{E})$$, then

$p(\mathcal{V}) = \prod_{X \in \mathcal{V}} p(X|\pa{X}) \, ,$

where for convenience we define $$p(X|\emptyset)=p(X)$$.

Before proving that result, we need the following proposition.

If $$\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z}$$, and $$\mathcal{S} \subset \mathcal{Y}$$ then $$\mathcal{X} \ind \mathcal{S} \mid \mathcal{Z}$$.

Let $$\mathcal{T}=\mathcal{Y} - \mathcal{S}$$. We have by the total rule and the chain rule for variables that

\begin{align*} p(\mathcal{X},\mathcal{S}|\mathcal{Z}) &= \sum_{\mathcal{T}} p(\mathcal{X}, \mathcal{S}, \mathcal{T}|\mathcal{Z}) \\ &= \sum_{\mathcal{T}} p(\mathcal{X},\mathcal{Y}|\mathcal{Z}) \\ &= \sum_{\mathcal{T}} p(\mathcal{X}|\mathcal{Y},\mathcal{Z}) p(\mathcal{Y}|\mathcal{Z}) \, . \end{align*}

Since $$\mathcal{X}$$ and $$\mathcal{Y}$$ are independent given $$\mathcal{Z}$$, it follows that

$p(\mathcal{X}|\mathcal{Y},\mathcal{Z})=p(\mathcal{X}|\mathcal{Z}) \, .$

Hence,

\begin{align*} p(\mathcal{X},\mathcal{S}|\mathcal{Z}) &= \sum_\mathcal{T} p(\mathcal{X}|\mathcal{Z}) p(\mathcal{Y}|\mathcal{Z}) \\ & = p(\mathcal{X}|\mathcal{Z}) \sum_{\mathcal{T}} p(\mathcal{Y}|\mathcal{Z}) \\ & = p(\mathcal{X}|\mathcal{Z}) p(\mathcal{S}|\mathcal{Z}) \, . \end{align*}

The second equality used the fact that $$p(\mathcal{X}|\mathcal{Z})$$ is constant with respect to $$\mathcal{T}$$.

We can now prove the theorem.

Let $$<$$ be a total topological ordering of the nodes/variables in $$G$$ and for each $$X$$ denote by $$\mathcal{A}_X=\{Y: Y < X\}$$ the set of variables that are smaller than $$X$$. We have by the Chain Rule for variables that

$p(\mathcal{V})=\prod_{X \in \mathcal{V}} p(X|\mathcal{A}_X) \, .$

For each $$X$$, $$\mathcal{A}_X$$ is a subset of $$\nd{X}$$ and contains $$\pa{X}$$. Hence, by applying the Decomposition Property with $$\mathcal{X}=\{X\}$$, $$\mathcal{Y}=\nd{X}-\pa{X}$$, $$\mathcal{Z}=\pa{X}$$ and $$\mathcal{S}=\mathcal{A}_X-\pa{X}$$, it follows that $$p(X|\mathcal{A}_X)=p(X|\pa{X})$$.

The distributions $$p(X|\pa{X})$$ are called local distributions. The Factorization Theorem allows us to specify a discrete joint probability distribution over $$\mathcal{V}$$ by

$\sum_{X \in \mathcal{V}} (|\text{dom}(X)|-1)\prod_{Y \in \pa{X}} |\text{dom}(Y)|$

numbers. This is much smaller than the number of numbers required to specify the joint probability when $$G$$ is sparse. For instance if $$k$$ is an upper bound on the in-degree of any node and $$v$$ is an upper bound on the cardinality of (the domain of) any variable, then the number of numbers required to specify a Bayesian network over $$n$$ variables is $$O(n \cdot v^k)$$, which is polynomial in the input size.

Any joint probability distribution satisfying the Markov property with respect to the graph in Figure [BROKEN LINK: fig:asia] factorizes as $p(A,S,T,L,E,B,X,D) = p(A)p(S)p(T|A)p(L|S)p(E|T,L)p(B|S)p(X|E)p(D|E,B) \, .$

Considering that all variables are binary, this specification requires $$1 + 1 + 1\cdot 2 + 1\cdot 2 + 1\cdot 2 \cdot 2 + 1 \cdot 2 + 1 \cdot 2 + 1 \cdot 2 \cdot 2=18$$ values (compare with the $$2^8-1=255$$ values required by the joint distribution).

## 4 Bayesian network definition

The Markov property and the factorization theorem give us a graphical representation that allows for a convenient description of probabilistic (in)dependence while enabling a contextual specification of a probability distribution in terms of local models:

A Bayesian network consists of

• an acyclic directed graph $$G=(\mathcal{V},\mathcal{E})$$, where $$\mathcal{V}$$ is a collection of random variables,
• a joint probability distribution $$p$$ over $$\mathcal{V}$$ satisfying the Markov property wrt $$G$$.

In practice, a Bayesian network is represented by its graph $$G$$ and the collection of local conditional probability distributions $p(X|\pa{X}) \text{ for all } X \in \mathcal{V}.$ The graph $$G$$ is often called the structure of the Bayesian network.

## 5 Barren nodes

The following concept of barren nodes will help us connect factorization back to the Markov property.

Barren node
A node $$X$$ in an acyclic directed graph $$G$$ is barren wrt to a set of variables/nodes $$\mathcal{S}$$ if $$(\{X\} \cup \de{X}) \cap \mathcal{S}=\emptyset$$.

A node is barren wrt $$\mathcal{S}$$ if it is a leaf node not in $$\mathcal{S}$$ in $$G$$ or any graph obtained from $$G$$ by a sequence of removal of barren nodes.

Let $$p_G$$ be a distribution that satisfies the Markov properties in a graph $$G=(\mathcal{V},\mathcal{E})$$, $$\mathcal{B}$$ be a set of barren nodes in $$G$$ with respect to a node set $$\mathcal{S}$$, and $$H=(\mathcal{V}-\mathcal{B},\mathcal{E}')$$ be the graph obtained by removing the nodes (and respective arcs) in $$\mathcal{B}$$ from $$G$$. Then $$p_G(\mathcal{S}) = p_H(\mathcal{S})$$, where $$p_H(\mathcal{V}-\mathcal{B}) = \prod_{X \in \mathcal{V}-\mathcal{B}} p_G(X|\pa{X})$$.

First note that

\begin{align*} p_G(\mathcal{S}) &= \sum_{\mathcal{V}-\mathcal{S}} \prod_{X \in \mathcal{V}} p_G(X|\pa{X}) \\ &= \sum_{\mathcal{V}-\mathcal{S}-\mathcal{B}} \prod_{X \in \mathcal{V}-\mathcal{B}} p_G(X|\pa{X}) \sum_{\mathcal{B}} \prod_{X \in \mathcal{B}} p_G(X|\pa{X}) . \end{align*}

Now for any $$X$$ it follows that $$\sum_X p(X|\pa{X}) = 1$$. Thus, by performing the eliminations $$\sum_{X}$$ for each $$X \in \mathcal{B}$$ in reverse topological order, we have (by an inductive argument) that $$\sum_{\mathcal{B}} \prod_{X \in \mathcal{B}} p_G(X|\pa{X}) = 1$$, which proves the result.

A simple corollary of the previous result is that if $$\mathcal{B}$$ is a set of barren nodes wrt $$\mathcal{R} \cup \mathcal{S}$$, then $$p_G(\mathcal{R}|\mathcal{S})=p_H(\mathcal{R}|\mathcal{S})$$, where $$H$$ is the graph obtained by removing $$\mathcal{B}$$. This holds since the Proposition can be applied to both the numerator and the denominator of the definition of conditional distribution. Another simple corollary is the following.

If $$(\mathcal{R},\mathcal{S})$$ is a partition of the nodes of a Bayesian network structure $$G$$ such that $$\mathcal{S}$$ is barren (wrt $$\mathcal{R}$$ in $$G$$) then $$p(\mathcal{R})=\prod_{X \in \mathcal{R}} p(X|\pa{X})$$.

The graph obtained by removing $$\mathcal{S}$$ is a Bayesian network over $$\mathcal{R}$$ and hence encodes a distribution $$p(\mathcal{R})$$ which factorizes as the product of local conditional distributions.

We can now prove the following result.

If the joint distribution factorizes according to an acyclic directed graph $$G$$, then it satisfies the Markov properties represented by $$G$$.

The set $$\de{X}$$ is barren wrt $$\{X\} \cup \nd{X}$$, thus by the Barren Node Corollary,

$p(X,\nd{X}) = p(X|\pa{X}) \prod_{Y \in \nd{X}} p(Y|\pa{Y}) .$

Similarly, $$\{X\} \cup \de{X}$$ is barren wrt $$\nd{X}$$, hence

$p(\nd{X}) = \prod_{Y \in \nd{X}} p(Y|\pa{Y}) .$

Combining these results, we have that

\begin{align*} p(X,\nd{X}|\pa{X}) &= p(X|\pa{X}) \frac{\prod_{Y \in \nd{X}} p(Y|\pa{Y})}{p(\pa{X})} \\ &= p(X|\pa{X}) p(\nd{X}|\pa{X}) . \end{align*}

which is true iff the Markov property is satisfied.

Thus the Markov property and the factorization property are largely equivalent: one can start with a set of independence assumptions and conclude a factorization of the joint distribution in terms of local conditional distributions, or one can start with a collection of local conditional distributions inducing an acyclic directed graph and derive a joint distribution satisfying the Markov property.2

From now on we will consider Bayesian networks as serving two roles: as a model of probabilistic (in)dependences through its Markov properties, and as an efficient representation of a multivariate joint distribution in terms of local conditional distributions.

## 6 Some common structures

A tree is a graph where each node has at most one parent (note that this definition allows disconnected graphs). A Bayesian network is tree-shaped if its structure is a tree. As we will see later, trees are particularly important as inference (querying) can be performed in linear time on them. A naive Bayes model is a specific type of tree-shaped Bayesian network with a single root variable which has the remaining variables as children, as in the graph in the left in Figure [BROKEN LINK: fig:nets]. Naive Bayes is a common model for building classifiers in many domains, for instance spam detection in emails. The root variable represents the message type (spam or ham) and the leaves indicate textual features such as word frequencies and amount of capitalization. Recall that a conditional distribution over a set of variables determines a probabilistic model.

Bipartite Bayesian networks are Bayesian networks where the nodes can be partitioned in two sets such that all arcs originate in one set and point to a variable in the other set. The network in the right in Figure [BROKEN LINK: fig:nets] is bipartite. Note that a naive Bayes is the simplest case of a bipartite Bayesian network (in terms of structure). Note also that a bipartite network might contain loops and is therefore not necessarily tree-shaped.

Recall that a path is a sequence of distinct nodes $$X_1,\dotsc,X_n$$ (except for the first and the last) such that $$X_i,X_{i+1}$$ is an edge of the graph for each $$i=1,\dotsc,n-1$$. An acyclic directed graph is singly directed if there is at most one directed path between any two nodes. A polytree-shaped Bayesian network has a singly-directed graph. Note that bipartite networks are also polytree-shaped. Figure [BROKEN LINK: fig:polytree] depicts a polytree-shaped Bayesian network which is not bipartite. The Asia Bayesian network in Figure [BROKEN LINK: fig:asia] is not polytree-shaped as e.g. the node $$D$$ is reachable from node $$S$$ through two different paths.

## 7 Exercises

1. Consider a Bayesian network with four binary variables and structure $X_1 \rightarrow X_2 \rightarrow X_3 \rightarrow X_4 \, .$ Assume that the conditional probability values parametrizing the model are $$p(X_i=1|X_{i-1}=0) = a$$, $$p(X_i=1|X_{i-1}=0)=b$$, for $$i=2,3,4$$, and $$p(X_1=1) = c$$. Compute the following probabilities as a function of $$a$$, $$b$$ and $$c$$:
1. $$p(X_1=1,X_2=0,X_3=1,X_4=0)$$
2. $$p(X_4=1|X_1=1)$$
3. $$p(X_1=1|X_4=1)$$
4. $$p(X_3=1|X_1=0,X_4=1)$$
2. Compute the number of probability values necessary to specify a tree-shaped Bayesian network with $$n$$ binary variables.
3. Compute the number of probability values necessary to specify a bipartite Bayesian network with $$m$$ root variables and $$n$$ leaves, and maximum in-degree $$d$$. The root variables are binary and the leaf variables are ternary.
4. Prove that the leaves of a naive Bayes model are pairwise independent conditional on the root node.
5. A nuclear power plant has an alarm that triggers when the temperature exceeds a certain threshold. A sensor measures the temperature. Call $$A$$ a binary variable that denotes whether the alarm has triggered, $$F_A$$ a binary variable that represents whether the alarm is defective, $$F_G$$ a binary variable representing whether the sensor is defective, $$G$$ a variable representing the possible measurements of the sensor and $$T$$ the temperature. The probability that the sensor is defective increases when the temperature is high and decreases otherwise.
• Draw the structure of a Bayesian network representing the given knowledge.

# Bibliography

• [CDLS1999] Cowell, Dawid, Lauritzen & Spiegelhalter, Probabilistic Networks and Expert Systems, Springer-Verlag (1999).

## Footnotes:

1

A binary relation is simply a collection of ordered pairs of elements from a domain; in our case these are the vertices.

2

There are some pathological cases involving continuous sample spaces where there is no joint distribution consistent with a collection of conditional distributions.

Created: January 22, 2020

Validate