# Bayesian Networks

## Table of Contents

\( \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\indeg}[1]{\text{indeg}(#1)} \newcommand{\outdeg}[1]{\text{outdeg}(#1)} \newcommand{\ind}{\perp\!\!\!\perp} \)

Any probabilistic phenomenon can be described by a joint
probability distribution over a set of random variables.
However, directly specifying and manipulating multivariate
probability distributions is difficult and inefficient. For
example, a joint probability distribution over \(n\) binary
random variables is characterized by its joint probability
mass function, which in turn is described by \(2^n\) numbers.
Clearly, representing knowledge in this way does not scale to
domains with more than a handful of binary variables. So our
goal here will be to find a convenient representation of
multivariate probability distributions that scale to a large
number of random variables. The key to achieving this goal is
the correspondence between (conditional) independence and
factorization. For example, consider the task of describing
the outcomes of the tosses of three coins represented by
binary variables \(A,B,C\). If we *assume* that the variables
are independent, then the joint distribution factorizes as
\(p(A,B,C)=p(A)p(B)p(C)\). Hence, instead of the \(2^3-1=7\)
numbers we need only \(3\cdot(2^2-1)=3\) numbers to fully
specify the probability function. Full independence is easy to
characterize, but what about more complex phenomena such as
intercausal reasoning when \(A\) and \(B\) are unconditionally
independent but conditionally independent given \(C\). This is
the purpose of Bayesian networks: to allow for a compact and
intuitive language in which to express independence statements
and (consequently) factorized probability functions.

## 1 A few concepts from graph theory

A **directed graph** is a pair \((\mathcal{V},\mathcal{E})\) where
\(\mathcal{V}\) is a set of **vertices** or **nodes** and
\(\mathcal{E}\) is a binary relation on \(\mathcal{V}\).^{1}
The pairs \((X,Y)\) in \(\mathcal{E}\) are called **edges** or
**arcs** and represented as \(X \rightarrow Y\).

If \((X,Y)\) is in \(\mathcal{E}\) we say that \(X\) is a parent of \(Y\) and \(Y\) is a child of \(X\). The set of parents and children of a node \(X\) are denoted, respectively, as \(\pa{X}\) and \(\ch{X}\). The in-degree of a node \(X\) is the number of its parents and is written as \(\indeg{X}=|\pa{X}|\). The out-degree is \(\outdeg{X}=|\ch{X}|\).

A **trail** is a sequence of nodes \(X_1,\dotsc,X_k\) such that
for \(i=1,\dotsc,k-1\) either \(X_i \rightarrow X_{i+1}\) or
\(X_{i+1} \rightarrow X_i\), and each arc appears at most once.
That is, a trail is a way of going from a node to another by
following arcs in any direction but without passing twice in
the same arc. The **length of a trail** is the number of edges
it contains (the number of nodes minus one). A **loop** is a
trail starting and ending at the same node. Two nodes are
**connected** if there is a trail starting at one of the nodes
and ending at the other one, otherwise they are **separated**. A
**path** is a trail where all but the first and the last nodes
are necessarily distinct (the first and last might be
distinct). A **directed path** is a path which follows the
direction of the arcs (i.e., \(X_i, X_{i+1}\) is in the path
only if \(X_i \rightarrow X_{i+1}\)). A **cycle** is a directed
path starting and ending at the same node, for example, \(A
\rightarrow B \rightarrow C \rightarrow A\).

If there is a directed path from \(X\) to \(Y\) we say that \(X\) is
an **ancestor** of \(Y\) and that \(Y\) is a **descendant** of \(X\).
The descendants (ancestors) of \(X\) are denoted \(\de{X}\)
(\(\an{X}\)). Its complement with respect to \(\mathcal{V}\) is
called the non-descendants and written \(\nd{X}\).

A directed graph is **acyclic** if it has no directed cycles,
that is, if no node is an ancestor of itself. A **topological
ordering** of the nodes is a total ordering of the nodes \(<\)
such that \(X < Y\) if and only \(Y\) is not an ancestor of
\(X\). A directed graph has a topological ordering if and only
if it is acyclic.

An **undirected graph** is a pair \((\mathcal{V},\mathcal{E})\)
where \(\mathcal{E}\) is a set of unordered pairs of nodes also
called **edges**. We denote an edge as \(X-Y\). We say that nodes
\(X\) and \(Y\) are **adjacent** or **neighbors** if \(X-Y\) is in
\(\mathcal{E}\). The set of neighbors of a node \(X\) is denoted
by \(\nb{X}\). The **degree of a node** is the number of neighbors
it has. Some of the definitions for directed graphs apply to
undirected graphs: A trail is a sequence of nodes where
consecutive nodes are connected by edges and no edge is
repeated, and a loop is a trail starting and ending in the
same node.

We will be mostly interested in graphs where nodes have few
parents (if directed) or neighbors (if undirected). The class
of **sparse graphs** is a set of graphs
\(G_n=(\mathcal{V},\mathcal{E})\) parametrized by
\(n=|\mathcal{V}|\) and such that \(|\mathcal{E}|=O(n)\).
Informally, we say that a graph is sparse if \(|\mathcal{E}|
\ll |\mathcal{V}|^2\).

## 2 Markov property for acyclic directed graphs

We are interested in directed graphs representing probabilistic independences among random variables. Thus, we will consider graphs whose vertex set is a collection of random variables. This allows us to discuss whether two nodes are (probabilistically) independent and whether two variables are connected.

If two sets of random variables \(\mathcal{X}\) and \(\mathcal{Y}\) are independent conditional on a set of random variables \(\mathcal{Z}\) we write

\[ \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \, . \]

So consider a directed graph \(G=(\mathcal{V},\mathcal{E})\) where \(\mathcal{V}\) is a collection of random variables defined in a probability space inducing conditional joint probability distributions \(p(\mathcal{X}|\mathcal{Y})\) for any subsets \(\mathcal{X},\mathcal{Y} \subseteq \mathcal{V}\) (whenever \(p(\mathcal{Y})>0\)). Our goal here is two-fold: We want to assign a semantics to the graph that captures probabilistic (in)dependence relations, and we want to exploit this correspondence to efficiently represent joint probability distributions. Let us start with the first goal. The following example helps motivating the chosen semantics.

Taken from CDLS1999. The chances of developing
*dyspnoea* are higher on patients that had *tuberculosis*,
*lung cancer* or *bronchitis*. A recent *visit to Asia*
increases the chances of tuberculosis, while *smoking* is a
risk factor for both *lung cancer* and *bronchitis*. An
*x-ray* does not distinguish between *lung cancer* and
*tuberculosis*, as neither the presence or absence of
*dyspnoea*. The graph in Figure [BROKEN LINK: fig:asia] encodes these
probabilistic relations among the variables.

The key probabilistic property represented by the graph structure is the following.

- Markov property
We say that a joint probability distribution (or joint probability mass function or joint probability density) satisfies (or respects) the Markov property wrt a directed graph \(G=(\mathcal{V},\mathcal{E})\) if for every node/random variable \(X\) it follows that

\[ X \ind \nd{X} \mid \pa{X} \, . \]

That is, a joint distribution \(p\) satisfies the Markov property if

\[ p(X|\nd{X}) = p(X|\pa{X}) \, , \] and \[ p(X,\nd{X}-\pa{X}|\pa{X}) = p(X|\pa{X})p(\nd{X}-\pa{X}|\pa{X}) \, , \] whenever these conditional distributions are defined.

According to the Markov property two variables are independent
if they are not connected by any directed path *and* any
common ancestor have their values known. The Markov property
thus allows us to represent causal reasoning by reading the
arcs as causal relations. In this case it simply states that
two phenomena that have no directed connection are independent
given their common causes.

## 3 Factorization

The Markov property provides a sensible semantics for a graphical representation of probabilistic (in)dependences. We will see later that this semantics is optimum in a precise sense. It remains to show the connection between this semantics and efficient representation. The following result shows that the joint distribution \(p\) can be efficiently encoded if \(G\) is sparse. First, define \(p(X|\pa{X})\) to be zero at any point where \(p(\pa{X})\) is zero. Note that if \(p(\mathcal{X})\) is zero at some point then \(p(\mathcal{X},\mathcal{Y})\) is also zero.

If a joint probability distribution \(p\) satisfies the Markov property wrt to an acyclic directed graph \(G=(\mathcal{V},\mathcal{E})\), then

\[ p(\mathcal{V}) = \prod_{X \in \mathcal{V}} p(X|\pa{X}) \, , \]

where for convenience we define \(p(X|\emptyset)=p(X)\).

Before proving that result, we need the following proposition.

If \(\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z}\), and \(\mathcal{S} \subset \mathcal{Y}\) then \(\mathcal{X} \ind \mathcal{S} \mid \mathcal{Z}\).

Let \(\mathcal{T}=\mathcal{Y} - \mathcal{S}\). We have by the total rule and the chain rule for variables that

\begin{align*} p(\mathcal{X},\mathcal{S}|\mathcal{Z}) &= \sum_{\mathcal{T}} p(\mathcal{X}, \mathcal{S}, \mathcal{T}|\mathcal{Z}) \\ &= \sum_{\mathcal{T}} p(\mathcal{X},\mathcal{Y}|\mathcal{Z}) \\ &= \sum_{\mathcal{T}} p(\mathcal{X}|\mathcal{Y},\mathcal{Z}) p(\mathcal{Y}|\mathcal{Z}) \, . \end{align*}Since \(\mathcal{X}\) and \(\mathcal{Y}\) are independent given \(\mathcal{Z}\), it follows that

\[ p(\mathcal{X}|\mathcal{Y},\mathcal{Z})=p(\mathcal{X}|\mathcal{Z}) \, . \]

Hence,

\begin{align*} p(\mathcal{X},\mathcal{S}|\mathcal{Z}) &= \sum_\mathcal{T} p(\mathcal{X}|\mathcal{Z}) p(\mathcal{Y}|\mathcal{Z}) \\ & = p(\mathcal{X}|\mathcal{Z}) \sum_{\mathcal{T}} p(\mathcal{Y}|\mathcal{Z}) \\ & = p(\mathcal{X}|\mathcal{Z}) p(\mathcal{S}|\mathcal{Z}) \, . \end{align*}The second equality used the fact that \(p(\mathcal{X}|\mathcal{Z})\) is constant with respect to \(\mathcal{T}\).

We can now prove the theorem.

Let \(<\) be a total topological ordering of the nodes/variables in \(G\) and for each \(X\) denote by \(\mathcal{A}_X=\{Y: Y < X\}\) the set of variables that are smaller than \(X\). We have by the Chain Rule for variables that

\[ p(\mathcal{V})=\prod_{X \in \mathcal{V}} p(X|\mathcal{A}_X) \, . \]

For each \(X\), \(\mathcal{A}_X\) is a subset of \(\nd{X}\) and contains \(\pa{X}\). Hence, by applying the Decomposition Property with \(\mathcal{X}=\{X\}\), \(\mathcal{Y}=\nd{X}-\pa{X}\), \(\mathcal{Z}=\pa{X}\) and \(\mathcal{S}=\mathcal{A}_X-\pa{X}\), it follows that \(p(X|\mathcal{A}_X)=p(X|\pa{X})\).

The distributions \(p(X|\pa{X})\) are called **local
distributions**. The Factorization Theorem allows us to specify
a discrete joint probability distribution over \(\mathcal{V}\)
by

\[ \sum_{X \in \mathcal{V}} (|\text{dom}(X)|-1)\prod_{Y \in \pa{X}} |\text{dom}(Y)| \]

numbers. This is much smaller than the number of numbers required to specify the joint probability when \(G\) is sparse. For instance if \(k\) is an upper bound on the in-degree of any node and \(v\) is an upper bound on the cardinality of (the domain of) any variable, then the number of numbers required to specify a Bayesian network over \(n\) variables is \(O(n \cdot v^k)\), which is polynomial in the input size.

Any joint probability distribution satisfying the Markov property with respect to the graph in Figure [BROKEN LINK: fig:asia] factorizes as \[ p(A,S,T,L,E,B,X,D) = p(A)p(S)p(T|A)p(L|S)p(E|T,L)p(B|S)p(X|E)p(D|E,B) \, . \]

Considering that all variables are binary, this specification requires \(1 + 1 + 1\cdot 2 + 1\cdot 2 + 1\cdot 2 \cdot 2 + 1 \cdot 2 + 1 \cdot 2 + 1 \cdot 2 \cdot 2=18\) values (compare with the \(2^8-1=255\) values required by the joint distribution).

## 4 Bayesian network definition

The Markov property and the factorization theorem give us a graphical representation that allows for a convenient description of probabilistic (in)dependence while enabling a contextual specification of a probability distribution in terms of local models:

A **Bayesian network** consists of

- an acyclic directed graph \(G=(\mathcal{V},\mathcal{E})\), where \(\mathcal{V}\) is a collection of random variables,
- a joint probability distribution \(p\) over \(\mathcal{V}\) satisfying the Markov property wrt \(G\).

In practice, a Bayesian network is represented by its graph
\(G\) and the collection of local conditional probability
distributions \[ p(X|\pa{X}) \text{ for all } X \in
\mathcal{V}. \] The graph \(G\) is often called the
**structure** of the Bayesian network.

## 5 Barren nodes

The following concept of barren nodes will help us connect factorization back to the Markov property.

- Barren node
- A node \(X\) in an acyclic directed graph \(G\) is
**barren**wrt to a set of variables/nodes \(\mathcal{S}\) if \((\{X\} \cup \de{X}) \cap \mathcal{S}=\emptyset\).

A node is barren wrt \(\mathcal{S}\) if it is a leaf node not in \(\mathcal{S}\) in \(G\) or any graph obtained from \(G\) by a sequence of removal of barren nodes.

Let \(p_G\) be a distribution that satisfies the Markov properties in a graph \(G=(\mathcal{V},\mathcal{E})\), \(\mathcal{B}\) be a set of barren nodes in \(G\) with respect to a node set \(\mathcal{S}\), and \(H=(\mathcal{V}-\mathcal{B},\mathcal{E}')\) be the graph obtained by removing the nodes (and respective arcs) in \(\mathcal{B}\) from \(G\). Then \( p_G(\mathcal{S}) = p_H(\mathcal{S}) \), where \(p_H(\mathcal{V}-\mathcal{B}) = \prod_{X \in \mathcal{V}-\mathcal{B}} p_G(X|\pa{X})\).

First note that

\begin{align*} p_G(\mathcal{S}) &= \sum_{\mathcal{V}-\mathcal{S}} \prod_{X \in \mathcal{V}} p_G(X|\pa{X}) \\ &= \sum_{\mathcal{V}-\mathcal{S}-\mathcal{B}} \prod_{X \in \mathcal{V}-\mathcal{B}} p_G(X|\pa{X}) \sum_{\mathcal{B}} \prod_{X \in \mathcal{B}} p_G(X|\pa{X}) . \end{align*}Now for any \(X\) it follows that \(\sum_X p(X|\pa{X}) = 1\). Thus, by performing the eliminations \(\sum_{X}\) for each \(X \in \mathcal{B}\) in reverse topological order, we have (by an inductive argument) that \(\sum_{\mathcal{B}} \prod_{X \in \mathcal{B}} p_G(X|\pa{X}) = 1\), which proves the result.

A simple corollary of the previous result is that if \(\mathcal{B}\) is a set of barren nodes wrt \(\mathcal{R} \cup \mathcal{S}\), then \(p_G(\mathcal{R}|\mathcal{S})=p_H(\mathcal{R}|\mathcal{S})\), where \(H\) is the graph obtained by removing \(\mathcal{B}\). This holds since the Proposition can be applied to both the numerator and the denominator of the definition of conditional distribution. Another simple corollary is the following.

If \((\mathcal{R},\mathcal{S})\) is a partition of the nodes of a Bayesian network structure \(G\) such that \(\mathcal{S}\) is barren (wrt \(\mathcal{R}\) in \(G\)) then \(p(\mathcal{R})=\prod_{X \in \mathcal{R}} p(X|\pa{X})\).

The graph obtained by removing \(\mathcal{S}\) is a Bayesian network over \(\mathcal{R}\) and hence encodes a distribution \(p(\mathcal{R})\) which factorizes as the product of local conditional distributions.

We can now prove the following result.

If the joint distribution factorizes according to an acyclic directed graph \(G\), then it satisfies the Markov properties represented by \(G\).

The set \(\de{X}\) is barren wrt \(\{X\} \cup \nd{X}\), thus by the Barren Node Corollary,

\[ p(X,\nd{X}) = p(X|\pa{X}) \prod_{Y \in \nd{X}} p(Y|\pa{Y}) . \]

Similarly, \(\{X\} \cup \de{X}\) is barren wrt \(\nd{X}\), hence

\[ p(\nd{X}) = \prod_{Y \in \nd{X}} p(Y|\pa{Y}) . \]

Combining these results, we have that

\begin{align*} p(X,\nd{X}|\pa{X}) &= p(X|\pa{X}) \frac{\prod_{Y \in \nd{X}} p(Y|\pa{Y})}{p(\pa{X})} \\ &= p(X|\pa{X}) p(\nd{X}|\pa{X}) . \end{align*}which is true iff the Markov property is satisfied.

Thus the Markov property and the factorization property are
largely equivalent: one can start with a set of independence
assumptions and conclude a factorization of the joint
distribution in terms of local conditional distributions, or
one can start with a collection of local conditional
distributions inducing an acyclic directed graph and derive a
joint distribution satisfying the Markov property.^{2}

From now on we will consider Bayesian networks as serving two roles: as a model of probabilistic (in)dependences through its Markov properties, and as an efficient representation of a multivariate joint distribution in terms of local conditional distributions.

## 6 Some common structures

A **tree** is a graph where each node has at most one parent
(note that this definition allows disconnected graphs). A
Bayesian network is **tree-shaped** if its structure is a tree.
As we will see later, trees are particularly important as
inference (querying) can be performed in linear time on them.
A **naive Bayes model** is a specific type of tree-shaped
Bayesian network with a single root variable which has the
remaining variables as children, as in the graph in the left
in Figure [BROKEN LINK: fig:nets]. Naive Bayes is a common model for building
classifiers in many domains, for instance spam detection in
emails. The root variable represents the message type (spam or
ham) and the leaves indicate textual features such as word
frequencies and amount of capitalization. Recall that a
conditional distribution over a set of variables determines a
probabilistic model.

**Bipartite Bayesian networks** are Bayesian networks where the
nodes can be partitioned in two sets such that all arcs
originate in one set and point to a variable in the other set.
The network in the right in Figure [BROKEN LINK: fig:nets] is bipartite. Note
that a naive Bayes is the simplest case of a bipartite
Bayesian network (in terms of structure). Note also that a
bipartite network might contain loops and is therefore not
necessarily tree-shaped.

Recall that a **path** is a sequence of distinct nodes
\(X_1,\dotsc,X_n\) (except for the first and the last) such that
\(X_i,X_{i+1}\) is an edge of the graph for each
\(i=1,\dotsc,n-1\). An acyclic directed graph is **singly
directed** if there is at most one directed path between any
two nodes. A **polytree-shaped Bayesian network** has a
singly-directed graph. Note that bipartite networks are also
polytree-shaped. Figure [BROKEN LINK: fig:polytree] depicts a polytree-shaped
Bayesian network which is not bipartite. The Asia Bayesian
network in Figure [BROKEN LINK: fig:asia] is not polytree-shaped as e.g. the
node \(D\) is reachable from node \(S\) through two different
paths.

## 7 Exercises

- Consider a Bayesian network with four binary variables and
structure \[ X_1 \rightarrow X_2 \rightarrow X_3
\rightarrow X_4 \, . \] Assume that the conditional
probability values parametrizing the model are \(
p(X_i=1|X_{i-1}=0) = a \), \( p(X_i=1|X_{i-1}=0)=b \), for
\(i=2,3,4\), and \( p(X_1=1) = c \). Compute the following
probabilities as a function of \(a\), \(b\) and \(c\):
- \(p(X_1=1,X_2=0,X_3=1,X_4=0)\)
- \(p(X_4=1|X_1=1)\)
- \(p(X_1=1|X_4=1)\)
- \(p(X_3=1|X_1=0,X_4=1)\)

- Compute the number of probability values necessary to specify a tree-shaped Bayesian network with \(n\) binary variables.
- Compute the number of probability values necessary to specify a bipartite Bayesian network with \(m\) root variables and \(n\) leaves, and maximum in-degree \(d\). The root variables are binary and the leaf variables are ternary.
- Prove that the leaves of a naive Bayes model are pairwise independent conditional on the root node.
- A nuclear power plant has an alarm that triggers when the
temperature exceeds a certain threshold. A sensor measures
the temperature. Call \(A\) a binary variable that denotes
whether the alarm has triggered, \(F_A\) a binary variable
that represents whether the alarm is defective, \(F_G\) a
binary variable representing whether the sensor is
defective, \(G\) a variable representing the possible
measurements of the sensor and \(T\) the temperature. The
probability that the sensor is defective increases when the
temperature is high and decreases otherwise.
- Draw the structure of a Bayesian network representing the given knowledge.

# Bibliography

- [CDLS1999] Cowell, Dawid, Lauritzen & Spiegelhalter, Probabilistic Networks and Expert Systems, Springer-Verlag (1999).

## Footnotes:

^{1}

A
binary relation is simply a collection of *ordered* pairs of
elements from a domain; in our case these are the vertices.

^{2}

There are some pathological cases involving continuous sample spaces where there is no joint distribution consistent with a collection of conditional distributions.