Bayesian Networks

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\ind}{⫫} $

Any probabilistic phenomenon can be described by a joint probability distribution over a set of random variables. However, directly specifying and manipulating multivariate probability distributions is difficult and inefficient. For example, a joint probability distribution over $n$ binary random variables is characterized by its joint probability mass function, which in turn is described by $2^n$ numbers. Clearly, representing knowledge in this way does not scale to domains with more than a handful of binary variables. So our goal here will be to find a convenient representation of multivariate probability distributions that scale to a large number of random variables. The key to achieving this goal is the correspondence between (conditional) independence and factorization. For example, consider the task of describing the outcomes of the tosses of three coins represented by binary variables $A,B,C$. If we assume that the variables are independent, then the joint distribution factorizes as $P(A,B,C)=P(A)P(B)P(C)$. Hence, instead of the $2^3-1=7$ numbers we need only $3\cdot(2^1-1)=3$ numbers to fully specify the probability function. Full independence is easy to characterize, but what about more complex phenomena such as intercausal reasoning when $A$ and $B$ are unconditionally independent but conditionally independent given $C$. This is the purpose of Bayesian networks: to allow for a compact and intuitive language in which to express independence statements and (consequently) factorized probability functions.

A few concepts from graph theory

A directed graph (digraph, for short) is a pair $(\mathcal{V},\mathcal{A})$ where $\mathcal{V}$ is a set of vertices or nodes and $\mathcal{A}$ is a binary relation on $\mathcal{V}$.¹ The ordered pairs $(X,Y)$ in $\mathcal{A}$ are called arcs and represented as $X \rightarrow Y$. If $X \rightarrow Y$ is in arc of the digraph we say that $X$ is a parent of $Y$, and that $Y$ is a child of $X$. The set of parents and children of a node $X$ are denoted, respectively, as $\pa{X}$ and $\ch{X}$. The in-degree of a node $X$ is the number of its parents $|\pa{X}|$, and the out-degree is the number of children $|\ch{X}|$. A node with in-degree zero is called a root, and a node with out-degree zero is called a leaf.

A trail is a sequence of nodes $X_1,\dotsc,X_k$ such that for $i=1,\dotsc,k-1$ either $X_i \rightarrow X_{i+1}$ or $X_{i+1} \rightarrow X_i$, and each arc appears at most once. That is, a trail is a way of going from a node to another by following arcs in any direction but without passing twice in the same arc. The length of a trail is the number of edges it contains (the number of nodes minus one). A loop is a trail starting and ending in the same node. Two nodes are connected if there is a trail starting at one of the nodes and ending at the other one, otherwise they are separated. A path is a trail where all but the first and the last nodes are necessarily distinct (the first and last might or might not be distinct). A directed path is a path which follows the direction of the arcs (i.e., $X_i, X_{i+1}$ is in the path only if $X_i \rightarrow X_{i+1}$). A cycle is a directed path starting and ending at the same node, for example, $A \rightarrow B \rightarrow C \rightarrow A$.

If there is a directed path from $X$ to $Y$ we say that $X$ is an ancestor of $Y$ and that $Y$ is a descendant of $X$. The descendants (resp., ancestors) of $X$ are denoted $\de{X}$ (resp., $\an{X}$). Its complement with respect to $\mathcal{V}$ is called the non-descendants and written $\nd{X}$.

A digraph is acyclic if it has no directed cycles, that is, if no node is an ancestor of itself. A topological ordering of the nodes is a total ordering of the nodes $<$ such that $X < Y$ if and only $Y$ is not an ancestor of $X$. A directed graph has a topological ordering if and only if it is acyclic. A directed tree is an acyclic digraph where all but one root node have exactly one parent.

The figure below shows an example of an acyclic digraph with a topological ordering $A$, $T$, $S$, $L$, $E$, $X$, $B$, $D$. The parents of $E$ are $\pa{E}=\{T,L\}$ and its children are $\ch{E} = \{X,D\}$. There is a loop $S,L,E,D,B,S$ (which is not a cycle). The ancestors of $E$ are $\an{E}=\pa{E} \cup \{A,S\}$. The descendants of $T$ are $\de{T}=\{ E,X,D \}$. Its non-descendants are $\nd{T}=\{ A, L, S, B \}$. $A$ and $S$ are root nodes and $X$ and $D$ are leaf nodes. The in-degree and out-degree of $E$ are both two.

An undirected graph (or just graph) is a pair $(\mathcal{V},\mathcal{E})$ where $\mathcal{E}$ is a set of unordered pairs of nodes called edges. We denote an edge as $X-Y$, and say that nodes $X$ and $Y$ are adjacent or neighbors. The set of neighbors of a node $X$ is denoted by $\nb{X}$. The degree of a node is the number of neighbors it has. Some of the definitions for directed graphs apply to undirected graphs: A trail is a sequence of nodes where consecutive nodes are connected by edges and no edge is repeated, and a loop is a trail starting and ending in the same node.

We will be mostly interested in graphs where nodes have few parents (if directed) or neighbors (if undirected). The class of sparse graphs is a set of graphs $G_n=(\mathcal{V},\mathcal{E})$ parametrized by $n=|\mathcal{V}|$ and such that $|\mathcal{E}|=O(n)$. Informally, we say that a graph is sparse if $|\mathcal{E}| \ll |\mathcal{V}|^2$.

Markov property for acyclic directed graphs

We are interested in directed graphs representing probabilistic independences among random variables. Thus, we will consider digraphs whose vertex set is a collection of random variables. This allows us to discuss whether two nodes are (probabilistically) independent and whether two variables are connected, mixing concepts of graph theory and probability theory. As independence is a central concept, we develop a special notation:

If two sets of random variables $\mathcal{X}$ and $\mathcal{Y}$ are independent conditional on a set of random variables $\mathcal{Z}$ we write $$ \mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z} \, . $$

Note that independence is a symmetric property, hence $\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z}$ implies $\mathcal{Y} \ind \mathcal{Z} \mid \mathcal{Z}$.

Consider an acyclic digraph $G=(\mathcal{V},\mathcal{E})$ where $\mathcal{V}$ is a collection of random variables. Our goal here is two-fold: We want to assign a semantics to the graph that captures probabilistic (in)dependence relations in a sensible way, and we want to exploit this correspondence to efficiently represent joint probability distributions. We will start with the first goal.

Directed graphs represent two types of relationship between nodes: a direct relationship, represented by the presence of an arc between nodes, and an indirect relationship, represented by a directed path between nodes. Arguably, the most intuitive way to translate such graphical properties to probabilistic reasoning is to use arcs to represent direct probabilistic influence (i.e., dependence) between two random variables, and use directed arc to represent indirect probabilistic influences that are mediated by other quantities. This can be understood more clear with an example.

Let us revisit the example of genetic inheritance of coat color in guinea pigs.The fur color pattern $O$ of guinea pigs is determined by genetic material $H$ of the specimen, as well as by environment factors $D$, the post-conception pre-birth developmental factors, and $E$, the post-birth developmental factors. The genetic material $H$ is probabilistically determined by the genetic materials $H_F$ and $H_M$ of the female and male parents, respectively. These genetic factors in turn determine the color patterns $F$ and $M$ of the female and male parents.

To obtain a directed graphical description of that knowledge, we identify the parents of each node $X$ as the minimum set of random variables we need to specify its probability. This way, the parents of $X$ are the random variables whose influence is not mediated by any other variable, thus giving a meaning to a direct relationship. For example, the probability of $H$ can be specified given knowledge of $H_F$ and $H_M$, and the influence of other factors is mediated by those variables. Similarly, $O$ depends on $H$, $E$ and $D$, and the influence of other quantities, say $H_F$ and $H_M$, are mediated by these variables. Also, the probability of $E$ can be specified without any further knowledge, and there does not seem to exists a quantity that mediates influence on $E$; we thus specify the parents set of $E$ as empty. Following this reasoning for all the random variables, we obtain the following digraph.

Digraph representation of the inheritance of coat color in guinea pigs

Here is another example, taken from Cowell et al. 1999’s book.

The chances of developing dyspnea ($D$) are higher on patients that had tuberculosis ($T$), lung cancer ($L$) or bronchitis ($B$). A recent visit to Asia ($A$) increases the chances of tuberculosis, while smoking ($S$) is a risk factor for both lung cancer and bronchitis. An X-ray ($X$) does not distinguish between lung cancer and tuberculosis, and neither the presence or absence of dyspnea. Let $E$ denote the disjunction of $T$ and $L$. Following the same procedure as before (define as parents the nodes which are necessary and sufficient to specify its conditional probability), we reach the acyclic digraph below.

There are two important points to notice in the previous examples. First, each random variable depends on its parents no matter the context (or observation given). That is, there is no observation of the other random variables that can render a variable independent of its parents. This gives a formal meaning to the idea of a direct probabilistic influence. Second, once the values of the parents are known, the variable becomes independent of any random variable that is not a descendant. This formalizes the idea of an indirect probabilistic influence. This second property is known as the Markov property for directed graphs, and as we will see is crucial for the development of Bayesian networks.

We say that a joint probability distribution (or joint probability mass function or joint probability density) satisfies (or respects) the Markov property with respect to a acyclic digraph $G$ if for every node/random variable $X$ it follows that $$ X \ind \nd{X} \mid \pa{X} \, . $$

That is, a joint distribution $P$ satisfies the Markov property if $$ P(X|\nd{X}) = P(X|\pa{X}) \, , $$ and $$ P(X,\nd{X}-\pa{X}|\pa{X}) = P(X|\pa{X}) P(\nd{X}-\pa{X}|\pa{X}) \, , $$ whenever these conditional distributions are defined.

According to the Markov property any two variables that are not connected by a directed path (hence, one is not a descendant of the other) are conditionally independent once the parents of one of them are known.

These are the Markov properties represented by the digraph of the coat color problem.

$$ \begin{gather*} O \ind F, H_F, H_M, M \mid E, H, D \\ E \ind H,D,F,H_F,M,H_M \\ D \ind H,E,F,H_F,M,H_M \\ H \ind E, D, F, M \mid H_F, H_M \\ F \ind H_M, M, H, E, D, O \mid H_F \\ M \ind H_F, F, H, E, D, O \mid H_M \\ H_F \ind H_M, M, E, D \\ H_M \ind H_F, F, E, D
\end{gather*} $$

Factorization

The Markov property provides a sensible semantics for a directed graphical representation of probabilistic (in)dependences, one that distinguishes direct influences from indirect influences. We will see later that this semantics is optimum in a precise sense. It remains to show the connection between this semantics and compactness of representation. The following result shows that the joint distribution $P$ that satisfies the Markov property with respect to a digraph $G$ can be compactly encoded as long as $G$ is sparse. To this end, define $P(X|\pa{X})$ to be zero at any point where $P(\pa{X})$ is zero. Note that if $P(X)$ is zero at some point then $P(X,Y)$ is also zero.

If a joint probability distribution $P$ satisfies the Markov property with respect to an acyclic directed graph $G=(\mathcal{V},\mathcal{E})$, then $$ P(\mathcal{V}) = \prod_{X \in \mathcal{V}} P(X|\pa{X}) \, , $$ where for convenience we define $P(X|\emptyset)=P(X)$.

To prove the result, we need the following proposition that states that conditional independence of a larger set of random variables implies conditional independence of any subset.

If $\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z}$ and $\mathcal{S} \subset \mathcal{Y}$ then $\mathcal{X} \ind \mathcal{S} \mid \mathcal{Z}$.

Let $\mathcal{T}=\mathcal{Y} - \mathcal{S}$. We have that $$ \begin{align*} P(\mathcal{X},\mathcal{S}|\mathcal{Z}) & = \sum_{\mathcal{T}} P(\mathcal{X}, \mathcal{S}, \mathcal{T}|\mathcal{Z}) \\\\ &= \sum_{\mathcal{T}} P(\mathcal{X},\mathcal{Y}|\mathcal{Z}) \\\\ &= \sum_{\mathcal{T}} P(\mathcal{X}|\mathcal{Y},\mathcal{Z}) P(\mathcal{Y}|\mathcal{Z}) \, . \end{align*} $$ Using $\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z}$, we obtain $$ \begin{align*} P(\mathcal{X},\mathcal{S}|\mathcal{Z}) &= \sum_\mathcal{T} P(\mathcal{X}|\mathcal{Z}) P(\mathcal{Y}|\mathcal{Z}) \\\\ & = P(\mathcal{X}|\mathcal{Z}) \sum_{\mathcal{T}} P(\mathcal{Y}|\mathcal{Z}) \\\\ & = P(\mathcal{X}|\mathcal{Z}) P(\mathcal{S}|\mathcal{Z}) \, . \end{align*} $$ The second equality used the fact that $P(\mathcal{X}|\mathcal{Z})$ is constant with respect to $\mathcal{T}$.

We can now prove the theorem.

Let $<$ be a topological ordering of the nodes/variables in $G$ and for each $X$ denote by $\text{less}(X)=\{Y: Y < X\}$ the set of variables that are smaller than $X$. We have by the chain rule that $$ P(\mathcal{V})=\prod_{X \in \mathcal{V}} P(X|\text{less}(X)) \, . $$ Since $\text{less}(X)$ is a subset of $\nd{X}$ that contains $\pa{X}$, $$ X \ind \nd{X} \mid \pa{X} $$ implies $$ X \ind \text{less}(X) - \pa{X} \mid \pa{X} \, . $$ Thus, $P(X|\text{less}(X))=P(X|\pa{X})$.

The distributions $P(X|\pa{X})$ are called local distributions. The Factorization Theorem allows us to specify a discrete joint probability distribution over $\mathcal{V}$ by $$ \sum_{X \in \mathcal{V}} (|\text{dom}(X)|-1)\prod_{Y \in \pa{X}} |\text{dom}(Y)| $$ probability values $P(X=x|\pa{X}=y)$. This is much lower than the number of probability values required to specify the joint probability distribution directly, when $G$ is sparse. For instance if $k$ is an upper bound on the in-degree of any node and $v$ is an upper bound on the cardinality of (the domain of), then the number of probability values required to specify a Bayesian network over $n$ variables is $O(n \cdot v^k)$, which is polynomial in the input size. In comparison, specifying a joint probability table requires $O(v^n)$ values. The compactness of specification also extends to non-discrete models. For example, if all $n$ variables are Gaussian, then the joint probability distribution is a multivariate Gaussian whose direct specification requires $O(n^2)$ parameters, while a sparse graph representation requires $O(nk)$ parameters.

A joint probability distribution satisfying the Markov property with respect to the graph of guinea pig coat color problem factorizes as $$ \begin{multline*} P(O,D,E,H,H_F,H_M,F,M) = \\ P(H_F) P(H_M) P(F|H_F) P(M|H_M) P(E) P(D) P(H|H_F,H_M) P(O|D,E,H) \, . \end{multline*} $$ Considering that all variables are binary, this specification requires $1 + 1 + 1\cdot 2 + 1 \cdot 2 + 1 + 1 + 1 \cdot 2 \cdot 2 + 1 \cdot 2 \cdot 2 \cdot 2 = 20$ probability values, against the $2^8-1=255$ values required by the naive specification of the joint distribution.

A joint probability distribution satisfying the Markov property with respect to the graph of the visit to Asia problem factorizes as $$ \begin{multline*} P(A,S,T,L,E,B,X,D) = \\ P(A)P(S)P(T|A)P(L|S)P(E|T,L)P(B|S)P(X|E)P(D|E,B) \, . \end{multline*} $$ Assuming that all variables are binary, this specification requires $1 + 1 + 1\cdot 2 + 1\cdot 2 + 1\cdot 2 \cdot 2 + 1 \cdot 2 + 1 \cdot 2 + 1 \cdot 2 \cdot 2=18$ probability values, against the $2^8-1=255$ values required by the naive specification of the joint distribution.

Bayesian networks

The Markov property and the factorization theorem give us a graphical representation that allows for a convenient description of probabilistic (in)dependence while enabling a contextual specification of a probability distribution in terms of local models:

A Bayesian network consists of

an acyclic directed graph $G=(\mathcal{V},\mathcal{E})$, where $\mathcal{V}$ is a collection of random variables, and
a joint probability distribution over $\mathcal{V}$ satisfying the Markov property with respect to $G$.

In practice, a Bayesian network is represented by its graph $G$ and the collection of local conditional probability distributions $$ P(X|\pa{X}) \text{ for all } X \in \mathcal{V} \, . $$ The graph $G$ is often called the structure of the Bayesian network.

We can build a Bayesian network of the survivability of the Titanic sinking assuming that Gender and Class are independent random variables that direct influence Status, as shown in the digraph below.

Bayesian network structure for the Titanic sinking.

We can estimate the conditional probabilities as the relative counts:

$G$	$C$	$P(S=1 \mid G,C)$
male	1st	0.32
male	2nd	0.07
female	1st	0.98
female	2nd	0.89

$P(G=\text{male}) = 0.56$ and $P(C=\text{1st})= 0.53$. These conditional distributions and digraph specify a Bayesian network whose joint probability distribution is given by $P(G,C,S)=P(S)P(G)P(S|G,C)$.

This example was taken from Norman Fenton’s note. Several recent studies suggest that smoking may reduce the severity of COVID-19 symptoms. These studies however are plagued by selection bias in their data, that originates from collecting information from a non-representative sample of the population. In particular, Myiara et al. collected data from 479 patients that tested positive ($T$) for COVID-19 at a large French hospital. According to the data, 64% of the smokers ($S$) developed more severe symptoms ($D$), compared with 73% of the non-smokers. Selection bias in the data was observed by noticing that the sample contained 6% of smokers, against the much higher rate of smokers in the French population, estimated to be nearly 26%, and that health workers ($O$) were over-represented. The digraph below provides an alternative explanation where the lower incidence of severe symptoms in smokers is obtained indirectly by the influence of the occupation sector on the severity of the disease and the probability of testing positive.

Bayesian network structure for the relation between Smoking and severity of COVID-19.

Fenton’s estimated the following conditional distributions using publicly available data and commonsense.

$O$	$D$	$P(T = \text{positive} \mid O, D)$	$P(T = \text{negative} \mid O, D)$
healthcare	mild	0.1	0.9
healthcare	severe	0.99	0.01
other	mild	0.01	0.99
other	severe	0.15	0.85

$O$	$P(D = \text{mild} \mid O)$	$P(D = \text{severe} \mid O)$
healthcare	0.97	0.03
other	0.99	0.01

$O$	$P(S = \text{yes} \mid O)$	$P(S = \text{no} \mid O)$
healthcare	0.14	0.86
other	0.28	0.72

$P(O = \text{healthcare})$	$P(O = \text{other})$
0.95	0.05

The joint probability distribution is given by $$ P(O,S,D,T) = P(O)P(S|O)P(D|O,S)P(T|O,D) \, . $$ One can use the network to verify that $$ P(D=\text{severe} | S = \text{no}, T = \text{yes}) = 0.1739 $$ and $$ P(D=\text{severe} | S = \text{yes}, T = \text{yes}) = 0.1548 \, . $$ Thus a higher incidence of severe symptoms for non-smokers is obtained despite the fact that the model assumes no direct relation between smoking and the disease (and no directed path between these two quantities).

Here is a more intricate model used in a real medical application.

Jochems et al. built a Bayesian network to predict dyspnea ($D$) in patients that received radiotherapy treatment for lung cancer. ² The network used information from tumor location ($T$), lung function tests ($L$), pre-treatment dyspnea ($B$), cardiac comorbidity ($H$) and timing of chemo ($C$). Numerical variables have been discretized. The graphical structure was elicited from medical experts and is shown below.

Bayesian network structure for the predicting post-treatment dyspnea in lung cancer patients.

The conditional probability distributions were independently estimated from clinical data of 287 patients of 5 different medical facilities, and then combined to produce the tables below.

$H$	$T$	$L$	$P( L = \text{low} \mid H, T)$	$P( L = \text{medium} \mid H, T)$	$P( L = \text{high} \mid H, T)$
no	hilar area	low	0.32	0.08	0.60
no	upper lobe	low	0.14	0.10	0.76
no	lower lobe	low	0.23	0.19	0.58
yes	hilar area	low	0.27	0.16	0.57
yes	upper lobe	low	0.21	0.21	0.58
yes	lower lobe	low	0.08	0.41	0.51

$B$	$C$	$P( D = \text{mild} \mid B, C)$	$P(D = \text{severe} \mid B, C)$
mild	none	0.78	0.22
mild	sequential	0.61	0.39
mild	concurrent	0.76	0.24
severe	none	0.25	0.75
severe	sequential	0.25	0.75
severe	concurrent	0.16	0.84

$T$	$P( B = \text{mild} \mid T)$	$P(B = \text{severe} \mid T)$
hilar area	0.94	0.06
upper lobe	0.95	0.05
lower lobe	0.98	0.02

$P(T = \text{hilar area})$	$P(T = \text{upper lobe})$	$P(T = \text{lower lobbe})$
0.22	0.52	0.25

$P(H = \text{no})$	$P(T = \text{yes})$
0.72	0.28

The joint probability distribution is given by $$ P(H,T,L,B,C,D) = P(H)P(T)P(L|H,T)P(B|T)P(D|B,C) \, . $$

A main advantage of the use of Bayesian networks in this domain is the ability to estimate, share and fuse the local models among different hospitals without releasing sensitive patient data, with little overhead.

Here is one example involving continuous random variables.

Jayasurya et al. developed a Bayesian Network to predict two-year survival ($S$) in lung cancer patients treated with radiotherapy (considered a continuous variable with observations 0 and 1).³ The model relied on clinical factors such as performance status ($W$), number of positive lymph nodes on a PET scan ($L$), clinical $T$-stage and $N$-stage, and size of gross tumor volume ($G$). All variables are continuous. Data about 322 patients from a single hospital was available, and contained a significant amount of missing values. For example, $W$ was missing in 11% of the cases. More importantly the model was validate on data of patients from different hospitals with higher rates of missing values. For instance, one of the hospital did not run PET scans (so $L$ was missing for all cases), and $G$ was missing in 15% of cases for another hospital. The structure of the network was estimated from data, with an initial solution encoding knowledge available in the medical literature. The resulting digraph is shown below.

Bayesian network structure for the 2-year survival domain.

The conditional distributions are specified as Gaussian distributions whose mean is a linear function of the values of the parents. The numbers on the arcs in the figure above denote the weights of the linear combination. This type of model is known as a Gaussian Bayesian Network. The mean and variances were learned from the data using the EM algorithm, which we will study later in the course. The authors reported higher prediction accuracy (using AUC) than an SVM classifier when missing values are replaced by their mean value.

The next example involving a continuous random variable, a discrete random variable with infinite support and a binary random variable.

Consider an artificial model of overall student performance connecting the number of registered students $R$ in a course, the ability of the teacher $T$ and the number of students $A$ that pass that course. The number of registered students depend on the teacher’s ability. The number of students who pass the course depends on both the number of registered students and the course’s difficulty. The structure of the Bayesian network is given below.

Bayesian network structure for the number of approved students.

To obtain the conditional distributions, suppose that on average 50% of registered students are approved when the teacher’s ability is low, and that 70% are approved when the teacher’s ability is high. We can mode the conditional distribution of $A$ given $T$ and $S$ as a binomial distribution with the number of trials equals to $R$, and success probability $p=0.7$ if $A$ is high and $p=0.5$ if $A$ low. We can model $R$ as a Gaussian distribution with, say mean 40 and standard deviation 10, if the teacher ability is high, and mean 20 and standard deviation 20, if the teacher ability is low. And we can mode the teacher’s ability as a uniform random variable. Alternatively, we can assume $A$ is continuous and approximate its distribution as a Gaussian distribution with mean $\mu_\text{low} = 0.5R$ and variance $\sigma_\text{low}^2 = 0.25R$ if $T=\text{low}$ and $\mu_\text{high} = 0.7R$ and variance $\sigma_\text{high}^2 = 0.21R$ if $T=\text{high}$. This latter type of model is known as a conditional Gaussian Bayesian network. We can then use the model to answer questions such as what is the expected number of registered students in a course where $20$ students are approved?

Bayesian networks are also very frequently used to build general-domain probabilistic classifiers.

A Naive Bayes Classifier (NBC) is a simple class of Bayesian networks formed by a single root variable and leaf nodes as children. Even though the assumptions made by the mode may seem unrealistic, they often lead to a competitive classifier for high dimensional domains with few data. The reason is that NBC can deliver fairly accurate classifications despite being a poor probability estimator due to its low variance (over data sets).⁴ To illustrate the use of the model, we build an NBC for spam detection of mobile messages, using a publicly available dataset. The root variable $C$ represents the message type (spam or ham) and the leaves $X_1,\dotsc,X_n$ indicate the presence or absence of selected keywords. NBC is based around the assumption that the attribute variables are conditionally independent given a class value: $$ P(X_1,\dotsc,X_n|C) = \prod_{i=1}^n P(X_i|C) \, . $$ The structure of NBC is shown below.

Naive Bayesian Classifier network structure.

Typically, NBC is used to classify an object $x_1,\dotsc,x_n$ by selecting the value $c$ which maximizes the class posterior probability $P(C=c|x_1,\dotsc,x_n)$. The conditional probabilities are generally estimated by the relative counts from data: $P(X_i=x_i|C=c)=N(X_i=x_i,C=c)/N(C=c)$, where $N(X=x,C=c)$ denotes the number of instances (say, messages) of class $c$ where $X$ takes on value $x$, and $N(C=c) = \sum_x N(X=x,C=x)$. By taking the logarithm, and dispensing with terms that are constant with respect to the choice of class label, the maximum a posteriori class label can be found as: $$ \arg\max_c (1-n)\ln N(C=c) + \sum_{i=1}^n \ln N(X_i=x_i,C=c) \, . $$ Hence, classification takes linear time in the number of attributes and classes, and requires only simple computations, which leads to an extremely lightweight classifier.

To simplify matters, we select as keywords all words with at least 30 occurrences for some class. We preprocess the messages to include three special tokens indicating whether the message has a number, whether it has an upper case word, and whether it has a monetary amount. Here are some of the 227 keywords selected:

just with NUMBER come need what time about text cash reply nokia free when this there only like call home guaranteed ALLCAPS stop want from service know love today contact tone have urgent good please mobile MONEY prize customer claim your going still send will that then week mins dont sorry

We then obtain the relative frequencies (counts) of keywords in ham and spam messages, as well as the proportion of spam/ham messages. For example, the counts of the top 5 most frequent words in ham messages are:

that have ALLCAPS your will just when with what this

The top 5 most frequent words in spam messages:

ALLCAPS NUMBER call your MONEY free have from mobile stop

According to the model the probability of observing the word “winner” in a ham/spam message is: $$ P(X_\text{winner}=1|C=\text{ham})=0 \qquad P(X_\text{winner}=1|C=\text{spam})=0.021 \, . $$ The probability of observing ALLCAPS is: $$ P(X_\text{ALLCAPS}=1|C=\text{ham})=0.09 \qquad P(X_\text{ALLCAPS}=1|C=\text{spam})=0.90 \, . $$

We use the classifier to predict the labels for the same dataset used to estimate the counts (a biased approach), and obtain an accuracy of 98%.

Here is an example of another Bayesian network classifier with a more complex structure.

A One-Dependence Bayesian Network Classifier (ODC) extends the Naive Bayes Classifier by allowing each attribute variable to depend also on a distinguished attribute called the superparent (which only depends on the class variable).⁵ An example of the structure of the network is represented below.

One-Dependence Bayesian Network Classifier structure.

The conditional joint distribution of the model given the class is $$ P(X_1,\dotsc,X_n|C) = P(X_i|C) \prod_{j \neq i} P(X_j|X_i,C) \, . $$

The classifier is usually learned by comparing all models obtained with each attribute as superparent and selecting the one with highest score ( loglikelihood, accuracy, etc). Alternatively, one can dispense with the model selection and consider all possible ODC classifiers. A classification is then made by averaging the inferences of all classifiers. The result is called Average One-Dependence Classifier (AODC). The conditional joint distribution of AODC given the class is $$ P(X_1,\dotsc,X_n|C) = \frac{1}{n} \sum_{i=1}^n P(X_i|C) \prod_{j\neq i} P(X_j|X_i,C) , . $$ Note that the above joint distribution has no immediate Bayesian network representation, as it subsumes a cycle between attributes.

Bayesian Network Topologies

A Bayesian network is tree-shaped if its structure is a directed tree. A Naive Bayes network is tree-shaped. As we will see later, trees are particularly important as inference (querying) can be performed in linear time on them.

A polytree is a loop-free connected acyclic digraph. Another way to define it is that by dropping the arc directions (i.e., by converting the graph into an undirected graph), we obtain an undirected tree. Every directed tree is a polytree (while the converse is not true). For example, the Bayesian network structure of the guinea pigs problem and the Bayesian network structure of post-treatment dyspnea are polytrees which are not a directed trees. The Asia Bayesian Network on the other hand is not a polytree as, for example, there are two directed paths from $S$ to $D$. A connected digraph that is not a polytree is called loopy or multiply connected.

A digraph is bipartite if the nodes can be partitioned into two sets such that all arcs point from nodes in one of the sets to a node in the other set. This makes bipartite digraphs necessarily acyclic. The following is an example of application of bipartite Bayesian networks.

In multilabel classification each object is assigned to one or more relevant class labels. For example, an image can be associated with the objects it contains, and a text document can be associated with topics. Let $C_1,\dotsc,C_m$ denote the presence or absence of each label. A simple approach consists is the prediction of presence of each label as a separate classification problem, that is, to build a classifier for the domain $X_1,\dotsc,X_n,C_j$ for each $j=1,\dotsc,m$. While this approach often performs reasonably well, an explicit modeling of class dimension correlations can often improve accuracy (and particularly so if the accuracy measure takes class correlations into account). The naive Bayes classifier can be extended to multilabel classification by combining all single label naive Bayes classifiers into a single network. The result is the bipartite digraph below.

Multilabel Naive Bayesian Classifier network structure.

Note that not all arcs are present; some of the class to attribute arcs have been pruned by a feature selection procedure. According to the model, the attributes are conditionally independent given all class labels: $$ P(C_1,\dotsc,C_m,X_1,\dotsc,X_n) = \prod_i P(X_i|\pa{X_i}) \prod_{j=1}^m P(C_j) \, . $$

A similar model was used in the QMR-DT Bayesian network, that was support medical diagnoses of about 600 diseases (labels) and 4000 symptoms (attributes). The performance of the network was comparable to medical experts.⁶ A more accurate model can be obtained by also allowing dependences among class labels, and among attributes. Antonucci et al. developed a more sophisticated ensemble model where the subgraph of the class labels is an ensemble of Naive Bayes structures.⁷ By doing so they obtained a very competitive multilabel classifier.

Barren nodes

The following concept of barren nodes will help us connect factorization back to the Markov property.

A node $X$ in an acyclic directed graph $G$ is barren with respect to to a set of variables/nodes $\mathcal{S}$ if it is not and has no descendants in $\mathcal{S}$, that is, if $(\{X\} \cup \de{X}) \cap \mathcal{S}=\emptyset$.

The nodes $B$ and $D$ are barren with respect to $\{A,X\}$ is the digraph below.

B and D are barren nodes with respect to X and A

A node is barren with respect to $\mathcal{S}$ if it is a leaf node not in $\mathcal{S}$ in $G$ or any graph obtained from $G$ by a sequence of removal of barren nodes. In the example above, $D$ is a leaf node not in $\mathcal{S}=\{A,X\}$. Removing it makes $B$ a new leaf node, also not in $\mathcal{S}$. After removing $B$, the remaining nodes are either in $\mathcal{S}$ or are not leaves, so they are not barren.

The removal of Barren nodes does not affect the remaining Markov properties nor the marginals of the remaining variables:

Let $P_G$ be a joint distribution that satisfies the Markov properties with respect to a graph $G=(\mathcal{V},\mathcal{E})$, $\mathcal{B}$ be a set of barren nodes in $G$ with respect to a node set $\mathcal{S}$, and $H=(\mathcal{V}-\mathcal{B},\mathcal{E}’)$ be the graph obtained by removing the nodes (and respective arcs) in $\mathcal{B}$ from $G$. Then $P_G(\mathcal{S}) = P_H(\mathcal{S})$, where $P_H(\mathcal{V}-\mathcal{B}) = \prod_{X \in \mathcal{V}-\mathcal{B}} P_G(X|\pa{X})$.

First note that $$ \begin{align*} P_G(\mathcal{S}) &= \sum_{\mathcal{V}-\mathcal{S}} \prod_{X \in \mathcal{V}} P_G(X|\pa{X}) \\\\ &= \sum_{\mathcal{V}-\mathcal{S}-\mathcal{B}} \prod_{X \in \mathcal{V}-\mathcal{B}} P_G(X|\pa{X}) \sum_{\mathcal{B}} \prod_{X \in \mathcal{B}} P_G(X|\pa{X}) . \end{align*} $$ Now for any $X$ it follows that $\sum_X P_G(X|\pa{X}) = 1$. Thus, by performing the eliminations $\sum_{X}$ for each $X \in \mathcal{B}$ in reverse topological order, we have (by an inductive argument) that $\sum_{\mathcal{B}} \prod_{X \in \mathcal{B}} P_G(X|\pa{X}) = 1$, which proves the result.

A simple corollary of the previous result is that if $\mathcal{B}$ is a set of barren nodes with respect to $\mathcal{R} \cup \mathcal{S}$, then $P_G(\mathcal{R}|\mathcal{S})=P_H(\mathcal{R}|\mathcal{S})$, where $H$ is the graph obtained by removing $\mathcal{B}$. This holds because the Theorem can be applied to both the numerator and the denominator of the definition of conditional distribution. Another simple corollary is the following. For example, in the digraph $G$ above $P_G(A|X)=P_H(A|X)$, where $H$ is the graph obtained by removing $B$ and $D$.

If $(\mathcal{R},\mathcal{S})$ is a partition of the nodes of a Bayesian network structure $G$ such that $\mathcal{S}$ is barren (with respect to $\mathcal{R}$ in $G$) then $P(\mathcal{R})=\prod_{X \in \mathcal{R}} P(X|\pa{X})$.

The graph obtained by removing $\mathcal{S}$ is a Bayesian network over $\mathcal{R}$ and hence encodes a distribution $P(\mathcal{R})$ which factorizes as the product of local conditional distributions.

We can now prove the following result.

If the joint distribution factorizes according to an acyclic directed graph $G$, then it satisfies the Markov properties represented by $G$.

The set $\de{X}$ is barren with respect to $\{X\} \cup \nd{X}$, thus $$ P(X,\nd{X}) = P(X|\pa{X}) \prod_{Y \in \nd{X}} P(Y|\pa{Y}) \, . $$ Similarly, $\{X\} \cup \de{X}$ is barren with respect to $\nd{X}$, hence $$ P(\nd{X}) = \prod_{Y \in \nd{X}} P(Y|\pa{Y}) \, . $$ Combining these results, we have that $$ \begin{align*} P(X,\nd{X}|\pa{X}) &= P(X|\pa{X}) \frac{\prod_{Y \in \nd{X}} P(Y|\pa{Y})}{P(\pa{X})} \\\\ &= P(X|\pa{X}) P(\nd{X}|\pa{X}) \, . \end{align*} $$ which is true if and only if the Markov property is satisfied.

Thus the Markov property and the factorization property are largely equivalent: one can start with a set of independence assumptions and conclude a factorization of the joint distribution in terms of local conditional distributions, or one can start with a collection of local conditional distributions inducing an acyclic directed graph and derive a joint distribution satisfying the Markov property.⁸

From now on we will consider Bayesian networks as playing two roles: as a model of probabilistic (in)dependences through its Markov properties, and as an efficient representation of a multivariate joint distribution in terms of local conditional distributions.

A binary relation is simply a collection of ordered pairs of elements from a domain; in our case these are the vertices. ↩︎
Arthur Jochems, Time M. Deist, Johan van Soest, Michael Eble, Paul Bulens, Philippe Coucke, Wim Dries, Philippe Lambin, Andre Dekker. Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the hospital – A real life proof of concept, Radiotherapy and Oncology 121, pp. 459-467, 2016. ↩︎
K. Jayasurya et al. Comparison of Bayesian network and support vector machine models for two-year survival prediction in lung cancer patients treated with radiotherapy, Medical Physics 37:4, 2010. ↩︎
See J. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, volume 1, pp 55-77, 1997, and also P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss., Machine Learning, volume 29, pp. 103-130, 1997. ↩︎
G.I. Webb, J. Boughton, and Z. Wang. Not so naive Bayes: Aggregating one-dependence estimators. Machine Learning 58, 2005. ↩︎
M. Shwe, B. Middleton, D. Heckerman, M. Henrion, E. Horvitz, H. Lehmann, and G. Cooper. Probabilistic diagnosis using a reformulation of the INTERNIST1/QMR knowledge base I. The probabilistic model and inference algorithms. Methods of Information in Medicine, 30, 1991. ↩︎
Alessandro Antonucci, Giorgio Corani, Denis D. Mauá and Sandra Gabaglio.An Ensemble of Bayesian Networks for Multilabel Classification. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, 2013. ↩︎
There are some pathological cases involving continuous sample spaces where there is no joint distribution consistent with a collection of conditional distributions. ↩︎

Last updated on Sep 18, 2020