Score-Based Structure Learning

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\can}[1]{\overline{\text{an}}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\perp_d} \newcommand{\sep}{\perp_u} \newcommand{\msep}{\perp_m} $

Recall that a Bayesian network can be seen from two perspectives: as an efficient encoding of an independence relation, and as an efficient parametrization of a high-dimensional probability distribution by a product of conditional probabilities. The constraint-based approaches to structure learning such as the PC algorithm exploit the first perspective, and attempt at reconstructing a Bayesian network by analyzing the independences on data. Such approaches are general in that they only require the ability to test conditional independence between variables, that is, no assumption about the conditional distributions is required. They also offer potential forms to invalidate the assumptions such as faithfulness, giving further insight into the data generating distribution. Moreover, they are consistent: given an infinite amount of data respecting the assumption, the method recover the true data generating structure.

Constraint-based learning has however many drawbacks. The learned structure is very sensitive to parameters of the statistical tests, in particular to the significance-level selected. This becomes even more critical when the same sample size is small, or the number of configurations of parents is large (which leads to small sample size when testing independence). Moreover, errors (either type I or type II) of tests propagate further on, increasing the probability of additional errors. Notably, constraint-based learning does not directly measures the goodness-of-fit of the learned network as whole (that is, the accuracy of the learned Bayesian network in representing the data at hand).

In the score-based approach, one invests in the second perspective, and searches for Bayesian networks that appropriately describe the data at hand. The gist of the approach consists in assigning a score value to each acyclic directed graph. The score function specifies a total order (possibly up to equivalences) over the structures in a way that structures with better describe the data are assigned a higher value.

Score-Based Structure Learning

Score-based approaches cast the task of recovering a Bayesian structure as a combinatorial optimization problem:

Find an acyclic digraph $G$ that maximizes a given score function $$ s: \mathcal{G} \rightarrow \mathbb{R} \, , $$ where $\mathcal{G}$ is a class of candidate structures.

The class of candidate structures might be simply the set of a all acyclic digraphs over $n$ nodes, or more commonly a subset of that set such as the acyclic digraphs with a bound on in-degree, or the subset of acyclic digraphs consistent with a given skeleton.

There are some minimal criteria one usually asks of score functions. The first is a type of parsimony:

We say that a score function is succinct if whenever the log-likelihood of $G$ and $G’$ are equal but $G$ has more parameters than $G’$ then $ s(G) < s(G’)$.

Score functions are usually a function of data. It is reasonable to expect that the true structure maximizes the score function given sufficient data:

Let $(G^{*},P)$ be a Bayesian network, and let $\{ s_N(G): N=1,2,\ldots \}$ be a collection of random variables, each representing the score value of a structure for a dataset of $N$ independent configurations sampled from $P$. We say that a score function is consistent if $$ \lim_{N \rightarrow \infty} P\left( \left| s_N(G^*) - \max_G s_N(G) \right| > \epsilon \right) = 0 $$ for any $\epsilon > 0$.

Consistency and succinctness ensure that “true Bayes nets” can be recovered asymptotically, that is, with infinite data (up to Markov equivalences and assuming faithfulness). For computational reasons we also assume:

A score function is efficient if $s(G)$ is computed in polynomial time in the size of $G$ and $N$.

And finally:

A score function is decomposable if it can be written as $$ s(G)=\sum_{X\in\mathcal{V}} f(X,\pa{X}) \, . $$

In the following we will justify some common score functions that satisfy all of the above desiderata. For simplicity, we consider only discrete random variables, although much of the discussion can be extended to continuous random variables without much effort. We start with an apparent detour through information theory.

Information Theory

Information Theory investigates how to store and transmit data efficiently and accurately. Given the limitations of physical means (noise, bandwidth, etc), it is not surprising that much of the theory relates to uncertainty. What might be more surprising is the fact that the apparatus developed to design efficient mechanisms for communicating data can also be used to perform statistical inference, in particular, to justify criteria for model selection.

The information content of a bit

How much information can a bit contain? To answer this, consider a random variable $X$ taking on four values with probabilities $P(X=0)=1/2$, $P(X=1)=1/4$, $P(X=2)=1/8$ and $P(X=3)=1/8$. How many bits are needed to store/communicate the value of $X$? Since we have four values for $X$, we need at least four codewords to decode whatever representation we choose to describe $X$. For example, we may represent $X=0$ by the codeword 00, $X=1$ by the codeword 01, $X=2$ by the codeword 10 and $X=3$ by the codeword 11. Can we decrease the length of the description without losing accuracy? Yes, in average, and if instead of fixed-size codewords, we use codewords of different sizes. The idea is to assign less bits to more probable values; for example, we might instead describe $X=0$ by 0, $X=1$ by 10, $X=2$ by 110 and $X=3$ by 111. The average description length in this case is $$ 1 \cdot 1/2 + 2 \cdot 1/4 + 3 \cdot (1/8+1/8) = 3.5 \text{ bits}, $$ which less than the 4 bits required for fixed size length. Take this reasoning from the side of the receiver. The key point is that the amount of compression of an information is related to its probability of occurrence. Skewed probability distributions allow shorter descriptions by skewed codeword size, while uniform distributions encourage the used of equal-size codewords. Now reverse this reasoning: suppose you receive only the first bit of the codeword; when is this bit most informative about the event $X=x$? If the originating distribution is uniform, then revealing the value of the bit discards half of the possible configurations. On the other hand, if the distribution is very skewed then revealing one bit allow us to reduce information to one less element. Following that reasoning, Shannon propose to use the functional $\log_2 1/P(x)$ to measure the information content of a event $X=x$ with probability $P(x)$. Thus, the most informative scenario is when $X$ is distributed uniformly, in which case each bit allows us to reduce our uncertainty by half; and the less informative scenario is when the probabilities for $X=x$ decay exponentially, in which case each bit allows us to reduce our uncertainty by only a small fraction.

Shannon Entropy

The information content of a random variable is the expected information content of values draw independently from its distribution:

The Shannon entropy of a set of variables $\mathcal{X}$ under the joint distribution $P(\mathcal{X})$ is:¹ $$ H(\mathcal{X}) = - \sum_\mathcal{X} P(\mathcal{X}) \log_2 P(\mathcal{X}) \, . $$

Here is an example.

Consider a binary random variable $X$ taking values $\{a,b\}$. Its Shannon Entropy is $P(a)[-\log_2 P(a)] + P(b)[-\log_2 P(b)]$. This quantity can be understood as the average description size (up to some rounding errors) of the expected codeword length of a message produced by drawing a value $x$ of $X$ and using $-\log_2 P(x)$ bits to encode. Since $P(b)=1-P(a)$, we can plot this function, equal to the Shannon entropy of $X$ in terms of $P(a)$:

One can see that the entropy is nonnegative, symmetric, maximized when $X$ is distributed uniformly, and vanishes when $X$ is degenerate (i.e., $P(a)=0$ or $P(a)=1$). Also, the entropy is a concave and bounded function.

As we usually interested in the relative order of informativeness, we use $\ln$ instead of $\log$ in the definition of entropy, as they differ only up to a constant, and call this functional the entropy of a random variable:²

$$ H(\mathcal{X}) = - \sum_\mathcal{X} P(\mathcal{X}) \ln P(\mathcal{X}) \, . $$

Entropy satisfies the following properties.

The following statements are true:

$0 \leq H(\mathcal{X}) \leq \ln |\Omega_{\mathcal{X}}|$.
$H(\mathcal{X}) = 0$ if and only if $P(\mathcal{X})$ is degenerate (i.e., it assigns all mass to a single configuration).
$H(\mathcal{X} \cup \mathcal{Y})= H(\mathcal{X}) + H(\mathcal{Y})$ if and only if $\mathcal{X} \ind \mathcal{Y}$.

Mutual Information

Informativeness can be altered by knowledge of other quantities. For example, if $X=Y$ then knowing $Y$ removes uncertainty about $X$. On the other hand, if $X \ind Y$ then knowing $Y$ does not reduce the uncertainty about $X$.

The mutual information of two sets of variables $\mathcal{X}$ and $\mathcal{Y}$ under distribution $P(\mathcal{X} \cup \mathcal{Y})$ is $$ I(\mathcal{X},\mathcal{Y}) = \sum_{\mathcal{X},\mathcal{Y}} P(\mathcal{X} \cup \mathcal{Y}) \ln \frac{P(\mathcal{X} \cup \mathcal{Y})}{P(\mathcal{X})P(\mathcal{Y})} \, . $$

Mutual information measures the average reduction in uncertainty about $\mathcal{X}$ once $\mathcal{Y}$ is known (and vice-versa). For convenience, we define $I(\mathcal{X},\emptyset)=0$. We observe the following properties:

The following statements are true:

$I(\mathcal{X},\mathcal{Y}) = I(\mathcal{Y},\mathcal{X})$.
$I(\mathcal{X},\mathcal{Y}) = H(\mathcal{X}) + H(\mathcal{Y}) - H(\mathcal{X} \cup \mathcal{Y})$.
$I(\mathcal{X},\mathcal{Y}) \geq 0$.
$I(\mathcal{X},\mathcal{Y})=0$ if and only if $\mathcal{X} \ind \mathcal{Y}$.
$I(\mathcal{X},\mathcal{Y}\cup\mathcal{Z}) \geq I(\mathcal{X},\mathcal{Y})$ with equality if only if $\mathcal{X} \ind \mathcal{Z} \mid \mathcal{Y}$.

The last property is particularly interesting. It says that if $X \ind \mathcal{Z} \mid \mathcal{Y}$ then $I(X,\mathcal{Y}\cup\mathcal{Z})=I(X,\mathcal{Y})$. Otherwise, $I(X,\mathcal{Y}\cup\mathcal{Z}) > I(X,\mathcal{Y})$. It thus provides us with a mean to infer probabilistic independence via mutual information. The G-Test is a statistical test of independence based on mutual information; it is often preferred to Chi-square.

Maximum Likelihood Score

Recall that our goal is associate a score value $s(G)$ to each acyclic graph that captures the ability of a Bayesian network with structure $G$ to represent the data. The most straightforward notion of representing model fitness is the so-called log-likelihood.

The log-Likelihood of Bayesian Network $(G,P)$ with respect to a fixed dataset is

$$ LL(G,P) = \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} N[X,\pa{X}] \ln P(X|\pa{X}) \, , $$

where $N$ represents the counts of events in the data and $P$ is the distribution specified by the network.

Denote by $\hat{P}$ the empirical distribution of the data, that is, the normalized counts of each event. We can rewrite the log-likelih0od as

$$ LL(G,P) = N \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} \underset{=N[X,\pa{X}]/N}{\underbrace{\hat{P}(X,\pa{X})}} \ln P(X|\pa{X}) \, . $$

The quantity above depends on the specific parametrization $P$ of a Bayesian network with structure $G$. One typical approach is to consider maximum likelihood estimates of the conditional probabilities. Recall that the maximum likelihood estimate of the distribution of a discrete random variable is given by its empirical distribution. Hence:

$$ \begin{align*} LL^{\text{mle}}(G) &= \arg\max_P LL(G,P) \\ &= N \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} \hat{P}(X,\pa{X}) \ln \hat{P}(X|\pa{X}) \\ &= N \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} \hat{P}(X,\pa{X}) \ln \frac{ \hat{P}(X,\pa{X})}{ \hat{P}(\pa{X})}\frac{ \hat{P}(X)}{ \hat{P}(X)} \\ &= N \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} \hat{P}(X,\pa{X}) \ln \frac{ \hat{P}(X,\pa{X})}{ \hat{P}(X) \hat{P}(\pa{X})}\hat{P}(X) \\ &= N \sum_{X \in \mathcal{V}} \hat{I}(X,\pa{X}) - N \sum_{X \in \mathcal{V}} \hat{H}(X) \, , \end{align*}
$$

where $\hat{I}$ and $\hat{H}$ are the empirical mutual information and entropy functions. Note that the last term is constant with respect to the structure $G$. Hence, log-likelihood is equivalent (up to that additive constant) to the sum of the mutual information of each variable and its parents. This makes it easier to obtain the following results.

$LL^{\text{mle}}(G)$ is a consistent score function, that is, given an infinite sample of i.i.d. configurations generated from a Bayesian network with structure $G$, we have (with probability one) that $$ LL^{\text{mle}}(G) \geq LL^{\text{mle}}(G’) \, . $$ for any structure $G’$.

The proof follows from the properties of mutual information and the fact that mutual information is consistent (i.e., $\hat{I}$ converges in probability to $I$ as the sample size grows to infinity).

Let $<$ be a topological ordering consistent with $G'$. One can show that the empirical log-likelihood satisfies: $$ LL^{\text{mle}}(G') = \hat{H}(\mathcal{V}) - \sum_X I(X,\{Y < X\} - \pa{X} \mid \pa{X}) \, , $$ where $I(X,Y|Z)$ denotes the conditional mutual information, and satisfies $I(X,Y|Z) = 0$ if only if $X$ is conditionally independent of $Y$ given $Z$. If $<$ satisfies the Markov property (in the true distribution), then $I(X,\{Y < X\} - \pa{X} \mid \pa{X})=0$. Since, the first term is constant with respect to $G'$, it follows that the data generating structure $G$ maximizes the empirical log-likelihood.

Consider again the data compiled by Sewell and Shah (1986) about possible motives that lead high school senior students to attend or not a college. The data set contains 10,318 records about the following five categorical variables:

Sex: male, female;
SES: Social economical status, taking on as low, lowed middle, upper middle, high;
IQ: discretized as low, lower middle, upper middle, high;
PE: Parental encouragement, taking on low and high;
CP: College plans, taking on yes or no.

Let us consider the ranking of the local models for the variable $Sex$, as given by the empirical loglikehood. The local score of that variable when the parent set is empty is $$ LL^\text{mle}(Sex,\emptyset) = \hat{H}(Sex) = -7146.42 \, . $$

The figure shows a Hasse diagram of the possible parent sets for variable Sex.

The edges of the figure are annotated with the difference of score value of the larger set with respect to the smaller set. For example, the edge between $\emptyset$ and $CP$ is annotated with $$ LL^\text{mle}(Sex,CP) - LL^\text{mle}(Sex,\emptyset) = -7109.01 - (-7146.42) = 37.41 \, . $$ Thus, the score values of each parent set can be recovered by adding the weights along a path from the empty set to the variable Sex’s entropy.

Note that the score values monotonically increase with set size, and achieve a maximum when all variables as in the parent set.

As the previous example shows, log-likelihood is however not succinct:

Let $G’$ be a structure obtained by inserting arcs into $G$ (and maintaining acyclicity), and suppose $LL^\text{mle}$ is estimated from some fixed dataset. Then, $$ LL^{\text{mle}}(G’) \geq LL^{\text{mle}}(G) \, . $$

It follows from $I(X,\mathcal{Y} \cup \mathcal{Z}) \geq I(X,\mathcal{Y})$ with $\mathcal{Y}$ the being the parents of $X$ in $G$ and $\mathcal{Z}$ the additional parents in $G'$.

The result above states that, even with a sufficiently large sample size, the true structure is not the single maximizer. With finite samples, $LL^\text{mle}$ favors denser structures over sparser structures, which makes it a poor score function for structure learning.

Chow-Liu Trees

The main drawback of ranking structures by the empirical log-likelihood, is that it favors models with more parameters. We can counter that problem by considering a class of structures with fixed complexity. One particularly interesting such class is that of tree-shaped structures, that is, digraphs where each variable has at most one parent. Recall that directed trees are equivalent to undirected trees in the sense that we can represent the same distribution (hence the same independences) by ignoring the direction of the arcs, or by orienting the arcs such that each node has at most one parent. In other words, the log-likelihood is invariant to orientations of the same skeleton (which represent the equivalence class for directed trees, as they contain no immoralities). The empirical log-likelihood of a tree with skeleton $H=(\mathcal{V},\mathcal{E})$ is therefore

$$ LL^\text{mle}(H) = \sum_{X - Y \in \mathcal{E}} I(X,Y) + \sum_X H(X) \, . $$

Again, note that the last term is invariant to the structure $H$ and can be discarded for the purposes of model selection. Chow and Liu noted that the expression above can be cast as a problem of finding the maximum weight spanning tree of the undirected graph $G$ whose edges $X-Y$ are weighted by $I(X,Y)$.³ That problem can be solved efficiently by standard algorithms for maximum spanning tree such as Prims’s algorithm, which can be implemented so as to run in time $\mathcal{O}(n^2)$, which is optimal given that we need quadratic time to compute all $I(X,Y)$ from data.

Penalized Log-Likelihood

A more general approach is to modify log-likelihood score to make it succinct. A simple approach is to add a penalizing term that accounts for the number of parameters required to estimate the conditional probabilities associated with a structure $G$. This leads to a class of penalized log-likelihood scores:

$$ PLL(G) = LL^{\text{mle}}(G) - \psi(N) \cdot \text{size}(G) \, , $$

where

$$ \text{size}(G) = \sum_{X \in \mathcal{V}} (|\Omega_X|-1)\prod_{Y \in \pa{X}}|\Omega_Y| $$

is the number of free-parameters in a (G,P), and $\psi(N)$ is non-decreasing function that weights the penalization according to $N$

$$ PLL(G) = LL^{\text{mle}}(G) - \psi(N) \cdot \text{size}(G) \, . $$

Different score functions can be obtained by specifying $\psi(N)$. Since the empirical log-likelihood $LL^\text{MLE}$ grows linearly with the sample size $N$, to remain consistent the function $\psi(N)$ needs to be of order sublinear in $N$. A common option is to select

$$ \psi(N)=\mathcal{O}(\log N) $$

thus ensuring consistency. With the above function the penalized log-likelihood satisfies our desiderata of succinctness, consistency, efficiency and decomposability.

Minimum Description Length

Our goal here is to justify the choice of function $\psi(N)$ in the penalized log-likelihood score, by resorting to information theory. Our previous discussion assumed the knowledge of the true distribution of random variable $X$. In practice, we need to estimate a model of the true distribution in order to obtain efficient encodings. This strikes a balance between the complexity of the learned model and its accuracy.

To see this, suppose that we have several different probabilistic models of symbol occurrence $P_1,\dotsc,P_h$ that are shared among the transmitter and receiver. To transmit information, say a message consisting of string of letters, we select a model $P_i$ to encode the message and use $-\log_2 P_i(x)$ bits for each symbol $x$ in it. To allow the receiver to decode message, we need to inform the receiver which model was used. This can be performed by adding a $-\log_2 h$ bit header to the transmitted message. The expected message length is thus $H(X) -\log_2 h$ bits long. Our model class is quite limited, and for high-dimensional random variables the small overhead introduced by transmitting the model is likely overshadowed by large inaccuracies in encoding the digits. This requires extending the model class. More generally, we may have an infinite number of possible encoding models, and each model may be parametrized by some vector $\Theta$. Then to transmit the model, we need to encode which parameters are used plus each value of the parameter. This leads to the following concept.

The description length represents the average message length used to transmit a random sample of $N$ i.i.d. values of a random variable $X$, and is given by $N \cdot H(X) + \text{size}(P(X))$, where the last term represents the encoding of the information required for the receiver to determine the model used.

We can recast structure learning as a problem of data compression, as follows. The dataset can be seen as string to be encoded. The symbols of the message are the joint configurations of the random variables. The probabilistic models of symbol/datum occurrence are the candidate Bayesian networks with maximum likelihood estimates of the conditional probabilities. That model is used to encode the data, and informed as a header bitstring. The idea is to equate model quality with the ability to compress data. The Minimum Description Length Principle asserts that we should select the model that minimizes description length.

Joe Suzuki proposed to encode the Bayesian network as a list of independent conditional probabilities, where each probability is encoded with $\log_2 N/2$ bits of precision.⁴ This choice of model encoding is justified by minimizing some sort of redundancy. This gives a model encoding of a Bayesian network $(G,P)$ whose size is $$ \text{size}(P(\mathcal{V})) = \frac{\log_2(N)}{2} \sum_X |\Omega_X -1 | \prod_{Y \in \pa{X}}|\Omega_Y| \, . $$ When $P$ is the empirical entropy, we have that $$ \hat{H}(\mathcal{V}) = -\sum_{\mathcal{V}} \hat{P}(\mathcal{V}) \log_2 \hat{P}(\mathcal{V}) = -\frac{1}{N} LL^{\text{mle}}(G) \, . $$ Thus, minimizing the description length is equivalent to maximizing the following score function:

The minimum description length score is $$ MDL(G) = LL^{\text{MLE}}(G) - \frac{\log_2(N)}{2} \cdot \text{size}(G) \, , $$ where $\text{size}(G)$ denotes the number of independent parameters in the joint distribution it parametrizes.

One immediately sees that the MDL score is a form of penalized log-likelihood with $$ \psi(G) = \frac{\log_2(N)}{2} \, . $$

Let us consider again the dataset by Sewell and Shah (1986) about college aspirations. We will rank the parent sets of the variable $Sex$ using the MDL score, but using the natural log instead of log base 2. The figure shows a Hasse diagram of the possible parent sets for variable Sex along with the score values.

One can see from the figure that the score for the empty set is $-75151.04$, and the score for $PE$ is $$ MDL(Sex,PE) - MDL(Sex,\emptyset) = -7151.04 + 73.53 = -7077.51 \, . $$

Note that the score no longer increases with the parent set size. In fact, one can verify that the maximum score is obtained by the parent set containing only $PE$.

Bayesian Score

An alternative to penalized log-likelihood scores is cast structure learning as Bayesian inference, where one is interested in obtaining the posterior distribution over acyclic digraphs given the data. By Bayes’ rule, we have that

$$ P(G|\text{data}) = \frac{P(G) P(\text{data}|G)}{P(\text{data})} \, . $$

Since we are only interested here in using the distribution above to rank structures $G$, we can ignore the denominator; this leads to the following generic score function:

The Bayesian score function is $$ \begin{align*} \text{Bayes}(G) &= \underset{\text{prior}}{\ln P(G)} + \underset{\text{marginal loglikelihood}}{\ln P(\text{data}|G)} \\ &= \ln P(G) + \ln \int_\theta P(\text{data}|G,\theta) p(\theta|G) d\theta
\end{align*} $$

The expression above represents a class of score functions obtained by specifying different data likelihood and prior distributions. We can specify the prior distribution as a Parameter Learning Bayesian network, which in this case is also known as the prior Bayesian network. The choice of prior and likelihood is required to satisfy the basic desiderata of score functions: consistency, succinctness, efficiency and decomposability. A typical assumption for categorical random variables is to assume unconditionlally independent Dirichlet Priors with hyperparameters $\alpha_{X|\nu}$ on the conditional probabilities $P(X|\pa{X}=\nu)$, similar to Bayesian parameter learning. The figure below show the corresponding prior network for a two-variable Bayesian network.

We now describe the derivation of a Bayesian score using independent Dirichlet priors. We first discuss a standard model of Bayesian inference over a single categorical random variable.

Small Detour: Discrete-Dirichlet Distribution

Consider a categorical random variable $X$, whose probability of producing a value $x$ is parametrized by a vector $\theta_x \geq 0$ such that $\sum_x \theta_x=1$. The data likelihood, representing the probability of observing $\{ N[X=x]: x \sim X \}$ i.i.d. occurrences of a categorical random variable $X$, is thus given by

$$ P(N[X]|\theta) = \prod_{x \sim X} \theta_x^{N[X=x]} \, , $$

We assume that $\theta_X$ is distributed by a Dirichlet Distribution with hyperparameters $\alpha_x$; this assigns the following joint probability density of observing a nonegative vector $(\theta_x)_{x \sim X}$ such that $\sum_x \theta_x = 1$:

$$ p(\theta|\alpha) = \frac{\Gamma(\sum_x \alpha_x)}{\prod_x \Gamma(\alpha_x)} \prod_x \theta_x^{\alpha_x - 1} \, . $$

The marginal likelihood of a Dirichlet distribution is given by: $$ \begin{align*} P(N[X]|\alpha) &= \mathbb{E}\left(\prod_x \theta_x^{N[X=x]} \biggl | \alpha \right) \\ & = \int_\theta P(N[X]|\theta) \cdot p(\theta|\alpha) \,d\theta \\ &= \frac{\Gamma(\sum_x \alpha_x)}{\Gamma(\sum_x \alpha_x + N[X=x])}\prod_x \frac{\Gamma(\alpha_x + N[X=x])}{\Gamma(\alpha_x)} \, . \end{align*} $$

The (log) Gamma functions can be approximated efficiently with reasonable precision.

Back to multivaritate Bayesian networks

For a multivariate Bayesian network with categorical random variables and independent Dirichlet priors, the marginal likelihood factorizes as the product of local Multinomial-Dirichlet models:

$$ \begin{align*} P(\text{data}|G,\alpha) &= \int P(\text{data}|G,\theta) p(\theta|G,\alpha) \, , d\theta \\ &= \int \prod_{X \in \mathcal{V}}\prod_{\nu \sim \pa{X}} \underset{\text{Multinomial}}{\underbrace{P(N[X,\pa{X}=\nu]|\theta_{X|\nu})}} \cdot \underset{\text{Dirichlet}}{\underbrace{p(\theta_{X|\nu}|\alpha_{X | \nu})}} \\ &= \prod_{X \in \mathcal{V}}\prod_{\nu \sim \pa{X}} \int P(N[X,\pa{X}=\nu]|\theta_{X|\nu}) \cdot p(\theta_{X|\nu}|\alpha_{X | \nu}) \, d\theta_{X|\nu} \, . \end{align*} $$

The Bayesian Score thus decomposes as:

$$ \begin{multline*} \text{Bayes}(G) = \ln P(G) \\ + \sum_{X\in\mathcal{V}}\sum_{\nu \sim \pa{X}} \biggl( \ln \frac{\Gamma(\sum_x \alpha_{x|\nu})}{\Gamma(N[\pa{X}=\nu]+\sum_x \alpha_{x|\nu})} \\+ \sum_{x \sim X} \ln \frac{\Gamma(N[X=x,\pa{X}=\nu]+\alpha_{x|\nu})}{\Gamma(\alpha_{x|\nu})} \biggr) \, . \end{multline*} $$

To ensure that the score function above is decomposable, we require that

$$ P(G) = \prod_{X \in \mathcal{V}} P(\pa{X}) \, , $$

that is, a priori the parent sets are generated independently. Common choices are

$P(\pa{X}) = 1/|\pa{X}|$, and
$P(\pa{X}) = \prod_{Y \rightarrow X} P(Y \rightarrow X)$.

Note that the latter prior allows us to influence prior structures by assigning very high or very low probabilities to edges. In practice, however, the term $P(G)$ has little influence on model selection except for very small $N$.

For $N \rightarrow \infty$ the marginal likelihood converges to the maximum likelihood estimate; hence the Bayes score above satisfies consistency, efficiency and decomposability. It can be shown that it is also succinct.

Likelihood Equivalence Principle

In addition to the four basic desiderata, it is often required that two Markov equivalent structures should be assigned the same score. The rationale is that two Markov equivalent structures provide the same statistical support for the data (in terms of the independences or constraints they impose), hence should not be distinguished by the score function. Note that two Markov equivalent structures have the same loglikelihood, thus any difference in the score must originate from one of the prior distributions. The following result establishes necessary and sufficient properties for the priors to abide by the Markov equivalence principle.

Two Markov equivalent structures with equal structure prior $P(G)$ have the same Bayesian score if and only if

$$ \alpha_{x|\nu} = \alpha \cdot P(X=x,\pa{X}=\nu) \, , $$

where $\alpha$ is the equivalent sample size and $P(X=x,\pa{X}=\nu)$ is a prior probability of observing the joint configuration of $x$ and $\nu$.

The theorem above simply states that the same equivalent sample size (or prior strength) should be used for all local Multinomial-Dirichlet models to satisfy the Markov equivalence principle. This choice is known as the BDe prior. A common choice for the prior probabilities is: $$ P(x,\nu) = \frac{1}{|\Omega_X|\prod_{Y \in \pa{X}} |\Omega_Y|} = \text{Uniform}(X,\pa{X}) \, , $$ when the prior is then called *BDeu prior, where u stands for uniform. The equivalent sample size $\alpha$ helps in regulating overfitting; typical values for it are usually in the range $[1,2]$, although structure ranking is very sensitive to this value.

In sum, the Bayesian score provides an efficient, decomposable, consistent and succinct score function, that can be made Markov equivalent (like LL and MDL) by setting appropriate values of the hyperparameters. Compared with penalized log-likelihood, the score function allows more flexibility in representing prior knowledge or inductive biases about structures and parameters of the Bayesian network. Its main advantage is the requirement of an appropriate value for the equivalent sample size, and its more complicated expression.

Bayesian Information Criterion

When the amount of data is large, the Bayesian score closely approximates the Maximum a posteriori estimates:

$$ P(G|\text{data}) \overset{N \rightarrow \infty}{\approx} \max_G\max_\theta p(G,\theta|\text{data}) \, . $$

This justifies the following so-called Bayesian Information Criterion (BIC) score function:

The BIC score function is: $$ \begin{align*} \text{BIC}(G) &= LL^{\text{MLE}}(G) - \frac{\ln N}{2} \text{size}(G) \\ & = \lim_{N \rightarrow \infty} \text{Bayes}(G) - O(1) \, . \end{align*} $$

Note that the BIC score equals the MDL score up to log base. It thus provides additional justification for the choice of $\psi(N)=\log(N)/2$ in penalized loglikelihood scores. The BIC scores tends to select simpler structures than the Bayesian score for small $N$, and performs comparably as $N$ increses.

Structure Learning Algorithms

We now revisit the problem of score-based structure learning:

Given a decomposable, succinct and efficient score function $s$, find a structure $G$ in some class $\mathcal{D}$ that maximizes $s(G) = \sum_{i=1}^n f(X_i,\pa{X})$.

The problem can be shown to be NP-complete, and current state-of-the-art techniques can solve problems with up to 100 variables. In the following we will discuss two of the most common approximate techniques: search in space of DAGs and search in space of topological orderings. Both produce very reasonable approximate solutions in polynomial time.

Search in the Space of DAGs

The first approach is to perform local modifications in the network structure to improve the score:

Start with some structure $G_0$ (e.g., the empty graph).
For a fixed number of iteration $i=1,\ldots,T$, find $G_{i+1}$ by applying any of the following operations:
Add a non-existing arc $X \rightarrow Y$ (as long as it does not introduces a cycle).
Remove an existing arc $X \rightarrow Y$.
Reverse an existign arc $X \rightarrow Y$ (i.e., remove $X \rightarrow Y$ and add $Y \rightarrow X)$ as long as it does not introduce a cycle.

If $s(G’)>s(G)$ do $G \gets G’$ and go to Step 2.

Due to the decomposability of the score function, these operations can be performed efficiently and in any order, and allow some degree of parallelism. Notably, they do not require a strict bound on in-degree (expect for the computation of the local scores), and can incorporate prior knowledge on initial structure, and on the local operations. The algorithm can also be adapted to search directly in the Markov equivalence class, which increases the search neighborhood with small overhead. A disadvantage is the need to check acyclicity for each candidate structure $G_i$.

Search in the Space of Topological Orders

An alternative approach is to search in the in the space of topological orders, thus dispensing with the need of verifying acyclicity. Recall that digraph $G$ is acyclic if and only if it admits a topological ordering $\sigma$ ov the nodes. The number of topological orderings ($n!$) is exponentially smaller than number of DAGs (which is superexponential), which justifies the approach.

$$ \begin{align*} \max_G s(G) &= \max_G \sum_{i=1}^n f(X_i,\pa{X}) \\ &= \max_{\sigma} \sum_{i=1}^n \max_{\pa{X_i} \subseteq {X_j: \sigma(j) < \sigma(i)}} f(X_i,\pa{X_i}) \, . \end{align*} $$

According to the result above, structure learning can be decomposed into two interdependent problems: ordering search and parent set selection. We can thus assign a score value to each topological ordering by: $$ s(\sigma) =\sum_{i=1}^n \max_{\pa{X_i} \subseteq {X_j: \sigma(j) < \sigma(i)}} f(X_i,\pa{X_i}) \, . $$

Tessyer & Koller’s Algorithm performs a greedy search in the space of orderings:

Start with a random topological ordering $\sigma$ (unless a good guess is available).
Repeat until no improvement is found:
Find $\sigma’$ by swapping the ordering of two consecutive variables, and compute its score.
If $s(\sigma’) > s(\sigma)$ then set $\sigma \gets \sigma’$.

The candidate structures obtained by swapping the order of two consecutive variable can be performed very efficiently as they only affect the local parents. This allows the algorithm to efficiently explore a large neighborhood, and has been shown in practice to produce superior results.

Let us run the Tessyer-Koller algorithm on the college aspirations dataset, using the BIC score. We start with the following ordering:

$$ Sex < IQ < CP < PE < SES \, . $$

The score of the empty digraph is $-49456.65$, and the score of the initial ordering is $-45643.28$. An candidate improving structure is obtained by finding the best parent set among the smaller variables according to the ordering. For the initial ordering the structure contains the following parent sets:

$$ \begin{align*} Sex & \gets \emptyset & IQ & \gets \emptyset & CP & \gets \{Sex, IQ \} \\ PE & \gets \{Sex, IQ, CP\} & SES &\gets \{CP, PE\} \end{align*} $$

Running the algorithm we find the locally best ordering:

$$ IQ < PE < CP < SES < Sex \, $$

whose associated structure is

$$ \begin{align*} Sex & \gets \{ PE \} & IQ & \gets \emptyset & CP & \gets \{IQ, PE \} \\ PE & \gets \{IQ\} & SES &\gets \{CP, PE\} \end{align*} $$

whose score is $-45609.63$. Running the algorithm 100 times with random initial orderings, we find a best scoring ordering

$$ Sex < SES < PE < CP < IQ \, $$

whose score is $-45609.42$, and best parent sets are

$$ \begin{align*} Sex & \gets \emptyset & IQ & \gets \{PE,CP\} & CP & \gets \{SES, PE \} \\ PE & \gets \{Sex, SES\} & SES &\gets \emptyset \end{align*} $$

We assume that $0 \log 0 = 0$. ↩︎
When using the natural log to measure of information content the quantity $-\ln P(X=x)$ is measured in nats (for natural unit of information) rather than bits. Clearly, 1 nat corresponds to $1/\ln 2$ bits. ↩︎
C. K. Chow and C. N. Liu, Approximating discrete probability distributions with dependence trees, IEEE Transactions on Information Theory 14, 1968. ↩︎
Joe Suziku, A construction of Bayesian networks from databases based on the MDL principle, In Proccedings of the Ninth Conference on Uncertainty in Artificial Intelligence, 1993. ↩︎

Last updated on Oct 13, 2020