# Score-Based Bayesian Network Structure Learning

## Table of Contents

\( \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\indeg}[1]{\text{indeg}(#1)} \newcommand{\outdeg}[1]{\text{outdeg}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\!\perp_d\!} \newcommand{\msep}{\!\perp_m\!} \newcommand{\sep}{\!\perp_u\!} \)

Recall that a Bayesian network can be seen from two perspectives: as an efficient encoding of an independence relation, and as an efficient encoding of a high-dimensional probability distribution. The constraint-based approach to structure learning exploits the first perspective, and attempts at reconstructing a Bayesian network by analyzing the independencies on data. In the score-based approach, one invests in the second perspective, and searches for Bayesian networks that appropriately describe the data at hand. The heart of the approach consists in assigning a score value \(s(G)\) to each acyclic directed graph \(G\). The score function specifies a total order (up to equivalences) over the structures in a way that structures with better describe the data are assigned a higher value.

## 1 Introduction

### 1.1 Constraint-Based Vs. Score-Based

- Bayesian networks admit a dual view:
- As a model of
**independences** - As a
**parametrization**of joint distribution

- As a model of
**Constraint-Based Structure Learning**attempts to recover the independence relation from data- Exploit locality of Bayes nets (independence tests involve variable and neighbors)
- Errors propagate; (difficulty in choosing significance-level)

**This Lecture:**recover parametrization from data by ranking models according to score function

### 1.2 Score-Based Structure Learning

- Objective

Find an acyclic digraph \(G\) that maximizes \(s: \mathcal{G} \rightarrow \mathbb{R}\)

- Consistency

If \((\bar{G},p)\) is a Bayesian network, and \(\mathcal{D}\) is a collection of \(N\) iid data generated from \(p\), then \[ N \rightarrow \infty \implies \bar{G} = \arg\max_G s(G) \]

- Succinctness

If the log-likelihood of \(G\) and \(G'\) are equal but \(G\) has more parameters than \(G'\) then \(s(G)

Consistency and Succinctness ensure that "true Bayes nets" can be recovered asymptotically (up to Markov equivalences)

For computational reasons we also assume:

- Efficiency

The score \(s(G)\) should be computed in polynomial time in the size of \(G\) and \(N\)

- Decomposability:

\[ s(G)=\sum_{X\in\mathcal{V}} f(X,\pa{X}) \]

## 2 Information Theory

Information Theory investigates how to store and transmit data efficiently and accurately. Given the limitations of physical means (noise, bandwidth, etc), it is not surprising that much of the theory relates to uncertainty. What might be more surprising is the fact that the apparatus developed to design efficient mechanisms for communicating data can also be used to perform statistical inference, in particular, to justify criteria for model selection.

### 2.1 The information content of a bit

How much information can a bit contain? To answer this,
consider a random variable \(X\) taking on four values with
probabilities \(P(X=0)=1/2\), \(P(X=1)=1/4\), \(P(X=2)=1/8\)
and \(P(X=3)=1/8\). How many bits are needed to
store/communicate the value of \(X\)? Since we have four
values for \(X\), we need at least four *codewords* to
*decode* whatever representation we choose to describe \(X\).
For example, we may represent \(X=0\) by the codeword 00,
\(X=1\) by the codeword 01, \(X=2\) by the codeword 10 and
\(X=3\) by the codeword 11. Can we decrease the length of the
description without losing accuracy? Yes, in average, and if
instead of *fixed-size* codewords, we use codewords of
different sizes. The idea is to assign less bits to more
probable values; for example, we might instead describe
\(X=0\) by 0, \(X=1\) by 10, \(X=2\) by 110 and \(X=3\)
by 111. The *average description length* in this case is \(1
\cdot 1/2 + 2 \cdot 1/4 + 3 \cdot (1/8+1/8) = 3.5\) bits, less
than the 4 bits required for fixed size length. Now, take this
reasoning from the side of the receiver. The key point is that
the amount of compression of an information is related to its
probability of occurrence. Skewed probability distributions
allow shorter descriptions by skewed codeword size, while
uniform distributions encourage the used of equal-size
codewords. Now reverse this reasoning: suppose you receive 1
bit; when is this bit most informative? If the originating
distribution is uniform, then revealing the value of the bit
discards half of the possible values. On the other hand, if
the distribution is very skewed then revealing one bit allow
us to reduce information to one less element. Based on this
observations, Shannon propose to use the functional \(\log_2
1/p(x)\) to measure the information content of a event \(X=x\)
with probability \(p(x)\). Thus, the most informative scenario
is when \(X\) is distributed uniformly, in which case each bit
allows us to reduce our uncertainty by half; and the less
informative scenario is when the probabilities for \(X=x\)
decay exponentially, in which case each bit allows us to
reduce our uncertainty by only a small fraction.

### 2.2 Shannon Entropy

The information content of a random variable is the expected information content of values draw independently from its distribution:

The **Shannon entropy** of a set of variables \(\mathcal{X}\) under the
distribution \(p(\mathcal{X})\) is ^{1}
\[
H(\mathcal{X}) = - \sum_\mathcal{X} p(\mathcal{X}) \log_2 p(\mathcal{X})
\]

In practice, we use \(\ln\) instead of \(\log\), as they differ only up to a constant, and call this functional the entropy of a random variable

Consider a binary random variable \(X\) taking values \(\{a,b\}\). Its Shannon Entropy is \(p(a)[-\log_2 p(a)] + p(b)[-\log_2 p(b)]\). This quantity can be understood as the average description size (up to some rounding errors) of the expected codeword length of a message produced by drawing a value \(x\) of \(X\) and using \(\log_2 1/p(x)\) bits to encode. Since \(p(b)=1-p(a)\), we can plot this function, equal to the Shannon entropy of \(X\) in terms of \(p(a)\):

One can see that the entropy is symmetric, maximized when \(X\) is distributed uniformly, and minimized when \(X\) is degenerate. Also, the entropy is a concave and bounded function.

The following statements are true:

- \(0 \leq H_p(\mathcal{X}) \leq \ln |\Omega_{\mathcal{X}}|\)
- \(H_p(\mathcal{X}) = 0\) if and only if \(p\) is degenerate (i.e., it assigns all mass to a single configuration)
- \(H_p(\mathcal{X} \cup \mathcal{Y})= H_p(\mathcal{X})+H_p(\mathcal{Y})\) if and only if \(\mathcal{X} \ind \mathcal{Y}\)

### 2.3 Mutual Information

- Informativeness can be altered by knowledge of other quantities
- If \(X=Y\) then knowing \(Y\) removes uncertainty about \(X\)
- If \(X \ind Y\) then knowing \(Y\) does not reduce uncertainty about \(X\)

The **mutual information** of two sets of
variables \(\mathcal{X}\) and \(\mathcal{Y}\) under distribution
\(p(\mathcal{X} \cup \mathcal{Y})\) is
\[
I(\mathcal{X},\mathcal{Y}) = \sum_{\mathcal{X},\mathcal{Y}} p(\mathcal{X} \cup \mathcal{Y}) \ln \frac{p(\mathcal{X} \cup \mathcal{Y})}{p(\mathcal{X})p(\mathcal{Y})} \, .
\]

- Measures average reduction in uncertainty on \(\mathcal{X}\) once \(\mathcal{Y}\) is known (and vice-versa)
- \(I(\mathcal{X},\emptyset)=0\)

The following statements are true:

- \(I(\mathcal{X},\mathcal{Y}) = I(\mathcal{Y},\mathcal{X})\)
- \(I(\mathcal{X},\mathcal{Y}) = H(\mathcal{X}) + H(\mathcal{Y}) - H(\mathcal{X} \cup \mathcal{Y})\)
- \(I(\mathcal{X},\mathcal{Y}) \geq 0\)
- \(I(\mathcal{X},\mathcal{Y})=0\) if and only if \(\mathcal{X} \ind \mathcal{Y}\) (under \(p\))
- \(I(\mathcal{X},\mathcal{Y}\cup\mathcal{Z}) \geq I(\mathcal{X},\mathcal{Y})\) with equality if only if \(\mathcal{X} \ind \mathcal{Z} \mid \mathcal{Y}\)

The last property is particularly interesting:

- Let \(X\) be some variable and \(\mathcal{Y}\) and \(\mathcal{Z}\) be disjoint sets of variables
- If \(X \ind \mathcal{Z} \mid \mathcal{Y}\) then \(I(X,\mathcal{Y}\cup\mathcal{Z})=I(X,\mathcal{Y})\)
- Otherwise, \(I(X,\mathcal{Y}\cup\mathcal{Z}) > I(X,\mathcal{Y})\)

For convenience, we define \(I(\mathcal{X},\emptyset)=0\).

## 3 Maximum Likelihood Score

### 3.1 Log-Likelihood of Bayesian Network

\[ LL(G,p) = \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} N[X,\pa{X}] \ln p(X|\pa{X}) \]

\[ LL(G,p) = N \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} \underset{=N[X,\pa{X}]/N[\pa{X}]}{\underbrace{p_\text{emp}(X,\pa{X})}} \ln p(X|\pa{X}) \]

\[ LL^{\text{MLE}}(G) = LL(G,p^\text{MLE}) = \arg\max_p LL(G,p) = LL(G,p_\text{emp}) \]

\begin{align*} LL^{\text{MLE}}(G) &= N \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} p_\text{MLE}(X,\pa{X}) \ln p_\text{MLE}(X|\pa{X}) \\ &= N \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} p_\text{MLE}(X,\pa{X}) \ln \frac{p_\text{MLE}(X,\pa{X})}{p_\text{MLE}(\pa{X})}\frac{p_\text{MLE}(X)}{p_\text{MLE}(X)} \\ &= N \sum_{X \in \mathcal{V}} \sum_{X,\pa{X}} p_\text{MLE}(X,\pa{X}) \ln \frac{p_\text{MLE}(X,\pa{X})}{p_\text{MLE}(X)p_\text{MLE}(\pa{X})}p_\text{MLE}(X) \\ &= N \sum_{X \in \mathcal{V}} I_{p_\text{MLE}}(X,\pa{X}) - N \sum_{X \in \mathcal{V}} H_{p_\text{MLE}}(X) \end{align*}### 3.2 MLE Score

Suppose the data were generated by a Bayesian network \((G,p)\). In the data size limit (i.e., when \(N \rightarrow \infty\)), we have that \(p_\text{MLE}\rightarrow p\), and \[ LL^{\text{MLE}}(G) \geq LL^{\text{MLE}}(G') \] for any structure \(G'\)

Let \(G'\) be a structure obtained by inserting arcs into \(G\) (and maintaining acyclicity). Then, \[ LL^{\text{MLE}}(G') \geq LL^{\text{MLE}}(G) \, . \]

Follows from \(I(X,\mathcal{Y} \cup \mathcal{Z}) \geq I(X,\mathcal{Y})\) with \(\mathcal{Y}\) the being the parents of \(X\) in \(G\) and \(\mathcal{Z}\) the additional parents in \(G'\)

- True structure is not single maximizer
- With finite samples, independence fails
- \(LL^\text{MLE}\) prefers denser structures over sparser structures

### 3.3 Penalized Log-Likelihood

Penalize denseness/complexity: \[ PLL(G) = LL^{\text{MLE}}(G) - \psi(N) \cdot \text{size}(G) \] where \[ \text{size}(G) = \sum_{X \in \mathcal{V}} (|\Omega_X|-1)\prod_{Y \in \pa{X}}|\Omega_Y| \] is the number of free-parameters in a (G,p), and \(\psi(N)\) is non-decreasing function that weights the penalization according to \(N\)

\[ PLL(G) = LL^{\text{MLE}}(G) - \psi(N) \cdot \text{size}(G) \]

- Different score functions can be obtained by differing \(\psi(N)\)
- \(LL^\text{MLE}\) grows linearly with \(N\)
- \(\psi(N)=\mathcal{O}(N)\) might lead to inconsistency
- \(\psi(N)=\mathcal{O}(\log N)\) ensures consistency

## 4 Minimum Description Length

### 4.1 Back to Enconding

- Suppose that we have several different models of symbol occurrence \(p_1,\dotsc,p_h\)
- To
**encode**a string: choose a model \(p_i\) and use \(-\log_2 p_i(x)\) bits for symbol \(x\) - To
**decode**message, we need to know the model used: e.g. add \(-\log_2 h\) bit header

Description Lenght represents average encoded message length to encode a random variable \(X\) distributed according to \(p\): \[ H(X) + \text{size}(p) \]

### 4.2 Model Selection as Data Compression

- Dataset can be seen as string to be encoded
- Symbols are configurations (possible instances)
- Probabilistic model of symbol/datum occurrence
- Encode data by encoding model and bitstring codes

**Minimum Description Lenght Principle:** Select model that minimizes description length

### 4.3 Using Bayesian Nets to Encode data

Suzuki 1996's proposal:

- Use Bayes net as model of symbol/datum occurrence
- Encode Bayes net as list of conditional probability tables
- Use \(\log_2 N/2\) bits of precision to encode probabilities

## 5 Bayesian Score

### 5.1 Bayesian Inference

- Posterior Distribution

\[ p(G|\text{data}) = \frac{p(G)p(\text{data}|G)}{p(\text{data})} \]

- Denominator is constant wrt to \(G\)

**Bayesian Score**

### 5.2 Prior Distribution

Assuming Categorical Variables and Dirichlet Parameter Priors

### 5.3 Small Detour: Discrete-Dirichlet Distribution

- Discrete Distribution

Models probability of observing \(\{ N[X=x]: x \sim X \}\) occurrences of a categorical random variable \(X\) \[ p(N[X]|\theta) = \prod_{x \sim X} \theta_x^{N[X=x]} \]

- Dirichlet Distribution

Models probability density of observing a nonegative vector \((\theta_1,\dotsc,\theta_n)\) such that \(\sum_i \theta_i = 1\) \[ p(\theta|\alpha) = \frac{\Gamma(\sum_i \alpha_i)}{\prod_i \Gamma(\alpha_i)} \prod_i \theta_i^{\alpha_i - 1} \]

### 5.4 Small Detour: Dirichlet Distribution

- Moments

Given integers \(n_1,\dotsc,n_k\):

\begin{align*} \mathbb{E}\left(\prod_i \theta_i^{n_i} \mid \alpha\right) &= \int_\theta \prod_i \theta_i^{n_i} \cdot \text{Dirichlet}(\theta|\alpha) \,d\theta \\ &= \frac{\Gamma(\sum_i \alpha_i)}{\Gamma(\sum_i \alpha_i + n_i)}\prod_i \frac{\Gamma(\alpha_i + n_i)}{\Gamma(\alpha_i)} \end{align*}

### 5.5 Decomposabability

\[ p(\text{data}|G,\theta) p(\theta|G) = \prod_{X \in \mathcal{V}}\prod_{\nu \sim \pa{X}} \underset{\text{Multinomial}}{\underbrace{p(N[X,\pa{X}=\nu]|\theta_{X|\nu})}} \cdot \underset{\text{Dirichlet}}{\underbrace{p(\theta_{X|\nu})}} \]

- Local Marginal Likelihood is Moment of Dirichlet

\begin{multline*} p(\text{data}[X,\pa{X}]) = \int_{\theta_{X|\nu}} p(N[X,\pa{X}=\nu]|\theta_{X|\nu}) p(\theta_{X|\nu}) d\theta_{X|\nu}\\= \frac{\Gamma(\sum_x \alpha_x)}{\Gamma(N[\pa{X}=\nu]+\sum_x \alpha_x)}\prod_{x \sim X} \frac{\Gamma(N[X=x,\pa{X}=\nu]+\alpha_{x|\nu}}{\Gamma(\alpha_{x|\nu})} \end{multline*}

### 5.6 Bayesian Score

- Prior \(p(G)\)
- Hyperparameters \(\alpha_{x|\nu}\)

### 5.7 Structure Decomposable Prior

\[ p(G) = \prod_X p(\pa{X}) \]

- E.g.: \(p(\pa{X}) = 1/|\pa{X}|\)
- E.g.: \(p(\pa{X}) = \prod_{Y \rightarrow X} p(Y \rightarrow X)\)
- In practice, little influence on model selection (if \(N\) is large)

### 5.8 Hyperparameters

- Likelihood Equivalence Principle

Two Markov equivalent structures should be assigned the same score

Likelihood Equivalence is satisfied iff \[ \alpha_{x|\nu} = \alpha \cdot p(X=x,\pa{X}=\nu) \] where \(\alpha\) is the

**equivalent sample size**Common choice: \[ p(x,\nu) = \frac{1}{|\Omega_X|\prod_{Y \in \pa{X}} |\Omega_Y|} = \text{Uniform}(X,\pa{X}) \]

- \(\alpha\) regulates overfitting; typical values for it are usually in the range \([1,2]\) (structure ranking is very sensitive to this value)

### 5.9 Bayesian Score Summary

- Efficient, decomposable, consistent and succinct
- Can be made Likelihood Equivalent (like LL and MDL)
- Requires Equivalent Sample Size
- Allows flexible prior knowledge about structures

### 5.10 Minimum Description Length

\[ \text{MDL}(G) = LL^{\text{MLE}}(G) - \frac{\log_2 N}{2} \text{size}(G) \]

- Penalized Log-Likelihood with \(\psi(N)=\log_2 N/2\) (therefore consistent and succinct)
- Decomposable and efficient

## 6 Bayesian Information Criterion

\[ p(G|\text{data}) \overset{N \rightarrow \infty}{\approx} \max_G\max_\theta p(G,\theta|\text{data}) \]

**BIC Score:**

- \(\text{BIC}(G) \approx \text{MDL}(G)\) up to log base

## 7 Structure Learning Algorithms

### 7.1 Problem

Given a decomposable, succinct and efficient score function \(s\), find a structure \(G\) maximizing \(s(G)\)

- NP-complete
- Exact techniques solve up to 100 variables
- Approximate techniques:
- Search in space of DAGs
- Search in space of topological orderings

### 7.2 Search in the Space of DAGs

- Start with some structure \(G\)
- Find \(G_{i+1}\) by applying any of the following operations
- Add a non-existing arc \(X \rightarrow Y\) (as long as it does not introduces a cycle)
- Remove an existing arc \(X \rightarrow Y\)
- Reverse an existign arc \(X \rightarrow Y\) (i.e., remove \(X \rightarrow Y\) and add \(Y \rightarrow X)\) as long as it does not introduce a cycle

- If \(s(G')>s(G)\) do \(G \gets G'\) and go to Step 2

Due to the decomposability of the score function, these operations can be performed efficiently and in any order, and allow some degree of parallelism

- Do not require strict bound on in-degree
- Can incorporate prior knowledge on structure
- Can search in Markov Equivalence Class
- Need to check acyclicity for each candidate

### 7.3 Search in the Space of Topological Orders

- Digraph \(G\) is acyclic iff it admits a topological ordering \(\sigma\)
- Number of topological ordering is exponentially smaller than number of DAGs

- Decomposes into
**ordering search**and**parent set selection**

### 7.4 Parent Set Selection

Given ordering \(\sigma\): \[ \max_{\pa{X_i} \subseteq \{X_j: \sigma(j) < \sigma(i)\}} f(X_i,\pa{X_i}) \]

- Usually: bound number of parents in the search
- Parent Set Space can be pruned based on the score function

### 7.5 Tessyer & Koller's Algorithm

\[ s(\sigma) =\sum_{i=1}^n \max_{\pa{X_i} \subseteq \{X_j: \sigma(j) < \sigma(i)\}} f(X_i,\pa{X_i}) \]

- Start with topological ordering \(\sigma\)
- Find \(\sigma'\) by swapping two variables ordering
- If \(s(\sigma') > s(\sigma)\) then \(\sigma \gets \sigma'\) and go to Step 2
- Search for \(\sigma'\) is very efficient as it involves only local scores of swapped variables

## 8 Exercises

The **Kullback-Leibler** (KL) divergence between the
distributions \(p(\mathcal{V})\) and \(q(\mathcal{V})\) is^{2}
\[ KL(p,q) = \sum_\mathcal{V} p(\mathcal{V}) \ln \frac{p(\mathcal{V})}{q(\mathcal{V})} = -H(\mathcal{V}) - \sum_{\mathcal{V}} p(\mathcal{V}) \ln q(\mathcal{V}) \,. \]

The KL divergence measures how close \(q\) approximates \(p\). Note that KL divergence is asymmetric. A information-theoretic interpretation of KL divergence is that \(KL(p,q)\) is the additional number of bits that is needed to encode a document using \(q\) relative to an encoding that uses the true distribution \(p\) (this requires that the log be taken with base 2). More generally, KL divergence can be shown to be the single measure of divergence between distributions that satisfy some desiderata. The KL-divergence can be rewritten as \[ KL(p,q) = -\left( \underset{\text{cross entropy}}{\underbrace{\sum_\mathcal{V} p(\mathcal{V}) \ln q(\mathcal{V})}} + \underset{H(\mathcal{V})}{\underbrace{\sum_\mathcal{V} - p(\mathcal{V}) \ln p(\mathcal{V})}} \right ) \, . \]

You might find the following result useful for solving these exercises.

**Gibbs Inequality:** \(KL(p,q) \geq 0\) for any \(p,q\) and \(KL(p,q) = 0\) if and only if \(p=q\).

- Prove that the following statements are true:
- \(0 \leq H(\mathcal{X}) \leq \ln |\Omega_{\mathcal{X}}|\).
- \(H(\mathcal{X}) = 0\) if and only if \(p\) is degenerate (i.e., it assigns all mass to a single configuration).
- \(H(\mathcal{X} \cup \mathcal{Y})= H(\mathcal{X})+H(\mathcal{Y})\) if and only if \(\mathcal{X} \ind \mathcal{Y}\) (under \(p\)).

Prove that the following statements are true:

- \(I(\mathcal{X},\mathcal{Y}) = I(\mathcal{Y},\mathcal{X})\).
- \(I(\mathcal{X},\mathcal{Y}) = H(\mathcal{X}) + H(\mathcal{Y}) - H(\mathcal{X} \cup \mathcal{Y})\).
- \(I(\mathcal{X},\mathcal{Y}) \geq 0\).
- \(I(\mathcal{X},\mathcal{Y})=0\) if and only if \(\mathcal{X} \ind \mathcal{Y}\) (under \(p\)).
- \(I(\mathcal{X},\mathcal{Y}\cup\mathcal{Z}) \geq I(\mathcal{X},\mathcal{Y})\) with equality if only if \(\mathcal{X} \ind \mathcal{Z} \mid \mathcal{Y}\).

Hint: note that \(I(\mathcal{X},\mathcal{Y})=KL(p(\mathcal{X}\cup\mathcal{Y}),p(\mathcal{X})p(\mathcal{Y}))\).

- Show that adding edges to a structure never decreases its
likelihood, that is, that if \(G\) is the result of adding
edges to \(G'\) then \[ LL^{MLE}(G) \geq LL^{MLE}(G') \, . \]
*Hint:*Note that adding edges increases the number of parameters while still maintaining the previous parameters. - Consider a structure \(G_\emptyset\) over two binary random
variables \(X\) and \(Y\) and with no arcs, and a second
structure \(G_{X \rightarrow Y}\) containing the same node
set and the arc \(X\rightarrow Y\).
- Show that \(\text{BIC}(G_{X \rightarrow Y}) > \text{BIC}(G_\emptyset)\) iff \(I(X,Y) > \ln(N)/(2N)\).
- Explain how the previous result can be used to prove consistency of BIC for two variables.

- Prove the following theorem: Suppose the data were
generated by a Bayesian network \((G,p)\). If \(N \rightarrow
\infty\) it follows that \[ LL^{MLE}(G) \geq LL^{MLE}(G') \,
, \] for any structure \(G'\).
*Hint:*use the Gibbs Inequality with the distribution induced by a structure \(G'\). Note that the empirical distribution matches the data generating distribution \(p\) (since we have infinite data).

## Footnotes:

^{1}

^{2}