# Constraint-Based Bayesian Network Structure Learning

## Table of Contents

\( \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\indeg}[1]{\text{indeg}(#1)} \newcommand{\outdeg}[1]{\text{outdeg}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\!\perp_d\!} \newcommand{\msep}{\!\perp_m\!} \newcommand{\sep}{\!\perp_u\!} \)

We discuss the problem of inferring the structure of a
Bayesian network from data. There are mainly two approaches to
this problem. The first approach, which we study in this
lecture, is connected to the independence representation view
of Bayesian networks. In this approach, termed
**constraint-based structure learning**, an independence oracle
is assumed available (which tests for conditional independence
of variables), and a Markov equivalent structure is recovered
making a number of calls to that oracle. The second approach,
termed **score-based structure learning**, takes the
distribution view of Bayesian networks, and attempts at
finding a structure which fits the empirical distribution in
the data. This is usually posed and solved as a combinatorial
problem over the space of DAGs guided by a polynomial-time
computable scoring rule. We will leave the study of this
approach for later.

## 1 Bayesian network structure learning

Thus far we have avoided the question "Where do probabilistic
graphical models come from?", and we have taken for granted
that a probabilistic model is available. However, obtaining a
model that meets our requirements is far from trivial. One
possibility is to rely on domain experts through a careful and
painstaking process of **knowledge elicitation**. This includes
training experts on probabilistic graphical modeling,
calibrating experts' opinions and extracting and verifying
knowledge. This process too often leads to disagreement among
the experts and lack of reliability about the model.
Nevertheless, in many domains where data is scarce, this is
one of the main approaches to model construction. It is also
part of the methodology of most scientific disciplines.

Another possibility is to automatically derive the model using
a data set of collected observations. It is this **machine
learning** approach that we adopt here (so that we completely
avoid the very rich field of human knowledge elicitation). To
be fair, model construction is never purely data-driven. A
number of (often hidden) assumptions and constraints need to
be made so that data can be turned into models. It is however
domain agnostic: the machine learning approach can be almost
automatically applied to any domain where data satisfying the
requirements is available.

We will start examining a class of techniques that derives Bayesian network structures from data without addressing the estimation of the numerical parameters (i.e., the conditional probability values). This approach is known as constraint-based structure learning.

## 2 Constraint-based structure learning

The objective of constraint-based structure learning methods is to recover an structure that best captures the independences in a dataset. In other words, we are interested in finding an acyclic digraph that is minimal wrt the conditional independences observed in the data. We assume the availability of an independence oracle:

An **independence oracle** is a(n abstract) computational device
that answers, in polynomial time, the question \(\mathcal{X}
\ind \mathcal{Y} \mid \mathcal{Z}\) for any sets
\(\mathcal{X}\),\(\mathcal{Y}\) and \(\mathcal{Z}\).

Since d-separation is not complete, and since not every independence relation can be represented as a Bayesian network, some assumptions need to be made. The most common is faithfulness, which in effect corresponds to assuming that the data in hand was generated by a Bayesian network from which we can ask conditional independence queries.

- The data has been generated by a Bayesian network whose structure has bounded in-degree \(d\).
- The independence oracle is an exact test of the conditional independences in the generating network
- D-separation is complete in the originating Bayesian network (i.e., \(X \dsep Y \mid Z\) iff \(X \ind Y \mid Z\)).

All of these assumptions are unrealistic: data is seldom
generated by a stationary Bayesian network distribution
**faithful** to its structure (meaning all graphical dependences
are verified in the distribution), and conditional
independence tests (independence oracles) are imperfect. One
particularly important exception to these assumption arises
when the generating network contains **latent variables**, that
is, variables which are not observed. For example, consider a
dataset generated by a Bayesian network with structure:

Suppose that \(B\) is latent, so that we only observe variables \(A,C,D,E\). Note that \(D\) and \(E\) are not d-separated by any set not containing \(B\). Hence, an independence oracle answering according the structure above answers affirmatively queries \(A \ind \{C,E\}\) and \(C \ind \{A,D\}\). The former requires that \(D\) acts as a convergent connection in any digraph representation, while the latter requires that \(E\) acts as a convergent connection. This is clearly impossible to represent with a structure whose skeleton is \(A-D-E-C\).

## 3 Order-based algorithm

So our goal is to obtain a sound structure that is minimal: any d-separation derived from the structure is verified by the independence oracle, and the removal of any edge renders leads to d-separations which are not true (according to the independence oracle). The simplest mechanism for achieving this is to use the ordered Markov property: \[ X \ind \{ Y: Y < X\} \mid \pa{X} \text{ for any topological ordering } < \, . \] So assume that we are given an ordering of the variables, say \(X_1,\dotsc,X_n\). For \(i=2,\dotsc,n\), find a subset-minimal set \(\mathcal{U} \subseteq \{X_1,\dotsc,X_{i-1}\}\) such that \(X \ind \{X_1,\dotsc,X_{i-1}\} - \mathcal{U} \mid \mathcal{U}\). This can achieved by querying the independence oracle for increasingly larger sets \(\mathcal{U}\). Define \(\mathcal{U}\) as the parents of \(X_i\) and continued the procedure. Note that by construction, the network factorizes according to the variable ordering, and thus satisfies all the d-separations induced by the corresponding structure. The parent set selected for each variable is not necessarily unique. For example, if \(A=B\) and \(A < B < C\), then either \(A\) or \(B\) are minimal subsets for \(C\) (but not \(\{A,B\}\)). When the data generating distribution is positive, however, one can show that any such set \(\mathcal{U}\) is unique. The structure obtained by this procedure is sensitive to the ordering chosen, and different orderings can result in different structures.

Consider the three structures in the figure below. Assuming the leftmost structure as the independence oracle, the algorithm obtains (i) the same structure if using ordering \(A < B < D < C < E\), (ii) the middle structure if using ordering \(E < D < C < B < A\), and (iii) the rightmost structure if using ordering \(E < A < D < B < C\).

By construction, the variable orderings are topological orderings of the generated structures, which ensures the non-existence of directed cycles. Thus, by repeating the procedure which every possible variable ordering we obtain all structures that are sound with respect to d-separation.

This structure learning method, while conceptually simple, is
very inefficient. First, it requires listing every variable
orderings (there are \(n!\) such orderings for \(n\) variables).
Second, it calls the independence oracle an exponential number
of times (there are \(2^{n-1}\) subsets of \(n\) variables), with
a query involving an arbitrarily large number of variables
(even if the number of parents in the *generating*
distribution is small). Last, some of the structures "learned"
by this method are not *perfect*, that is, they miss
independences represented by other structures found with the
same algorithm. We now turn to an algorithm that solves all
these issues.

## 4 The PC algorithm

By definition, any two Markov equivalent structures induce the
same answers from an independence oracle, so that no algorithm
that is based purely on this oracle can prefer one structure
over the other. It seems reasonable then to learn an
**equivalence class** instead of a single structure. Recall that
an equivalence class contains all Markov equivalent
structures. And a **partially directed acyclic graph** (PDAG) of
an equivalence class is a graph containing both directed and
undirected graphs, and such that an edge appears directed if
and only if the corresponding arc appears in all the
structures in the equivalence class.

The PC (Peter & Clark) algorithm learns a partially directed
acyclic graph representing the (in)dependences in the data
using only *polynomially many calls* to the oracle, each call
involving only a *bounded number of variables* SM1995.

### 4.1 The Algorithm

The algorithm has the following steps:

- Identify the graph skeleton (= undirected graph) induced by those relations.
- Identify immoralities.
- Identify derived directions.

The output of the algorithm is a PDAG representing the class of Markov equivalent networks.

### 4.2 Skeleton identification

The skeleton identification step is based on the following properties of d-separation in Bayesian networks:

If two nodes \(X\) and \(Y\) are adjacent in an acyclic digraph then there is no set \(\mathcal{Z}\subseteq\mathcal{V}-\{X,Y\}\) that d-separates \(X\) and \(Y\).

If \(X\) and \(Y\) are *not* adjacent in an acyclic digraph
then either \(\pa{X}\) d-separates \(X\) and \(Y\) or
\(\pa{Y}\) d-separates \(X\) and \(Y\).

Due to the faithfulness assumption, these properties translate
to independences: if \(X\) and \(Y\) are *not* adjacent in the
originating network, then either \(X \ind Y \mid \pa{X}\) or \(X
\ind Y \mid \pa{Y}\); Conversely, if the variables *are*
adjacent then any query \(X \ind Y \mid \mathcal{Z}\) with \(X,Y
\not\in \mathcal{Z}\) will be responded negatively. Since we
assumed \(|\pa{X}| \leq d\) for any \(X\), then \(X\) and \(Y\)
are adjacent if and only if we cannot find a set \(\mathcal{Z}\)
(not containing \(X\) and \(Y\)) of size at most \(d\) such that
\(X \ind Y \mid \mathcal{Z}\). Call any such set \(\mathcal{Z}\) a
**witness set** (for \(X\) and \(Y\)). Note that there are
\(O(2^d)\) witness sets for a pair of variables, and this is
considered constant wrt to the input size (as \(d\) is
considered constant). Moreover, since a witness set is a
parent set of one of the endpoints, we need only to consider
subsets of variables that have not been deemed independent of
either endpoint. Building on these observations, the algorithm
to identify the skeleton of the PDAG proceeds as follows.

- Start with a complete graph \(H\) over the variables.
- For each edge \(X - Y\), consider all possible witness sets \(\mathcal{Z} \subseteq \ne{X} \cup \ne{Y}\) with size \(|\mathcal{Z}|=0,1,\dotsc,d\); for each such set \(\mathcal{Z}\), query \(X \ind Y \mid \mathcal{Z}\) and remove \(X - Y\) from \(H\) if the query returns "yes" (store \(\mathcal{Z}\) as the witness of that edge).

Note that in Step 3, we only need to consider subsets of either \(\nb{X}\) or \(\nb{Y}\) in the current \(H\).

### 4.3 Finding immoralities

Recall that an equivalence class is characterized by a
skeleton and a set of immoralities. Thus, it remains to devise
a procedure that finds the immoralities. So consider a
skeleton \(S\) and a list of witness sets for each absent edge
in \(S\). A candidate immorality is a triple \(X - Z - Y\) such
that \(X\) and \(Y\) are not adjacent. Consider any such
triple. If it is *not* an immorality in the originating graph
we will have either one of the serial connections

or the divergent connection

In any of three cases, \(Z\) d-separates \(X\) and \(Y\), and \(X\) and \(Y\) are d-connected if \(Z\) is not given. Thus it follows from the faithfulness assumption that \(X \ind Y \mid \mathcal{Z}\) if and only if \(\mathcal{Z}\) contains \(Z\). On the other hand, if the triple is indeed an immorality (in the originating graph) then \(\neg(X \ind Y \mid \mathcal{Z})\) for any set \(\mathcal{Z}\) containing \(Z\). So if \(X-Z-Y\) is an immorality, then \(Z\) appears in all witness sets of \(X-Y\); and if \(X-Z-Y\) is not an immorality then \(Z\) does not appear in any witness set of \(X-Y\). This leads to a simple procedure for determining immoralities:

- For each potential immorality \(X-Z-Y\), orient \(X \rightarrow Z \leftarrow Y\) if \(Z\) is in the witness set of \(X-Y\).

The outcome of the procedure above is not yet the partially directed acyclic graph representing the Markov equivalence class defined by the oracle, because there might be arcs \(X \rightarrow Y\) which appear in every structure in the class and yet are undirected in \(S\). This occurs when a different orientation of that edge in \(S\) would either create a cycle or an immorality. For example, the structure in Figure 5 has a single immorality. This entails a unique orientation for the edge \(C-D\), and excludes the graph on the right from representing the Markov equivalence class of that graph.

Figure 5: A Bayesian network structure which is also the partially directed graph of its equivalence class (left), and a possible output of the procedure described (right).

### 4.4 Orienting arcs

The final step of the PC algorithm is thus the orientation of arcs from the partially directed skeleton \(S\) (obtained from the previous step). This is achieved by recursively applying any of the following rules (in any order) to \(S\) until convergence.

- Prohibited immorality rule

- Acyclicity rule

- Quartet rule

The correctness of the first two rules is straightforward. For the third rule, we can show correctness by contradiction: if we assume that \(X-Y_1\) and \(X-Y_2\) are undirected in the equivalence class, then any other orientation of \(X-Z\) would either create a cycle or an immorality. These rules take at most polynomial time, and hence the overall algorithm takes polynomial time in the number of variables.

The overall procedure can be shown to be sound and complete Meek1995. That is, the PC algorithm returns the partially directed acyclic graph of the Markov equivalence class represented by the oracle.

### 4.5 Time complexity

The skeleton identification procedure examines all the \(n^2\) pair of variables, and for each performs \(O(n^d)\) calls to the oracle; hence, it consumes \(O(n^{d+2})\) time. If each oracle call takes polynomial time in the number of variables and in the data set size, the overall time complexity is also polynomial (since \(d\) is assumed constant). The immorality identification can be accomplished in time linear in the skeleton, and in the witness sets. Finally, the arc orientation rules only orient edges, therefore this step necessarily terminates in \(O(n^2)\) steps. Thus, the algorithm finishes in polynomial time. Note the dependence of the procedure on the boundedness of the in-degree.

## 5 Statistical independence testing

The only missing part for a practical algorithm is the construction of an independence oracle that answers conditional independence queries of the form \(X \ind Y \mid \mathcal{Z}\) given a dataset \(\mathcal{D}\) of configurations of the variables assumed generated independently from some (unknown) Bayesian network. To simplify the discussion, fix \(X,Y,\mathcal{Z}\), so that the oracle is a function of only \(\mathcal{D}\); call \(I(\mathcal{D})\) this function, that is, \(I\) is a function that answers yes or no given a data set \(\mathcal{D}\). Assume that \(\mathcal{D}\) was generated by a distribution \(p(X,Y)\). Ideally, we would like to have \[ I(\mathcal{D}) = \text{yes} \Leftrightarrow p(X,Y|\mathcal{Z}) = p(X|\mathcal{Z})p(Y|\mathcal{Z}) \, . \] In practice, however, we do not have access to \(p(X,Y,Z)\), and the oracle will not be perfectly reliable. The mistakes made by the oracle can be classified into two types:

- Type I or false positive, when the oracle
*rejects*a true independence statement. - Type II or false negative, when the oracle
*accepts*a false independence statement.

The **type I error rate** is formalized as: \[ \alpha =
P(\{\mathcal{D}' \sim p(X,Y,Z):
I(\mathcal{D}')=\text{no}, |\mathcal{D}'|=|\mathcal{D}|\}|
H_0) \, , \] where \(H_0\) is the so-called **null hypothesis**
that \(X \ind Y \mid \mathcal{Z}\). This probability value is
known as the **significance level**. Note that by *controlling*
the type I error we, in principle, make no commitment to type
II errors (as they are probabilities conditioned on a disjoint
event). In terms of an independence oracle used to learn the
structure of a Bayesian network, the smaller the value of
\(\alpha\) the sparser is the induced graph.

Data-based oracles are often obtained by *thresholding* a data
statistic, that is, by returning yes if and only if \[
f(\mathcal{D}) \leq t \, , \] where \(f\) is a real-valued
function and \(t\) is a real value. The probability of observing
a value of \(f(\mathcal{D}) > t\) under the (null) hypothesis of
independence is known as the **p-value** of the test: \[
\text{p-value}(t) = P(\{ \mathcal{D}' \sim p(X,Y,Z):
f(\mathcal{D}') > t, |\mathcal{D}'|=|\mathcal{D}|\}| H_0) \, .
\]

A common statistic for testing independence of categorical
variables is the Chi-Square statistic: \[ \chi^2(\mathcal{D})
= \sum_{x,y,z}
\frac{\left(N[x,y,z] -
N[x,z]N[y,z]/N[z]\right)^2}{N[x,z]N[y,z]/N[z]}
\, . \] It is the mean squared error between the **empirical
distribution** of the joint distribution of the two variables,
and the joint distribution of two independent variables (i.e.,
\(p(X,Y,\mathcal{Z})=p(X|\mathcal{Z})p(Y|\mathcal{Z})p(\mathcal{Z})\)).
When \(X\) and \(Y\) are (conditionally) independent under the
empirical distribution, the \(\chi\)-statistic returns 0;
otherwise it returns a positive value proportional to the
discrepancy between the observed and the hypothesized
distributions. The Chi-Square statistic follows
*approximately* a Chi-Square distribution with
\((|\Omega_X|-1)(|\Omega_Y|-1)\prod_{Z \in
\mathcal{Z}}|\Omega_Z|\) degrees of freedom under the (null)
hypothesis of independence of variables, and of iid data. The
approximation is fairly accurate when the data set is
moderately large (say, greater than 30 records).

Typically, the significance level is set to 0.05, so that an
idealized procedure would make one false rejection in every 20
tests. Of course, the several approximations made usually move
this number away from the ideal. A more serious issue is the
large number of tests employed when identifying the skeleton,
a phenomenon generally called **multiple hypothesis testing**:
the type I error probability increases as the number of
independence tests increases.^{1} Controlling for this type of error by a decrease
of the significance level often leads to an increase of type
II errors. This is particularly troublesome when the data set
size is small compared to the maximum (assumed) number of
parents \(d\). In practice, constraint-based structure
learning is most often used to provide a first rough estimate
of the structure, that is then refined by a score-based
structure learning procedure (which is the subject of a future
lecture).

## 6 Exercises

Use the PC algorithm to learn a PDAG using d-separation as oracle in each of the following structures.

- Implement the PC algorithm and use it to learn a PDAG from
the Breast Cancer and the Primary Tumor datasets.
Experiment with different significance levels (ex., 0.01,
0.05, 0.1).
- What is the effect of the significance level on the learned structures?
- Which conclusions about these domains do you draw from the learned structures?

# Bibliography

- [SM1995] Peter Spirtes & Christopher Meek, Learning Bayesian Networks with Discrete Variables from Data, 294-300, in in: Proceedings of the 1st International Conference on Knowledge Discovering and Data Mining, edited by (1995)
- [Meek1995] Christopher Meek, Causal inference and causal explanantion with background knowledge, 403-410, in in: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, edited by (1995)

## Footnotes:

^{1}