Constraint-Based Bayesian Network Structure Learning

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\can}[1]{\overline{\text{an}}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\perp_d} \newcommand{\sep}{\perp_u} \newcommand{\msep}{\perp_m} $

We discuss the problem of inferring the structure of a Bayesian network from data. There are mainly two approaches to this problem. The first approach, which we study in this lecture, is connected to the independence representation view of Bayesian networks. In this approach, termed constraint-based structure learning, an independence oracle is assumed available (which tests for conditional independence of variables), and a Markov equivalent structure is recovered making a number of calls to that oracle. The second approach, termed score-based structure learning, takes the distribution view of Bayesian networks, and attempts at finding a structure which fits the empirical distribution in the data. This is usually posed and solved as a combinatorial problem over the space of DAGs guided by a polynomial-time computable scoring rule. We will leave the study of this approach for later.

Bayesian network structure learning

Thus far we have avoided the question “Where do probabilistic graphical models come from?”, and we have taken for granted that a probabilistic model is available. However, obtaining a model that meets our requirements is far from trivial. One possibility is to rely on domain experts through a careful and painstaking process of knowledge elicitation. This includes training experts on probabilistic graphical modeling, calibrating experts’ opinions and extracting and verifying knowledge. This process too often leads to disagreement among the experts and lack of reliability about the model. Nevertheless, in many domains where data is scarce, this is one of the main approaches to model construction. It is also part of the methodology of most scientific disciplines.

Another possibility is to automatically derive the model using a data set of collected observations. It is this machine learning approach that we adopt here (so that we completely avoid the very rich field of human knowledge elicitation). To be fair, model construction is never purely data-driven. A number of (often hidden) assumptions and constraints need to be made so that data can be turned into models. It is however domain agnostic: the machine learning approach can be almost automatically applied to any domain where data satisfying the requirements is available.

We will start examining a class of techniques that derives Bayesian network structures from data without addressing the estimation of the numerical parameters (i.e., the conditional probability values). This approach is known as constraint-based structure learning.

Constraint-based structure learning

The objective of constraint-based structure learning methods is to recover an structure that best captures the independences in a dataset. In other words, we are interested in finding an acyclic digraph that is minimal with respect to the conditional independences observed in the data. We assume the availability of an independence oracle:

An independence oracle is a(n abstract) computational device that answers, in polynomial time, the question $\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z}$ for any sets $\mathcal{X}$, $\mathcal{Y}$ and $\mathcal{Z}$.

Since d-separation is not complete, and since not every independence relation can be represented as a Bayesian network, some assumptions need to be made to enable learning. Here are the most common assumptions:

The data has been generated by a Bayesian network whose structure has bounded in-degree $d$.
The independence oracle is an exact test of the conditional independences in the generating network
D-separation is complete in the originating Bayesian network (i.e., $X \dsep Y \mid Z$ iff $X \ind Y \mid Z$).

All of those assumptions are unrealistic: data is seldom generated by a stationary bounded-degree Bayesian network distribution faithful to its structure (meaning all graphical dependences are verified in the distribution), and practical conditional independence tests are imperfect. One particularly important violation to those assumptions arises when the generating network contains latent variables, that is, variables which are not observed. For example, consider a dataset generated by a Bayesian network with structure:

Suppose that $B$ is latent, so that we only observe variables $A,C,D,E$. Note that $D$ and $E$ are not d-separated by any set not containing $B$. Hence, an independence oracle answering according the structure above answers affirmatively queries $A \ind {C,E}$ and $C \ind {A,D}$. The former requires that $D$ acts as a convergent connection in any digraph representation, while the latter requires that $E$ acts as a convergent connection. This is clearly impossible to represent with a structure whose skeleton is $A-D-E-C$.

When an ordering is known

So our goal is to obtain a sound structure that is minimal: any d-separation derived from the structure is verified by the independence oracle, and the removal of any edge renders leads to d-separations which are not true (according to the independence oracle). The simplest mechanism for achieving this is to use the ordered Markov property:

$$ X \ind \{ Y: Y < X \} \mid \pa{X} \text{ for any topological ordering } < \, . $$

So assume that we are given a topological ordering of the variables, say $X_1,\dotsc,X_n$. For $i=2,\dotsc,n$, find a subset-minimal set $$ \mathcal{U} \subseteq \{X_1,\dotsc,X_{i-1}\} $$ such that $$ X \ind \{X_1,\dotsc,X_{i-1}\} - \mathcal{U} \mid \mathcal{U} \, . $$

This can achieved by querying the independence oracle for increasingly larger sets $\mathcal{U}$. Define $\mathcal{U}$ as the parents of $X_i$ and continued the procedure. Note that by construction, the network factorizes according to the variable ordering, and thus satisfies all the d-separations induced by the corresponding structure. The parent set selected for each variable is not necessarily unique. For example, if $A=B$ and $A < B < C$, then either $A$ or $B$ are minimal subsets for $C$ (but not $\{A,B\}$). When the data generating distribution is positive, however, one can show that any such set $\mathcal{U}$ is unique. The structure obtained by this procedure is sensitive to the ordering chosen, and different orderings can result in different structures.

Consider the three structures below.

Assuming the leftmost structure as the independence oracle, the algorithm obtains

the same structure if using ordering $A < B < D < C < E$,
the middle structure if using ordering $E < D < C < B < A$, and
the rightmost structure if using ordering $E < A < D < B < C$.

By construction, the variable orderings are topological orderings of the generated structures, which ensures the non-existence of directed cycles. Thus, by repeating the procedure which every possible variable ordering we obtain all structures that are sound with respect to d-separation.

This structure learning method, while conceptually simple, is very inefficient. First, it requires listing every variable orderings (there are $n!$ such orderings for $n$ variables). Second, it calls the independence oracle an exponential number of times (there are $2^{n-1}$ subsets of $n$ variables), with a query involving an arbitrarily large number of variables (even if the number of parents in the generating distribution is small). Last, some of the structures “learned” by this method are not perfect, that is, they miss independences represented by other structures found with the same algorithm. We now turn to an algorithm that solves all these issues.

The PC algorithm

By definition, any two Markov equivalent structures induce the same answers from an independence oracle, so that no algorithm that is based purely on this oracle can prefer one structure over the other. It seems reasonable then to learn an equivalence class instead of a single structure. Recall that an equivalence class contains all Markov equivalent structures. And a partially directed acyclic graph (PDAG) of an equivalence class is a graph containing both directed and undirected graphs, and such that an edge appears directed if and only if the corresponding arc appears in all the structures in the equivalence class.

The PC algorithm learns a partially directed acyclic graph representing the (in)dependences in the data using only polynomially many calls to the oracle, each call involving only a bounded number of variables.¹ The algorithm has the following steps:

Identify the graph skeleton (= undirected graph) induced by those relations.
Identify immoralities.
Identify derived directions.

The output of the algorithm is a PDAG representing the class of Markov equivalent networks.

Skeleton identification

The skeleton identification step is based on the following properties of d-separation in Bayesian networks:

If two nodes $X$ and $Y$ are adjacent in an acyclic digraph then there is no set $\mathcal{Z}\subseteq\mathcal{V}-\{X,Y\}$ that d-separates $X$ and $Y$.
If $X$ and $Y$ are not adjacent in an acyclic digraph then either $\pa{X}$ d-separates $X$ and $Y$ or $\pa{Y}$ d-separates $X$ and $Y$.

Due to the faithfulness assumption, these properties translate to independences: if $X$ and $Y$ are not adjacent in the originating network, then either $X \ind Y \mid \pa{X}$ or $X \ind Y \mid \pa{Y}$; Conversely, if the variables are adjacent then any query $X \ind Y \mid \mathcal{Z}$ with $X,Y \not\in \mathcal{Z}$ will be responded negatively. Since we assumed $|\pa{X}| \leq d$ for any $X$, then $X$ and $Y$ are adjacent if and only if we cannot find a set $\mathcal{Z}$ (not containing $X$ and $Y$) of size at most $d$ such that $X \ind Y \mid \mathcal{Z}$. Call any such set $\mathcal{Z}$ a witness set (for $X$ and $Y$). Note that there are $O(2^d)$ witness sets for a pair of variables, and this is considered constant with respect to the input size (as $d$ is considered constant). Moreover, since a witness set is a parent set of one of the endpoints, we need only to consider subsets of variables that have not been deemed independent of either endpoint. Building on these observations, the algorithm to identify the skeleton of the PDAG proceeds as follows.

Start with a complete graph $S$ over the variables.
For $i=0,1,\ldots,d$ do:
For each edge $X - Y$ do
For each candidate witness set $\mathcal{Z} \subseteq \nb{X} \cup \nb{Y}$ with size $|\mathcal{Z}|=i$ do:
Query $X \ind Y \mid \mathcal{Z}$ and remove $X - Y$ from $S$ if the query returns “yes” (store $\mathcal{Z}$ as the witness of that edge).

Note that when querying for independence, we only need to consider subsets of either $\nb{X}$ or $\nb{Y}$ in the current $S$.

Finding immoralities

Recall that an equivalence class is characterized by a skeleton and a set of immoralities. Thus, it remains to devise a procedure that finds the immoralities. Consider a skeleton $S$ obtained by the previous procedure along with a list of witness sets for each absent edge in $S$. A candidate immorality is a trail $X - Y - Z$ such that $X$ and $Z$ are not adjacent in $S$. Consider any such triple; if it is not an immorality in the “true” graph we have either one of the serial connections

or the divergent connection

In any of three cases above, $Y$ d-separates $X$ and $Z$, and $X$ and $Z$ are d-connected if $Y$ is not given. Thus it follows from the faithfulness assumption that $X \ind Z \mid \mathcal{W}$ if and only if $\mathcal{W}$ contains $Y$. On the other hand, if the triple is indeed an immorality (in the “true” graph) then $X \not\ind Z \mid \mathcal{W}$ for any set $\mathcal{W}$ containing $Y$. Combining these two considerations we have that:

If $X-Y-Z$ is an immorality, then $Y$ does not appear in any witness set of $X-Z$;
If $X-Y-Z$ is not an immorality then $Y$ appears in all witness sets of $X-Z$.

This leads to the following simple procedure for determining immoralities:

For each trail $X-Y-Z$ where $X$ and $Z$ are not adjacent, orient $X \rightarrow Y \leftarrow Z$ if $Y$ is not in the witness set of $X-Z$.

The outcome of the procedure above is not yet the partially directed acyclic graph representing the Markov equivalence class defined by the oracle, because there might be arcs $X \rightarrow Y$ which appear in every structure in the class and yet are undirected in $S$. This occurs when a different orientation of that edge in $S$ would either create a cycle or an immorality. For example, the structure below has a single immorality. This entails a unique orientation for the edge $C-D$, and excludes the graph on the right from representing the Markov equivalence class of that graph.

A Bayesian network structure which is also the partially directed graph of its equivalence class (left), and a possible output of the procedure described (right)

Orienting the arcs

The final step of the PC algorithm is thus the orientation of arcs from the partially directed skeleton $S$ (obtained from the previous step). This is achieved by recursively applying any of the following rules (in any order) to $S$ until convergence.

Prohibited immorality rule

Acyclicity rule

Quartet rule

The correctness of the first two rules is straightforward. For the third rule, we can show correctness by contradiction: if we assume that $X-Y_1$ and $X-Y_2$ are undirected in the equivalence class, then any other orientation of $X-Z$ would either create a cycle or an immorality. These rules take at most polynomial time, and hence the overall algorithm takes polynomial time in the number of variables.

The overall procedure can be shown to be sound and complete.² That is, the PC algorithm returns the partially directed acyclic graph of the Markov equivalence class represented by the oracle.

Time complexity

The skeleton identification procedure examines all the $n^2$ pair of variables, and for each performs $O(n^d)$ calls to the oracle; hence, it consumes $O(n^{d+2})$ time. If each oracle call takes polynomial time in the number of variables and in the data set size, the overall time complexity is also polynomial (since $d$ is assumed constant). The immorality identification can be accomplished in time linear in the skeleton, and in the witness sets. Finally, the arc orientation rules only orient edges, therefore this step necessarily terminates in $O(n^2)$ steps. Thus, the algorithm finishes in polynomial time. Note the dependence of the procedure on the boundedness of the in-degree.

Soundness

If all the assumptions are met (faithfulness, bounded in-degree, no latent variable, perfect independence testing), then the PC algorithm recovers the Markov equivalence class of the true data generating structure as the sample size goes to infinity. Note that this has a lot of “ifs”. For finite samples, there are no guarantees that the PC algorithm recovers the true structure even if the data is generated by a Bayesian network with no latent variables and the sample size is large.

Statistical independence testing

The only missing part for a practical algorithm is the construction of an independence oracle that answers conditional independence queries of the form $X \ind Y \mid \mathcal{Z}$ given a dataset $\mathcal{D}$ of configurations of the variables assumed generated independently from some (unknown) Bayesian network. To simplify the discussion, fix $X$, $Y$ and $\mathcal{Z}$, so that the oracle is a function of only $\mathcal{D}$; call $I(\mathcal{D})$ this function, that is, $I$ is a function that answers yes or no given a data set $\mathcal{D}$. Assume that $\mathcal{D}$ was generated by a distribution $P(X,Y)$. Ideally, we would like to have $$ I(\mathcal{D}) = \text{yes} \Leftrightarrow P(X,Y \mid \mathcal{Z}) = P(X \mid \mathcal{Z}) P(Y \mid \mathcal{Z}) \, . $$ In practice, however, we do not have access to $P(X,Y,Z)$, and the oracle will not be perfectly reliable. The mistakes made by the oracle can be classified into two types:

Type I or false positive, when the oracle rejects a true independence statement.
Type II or false negative, when the oracle accepts a false independence statement.

The type I error rate is formalized as: $$ \alpha = P(\{\mathcal{D}’ \sim P(X,Y,Z): I(\mathcal{D}’)=\text{no}, |\mathcal{D}’|=|\mathcal{D}|\} \mid H_0) \, , $$ where $H_0$ is the so-called null hypothesis that $X \ind Y \mid \mathcal{Z}$. This probability value is known as the significance level. Note that by controlling the type I error we, in principle, make no commitment to type II errors (as they are probabilities conditioned on a disjoint event). In terms of an independence oracle used to learn the structure of a Bayesian network, the smaller the value of $\alpha$ the sparser is the induced graph.

Data-based oracles are often obtained by thresholding a data statistic, that is, by returning yes if and only if $$ f(\mathcal{D}) \leq t \, , $$ where $f$ is a real-valued function and $t$ is a real value. The probability of observing a value of $f(\mathcal{D}) > t$ under the (null) hypothesis of independence is known as the p-value of the test: $$ \text{p-value}(t) = P(\{ \mathcal{D}’ \sim P(X,Y,Z): f(\mathcal{D}’) > t, |\mathcal{D}’|=|\mathcal{D}|\} \mid H_0) \, . $$

A common statistic for testing independence of categorical variables is the Chi-Square statistic: $$ \chi^2(\mathcal{D}) = \sum_{x,y,z} \frac{\left(N[x,y,z] - N[x,z]N[y,z]/N[z]\right)^2}{N[x,z]N[y,z]/N[z]} \, . $$

It is the mean squared error between the empirical distribution of the joint distribution of the two variables, and the joint distribution of two independent variables, that is,

$$ P(X,Y,\mathcal{Z})=P(X|\mathcal{Z})P(Y|\mathcal{Z})P(\mathcal{Z}) \, . $$

When $X$ and $Y$ are (conditionally) independent under the empirical distribution, the $\chi$-statistic returns 0; otherwise it returns a positive value proportional to the discrepancy between the observed and the hypothesized distributions. The Chi-Square statistic follows approximately a Chi-Square distribution with $$ (|\Omega_X|-1)(|\Omega_Y|-1)\prod_{Z \in \mathcal{Z}}|\Omega_Z| $$ degrees of freedom under the (null) hypothesis of independence of variables, and of iid data. The approximation is fairly accurate when the data set is moderately large (say, greater than 30 records).

Typically, the significance level is set to 0.05, so that an idealized procedure would make one false rejection in every 20 tests. Of course, the several approximations made usually move this number away from the ideal. A more serious issue is the large number of tests employed when identifying the skeleton, a phenomenon generally called multiple hypothesis testing: the type I error probability increases as the number of independence tests increases.³ Controlling for this type of error by a decrease of the significance level often leads to an increase of type II errors. This is particularly troublesome when the data set size is small compared to the maximum (assumed) number of parents $d$. In practice, constraint-based structure learning is most often used to provide a first rough estimate of the structure, that is then refined by a score-based structure learning procedure (which is the subject of a future lecture).

Examples

Let us review the content we have seen with two concrete examples using real-world data.

We learn a Bayesian network structure from the data compiled by Sewell and Shah (1986) about possible motives that lead high school senior students to attend or not a college.⁴ The data contains 10,318 records about the following five categorical variables:

Sex: male, female;
SES: Social economical status, taking on as low, lowed middle, upper middle, high;
IQ: discretized as low, lower middle, upper middle, high;
PE: Parental encouragement, taking on low and high;
CP: College plans, taking on yes or no.

The dataset used was downloaded from Kevin Murphy’s book supplementary material.

To identify the skeleton, we ran Chi-square independence test with significance level $0.04$ as follows. We start with a complete graph skeleton and first test for unconditional independence and obtain the following independences:

Sex ⫫ IQ (p-value 0.2409)
Sex ⫫ SES (p-value 0.1538)

This removes the corresponding edges. We now test for conditional independence given one neighbor. We find that

Sex ⫫ CP | PE (p-value 0.075)

We remove the corresponding edge and assign PE as the respective witness set. We then run conditional tests given two neighbors; no independence is found. The resulting skeleton is depicted below.

We now check for immoralities:

Sex - PE - SES is an immorality as PE is not in the witness set of Sex - PE;
Sex - PE - CP is not an immorality as PE is in the witness set of Sex - CP;
Sex - PE - IQ is an immorality as PE is not in the witness set of Sex - IQ.

We reach the following partial digraph:

We now apply the orientation rules to find new arcs. We first apply the prohibited immorality rule:

The new orientation allows us to apply the acyclicity rule to the following edges:

The final partial digraph representing the equivalence class is

The graph above can be interpreted as stating that any influence of Sex on the education aspirations is mediated by parental encouragement. Moreover, all other factors have both direct influences in college plans, and indirect influences via parental encouragement. This seems in line with Sewell and Shah’s findings that “parental encouragement is a powerful intervening variable between socioeconomic class background and intelligence of the child and his educational aspirations”.

The following example shows what happens when data is not faithful to any structure.

This example is based on a similar example appearing in Norman Fenton and Neil Martin’s book.⁵ There is a long and ongoing debate about institutional racism in the United States’ justice system; in particular the relationship between race and death penalty has been pointed out by many authors. Radelet compiled data about 637 homicide trials in the state of Florida between 1976 and 1977.⁶ The cases were classified in terms of the victim’s race $V$, the defendant’s race $D$, the type of victim-defendant relationship $R$ (primary, if they were firends, relatives, lovers, etc., or nonprimary), the severity of the indictment $I$ (first-degree murder or not) and whether the defendant was sentenced death ($S$). By law, only first-degree indictments are punishable with death sentence in Florida.

$V$	$D$	$R$	$I$	$S$	Cases
white	white	nonprimary	first	no	105
white	white	nonprimary	first	yes	19
white	white	nonprimary	other	no	27
white	white	nonprimary	other	yes	0
white	white	primary	first	no	70
white	white	primary	first	yes	3
white	white	primary	other	no	61
white	white	primary	other	yes	0
white	black	nonprimary	first	no	47
white	black	nonprimary	first	yes	11
white	black	nonprimary	other	no	5
white	black	nonprimary	other	yes	0
white	black	primary	first	no	1
white	black	primary	first	yes	0
white	black	primary	other	no	2
white	black	primary	other	yes	0
black	white	nonprimary	first	no	4
black	white	nonprimary	first	yes	0
black	white	nonprimary	other	no	5
black	white	nonprimary	other	yes	0
black	white	primary	first	no	4
black	white	primary	first	yes	0
black	white	primary	other	no	4
black	white	primary	other	yes	0
black	black	nonprimary	first	no	50
black	black	nonprimary	first	yes	6
black	black	nonprimary	other	no	47
black	black	nonprimary	other	yes	0
black	black	primary	first	no	51
black	black	primary	first	yes	0
black	black	primary	other	no	115
black	black	primary	other	yes	0

Let us try to recover a Bayesian network structure representing the data above. For simplicity we use Chi-square statistical tests for independence, with significance level 0.05. We begin with a complete skeleton $H$ over the five variables. We have the following the unconditional independences:

$D \ind R$ with p-value 0.3875;
$D \ind S$ with p-value 0.2453.

Accordingly, we remove the edges $D - R$ and $D - S$. Next, we perform independence tests given one neighbor. We find the following conditional independence:

$V \ind S \mid I$ with p-value 0.0834.

We therefore remove $V-S$ and store $I$ as the respective witness. We now move to neighborhoods of size two, and find that

$D \ind I \mid V,R$ with p-value 0.2289.

Thus we associate $\{V,R\}$ as a witness set for the removed edge $D - I$. Since at this point no node has more than three neighbors, we finished the recovery of the skeleton. The outcome is shown below.

We have the following candidate immoralities:

$D - V - I$ is not an immorality since $V$ is in the witness set of $D-I$;
$D - V - R$ is an immorality as the witness set of $D - R$ is empty;
$V - R - S$ is also an immorality, as $R$ is not in the witness set of $V-S$;
$V - I - S$ is not an immorality as $I$ is a witness of $D-R$;

We obtain the following partial digraph:

As the partial digraph contains a directed cycle $V - R - V$ it is not the partial digraph of any equivalence class; this suggest that the data might have not been generated by any distribution that can be faithfully represented as a Bayesian network. It is also possible that the data is faithful, but the result of some statistical tests are flawed (as they are inexact), which leads to the inconsistent structure above.

We can try and fix the situation by reviewing our premises. If we reduce the significance level to, say, 0.01, then $V$ and $S$ remain connected and the inconsistency disappears (as the four-node subgraph would then be fully connected). Another fix is to adopt only one of the orientations of the bidirected arc between $V$ and $R$, and then proceed as usual. Yet another approach is to interpret the inconsistency above as suggesting the existence of an unobserved variable that causes $V$ and $R$ to be connected. As we have discussed previously, the existence of such a variable might render the resulting empirical marginal distribution not representable by a Bayesian network structure. Thus let us assume the existence of a latent random variable $U$ that “explains” the conditional independences observed as follows:

Note that since we cannot test independences with respect to $U$, we cannot apply any of the orientation rules. We therefore need additional assumptions about the independences with respect to $U$.

The algorithm is named after the first names of its inventors, Peter and Clark. ↩︎
Christopher Meek, Causal inference and causal explanation with background knowledge, In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995. ↩︎
A witty illustration of the pitfalls of uncontrolled multiple hypothesis testing is this XKCD comic. ↩︎
W. H. Sewell and V. P. Shah, Social class, parental encouragement, and educational aspirations. American Journal of Sociology, 73:5, 1968. ↩︎
Norman Fenton and Neil Marting, Risk assessment and decision analysis with Bayesian networks, Chapman & Hall CRC, 2019. ↩︎
M. Radelet, Racial characteristics and imposition of the death penalty, American Sociological Review 46, 1981. ↩︎

Last updated on Oct 3, 2020