# Constraint-Based Bayesian Network Structure Learning


We discuss the problem of inferring the structure of a Bayesian network from data. There are mainly two approaches to this problem. The first approach, which we study in this lecture, is connected to the independence representation view of Bayesian networks. In this approach, termed constraint-based structure learning, an independence oracle is assumed available (which tests for conditional independence of variables), and a Markov equivalent structure is recovered making a number of calls to that oracle. The second approach, termed score-based structure learning, takes the distribution view of Bayesian networks, and attempts at finding a structure which fits the empirical distribution in the data. This is usually posed and solved as a combinatorial problem over the space of DAGs guided by a polynomial-time computable scoring rule. We will leave the study of this approach for later.

## 1 Bayesian network structure learning

Thus far we have avoided the question "Where do probabilistic graphical models come from?", and we have taken for granted that a probabilistic model is available. However, obtaining a model that meets our requirements is far from trivial. One possibility is to rely on domain experts through a careful and painstaking process of knowledge elicitation. This includes training experts on probabilistic graphical modeling, calibrating experts' opinions and extracting and verifying knowledge. This process too often leads to disagreement among the experts and lack of reliability about the model. Nevertheless, in many domains where data is scarce, this is one of the main approaches to model construction. It is also part of the methodology of most scientific disciplines.

Another possibility is to automatically derive the model using a data set of collected observations. It is this machine learning approach that we adopt here (so that we completely avoid the very rich field of human knowledge elicitation). To be fair, model construction is never purely data-driven. A number of (often hidden) assumptions and constraints need to be made so that data can be turned into models. It is however domain agnostic: the machine learning approach can be almost automatically applied to any domain where data satisfying the requirements is available.

We will start examining a class of techniques that derives Bayesian network structures from data without addressing the estimation of the numerical parameters (i.e., the conditional probability values). This approach is known as constraint-based structure learning.

## 2 Constraint-based structure learning

The objective of constraint-based structure learning methods is to recover an structure that best captures the independences in a dataset. In other words, we are interested in finding an acyclic digraph that is minimal wrt the conditional independences observed in the data. We assume the availability of an independence oracle:

An independence oracle is a(n abstract) computational device that answers, in polynomial time, the question $$\mathcal{X} \ind \mathcal{Y} \mid \mathcal{Z}$$ for any sets $$\mathcal{X}$$,$$\mathcal{Y}$$ and $$\mathcal{Z}$$.

Since d-separation is not complete, and since not every independence relation can be represented as a Bayesian network, some assumptions need to be made. The most common is faithfulness, which in effect corresponds to assuming that the data in hand was generated by a Bayesian network from which we can ask conditional independence queries.

• The data has been generated by a Bayesian network whose structure has bounded in-degree $$d$$.
• The independence oracle is an exact test of the conditional independences in the generating network
• D-separation is complete in the originating Bayesian network (i.e., $$X \dsep Y \mid Z$$ iff $$X \ind Y \mid Z$$).

All of these assumptions are unrealistic: data is seldom generated by a stationary Bayesian network distribution faithful to its structure (meaning all graphical dependences are verified in the distribution), and conditional independence tests (independence oracles) are imperfect. One particularly important exception to these assumption arises when the generating network contains latent variables, that is, variables which are not observed. For example, consider a dataset generated by a Bayesian network with structure:

Suppose that $$B$$ is latent, so that we only observe variables $$A,C,D,E$$. Note that $$D$$ and $$E$$ are not d-separated by any set not containing $$B$$. Hence, an independence oracle answering according the structure above answers affirmatively queries $$A \ind \{C,E\}$$ and $$C \ind \{A,D\}$$. The former requires that $$D$$ acts as a convergent connection in any digraph representation, while the latter requires that $$E$$ acts as a convergent connection. This is clearly impossible to represent with a structure whose skeleton is $$A-D-E-C$$.

## 3 Order-based algorithm

So our goal is to obtain a sound structure that is minimal: any d-separation derived from the structure is verified by the independence oracle, and the removal of any edge renders leads to d-separations which are not true (according to the independence oracle). The simplest mechanism for achieving this is to use the ordered Markov property: $X \ind \{ Y: Y < X\} \mid \pa{X} \text{ for any topological ordering } < \, .$ So assume that we are given an ordering of the variables, say $$X_1,\dotsc,X_n$$. For $$i=2,\dotsc,n$$, find a subset-minimal set $$\mathcal{U} \subseteq \{X_1,\dotsc,X_{i-1}\}$$ such that $$X \ind \{X_1,\dotsc,X_{i-1}\} - \mathcal{U} \mid \mathcal{U}$$. This can achieved by querying the independence oracle for increasingly larger sets $$\mathcal{U}$$. Define $$\mathcal{U}$$ as the parents of $$X_i$$ and continued the procedure. Note that by construction, the network factorizes according to the variable ordering, and thus satisfies all the d-separations induced by the corresponding structure. The parent set selected for each variable is not necessarily unique. For example, if $$A=B$$ and $$A < B < C$$, then either $$A$$ or $$B$$ are minimal subsets for $$C$$ (but not $$\{A,B\}$$). When the data generating distribution is positive, however, one can show that any such set $$\mathcal{U}$$ is unique. The structure obtained by this procedure is sensitive to the ordering chosen, and different orderings can result in different structures.

Consider the three structures in the figure below. Assuming the leftmost structure as the independence oracle, the algorithm obtains (i) the same structure if using ordering $$A < B < D < C < E$$, (ii) the middle structure if using ordering $$E < D < C < B < A$$, and (iii) the rightmost structure if using ordering $$E < A < D < B < C$$.

By construction, the variable orderings are topological orderings of the generated structures, which ensures the non-existence of directed cycles. Thus, by repeating the procedure which every possible variable ordering we obtain all structures that are sound with respect to d-separation.

This structure learning method, while conceptually simple, is very inefficient. First, it requires listing every variable orderings (there are $$n!$$ such orderings for $$n$$ variables). Second, it calls the independence oracle an exponential number of times (there are $$2^{n-1}$$ subsets of $$n$$ variables), with a query involving an arbitrarily large number of variables (even if the number of parents in the generating distribution is small). Last, some of the structures "learned" by this method are not perfect, that is, they miss independences represented by other structures found with the same algorithm. We now turn to an algorithm that solves all these issues.

## 4 The PC algorithm

By definition, any two Markov equivalent structures induce the same answers from an independence oracle, so that no algorithm that is based purely on this oracle can prefer one structure over the other. It seems reasonable then to learn an equivalence class instead of a single structure. Recall that an equivalence class contains all Markov equivalent structures. And a partially directed acyclic graph (PDAG) of an equivalence class is a graph containing both directed and undirected graphs, and such that an edge appears directed if and only if the corresponding arc appears in all the structures in the equivalence class.

The PC (Peter & Clark) algorithm learns a partially directed acyclic graph representing the (in)dependences in the data using only polynomially many calls to the oracle, each call involving only a bounded number of variables SM1995.

### 4.1 The Algorithm

The algorithm has the following steps:

1. Identify the graph skeleton (= undirected graph) induced by those relations.
2. Identify immoralities.
3. Identify derived directions.

The output of the algorithm is a PDAG representing the class of Markov equivalent networks.

### 4.2 Skeleton identification

The skeleton identification step is based on the following properties of d-separation in Bayesian networks:

If two nodes $$X$$ and $$Y$$ are adjacent in an acyclic digraph then there is no set $$\mathcal{Z}\subseteq\mathcal{V}-\{X,Y\}$$ that d-separates $$X$$ and $$Y$$.

If $$X$$ and $$Y$$ are not adjacent in an acyclic digraph then either $$\pa{X}$$ d-separates $$X$$ and $$Y$$ or $$\pa{Y}$$ d-separates $$X$$ and $$Y$$.

Due to the faithfulness assumption, these properties translate to independences: if $$X$$ and $$Y$$ are not adjacent in the originating network, then either $$X \ind Y \mid \pa{X}$$ or $$X \ind Y \mid \pa{Y}$$; Conversely, if the variables are adjacent then any query $$X \ind Y \mid \mathcal{Z}$$ with $$X,Y \not\in \mathcal{Z}$$ will be responded negatively. Since we assumed $$|\pa{X}| \leq d$$ for any $$X$$, then $$X$$ and $$Y$$ are adjacent if and only if we cannot find a set $$\mathcal{Z}$$ (not containing $$X$$ and $$Y$$) of size at most $$d$$ such that $$X \ind Y \mid \mathcal{Z}$$. Call any such set $$\mathcal{Z}$$ a witness set (for $$X$$ and $$Y$$). Note that there are $$O(2^d)$$ witness sets for a pair of variables, and this is considered constant wrt to the input size (as $$d$$ is considered constant). Moreover, since a witness set is a parent set of one of the endpoints, we need only to consider subsets of variables that have not been deemed independent of either endpoint. Building on these observations, the algorithm to identify the skeleton of the PDAG proceeds as follows.

1. Start with a complete graph $$H$$ over the variables.
2. For each edge $$X - Y$$, consider all possible witness sets $$\mathcal{Z} \subseteq \ne{X} \cup \ne{Y}$$ with size $$|\mathcal{Z}|=0,1,\dotsc,d$$; for each such set $$\mathcal{Z}$$, query $$X \ind Y \mid \mathcal{Z}$$ and remove $$X - Y$$ from $$H$$ if the query returns "yes" (store $$\mathcal{Z}$$ as the witness of that edge).

Note that in Step 3, we only need to consider subsets of either $$\nb{X}$$ or $$\nb{Y}$$ in the current $$H$$.

### 4.3 Finding immoralities

Recall that an equivalence class is characterized by a skeleton and a set of immoralities. Thus, it remains to devise a procedure that finds the immoralities. So consider a skeleton $$S$$ and a list of witness sets for each absent edge in $$S$$. A candidate immorality is a triple $$X - Z - Y$$ such that $$X$$ and $$Y$$ are not adjacent. Consider any such triple. If it is not an immorality in the originating graph we will have either one of the serial connections

or the divergent connection

In any of three cases, $$Z$$ d-separates $$X$$ and $$Y$$, and $$X$$ and $$Y$$ are d-connected if $$Z$$ is not given. Thus it follows from the faithfulness assumption that $$X \ind Y \mid \mathcal{Z}$$ if and only if $$\mathcal{Z}$$ contains $$Z$$. On the other hand, if the triple is indeed an immorality (in the originating graph) then $$\neg(X \ind Y \mid \mathcal{Z})$$ for any set $$\mathcal{Z}$$ containing $$Z$$. So if $$X-Z-Y$$ is an immorality, then $$Z$$ appears in all witness sets of $$X-Y$$; and if $$X-Z-Y$$ is not an immorality then $$Z$$ does not appear in any witness set of $$X-Y$$. This leads to a simple procedure for determining immoralities:

• For each potential immorality $$X-Z-Y$$, orient $$X \rightarrow Z \leftarrow Y$$ if $$Z$$ is in the witness set of $$X-Y$$.

The outcome of the procedure above is not yet the partially directed acyclic graph representing the Markov equivalence class defined by the oracle, because there might be arcs $$X \rightarrow Y$$ which appear in every structure in the class and yet are undirected in $$S$$. This occurs when a different orientation of that edge in $$S$$ would either create a cycle or an immorality. For example, the structure in Figure 5 has a single immorality. This entails a unique orientation for the edge $$C-D$$, and excludes the graph on the right from representing the Markov equivalence class of that graph.

### 4.4 Orienting arcs

The final step of the PC algorithm is thus the orientation of arcs from the partially directed skeleton $$S$$ (obtained from the previous step). This is achieved by recursively applying any of the following rules (in any order) to $$S$$ until convergence.

1. Prohibited immorality rule
2. Acyclicity rule
3. Quartet rule

The correctness of the first two rules is straightforward. For the third rule, we can show correctness by contradiction: if we assume that $$X-Y_1$$ and $$X-Y_2$$ are undirected in the equivalence class, then any other orientation of $$X-Z$$ would either create a cycle or an immorality. These rules take at most polynomial time, and hence the overall algorithm takes polynomial time in the number of variables.

The overall procedure can be shown to be sound and complete Meek1995. That is, the PC algorithm returns the partially directed acyclic graph of the Markov equivalence class represented by the oracle.

### 4.5 Time complexity

The skeleton identification procedure examines all the $$n^2$$ pair of variables, and for each performs $$O(n^d)$$ calls to the oracle; hence, it consumes $$O(n^{d+2})$$ time. If each oracle call takes polynomial time in the number of variables and in the data set size, the overall time complexity is also polynomial (since $$d$$ is assumed constant). The immorality identification can be accomplished in time linear in the skeleton, and in the witness sets. Finally, the arc orientation rules only orient edges, therefore this step necessarily terminates in $$O(n^2)$$ steps. Thus, the algorithm finishes in polynomial time. Note the dependence of the procedure on the boundedness of the in-degree.

## 5 Statistical independence testing

The only missing part for a practical algorithm is the construction of an independence oracle that answers conditional independence queries of the form $$X \ind Y \mid \mathcal{Z}$$ given a dataset $$\mathcal{D}$$ of configurations of the variables assumed generated independently from some (unknown) Bayesian network. To simplify the discussion, fix $$X,Y,\mathcal{Z}$$, so that the oracle is a function of only $$\mathcal{D}$$; call $$I(\mathcal{D})$$ this function, that is, $$I$$ is a function that answers yes or no given a data set $$\mathcal{D}$$. Assume that $$\mathcal{D}$$ was generated by a distribution $$p(X,Y)$$. Ideally, we would like to have $I(\mathcal{D}) = \text{yes} \Leftrightarrow p(X,Y|\mathcal{Z}) = p(X|\mathcal{Z})p(Y|\mathcal{Z}) \, .$ In practice, however, we do not have access to $$p(X,Y,Z)$$, and the oracle will not be perfectly reliable. The mistakes made by the oracle can be classified into two types:

• Type I or false positive, when the oracle rejects a true independence statement.
• Type II or false negative, when the oracle accepts a false independence statement.

The type I error rate is formalized as: $\alpha = P(\{\mathcal{D}' \sim p(X,Y,Z): I(\mathcal{D}')=\text{no}, |\mathcal{D}'|=|\mathcal{D}|\}| H_0) \, ,$ where $$H_0$$ is the so-called null hypothesis that $$X \ind Y \mid \mathcal{Z}$$. This probability value is known as the significance level. Note that by controlling the type I error we, in principle, make no commitment to type II errors (as they are probabilities conditioned on a disjoint event). In terms of an independence oracle used to learn the structure of a Bayesian network, the smaller the value of $$\alpha$$ the sparser is the induced graph.

Data-based oracles are often obtained by thresholding a data statistic, that is, by returning yes if and only if $f(\mathcal{D}) \leq t \, ,$ where $$f$$ is a real-valued function and $$t$$ is a real value. The probability of observing a value of $$f(\mathcal{D}) > t$$ under the (null) hypothesis of independence is known as the p-value of the test: $\text{p-value}(t) = P(\{ \mathcal{D}' \sim p(X,Y,Z): f(\mathcal{D}') > t, |\mathcal{D}'|=|\mathcal{D}|\}| H_0) \, .$

A common statistic for testing independence of categorical variables is the Chi-Square statistic: $\chi^2(\mathcal{D}) = \sum_{x,y,z} \frac{\left(N[x,y,z] - N[x,z]N[y,z]/N[z]\right)^2}{N[x,z]N[y,z]/N[z]} \, .$ It is the mean squared error between the empirical distribution of the joint distribution of the two variables, and the joint distribution of two independent variables (i.e., $$p(X,Y,\mathcal{Z})=p(X|\mathcal{Z})p(Y|\mathcal{Z})p(\mathcal{Z})$$). When $$X$$ and $$Y$$ are (conditionally) independent under the empirical distribution, the $$\chi$$-statistic returns 0; otherwise it returns a positive value proportional to the discrepancy between the observed and the hypothesized distributions. The Chi-Square statistic follows approximately a Chi-Square distribution with $$(|\Omega_X|-1)(|\Omega_Y|-1)\prod_{Z \in \mathcal{Z}}|\Omega_Z|$$ degrees of freedom under the (null) hypothesis of independence of variables, and of iid data. The approximation is fairly accurate when the data set is moderately large (say, greater than 30 records).

Typically, the significance level is set to 0.05, so that an idealized procedure would make one false rejection in every 20 tests. Of course, the several approximations made usually move this number away from the ideal. A more serious issue is the large number of tests employed when identifying the skeleton, a phenomenon generally called multiple hypothesis testing: the type I error probability increases as the number of independence tests increases.1 Controlling for this type of error by a decrease of the significance level often leads to an increase of type II errors. This is particularly troublesome when the data set size is small compared to the maximum (assumed) number of parents $$d$$. In practice, constraint-based structure learning is most often used to provide a first rough estimate of the structure, that is then refined by a score-based structure learning procedure (which is the subject of a future lecture).

## 6 Exercises

1. Use the PC algorithm to learn a PDAG using d-separation as oracle in each of the following structures.

2. Implement the PC algorithm and use it to learn a PDAG from the Breast Cancer and the Primary Tumor datasets. Experiment with different significance levels (ex., 0.01, 0.05, 0.1).
• What is the effect of the significance level on the learned structures?
• Which conclusions about these domains do you draw from the learned structures?

# Bibliography

• [SM1995] Peter Spirtes & Christopher Meek, Learning Bayesian Networks with Discrete Variables from Data, 294-300, in in: Proceedings of the 1st International Conference on Knowledge Discovering and Data Mining, edited by (1995)
• [Meek1995] Christopher Meek, Causal inference and causal explanantion with background knowledge, 403-410, in in: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, edited by (1995)

## Footnotes:

1
An witty illustration of the pitfalls of uncontrolled multiple hypothesis testing is this XKCD comic.

Created: 2018-11-07 Wed 18:29

Validate