Variable Elimination

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\sco}[1]{\text{sc}(#1)} \newcommand{\del}[2]{{#2}^{-#1}} $

We resume our study on algorithms for computing the probability of evidence (or the partition function) of a Bayesian network or Markov network. We discuss Variable Elimination, a generic framework for producing exact inference in probabilistic graphical models.

Variable elimination is a simple and intuitive algorithm for manipulating information that seems to have been re-developed many times in different areas such as operational research, constraint satisfaction, relational databases, and probabilistic inference, under many different names (and subtle variations). For instance, it was used for automating pedigree analysis in genetics,¹ and for computing conjunctive queries in relational databases;² it was discussed as a sort of non-sequential dynamic programming paradigm in the operational research field;³ it underlies the basic version of the Davis-Puttnam algorithm for solving satisfiability;⁴ it was one of the first practical exact algorithms for computing marginals (i.e., belief updating) in Bayesian networks.⁵ ⁶ Dechter was one of the first persons to note the similarity of all these methods, and to formalize them as a unifying algorithm (with a few minor modifications and under the name of bucket elimination).⁷ The algorithm’s key insight is to exploit the distributivity of addition (marginalization) over multiplication (product) to decompose the problem into smaller problems in a dynamic programming fashion.

The Sum-Product Problem

Here we will consider a slightly more general version of the problem, stated as follows. Recall that a potential is a nonnegative function $\phi$ of a set of random variables $\sco{\phi}$ called its scope (which we assume to be discrete in the following).

Sum-Product Problem: Given potentials $\phi_1,\dotsc,\phi_m$ and a (possibly empty) set $\mathcal{X} \subseteq \mathcal{V}$, where $\mathcal{V}=\bigcup_{i=1}^m \sco{\phi_j}$, compute:

$$ \phi(\mathcal{X}) = \sum_{\nu \sim \mathcal{V}-\mathcal{X}} \prod_{j=1}^m \phi_j(\nu) \, . $$

For even moderately sized sets $\mathcal{V}$, the above expression is impractical to compute by applying the operations as they appear. We have seen that this problem can be solved by sampling algorithms, but they do not provide any guarantees with finite time. In this lecture, we will analyze one of the most popular exact algorithm for that problem, called variable elimination.

For example, the problem of computing the probability of some evidence $P(E_1=e_1,\dotsc,E_m=e_m)$ in Bayesian networks can be posed as a sum-product problem with potentials $P(X|\pa{X})$ for each $X \in \mathcal{V}$, and an indicator potential $\delta_{e_i}(E_i)$ for each evidence $E_i=e_i$, where $\delta_{e_i}(E_i)=1$ if $E_i=e_i$ and $\delta_{e_i}(E_i)=0$ if $E_i \neq e_i$. Then, $$ P(e_1,\dotsc,e_m) = \sum_{\nu \sim \mathcal{V}} \prod_X P(x|\pa{X}) \prod_{E_i} \delta_{e_i}(e_i) \, . $$

An algebra of potentials

Variable elimination is a generic framework which can be used to solve many computational problems that consist in aggregating pieces of information or knowledge and projecting into a new domain (this include problems that are conceptually unrelated to probabilistic inference such as consulting a relational database). Before presenting variable elimination over the concrete objects that we manipulate (real-valued potentials), we first analyze the algebraic properties of potentials at an abstract level. To this end, let $\mathcal{V}$ be a fixed finite set of variables.

Abstract potentials

A labeled abstract potential is an object $\phi$ with an associated scope $\mathcal{S} \subseteq \mathcal{V}$. We denote by $\sco{\phi}=\mathcal{S}$ the function that maps labeled abstract potentials to their scopes, and often write a labeled abstract potential as $(\phi,\mathcal{S})$.

Note that the definition above does not commit to any concrete form of abstract potentials such as nonnegative functions. Nonnnegative real-valued functions of random variables are an example of a concrete (i.e., non-abstract) labeled potential, if we slightly extend their definition. Call a valuation a mapping $\nu$ from random variables in $\mathcal{V}$ to values in their corresponding co-domain. Then a concrete potential $\phi$ is a mapping from valuations of a set of variables $\sc{\phi}$ to nonnegative real-values. This definition is invariant to the order in which variables appear. Another example of concrete labeled potential are relations in relational databases, which define finite sets of valuations of attributes (columns) of a relation (table).

The table below represents a concrete potential. Note that the order in which the variables appear is not relevant, as potentials are defined over valuations.

A	B	$\phi$
0	0	10
1	0	0.1
0	1	0.1
1	1	10

For convenience, we will refer to labeled abstract potentials simply as potentials (bearing in mind that these need not to refer to the concrete potentials we have just defined). Let $\Phi$ be the set of all abstract potentials. Potentials are useful when they can be manipulated to produce new potentials.

A combination is a binary operation $\times$ on $\Phi$ satisfying $\sco{\phi \times \psi}=\sco{\phi} \cup \sco{\psi}$.

In our concrete potentials, combination corresponds to product of functions. That is, $(\phi \times \psi)(\nu) = \phi(\nu) \cdot \psi(\nu)$.

Consider the following probability potentials:

A	B	$\phi_1$
0	0	10
1	0	0.1
0	1	0.1
1	1	10

A	C	$\phi_2$
0	0	5
1	0	0.2
0	1	5
1	1	0.2

Their combination, taken as their product, is

A	B	C	$\phi_1 \times \phi_2$
0	0	0	50
1	0	0	0.02
0	1	0	0.5
1	1	0	2
0	0	1	50
1	0	1	0.02
0	1	1	0.5
1	1	1	2

Another operation is the follows.

An elimination is a mapping $\del{\phi}{X}$ from $\Phi \times \mathcal{V}$ to $\Phi$ such that $\sco{\del{X}{\phi}}=\sco{\phi}-\{X\}$.

Elimination of concrete potentials is usually defined as the sum-marginalization of the first operand from the second one: $$ \del{X}{\phi}(\nu) = \sum_{\nu’: \nu’(X)=\nu(X), X \in \sco{\phi}-{X}} \phi(\nu) \, . $$

Consider the following potential

A	B	$\phi$
0	0	10
1	0	0.1
0	1	5
1	1	0.2

We have that:

B	$\del{A}{\phi}$
0	10.1
1	5.2

and

A	$\del{B}{\phi}$
0	15
1	0.3

Potentials equipped with combination and elimination produce an algebraic system.

An algebra of (labeled abstract) potentials is a system $(\mathcal{V},\Phi,\times,-)$ closed under combination and elimination, and satisfying the following axioms:⁸

Commutativity of combination: $$ \phi \times \psi = \psi \times \phi $$
Associativity of combination: $$ \phi_1 \times (\phi_2 \times \phi_3) = (\phi_1 \times \phi_2) \times \phi_3 $$
Transitivity of elimination: $$ \del{Y}{(\del{X}{\phi})}=\del{X}{(\del{Y}{\phi})} $$
Distributivity of $-$ over $\times$: $$ X \not\in \sco{\phi} \implies \del{X}{(\phi \times \psi)}=\phi \times (\del{X}{\psi}) $$

One can verify that concrete potentials (i.e., nonnegative functions of categorical random variables) satisfy the axioms above.

Let $\phi_1,\dotsc,\phi_n$ be potentials and $X$ be a variable such that $X \not\in \sco{\phi_1} \cup \dotsb \cup \sco{\phi_m}$, for some $m < n$. These axioms allows us to derive:

$$ \del{X}{(\phi_1 \times \dotsb \times \phi_n)} = \phi_1 \times \dotsb \times \phi_m \times \del{X}{(\phi_{m+1} \times \dotsb \times \phi_n)} $$

Repeated application of the result above suffices to derive a sound algorithm for the elimination of a set of variables from a combination of potentials. Our interest in the algebra of potentials is to perform exact inference in discrete probabilistic graphical models (Bayesian networks and Markov networks). Nevertheless, many other representation devices can be shown to satisfy these axioms, and hence, variable elimination is well-defined on them. Examples include Gaussian distributions, systems of linear equations, convex sets, Boolean algebra, and belief functions.⁹ Notably, product and maximization of discrete functions (i.e., potentials) satisfy the axioms; we can thus use variable elimination to find maximum probability configurations.

Variable Elimination

The Variable Elimination Algorithm is a procedure for computing the elimination of variables from the combination of a number of potentials. The algorithm takes a set

$$ \Psi = \{\phi_1,\dotsc,\phi_m\} $$

of potentials, and an ordered set of variables $(X_1,\dotsc,X_n)$ to be eliminated, and returns

$$ \del{X_1 \dotsb X_n}{(\phi_1 \times \dotsb \times \phi_m)} \, , $$ where the notation above denotes the subsequente elimination of variables $X_1,\dotsc,X_n$ from an initial potential $\phi$, in that order.

The algorithm proceeds as follows.

For $i\gets 1$ to $i\gets n$ do:
Collect all potentials $\phi \in \Psi$ such that $X_i \in \sco{\phi}$ and store them in $\Phi_i$
Combine all potentials $\phi \in \Phi_i$ and then eliminate $X$ to obtain $\psi_i$
Remove the potentials $\phi \in \Phi_i$ from $\Psi$ and add the potential $\psi_i$

Return the combination of all $\phi \in \Psi$

Consider running variable elimination with $$ \Psi = \{ \phi(A,B), \phi(A,C), \phi(B,D), \phi(C,D), \phi(D,E) \} $$ and variable ordering $A,B,C,D,E$. We identify the set of potentials obtained at the end of the $i$the iteration of the main loop as $\Psi^i$. The algorithm thus produces: $$ \begin{align*} \psi_1(B,C) &= \del{A}{(\phi(A,B) \times \phi(A,C))} \\ \Psi^1 &= \{ \psi(B,C), \phi(B,D), \phi(C,D), \phi(D,E) \} \\ \psi_2(C,D) &= \del{B}{(\psi_1(B,C) \times \phi(B,D))} \\ \Psi^2 &= \{ \psi(C,D), \phi(C,D), \phi(D,E) \} \\ \psi_3(D) &= \del{C}{( \psi_2(C,D) \times \phi(C,D) )} \\ \Psi^3 &= \{ \psi(D), \phi(D,E) \} \\ \psi_4(E) &= \del{D}{( \psi_3(D) \times \phi(D,E) )} \\ \Psi^4 &= \{ \psi(E) \} \\ \psi_5() &= \del{E}{ \psi_4(E)} \\ \Psi^5 &= \{ \psi() \} \\ \end{align*} $$

At step 2, the algorithm terminates and returns $\psi_5$.

Performance analysis

Let us analyze the time complexity of the algorithm. A naive implementation of Step 1.1 requires considering each potential in $\Psi$, and performing a linear search over its scope. Let $\omega$ be the maximum cardinality of a scope in $\Psi$ (at any time). Then assuming that insertion in $\Phi_i$ takes constant time, this step takes $\mathcal{O}(m \cdot \omega)$ time (since the size of $\Psi$ never increases at each iteration). Consider now Step 1.2). Combining two potentials $\phi$ and $\psi$ can be done by considering all partial valuations defined over the variables in $\sco{\phi\psi}$ and for each valuation performing the product $\phi(\nu)\psi(\nu)$. Let $c$ be the maximum cardinality of a variable; assuming multiplication is performed in constant time, combination takes $\mathcal{O}(c^k)$ time where $k=|\sco{\phi} \cup \sco{\psi}|$. Elimination can be performed similarly with the same asymptotic cost. Hence, Step 1.2 takes time $\mathcal{O}(m_i \cdot c^{\omega_i+1})$, where $m_i=|\Phi_i|$ and $\omega_i=|\sco{\psi_i}|$. Assuming removal from $\Phi_i$ takes constant time, Step 1.3 can be performed in time $\mathcal{O}(m_i)$. We also need to account for the combination of potentials in the last step and the size of the output. For simplicity, we will assume that this step takes constant time (as it is often the case in practice), which implies that all but a few variables are eliminated. The overall running time is thus

$$ \mathcal{O}\left(m\omega + \sum_{i=1}^n m_i c^{\omega_i} \right) = \mathcal{O}\bigr( (m+n)c^{\omega+1} \bigr) \, , $$

where we used the fact that sum of all $m_i$ is at most $m$ (since at each iteration we remove $m_i$ potentials and add one, so that $m_n \leq m - \sum_{i=1}^{n-1} [m_i + 1]$), and $\omega=\max_i \omega_i$.

Note that while $m,n$ and $c$ are parameters of the model, the value $\omega$ is not. Moreover, $\omega$ depends on the variable elimination ordering. To see this, consider binary variables $A,B,C$ and potentials $\phi_1(A,B)$ and $\phi_2(B,C)$. If we perform variable elimination using the ordering $A,B$ we end up with $$ \psi_1(B) = \phi_1(A,B) - A \,, \quad \psi_2(C) = (\phi_2(B,C)\psi_1(C))-B \, , $$

and obtain that $\omega=1$. On the other hand, if we use the ordering $B,A$, the algorithm generates

$$ \psi_1’(A,C) = \phi_1(A,B)\phi_2(B,C) - B \,, \quad \psi_2’(C) = \psi_1(A,C)-A \, , $$

and thus $\omega=2$. Can we obtain a bound on $\omega$ as a function of the model (the potentials and variables) and the elimination ordering? Not surprisingly, to answer that question we will represent the model graphically and determine $\omega$ as a function of the graph.

Domain Graph

The domain graph of a set of potentials $\Psi$ is an undirected graph whose nodes are the variables appearing in the scope of a potential in $\Psi$ and an edge connects two nodes if and only if the corresponding variables appear in the same scope.

The figure below depicts the domain graph for the potentials $\phi(A,B)$, $\phi(A,C)$, $\phi(B,D)$, $\phi(C,D)$ and $\phi(D,E)$.

The reader should convince herself that the domain graph of a Markov network is its own graph, and that the domain of a Bayesian network is its moral graph.

For the rest of this discussion, we will assume that the elimination ordering is complete, that is, that it contains all the variables appearing in a potential (the following results are still valid if we elimination all but one variable, and can be adapted to the general case to account for the complexity of the output). Note that when the elimination ordering is complete, the output is a number (instead of a potential).

Elimination Graph

Let $D_0,D_1,\dotsc,D_n$ be the domain graphs of the sets $\Psi$ obtained during the execution of variable elimination. The elimination graph is the undirected graph over the nodes of $D_0$ such that there is an edge if and only if that edge appears in at least one of $D_0,D_1,\dotsc,D_n$. In other words, the elimination graph is the union of all domain graphs of $\Psi^i$, for $i=1,\dotsc,n$.

Consider running variable elimination with $$ \Psi = \{ \phi(A,B), \phi(A,C), \phi(B,D), \phi(C,D), \phi(D,E) \} $$ and variable ordering $A,B,C,D,E$. The elimination graph is

Note that the elimination graph is more dense than the domain graph of the initial set $Psi$.

The elimination graph depends on the elimination ordering used.

Consider again the potentials $\phi_1(A,B)$ and $\phi_2(B,C)$. Their domain graph is the leftmost graph $D$ in the figure below.

After eliminating variable $A$, we are left with potentials $\psi_1(B)$ and $\phi_2(B,C)$, whose domain graph is $D^{-A}$ (ignoring blue nodes and dashed edges). Eliminating $B$ then generates the graph $D^{-A,B}$. The elimination graph of this run is therefore the graph $E^{A,B}$, which equals the initial domain graph.

Now consider a new run of the algorithm where we first eliminate $B$, obtaining $\psi_1’(A,C)$. The respective domain graph is $D^{-B}$. Note that an edge between $A$ and $C$ was inserted, representing the potential $\psi_1’$. Eliminating variable $A$ generates the graph $D^{-B,A}$ in the figure. The rightmost graph $E^{-B,A}$ is the elimination graph of this ordering; we can see that this graph is more connected than the initial domain graph.

It is not a coincidence that in our example the value of $\omega$ is the size of the largest maximal clique in the elimination graph of each run minus one.

Consider a graph $G$ and a node $X$. The elimination of node $X$ from $G$ is the graph $G^{-X}$ obtained by pairwise connecting the neighbors of $X$ and removing $X$ and its incident edges.

For example, the graph $D^{-A}$ in the example is obtained by eliminating variable $B$ from the (initial) domain graph $D$. As the following result shows, elimination of variables can be represented graphically by elimination of nodes in the corresponding domain graph.

The following theorem shows that elimination of variables can be represented graphically by elimination of nodes.

The domain graph $D_i$ of the potentials in the $i$th iteration of Variable Elimination equals the graph obtained by eliminating node $X_i$ from the domain graph $D_{i-1}$ of the $(i-1)$th iteration, where $D_0$ is the domain graph of the initial set $\Psi$.

Let $\Psi_i$ denote the set $\Psi$ after the $i$th iteration. Then, $\Psi_i = \Psi_{i-1} - \Phi_i \cup \{\psi_i\}$. By construction, $X_i$ is not in $D_i$. First note that the neighbors of $X_i$ in $D_{i-1}$ are $\bigcup_{\phi \in \Phi_i} \sco{\phi}$, and this set is the scope of $\psi_i$. Since $\psi_i$ is $\Psi_i$, the neighbors of $X_i$ must be pairwise connected in $D_i$. It remains to show that no other edge is inserted or removed. Let $Y$ and $Z$ be adjacent variables in $D_{i-1}$ which are not neighbors of $X_i$. By definition of domain graph, there must be a potential in $\Psi_{i-1}$ whose scope contains both $Y$ and $Z$ and does not contain $X$. By construction, that potential is also in $\Psi_i$ (because it is not in $\Phi_i$), hence $Y$ and $Z$ are adjacent in $D_i$. Thus, every edge in $D_i$ which is not incident in $X_i$ or in one of its neighbors is also in $D_{i+1}$. Now consider an edge $Y,Z$ in $D_i$. Then either $Y$ and $Z$ appear in $\psi_i$ are neighbors of $X_i$ or else they appear in some potential which is also in $\Phi_i$.

That result extends to the elimination graph (and justifies its name).

The elimination graph equals the graph obtained as follows: Start with $D_0$ and for $i=1,\dotsc,n$ connect any two neighbors $X_j$ and $X_k$ of $X_i$ such that $i < j < k$.

Thus, the elimination graph is obtained by simulating the elimination of each node sequentially, but with marking instead of removing of the eliminated nodes.

It follows from the previous theorem, by noting that the neighbors of $X_i$ in $D_{i-1}$ are the neighbors of $X_i$ in $D$ that have not been eliminated, hence are some variable $X_j$ with $j < i$.

Induced width

We parametrize the class of elimination graphs by the size of the largest clique.

The induced width of an elimination ordering is the the size of the largest clique of the corresponding elimination graph minus one.

We can now relate the size of generated potentials with the elimination graph.

The induced width of an elimination order equals $\omega=\max_i |\sco{\psi_i}|$.

First note that the scopes of the initial potentials are cliques in the initial domain graphs and hence cliques in the elimination graph. The $\sco{\psi_i} \cup \{X_i\}$ is also a clique in the elimination graph, since $\nb{X_i}$ is a clique in $D_i$, $X_i$ is connected to them in $D_{i-1}$. To see that these cliques are maximal, note that an edge occurs only between variables in the scope of a common potential, and that the scope of any other potential generated is a subset of $\sco{\psi_i} \cup \{X_i\}$.

Computational Complexity

We can now state the connection between the graphical representation of the model and the complexity of variable elimination.

The value $\omega_i$ is the size of the neighborhood of $X_i$ in $D_{i-1}$, which is also the number of higher-ordered neighbors in the elimination graph.

To see that $\omega_i$ is the number of neighbors of $X_i$ in $D_{i-1}$, note that the scope of $\psi_i$ contains exactly the variables which co-appear with $X_i$ in the scope of a potential (hence are neighbors of $X_i$ in $D_{i-1}$).

We are finally ready to state the complexity of variable elimination with respect to the graphical representation of its input. Let $b$ be the smallest cardinality of a variable.

Variable elimination takes time $\Omega(m b^\chi)$ and $\mathcal{O}((m+n)c^{\omega+1})$, where $\chi$ is the size of the largest clique in the domain graph, and $\omega$ is the induced-width of the elimination ordering plus one.

Finding Elimination Orderings

According to the previous theorem, the complexity of the algorithm is lower bounded by the biggest clique size in the domain graph and upper bounded by the biggest clique size in the elimination graph. The former is part of the input, and cannot be improved. As we have seen by example, the latter depends on the elimination ordering. Thus, we want to find the elimination ordering that minimizes the induced width. This is summarized by the concept of a graph treewidth.

The treewidth of a graph $G$ is the minimum induced width of an elimination ordering.

Thus, the optimum implementation of variable elimination is to select an elimination ordering whose induced width equals the treewidth. Finding such an ordering is however an NP-hard problem (and verifying if there is an ordering whose induced width is at most a certain integer is NP-complete).¹⁰ Instead, we will devise efficient heuristics for finding good elimination orderings (elimination orderings with relatively small induced width).

Chordalization

Recall that a graph is chordal if every loop of three or more variables contains a triangle. There is an interesting connection between elimination orderings and chordalization of graphs.¹¹

A chordal graph $H$ is a chordalization of a graph $G$ if every edge in $G$ is also in $H$.

The following result shows that an elimination can be seen as a chordalization of the (initial) domain graph.

The elimination graph is chordal.

We prove by contradiction. Take a chordless cycle $X_{i_1},\dotsc,X_{i_k}$, ordered according to the elimination ordering (this is always possible, since a cycle is closed under shifting). Eliminating $X_{i_1}$ generates an edge between $X_{i_2}$ and $X_{i_k}$ in the elimination graph, and hence a chord in the cycle.

Hence, variable elimination can be seen as a means of inserting edges in the domain graph so as to make it chordal. For instance, the chordal graph on the right below is the elimination graph of the non-chordal domain graph on the left, obtained by eliminating nodes in the ordering $A,B,C,D,E$.

The edges in the elimination graph which are not in the domain graph are called fill-in edges. Each fill-in edge denotes the increase of the scope of a potential with respect to the scopes in the input potentials, which implies an exponential increase in the time complexity of the algorithm. The edge $B – C$ in the graph on the right above is the single fill-in edge.

Min-Fill Heuristic

It can be shown that every chordalization corresponds to an elimination. Hence, the treewidth of a graph can be equivalently stated as the minimal chordalization of a graph (where minimality is taken with respect to the number of edges). Consequently, the complexity of variable elimination is minimized by finding the smallest chordalization of the domain graph (where the size of a chordalization is the number of fill-in edges). This suggest the min-fill heuristic, which starts with the domain graph and then recursively selects the node whose elimination leads to the minimum number of fill-in edges and eliminates that node. Let $\text{fill}(G-X)$ denote the number of fill-in edges introduced by eliminating $X$ from $G$ (i.e., the number of edges needed to turn $\nb{X}$ into a clique). The min-fill heuristic is:

Start with the domain graph $D$ and an empty list $L$
For $i\gets 1$ to $i \gets n$ do:
Select the node $X_j$ in $D$ that minimizes $\text{fill}(D-X)$
Add $X_j$ to the end of $L$ and update $D \gets D - X_j$

Return $L$

The algorithm takes time linear in the number of nodes and quadratic in the largest clique size if the graph is represented as an adjacency list (due to computing the number of fill-in edges, which requires checking whether neighbors are connected).

Simplicial nodes

While the min-fill heuristic may seem intuitively a good approach, it does not always lead to the minimum width elimination ordering. It thus so when the domain graph is already chordal.

A node $X$ is simplicial if $\nb{X}$ is a clique.

Clearly, a node $X$ is simplicial iff $\text{fill}(D-X)=0$. Hence, min-fill will always prefer a simplicial node over some non simplicial nodel.

Perfect Elimination Orderings

An elimination ordering is perfect if it introduces no fill-in edges.

The following statements are true.

Any chordal graph with at least two nodes has at least two simplicial nodes.
A graph is chordal if and only if it admits a perfect elimination ordering.
If $D$ is chordal and $X$ is simplicial then $D-X$ is chordal.
A elimination ordering which repeatedly eliminate simplicial nodes is perfect.

According to the previous theorem, the min-fill heuristic produces a perfect elimination ordering in chordal domain graphs.

Min-Degree heuristic

The following result is useful to compute the induced width of an elimination ordering (recall that the elimination graph is chordal).

Let $D_0,D_1,\dotsc,D_n$ be a sequence of graphs obtained by repeatedly eliminating simplicial nodes $X_1,\dotsc,X_n$. Then the induced width of the ordering is the maximum number of neighbors of a variable $X_i$ in $D_{i-1}$.

The size of the potentials $\psi_i$ created are related to the number of neighbors of the variable $X_i$ being eliminated. Thus, it is reasonable to eliminate variables of small degree. The min-degree heuristic attempts to find a good elimination ordering by favoring variables with fewer neighbors. The algorithm is very similar to the min-fill heuristics:

Start with the domain graph $D$ and an empty list $L$
For $i\gets 0$ to $i \gets n$ do:
Select the node $X_j$ in $D$ that minimizes $|\nb{X}|$
Add $X_j$ to the end of $L$ and update $D \gets D - X_j$

Return $L$

Weighted heuristics

Both the min-fill and the min-degree heuristic aim at minimizing the maximum size of the scope of $\psi_i$. However, the size of the scope also depends on the cardinality of the variables. For instance, it might be better to generate a potential with 10 binary variables than to generate a potential with 2 variables of cardinality 100. This leads us to consider weighted versions of these heuristics. The weighted min-fill heuristic selects a node that has the smallest sum of weights of fill-in edges, where the weight of an edge is the product of the cardinalities of the adjacent variables. The weighted min-degree heuristic selects a node that minimizes the weighted degree, defined as the product of the cardinality of the neighbors.

Offline computation

Note that the min-fill and the min-degree heuristics can be computed dynamically at each iteration from the corresponding graph $D_i$. This is usually a good approach when the elimination ordering heuristic is fixed. It might be interesting however to obtain an elimination order before calling variable elimination, by running several heuristics (possibly each multiple times) and select the best ordering.

Conclusion

Variable elimination is a simple algorithm for computing sum-product inferences in Bayesian networks and Markov networks. Its complexity is exponential in the induced width of the elimination ordering used. Finding an optimal elimination ordering is computationally difficult and greedy heuristics are used. These heuristics often perform well in practice, generating close to optimal orderings.

Variable elimination is particularly efficient in models of small treewidth. This is the case of tree-shaped Bayesian and Markov networks, and of polytree-shaped Bayesian networks of small in-degree. Tree-shaped models are routinely used to represent sequential data such as speech, text, and movies. For these graphs an optimal elimination ordering can be found efficiently.

C. Cannings and E.A. Thompson and H.H. Skolnick, The recursive derivation of likelihoods on complex pedigrees, Advances in Applied Probability 8:4, 1976. ↩︎
C. Beeri and R. Fagin and D. Maier and M. Yannakakis, On the desirability of acyclic database schemes, Journal of the Association for Computing Machinery 30:3, 1983. ↩︎
U. Bertelé and F. Brioschi, Nonserial Dynamic Programming, 1972. ↩︎
M. Davis and H. Putnam, A Computing Procedure for Quantification Theory, Journal of the Association for Computing Machinery 7:4, 1960. ↩︎
R. Shachter and B. D’Ambrosio and B. Del Favero, Symbolic probabilistic inference in belief networks, Proceedings of the 8th National Conference on Artificial Intelligence, 1990. ↩︎
Zhang, N. and Poole, D., A simple approach to Bayesian network computations, Proceedings of the 10th Biennial Canadian Artificial Intelligence Conference, 1994. ↩︎
Rina Dechter, Bucket elimination: A unifying framework for reasoning, Artificial Intelligence 113:1-2, 1990. ↩︎
Shenoy, P. and Shafer, G., Axioms for probablity and belief-function propagation, Proceedings of the 4th Conference on Uncertainty in Artificial Intelligence, 1988. ↩︎
Kohlas, J., Information Algebras: Generic Structures for Inference, Springer-Verlag, 2003. ↩︎
S. Arnborg and D.G. Corneil and A. Proskurowski, Complexity of finding embeddings in a k-tree, SIAM Journal on Algebraic Discrete Methods 8:2, 1987. ↩︎
Chordalization is also called triangulation. ↩︎

Last updated on Oct 29, 2020