Sum-Product Algorithms

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\sco}[1]{\text{sc}(#1)} \newcommand{\del}[2]{{#2}^{-#1}} $

As we have seen, variable elimination is sound but inefficient in networks with high treewidth. Kwisthout et al.¹ showed that if the sum-product problem can be solved in subexponential time in the treewidth, then the propositional logic satisfiability problem (SAT) can also solved in subexponential time in the number of propositions. The latter is believed to not be true, implying that the former is probably also not true. Thus, exact inference algorithms with polynonial-time behavior in high treewidth models are very unlikely to exist. This motivates the study of approximate algorithms. We have already seen sampling algorithms which provide approximate solutions (in polynomial time). In this lecture, we will look at another approximate algorithm, called the sum-product algorithm, which is a syntactic variant of the famous loopy belief propagation algorithm for approximate inference in Bayesian networks. We show that the algorithm is exact when the underlying graphical structure is a tree. This leads to a class of exact inference algorithms that re-cast any model as a tree, and then runs sum-product in that tree. This is the essence of clique-tree algorithms, which we also discuss here.

The Sum-Product Problem

To recap, we are interested in the problem of, given potentials $\phi_1,\dotsc,\phi_m$ over a set of variables $\mathcal{V}$ and a set $\mathcal{X} \subseteq \mathcal{V}$, computing

$$ \sum_{\nu \sim (\mathcal{V}-\mathcal{X})} \prod_j \phi_j(\nu) \, . $$

For $\mathcal{X}=\emptyset$, the problem boils down to computing the partition function of the corresponding graphical model.

Often, we are interested in obtaining updated marginal distributions for every variable in the domain. We can accomplish this by solving the Sum-Product problem separately for each variable (i.e., with $\mathcal{X}=\{X\}$ for each $X \in \mathcal{V}$), for instance, by running Variable Elimination for each variable. However, more efficient approaches can be achieved by exploiting the redundancies in the computations. To this end, we formalize the inference problem as follows.

All-Marginals Problem: For every $X \in \mathcal{V}$ compute:

$$ P(X) = \frac{\sum_{\nu \sim \mathcal{V}-{X}} \prod_j \phi_j(\nu)}{\sum_{\nu \sim \mathcal{V}} \prod_j \phi_j(\nu)} $$

Variable Elimination in Markov Trees

Variable elimination takes a very peculiar form in tree-shaped Markov networks. Recall that the distribution of a factored Markov tree has the form

$$ P(\mathcal{V}) = \frac{1}{Z} \prod_{X-Y \in \mathcal{E}} \phi(X,Y) \, , $$

where $\mathcal{E}$ is the set of edges of the tree. We can transform any tree into a rooted tree, which is a directed tree where the arcs are pointed away from the root (the node without incoming arcs). As usual, we refer to the neighbors of a node in a tree as the parent (the one that is closer to the root) and children. If $T_R=(\mathcal{V},\mathcal{E})$ is tree rooted at $R \in \mathcal{V}$, we define the subtree rooted at $S$ as the subgraph $T_S=(\{S\} \cup \de{S},\mathcal{E}_S)$ induced by $S$ and its descendants $T_R$.

Let $X_n,\dotsc,X_1$ be a topological ordering of the nodes in the tree rooted at $X_n$. Then $X_1,\dotsc,X_n$ is a perfect elimination ordering.

By the time a variable $X_i$ is eliminated, all its children have been eliminated, so that its single neighbor is its parent, making $X_i$ simplicial.

This allows us to re-define variable elimination in trees as follows.

Let $X_n,\dotsc,X_1$ be a topological ordering of the nodes. Compute for $i=1$ to $i=n$:

$$ \psi_i(\pa{X_i}) = \del{X_i}{\Bigl[\phi(X_i,\pa{X_i}) \times \psi_{i_1}(X_i) \times \dotsb \times \psi_{i_k}(X_i)\Bigr]} \, , $$

where $\ch{X_i}=\{X_{i_1},\dotsc,X_{i_k}\}$.

Since $\pa{X_n}=\emptyset$, it follows that $\psi_n=Z$ (as all variables are eliminated). The marginal distribution of $X_n$ can be obtained as

$$ (\psi_{i_1} \times \dotsb \times \psi_{i_k})/Z_n \, , $$

where $Z_n$ is a normalization constant. The figure below illustrates the computation of variable elimination in a Markov tree with factors $\phi(X,Y)$, $\phi(X,Z)$ and $\phi(X,W)$, using a perfect elimination ordering obtained by rooting at $W$.

Synchronous Message Passing

The choice of $X_n$ as the root of the tree is arbitrary (e.g., $W$ in the example above), and the algorithm computes the marginal and the partition function for any choice of root. This allows us to re-formulate the algorithm as follows:

Initialize $\psi_{i \rightarrow j}^{(0)}(X_j) = 1$, and $t \gets 1$.
Repeat until convergence:
For $i\gets 1$ to $i \gets n$ compute:
$$
\psi_{i\rightarrow j}^{(t)}(X_j) = \del{X_i}{\Bigl[ \phi(X_i,X_j) \times \psi^{(t-1)}{i_1 \rightarrow i}(X_i) \times \dotsb \times \psi^{(t-1)}{i_r \rightarrow i}(X_i)\Bigr]}
$$
for each $X_j \in \nb{X_i}$, where
$$
\{X_{i_1},\dotsc,X_{i_r}\} = \nb{X_i} - \{X_j\} \, .
$$
Update $t \gets t + 1$.

We call the potential $\psi_i^{(t)}(X_j)$ the (t)-th message from $X_i$ to $X_j$, and the previous algorithm synchronous message-passing. Convergence is verified by computing $\max_{i,j} |\psi_{i\rightarrow j }^{(t)}-\psi_{i\rightarrow j}^{(t-1)}|$.

The figure below illustrates the execution of the algorithm in the same model as before. Note that each arc is associated with two messages, one for each direction (parent to child and child to parent).

Before giving a formal proof of convergence, let us analyze a few iterations of the algorithm. After the first iteration of Step (2), every leaf has sent a message to its single neighbor, as it would have in variable elimination with a consistent reverse topological ordering. This message remains the same in every subsequent iteration (so that leaf nodes converge after one iteration). Now consider a node $X_i$ where all but one of its neighbors are leaves (call the non-leaf neighbor $X_j$). That is, there is at most one path from $X_i$ to a leaf with length greater than one. Then, after 2 iterations, this node has sent the same message to its non-leaf neighbor as it would have had we been performing variable elimination with a consistent reverse topological ordering. Moreover, since the messages $X_i$ receives from its leaf neighbors remain unaltered in future iterations, also the message from $X_i$ to $X_j$ remains unaltered for $t \geq 2$ (so that $X_i$ converges after 2 iterations). In general, message-passing can be seen as a parallel simulation of variable elimination with many perfect elimination orderings and with memory (message) sharing. This enables a more efficient computation of all the marginals of the network (compared to multiple runs of variable elimination).

The formal analysis of convergence uses the height of rooted trees: The height of a rooted tree is the length of the longest path from the root to a leaf. The height of a tree containing a single node is zero. Note that the height of a subtree is always smaller than that of the tree.

Root the tree and let $X_i$ be an arbitrary node and $X_j=\pa{X_i}$. Then $\psi_{i\rightarrow k}^{(t)}=\psi_{i\rightarrow k}^{(h+1)}$ for every $t>h$, where $h$ is the height of the subtree rooted at $X_i$.

We use an inductive argument in the height of the tree. Let $X_i$ be a leaf (hence $h=0$). After one iteration this node has sent a message to its single neighbor (its parent), as it would have in variable elimination with any consistent reverse topological ordering. This message remains the same in every subsequent iteration. Now consider a node $X_i$ with height $h$, and assume that the result holds for every subtree rooted at one of its children. Then every message $\psi_{j \rightarrow i}^{(t)}$ sent by a child of $X_i$ are unaltered for $t>h-1$. But this implies that the message from $X_i$ to its parent $X_j$ is the same for every $t>h$.

We thus have that:

Synchronous message-passing converges in tree-shaped graphs after $\rho$ iterations, where $\rho$ is the diameter of the tree (the size of the longest path). Moreover, $$ Z=\del{X_i}{\prod_{X_j \in \nb{X_i}}\psi_{j\rightarrow i}^{(\rho)}(X_i)} $$ and $$ P(X_i) \propto \prod_{X_j \in \nb{X_i}}\psi_{j\rightarrow i}^{(\rho)}(X_i) $$ for any node $X_i$.

Root the node at a node $X_i$ such that the height of the rooted tree is maximized. This height equals the diameter of the tree, and by the previous lemma, the messages converge after $t>\rho$ iterations. Fix a reverse topological ordering $X_1,\dotsc,X_n$. Then the messages $\psi_{X_i \rightarrow \pa{X_i}}^{(i)}$ are the same potentials generated during variable elimination with that elimination ordering.

Asynchronous Message-Passing

Message-passing can also be performed asynchronously, that is, a message can be updated and sent to neighbors as soon as any incoming message is received. Moreover, we can also normalize the messages to increase numerical accuracy. The result is the so-called asynchronous message-passing algorithm:

Initialize $\psi_{i \rightarrow j}(X_j) = 1/|\Omega_{X_j}|$
Repeat until convergence:
For $i\gets 1$ to $i \gets n$ compute
$$
\psi_{i\rightarrow j}(X_j) \propto \del{X_i}{\Bigl[ \phi(X_i,X_j) \times \psi_{i_1 \rightarrow i}(X_i) \times \dotsb \times \psi_{i_r \rightarrow i}(X_i)\Bigr]}
$$
for each $X_j \in \nb{X_i}$, where
$$
\{ X_{i_1},\dotsc, X_{i_r} \} = \nb{X_i} - \{X_j\} \, .
$$

The messages $\psi_{i \rightarrow j}$ are normalized by dividing $(\del{X_j}{\psi_{i \rightarrow j}})$. It can be shown that the asynchronous version also terminates after a finite number of steps and produces estimates of the marginal probabilities and the partition function. Moreover, since the asynchronous version updates messages more often than the synchronous version, the former converges faster. Note that the normalization can also be performed to the synchronous version.

Sum-Product in Factor Graphs

The (synchronous or asynchronous) message-passing algorithm can be immediately applied on loopy pairwise Markov networks. However, in loopy networks, the algorithm might not converge and a bound on the number of iterations must be imposed. Even when the algorithm does converge, there is no guarantee on the quality of the estimates induced from the messages. In fact, the algorithm can be understood as implicitly assuming a larger set of independences, which results in either under or overestimation of probability marginals. The accuracy of the algorithm need not to increase with time (in fact, it is common that accuracy oscillates as the number of iterations grows).

Factor Graphs

To further extend the algorithm to arbitrary (non-pairwise) Markov networks, we need to introduce an auxiliary representation.

The factor graph of a set of factors $\mathcal{F}=\{\phi_1,\dotsc,\phi_m\}$ is the bipartite undirected graph $G=(\mathcal{V} \cup \mathcal{F},\mathcal{E})$ where $\mathcal{V}= \bigcup_j \sco{\phi_j}$ is the set of variables and there is an edge $X - \phi_j$ in $\mathcal{E}$ if and only if $X \in \sco{\phi_j}$.

The figure below depicts the factor graph of $$ \mathcal{F} = \{ \phi(A,B),\phi(A,C),\phi(B,D),\phi(C,d),\phi(E) \} \, . $$

A factor graph is a more detailed description of the distribution than a Markov network, since the latter makes no (graphical) distinction between the parametrization of the model (i.e., whether the potentials are over maximal or non-maximal cliques). Thus, while a pairwise Markov network over a complete graph and a Markov network with single factor have the same graphical representation, they have very different factor graph representations.

Gibbs Distribution

A factor graph induces Gibbs distribution over the random variables by $$ P(\mathcal{V}) = \frac{1}{Z} \prod_{j=1}^m \phi_j \, . $$

For example, the factor graph in figure above induces the joint distribution $$ P(A,B,C,D,E)=\frac{\phi(A,B)\phi(A,C)\phi(B,D)\phi(C,D)\phi(D,E)}{\sum_{ABCDE}\phi(A,B)\phi(A,C)\phi(B,D)\phi(C,D)\phi(D,E)} \, . $$

Sum-Product Rules

Even though the factor graph is defined as an undirected graph, it is convenient to represent it as a bi-directed graph (i.e., each edge is represented by two symmetric arcs). Let $\phi_1,\dotsc,\phi_r$ denote the neighbors of a variable $X$ in the graph, and $X_1,\dotsc,X_s$ be the neighbors of a factor $\phi$. Message-passing in sum-product graphs is performed by applying the following sum-product rules to each (directed) arc.

Variable to Factor:

$$ \psi_{X \rightarrow \phi_j}(X) = \psi_{\phi_1 \rightarrow X} \times \dotsb \times \psi_{\phi_{j-1} \rightarrow X} \times \psi_{\phi_{j+1}\rightarrow X} \times \dotsb \times \psi_{\phi_r\rightarrow X}
$$

Factor to Variable

$$ \psi_{\phi \rightarrow X_j}(X_j) = \del{(\nb{\phi} - \{X_j\})}{(\phi \times \psi_{X_1 \rightarrow \phi} \times \dotsb \times \psi_{X_{j-1} \rightarrow \phi} \times \psi_{X_{j+1} \rightarrow \phi} \times \dotsb \times \psi_{X_s \rightarrow \phi})} $$

The messages are initialized with ones and often updated asynchronously (although synchronous update is possible). The algorithm terminates when messages converge or after a maximum number of iterations is reached.

The figure below illustrates the sum-product rules in a simple graphical model with three factors and three variables. Note that the factor graph distinguishes the factors $\phi_1$ and $\phi_2$, which have identical scope.

For example, the messages computed at node $X$ are: $$ \begin{align*} \psi_{X \rightarrow \phi}(X) &= \psi_{\phi_1 \rightarrow X}(X) \times \psi_{\phi_2 \rightarrow X}(X) \\ \psi_{\phi \rightarrow X}(X) &= \del{\{Y,Z\}}{\bigl[\phi(X,Y,Z) \times \psi_{Y \rightarrow \phi}(Y) \times \psi_{Z \rightarrow \phi}(Z)\bigr]} \end{align*} $$

Marginals

The node and clique marginal distributions can be obtained from the respective incoming messages at any iteration by

Node marginal:
$$ P(X) \propto \psi_{\phi_1 \rightarrow X} \times \dotsb \times \psi_{\phi_r \rightarrow X} \, . $$
Clique marginal: $$ P(\mathcal{C}) \propto \phi(\mathcal{C}) \times \psi_{X_1 \rightarrow \phi} \times \dotsb \times \psi_{X_s \rightarrow \phi} \, . $$

where $\phi_1,\dotsc,\phi_r$ are the neighbors of $X$ in the factor graph and $X_1,\dotsc,X_s$ are the neighbors of $\phi(\mathcal{C})$.

Note that the marginals need not to converge and can oscillate arbitrarily from iteration to iteration.

Complexity

Sending a message from a potential $\phi$ to a variable $X$ requires combining $(d-1)$ potentials and eliminating $d-1$ variables, where $d=|\nb{\phi}|=|\sco{\phi}|$. Thus, the cost of the computation is $O(d \cdot c^d)$, where $c$ is the maximum variable cardinality. Sending a message from $X$ to $\phi$ requires combining $d’-1$ potentials with scope $\{X\}$ taking time $O(d’ \cdot c)$, where $d’$ is the number of neighbors of $X$ in the associated domain graph of the model. Hence, the time is dominated by the complexity of sending messages from potential nodes. At each iteration, two messages are sent on each edge. Thus, if there are $n$ variables and $m$ potentials (factors), an iteration takes time $O((m+n) \cdot c^d)$, where $d$ is the size of the largest scope of a potential in the input. The overall running time is $O((m+n) \cdot c^2 \cdot T)$, where $T$ is the maximum number of iterations. Hence, the algorithm takes polynomial time in the size of the input (the number of variables, potentials and variable cardinalities), independently of the treewidth of the domain graph.

Convergence

It is widely believed that the sum-product algorithm is more accurate when graph has few short loops (long loops are less problematic);² and it is more accurate when the potentials satisfy

$$ \phi(X_i,X_j)=\phi(X_i)\phi(X_j) + \phi’(X_i,X_j) $$

with $|\phi’(X_i,X_j)| < \epsilon$, for some small $\epsilon$.³ The convergence of the algorithm can be investigated as an non-convex optimization problem where one tries to find a Markov tree approximation to any given model. It can be shown that the fixed points of the sum-product algorithm correspond to stationary points of the optimization.⁴ ⁵

Clique-Trees

Recall that chordal graphs are the intersection of the classes of distributions represented by d-separation and u-separation. They are also essential in understanding the complexity of variable elimination. Trees are one of the simplest family of chordal graphs, and one that originates the sum-product algorithm. As we will see in this lecture, chordal graphs generalize trees in many ways. Importantly, (maximal) chordal graphs can be defined recursively. Chordal graphs also decompose: there are sets of nodes which if removed turn the graph into two chordal graphs. These properties allow a very peculiar parametrization of chordal graphs, one that can be used to compute exact inference. Since any graph can be completed to become a chordal graph (e.g., by graphical variable elimination), the algorithm for chordal graphs is generic (i.e., it computes inference in any Bayesian or Markov network), but intractable for some classes of models.

The clique-tree algorithm, which operates in chordal graphs, is most useful when solving the all-marginals problem. The algorithm performs sequential message passing in a tree-like structure called clique tree. As with the sum-product algorithm in trees, clique-tree message passing avoids redundant computations when eliminating variables with different orderings.

Markov Trees Revisited

We begin by revisiting Markov trees, and proving an interesting property.

Let $(G,P)$ be a Markov tree where $G=(\mathcal{V},\mathcal{E})$ and $P$ factorizes as $$ P(\mathcal{V}) = \frac{1}{Z} \prod_{X - Y \in \mathcal{E}} \phi(X,Y) \, . $$

Then,

$$ P(\mathcal{V}) = \prod_{X \in \mathcal{V}}P(X) \prod_{X - Y \in \mathcal{E}}\frac{P(X,Y)}{P(X)P(Y)} = \frac{\prod_{X - Y}P(X,Y)}{\prod_{X} [P(X)]^{\text{deg}(X)-1}} \, , $$

where $\mathcal{V}$ is the set of nodes/variables and $\mathcal{E}$ is the set of edges of $G$.

Root the tree at an arbitrary node $X_1$ and let $X_1,\dotsc,X_n$ be a topological ordering. In a tree, every node is connected to its non-descendants only through its parent. Thus, \[ P(\mathcal{V})=\prod_i P(X_i|X_1,\dotsc,X_{i-1}) = \prod_i P(X_i|\pa{X_i}) = \prod_i \frac{P(X_i,\pa{X_i})}{P(\pa{X_i})} \, . \] Note that in a directed tree we have that $\pa{X_i}=X_j$ for $i=2,\dotsc,n$ and $\pa{X_1}=\emptyset$. Moreover, every relation $(\pa{X_i},X_i)$ is an arc $X_j \rightarrow X_i$ in the tree. Therefore, $$ \begin{align*} P(\mathcal{V}) &= P(X_1) \prod_{X_i \rightarrow X_j} \frac{P(X_i,X_j)}{P(X_i)} \\ &= P(X_1) \prod_{X_i \rightarrow X_j} \frac{P(X_i,X_j)}{P(X_i)P(X_j)}P(X_j) \\ &= \prod_i P(X_i) \prod_{X_i \rightarrow X_j} \frac{P(X_i,X_j)}{P(X_i)P(X_j)} \, , \end{align*} $$ where in the last step we used the fact that each internal node has in-degree 1. The result follows by dropping the direction of the arcs.

According to the previous result, any Markov tree can be characterized in terms of its edge marginals $P(X,Y)$ (since the node marginals $P(X)=\sum_Y P(X,Y)$ are themselves characterized in terms of edge marginals). Thus, if $p$ and $Q$ are distributions over $\mathcal{V}$ such that $P(X,Y)=Q(X,Y)$ for each edge $X- Y$, then $P(\mathcal{V})=Q(\mathcal{V})$.

Another consequence of the theorem is that any Markov tree can be easily converted into an equivalent tree-shaped Bayesian network. This fact was actually used to prove the correctness of the result. Hence, Markov trees and tree-shaped Bayesian networks are simply different characterizations of otherwise identical models.

We can see the sum-product algorithm in a tree as a way of reparametrizing the model as the product of marginals. We say that the tree is calibrated when

$$ P(X_i) \propto \prod_{X_j \in \nb{X_i}}\psi_{j\rightarrow i}^{(\rho)}(X_i) $$

at every node. Our goal in this lecture is to extend this result and terminology to chordal graphs.

Decomposable graphs

Recall that an undirected graph is chordal only if every cycle longer than 3 has a chord, that is, a subcycle of size 3. A directed graph is chordal only if its moral graph is chordal. An equivalent notion is that of a decomposable graph.

A graph $G$ is decomposable if it is either complete or its nodes can be partitioned into sets $\mathcal{V}=\mathcal{A} \cup \mathcal{B} \cup \mathcal{C}$ such that

$\mathcal{A}$ and $\mathcal{C}$ are non-empty and $\mathcal{B}$ is a clique;
$\mathcal{B}$ separates $\mathcal{A}$ and $\mathcal{C}$;
$\mathcal{A} \cup \mathcal{B}$ and $\mathcal{B} \cup \mathcal{C}$ induce decomposable graphs.

It is easy to see that trees are decomposable: let $X - Y$ be any edge, and define $\mathcal{A}$ be the set of all nodes that are connected to $X$ by a path that does not contain $Y$, and similarly, let $\mathcal{C}$ be the set of all nodes that are connected to $Y$ by a path that does not contain $X$. Then $\mathcal{A} \cup \{X,Y\} \cup \mathcal{B}$ is a decomposition of the graph since $G[\mathcal{A} \cup \{X,Y\}]$ and $G[\mathcal{C} \cup \{X,Y\}]$ are both trees, hence decomposable, and $\{X,Y\}$ separates $\mathcal{A}$ and $\mathcal{C}$).

Tree decomposition

Our interest in decomposable graph lies in their connection with the following data structure:

A tree decomposition of an undirected graph $G=(\mathcal{V},\mathcal{E})$ is a tree $T=(\mathcal{V}_T,\mathcal{E}_T)$ such that

$\mathcal{V}_T$ is a set of subsets of $\mathcal{V}$;
For every clique $\mathcal{C}$ of $G$ there is a clique node $\mathcal{C}_j \in \mathcal{V}_T$ such that $\mathcal{C} \subseteq \mathcal{C}_j$;
Every node $\mathcal{C}_k$ in a path between clique nodes $\mathcal{C}_i$ and $\mathcal{C}_j$ in $\mathcal{V}_T$ contains $\mathcal{C}_i \cap \mathcal{C}_j$.

The width of a tree decomposition is the size of the largest set $\mathcal{C} \in \mathcal{V}_T$ node minus one.

Property 2 ensures that the scopes of each potential appears in a clique node of a tree decomposition of its domain graph; it also ensures that every maximal clique of $G$ is a node in $T$ (but some nodes might be non-maximal cliques). Property 3 ensures that a node in a clique $\mathcal{C}_j$ in any path of $T$ either appears for the last time (in that path) or appears also in the next clique node. We often label the edges $\mathcal{C}_i - \mathcal{C}_j$ with the respective separator set $\mathcal{C}_i \cap \mathcal{C}_j$ (this term will be made clearer later).

The figure below, shows a non-chordal undirected graph (left), a chordalization (center) and a corresponding tree decomposition (right).

We will often be interested in a special case of tree decomposition.

A clique tree is a tree-decomposition of $G$ where every node $\mathcal{C}_j$ of $\mathcal{V}_T$ is a clique in $G$.

The following result relates chordal graphs, decomposable graphs, elimination sequences, and clique trees.

The following statements are equivalent.

$G$ is chordal.
$G$ is decomposable.
$G$ admits a perfect elimination ordering.
$G$ has a clique tree.
There is an orientation of the edges of $G$ that induces a directed acyclic graph whose moral graph is $G$.

As a consequence of the previous result, the treewidth of the a chordal graph is the minimum width of a tree decomposition of it.

Let $(G,P)$ be a Markov or Bayesian network whose graph $G$ is chordal, and $T=(\mathcal{V}_T,\mathcal{E}_T)$ be a clique tree for $G$. Then

$$ P(\mathcal{V}) = \frac{\prod_{\mathcal{C} \in \mathcal{V}T} P(\mathcal{C}) }{\prod{\mathcal{S} \in \mathcal{E}_T} P(\mathcal{S})} \, . $$

Root the clique tree $T$ at a node $\mathcal{R}$ and let $\mathcal{C}_1,\dotsc,\mathcal{C}_m$ be a topological ordering of the cliques. Then, $$ \begin{align*} P(\mathcal{V}) &= \prod_{j=1}^m P(\mathcal{C}_j|\mathcal{C}_1,\dotsc,\mathcal{C}_{j-1}) \\ &= \prod_{j=1}^m P(\mathcal{C}_j | \mathcal{C}_j \cap \pa{\mathcal{C}_j}) \\ &= \frac{\prod_{\mathcal{C} \in \mathcal{V}_T} P(\mathcal{C}) }{\prod_{\mathcal{S} \in \mathcal{E}_T} P(\mathcal{S})} \, , \end{align*} $$ where the second equality follows from the decomposability of the graph using $\mathcal{B}=\pa{\mathcal{C}_j} \cap \mathcal{C}_j$, $\mathcal{A}$ given by the nodes that are connected to $\mathcal{C}_j$ without passing through $\pa{\mathcal{C}_j}$, and $\mathcal{C}$ the nodes that are connected to $\mathcal{C}_j$ without passing through $\pa{\mathcal{C}_j}$.

Clique-Tree propagation

The clique-tree algorithm requires the construction of a clique tree. A common method is to:

Chordalize graph (e.g. by graphical VE)
Find (maximal) cliques (e.g. as greater neighbors of a variable according to a perfect elimination ordering)
Build clique graph (with edges weighted by the cardinality of their separating sets)

The figure below depicts the construction of a clique tree using the method above from a simple undirected graph. The dotted arcs in left-side graph denote fill-in edges that have been included in the chordalization of the graph. The weights associated to edges in the right-side graph denote the number of neighbors in common. The thickets edges form a maximum-weight spanning tree of this graph, and a clique tree for the original graph.

The width of the clique tree produced is the induced width of the elimination ordering used. Finding a minimal width clique tree equals finding a perfect elimination ordering (and it is thus NP-hard).

A clique tree is minimal if no clique node is a proper subset of other node (i.e., if a clique node is not a maximal clique). Any clique tree can be transformed into minimal form by deleting non-maximal cliques. This can also be achieved when building the clique tree using the procedure described (at each step we check whether a clique is contained in some already built clique).

Sum-Product in Clique Trees

Take a set of potentials $\phi_1,\dotsc,\phi_m$ and a tree decomposition $T=(\mathcal{V}_T,\mathcal{E}_T)$ such that $\sco{\phi_k} \in \mathcal{C}_j \in \mathcal{V}_T$ for $k=1,\dotsc,m$ (we say that such a tree /covers/ the potentials).

Initialization

Associate every potential $\phi_j$ with a clique $\mathcal{C}_k$ such that $\sco{\phi_j} \subseteq \mathcal{C}_k$. Note that there might be cliques with multiple potentials assigned and cliques with no potential assigned. Call $\psi_j$ the combination of all potentials associated to clique $\mathcal{C}_j$; if there is no potential associated, set $\psi_j=1$.

Propagation

We classify arcs $\mathcal{C}_i \rightarrow \mathcal{C}_j$ into two classes:

Empty: when $\mathcal{C_i}$ has not yet sent a message $\psi_{i\rightarrow j}$ to $\mathcal{C}_j$.
Full: when $\mathcal{C_i}$ has already sent a message $\psi_{i\rightarrow j}$ to $\mathcal{C}_j$.

The empty arcs $\mathcal{C}_i \rightarrow \mathcal{C}_j$ can be further divided into

Ready: When all adjacent arcs $\mathcal{C}_k \rightarrow \mathcal{C}_i$ other than $\mathcal{C}_j \rightarrow \mathcal{C}_i$ are full.
Not ready: When at least two adjacent arcs are empty.

Consider an arc $\mathcal{C_i} \rightarrow \mathcal{C_j}$ and let $\mathcal{C}_{j_1},\dotsc,\mathcal{C}_{j_r}$ be the neighbors of $\mathcal{C}_i$ other than $\mathcal{C}_j$. Message passing in tree decompositions consists in recursively selecting a ready arc $\mathcal{C}_i \rightarrow \mathcal{C}_j$ and computing the message

$$ \psi_{i \rightarrow j} = \del{(\mathcal{C}i \setminus \mathcal{C}j)}{\left(\psi_i \times \psi{j_1 \rightarrow i} \times \dotsb \times \psi{j_r \rightarrow i} \right)} \, . $$

Initially, only the leaves have ready arcs. The algorithms terminates when every arc is full, and not more messages can be send. With an appropriate scheduling of the messages, this algorithm can be seen as first sending messages toward a designated root clique node $\mathcal{C}_r$, then sending messages away from that node towards the leaves. These two phases are called the upward and downward passes. In practice, the root node is chosen automatically by any scheduling.

Termination

We say that the tree is calibrated when every arc is full. The process of sending messages until every arc is full is called calibration. The clique and separator marginals can be obtained from the messages once the tree is calibrated. Let $\mathcal{C}_{i_1},\dotsc,\mathcal{C}_{i_r}$ be (all) the neighbors of a clique $\mathcal{C}_j$. Then

$$ \begin{align*} P(\mathcal{C}j) &\propto \psi_j \times \psi{i_1 \rightarrow j} \times \dotsb \times \psi_{i_r \rightarrow j} \\ P(\mathcal{C}i \cap \mathcal{C}j) &\propto \psi{i \rightarrow j} \times \psi{i \rightarrow i} \, . \end{align*} $$

Note that the marginal of a variable $P(X)$ can be obtained from any clique or separator containing $X$.

Soundness

Let $\mathcal{C_i}$ be any node and $j=\pa{i}$. Define $\mathcal{D}_{i \rightarrow j} = \bigcup_{k \in \de{i}} \mathcal{C}_k \setminus \mathcal{C}_j$ to be the variables that appear in the subtree rooted at $i$ but not at subtree rooted at $j$. At any iteration of the upward message passing the product of all the messages sent to inactive nodes and all the potentials assigned to inactive nodes is proportional to the probability distribution of variables that appear only in inactive cliques:

Upward Pass Invariant: The messages sent in the upward pass satisfy:

$$ \psi_{i \rightarrow j} = \del{\mathcal{D}{i \rightarrow j}}{ \left(\prod{k \in \de{i} \cup \{i\}} \psi_k \right)} \, . $$

Say that a node of the clique tree is inactive if it has not sent message, otherwise it is active. Thus at some iteration, the active nodes are $\de{i}$ and the remaining ones are inactive. Recall that by the running intersection property of tree decompositions (Property 3), any variable in $\mathcal{C}_i - \mathcal{C}_j$ is not in any further clique in the path from $\mathcal{C}_j$ to the root. Hence, all potentials assigned to inactive cliques do not have the eliminated variables in their scope. This suffices to show by an inductive argument in the height of the subtree that the result holds.

Let $r$ be the root of the tree, and $i_1,\dotsc,i_s$ be its children. Then, $$ P(\mathcal{C}_r) \propto \psi_r \times \psi_{i_1 \rightarrow r} \times \dotsb \times \psi_{i_s \rightarrow r} \, . $$

By the previous theorem we have that $$ \psi_{i_k \rightarrow r} = \del{\mathcal{D}_{i_k \rightarrow r}}{ \left(\prod_{k \in \de{i_k} \cup \{i_k\}} \psi_k \right)} \, . $$ Note that by the running intersection property, the sets $\mathcal{D}_{i_k \rightarrow k}$ form a disjoint union of $\mathcal{V} - \mathcal{C}_r$. Hence, $$ \begin{align*} & \psi_r \times \psi_{i_1 \rightarrow r} \times \dotsb \times \psi_{i_s \rightarrow r} = \\ & \quad \psi_r \times \prod_k \biggl(\del{\mathcal{D}_{i_k \rightarrow r}}{\Bigl(\prod_{k \in \de{i_k} \cup \{i_k\}} \psi_k\Bigr)}\biggr) \del{\biggl(\bigcup_k \mathcal{D}_{i_k \rightarrow r}\biggr)}{\biggl(\psi_r \times \prod_{k \in \de{i_k} \cup \{i_k\}} \psi_k\biggr) } \\ & \quad \propto P(\mathcal{C}_r) \, , \end{align*} $$ where we used the fact that $\mathcal{D}_{i_k \rightarrow k}$ are disjoint and do not intersect with $\mathcal{C}_r$ to distribute the combinations.

Consistency checking:

$$ \begin{align*} P(\mathcal{V}) &= \frac{\prod_{j \in \mathcal{V}T} p(\mathcal{C}j)}{\prod{i - j \in \mathcal{E}T} p(\mathcal{C}i \cap \mathcal{C}j)} & \\ &\propto \frac{\prod{\mathcal{C}j \in \mathcal{V}T} \psi_j \times \psi{i_1 \rightarrow j} \times \dotsb \times \psi{i_r \rightarrow j}}{\prod{i - j \in \mathcal{E}T} \psi{i \rightarrow j} \times \psi{j \rightarrow i}} \\ &= \prod{j \in \mathcal{V}T} \left ( \psi_j \times \frac{\prod{i \in \nb{j}}\psi_{i \rightarrow j}}{\prod_{i \in \nb{j}} \psi_{i \rightarrow j}} \right ) & \\ &= \prod_{k=1}^m \phi_k \, , \end{align*} $$ where we have notated the constraints $\mathcal{C}_j \in \mathcal{V}_j$ as $j \in \mathcal{V}_T$ and $\mathcal{C}_i - \mathcal{C}_j \in \mathcal{E}_T$ as $i - j \in \mathcal{E}_T$.

Note that in a calibrate tree we have that

$$ P(\mathcal{C}_i \cap \mathcal{C}_j) = \del{(\mathcal{C}_i \setminus \mathcal{C}_j)}{P(\mathcal{C}_i)} = \del{(\mathcal{C}_j \setminus \mathcal{C}_i)}{P(\mathcal{C}_j)} \, . $$

An alternative method of passing messages is by fixed point iteration: start with $P(\mathcal{C}_i) \propto \psi_i$ and $P(\mathcal{C}_i \cap \mathcal{C}_j)=1$ and pass messages using the equation above until it is satisfied. This equivalent to passing messages with division, and has the same asymptotic properties of the message passing algorithm discussed.

Complexity

The cost of passing messages upward and downward (i.e., towards and away from a designated root node) is linear in the clique tree size, which is at least exponential in the treewidth of the graph. Hence, clique tree propagation is asymptotically similar to variable elimination.

Relation to Variable Elimination

Message passing in clique-trees is very similar to VE. The clique tree defines a perfect elimination ordering according to which variables are eliminated. Compared to VE, message passing has the following advantages:

All marginals are computed with two passes in the tree, saving computations (compared to multiple runs of VE);
Multiple variables are eliminated from a single combined potential when sending a message (in VE, this would result in more computations);
Evidence can be inserted and retracted easily.

The downsides of clique-tree are:

small overhead for single marginal computation (compared to VE), and
the fact that it does not exploit the additional independences and irrelevances (barren nodes) introduced by inserting evidence.

The latter can be partly fixed by lazy propagation.⁶

Historical Background

Loopy belief propagation in Bayesian networks was first proposed by Pearl as a direct generalization of its belief propagation algorithm that performs exact inference in tree-shaped Bayesian networks. Pearl warned that the algorithm was not guaranteed to converge, and when it does it produces only approximate results (of arbitrary quality).⁷ A different guise of the algorithm was independently developed in the statistical mechanics community and used for image processing tasks due to its relative efficiency and accuracy (compared to sampling) in models with thousands of variables.⁸ Later, it was realized that turbo coder, one of the most efficient error-correcting algorithms used in communication theory, could be interpreted as the sum-product algorithm in a particular Markov network.⁹ This revived the theoretical interest in investigating the properties of the sum-product algorithm. Yedidia, Freeman and Weiss ¹⁰ ¹¹ showed that the algorithm can be cast as an optimization problem, where the objective is to find the tree-shaped network that minimizes the KL-divergence with respect to the model distribution constrained so that the node and edge marginals in each model agree. This led to a deeper understanding of the convergence of the algorithm, to a connection between the algorithm and Bethe free energy minimization, an old and well-studied problem in statistical mechanics, and to the development of convergent variants such as the tree-reweighted belief propagation algorithm.¹²

Clique-tree algorithms first appeared as methods to compute marginal probabilities in Bayesian networks ¹³ ¹⁴ (although many ideas had been developed in database theory earlier).¹⁵ Motivated by the fact that computations in tree-shaped graphical models are efficient and straightforward, clique-tree algorithms re-cast any graphical model as a tree. The efficiency of the computations in a clique tree is a function of the resemblance of the original graph to a tree as measured by its treewidth.

J.H.P. Kwisthout and A.H.L. Bodlaender and L.C. van der Gaag, The Necessity of Bounded Treewidth for Efficient Inference in Bayesian Networks, Proceedings of the 19th European Conference on Artificial Intelligence, 2010. ↩︎
Y. Weiss, Correctness of local probability propagation in graphical models with loops, Neural Computation 12:1, 2000. ↩︎
J.M. Mooij and H.J. Kappen, Sufficient Conditions for Convergence of the Sum-Product Algorithm, IEEE Transactions on Information Theory 53:12, 2007. ↩︎
T. Heskes, Stable Fixed Points of Loopy Belief Propagation Are Minima of the Bethe Free Energy, Advances in Neural Information Processing Systems 15, 2002. ↩︎
T. Heskes, On the uniqueness of loopy belief propagation fixed points, Neural Computation 16:11, 2004. ↩︎
A.L. Madsen and F.V. Jensen, Lazy propagation: A junction tree inference algorithm based on lazy evaluation, Artificial Intelligence 113:1-2, 1999. ↩︎
Judea Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann, 1988. ↩︎
M. Coughlan and S.J. Ferreira, Finding deformable shapes using loopy belief propagation, Proceedings of the European Conference on Computer Vision 7, 2002. ↩︎
R. McEliece and D. MacKay and J. Cheng, Turbo decoding as an instance of Pearl’s ‘belief propagation’ algorithm, IEEE Journal on Selected Areas in Comm. 16:2, 1998. ↩︎
J.S. Yedidia and W.T. Freeman and Y. Weiss, Generalized belief propagation, Advances in Neuroprocessing Systems 13, 2001. ↩︎
J.S. Yedidia and W.T. Freeman and Y. Weiss, Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms, IEEE Transactions on Information Theory 51:7, 2005. ↩︎
M.J. Wainwright and T.S. Jaakkola and A.S. Willsky, Tree–based reparameterization analysis of sum–product and its generalizations, IEEE Transactions on Information Theory 49:5, 2003. ↩︎
S.L. Lauritzen and D.J. Spiegelhalter, Local computations with probabilities on graphical structures and their application to expert systems, Journal of the Royal Statistical Society (Series B), 1988. ↩︎
P. Shenoy and G. Shafer, Axioms for probability and belief-function propagation, Proceedings of the 4th Conference on Uncertainty in Artificial Intelligence, 1988. ↩︎
C. Beeri and R. Fagin and D. Maier and M. Yannakakis, On the desirability of acyclic database schemes, Journal of the Association for Computing Machinery 30:3, 1983. ↩︎

Last updated on Nov 13, 2020