# Markov Networks

## Table of Contents

\( \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\indeg}[1]{\text{indeg}(#1)} \newcommand{\outdeg}[1]{\text{outdeg}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\!\perp_d\!} \newcommand{\msep}{\!\perp_m\!} \newcommand{\sep}{\!\perp_u\!} \)

We examine the representation of independence by undirected graphs, resulting in the concept of Markov networks. We also compare Bayesian networks and Markov networks as representational devices for independences.

## 1 A few more concepts from graph theory

Recall that an undirected graph is a pair \(G=(V,E)\) where \(E\)
is a set of *unordered pairs* of vertices. We denote an
(undirected) edge as \(X-Y \in E\). A **loop** (or undirected
cycle) is a trail that starts and ends with the same node. An
undirected graph with no loops is a tree (even if it is not
connected), otherwise it is loopy. A **triangle** is a 3-node
loop. Here is an important class of graphs:

A graph is **chordal** if every loop has
a subsequence which is a triangle (i.e., 3-node loop).

Figure 1 shows a chordal graph in the left (any permutation of the loop \(A,B,C,D\) has loops \(A,B,C\) or \(B,C,D\)) and non-chordal graphs in the center and right (e.g., in the graph in the center, the loop \(A,B,C,D\) has no subsequence which is also a loop). Note that the rightmost graph is non-chordal even though is made up of triangles. The outer loop \(A,B,C,D,E,F\) contains no triangles.

Figure 1: A chordal graph (left) and two non-chordal graphs (center and right).

A **clique** is a set of nodes which are
pairwise connected. A clique is **maximal** if it
is not a proper subset of any other clique.

The maximal cliques of the left graph in Figure 1 are \(\{A,B,C\}\) and \(\{B,C,D\}\) and the non maximal cliques are the edges, the singletons containing each node and the empty set. The maximal cliques of the graph in the center are its edges. The maximal cliques of the graph on the right are the triangles, all containing node \(G\).

## 2 Markov networks

We have seen that even though independence is a symmetric
concept, it can be well captured by an acyclic directed graph
through the semantics of d-separation. Moreover, we saw that
d-separation is equivalent to m-separation, thus connecting
directed and undirected representations. The m-separation
representation, obtained through the relevant moral graph, is
*dynamic* in that it depends on a graph constructed
specifically for the desired independence (or separation)
query. In this section we will look at a undirected
representation of independence that is *static*, that is, it
is represented by a fixed graph.

### 2.1 Separation

Independences in undirected graphs are read as standard separation (i.e., non-connectivity) in graphs:

The sets of variables \(\mathcal{X}\) and \(\mathcal{Y}\) are
**u-separated** by \(\mathcal{Z}\), written
\(\mathcal{X} \sep \mathcal{Y} \mid \mathcal{Z}\) if \(\mathcal{X}\) and \(\mathcal{Y}\) are
separated in \(G-\mathcal{Z}\).

Hence, \(m\)-separation is simply \(u\)-separation in the corresponding relevant moral graph. Note that conditioning can only separate sets, but never connect them (this is the static character of u-separation), in stark contrast to d-separation.

Consider the graph in the center in Figure 1; it encodes the u-separations \[ A \sep D \mid B,C \text{ and } B \sep D \mid A,C \, . \]

### 2.2 Definition

A **Markov network** is a pair \((G,p)\), where \(G\) is an
undirected graph over variables \(\mathcal{V}\) and \(p(\mathcal{V})\) is a joint
distribution for \(\mathcal{V}\) such that

for any subsets \(\mathcal{X},\mathcal{Y},\mathcal{Z} \subseteq \mathcal{V}\).

Markov networks are also called **Markov Random Fields**,
following their usage in Physics.

### 2.3 The Hammersley-Clifford Theorem

Our goal is to connect independences and factorizations, as we have done for directed graphs. This leads to the concept of a potential.

A **potential** is a nonnegative
real-valued function over the joint domain of some set of variables.

A potential is also often called a **factor** and less often a
**valuation**. Note that a (conditional) joint distribution is a
potential (and the converse is not generally true). The
parametrization of a Markov network in terms of potentials
follows from the following result.

*Hammersley-Clifford Theorem:*
Let \(\mathcal{C}_1,\dotsc,\mathcal{C}_m\) be the maximal cliques of a Markov
network structure \(G\). If \(p(\mathcal{V})>0\) then the global Markov
property implies that
\[
p(\mathcal{V}) = \frac{1}{Z}\prod_{j=1}^m \phi_j(\mathcal{C}_j) \, , \quad \text{where} \quad Z = \sum_\mathcal{V} \prod_{j=1}^m \phi_j(\mathcal{C}_j)
\]
is a normalizing term called the **partition function**.

The potentials \(\phi_j\) are called **clique potentials**.

### 2.4 Factorization

It is possible to construct a non-positive distribution that satisfies all the global Markov properties in the graph and yet does not factorize over the cliques. The converse result, however, holds even for non-positive distributions.

If \(p(\mathcal{V})\) is a joint distribution that factorizes as \[ p(\mathcal{V}) = \frac{1}{Z}\prod_{j=1}^m \phi_j(\mathcal{C}_j) \, , \] where \(\mathcal{C}_j\) are the maximal cliques of a graph \(G\), then \((G,p)\) satisfies all the global Markov properties.

The result above uses the following fact about independence:

The sets of variables \(\mathcal{X}\) and \(\mathcal{Y}\) are conditionally independent given the set of variables \(\mathcal{Z}\) if and only if there are potentials \(\phi_1(\mathcal{X},\mathcal{Z})\) and \(\phi_2(\mathcal{Y},\mathcal{Z})\) such that \[ p(\mathcal{X},\mathcal{Y}) = \phi_1(\mathcal{X},\mathcal{Z}) \phi_2(\mathcal{Y},\mathcal{Z}) \, . \]

### 2.5 Markov properties

It is not difficult to show that Global Markov properties imply

\begin{align*} X \sep \mathcal{V} - \nb{X} \mid \nb{X} \text{ only if } X \ind \mathcal{V} - \nb{X} \mid \nb{X} \,,&&\text{[local Markov property]} \end{align*}which in turn implies that

\begin{align*} X-Y \not\in E \text{ only if } X \ind Y \mid \mathcal{V} - \{X,Y\} \,.&&\text{[pairwise Markov property]} \end{align*}If \(p(\mathcal{V})>0\) the following tree assertions are equivalent.

- \(p\) satisfies the pairwise Markov property.
- \(p\) satisfies the local Markov property.
- \(p\) satisfied the global Markov property.

Note that unlike Bayesian networks, we defined Markov networks satisfying the global Markov properties from which we deduced the local Markov properties and the pairwise Markov properties. According to the result above, for positive distributions this choice is arbitrary: we could have defined Markov networks as satisfying the local or pairwise Markov properties, as we did with Bayesian networks. However, there are non-negative distributions which satisfy the pairwise Markov properties and not the local Markov properties, and distributions which satisfy the local Markov properties and not the global ones.

When the distribution is positive, it is also common to
parametrize it using a distribution from the **exponential
family**:

where \(\psi_c\) are clique potentials. A distribution
satisfying the equation above is often called a **Gibbs
distribution**.

### 2.6 Pairwise Markov networks

Here is a common class of Markov networks with quadratic number of parameters:

A **pairwise Markov network** is a Markov network whose maximal
cliques are the edges of the graph. It is commonly parametrized as
\[
p(\mathcal{V}) = \exp\left( -\sum_{X \in V} \psi(X) -\sum_{X-Y} \psi(X,Y) - \ln(Z) \right ) \, ,
\]
where \(\psi(X)\) and \(\psi(X,Y)\) are called the node and edge potentials, respectively.

Pairwise Markov networks are popular models for image processing tasks. For example, in image segmentation one is interested in classifying each pixel in an image according to pre-determined set of labels (e.g., foreground, background). Every pixel is classified according to local characteristics (pixel intensity, color, etc), and a geometric smoothing is enforced (neighboring pixels should be classified alike). This model can be represented as a pairwise Markov where every pixel is a node and two nodes are adjacent if they correspond to adjacent pixels. The graph in Figure 2 is a pairwise (possibly) representing a 3-by-4 pixels image. The node potentials encode the propensity of each pixel to belong to some specific class, and the edge potentials encode the smoothness (e.g., that neighboring nodes need to be classified equally).

Figure 2: A pairwise Markov network.

### 2.7 Reduced Markov networks

A **restriction** of a potential
\(\phi(\mathcal{X},\mathcal{Y})\) with respect to an
assignment \(\mathcal{Y}=y\) is the potential
\(\phi_{\mathcal{Y}=y}(\mathcal{X})\) such that \(\phi(x,y) =
\phi_{\mathcal{Y}=y}(x)\) for every configuration \(x\).

When convenient we write \(\phi(\mathcal{X},\mathcal{Y}=y)\) to denote the restriction \(\phi_{\mathcal{Y}=y}(\mathcal{X})\).

The restriction of the potential

\(X\) | \(Y\) | \(\phi(X,Y)\) |
---|---|---|

0 | 0 | 0.1 |

0 | 1 | 100 |

1 | 0 | 5.2 |

1 | 1 | 0 |

with respect to \(Y=1\) is

\(X\) | \(\phi(X,Y=1)\) |
---|---|

0 | 100 |

1 | 0 |

Markov networks are closed with respect to restrictions:

If \((G,p)\) is a Markov network over \(\mathcal{V}\) such that
\(p\) factorizes as \(\prod_j \phi_j(\mathcal{C}_j)/Z\),
\(\mathcal{Y} \subseteq \mathcal{V}\) and \(y\) is a
configuration for \(\mathcal{Y}\) then \((G-\mathcal{Y},q)\) is
a Markov network over \(\mathcal{V}-\mathcal{Y}\) and \(q\)
factorizes as \(\prod_j \phi_j(\mathcal{C}_j-\mathcal{Y},
\mathcal{Y}=y)/Z(y)\). Moreover,
\(q(\mathcal{V}-\mathcal{Y})=p(\mathcal{V}|\mathcal{Y}=y)\).
This network is called a **reduced Markov network**.

The notation \(Z(y)\) emphasizes that the partition function of the reduced network depends on the restriction. Hence, computing conditional probabilities in Markov networks is the same as restricting every potential with respect to the observation and then computing a marginal probability (which requires computing the partition function). Note that the maximal cliques in the reduced network may be smaller than in the original network, so that more than a potential must be combined to form the clique potentials of the reduced network.

Consider a Markov network with the graph in the left in Figure 1, and its restriction by \(B=1,C=0\). The corresponding reduced network is an empty graph over nodes \(A\) and \(D\), and the clique potentials are \(\phi_1(A)=\phi(A,B=1)\phi(A,C=0)\) and \(\phi_2(D)=\phi(B=1,D)\phi(C=0,D)\). Note that the potential \(\phi(B,C)\) is discarded (since it is a constant in the reduced model). Alternatively, \(\phi(B=1,C=0)\) could be incorporated into any clique potential (or all of them). The reduced network distribution is \(p(A,D|B=1,C=0)\).

## 3 Chordal graphs

Since any Bayesian network factorizes (even the ones with zero probability values), a Bayesian network \((G,p)\) specifies a Markov network \((M[G],p)\) whose partition function is \(Z=1\) (however the set of independences encoded by both representations differ). Hence, from the parametrization point-of-view, Bayesian networks are subclass of Markov networks with the special property that the partition function is one. However, Bayesian networks can represent dynamic independencies (i.e., independencies created by conditioning), whereas Markov networks can represent only static dependencies. So, for example, no Markov network can represent the independencies in a convergent connection. Conversely, there are independence relations that can be represented by a Markov network but not by any Bayesian network. This is the case, e.g., of the network in the center in Figure 1. Any attempt of representing that relation as a directed graph will either result in creating a convergent connection which misses some independence and introduces others, or lack a direct influence (again, misrepresenting independences). Thus, from the representational point-of-view, Bayesian networks and Markov networks are incomparable.

What types of independence relations can be represented either by a directed or an undirected graph? The answer follows from our translation from Bayesian networks to Markov networks. Recall that the equivalence class of a Bayesian network is characterized by the skeleton and the set of immoralities. A Bayesian network without immoralities is thus represented as an undirected graph, in which case d-separation (with any consistent orientation of the arcs) and u-separation coincide. So the class of Bayesian networks such that \(G=M[G]\) is contained in the class of Markov networks. What about the converse? Any triangle-free loop contains independences that cannot be represented by a directed graph (consider the graph in the center in Figure 1). On the other hand, any triangle

can be represented either as the directed structure

or by any homomorphic structure. Combining these arguments,
one can show that the class of Markov networks that can be
represented as Bayesian networks is exactly the class of
**chordal graphs**; any non-chordal graph cannot be represented
(in terms of independences) by a Bayesian network.

Let \(G\) be an undirected graph. The following statements are equivalent:

- G is chordal.
- There is an acyclic directed graph \(D\) such that \(M[D]=G\).
- There is an acyclic directed graph \(D\) such that \(\mathcal{X} \dsep \mathcal{Y} \mid \mathcal{Z}\) if and only if \(\mathcal{X} \sep \mathcal{Y} \mid \mathcal{Z}\).

The equivalences between Bayesian and Markov networks are represented in Figure 5. As is noted in the diagram, trees are particular cases of chordal graphs.

Figure 5: Representational power of different formalisms.

## 4 Clique Tree Reparametrization

We will see later that chordal graphs, the intersection of Markov networks and Bayesian networks are an important family of probabilistic graphical models, one in which inference complexity can be graphically characterized. Chordal graphs have the important property that their cliques can be arranged in the form of a tree with certain properties, and parametrized as \[ p(\mathcal{V}) = \frac{\prod_{j=1}^m p(\mathcal{C}_j)}{\prod_{j=1}^{m-1} p(\mathcal{S}_j)} \, , \] where \(\mathcal{C}_j\) are the cliques of the graph and \(\mathcal{S}_j\) are the intersection of adjacent cliques known as separation sets.

## 5 Representation power

Since the restriction of potentials of a Markov network is also a Markov network, and a Bayesian network parametrizes a Markov network, we can compute conditional probabilities in Bayesian networks by inference in the corresponding restricted Markov network. Let \(\mathcal{Y}=y\) be an evidence (i.e., an observation), and \(\mathcal{X}\) be some target variables. To compute \(p(\mathcal{X}|\mathcal{Y}=y)\) in a Bayesian network with graph \(G\), obtain the potentials \(\phi_i = p_{\mathcal{Y}=y}(X_i|\pa{X_i})\) and compute the marginal \(q(\mathcal{X})\) in the Markov network defined by the factorization \(q(\mathcal{V}-\mathcal{Y})=Z^{-1}\prod_i \phi_i\). The partition function of this network encodes \(Z=p(\mathcal{Y}=y)\), that is, the probability of the evidence. To make things computationally more efficient, we can restrict the Markov network to non-barren nodes wrt \(\mathcal{X}\) and \(\mathcal{Y}\), which are exactly the ancestors of \(\mathcal{X}\) and \(\mathcal{Y}\) in \(G\).

## 6 Exercises

- Prove the following weaker version of the factorization
theorem of Markov networks: If \(p(\mathcal{V})\) is a joint distribution
that factorizes as
\[
p(\mathcal{V}) = \frac{1}{Z}\prod_{j=1}^m \phi_j(\mathcal{C}_j) \, ,
\]
where \(\mathcal{C}_j\) are the maximal cliques of a graph \(G\), then \((G,p)\)
satisfies all the
*local Markov properties*. Hint: use the Proposition that connects independence and factorization. - Prove that tree-shaped Bayesian networks and tree-shaped Markov networks encode the same set of independences.
- Write code that receives a Markov network in UAI file format and computes the partition function. Verify that if the potentials encode a Bayesian network then the partition function is 1.