# Inference By Sampling

## Table of Contents

\( \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\indeg}[1]{\text{indeg}(#1)} \newcommand{\outdeg}[1]{\text{outdeg}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\!\perp_d\!} \newcommand{\msep}{\!\perp_m\!} \newcommand{\sep}{\!\perp_u\!} \)

Here and in the next few lectures we study practical algorithms for the
**inference problem**, which comes in many forms. To make this
discussion concise we will focus on the computation of the probability
of evidence \(p(E_1=e_1,\dotsc,E_m=e_m)\) where \(E_1,\dotsc,E_m\) are the
evidence variables and the (partial valuation) \(e_1,\dotsc,e_m\) is the
**evidence**. This is also known as the **partition function**, as it
is equivalent to the computation of the partition function in a
(reduced) Markov network. A related problem is **belief updating**
or **posterior probability** which consists in computing
\(p(X_q=x_q|E_1=e_1,\dotsc,E_m=e_m)\). In principle, we can solve this
problem by calling an algorithm that solves the partition problem twice
so as to obtain \(p(X_q=x_q,E_1=e_1,\dotsc,E_m=e_m)\) and
\(p(E_1=e_1,\dotsc,E_m=e_m)\). While this is actually the approach taken
by some algorithms, some techniques employ more efficient ways to obtain
the posterior probability directly (and this is particularly true for
the algorithms we study here). Nevertheless, the adaptations for
producing posterior probabilities are often straightforward and do not
affect the generality of the results we will see later.

## 1 Sampling Algorithms

Computing the probability of evidence by enumeration is
impractical for all but the smallest models, so that more
sophisticated algorithms are needed. In this lecture we
analyze **sampling algorithms**, that is, algorithms that
operate by generating data sets from the probabilistic models
and then using these data sets to generate estimates of the
desired probabilities. This family of algorithms is also known
as **search-based inference algorithms**, as they perform a
search over the space of (partial) valuations of variables,
and **simulation algorithms**, as they simulate the phenomenon
being modeled. We focus on Monte Carlo approaches which
generate samples stochastically (i.e., different runs of the
algorithms produce different solutions). The algorithms we
examine are **approximate**: their solution is only an
approximation of correct value.

## 2 Evaluation of Approximate Algorithms

Two criteria are generally used to evaluate approximate algorithms: accuracy and efficiency. The former is a measure of the quality of the solutions produced, and the latter is a measure of the amount of computational resources needed. There is often a trade-off between these two quantities, such that increasing one leads to a decrease in the other.

### 2.1 Accuracy

The most common measures of accuracy are absolute error and relative error. Let \(p\) be a (marginal) probability we would like to determine and \(q\) the output of an approximate algorithm. They are defined as

\begin{align*} \epsilon_\text{abs} &= | p-q | && \text{[absolute error]}\\ \epsilon_\text{rel} &= \frac{| p-q |}{p} && \text{[relative error]}\\ \end{align*}An estimate is an absolute (resp., relative) \(\epsilon\)-approximation if its absolute (resp., relative) error is at most \(\epsilon\). Since the algorithms we will see are stochastic, the accuracy of the algorithm is a random variable. We thus discuss the randomized accuracy, which is the probability \(1-\delta\) of the algorithm producing an \(\epsilon\)-approximation.

An absolute approximation is useful when the desired number is relatively large and we need only finite precision. For example, if \(p(e) > 1/2\), then an absolute error is usually appropriate. It is less appropriate when \(p(e)\) is small, which is usually the case when \(m\) is large. Any relative \(\epsilon\)-approximation is also an absolute \(\epsilon\)-approximation. Moreover, relative \(\epsilon\)-approximations to \(p(E_1=e_1,\dotsc,E_m=e_m)\) and \(p(X_q=x_q,E_1=e_1,\dotsc,E_m=e_m)\) can be combined to produce a relative \(\sqrt{\epsilon}\)-approximation to \(p(X_q=x_q|E_1=e_1,\dotsc,E_m=e_m)\), but this is not true for absolute approximations. These arguments suggest that relative error is preferred as an accuracy criteria. As we will see, satisfying a relative error is often more demanding than an absolute error.

### 2.2 Efficiency

Efficiency is typically measured in terms of the running time and memory consumption of the algorithm. The algorithms we will study consume very little memory, and we will often focus on the running time, measured as a function of the number of elementary operations performed (arithmetic operations, list insertion, etc).

## 3 Monte Carlo Estimation

Figure 1: Monte Carlo approximation of the length of the blue are (the intersection of the circle and the polygon).

Suppose we are interested in determining the proportion \(p(S)\) of the blue area \(S\) (i.e., the intersection of the circle and the quadrilateral) wrt to the circle in Figure 1, and that we have an efficient way of telling whether any certain point is in the area (e.g., by verifying if it is both inside the circle and the quadrilateral). A popular approximate technique, known as Monte Carlo estimation, is to generate points \(x_1,\dotsc,x_M\) uniformly within the circle and then calculate the desired quantity as the ratio of points that fell inside and outside the target area. Formally, let \(B_i\) be a Boolean random variable such that \(B_i=1\) if \(x_i \in S\) and \(B_i=0\) otherwise. The estimate is thus given by \[ q(S) = \frac{1}{M} \sum_{i=1}^M B_i \, . \] When \(M \rightarrow \infty\) then \(q(S)\) converges in probability to \(p(S)\).

### 3.1 Accuracy

The accuracy of Monte Carlo estimation depends on the number of samples generated, on the distribution from where points are drawn, and on the desired quantity. For instance, for a fixed number of sample points, a better estimator is achieved if \(p(S)\) is larger than if \(p(S)\) is small. This can be formalized using the following probability inequalities.

**Hoeffding Inequality** Let \(B_1,\dotsc,B_M\) be Bernoulli
trials with success probability \(p\), \(q=\sum_i B_i/M\)
and \(\epsilon>0\)
\[
P( |p-q| \geq \epsilon) < \exp( -2 M \epsilon^2 ) \, .
\]

The Hoeffding inequality shows that the absolute error (additive error) of the estimator \(q\) converges in probability exponentially fast wrt the number of trials.

Thus, in order to guarantee an \(\epsilon\)-approximation with probability at least \(1-\delta\) with need to set

\[ M \geq \ln(2/\delta)/2\epsilon^2 \, . \]

That upper bound is independent of the desired value, logarithmic in the probability bound \(\delta\) and polynomial in the absolute error.

The relative error performance can be analyzed with the help of the following result.

**Chernoff Inequality** Let \(B_1,\dotsc,B_M\) be Bernoulli trials
with success probability \(p\), \(q=\sum_i B_i/M\) and
\(\epsilon>0\)
\[
P( q \geq p(1 + \epsilon)) < \exp( -M p \epsilon^2/3 )
\]
and
\[
P( q \leq p(1- \epsilon)) < \exp ( -M p\epsilon^2/2 ) \, .
\]

The Chenorff Inequality shows that the relative error (multiplicative error) of an estimator \(q\) converges in probability exponentially fast with respect to the number of trials. This bound however depends on the true probability \(p\) (and decays exponentially wrt this value so that for \(p\) close to 0 the bound approaches 1).

Using Chernoff inequality, we can bound the number of samples required to get a relative \(\epsilon\)-approximation estimate with probability at least \(1-\delta\) by

\[ M \geq 3 \ln(2/\delta)/(p \epsilon^2) \, . \]

The upper bound depends on the unknown target probability \(p\). Hence to be used in practice we must assume some lower bound such as \(p \geq 1/2\), which is not always possible, especially when the number of evidence variables is large.

### 3.2 Efficiency

The efficiency of Monte Carlo estimation depends on the number of sample points generated, the time taken to generate a sample point and the time to reject or accept a sample point. We usually assume that sampling and verifying a sample point can be done efficiently (e.g., linear time), so that the efficiency is a function of the number of sample points used.

## 4 Logical Sampling

Logical sampling was introduced by Henrion Hen1986 and is
arguably the simplest non-enumerative algorithm for inference
in Bayesian networks. It relies on the assumption that
sampling a complete instantiation from the joint distribution
can be performed efficiently. This is true in discrete
Bayesian networks and in many directed models whose local
distributions are standard (e.g., Gaussian distributions).
^{1} Let \(X_1,\dotsc,X_n\)
be a topological ordering of the variables, and consider the
evidence \(E_1=e_1,\dotsc,E_m=e_m\). We say that two (partial)
configurations are **consistent** if they agree on the value of
every variable they fix. Consider a topological ordering of
the variables \(X_1,\dotsc,X_n\). A high-level description of
logical sampling is shown below.

- Initialize \(N \gets 0\)
- Repeat \(M\) times:
- For \(i=1\) to \(i=n\) generate \(x_i \sim p(X_i|X_1=x_1,\dotsc,X_{i-1}=x_i)\)
- If \(x_1,\dotsc,x_n\) is consistent with \(e_1,\dotsc,e_m\) increase \(N\)

- Return \(N/M\)

Note that since the variables are ordered topologically, the distribution \(p(X_i|x_1,\dotsc,x_{i-1})=p(X_i|\pa{X_i} \sim x_1,\dotsc,x_{i-1})\). The counter \(N\) counts the number of valuations consistent with the evidence, while \(M\) is the total number of valuations generated. We thus have that \[ \lim_{M \rightarrow \infty} N/M = P(E_1=e_1,\dotsc,E_m=e_m) \, . \]

The network below shows an example of logical sampling a complete instantiation from a simple Bayesian network.

### 4.1 Accuracy

The absolute and relative errors for finite samples can be derived using the Hoeffding and Chernoff inequalities, respectively. The absolute error is

\[ P(|N/M - p(e_1,\dotsc,e_m)| \geq \epsilon) \leq \exp(-2M\epsilon^2) \]

Hence, to achieve randomized \(\epsilon\)-accuracy with \(\delta\) confidence, we need: \[ M \geq \frac{\ln(2/\delta)}{2\epsilon^2} \] samples.

The relative error is given by \[ P\left(\frac{|N/M - p(e_1,\dotsc,e_m)|}{p(e_1,\dotsc,e_m)} \geq \epsilon\right) \leq \exp(-Mp\epsilon^2/2) \]

Thus, to achieve a randomized \(epsilon\) accuracy, we need \[ M \geq 3\frac{\ln(2/\delta)}{p\epsilon^2} \] samples.

### 4.2 Efficiency

What about the efficiency of the algorithm? An inspection of
the code shows that the algorithm takes \(O(M n r \log(d) )\)
time, where \(r\) is the maximum in-degree of a variable, and
\(d\) is the maximum cardinality of a variable. Hence, the
algorithm is polynomial as long as the number of simulations
is polynomially bounded. As we have seen, a polynomial bound
on the number of samples is insufficient to *guarantee*
an accurate estimate when the desired output is small.

The algorithm can be straightforwardly modified to produce several estimates of probabilities with the same set of generated valautions by adding more counters.

## 5 Importance Sampling

Logical sampling is also known as **reject sampling**, as it
consists in generating samples from the joint distribution
\(p(\mathcal{V})\) and *rejecting* if they are not consistent
with the evidence. This leads to a inefficiency and loss of
accuracy, as samples are rejected with probability
\(1-p(e_1,\dotsc,e_m)\), which is usually high. Ideally, we
would instead like to sample directly from the restricted
distribution
\(p(\mathcal{V}-\{E_1,\dotsc,E_m\},e_1,\dotsc,e_m)\) of
configurations consistent with the evidence; This is in many
cases difficult or inefficient. Importance Sampling describes
a class of methods that overcomes such issues by making using
of a **proposal distribution** which approximates the restricted
distribution while allowing for efficiently sampling.

Importance Sampling proceeds as follows. Let
\(\mathcal{X}=\mathcal{V}-\{E_1,\dotsc,E_m\}\). Generate
(partial) instantiations \(\nu_1,\dotsc,\nu_M\) from a **proposal
distribution** \(q(\mathcal{X})\) and estimate the probability
\(p(e_1,\dotsc,e_m)\) as \[ \frac{1}{M} \sum_{i=1}^M W_i \,,
\quad \text{where } W_i = p(\nu_i)/q(\nu_i) \, . \] The
proposal distribution is assumed to satisfy \(q(\nu)>0\) if
\(p(\nu)>0\). This ensures that every configuration with
positive probability is eventually generated. When this is
true, the asymptotic value of the estimate is \[ \lim_{M
\rightarrow \infty} \frac{1}{M} \sum_i W_i = \sum_{\nu:
\nu(E_i)=e_i} \frac{p(\nu)}{q(\nu)}q(\nu) = p(e_1,\dotsc,e_m)
\, . \] The finite sample behavior depends on sample weight
\(p(\nu)/q(\nu)\); the higher the better.

## 6 Likelihood Weighting

When the proposal distribution is chosen carefully, importance
sampling produces more reliable estimates (in the sense that
they are consistent and have smaller variance). Choosing good
proposal distributions is however not always easy. One
particularly good strategy is **likelihood weighting**, where we
let \(q(\mathcal{X})\) be the joint distribution of the Bayesian
Network obtained by re-defining \(p(E_i=e_i|\pa{E_i})=1\) and
\(p(E_i \neq e_i|\pa{E_i})=0\) for every evidence variable and
leaving the remaining local models unchanged. One can verify
that \[ W_i = \prod_i p(e_i|\nu(\pa{E_i})) \, . \] Moreover,
\(q(\nu) \geq p(\nu)\), which makes likelihood weighting
asymptotically consistent and with smaller variance (which
implies faster convergence).

### 6.1 Example

The proposal distribution used in likelihood weighting is
equivalent to the so-called **multilated Bayesian network**
obtained by removing arcs entering \(E_j\) and fixing \(E_j=e_j\).
The network below is the multilated version in the previous
example for the evidence \(C=1\).

## 7 Bounded-Variance Algorithm

While likelihood weighting often improves on logical sampling
(especially for conditional probabilities), their convergence
rate are similar and might require exponentially many
valuations to guarantee a certain accuracy. The
**bounded-variance algorithm** DL1997 uses a stopping rule
to dynamically decide the number of valuations based on the
current estimate and is guaranteed to be polynomial when the
number of evidence variables is small. The algorithm assumes
that \(0 < \ell_i \leq p(E_i=e_i|\pa{E_i}) \leq u_i < 1\) and
proceeds as follows. Once more, assume \(X_1,\dotsc,X_n\) are
ordered topologically.

- Define \(U = \prod_i u_i\), \(N^* = 4\ln(2/\delta)(1+\epsilon)/\epsilon^2\)
- Initialize \(N \gets 0, M \gets 0\)
- While \(N < N^*\) repeat:
- \(W \gets 1\)
- For \(i=1\) to \(i=n\) generate \(\nu(X_i)=x_i\) with probability \(p(X_i=x_i|X_1=x_1,\dotsc,X_{i-1}=x_{i-1})\) if \(X_i \neq E_j\) else do \(W \gets W \cdot p(X_i=e_j|x_1,\dotsc,x_{i-1})\)
- Do \(N \gets N + W/U\) and \(M \gets M + 1\)

- return \(U \cdot N / M\)

The algorithm is similar in spirit to likelihood weighting, but uses a different estimator that divides by an upper bound in order to reduce (and even bound) the variance.

So far we have discussed algorithms for inference in Bayesian
networks, where we can generate valuations directly from the
model. This is no longer true in Markov networks. Assume that
we want to compute the partition function of a distribution
\(p(\mathcal{V})=\phi(\mathcal{V})/Z\) as in Markov networks. We
generate valuations \(\nu_1,\dotsc,\nu_M\) from some proposal
distribution \(q(\mathcal{V})\), and compute \[ {Z}_e =
\frac{1}{M} \sum_{i=1}^M {W}_i \,, \quad \text{where } {W}_i =
\phi(\nu_i)/q(\nu_i) \, . \] In the limit, we have that \[
\lim_{M \rightarrow \infty} \frac{1}{M} \sum_i {W}_i =
\sum_{\nu} \frac{\phi(\nu)}{q(\nu)}q(\nu) = Z \, . \] Hence,
the partition function can be computed much like a probability
of evidence (and vice-versa). Any probability can now be
computed by importance sampling with weights
\([\phi(\nu)/Z]/q(\nu)\). It remains to decide how to choose a
proposal distribution. One possibility is to fix an arbitrary
topological ordering among the variables and locally normalize
every clique potential to turn the Markov network into a
Bayesian network with local models \(p(X_i|\pa{X_i}) \propto
\phi(\mathcal{C}_j)/p(\pa{X_i})\), where \(\mathcal{C}_j\) is a
clique containing \(X_i\). This will often require
*chordalizing* the network. A usually more successful
alternative is Gibbs sampling, which we discuss next.

## 8 Gibbs Sampling

Gibbs sampling is a general scheme for sampling from complex
distributions where generating samples from local conditional
models is easy.^{2} The algorithm
can be succinctly described as follows. Consider any ordering
of the variables \(X_1,\dotsc,X_n\).

- Initialize \(\nu_0\) with \(\nu_0(E_i)=e_i\) and otherwise arbitrary, set \(N \gets 0\)
- Repeat for a number of steps \(M\):
- For \(i=1\) to \(i=n\) generate \(\nu_j(X_i)=x_i\) with probability \(p(X_i=x_i|X_1=\nu_{j}(X_1),\dotsc,X_{i-1}=\nu_{j}(X_{i-1}),X_{i+1}=\nu_{j-1}(X_{i+1}),\dotsc,X_n=\nu_{j-1}(X_n))\)
- If \(\nu_j\) is consistent with \(e_1,\dotsc,e_m\) increment \(N\)
- Increment \(j\)

- Return \(N/M\)

Note that a value \(X_i=x_i\) is generated given a full
valuation. Hence, the probability
\(p(X_i=x_i|X_1=\nu_{j-1}(X_1),\dotsc,X_{i-1}=\nu_{j-1}(X_{i-1}),X_{i+1}=\nu_{j-1}(X_{i+1}),\dotsc,X_n=\nu_{j-1}(X_n))\)
is taken from the conditional probability of \(X_i\) given its
Markov blanket. One can show that \(p(X_i|\nb{X_i})\) can be
efficiently computed in Markov networks (note that \(\nb{X_i}\)
is the Markov blanket of \(X_i\)) and used to generate values
\(x_i\). The estimator \(N/M\) is known as the **histogram
estimator**.

Under some mild conditions, as \(M \rightarrow \infty\), the
valuations \(\nu_j\) are generated according to the joint
distribution \(p(\mathcal{V})\) of the Bayesian network or
Markov network. Hence, we can use Gibbs sampling to perform
many sorts of inference problems (computing the partition
function, computing all marginals, computing an expectation,
etc.). When computing conditional probabilities, we can fix
the valuations at the evidence; Gibbs sampling is then
guaranteed asymptotically to generate valuations from
\(p(X_q|e_1,\dotsc,e_m)\). In this case a theoretically improved
estimator is the **mixture estimator**: \[ p(x_q|e_1,\dotsc,e_m)
\approx \frac{1}{M} \sum_{j=1}^M
p(X_i=\nu_j(X_i)|\nu_j(\pa{X_i})) \, . \]

Analyzing the finite-sample convergence of Gibbs sampling is
difficult, and generally there are no non-trivial upper bounds
on the necessary number of iterations necessary to guarantee
good performance. There are many convergence tests that can be
made to verify *estimate* convergence. For example, a simple
approach is to run many independent simulations concurrently
(using different initializations) and to decide convergence
when the estimates produced by the different runs become
similar. This is usually accomplished by comparing the
variance of each estimate with respect to the overall estimate
(the one using valuations from all runs). Upon convergence,
this variance should be small (see Box 12.B on page 522 of
KF2009).

The consistency of the samples generated by Gibbs sampling are
only valid asymptotically. For small sample sizes, consecutive
valuations are *correlated*, which can greatly bias the
estimates produced. To decrease correlation between
valuations, practitioners often adopt a **burn-in** phase,
during which the generated valuations are discarded. It is
customary to have a burn-in sample size of thousands of
valuations (these do not contribute to the value of \(N\) or
\(M\)).

Gibbs sampling has smaller theoretical guarantees than logical and importance sampling. On the other hand, its convergence is not generally affected by the probability being sought, which makes it particularly competitive when computing the probability of rare events. There are also many improvements (e.g., block and Rao-Blackwellized Gibbs sampling) that can speed up convergence of the algorithm.

All methods presented here can be used with continuous variables (as long samples from the local distributions can be efficiently generated) with little additional effort. Sampling methods are usually the choice when devising new complex models (e.g., specialized topic models and probabilistic logics).

## 9 Exercises

- Let \((G,p)\) be a Markov network factorizing as \(p(\mathcal{V})=\prod_j \phi_j(\mathcal{C}_j)/Z\). Show that \[ p(X | \nb{X}) = \frac{ \prod_{j: X \in \mathcal{C}_j} \phi_j(\mathcal{C}_j) }{ \sum_{X} \prod_{j: X \in \mathcal{C}_j} \phi_j(\mathcal{C}_j) } \] for any variable \(X \in \mathcal{V}\).
- Obtain a similar expression for \(p(X | MB(X))\) in the case of Bayesian networks, where \(MB(X)\) denotes the Markov Blanket of \(X\) (i.e., parents, spouses and children).
- Implement the likelihood weighting and Gibbs sampling algorithms, and evaluate them on Bayesian networks learned from datasets. Consider aspects such as accuracy (absolute and relative errors), convergence (accuracy vs. number of samples), efficiency (accuracy vs. runtime), effect on amount of evidence, effect of model size (number of variables) and effect of desired inference (the true probability).

# Bibliography

- [Hen1986] Henrion, Propagating Uncertainty in Bayesian Networks by Probabilistic Logic Sampling, 149-163, in in: Proceedings of the 2nd Conference on Uncertainty in Artificial Intelligence, edited by (1986)
- [KF2009] Koller & Friedman, Probabilistic Graphical Models, MIT press (2009).
- [Dar2009] Adnan Darwiche, Modeling and Reasoning with Bayesian Networks, Cambridge University Press (2009).
- [DL1997] Dagum & Luby, An optimal approximation algorithm for Bayesian inference,
*Artificial Intelligence*,**93(1-2)**, 1-27 (1997).