Inference By Sampling

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} $

In this and in the next few lectures we study practical algorithms for the inference problem, which comes in many forms. To make this discussion concise we will focus on the computation of the probability of evidence $P(E_1=e_1,\dotsc,E_m=e_m)$ where $E_1,\dotsc,E_m$ is a subset of the varaibles called the evidence variables and the $e_1,\dotsc,e_m$ is the evidence. This is also known as the partition function, as it is equivalent to the computation of the partition function in a (reduced) Markov network. A related problem is belief updating or posterior probability which consists in computing $P(X_q=x_q|E_1=e_1,\dotsc,E_m=e_m)$. In principle, we can solve this problem by calling an algorithm that solves the partition problem twice so as to obtain $P(X_q=x_q,E_1=e_1,\dotsc,E_m=e_m)$ and $P(E_1=e_1,\dotsc,E_m=e_m)$. While this is actually the approach taken by some algorithms, some techniques employ more efficient ways to obtain the posterior probability directly (and this is particularly true for the algorithms we study here). Nevertheless, the adaptations for producing posterior probabilities are often straightforward and do not affect the generality of the results we will see later.

Approximate Stochastic Algorithms

Computing the probability of evidence by enumeration is impractical for all but the smallest models, so that more sophisticated algorithms are needed. In this lecture we analyze sampling algorithms, that is, algorithms that operate by generating data sets from the probabilistic models and then using these data sets to generate estimates of the desired probabilities. This family of algorithms is also known as search-based inference algorithms, as they perform a search over the space of (partial) configurations of the variables, and simulation algorithms, as they simulate the phenomenon being modeled. We focus on Monte Carlo approaches which generate samples stochastically (i.e., different runs of the algorithms produce different solutions). The algorithms we examine are approximate: their solution is only an approximation of correct value.

Two criteria are generally used to evaluate approximate algorithms: accuracy and efficiency. The former is a measure of the quality of the solutions produced, and the latter is a measure of the amount of computational resources needed. There is often a trade-off between these two quantities, such that increasing one leads to a decrease in the other.

The most common measures of accuracy are absolute error and relative error. Let $p$ be a (marginal) probability we would like to determine and $q$ the output of an approximate algorithm. They are defined as

The absolute error of the estimate $q$ with respect to the true value $p$ is $\epsilon_\text{abs} = | p-q |$.

The relative error of the estimate $q$ with respect to the true value $p$ is $\epsilon_\text{rel} = \frac{| p-q |}{p}$.

An estimate is an absolute (resp., relative) $\epsilon$-approximation if its absolute (resp., relative) error is at most $\epsilon$. Since the algorithms we will see are stochastic, the accuracy of the algorithm is a random variable. We thus discuss the randomized accuracy, which is the probability $1-\delta$ of the algorithm producing an $\epsilon$-approximation.

An absolute approximation is useful when the desired number is relatively large and we need only finite precision. For example, if $P(e) > 1/2$, then an absolute error is usually appropriate. It is less appropriate when $P(e_1,\ldots,e_m)$ is small, which is usually the case when $m$ is large. Any relative $\epsilon$-approximation is also an absolute $\epsilon$-approximation. Moreover, relative $\epsilon$-approximations to $P(E_1=e_1,\dotsc,E_m=e_m)$ and $P(X_q=x_q,E_1=e_1,\dotsc,E_m=e_m)$ can be combined to produce a relative $\sqrt{\epsilon}$-approximation to $P(X_q=x_q|E_1=e_1,\dotsc,E_m=e_m)$, but this is not true for absolute approximations. These arguments suggest that relative error is preferred as an accuracy criteria. As we will see, satisfying a relative error is often more demanding than an absolute error.

Efficiency is typically measured in terms of the running time and memory consumption of the algorithm. The algorithms we will study consume very little memory, and we will often focus on the running time, measured as a function of the number of elementary operations performed (arithmetic operations, list insertion, etc).

Monte Carlo Estimation

Monte Carlo approximation of the blue area.

Suppose we are interested in determining the proportion $P(S)$ of the blue area $S$ (i.e., the intersection of the pink circle and the yellow quadrilateral) with respect to to the circle in the figure above, and that we have an efficient way of telling whether any certain point is in the area (e.g., by verifying if it is both inside the circle and the quadrilateral). A popular approximate technique, known as Monte Carlo estimation, is to generate points $x_1,\dotsc,x_M$ uniformly within the circle and then calculate the desired quantity as the ratio of points that fall inside and outside the target area. Formally, let $B_i$ be a Boolean random variable such that $B_i=1$ if $x_i \in S$ and $B_i=0$ otherwise. The estimate is thus given by $$ Q(S) = \frac{1}{M} \sum_{i=1}^M B_i \, . $$

As $M \rightarrow \infty$, $Q(S)$ converges in probability to $P(S)$.

Accuracy

The accuracy of Monte Carlo estimation depends on the number of samples generated, on the distribution from where points are drawn, and on the desired quantity. For instance, for a fixed number of sample points, a better estimator is achieved if $P(S)$ is larger than if $P(S)$ is small. This can be formalized using the following probability inequalities.

Hoeffding Inequality: Let $B_1,\dotsc,B_M$ be Bernoulli trials with success probability $p$, $q=\sum_i B_i/M$ be an estimate of $p$, and $\epsilon>0$. Then $$ P( |p-q| \geq \epsilon) < \exp( -2 M \epsilon^2 ) \, . $$

The Hoeffding inequality shows that the absolute error (additive error) of the estimator $q$ converges in probability exponentially fast with respect to the number of trials. Thus, in order to guarantee an $\epsilon$-approximation with probability at least $1-\delta$ with need to set

$$ M \geq \frac{\ln(2/\delta)}{2\epsilon^2} \, . $$

That upper bound is independent of the desired value $p$, logarithmic in the probability bound $\delta$ and quadratic in the absolute error. For example, to obtain an absolute $10^{-3}$-approximation with 99% confidence, we can choose $$ M = \left\lceil \frac{\ln(2/0.01)}{2 \cdot 10^{-6}} \right\rceil = 1\,150\,515 \, , $$ irrespective of the value of $p$. While the number of samples required is large, it is most often within reasonable computations limits for many models.

The relative error performance can be analyzed with the help of the following result.

Chernoff Inequality: Let $B_1,\dotsc,B_M$ be Bernoulli trials with success probability $p$, $q=\sum_i B_i/M$ be an estimate of $p$, and $\epsilon > 0$. Then $$ P( q \geq p(1 + \epsilon)) < \exp( -M p \epsilon^2/3 ) $$ and $$ P( q \leq p(1- \epsilon)) < \exp ( -M p\epsilon^2/2 ) \, . $$

The Chenorff Inequality shows that the relative error (multiplicative error) of an estimator $q$ converges in probability exponentially fast with respect to the number of trials. This bound however depends on the true probability $p$, and decays exponentially with respect to this value so that for $p$ close to 0 the bound approaches 1.

Using the Chernoff inequality, we can bound the number of samples required to get a relative $\epsilon$-approximation estimate with probability at least $1-\delta$ by

$$ M \geq 3 \frac{\ln(2/\delta)}{p \epsilon^2} \, . $$

The upper bound depends on the unknown target probability $p$. Hence to be used in practice we must assume some lower bound such as $p \geq 1/2$, which is not always possible, especially when the number of evidence variables is large. For example, to obtain a relative $10^{-3}$-approximation with 99% confidence for a desired probability $p=1/2$, we might choose $$ M = \left\lceil 3\frac{\ln(2/0.01)}{0.5 \cdot 10^{-6}} \right\rceil = 1\,380\,618 \, , $$ which is similar to the required number of samples for the absolute error. If on the other hand, $p=10^{-4}$, then $$ M = \left\lceil 3\frac{\ln(2/0.01)}{10^{-4} \cdot 10^{-6}} \right\rceil = 69\,030\,899\,870 \, . $$ An enormous quantity, most likely out of reach of the available computational resources.

Efficiency

The efficiency of Monte Carlo estimation depends on the number of sample points generated, the time taken to generate a sample point and the time to reject or accept a sample point. We usually assume that sampling and verifying a sample point can be done efficiently (e.g., linear time), so that the efficiency is a function of the number of sample points used.

Logical Sampling

Logical sampling was introduced by Henrion in 1986 and is arguably the simplest non-enumerative algorithm for inference in Bayesian networks.¹ It relies on the assumption that sampling a complete instantiation from the joint distribution can be performed efficiently. This is true in discrete Bayesian networks and in many directed models whose local distributions are standard (e.g., Gaussian distributions).² Let $X_1,\dotsc,X_n$ be a topological ordering of the variables, and consider the evidence $E_1=e_1,\dotsc,E_m=e_m$. We say that two (partial) configurations are consistent if they agree on the value of every variable they fix. Consider a topological ordering of the variables $X_1,\dotsc,X_n$. A high-level description of logical sampling is as follows.

Initialize $N \gets 0$
Repeat $M$ times:
For $i=1$ to $i=n$ generate $x_i \sim P(X_i|X_1=x_1,\dotsc,X_{i-1}=x_{i-1})$
If $x_1,\dotsc,x_n$ is consistent with $e_1,\dotsc,e_m$ increase $N$

Return $N/M$

Note that since the variables are ordered topologically, the distribution $$ P(X_i|X_1=x_1,\dotsc,X_{i-1}=x_{i-1})=P(X_i|\pa{X_i} = \nu) \, , $$ where $\nu$ is the configuration of $\pa{X_i}$ consistent with $x_1,\dotsc,x_{i-1}$. The counter $N$ counts the number of complete joint configurations consistent with the evidence, while $M$ is the total number of configurations generated. We thus have that $$ \lim_{M \rightarrow \infty} N/M = P(E_1=e_1,\dotsc,E_m=e_m) \, . $$

The figure below illustrates the process of logically sampling a complete instantiation from a simple Bayesian network, also known as forward sampling.

Accuracy

The absolute and relative errors for finite samples can be derived using the Hoeffding and Chernoff inequalities, respectively, with $q=N/M$ and $p=P(e_1,\dotsc,e_m)$. Hence, to ensure a randomized absolute $\epsilon$-approximation with $\delta$ confidence, we need $$ M = \left \lceil \frac{\ln(2/\delta)}{2\epsilon^2} \right \rceil $$ samples. Similarly, to ensure a randomized relative $\epsilon$-approximation with $\delta$ confidence, we need $$
M = \left \lceil 3\frac{\ln(2/\delta)}{p\epsilon^2} \right \rceil $$ samples.

Efficiency

What about the efficiency of the algorithm? An inspection of the code shows that the algorithm takes $O(M n r \log(d) )$ time, where $r$ is the maximum in-degree of a variable, and $d$ is the maximum cardinality of a variable. Hence, the algorithm runs in polynomial time as long as the number of simulations $M$ is polynomially bounded. As we have discussed, a polynomial bound on the number of samples is insufficient to guarantee an estimate with small relative error when the desired probability value is small.

The algorithm can be straightforwardly modified to produce several estimates of probabilities with the same set of generated configurations by adding more counters.

Importance Sampling

Logical sampling is also known as rejection sampling, as it consists in generating samples from the joint distribution $P(\mathcal{V})$ and rejecting or discarding those that are not consistent with the evidence. This leads to inefficiency and loss of accuracy, as samples are rejected with probability $1-P(e_1,\dotsc,e_m)$, which is usually high if $m$ is large. Ideally, we would instead like to sample directly from the restricted distribution $$ P(\mathcal{V}-\{E_1,\dotsc,E_m\},e_1,\dotsc,e_m) $$ of configurations consistent with the evidence; This is in many cases difficult or inefficient. Importance Sampling describes a class of methods that overcome such issues by making using of a proposal distribution that approximates the restricted distribution while allowing for efficiently sampling. Importance Sampling proceeds as follows. Let $$ \mathcal{X}=\mathcal{V}-\{E_1,\dotsc,E_m\} \, . $$ Generate (partial) instantiations $\nu_1,\dotsc,\nu_M$ from a proposal distribution $Q(\mathcal{X})$ and estimate the probability $P(e_1,\dotsc,e_m)$ as $$ q = \frac{1}{M} \sum_{i=1}^M W_i \,, \quad \text{where } W_i = P(\nu_i)/Q(\nu_i) \, . $$ The proposal distribution is assumed to satisfy $Q(\nu)>0$ wherever $P(\nu)>0$. This ensures that every configuration consistent witht the evidence with positive probability according to $P$ is eventually generated. When this is true, the asymptotic value of the estimate is $$ \lim_{M \rightarrow \infty} \frac{1}{M} \sum_i W_i = \sum_{\nu: \nu(E_i)=e_i} \frac{P(\nu)}{Q(\nu)}Q(\nu) = P(e_1,\dotsc,e_m) \, . $$

The finite sample behavior depends on the values of the sample weight $P(\nu)/Q(\nu)$; the higher the better.

Likelihood Weighting

When the proposal distribution is chosen carefully, importance sampling produces more reliable estimates (in the sense that they are consistent and have smaller variance). Choosing good proposal distributions is however not always easy. One particularly good strategy is likelihood weighting, where we let $Q(\mathcal{X})$ be the joint distribution of the Bayesian Network obtained by re-defining $P(E_i=e_i|\pa{E_i})=1$ and $P(E_i \neq e_i|\pa{E_i})=0$ for every evidence variable and leaving the remaining local models unchanged. One can verify that $$ W_i = \prod_i P(e_i|\pa{E_i}=\nu) \, . $$ Moreover, $Q(\nu) \geq P(\nu)$, which makes likelihood weighting asymptotically consistent and with smaller variance (which implies faster convergence).

The proposal distribution used in likelihood weighting is equivalent to the so-called multilated Bayesian network obtained by removing arcs entering $E_j$ and fixing $E_j=e_j$. By removing arcs, we introduce new independences that can be explored when sampling. For example, we can sample independently from each connected component of the multilated network.

The network below is the multilated version in the logical sampling example for the evidence $C=1$.

Bounded-Variance Algorithm

While likelihood weighting often improves on logical sampling (especially for conditional probabilities), their convergence rate are similar and might require exponentially many samplings to guarantee a certain accuracy. The bounded-variance algorithm uses a stopping rule to dynamically decide the number of configurations based on the current estimate and is guaranteed to be polynomial when the number of evidence variables is small.³ The algorithm assumes that $$ 0 < \ell_i \leq P(E_i=e_i|\pa{E_i}) \leq u_i < 1 $$ and proceeds as follows. Once more, assume $X_1,\dotsc,X_n$ are ordered topologically.

Define $U = \prod_i u_i$, and $N^* = 4\ln(2/\delta)(1+\epsilon)/\epsilon^2$
Initialize $N \gets 0, M \gets 0$
While $N < N^*$ repeat:
$W \gets 1$
For $i=1$ to $i=n$ do:
If $X_i$ is not an evidence variable, generate $x_i \sim P(X_i|x_1,\dotsc,x_{i-1})$
If $X_i$ has evidence $e_j$ then set $W \gets W \cdot P(X_i=e_j|x_1,\dotsc,x_{i-1})$

Update $N \gets N + W/U$ and $M \gets M + 1$

return $U \cdot N / M$

The algorithm is similar in spirit to likelihood weighting, but uses a different estimator that divides by an upper bound in order to reduce (and even bound) the variance, hence its name.

Sampling in Markov Networks

So far we have discussed algorithms for inference in Bayesian networks, where we can generate configurations directly from the model by forward sampling. This is no longer true in Markov networks. Consider computing the partition function of a distribution $P(\mathcal{V})=\phi(\mathcal{V})/Z$ as in Markov networks. We generate configurations $\nu_1,\dotsc,\nu_M$ from some proposal distribution $Q(\mathcal{V})$, and compute $$ Z_e = \frac{1}{M} \sum_{i=1}^M W_i \,, \quad \text{ where } W_i = \phi(\nu_i)/Q(\nu_i) \, . $$

In the sample limit, we have that

$$ \lim_{M \rightarrow \infty} \frac{1}{M} \sum_i {W}i = \sum{\nu} \frac{\phi(\nu)}{Q(\nu)}Q(\nu) = Z \, . $$

Hence, the partition function can be computed much like a probability of evidence (and vice-versa). One we obtain an estimate of the partition function, any marginal probability can be estimated by importance sampling with weights $[\phi(\nu)/Z_e]/Q(\nu)$. It remains to decide how to choose a proposal distribution for Markov networks. One possibility is to fix an arbitrary topological ordering of the variables and locally normalize every clique potential to turn the Markov network into a Bayesian network with local models $P(X_i|\pa{X_i}) \propto \phi(\mathcal{C}_j)$, where $\mathcal{C}_j$ is a clique containing $X_i$. This will often require chordalizing the network. A usually more successful alternative is Gibbs sampling, which we discuss next.

Gibbs Sampling

Gibbs sampling is a general scheme for sampling from complex distributions where generating samples from local conditional models is easy.⁴ The algorithm can be succinctly described as follows. Consider any ordering of the variables $X_1,\dotsc,X_n$.

Initialize $x_1,\dotsc,x_n$ consistent with evidence, and set $N \gets 0$
Repeat for a number of steps $M$:
For $i=1$ to $i=n$ generate $x_i \sim P(X_i|x_1,\dotsc,x_{i-1},x_{i+1},\dotsc,x_n)$
If $x_1,\dotsc,x_n$ is consistent with $e_1,\dotsc,e_m$ increment $N$
Increment $j$

Return $N/M$

Note that unlike logical sampling, a value $X_i=x_i$ is generated given a full configuration of the remaining variables. Hence, the probability

$$ P(X_i=x_i|x_1,\dotsc,x_{i-1},x_{i+1},\dotsc,x_n) $$

is taken from the conditional probability of $X_i$ given its Markov blanket. One can show that $P(X_i|\nb{X_i})$ can be efficiently computed in Markov networks, where $\nb{X_i}$ is the Markov blanket of $X_i$, and used to generate values $x_i$. The estimator $N/M$ is known as the histogram estimator.

Under some mild conditions, as $M \rightarrow \infty$, the configurations $\nu_j$ are generated according to the joint distribution $P(\mathcal{V})$ of the Bayesian network or Markov network. Hence, we can use Gibbs sampling to perform many sorts of inference problems (computing the partition function, computing all marginals, computing an expectation, etc.). When computing conditional marginal probabilities $P(X|e_1,\dotsc,e_m)$, we can fix the configurations of evidence variables to be consistent with evidence, thus avoiding rejecting samples; Gibbs sampling is then guaranteed asymptotically to generate configurations from the posterior $P(X|e_1,\dotsc,e_m)$. In this case a theoretically improved estimator is the mixture estimator: $$ P(x|e_1,\dotsc,e_m) \approx \frac{1}{M} \sum_{j=1}^M P(X=x^j|\pa{X_i}=\nu^j) \, , $$ where $x^j$ and $\nu_j$ are the configurations consistent with the $j$th sampled joint configuration.

Analyzing the finite-sample convergence of Gibbs sampling is difficult, and generally there are no non-trivial upper bounds on the necessary number of iterations necessary to guarantee good performance. There are many convergence tests that can be made to verify estimate convergence. For example, a simple approach is to run many independent simulations concurrently (using different initializations) and to decide convergence when the estimates produced by the different runs become similar. This is usually accomplished by comparing the variance of each estimate with respect to the overall estimate (the one using configurations from all runs). Upon convergence, this variance should be small.⁵

The consistency of the samples generated by Gibbs sampling are only valid asymptotically. For small sample sizes, consecutive joint configurations are correlated, which can greatly bias the estimates produced. To decrease correlation between sampled configurations, practitioners often adopt a burn-in phase, during which the generated configurations are discarded. It is customary to have a burn-in sample size of (hundreds of) thousands of configurations (these do not contribute to the value of $N$ or $M$).

Gibbs sampling has smaller theoretical guarantees than logical and importance sampling. On the other hand, its convergence is not generally affected by the probability being sought, which makes it particularly competitive when computing the probability of rare events. There are also many improvements (e.g., block and Rao-Blackwellized Gibbs sampling) that can speed up convergence of the algorithm.

Generating samples

The methods thus described assume an efficient sampling mechanism to generate instantiations from a known model. A very general and reasonably efficient sampling technique is inverse transform sampling, which uses samples from a uniform distribution to generate samples from a distribution with inverse cumulative distribution function $F$ as follows. Suppose $X$ is a random variable with density $p$ and cumulative distribution function $F$. Generate a sample $u$ from a Uniform distribution on $[0,1]$ (for example, using any pseudorandom number generator) and return $x=F^{-1}(u)$, where $F^{-1}$ denotes the inverse of $F$, hence the name of the procedure. That is, $x$ is the a value such that $F(x)=u$. The correctness is given by the theorem below.

Let $X$ be a random variable with a a continuous cumulative distribution function $F$, $U$ a random variable that is uniformly distributed in $[0,1]$ and $Y=F^{-1}(U)$. Then $P(Y \leq x)=F(x)$.

Since $F$ is monotonic, we have that $$ \begin{align*} P(Y \leq x) &= P(F(Y) \leq F(x)) \\ & = P(U \leq F(x)) = F(x) \, , \end{align*} $$ where in the last equation we used the fact that $P(U \leq u) = u$.

This procedure can be extended to discrete random variables by mapping its sample space into $[0,1]$.

Continuous variables

All methods presented here can be used with continuous variables (as long samples from the local distributions can be efficiently generated) with little additional effort. Sampling methods are usually the choice when devising new complex models (e.g., specialized topic models and probabilistic logics). There has been intense research of improving the sample efficiency of sampling-based inference algorithms, especially when considering continuous variables.

M. Henrion, Propagating Uncertainty in Bayesian Networks by Probabilistic Logic Sampling, In Proceedings of the 2nd Conference on Uncertainty in Artificial Intelligence, 1986. ↩︎
To learn how to draw a sample from a discrete probability distribution, see Box 12.A on page 489 of Koller and Friedman 2009’s book and the footnote on page 379 of Darwiche 2009’s book. ↩︎
P. Dagum and M. Luby, An optimal approximation algorithm for Bayesian inference, Artificial Intelligence 93:1-2, 1997. ↩︎
The theoretical underpinnings of the algorithm are out of the scope of this course. ↩︎
See Box 12.B on page 522 of Koller and Friedman 2009’s book. ↩︎

Last updated on Oct 28, 2020