Inference By Sampling

Table of Contents

\( \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\indeg}[1]{\text{indeg}(#1)} \newcommand{\outdeg}[1]{\text{outdeg}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\!\perp_d\!} \newcommand{\msep}{\!\perp_m\!} \newcommand{\sep}{\!\perp_u\!} \)

Here and in the next few lectures we study practical algorithms for the inference problem, which comes in many forms. To make this discussion concise we will focus on the computation of the probability of evidence \(p(E_1=e_1,\dotsc,E_m=e_m)\) where \(E_1,\dotsc,E_m\) are the evidence variables and the (partial valuation) \(e_1,\dotsc,e_m\) is the evidence. This is also known as the partition function, as it is equivalent to the computation of the partition function in a (reduced) Markov network. A related problem is belief updating or posterior probability which consists in computing \(p(X_q=x_q|E_1=e_1,\dotsc,E_m=e_m)\). In principle, we can solve this problem by calling an algorithm that solves the partition problem twice so as to obtain \(p(X_q=x_q,E_1=e_1,\dotsc,E_m=e_m)\) and \(p(E_1=e_1,\dotsc,E_m=e_m)\). While this is actually the approach taken by some algorithms, some techniques employ more efficient ways to obtain the posterior probability directly (and this is particularly true for the algorithms we study here). Nevertheless, the adaptations for producing posterior probabilities are often straightforward and do not affect the generality of the results we will see later.

1 Sampling Algorithms

Computing the probability of evidence by enumeration is impractical for all but the smallest models, so that more sophisticated algorithms are needed. In this lecture we analyze sampling algorithms, that is, algorithms that operate by generating data sets from the probabilistic models and then using these data sets to generate estimates of the desired probabilities. This family of algorithms is also known as search-based inference algorithms, as they perform a search over the space of (partial) valuations of variables, and simulation algorithms, as they simulate the phenomenon being modeled. We focus on Monte Carlo approaches which generate samples stochastically (i.e., different runs of the algorithms produce different solutions). The algorithms we examine are approximate: their solution is only an approximation of correct value.

2 Evaluation of Approximate Algorithms

Two criteria are generally used to evaluate approximate algorithms: accuracy and efficiency. The former is a measure of the quality of the solutions produced, and the latter is a measure of the amount of computational resources needed. There is often a trade-off between these two quantities, such that increasing one leads to a decrease in the other.

2.1 Accuracy

The most common measures of accuracy are absolute error and relative error. Let \(p\) be a (marginal) probability we would like to determine and \(q\) the output of an approximate algorithm. They are defined as

\begin{align*} \epsilon_\text{abs} &= | p-q | && \text{[absolute error]}\\ \epsilon_\text{rel} &= \frac{| p-q |}{p} && \text{[relative error]}\\ \end{align*}

An estimate is an absolute (resp., relative) \(\epsilon\)-approximation if its absolute (resp., relative) error is at most \(\epsilon\). Since the algorithms we will see are stochastic, the accuracy of the algorithm is a random variable. We thus discuss the randomized accuracy, which is the probability \(1-\delta\) of the algorithm producing an \(\epsilon\)-approximation.

An absolute approximation is useful when the desired number is relatively large and we need only finite precision. For example, if \(p(e) > 1/2\), then an absolute error is usually appropriate. It is less appropriate when \(p(e)\) is small, which is usually the case when \(m\) is large. Any relative \(\epsilon\)-approximation is also an absolute \(\epsilon\)-approximation. Moreover, relative \(\epsilon\)-approximations to \(p(E_1=e_1,\dotsc,E_m=e_m)\) and \(p(X_q=x_q,E_1=e_1,\dotsc,E_m=e_m)\) can be combined to produce a relative \(\sqrt{\epsilon}\)-approximation to \(p(X_q=x_q|E_1=e_1,\dotsc,E_m=e_m)\), but this is not true for absolute approximations. These arguments suggest that relative error is preferred as an accuracy criteria. As we will see, satisfying a relative error is often more demanding than an absolute error.

2.2 Efficiency

Efficiency is typically measured in terms of the running time and memory consumption of the algorithm. The algorithms we will study consume very little memory, and we will often focus on the running time, measured as a function of the number of elementary operations performed (arithmetic operations, list insertion, etc).

3 Monte Carlo Estimation

montecarlo.png

Figure 1: Monte Carlo approximation of the length of the blue are (the intersection of the circle and the polygon).

Suppose we are interested in determining the proportion \(p(S)\) of the blue area \(S\) (i.e., the intersection of the circle and the quadrilateral) wrt to the circle in Figure 1, and that we have an efficient way of telling whether any certain point is in the area (e.g., by verifying if it is both inside the circle and the quadrilateral). A popular approximate technique, known as Monte Carlo estimation, is to generate points \(x_1,\dotsc,x_M\) uniformly within the circle and then calculate the desired quantity as the ratio of points that fell inside and outside the target area. Formally, let \(B_i\) be a Boolean random variable such that \(B_i=1\) if \(x_i \in S\) and \(B_i=0\) otherwise. The estimate is thus given by \[ q(S) = \frac{1}{M} \sum_{i=1}^M B_i \, . \] When \(M \rightarrow \infty\) then \(q(S)\) converges in probability to \(p(S)\).

3.1 Accuracy

The accuracy of Monte Carlo estimation depends on the number of samples generated, on the distribution from where points are drawn, and on the desired quantity. For instance, for a fixed number of sample points, a better estimator is achieved if \(p(S)\) is larger than if \(p(S)\) is small. This can be formalized using the following probability inequalities.

Hoeffding Inequality Let \(B_1,\dotsc,B_M\) be Bernoulli trials with success probability \(p\), \(q=\sum_i B_i/M\) and \(\epsilon>0\) \[ P( |p-q| \geq \epsilon) < \exp( -2 M \epsilon^2 ) \, . \]

The Hoeffding inequality shows that the absolute error (additive error) of the estimator \(q\) converges in probability exponentially fast wrt the number of trials.

Thus, in order to guarantee an \(\epsilon\)-approximation with probability at least \(1-\delta\) with need to set

\[ M \geq \ln(2/\delta)/2\epsilon^2 \, . \]

That upper bound is independent of the desired value, logarithmic in the probability bound \(\delta\) and polynomial in the absolute error.

The relative error performance can be analyzed with the help of the following result.

Chernoff Inequality Let \(B_1,\dotsc,B_M\) be Bernoulli trials with success probability \(p\), \(q=\sum_i B_i/M\) and \(\epsilon>0\) \[ P( q \geq p(1 + \epsilon)) < \exp( -M p \epsilon^2/3 ) \] and \[ P( q \leq p(1- \epsilon)) < \exp ( -M p\epsilon^2/2 ) \, . \]

The Chenorff Inequality shows that the relative error (multiplicative error) of an estimator \(q\) converges in probability exponentially fast with respect to the number of trials. This bound however depends on the true probability \(p\) (and decays exponentially wrt this value so that for \(p\) close to 0 the bound approaches 1).

Using Chernoff inequality, we can bound the number of samples required to get a relative \(\epsilon\)-approximation estimate with probability at least \(1-\delta\) by

\[ M \geq 3 \ln(2/\delta)/(p \epsilon^2) \, . \]

The upper bound depends on the unknown target probability \(p\). Hence to be used in practice we must assume some lower bound such as \(p \geq 1/2\), which is not always possible, especially when the number of evidence variables is large.

3.2 Efficiency

The efficiency of Monte Carlo estimation depends on the number of sample points generated, the time taken to generate a sample point and the time to reject or accept a sample point. We usually assume that sampling and verifying a sample point can be done efficiently (e.g., linear time), so that the efficiency is a function of the number of sample points used.

4 Logical Sampling

Logical sampling was introduced by Henrion Hen1986 and is arguably the simplest non-enumerative algorithm for inference in Bayesian networks. It relies on the assumption that sampling a complete instantiation from the joint distribution can be performed efficiently. This is true in discrete Bayesian networks and in many directed models whose local distributions are standard (e.g., Gaussian distributions). 1 Let \(X_1,\dotsc,X_n\) be a topological ordering of the variables, and consider the evidence \(E_1=e_1,\dotsc,E_m=e_m\). We say that two (partial) configurations are consistent if they agree on the value of every variable they fix. Consider a topological ordering of the variables \(X_1,\dotsc,X_n\). A high-level description of logical sampling is shown below.

  1. Initialize \(N \gets 0\)
  2. Repeat \(M\) times:
    1. For \(i=1\) to \(i=n\) generate \(x_i \sim p(X_i|X_1=x_1,\dotsc,X_{i-1}=x_i)\)
    2. If \(x_1,\dotsc,x_n\) is consistent with \(e_1,\dotsc,e_m\) increase \(N\)
  3. Return \(N/M\)

Note that since the variables are ordered topologically, the distribution \(p(X_i|x_1,\dotsc,x_{i-1})=p(X_i|\pa{X_i} \sim x_1,\dotsc,x_{i-1})\). The counter \(N\) counts the number of valuations consistent with the evidence, while \(M\) is the total number of valuations generated. We thus have that \[ \lim_{M \rightarrow \infty} N/M = P(E_1=e_1,\dotsc,E_m=e_m) \, . \]

The network below shows an example of logical sampling a complete instantiation from a simple Bayesian network.

logsamp.png

4.1 Accuracy

The absolute and relative errors for finite samples can be derived using the Hoeffding and Chernoff inequalities, respectively. The absolute error is

\[ P(|N/M - p(e_1,\dotsc,e_m)| \geq \epsilon) \leq \exp(-2M\epsilon^2) \]

Hence, to achieve randomized \(\epsilon\)-accuracy with \(\delta\) confidence, we need: \[ M \geq \frac{\ln(2/\delta)}{2\epsilon^2} \] samples.

The relative error is given by \[ P\left(\frac{|N/M - p(e_1,\dotsc,e_m)|}{p(e_1,\dotsc,e_m)} \geq \epsilon\right) \leq \exp(-Mp\epsilon^2/2) \]

Thus, to achieve a randomized \(epsilon\) accuracy, we need \[ M \geq 3\frac{\ln(2/\delta)}{p\epsilon^2} \] samples.

4.2 Efficiency

What about the efficiency of the algorithm? An inspection of the code shows that the algorithm takes \(O(M n r \log(d) )\) time, where \(r\) is the maximum in-degree of a variable, and \(d\) is the maximum cardinality of a variable. Hence, the algorithm is polynomial as long as the number of simulations is polynomially bounded. As we have seen, a polynomial bound on the number of samples is insufficient to guarantee an accurate estimate when the desired output is small.

The algorithm can be straightforwardly modified to produce several estimates of probabilities with the same set of generated valautions by adding more counters.

5 Importance Sampling

Logical sampling is also known as reject sampling, as it consists in generating samples from the joint distribution \(p(\mathcal{V})\) and rejecting if they are not consistent with the evidence. This leads to a inefficiency and loss of accuracy, as samples are rejected with probability \(1-p(e_1,\dotsc,e_m)\), which is usually high. Ideally, we would instead like to sample directly from the restricted distribution \(p(\mathcal{V}-\{E_1,\dotsc,E_m\},e_1,\dotsc,e_m)\) of configurations consistent with the evidence; This is in many cases difficult or inefficient. Importance Sampling describes a class of methods that overcomes such issues by making using of a proposal distribution which approximates the restricted distribution while allowing for efficiently sampling.

Importance Sampling proceeds as follows. Let \(\mathcal{X}=\mathcal{V}-\{E_1,\dotsc,E_m\}\). Generate (partial) instantiations \(\nu_1,\dotsc,\nu_M\) from a proposal distribution \(q(\mathcal{X})\) and estimate the probability \(p(e_1,\dotsc,e_m)\) as \[ \frac{1}{M} \sum_{i=1}^M W_i \,, \quad \text{where } W_i = p(\nu_i)/q(\nu_i) \, . \] The proposal distribution is assumed to satisfy \(q(\nu)>0\) if \(p(\nu)>0\). This ensures that every configuration with positive probability is eventually generated. When this is true, the asymptotic value of the estimate is \[ \lim_{M \rightarrow \infty} \frac{1}{M} \sum_i W_i = \sum_{\nu: \nu(E_i)=e_i} \frac{p(\nu)}{q(\nu)}q(\nu) = p(e_1,\dotsc,e_m) \, . \] The finite sample behavior depends on sample weight \(p(\nu)/q(\nu)\); the higher the better.

6 Likelihood Weighting

When the proposal distribution is chosen carefully, importance sampling produces more reliable estimates (in the sense that they are consistent and have smaller variance). Choosing good proposal distributions is however not always easy. One particularly good strategy is likelihood weighting, where we let \(q(\mathcal{X})\) be the joint distribution of the Bayesian Network obtained by re-defining \(p(E_i=e_i|\pa{E_i})=1\) and \(p(E_i \neq e_i|\pa{E_i})=0\) for every evidence variable and leaving the remaining local models unchanged. One can verify that \[ W_i = \prod_i p(e_i|\nu(\pa{E_i})) \, . \] Moreover, \(q(\nu) \geq p(\nu)\), which makes likelihood weighting asymptotically consistent and with smaller variance (which implies faster convergence).

6.1 Example

The proposal distribution used in likelihood weighting is equivalent to the so-called multilated Bayesian network obtained by removing arcs entering \(E_j\) and fixing \(E_j=e_j\). The network below is the multilated version in the previous example for the evidence \(C=1\).

weightsamp.png

7 Bounded-Variance Algorithm

While likelihood weighting often improves on logical sampling (especially for conditional probabilities), their convergence rate are similar and might require exponentially many valuations to guarantee a certain accuracy. The bounded-variance algorithm DL1997 uses a stopping rule to dynamically decide the number of valuations based on the current estimate and is guaranteed to be polynomial when the number of evidence variables is small. The algorithm assumes that \(0 < \ell_i \leq p(E_i=e_i|\pa{E_i}) \leq u_i < 1\) and proceeds as follows. Once more, assume \(X_1,\dotsc,X_n\) are ordered topologically.

  1. Define \(U = \prod_i u_i\), \(N^* = 4\ln(2/\delta)(1+\epsilon)/\epsilon^2\)
  2. Initialize \(N \gets 0, M \gets 0\)
  3. While \(N < N^*\) repeat:
    1. \(W \gets 1\)
    2. For \(i=1\) to \(i=n\) generate \(\nu(X_i)=x_i\) with probability \(p(X_i=x_i|X_1=x_1,\dotsc,X_{i-1}=x_{i-1})\) if \(X_i \neq E_j\) else do \(W \gets W \cdot p(X_i=e_j|x_1,\dotsc,x_{i-1})\)
    3. Do \(N \gets N + W/U\) and \(M \gets M + 1\)
  4. return \(U \cdot N / M\)

The algorithm is similar in spirit to likelihood weighting, but uses a different estimator that divides by an upper bound in order to reduce (and even bound) the variance.

So far we have discussed algorithms for inference in Bayesian networks, where we can generate valuations directly from the model. This is no longer true in Markov networks. Assume that we want to compute the partition function of a distribution \(p(\mathcal{V})=\phi(\mathcal{V})/Z\) as in Markov networks. We generate valuations \(\nu_1,\dotsc,\nu_M\) from some proposal distribution \(q(\mathcal{V})\), and compute \[ {Z}_e = \frac{1}{M} \sum_{i=1}^M {W}_i \,, \quad \text{where } {W}_i = \phi(\nu_i)/q(\nu_i) \, . \] In the limit, we have that \[ \lim_{M \rightarrow \infty} \frac{1}{M} \sum_i {W}_i = \sum_{\nu} \frac{\phi(\nu)}{q(\nu)}q(\nu) = Z \, . \] Hence, the partition function can be computed much like a probability of evidence (and vice-versa). Any probability can now be computed by importance sampling with weights \([\phi(\nu)/Z]/q(\nu)\). It remains to decide how to choose a proposal distribution. One possibility is to fix an arbitrary topological ordering among the variables and locally normalize every clique potential to turn the Markov network into a Bayesian network with local models \(p(X_i|\pa{X_i}) \propto \phi(\mathcal{C}_j)/p(\pa{X_i})\), where \(\mathcal{C}_j\) is a clique containing \(X_i\). This will often require chordalizing the network. A usually more successful alternative is Gibbs sampling, which we discuss next.

8 Gibbs Sampling

Gibbs sampling is a general scheme for sampling from complex distributions where generating samples from local conditional models is easy.2 The algorithm can be succinctly described as follows. Consider any ordering of the variables \(X_1,\dotsc,X_n\).

  1. Initialize \(\nu_0\) with \(\nu_0(E_i)=e_i\) and otherwise arbitrary, set \(N \gets 0\)
  2. Repeat for a number of steps \(M\):
    1. For \(i=1\) to \(i=n\) generate \(\nu_j(X_i)=x_i\) with probability \(p(X_i=x_i|X_1=\nu_{j}(X_1),\dotsc,X_{i-1}=\nu_{j}(X_{i-1}),X_{i+1}=\nu_{j-1}(X_{i+1}),\dotsc,X_n=\nu_{j-1}(X_n))\)
    2. If \(\nu_j\) is consistent with \(e_1,\dotsc,e_m\) increment \(N\)
    3. Increment \(j\)
  3. Return \(N/M\)

Note that a value \(X_i=x_i\) is generated given a full valuation. Hence, the probability \(p(X_i=x_i|X_1=\nu_{j-1}(X_1),\dotsc,X_{i-1}=\nu_{j-1}(X_{i-1}),X_{i+1}=\nu_{j-1}(X_{i+1}),\dotsc,X_n=\nu_{j-1}(X_n))\) is taken from the conditional probability of \(X_i\) given its Markov blanket. One can show that \(p(X_i|\nb{X_i})\) can be efficiently computed in Markov networks (note that \(\nb{X_i}\) is the Markov blanket of \(X_i\)) and used to generate values \(x_i\). The estimator \(N/M\) is known as the histogram estimator.

Under some mild conditions, as \(M \rightarrow \infty\), the valuations \(\nu_j\) are generated according to the joint distribution \(p(\mathcal{V})\) of the Bayesian network or Markov network. Hence, we can use Gibbs sampling to perform many sorts of inference problems (computing the partition function, computing all marginals, computing an expectation, etc.). When computing conditional probabilities, we can fix the valuations at the evidence; Gibbs sampling is then guaranteed asymptotically to generate valuations from \(p(X_q|e_1,\dotsc,e_m)\). In this case a theoretically improved estimator is the mixture estimator: \[ p(x_q|e_1,\dotsc,e_m) \approx \frac{1}{M} \sum_{j=1}^M p(X_i=\nu_j(X_i)|\nu_j(\pa{X_i})) \, . \]

Analyzing the finite-sample convergence of Gibbs sampling is difficult, and generally there are no non-trivial upper bounds on the necessary number of iterations necessary to guarantee good performance. There are many convergence tests that can be made to verify estimate convergence. For example, a simple approach is to run many independent simulations concurrently (using different initializations) and to decide convergence when the estimates produced by the different runs become similar. This is usually accomplished by comparing the variance of each estimate with respect to the overall estimate (the one using valuations from all runs). Upon convergence, this variance should be small (see Box 12.B on page 522 of KF2009).

The consistency of the samples generated by Gibbs sampling are only valid asymptotically. For small sample sizes, consecutive valuations are correlated, which can greatly bias the estimates produced. To decrease correlation between valuations, practitioners often adopt a burn-in phase, during which the generated valuations are discarded. It is customary to have a burn-in sample size of thousands of valuations (these do not contribute to the value of \(N\) or \(M\)).

Gibbs sampling has smaller theoretical guarantees than logical and importance sampling. On the other hand, its convergence is not generally affected by the probability being sought, which makes it particularly competitive when computing the probability of rare events. There are also many improvements (e.g., block and Rao-Blackwellized Gibbs sampling) that can speed up convergence of the algorithm.

All methods presented here can be used with continuous variables (as long samples from the local distributions can be efficiently generated) with little additional effort. Sampling methods are usually the choice when devising new complex models (e.g., specialized topic models and probabilistic logics).

9 Exercises

  1. Let \((G,p)\) be a Markov network factorizing as \(p(\mathcal{V})=\prod_j \phi_j(\mathcal{C}_j)/Z\). Show that \[ p(X | \nb{X}) = \frac{ \prod_{j: X \in \mathcal{C}_j} \phi_j(\mathcal{C}_j) }{ \sum_{X} \prod_{j: X \in \mathcal{C}_j} \phi_j(\mathcal{C}_j) } \] for any variable \(X \in \mathcal{V}\).
  2. Obtain a similar expression for \(p(X | MB(X))\) in the case of Bayesian networks, where \(MB(X)\) denotes the Markov Blanket of \(X\) (i.e., parents, spouses and children).
  3. Implement the likelihood weighting and Gibbs sampling algorithms, and evaluate them on Bayesian networks learned from datasets. Consider aspects such as accuracy (absolute and relative errors), convergence (accuracy vs. number of samples), efficiency (accuracy vs. runtime), effect on amount of evidence, effect of model size (number of variables) and effect of desired inference (the true probability).

Bibliography

  • [Hen1986] Henrion, Propagating Uncertainty in Bayesian Networks by Probabilistic Logic Sampling, 149-163, in in: Proceedings of the 2nd Conference on Uncertainty in Artificial Intelligence, edited by (1986)
  • [KF2009] Koller & Friedman, Probabilistic Graphical Models, MIT press (2009).
  • [Dar2009] Adnan Darwiche, Modeling and Reasoning with Bayesian Networks, Cambridge University Press (2009).
  • [DL1997] Dagum & Luby, An optimal approximation algorithm for Bayesian inference, Artificial Intelligence, 93(1-2), 1-27 (1997).

Footnotes:

1
To learn how to draw a sample from a discrete probability distribution, see Box 12.A on page 489 of KF2009 and the footnote on page 379 of Dar2009 .
2
The theoretical underpinnings of the algorithm are out of the scope of this course.

Author: Denis D. Mauá

Created: 2018-10-26 Fri 15:27

Validate