Introduction to Probabilistic Graphical Reasoning

Table of Contents

In this lecture, we meet an informal introduction to probabilistic reasoning using graphical models. We discuss compact representation of joint probability distributions, and graphical representation of independences.

1 Motivation

Our knowledge about the aspects of the world is most often uncertain; we are seldom able to determine with precision the chain of events that will follow an action, to pinpoint the cause of an observed fact, or to ascribe the truth of a statement. For many tasks, an improper treatment of uncertainty leads to poorly designed solutions that behave unsatisfactorily. A proper treatment of uncertainty is founded on three cornerstones:

Sound and intuitive representational scheme
We need a sound representational scheme to encode our knowledge base. This scheme needs to allow for describing knowledge intuitively and in a computationally efficient way.
Sound and efficient reasoning
We need a reasoning scheme to derive new facts from our knowledge base, thus making our system more than a simple repository. To be useful, the reasoning scheme needs to produce sound knowledge using a reasonable amount of computational resources.
Useful and efficient learning
We need a learning scheme that can help us to transform data/facts into knowledge, thus easing the knowledge specification burden. This scheme must infer meaningful knowledge, and be able to cope with our background knowledge. Of course, if it is to be used in a real-world application, it needs to be implemented efficiently.

Probabilistic graphical models are frameworks for addressing all the above desiderata. They combine probability theory and graph theory to represent uncertain knowledge in a sound and efficient way. They use state-of-the-art optimization techniques to achieve efficient reasoning and learning. Finally, they are built around a graphical language that facilitates the communication of knowledge and exposes modeling assumptions. As Jordan puts it, "[probabilistic graphical models are] a natural tool for dealing with two problems that occur throughout applied mathematics and engineering—uncertainty and complexity—and, in particular, they are playing an increasingly important role in the design and analysis of machine learning algorithms" Jor1998.

The key insight of probabilistic graphical modeling is the development of tools to handle and exploit probabilistic independence that were advanced in the late 1980's and early 1990's Pea1988,Lau1996,CDLS1999. In particular, Pearl 1988's influential book Pea1988 and the creation of the conference series Uncertainty in Artificial Intelligence lead to the dominance of probability theory as the representational framework for representing uncertainty in AI, to the detriment of other uncertainty formalisms that abounded at the time.

2 Reasoning with uncertain knowledge

Many interesting phenomena can only be adequately described probabilistically, that is, by associating events with their probabilities of occurrence. Every probabilistic information one can represent about a finite domain can be encoded as a joint probability distribution function

\[ p(X_1,X_2,\dotsc,X_n) \, , \]

over a set of categorical random variables, that is, by a function that associates with every realization \(X_1=x_1,\dotsc,X_n=x_n\) a probability value in \([0,1]\) such that the sum over all probability values adds up to one. For the sake of simplicity let us assume that all random variables are binary (i.e., they take on values 0 or 1 only). To get a sense of this type of modeling, consider identifying junk messages (spam) from a collection of sms texts. Table 1 shows examples of sms messages categorized into spam/ham (taken from the dataset available at this site).

Table 1: Sample of sms messages.
ham Hope you are having a good week. Just checking in
ham K..give back my thanks.
ham Am also doing in cbe only. But have to pay.
spam complimentary 4 STAR Ibiza Holiday or L10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out!
spam Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm

Finding a set of rules that perfectly discriminates between spams and hams (i.e., not spams) is too difficult. Instead, we can produce a probability of a message with given text being spam by specifying \( p(C,W_1,\dotsc,W_m) \) where \( C \) denotes whether the message is spam/ham and each \( W_i \) denotes the occurrence of a certain keyword in the message (e.g., money or thanks). Then one can answer inferences such as the conditional probability of an message being spam given its constituent keywords by computing

\[ p(C=1|W_1=w_1,\dotsc,W_m=w_m) = \frac{p(C=1,W_1=w_1,\dotsc,W_m=w_m)}{p(W_1=w_1,\dotsc,W_m=w_m)} \, , \]

and inferences such as finding the most probable keywords of a spam by

\[ \arg\max_{w_1,\dotsc,w_m} p(W_1=w_1,\dotsc,W_m=w_m|C=0) \, . \]

All these inferences can be derived from the joint probability distribution and the rules of probability theory. We can also learn the probability values \(p(C=c,W_1=w_1,\dotsc,W_m=w_m)\) from a dataset of observations using techniques from statistical inference.

3 Modeling independences

A naive approach to eliciting and representing the function \( p(C,W_1,\dotsc,W_m) \) would be to compute all the frequencies of ocurrences of every combination of keywords in a collection of documents of spams and hams and store them in table. This approach is inefective since it requires both a large amount of storage to represent and a very large amount of data to reliably estimate the joint probabilities. We can alleviate either problem by assuming some conditional independences to hold among the variables. For instance, one of the simplest, yet usually highly effective, independence model is the so-called Naive Bayes Classifier, which considers all keyword occurrence variables independent conditional on the class variable. Then, using the chain rule of probability theory, we have, under this assumption, the following factorization:

\[ p(C,W_1,\dotsc,W_m) = p(C)p(W_1|C) \dotsb p(W_m|C) \, . \]

Note that the factorization requires only \(4m+2\) probability values to be estimated and stored, as opposed to the \(2^m\) probability values of the tabular form (in fact, due to the normalization property of probability distributions the factorized form requires \(2m+1\) parameters while the non-factorized requires \(2^m-1\) values).

We can represent this factorization graphically by depicting every variable as a node, and drawing an arc from variable \(A\) to variable \(B\) if there is a factor where \(A\) appears "before" the conditioning bar and \(B\) appears "after" the conditioning bar. The result for the distribution above is shown in Figure 1.

This representation, known as Bayesian network or directed graphical model, is useful for communicating complex independence relationships among variables, and for deriving "causal" explanations (by extending the semantics of the direction of arcs given). The graphical structure also enables efficient computation of inferences for a large class of models, as well as efficient algorithms that recover the structure of independences from data KFL2001,WJ2008,Dar2009,KF2009. The local unconditional and conditional probability distributions \(p(C)\) and \(p(W_i|C)\), \(i=1,\dotsc,n\), are the parameters of our probabilistic model. These parameters can be elicited from domain experts (let us say, linguists in our text categorization example), or learned from a dataset, that is, estimated using statistical inference techniques.

4 Towards more complex independence models

The Naive Bayes Bayesian network performs surprisingly well for detecting spam messages, despite its strong (and largely incorrect) independence assumptions. For more complex tasks, however, these assumptions need to be relaxed to produce more satisfactory results. For instance, suppose we want to build a model of typical sms messages, that is, a probability distribution that assigns high probability for sms messages and small probability for any other text. In other words, we want to specify

\[ p(W_1,\dotsc,W_m) \, . \]

First, note that we can specify the above distribution as a weighted sum of class-conditional keyword distribution:

\[ p(W_1,\dotsc,W_m) = p(W_1,\dotsc,W_m|C=0)p(C=0) + p(W_1,\dotsc,W_m|C=1)p(C=1) \, . \]

We could follow the previous assumption and assume that keywords occur conditionally independently given the class variable (which is now unknown), that is, that

\[ p(W_1,\dotsc,W_m|C) = p(W_1|C)p(W_2|C) \dotsb p(W_m|C) \, . \]

However, assuming independence between keywords introduces a large bias in the model since some keywords are highly correlated in a order-specific way (e.g., the keyword "you" is often followed by keyword "are" but not the converse). So assume instead that \(W_i\) represents the \(i\)-th keyword in the message (note that in this representation different occurrences of a keyword are represented by different variables \(W_i\)), and take values in the set of possible keywords (so the random variables are not binary but take on a finite set of values representing different possible words from the vocabulary). In principle, the number of variables thus depends on the number of occurrences of keywords in the message; to simplify matters, assume that we only consider messages of a fixed length \(n\). A common model that accounts for the ordering of words is to assume conditional independence of a keyword from other keywords given the type of message (spam/ham) and the precedent keyword. In this model, we have:

\[ p(W_1,\dotsc,W_n|C) = p(W_1|C)p(W_2|W_1,C) \dotsb p(W_n|W_{n-1},C) \, . \]


Figure 2: A simple tree-augmented Naive Bayes models of sms messages.

Figure 2 shows a graphical representation of this models for a message with three words. This is an example of a tree-structured naive Bayes model, since the graph obtained by discarding the class variable is a tree (every variable has at most one parent). This model requires \(2(n-1)m(m-1)+2m+1\) parameters, a small increase with respect to the Naive Bayes model (roughly twice the number of parameters).

5 Latent and hidden variables

The previous model increases the representation power at the cost of doubling the number of parameters. We can continue to increase the representation power by allowing longer range dependencies, that is, by modeling conditional independence of words given the two precedent words. The model in this case requires \(8(n-1)m^2(m-1)+2m(m-1)+2(m-1) + 1\). More and more expressive models can be obtained by considering longer range dependencies at the cost of an exponential increase in the number of parameters.

One way to increase the expressivity of the model while retaining its succinctness is to allow for latent variables in our model, that is, variables which are not observed in our dataset. For example, we can assume that each word \(W_i\) is annotated with its grammatical role \(N_i\) in the text (the so-called part-of-speech tags such as noun, verb, etc). Then we can assume that the occurrence of a word is conditionally indepedent of all other words given its grammatical role, and that the grammatical role of a word is conditionally independent of other random variables (roles and words) given its phrase grammatical role

\[ p(W_1,\dotsc,W_n) = \sum_{N_1}\dotsb\sum_{N_n} p(N_1)p(W_1|N_1) \prod_{i=2}^n p(N_i|N_{i-1})p(W_i|N_i) \, , \]

where \(N_i\) are random variables denoting the grammatical role of the \(i\)-th word \(W_i\). Figure 3 shows a graphical representation of this model for a 3-word message. This type of model is usually know as Hidden Markov Model, due to the presence of hidden variables which are not present at inference time.


Figure 3: A hidden Markov model of sms messages.

6 Continuous random variables

So far, we have limited the discussion to categorical random variables, that is, random variables with a finite domain. But this need not to be so. Suppose we want to investigate the relation between salaries, age and education. We can then assume that salary is a (deterministic) linear function of age and education plus some unaccounted factors that introduce variability from individual to individual. That is, if we let \(S\) denote salary, \(A\) denote age (in years) and \(E\) denote education (in years of schooling), we have that

\[ S = \mu_A A + \mu_E E + \mu_0 + U_1 \, , \]

where \(\mu_A\), \(\mu_E\) and \(\mu_0\) are real-valued parameters of our model and \(U_1\) denotes the influence of the unaccounted factors. We can also assume that education is related to age in a nonlinear form:

\[ E = \rho_1\left(1-\exp(\rho_2 A)\right) + U_2 \, , \]

where \(\rho_1\) and \(\rho_2\) are again real-valued parameters, and \(U_2\) represents unaccounted factors. Assume also that these unaccounted factors are distributed according to a Normal (or Gaussian) distribution:

\[ p(U_1 = u) = \frac{1}{\sigma_1\sqrt{2\pi}}\exp\left(-\frac{(u-\mu_1)^2}{2\sigma^2}\right) \, , \]

and similarly for \(U_2\). And finally, we can assume some convenient distribution for ages (e.g., a Gamma distribution).


Figure 4: A Bayesian networks representing a causal model of salary, age and education.

This model can be represented graphically as in Figure 4.

7 Mixed random variables

With some care, we can also mix continuous and discrete random variables. Consider again the tree-augmented Naive Bayes model of sms messages. Assume that the words in our vocabulary are represented by integers \(1,2,\dotsc,m\) (so each \(W_i\) takes values in this set). We can assume that the probability of a keyword occurring in a certain position depends only on the previous keyword, on the class, and on parameters \(\theta_{c,j,k}\) such that \[ p(W_i=k|W_{i-1}=j,C=c,\theta) = \theta_{c,j,k} \, , \] where \(\theta\) collects all such \(\theta_{c,j,k}\). In other words, the conditional probability of occurrence of a keyword does not depend on its particular position, but on its relative position to other keyword occurrences. We can further assume these parameters \(\theta_{cjk}\) (which are also random variables in our model) are unconditionally independent and assume a Dirichlet distribution. Thus, we have that

\[ p(W_1,\dotsc,W_n) = \int_\theta \sum_{C} p(C)p(W_1|C)\prod_{i=2}^n \theta_{C,W_{i-1},W_i} d\,\theta \, . \]

Note that the number of parameters is now independent of the length of the message, and can thus be applied to messages of any size.

8 Undirected models

So far, we have only considered models where the independence assumptions lead to a factorization of the joint distribution in terms of local conditional probability models. There are also models where independences lead to factorization in terms of more generic functions. For example, consider a farming field segmented in smaller contiguous regions \(R_A\), \(R_B\), \(R_C\) and \(R_D\) such that \(R_A\) neighbors \(R_B\) and \(R_C\), which are also adjacent to each other and to \(R_D\). The yield of a region correlates only with its neighboring regions, so that the yield of a region is conditionally independent of its non-neighboring regions given the yield of the neighboring regions. Let \(A,B,C,D\) denote the yield of regions \(R_A\), \(R_B\), \(R_C\) and \(R_D\), respectively, take values high and low. According to out assumptions, we have that \(A\) and \(D\) are conditionally independent given \(B\) and \(C\), and \(B\) and \(C\) are conditionally independent given \(A\) and \(C\). Moreover, every two variables are dependent conditional on any other set of variables, and unconditionally. One can show that these independence relations cannot be expressed as a product of conditional probabilities as before. However, we can express them in terms of nonnegative pairwise factors:

\[ p(A,B,C,D) = \frac{1}{Z} \phi(A,B) \phi(A,C) \phi(B,C) \phi(B,D) \phi(C,D) \, , \]


\[ Z = \sum_A \sum_B \sum_C \sum_D \phi(A,B) \phi(A,C) \phi(B,C) \phi(B,D) \phi(C,D) \, , \]

is a normalizing constant. We can represent this model graphically by connecting nodes with an (undirected) edge if they co-occur as arguments of a factor. Figure 5 produces this vizualization for the model above.


Figure 5: A Markov network representing yield of neighboring field regions.

This type of model is called Markov network or undirected graphical model.


  • [Jor1998] Michael Jordan, Learning in Graphical Models, MIT Press (1998).
  • [Pea1988] Judea Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann (1988).
  • [Lau1996] Stephen Lauritzen, Graphical Models, Oxford (1996).
  • [CDLS1999] Cowell, Dawid, Lauritzen & Spiegelhalter, Probabilistic Networks and Expert Systems, Springer-Verlag (1999).
  • [KFL2001] Kschischang, Frey & Loeliger, Factor graphs and the sum-product algorithm, IEEE Transactions on Information Theory, 47(2), 498-519 (2001).
  • [WJ2008] Martin Wainwright & Michael Jordan, Graphical Models, Exponential Families, and Variational Inference, Foundations and Trends in Machine Learning, 1(1-2), 1-305 (2008).
  • [Dar2009] Adnan Darwiche, Modeling and Reasoning with Bayesian Networks, Cambridge University Press (2009).
  • [KF2009] Koller & Friedman, Probabilistic Graphical Models, MIT press (2009).

Author: Denis D. Mauá

Created: 2019-10-11 Fri 11:40