# Introduction to Probabilistic Graphical Reasoning

## Table of Contents

In this lecture, we meet an informal introduction to probabilistic reasoning using graphical models. We discuss compact representation of joint probability distributions, and graphical representation of independences.

## 1 Motivation

Our knowledge about the aspects of the world is most often uncertain; we are seldom able to determine with precision the chain of events that will follow an action, to pinpoint the cause of an observed fact, or to ascribe the truth of a statement. For many tasks, an improper treatment of uncertainty leads to poorly designed solutions that behave unsatisfactorily. A proper treatment of uncertainty is founded on three cornerstones:

- Sound and intuitive representational scheme
- We need a sound representational scheme to encode our knowledge base. This scheme needs to allow for describing knowledge intuitively and in a computationally efficient way.
- Sound and efficient reasoning
- We need a reasoning scheme to derive new facts from our knowledge base, thus making our system more than a simple repository. To be useful, the reasoning scheme needs to produce sound knowledge using a reasonable amount of computational resources.
- Useful and efficient learning
- We need a learning scheme that can help us to transform data/facts into knowledge, thus easing the knowledge specification burden. This scheme must infer meaningful knowledge, and be able to cope with our background knowledge. Of course, if it is to be used in a real-world application, it needs to be implemented efficiently.

Probabilistic graphical models are frameworks for addressing all the above desiderata. They combine probability theory and graph theory to represent uncertain knowledge in a sound and efficient way. They use state-of-the-art optimization techniques to achieve efficient reasoning and learning. Finally, they are built around a graphical language that facilitates the communication of knowledge and exposes modeling assumptions. As Jordan puts it, "[probabilistic graphical models are] a natural tool for dealing with two problems that occur throughout applied mathematics and engineering—uncertainty and complexity—and, in particular, they are playing an increasingly important role in the design and analysis of machine learning algorithms" Jor1998.

The key insight of probabilistic graphical modeling is the development of tools to handle and exploit probabilistic independence that were advanced in the late 1980's and early 1990's Pea1988,Lau1996,CDLS1999. In particular, Pearl 1988's influential book Pea1988 and the creation of the conference series Uncertainty in Artificial Intelligence lead to the dominance of probability theory as the representational framework for representing uncertainty in AI, to the detriment of other uncertainty formalisms that abounded at the time.

## 2 Reasoning with uncertain knowledge

Many interesting phenomena can only be adequately described
probabilistically, that is, by associating events with their
probabilities of occurrence. Every probabilistic information
one can represent about a *finite* domain can be encoded as a
**joint probability distribution** function

\[ p(X_1,X_2,\dotsc,X_n) \, , \]

over a set of **categorical random variables**, that is, by a
function that associates with every realization
\(X_1=x_1,\dotsc,X_n=x_n\) a probability value in \([0,1]\)
such that the sum over all probability values adds up to one.
For the sake of simplicity let us assume that all random
variables are *binary* (i.e., they take on values 0 or 1
only). To get a sense of this type of modeling, consider
identifying junk messages (spam) from a collection of sms
texts. Table 1 shows examples of sms messages
categorized into spam/ham (taken from the dataset available at
this site).

ham | Hope you are having a good week. Just checking in |

ham | K..give back my thanks. |

ham | Am also doing in cbe only. But have to pay. |

spam | complimentary 4 STAR Ibiza Holiday or L10,000 cash needs your URGENT collection. 09066364349 NOW from Landline not to lose out! |

spam | Dear Dave this is your final notice to collect your 4* Tenerife Holiday or #5000 CASH award! Call 09061743806 from landline. TCs SAE Box326 CW25WX 150ppm |

Finding a set of rules that perfectly discriminates between
spams and hams (i.e., not spams) is too difficult. Instead, we
can produce a probability of a message with given text being
spam by specifying \( p(C,W_1,\dotsc,W_m) \) where \( C \)
denotes whether the message is spam/ham and each \( W_i \)
denotes the occurrence of a certain keyword in the message
(e.g., money or thanks). Then one can answer **inferences** such
as the **conditional probability** of an message being spam
given its constituent keywords by computing

\[ p(C=1|W_1=w_1,\dotsc,W_m=w_m) = \frac{p(C=1,W_1=w_1,\dotsc,W_m=w_m)}{p(W_1=w_1,\dotsc,W_m=w_m)} \, , \]

and inferences such as finding the most probable keywords of a spam by

\[ \arg\max_{w_1,\dotsc,w_m} p(W_1=w_1,\dotsc,W_m=w_m|C=0) \, . \]

All these inferences can be derived from the joint probability
distribution and the rules of probability theory. We can also
**learn** the probability values \(p(C=c,W_1=w_1,\dotsc,W_m=w_m)\)
from a dataset of observations using techniques from
statistical inference.

## 3 Modeling independences

A naive approach to eliciting and representing the function \(
p(C,W_1,\dotsc,W_m) \) would be to compute all the frequencies
of ocurrences of every combination of keywords in a collection
of documents of spams and hams and store them in table. This
approach is inefective since it requires both a large amount
of storage to represent and a very large amount of data to
reliably estimate the joint probabilities. We can alleviate
either problem by *assuming* some **conditional independences**
to hold among the variables. For instance, one of the
simplest, yet usually highly effective, independence model is
the so-called **Naive Bayes Classifier**, which considers all
keyword occurrence variables independent conditional on the
class variable. Then, using the **chain rule of probability
theory**, we have, under this assumption, the following
**factorization**:

\[ p(C,W_1,\dotsc,W_m) = p(C)p(W_1|C) \dotsb p(W_m|C) \, . \]

Note that the factorization requires only \(4m+2\) probability values to be estimated and stored, as opposed to the \(2^m\) probability values of the tabular form (in fact, due to the normalization property of probability distributions the factorized form requires \(2m+1\) parameters while the non-factorized requires \(2^m-1\) values).

We can represent this factorization graphically by depicting every variable as a node, and drawing an arc from variable \(A\) to variable \(B\) if there is a factor where \(A\) appears "before" the conditioning bar and \(B\) appears "after" the conditioning bar. The result for the distribution above is shown in Figure 1.

This representation, known as **Bayesian network** or **directed
graphical model**, is useful for communicating complex
independence relationships among variables, and for deriving
"causal" explanations (by extending the semantics of the
direction of arcs given). The graphical structure also enables
efficient computation of inferences for a large class of
models, as well as efficient algorithms that recover the
**structure** of independences from data
KFL2001,WJ2008,Dar2009,KF2009. The **local** unconditional
and conditional probability distributions \(p(C)\) and
\(p(W_i|C)\), \(i=1,\dotsc,n\), are the parameters of our
probabilistic model. These parameters can be **elicited** from
domain experts (let us say, linguists in our text
categorization example), or **learned** from a dataset, that is,
estimated using statistical inference techniques.

## 4 Towards more complex independence models

The Naive Bayes Bayesian network performs surprisingly well for detecting spam messages, despite its strong (and largely incorrect) independence assumptions. For more complex tasks, however, these assumptions need to be relaxed to produce more satisfactory results. For instance, suppose we want to build a model of typical sms messages, that is, a probability distribution that assigns high probability for sms messages and small probability for any other text. In other words, we want to specify

\[ p(W_1,\dotsc,W_m) \, . \]

First, note that we can specify the above distribution as a weighted sum of class-conditional keyword distribution:

\[ p(W_1,\dotsc,W_m) = p(W_1,\dotsc,W_m|C=0)p(C=0) + p(W_1,\dotsc,W_m|C=1)p(C=1) \, . \]

We could follow the previous assumption and assume that keywords occur conditionally independently given the class variable (which is now unknown), that is, that

\[ p(W_1,\dotsc,W_m|C) = p(W_1|C)p(W_2|C) \dotsb p(W_m|C) \, . \]

However, assuming independence between keywords introduces a
large bias in the model since some keywords are highly
correlated in a order-specific way (e.g., the keyword "you" is
often followed by keyword "are" but not the converse). So
assume instead that \(W_i\) represents the \(i\)-th keyword in
the message (note that in this *representation* different
occurrences of a keyword are represented by different
variables \(W_i\)), and take values in the set of possible
keywords (so the random variables are not binary but take on a
finite set of values representing different possible words
from the vocabulary). In principle, the number of variables
thus depends on the number of occurrences of keywords in the
message; to simplify matters, assume that we only consider
messages of a fixed length \(n\). A common model that accounts
for the ordering of words is to assume conditional
independence of a keyword from other keywords given the type
of message (spam/ham) and the precedent keyword. In this
model, we have:

\[ p(W_1,\dotsc,W_n|C) = p(W_1|C)p(W_2|W_1,C) \dotsb p(W_n|W_{n-1},C) \, . \]

Figure 2: A simple tree-augmented Naive Bayes models of sms messages.

Figure 2 shows a graphical representation of this models
for a message with three words. This is an example of a
**tree-structured naive Bayes model**, since the graph obtained
by discarding the class variable is a tree (every variable has
at most one parent). This model requires \(2(n-1)m(m-1)+2m+1\)
parameters, a small increase with respect to the Naive Bayes
model (roughly twice the number of parameters).

## 5 Latent and hidden variables

The previous model increases the representation power at the cost of doubling the number of parameters. We can continue to increase the representation power by allowing longer range dependencies, that is, by modeling conditional independence of words given the two precedent words. The model in this case requires \(8(n-1)m^2(m-1)+2m(m-1)+2(m-1) + 1\). More and more expressive models can be obtained by considering longer range dependencies at the cost of an exponential increase in the number of parameters.

One way to increase the expressivity of the model while
retaining its succinctness is to allow for **latent variables**
in our model, that is, variables which are not observed in our
dataset. For example, we can assume that each word \(W_i\) is
annotated with its grammatical role \(N_i\) in the text (the
so-called part-of-speech tags such as noun, verb, etc). Then
we can assume that the occurrence of a word is conditionally
indepedent of all other words given its grammatical role, and
that the grammatical role of a word is conditionally
independent of other random variables (roles and words) given
its phrase grammatical role

\[ p(W_1,\dotsc,W_n) = \sum_{N_1}\dotsb\sum_{N_n} p(N_1)p(W_1|N_1) \prod_{i=2}^n p(N_i|N_{i-1})p(W_i|N_i) \, , \]

where \(N_i\) are random variables denoting the grammatical role
of the \(i\)-th word \(W_i\). Figure 3 shows a graphical
representation of this model for a 3-word message. This type
of model is usually know as **Hidden Markov Model**, due to
the presence of **hidden variables** which are not present at
inference time.

Figure 3: A hidden Markov model of sms messages.

## 6 Continuous random variables

So far, we have limited the discussion to categorical random variables, that is, random variables with a finite domain. But this need not to be so. Suppose we want to investigate the relation between salaries, age and education. We can then assume that salary is a (deterministic) linear function of age and education plus some unaccounted factors that introduce variability from individual to individual. That is, if we let \(S\) denote salary, \(A\) denote age (in years) and \(E\) denote education (in years of schooling), we have that

\[ S = \mu_A A + \mu_E E + \mu_0 + U_1 \, , \]

where \(\mu_A\), \(\mu_E\) and \(\mu_0\) are real-valued parameters of our model and \(U_1\) denotes the influence of the unaccounted factors. We can also assume that education is related to age in a nonlinear form:

\[ E = \rho_1\left(1-\exp(\rho_2 A)\right) + U_2 \, , \]

where \(\rho_1\) and \(\rho_2\) are again real-valued parameters, and \(U_2\) represents unaccounted factors. Assume also that these unaccounted factors are distributed according to a Normal (or Gaussian) distribution:

\[ p(U_1 = u) = \frac{1}{\sigma_1\sqrt{2\pi}}\exp\left(-\frac{(u-\mu_1)^2}{2\sigma^2}\right) \, , \]

and similarly for \(U_2\). And finally, we can assume some convenient distribution for ages (e.g., a Gamma distribution).

Figure 4: A Bayesian networks representing a causal model of salary, age and education.

This model can be represented graphically as in Figure 4.

## 7 Mixed random variables

With some care, we can also mix continuous and discrete random variables. Consider again the tree-augmented Naive Bayes model of sms messages. Assume that the words in our vocabulary are represented by integers \(1,2,\dotsc,m\) (so each \(W_i\) takes values in this set). We can assume that the probability of a keyword occurring in a certain position depends only on the previous keyword, on the class, and on parameters \(\theta_{c,j,k}\) such that \[ p(W_i=k|W_{i-1}=j,C=c,\theta) = \theta_{c,j,k} \, , \] where \(\theta\) collects all such \(\theta_{c,j,k}\). In other words, the conditional probability of occurrence of a keyword does not depend on its particular position, but on its relative position to other keyword occurrences. We can further assume these parameters \(\theta_{cjk}\) (which are also random variables in our model) are unconditionally independent and assume a Dirichlet distribution. Thus, we have that

\[ p(W_1,\dotsc,W_n) = \int_\theta \sum_{C} p(C)p(W_1|C)\prod_{i=2}^n \theta_{C,W_{i-1},W_i} d\,\theta \, . \]

Note that the number of parameters is now independent of the length of the message, and can thus be applied to messages of any size.

## 8 Undirected models

So far, we have only considered models where the independence
assumptions lead to a factorization of the joint distribution
in terms of local conditional probability models. There are
also models where independences lead to factorization in terms
of more generic functions. For example, consider a farming
field segmented in smaller contiguous regions \(R_A\), \(R_B\),
\(R_C\) and \(R_D\) such that \(R_A\) neighbors \(R_B\) and \(R_C\),
which are also adjacent to each other and to \(R_D\). The yield
of a region correlates only with its neighboring regions, so
that the yield of a region is conditionally independent of its
non-neighboring regions given the yield of the neighboring
regions. Let \(A,B,C,D\) denote the yield of regions \(R_A\),
\(R_B\), \(R_C\) and \(R_D\), respectively, take values high and
low. According to out assumptions, we have that \(A\) and \(D\)
are conditionally independent given \(B\) and \(C\), and \(B\) and
\(C\) are conditionally independent given \(A\) and \(C\). Moreover,
every two variables are dependent conditional on any other set
of variables, and unconditionally. One can show that these
independence relations cannot be expressed as a product of
conditional probabilities as before. However, we can express
them in terms of **nonnegative pairwise factors**:

\[ p(A,B,C,D) = \frac{1}{Z} \phi(A,B) \phi(A,C) \phi(B,C) \phi(B,D) \phi(C,D) \, , \]

where

\[ Z = \sum_A \sum_B \sum_C \sum_D \phi(A,B) \phi(A,C) \phi(B,C) \phi(B,D) \phi(C,D) \, , \]

is a normalizing constant. We can represent this model graphically by connecting nodes with an (undirected) edge if they co-occur as arguments of a factor. Figure 5 produces this vizualization for the model above.

Figure 5: A Markov network representing yield of neighboring field regions.

This type of model is called **Markov network** or **undirected
graphical model**.

# Bibliography

- [Jor1998] Michael Jordan, Learning in Graphical Models, MIT Press (1998).
- [Pea1988] Judea Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann (1988).
- [Lau1996] Stephen Lauritzen, Graphical Models, Oxford (1996).
- [CDLS1999] Cowell, Dawid, Lauritzen & Spiegelhalter, Probabilistic Networks and Expert Systems, Springer-Verlag (1999).
- [KFL2001] Kschischang, Frey & Loeliger, Factor graphs and the sum-product algorithm,
*IEEE Transactions on Information Theory*,**47(2)**, 498-519 (2001). - [WJ2008] Martin Wainwright & Michael Jordan, Graphical Models, Exponential Families, and Variational Inference,
*Foundations and Trends in Machine Learning*,**1(1-2)**, 1-305 (2008). - [Dar2009] Adnan Darwiche, Modeling and Reasoning with Bayesian Networks, Cambridge University Press (2009).
- [KF2009] Koller & Friedman, Probabilistic Graphical Models, MIT press (2009).