Parameter Learning

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\can}[1]{\overline{\text{an}}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\perp_d} \newcommand{\sep}{\perp_u} \newcommand{\msep}{\perp_m} $

Probabilistic graphical models consist of a qualitative part, that encodes variable interactions and is represented by a graph, and a quantitative part, that encodes the strength of interactions and is represented by real-valued functions. In this lecture, we assume that the qualitative part of the model (i.e., the graph structure) is given, and discuss the learning of the numerical parameters (e.g., the conditional probabilities associated with nodes of a Bayesian networks). We consider two scenarios according to the type of information available. In the complete data scenario, we assume that all relevant variables are fully observable, meaning that either we have a table of complete instantiations of variables we deem important, or we discard any variable/instantiation that contains a missing value. In the second scenario, we deal with incomplete data, that is, with partial assignments that leave some of the variables unassigned. This might happen because some of the values could not be acquired during the data acquisition process (e.g., the respondent did not answer the question or the sensor failed), or because we assume that there are important quantities that cannot be directly measured and yet we wish to include them in the model. A simple example of the latter is clustering: we assume that data can be separated in groups of alike objects (as represented by their attributes), and we include a cluster indicator variable that labels each instance/object with its corresponding group.

Statistical inference

Data-driven approaches to parameter estimation are the bread and butter of statistical inference. Most of the discussion here will be based in standard statistical principles such as maximum likelihood and consistency. These can be interpreted as frequentist approaches. There are also Bayesian approaches that essentially cast parameter learning as inference in a Bayesian network with continuous variables (the unknown parameters). For simplicity and uniformity, we use the notation and perspective of Bayesian inference, even when applying maximum likelihood, so that we can represent models of parameter learning as Bayesian networks.

Parameter learning from complete data

We start with the simplest case where there are no missing observations and no hidden (latent) variable.

Maximum Likelihood Estimation

Let us start with the simplest model imaginable: A Bayesian network with a single variable $X$ taking on values 0/1. For instance, $X$ could represent the result of tossing a thumbtack (with the pointy side up or down), or a congressperson’ votes categorized as pro- or against-government. So suppose we have a list of observations $x_1,\dotsc,x_N$ of the values of $X$ (note that we allow for repetition in this list). This list can itself been seen as a realization of a list of variables $X_1,\dotsc,X_N$, each representing the outcome of an (idealized) experiment where we draw a sample from the distribution of $X$. In this simple case, there is only one possible structure and a single numerical parameter, namely, the probability $\theta=P(X=1)$.1 A simple approach is to guess a value for $\theta$ and to estimate the likelihood that a model with that parameter would generate the observed data:2

$$ L(\theta) = P(X_1=x_1,\dotsc,X_N=x_N|\theta) \, . $$

The maximum likelihood principle states that we should prefer a model that maximizes the quantity above. This criterion alone is insufficient for deriving a feasible strategy and additional assumptions have to be made. For instance, suppose we have data $X_1=1,X_2=1,X_3=1$; then it might appear that $\theta=1$; however, unless additional assumptions are imposed, it might well be that all three observations are deterministically determined by some other (unobserved) variable. The most common assumption is that the observed variables are exchangeable, meaning that the order in which variables are arranged does not alter its probability (for any data set size). In other words, for any $N$ and any permutation $\sigma$ of the indices, it follows that

$$ P(X_{\sigma(1)}=x_{\sigma(1)},\dotsc,X_{\sigma(N)}=x_{\sigma(N)}|\theta) = P(X_1=x_1,\dotsc,X_N=x_N|\theta) \, . $$

For example, in the thumbtack example we can assume that the conditions of the experiment remained unchanged so that the order of the outcomes is irrelevant. The famous de Finetti’s representation theorem assures us that (under some very mild conditions) assuming exchangeability is the same as assuming independent and identically distributed (iid) data (the assumption that $X_1,\dotsc,X_N$ are independent given $\theta$, and that each $X_i$ is generated using the same distribution). Under those assumptions, the likelihood of the thumbtack example is

$$ \begin{align*} L(\theta) &= \prod_{j=1}^N P(X=x_j|\theta) = \prod_{j=1}^N \theta^{x_j}(1-\theta)^{1-x_j} \\ &= \theta^{N[X=1]}(1-\theta)^{N[X=0]} \, , \end{align*} $$

where $N[X=1]=\sum_j x_j$ is the count of $X_j=1$ and $N[X=0]=N-N[X=1]$. We can graphically represent these assumptions by the Bayesian network below.

Parameter-learning Bayesian network structure of a single-variable
Parameter-learning Bayesian network structure of a single-variable

This type of network is known as the parameter-learning Bayesian network, augmented Bayesian network and meta-Bayesian network. Note that for this simple case the parameter network has a simple naive Bayes structure. Note also that the network is parametrized in a special form, where the conditional probability distributions of all observed variables $X_i$ are identical; this additional structure of the model cannot be inferred from its graph.

The likelihood function of the thumbtack example is non-negative, bounded for $\theta \in (0,1)$, and zero at $\theta \in \{0,1\}$. Recall that for any positive function $$ f(x^*) \geq f(x) $$

if and only if

$$ \ln f(x^*) \geq \ln f(x) \, . $$

Since we are only interested in the maximizer of the likelihood function (and assuming $0 < N[X=1] < N$), we can instead maximize the log-likelihood:

$$ \begin{align*} LL(\theta) &= \ln P(X_1=x_1,\dotsc,X_N=x_N|\theta) \\ &= N[X=1] \ln \theta + N[X=0]\ln (1-\theta) \, . \end{align*} $$

The function above is strictly concave in $\theta$, differentiable, bounded from above in $(0,1)$ and approaches negative infinity at $\{0,1\}$, therefore it has a single maximizer at a stationary point (a point with zero derivative). We call that maximizer the maximum likelihood estimator (MLE). Taking derivatives and setting to zero, we find that the MLE is $$ \theta^{MLE} = \frac{N[X=1]}{N}. $$

Note that the MLE is a function of the data via the sufficient statistics given by the counts $N[X=1]$ and $N[X=0]$. The plot below depicts an example of the likelihood function with $N[X=1]=1$ and $N=4$. The corresponding MLE is $\theta^{MLE}=0.25$, which is clearly seen in the picture.

Likelihood function for the binary single-variable parameter learning Bayesian network
Likelihood function for the binary single-variable parameter learning Bayesian network

Let us now turn to a slightly more complex model: a Bayesian network with two binary variables and an arc $X \rightarrow Y$. The figure below shows such a network (left), a Bayesian network representing the parameter learning model (center), and a more succinct version of the learning model in the form of a plates model (right).

A two-variable Bayesian network (left), its parameter learning network (center) and the same network in plates representation (right)
A two-variable Bayesian network (left), its parameter learning network (center) and the same network in plates representation (right)

The desired parameters are the probabilities $\theta_X = P(X=1)$, $\theta_{Y|0}=P(Y=1|X=0)$ and $\theta_{Y|1}=P(Y=1|X=1)$. Let $\theta=(\theta_X,\theta_{Y|0},\theta_{Y|1})$. The log-likelihood is given (under the same assumptions as before) by:

$$ \begin{align*} LL(\theta) &= \ln P(X_1=x_1,Y_1=y_1,\dotsc,X_N=x_N,Y_N=y_N|\theta) \\ &= \sum_{j=1}^N \ln P(X=x_j,Y=y_j|\theta) = \sum_{j=1}^N \Bigl(\ln P(X=x_j|\theta_X) + \ln P(Y=y_j|X=x_j,\theta_{Y|X}) \Bigr)\\ &=\underset{LL(\theta_X)}{\underbrace{\sum_{x \sim X} N[X=x]\ln P(X=x|\theta_X)}} + \underset{LL(\theta_{Y|0})}{\underbrace{\sum_{y \sim Y} N[Y=y,X=0] \ln P(Y=y|X=0,\theta_{Y|0})}} \\ & \qquad \qquad \qquad \qquad \qquad + \underset{LL(\theta_{Y|1})}{\underbrace{\sum_{y \sim Y} N[Y=y,X=1] \ln P(Y=y|X=1,\theta_{Y|1})}} \end{align*} $$

Note that the overall log-likelihood decomposes as three functionally independent log-likelihood functions over single-variable models $LL(\theta_X)$, $LL(\theta_{Y|0})$ and $LL(\theta_{Y|1})$. Hence,

$$ \begin{align*} \theta^{MLE} &= (\theta_X^{MLE},\theta_{Y|0}^{MLE},\theta_{Y|1}^{MLE}) = \arg\max_\theta LL(\theta_X,\theta_{Y|0},\theta_{Y|1}) \\ & = \left( \arg\max_{\theta_X} LL(\theta_X), \arg\max_{\theta_{Y|0}} LL(\theta_{Y|0}),\arg\max_{\theta_{Y|1}} LL(\theta_{Y|1}) \right) \,. \end{align*} $$

By equating the gradient to zero we find that: $$ \begin{align*} \theta_X^{MLE} &= \frac{N[X=1]}{N} & \theta_{Y|0}^{MLE} &= \frac{N[X=0,Y=1]}{N[X=0]} & \theta_{Y|1}^{MLE} &= \frac{N[X=1,Y=1]}{N[X=1]} \, , \end{align*} $$

where $N[\eta]$ denotes the number of assignments where $\eta$ is true, for example,

$$ N[X=0,Y=1]=\sum_{j=1}^N (1-x_j)y_j \, . $$

It is easy to generalize this result to any arbitrary discrete Bayesian network. Let $$ \theta = ( \theta_{X|\nu}: X \in \mathcal{V}, \nu \sim \pa{X}) \, . $$

The complete data log-likelihood of a Bayesian network over discrete variables $\mathcal{V}$ is $$ LL(\theta) = \sum_{X \in \mathcal{V}} \sum_{x \sim X} \sum_{\nu \sim \pa{X}} N[X=x,\pa{X}=\nu] \ln P(X=x|\pa{X}=\nu,\theta_{X|\nu}) \, . $$

We can easily show that:

The MLE for each conditional probability of a Bayesian network is

$$ \theta_{X|\nu}^{MLE}(x) = \frac{N[X=x,\pa{X}=\nu]}{N[\pa{X}=\nu]} \, . $$

Thus, MLE parameter learning in Bayesian networks with complete data boils down to simple counting of relative frequencies for each configuration of variable and its parents. A definition that is both intuitive and efficient. Note that the counts $N[X,\pa{X}]$ can be stored in a potential and the probabilities can be obtained by a normalizing operation that normalizes function with respect to every restriction $\pa{X}=\nu$.

Pseudo-counts

The MLE for conditional probabilities is defined only when $N[\pa{X}=\nu]>0$. For small data sets, or for relatively large in-degree, it is often the case that $N[\pa{X}=\nu]$ for some $\nu$. A possible solution is to assume the principle of indifference (also called the principle of insufficient reason) that prescribes that we should assign equal probabilities to events whose likelihood are indistinguishable to us. This means assigning a uniform conditional distribution whenever $N[\pa{X}=\nu]=0$.3 A more general form of the problem occurs when $N[\pa{X}=\nu]$ is a small positive constant. In this scenario, the MLE is well-defined but very sensitive to local perturbations of the data (i.e., to change in a few instantiations), and thus not reliable (a few extra observations often change the implications of the model). To see this, consider the case where we estimate $\theta=p(X=1)$ for a binary variable $X$. Assume $N[X=1]/N=0.1$ and two scenarios: $N=10$ and $N=1000$. In the first scenario a single flip of an observation makes the estimate double its value (from 0.1 to 0.2) or become degenerate (i.e., $\theta=0$). On the other hand, the estimate for the second scenario remains virtually unchanged even with a tenth of flips in the assignments. It is interesting to obtain estimators that are somehow robust to the sample size (meaning their variance is relatively small accross different samples of the same population). This can be accomplished by introducing an inertia component known (in the case of discrete distributions) as pseudo-counts. The idea is essentially to insert virtual observations to every count, so that the estimator becomes:

$$ \theta_{X|\nu}^\text{Bayes}(x) = \frac{N[X=x,\pa{X}=\nu]+\alpha_{X|\nu}(x)}{\sum_{x’} N[X=x’,\pa{X}=\nu]+\alpha_{X|\nu}(x’)} \, . $$

Each pseudo-count $\alpha_{X|\nu}(x)$ is a non-negative number (fractional counts are allowed) and indicate the amount of inertia. The sum of the pseudo-counts

$$ \sum_x \alpha_{X|\nu}(x) $$

indicate the strength of the belief in the background knowledge, and it is known as the equivalent sample size. When all pseudo-counts are set to unity, the corresponding prior distribution is a uniform on the space of probability distributions $\theta_{X|\nu}$ and the estimator is known as the Laplace correction.

Bayesian estimation

The pseudo-count approach solves both the large variance problem and the zero data problem, and it can be justified as the predictive Bayesian estimator. The Bayesian estimator for $\theta$ is obtained from the posterior probability:

$$ P(\theta|\text{data}) = \frac{P(\text{data}|\theta) P(\theta)}{P(\text{data})} \, . $$

The term $P(\text{data}|\theta)$ is the data likelihood, $P(\theta)$ is the prior and $P(\text{data})$ is the marginal data likelihood. Note that $P(\theta)$ is actually a probability over a probability, also called a second-order probability. The prior encodes background knowledge of the model (before observations are made), and the posterior is the combination of background knowledge and data likelihood (normalized by the marginal likelihood) as given by Bayes’ rule. When $X$ is a categorical variable and the prior is a Dirichlet distribution, the posterior is also a Dirichlet distribution. For the single-variable model we have that

$$ \begin{align*} \theta^\text{Bayes}(x) &= P(X=x|\text{data}) \\ &= \mathbb{E}(\theta(x)|\text{data}) \\ &= \int \theta \cdot P(\theta|\text{data}) d\theta \\ &= \frac{N[X=x]+\alpha_x}{N+\sum_{x’} \alpha_{x’}} \\ &= \frac{N}{N+\beta}\theta^\text{MLE}(x) + \frac{\beta}{N+\beta}\theta^0(x) \, . \end{align*} $$

The right-most part shows an interesting view of the Bayesian estimator as the weighted average of the MLE and the prior estimate (guess) $\theta^0(x)=\alpha_x/\beta$, according to the prior strength $\beta=\sum_{x’}\alpha_{x’}$. The Bayes estimator integrates out all possible values of $\theta$, which reduces the variance of the estimator. For example, consider again our scenario with $N=10$ and $N[X=1]=3$, and assume that $\theta$ can only take values in $\{0,0.1,0.2,\dotsc,0.9,1\}$ and that the prior is given by

$$ P(\theta=\theta_i) \propto \exp\left( -5*|0.5-i|\right) \, . $$

This prior is peaked around 1/2 and decays exponentially fast as we move away from its peak. The figure below shows the prior distribution, the normalized likelihood (equivalent to a posterior with uniform prior), and the posterior distribution.

Prior, normalized likelihood and posterior distributions
Prior, normalized likelihood and posterior distributions

The Bayesian estimate is $$ \theta^\text{Bayes}=\sum_{\theta_i} \theta_i \cdot P(\theta_i \mid N[X=1]=3,N=10)=0.392 \, . $$

This is shown as the vertical bar in the Figure 4. For comparison, the MLE estimate, obtained at the peak of the normalized likelihood, is $\theta^\text{MLE}=0.3$. The Bayesian estimate bias the estimate according to the prior knowledge, pushing the estimate towards the center.

Maximum a posteriori estimation

When the data set is sufficiently large, the posterior probability $P(\theta | \text{data})$ will be concentrated around a small region. A suitable approximation is then given by the maximum a posteriori (MAP) estimator:

$$ \theta^\text{MAP} = \arg\max_{\theta} P(\theta \mid \text{data}) \, . $$

In the example in Figure 4, the MAP estimate is very similar to the Bayesian estimate (this is due to the posterior being near symmetric). In the case of a categorical variable $X$ with a Dirichlet prior this reduces to

$$ \theta^\text{MAP}(x) = \frac{N[X=x]+\alpha_x-1}{N-r+\sum_{x’} \alpha_{x’}} \, , $$

where $r=|\Omega_X|$. Note that the MAP estimator equals (for categorical variables) the Bayesian estimator up to a change of the pseudo counts (an unit increment). In general, the MAP estimator will be an approximation with larger variance that converges to the Bayesian estimator only in the limit (i.e., when the amount of data reaches infinity). It is however often used in practice, as it is computationally less demanding.

This approach generalizes to any Bayesian network, but additional assumptions need to be made. The first assumption is that the prior distribution factorizes over each variable and configuration of its parents:

$$ P(\theta) = \prod_{X \in \mathcal{V}} \prod_{\nu \sim \pa{X}} P(\theta_{X|\nu}) \, , $$

where $\theta_{X|\nu}=P(X|\pa{X}=\nu)$. This assumption is known as prior parameter independence. The second assumption it to assume that all variables are categorical and that $P(\theta_{X|\nu})$ are Dirichlet distributions with hyperparameters $\alpha_{x|\nu}$:

$$ P(\theta_{X|\nu} \mid \alpha_{X|\nu}) \propto \prod_{x \sim X} \theta_{X|\nu}(x)^{\alpha_{x|\nu}-1} \, . $$

The resulting parameter learning model is depicted below for the two-variable network in Figure 3.

Plates model of the parameter learning Bayesian network with hyperparameters for the two-variable network
Plates model of the parameter learning Bayesian network with hyperparameters for the two-variable network

Note that if all $X_j,Y_j$ are observed, the nodes $\theta_{X}$ and $\{\theta_{Y|0},\theta_{Y|1}\}$ are d-separated. This is true in general:

$$ P(\theta|\text{data}) = \prod_{X \in \mathcal{V}} P(\theta_{X|\pa{X}}|\text{data}) \, . $$

As the equation above suggests, the parameters that $\theta_{X|\nu}$ for different $\nu$ are however d-connected once the value of $X$ is given.

Consistency

We say that an estimator is consistent if it converges to the true value. All the three estimators we have seen are consistent, that is, they recover the true distribution (assuming one exists) that generated the data as $N$ goes to infinity. MLE converges faster (and with higher variance), while the Bayesian estimator converges the slowest (with smaller variance). In practice the Bayesian estimator often performs best, followed closely by the MAP estimator.

Markov networks

We have seen that parameter learning of Bayesian networks with complete data can be reduced to a problem of computing (smoothed) relative frequencies from the data set. This was related to the parameter independence, which allows us to estimate each conditional distribution independently. This was true even when the parameters were integrated out, as in the case of Bayesian estimators. Conceptually, the same approaches can be used to learn the parameters of a Markov network: maximum likelihood, Bayesian estimation, MAP. In practice, however, the unnormalized aspect of Markov networks makes the corresponding problem not decomposable, and hence much more difficult. To see this, consider a simple Markov network with factors $\phi_1(X,Y)$ and $\phi_2(Y,Z)$. Its log-likelihood is

$$ \begin{align*} LL(\phi) &= \ln P(\text{data}|\phi) \\ &= \underset{\approx LL(\phi_1)}{\underbrace{\sum_{x,y} N[X=y,Y=y] \ln \phi_1(x,y)}} + \underset{\approx LL(\phi_2)}{\underbrace{\sum_{y,z} N[Y=y,Z=z] \ln(y,z)}} - N \ln Z(\phi) \, , \end{align*} $$

where

$$ Z(\phi) = \sum_{x,y,z} \phi_1(x,y)\phi_2(y,z) \, . $$

As we can see from the expression above, maximizing the log-likelihood does not decompose into smaller problems due to the presence of the log-partition function $\ln Z(\phi)$ which connects both problems. For this particular network structure (whose domain graph is chordal), we can by-pass this problem by noticing that the model is Markov equivalent to any Bayesian network produced by a consistent orientation of the arcs, for example: $X \rightarrow Y$, $Y \rightarrow Z$. Since Markov equivalent models have the same log-likelihood, we can learn the corresponding Markov network by learning any consistent Bayesian network, for example: $$ \phi_1(X,Y) = \frac{N[X=x,Y=y]}{N[X=x]} \,, \qquad \phi_2(Y,Z) = \frac{N[Y=y,Z=z]}{N[Y=y]} \, . $$ This no longer holds on non-chordal graphs, and a Bayesian network-like estimation of the parameters leads to a suboptimal choice (under the maximum likelihood criteria).

Log-linear parametrization

To make the notation simpler, we will use the following log-linear parametrization of a Markov network:

$$ P(\mathcal{V}|\theta) = \exp\left(\sum_k \theta_k f_k(\mathcal{V}) - \ln Z(\theta) \right) \, , $$

where $\theta_k$ are real values, and $f_k$ are real-valued functions. Any positive Markov network can be parametrized in that way; in general the functions $f_k$ will depend only on a subset of the variables (relative to cliques in the network). The log-likelihood is then

$$ LL(\theta) = \sum_k \theta_k \left( \sum_{\nu \sim \mathcal{V}} N[\mathcal{V}=\nu] f_k(\nu) \right) - N \ln Z(\theta) \, . $$

Note that

$$ \sum_{\nu \sim \mathcal{V}} \frac{N[\mathcal{V}=\nu]}{N} f_k(\nu) $$

is the empirical mean value of $f_k$ in the data set, and does not depend on the parameters $\theta$. Remarkably, $\ln Z(\theta)$ is convex in $\theta$ (it is the composition of an affine function and a sum of exponential [thus non-decreasing convex] compositions), hence $LL(\theta)$ is a concave function (since it is a sum of an affine function and a concave function). This guarantees the existence of an MLE at any stationary point, and the optimality of greedy search methods. First note that

$$ \frac{\partial}{\partial \theta_k} \ln Z(\theta) = \frac{1}{Z(\theta)} \sum_{\nu} \exp\left( \sum_j \theta_j f_j(\nu) \right) f_k(\nu) = \sum_{\nu} f_k(\nu) P(\nu|\theta) \, . $$

The stationary points of $LL(\theta)$ satisfy (for all $k$):

$$ \frac{\partial}{\partial \theta_k} LL(\theta) = \frac{\partial}{\partial \theta_k} \sum_j \theta_j \sum_{\nu} N[\nu] f_j(\nu) - \frac{\partial}{\partial \theta_k} N \ln Z(\theta) = 0 \, , $$ whence, $$ \sum_{\nu} \frac{N[\nu]}{N} f_k(\nu) = \sum_{\nu} f_k(\nu) p(\nu|\theta) \, . $$

In words, the stationary points are the parameters that make the expected values of $f_k$ equal to their empirical observed mean.

Parameter optimization

Even though the log-likelihood is a well-behaved function, it has no closed form solution and numerical iterative methods are required to solve the maximization for $\theta$. A popular and relatively simple algorithm for solving a convex optimization (optimally) is gradient ascent, which consists of updates:

$$ \theta_k^{(t+1)} = \theta_k^{(t)} + \frac{\partial}{\partial \theta_k} LL(\theta^{(t)}) = \theta_k^{(t)} + \lambda_t \sum_\nu N[\nu] f_k(\nu) - \sum_\nu f_k(\nu) P(\nu|\theta^{(t)}) \, . $$

The parameter $\lambda_t$ is a positive value denoted learning rate which regulates the convergence rate of the method. Too small values lead to slow convergence, while large values can lead to non-convergence. Typically, the search is initialized with a sufficiently large value, which is decreased at each iteration. Note that each iteration of the update above requires performing inference to compute the expected value of $f_k$ (we can simultaneously compute all such values using for instances the sum-product or clique-tree algorithms that we will see in future lessons). This makes parameter learning in Markov networks a computationally expensive procedure.

Gradient ascent is conceptually simple to understand and implement, but can be very slow in practice (due to its slow convergence). For this reason, more sophisticated algorithms, such as Limited-memory BFGS are often used.

Pseudo log-likelihood

The computational difficulty with obtaining the MLE for Markov networks comes from the difficulty in computing the probability of a complete instantiation $P(\nu)$, which can be performed in linear time in Bayesian networks. Thus, an alternative to the MLE estimate is to optimize a different function of the data which is based on some easy-to-compute probability. We have already seen that computing $ P(X|\mathcal{V})$ can be done efficiently if the Markov blanket of $X$ is sufficiently small. This motivates the use of the pseudo log-likelihood:

$$ \text{PLL}(\theta) = \sum_{X \in \mathcal{V}} \sum_{\nu\sim\mathcal{V}} N[X=\nu[X]] \ln P(X=\nu[X] \mid \mathcal{V}=\nu,\theta) \, . $$

The critical point here is that the pseudo log-likelihood can be computed without computing the partition function, and it is thus much more efficient. In fact, one can show that $$ \begin{multline*} \ln P(X=\nu[X]|\mathcal{V}=\nu,\theta) =\\ \ln \sum_{k: X \in \text{scope}[f_k]} \theta_k f_k(\nu) - \ln\left(\sum_{x \sim X} \exp\left[ \sum_{k: X \in \text{scope}[f_k]} \theta_k f_k(x,\nu[\mathcal{V}-X]) \right] \right) \, . \end{multline*} $$

Also,

$$ \frac{\partial}{\partial \theta_k} \text{PLL}(\theta) = \sum_{X: X \in \text{scope}[f_k]} \sum_{\nu \sim \mathcal{V}} N[\nu] \left(\frac{f_k(\nu)}{N} - \sum_{x \sim X} f_k(\nu’) p(X=x|\mathcal{V}=\nu,\theta) \right) \, . $$

The pseudo log-likelihood is a convex function, and its maximizer is a consistent estimator; moreover, for sufficiently large samples, the maximizers of the pseudo log-likelihood coincide with the MLE. This is not true with small data sets (even moderately large ones), and in this case the argument for PLL is somewhat weaker. PLL is particularly useful when the model is built so that most of the queries are similar to $P(X|\text{MB}(X))$. This is the case for instance, in classification with few missing values, or when performing inference by Gibbs sampling (an inference algorithm we will encounter in the future).

Parameter learning from incomplete data

Consider again the two-variable network, and assume we have data $y_1,\dotsc,y_N$. That is, we observed only values for $Y$, and the values for $X$ are missing at random. The observed data log-likelihood in this case is

$$ LL(\theta) = \ln \prod_{j=1}^N P(Y_j=y_j|\theta) = \sum_{j=1}^N \ln \left( \sum_{x_j} P(Y_j=y_j|X_j=x_j) P(X_j=x_j) \right) \, , $$

which is equal to

$$ N[Y=0] \cdot \ln \left( [1-\theta_{Y|0}][1-\theta_{X}]+[1-\theta_{Y|1}]\theta_{X} \right) + N[Y=1] \cdot \ln \left( \theta_{Y|0}[1-\theta_{X}]+\theta_{Y|1}\theta_{X} \right) \, . $$

This is a much more complicated optimization, which has no closed-formula solution (it is a non-convex function with multiple local maxima). More generally, assume that we have data $x_1,y_1,\dotsc,x_N,y_N$, except that some values for $x_j$ or $y_j$ are missing at random (represented as $y_j=?$, $x_j=?$). For example (for now ignore the last column):

$j$ $X$ $Y$ part
1 0 1 $D_1$
2 1 1 $D_1$
3 ? 1 $D_2$
4 0 ? $D_3$
5 0 1 $D_1$
6 ? ? $D_4$
7 0 0 $D_1$
8 1 0 $D_1$

The data log-likelihood is

$$ \begin{align*} LL(\theta) = & \sum_{j \in D_1} \ln P(X_j=x_j,Y_j=y_j) + \sum_{j \in D_2} \ln \sum_{x_j} P(X_j=x_j,Y_j=y_j) \\ & + \sum_{j \in D_3} \ln \sum_{y_j} P(X_j=x_j,Y_j=y_j) + \sum_{j \in D_4} \ln \sum_{x_j,y_j} P(X_j=x_j,Y_j=y_j) \end{align*} $$

where

  • $D_1$ are the pairs $x_j,y_j$ where both are non-missing;
  • $D_2$ are the pairs $x_j,y_j$ where only $x_j$ is missing;
  • $D_3$ are the pairs $x_j,y_j$ where only $y_j$ is missing;
  • $D_4$ are the pairs $x_j,y_j$ where both are missing.

Note that since $\sum_{x,y} P(X=x,Y=y) = 1$, the last term of the log-likelihood is 0.

As a initial approximation let us estimate $P(X=1)$ using data sets $D_1$ and $D_3$:

$$ \theta^{(0)}_X = 1/3 \, . $$

Similarly, we can estimate $P(Y=1|X)$ using $D_1$:

$$ \theta^{(0)}{Y | 0} = 2/3, \qquad \theta^{(0)}{Y | 1} = 1/2 . $$

We can use these estimates to estimate the missing values. One alternative is to replace the missing values with the most probable values. For instance, we can choose

$$ y_4 = \arg\max_{y} P(Y=y|X=0)=1 $$

and

$$ x_3 = \arg\max_{x} P(X=x|Y=1) \propto P(X=x) P(Y=y|X=x) = 0 \, . $$

Expectation-Maximization

The completing approach (aka imputation) does not account for our uncertainty about the completion. An arguably better approach is to consider the expected value of each missing value. For instance,

$$ \mathbb{E}(X_3|Y_3=1) = P(X_3=1|Y_3=1) = 3/11 $$

and

$$ \mathbb{E}(Y_4|X_4=0) = P(Y_4=1|X_4=0) = 2/3 \, . $$

We can interpret these values as fractional counts, and use them to compute the expected log-likelihood:

$$ \begin{align*} ELL(\theta_X) &= \mathbb{E}(N[X=1]) \ln (\theta_X) + \mathbb{E}(N[X=0]) \ln (1-\theta_X) \, , \\ ELL(\theta_{Y|0}) &= \mathbb{E}(N[Y=1,X=0]) \ln (\theta_{Y|0}) + \mathbb{E}(N[Y=0,X=0]) \ln (1-\theta_{Y|0}) \, , \\ ELL(\theta_{Y|1}) &= \mathbb{E}(N[Y=1,X=1]) \ln (\theta_{Y|1}) + \mathbb{E}(N[Y=0,X=1]) \ln (1-\theta_{Y|1}) \, , \\ ELL(\theta) &= ELL(\theta_X) + ELL(\theta_{Y|0}) + ELL(\theta_{Y|1}) \, . \end{align*} $$

Note that the expression above is the same as in the case of complete data, except that the counts have been replaced with expected counts. Thus, we can find new estimates by maximizing the expected log-likelihood:

$$ \theta^{(1)} = \arg\max_{\theta} ELL(\theta) \, . $$

This process can be repeated until convergence (which is guaranteed). This is known as the Expectation-Maximization (EM) algorithm, which alternates between an expectation step, where expected counts are computed, and a maximization step, where new estimates are produced by maximization of the expected log-likelihood.

These two steps can be combined into a single update formula for arbitrary (discrete) Bayesian networks:

$$ \theta^{(t+1)}_{X|\nu}(x) = \frac{ \sum_j P(X=x,\pa{X}=\nu|D_j,\theta^{(t)}) }{ \sum_j P(\pa{X}=\nu|D_j,\theta^{(t)}) } \, , $$

where $D_j$ is the assignment to observed variables in row $j$.

To avoid poor local optimal, the algorithm is usually run several times with different initial estimates. Note that solving the equation above requires performing inference in the network. As with the complete data case, we can place priors over the parameters and maximize the posterior likelihood. Again, the result is a difficult non-convex optimization, which can be solved by EM. The update rule is then (for the MAP criterion):

$$ \theta^{(t+1)}_{X | \nu}(x) \propto \sum_j P(X=x, \pa{X}=\nu \mid D_j, \theta^{(t)}) + \alpha_{X|\nu} - 1 \, . $$

Gibbs sampling

An alternative (and usually less efficient) approach to parameter learning is to perform Gibbs sampling on the parameter-learning Bayesian network. This works with incomplete data, and even with continuous variables in the base model (as long as we can sample from its conditional distribution). However, this usually performs worse than the methods presented here. An improvement is to perform inference in the discrete part of the network conditional on the continuous parameter variables, and to analytically integrate out these variables, leading to a Rao-Blackwellized version of Gibbs sampling, which can be effective in practice, especially when the number of hidden variables is high.

Markov networks

The gradient ascent approach can be adapted for the case of incomplete data. The only change is that many inferences are required for the update of a single parameter at each iteration (as we compute the expected likelihood). An analogous version of Expectation-Maximization can also be used, where data is completed with expected counts (requiring inference) and then a new parameter is found (through e.g. gradient ascent). It is also possible to replace the last step with a maximization over the pseudo log-likelihood, thus making the procedure more efficient. Finally, Gibbs sampling can be applied, and it is often competitive with other approaches (especially if a Rao-Blackwellized version is used).


  1. Note that we also have to estimate $P(X=0)$, but this is defined by $\theta=1-P(X=0)$. The choice of estimating $P(X=1)$ rather than $P(X=0)$ is arbitrary and unimportant. ↩︎

  2. The notation $P(X=x|\theta)$ suggests that we can specify a probability distribution $P(\theta)$; while this is not necessary, it is often desirable, as we will later see. ↩︎

  3. Note that this approach generates inconsistent probability models, with conditionals $P(\alpha|\beta)$ defined for $P(\beta)=0$. Such models lie in the realm of the theory of full conditional probabilities. ↩︎

Previous
Next