Missing Data

$ \newcommand{\nb}[1]{\text{ne}(#1)} \newcommand{\pa}[1]{\text{pa}(#1)} \newcommand{\ch}[1]{\text{ch}(#1)} \newcommand{\de}[1]{\text{de}(#1)} \newcommand{\an}[1]{\text{an}(#1)} \newcommand{\nd}[1]{\text{nd}(#1)} \newcommand{\can}[1]{\overline{\text{an}}(#1)} \newcommand{\ind}{⫫} \newcommand{\dsep}{\perp_d} \newcommand{\msep}{\perp_m} $

When using or estimating models we are often faced with partial information such as when a sensor fails to record its measurement, when survey respondents are unable or unwilling to answer questions, or when certain quantities are simply not directly observable. In this lecture we will discuss how such cases can be captured and handled within the framework of Bayesian networks, taking advantage of d-separations induced by a model of the data missingness mechanism.

Inferences from missing data

Real data sets often come with missing values, that is, values that have not been recorded. Examples of such a phenomenon include carriers of a disease failing to report it (due to concern of prejudice, mistreat or social norms), physical sensors failures due to environment conditions (which they might be sensing), user preferences/ratings of items (which suffer selection bias in multiple ways), drop-out participants of a scientific study (due to lack of interest or inability to participate) and even a poorly understood process. Missing data poses several problems to estimating probabilistic models and using them to draw inferences. Consider the construction of a credit score rating system. Financial institutions often build models to predict the probability that an individual, if granted a loan, will default. The model is usually built from data of individuals that were granted loans; this can lead to drastic skewed inferences. For instance, suppose that referral of longstanding customers is a common reason for acceptance of a loan request for young people, but that information is not recorded on data. Then the dataset associates young people with a smaller risk of default, while that is possible the opposite of the case. It is thus important for the model builder to consider the process by which data might be biased, or more specifically, missing.

Yet another example is in recommender systems. Marlin et al. in 2007 conducted a survey with 5400 users of a music streaming service, asking how likely each person would rate a song given their opinion of it.¹ The results are summarized in the table below.

Frequency	Hate	Dislike	Neutral	Like	Love
Never	7%	5%	2%	0.1%	0.07%
Very seldom	2%	4%	9%	0.5%	0.3%
Seldom	2%	4%	25%	1%	0.2%
Often	12%	22%	27%	25%	5%
Very often	78%	64%	37%	73%	94%

There is a clear trend towards polarization, with users more likely to provide ratings for content they feel strongly positive or strongly negative about. The trend was particularly biased towards high rates, following similar pattern found in other domains.² The researchers then asked the users to rate 10 randomly selected songs from a fixed random sample of 1000 songs that had been rated at least 500 times. This result was compared to the available ratings in the system (where users choose which items to rate). The table below approximate the results obtained.

Rating	1	2	3	4	5
Survey	51%	25%	15%	5%	2%
System	30%	13%	18%	17%	22%

Note that both data sets contain missing data, generated by different process (random selection versus non-random selection). The results show that the marginal distribution of rating changes significantly, with the surveyed rating show a great skew towards low rating and the system existing ratings displaying a more flat distribution. The effect is arguably intuitive: most songs will be disliked by most users, and users tend to select (and therefore rate) songs they are more likely to like.

To better understand the effect that missing data might have on inference, consider the following synthetic data:


X	0	0	0	1	1	?	?	?	?	?
M	0	0	0	0	0	1	1	1	1	1

Variable $X$ might denote, for example, age (young, old), income (low, high), status of an smoke alarm (normal, abnormal), whether a loan borrower has default or not, or whether a specific user liked/disliked an item. The variable $M$ simply indicates whether variable $X$ has been observed ($M=0$), or whether its value is missing ($M=1$). If we use the available observed data to estimate $P(X)$ we obtain:

X	P(X)
0	0.6
1	0.4

Now suppose that the actual data is:


X	0	0	0	1	1	0	0	0	0	0
M	0	0	0	0	0	1	1	1	1	1

That is, $X$ is missing only when $X=0$. Then instead of the previous estimates, we get $P(X=1)=0.2$. If on the other hand we had $X$ missing only when $X=1$, we would have $P(X=1)=0.7$. Hence, without additional knowledge we can only infer that

$$ 0.2 \leq P(X=1) \leq 0.7 \, . $$

Let us consider a more concrete example. The table below is extracted from an observational study of the relation between smoking and severity of COVID-19 symptoms.

	Mild	Severe
smoker	8	14
non-smoker	124	324
unknown	7	2

The data was obtained from self-reported questionnaires applied to patients that tested positive for COVID-19 at a large French hospital. The last row corresponds to unknown values. These are caused by respondents that were unable (e.g., because they have died) or unwilling to answer the survey. The most important thing to notice is that the dataset contains only data about confirmed positive cases for COVID-19 at that hospital. A complete picture of the population can be conceived as a data set where each instance is an individual of the population (say, from the same region as the survey) with confirmed or unconfirmed COVID-19 reporting the smoking status (smoker, non-smoker) and the severity of the disease. Since we have estimates of smoking prevalence (about 26%) and disease incidence (about 10% at the time of the survey), we might assume that the smoking status of every individual is observed, while the disease severity is not. For example, the complete data set is actually something like:

Patient	1	2	3	4	…
Smoker	yes	yes	no	no	…
Severity	mild	?	?	severe	…
Respondent	yes	no	no	yes	…

Note that last row indicates whether the value of the severity is known or not. As with the previous example, the inferences that one draws from the observed data may or may not agree with the actual completion of the data set. For instance, it might be that smokers with mild symptoms tend to go to hospitals and thus be tested more often than non-smokers with mild symptoms.

A final example considers data that is missing by design. One approach to reducing burden on respondents of a survey or takers of an exam is to randomly select a subset of the target questions to each participant with, so that each participant responds only a few questions, but every question is responded by a significant number of participants. The decision to which questions select can be based on fully observed data such as participant age, gender and place of birth, to ensure fair balance or a certain target population. In this scenario, the probability that a value is missing does not depend on its unmeasured value (but might depend on other variables). For example, the US National Assessment of Education Progress measures student knowledge of several subject areas (mathematics, geography, reading, etc) by means of written tests. The exam is applied so that each student answer questions from only one subject area, and each area has sufficient data to reach reliable estimates.

Missingness mechanism

The (ideal) process that makes a value be missing is known as the missingness process and is modeled for the univariate case by $P(X_m=1|X)$, where $X_m$ denotes whether $X$ is missing or not. To better represent that process, let $X_o$ denote the fully observable copy of $X$, so that the two variables are related by: $$ X_o = \begin{cases} X & \text{if } X_m = 0,\\ ? & \text{if } X_m = 1. \end{cases} $$

We say that a $P(X)$ is recoverable if it can be estimated given (an infinite amount of) data on $X_o$. More generally, recoverability is the ability of ideally obtaining an accurate estimate of a parameter from an assumed model of the (incomplete) data generation process (which for our purposes here will be identified as a Bayesian network). Accuracy here is intended as asymptotical convergence to the idealized true value if given infinite supply of data (this is called consistency in frequentist statistical parlance). Thus, recoverability is a minimal requirement for unbiased inference.

Data missing at random

There are situations however where we can recover the distribution of $X$ from data in the presence of missing data. To see this, suppose that the unobserved values are missing at random, that is, that the ideal missingness mechanism that decides whether a value is observed or missing does not depend on the value itself: $P(X_m|X)=P(X_m)$. We can represent graphically this situation as follows:

Missing at random process of univariate model

According to the structure above, $X$ is d-separated from $X_m$ hence $$ P(X)=P(X|X_m=0)=P(X_o|X_m=0) \, . $$ Note that in the rightmost term both $X_o$ and $X_m$ are observed quantities. For this simple example, assuming a missing at random mechanism is equivalent to discarding missing values. For more intricate examples involving many variables, we might need information from other variables in other to recover a desired probability. This is shown in the following example.

Consider data about students of a school reporting on their gender ($G$), age ($A$) and weight ($W$). Due to operational difficulties, older kids tend to have their weight unreported (e.g., participation among teenagers is lower, however age and gender can be gathered from school records and are available even for non-participants). A possible data generation process where data is missing at random is represented by the graph below.

A multivariate example of missing at random mechanism

Since the variables $G$ and $A$ are fully observable, we can estimate $P(A)$ and $P(G)$ as relative counts from data. Note that this procedure actually utilizes all data records, including ones where $W$ is missing. According to the graph, $A$ d-separates $W$ and $W_m$. Hence $$ P(W|G,A) = P(W|G,A,W_m=0) = P(W_o|G,A,W_m=0) \, . $$ Thus, the whole joint distribution $P(G,W,A)$ can be recovered from data.

One important thing to notice is that missingness at random cannot be falsified from data alone. The reason is that any statistical test used to verify the claim requires assuming. To see this, suppose that we assume that data is missing at random ($X \ind X_m$) and estimate $$ P(X) = P(X_o | X_m = 0) \, . $$ This implies that $$ P(X,X_m) = P(X_o | X_m = 0) P(X_m) = P(X)P(X_m) \, . $$ That means that any data can be made compatible with a missing at random hypothesis, so that the claim cannot be falsified unless additional assumptions are made.³

Data missing not at random

Missing at random is a strong assumption about the missingness process, one that is very often violated in practice. As we have discussed, the probability of a user providing a rating (or conversely, of not providing a rating) depends on the value of the rating itself, thus violating the missing at random assumption. If $X$ is the rating of a certain item, then we represent graphically the situation as follows:

Missing not at random process of univariate model

Because $X$ and $X_m$ are d-connected, we cannot recover $P(X)$ from the observed data $X_o$ and $X_m$ alone.

Here is another example of non-recoverability.

Consider again the data of the study about smoking and disease severity. A possible data generation process that assumes that smokers with mild symptoms are more likely to be tested than non-smokers is shown below.

A missing not a random model for the COVID-19 and Smoking data

According to the model above, we cannot recover the conditional distribution of $D$ from the observed data, as $$ P(D|S) \neq P(D|S, D_m = 0, S_m =0) \, . $$ This follows since $D$ and $D_m$ cannot be d-separated by any observation. Note that the non-recoverability remains if we drop the arc from $S$ to $D$.

Missingness graphs

The types of graphs we have used for representing the missingness process are known as missingness graphs, or m-graphs for short.⁴

A missingness graph is an acyclic directed graph whose node set $\mathcal{V}$ is a collection of random variables partitioned into

a set $\mathcal{U}$ of unobserved random variables,
a set $\mathcal{O}$ of fully observed random variables,
a set $\mathcal{M}$ of partially observed random variables,
a set $\mathcal{P}$ of proxy random variables containing a variable $X_o$ for each $X \in \mathcal{M}$, and
a set $\mathcal{R}$ of response random variables containing a variable $X_m$ for each variable $X \in \mathcal{M}$.

An unobserved random variable, also called latent or hidden, is one whose values are always missing. A random variable is fully observed if its values is known for every record of the data set. Conversely, a random variable is partially observable if its value is missing in at least one record. A response variable is a binary random variable representing the missingness process that generated a missing value. A proxy variable is an ancillary fully observed deterministic random variable that represents the recorded values of a partially observed variable $X \in \mathcal{M}$ as $$ X_o = \begin{cases} X & \text{if } X_m = 0,\\ ? & \text{if } X_m = 1. \end{cases} $$

The marginal distribution $P(\mathcal{P}, \mathcal{O}, \mathcal{R})$ is called the observed-data distribution, and the marginal distribution $P(\mathcal{O},\mathcal{M},\mathcal{R})$ is called the underlying distribution. The former specifies the distribution over fully observed random variables, and the latter specified the distribution over random variables if no missingness had occurred.

Recovering a desired parameter $\theta$ amounts to specifying it as a function of the observed-data distribution $P(\mathcal{O},\mathcal{R},\mathcal{P})$, whatever that may be (as long as it assigns positive probability to the observed data). Because we require that $\theta$ is recoverable for any positive observed-data distribution, recoverability is a property of the missingness graph, and not of the data. That is, we can discuss recoverability even in the absence of any data (which can lead to better experimental or survey design).

We can redefine the types of missingness mechanisms according to their graphical representation. Data is missing completely at random if $$ \mathcal{M} \cup \mathcal{U} \cup \mathcal{O} \dsep \mathcal{R} \, . $$ In words, the missingness is entirely independent of both observed and partially observed variables. In terms of d-separation, this implies the inexistence of arcs between response variables and variables in $\mathcal{O} \cup \mathcal{M}$. A typical approach for data missing completely at random is discarding any records that contain missing data. By the definition above, we have that $$ P(\mathcal{M} \cup \mathcal{O}) = P(\mathcal{P} \cup \mathcal{O} \mid \mathcal{R} = 0) \, , $$ which justifies the procedure. Note that the right-hand side contains only fully observed variables. While this procedure is able to recover the parameters, discarding data might be inefficient or even detrimental to estimation from a finite sample.⁵

Consider the data set below containing missing values for variables $L$ and $E$:

id	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24
L	0	0	0	0	1	1	1	0	0	0	0	1	1	1	1	?	?	?	?	?	?	?	?
E	0	0	0	1	1	1	1	?	?	?	?	?	?	?	?	0	0	0	0	1	1	?	?
S	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1	0	0	0	0
D	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1

Suppose the data is missing completely at random, and is generated by the graph below.

Let us first consider the approach where we discard all records with missing values, and use only the first 8 records to estimate the parameters of the model. In this case, using the observed relative frequencies as estimates of probabilities we have $P(S=1) = P(S=1 | L_m=0, E_m=0) = 4/8 = 1/2$ and $$ \begin{align*} P(L=1|S=0) &= P(L_o=1|S=0,L_m=0,E_m=0) = 1/4 \, , \\ P(L=1|S=1) &= P(L_o=1|S=0,L_m=0,E_m=0) = 2/4 = 1/2 \, , \\ P(E=1|L=0) &= P(E_o=1|L_o=0,L_m=0,E_m=0) = 1/5 \, , \\ P(E=1|L=1) &= P(E_o=1|L_o=1,L_m=0,E_m=0) = 3/3 = 1 \, , \\ P(D=1|E=0) &= P(D=1|E_o=0,L_m=0,E_m=0) = 2/4 = 1/2 \, , \\ P(D=1|E=1) &= P(D=1|E_o=1,L_m=0,E_m=0) = 2/4 = 1/2 \, . \end{align*} $$ Note the small sample sizes on which the above estimates are based. Because $S$ is fully observed, we can estimate its probability using the entire dataset. In this case, we have that $P(S=1) = 10/24 = 5/12$. Likewise, we only need to condition on the necessary response variables: $$ \begin{align*} P(L=1|S=0) &= P(L_o=1|S=0,L_m=0) = 3/8 \, , \\ P(L=1|S=1) &= P(L_o=1|S=0,L_m=0) = 4/8 = 1/2 \, , \\ P(E=1|L=0) &= P(E_o=1|L_o=0,L_m=0,E_m=0) = 1/5 \, , \\ P(E=1|L=1) &= P(E_o=1|L_o=1,L_m=0,E_m=0) = 3/3 = 1 \, , \\ P(D=1|E=0) &= P(D=1|E_o=0,E_m=0) = 4/8 = 1/2 \, , \\ P(D=1|E=1) &= P(D=1|E_o=1,E_m=0) = 3/6 = 1/2 \, . \end{align*} $$ Note that some estimates changed drastically (e.g., $P(L=1|S=0)$) while other ones remained the same, but are now based on a larger sample size (e.g., $P(D=1|E=1)$). Note also the difference in sample sizes: the estimates for $P(E=1|L=1)$ relied on only 3 records, while the estimates for $P(S)$ are based on all 24 records. We can improve the data efficiency by combining estimates. For example, the estimates for $P(L|S)$ can be obtained from estimates of $P(S,L)$, and that the latter factorizes, according to the chain rule, in two possible ways: $$ \begin{align*} P(S,L) &= P(L|S)P(S) = P(L_o|S,L_m=0)P(S) \\ &= P(S|L)P(L) = P(S|L_o,L_m=0) P(L_o|L_m=0) \, . \end{align*} $$ Using the first factorization and the d-separations in the m-graph, we obtain the following estimate for the joint distribution $P(S,L)$:

	$L=0$	$L=1$
$S=0$	$\frac{5}{9}\frac{9}{16} = \frac{5}{16}$	$\frac{3}{7}\frac{7}{16} = \frac{3}{16}$
$S=1$	$\frac{4}{9}\frac{9}{16} = \frac{1}{4}$	$\frac{4}{7}\frac{7}{16} = \frac{1}{4}$

For the second factorization we have:

	$L=0$	$L=1$
$S=0$	$\frac{5}{8}\frac{7}{12} = \frac{35}{96}$	$\frac{3}{8}\frac{5}{12} = \frac{15}{96}$
$S=1$	$\frac{1}{2}\frac{7}{12} = \frac{7}{24}$	$\frac{1}{2}\frac{5}{12} = \frac{5}{24}$

We can combined the two estimates to obtain a final estimate. A simple rule is taking the simple average:

	$L=0$	$L=1$
$S=0$	$\frac{65}{192}$	$\frac{11}{64}$
$S=1$	$\frac{13}{48}$	$\frac{11}{48}$

We can obtain final estimates for $P(L|S)$ by a simple row renormalization of the table above (or equivalently, by dividing by $P(S)$):

	$L=0$	$L=1$
$S=0$	$\frac{65}{98}$	$\frac{33}{98}$
$S=1$	$\frac{13}{24}$	$\frac{11}{24}$

A similar procedure can be applied for estimating the conditional probabilities of the other variables. Note that since we can recover the joint distribution, we can recover any conditional probability (or function of it).

We define data as missing at random if $$ \mathcal{M} \cup \mathcal{U} \dsep \mathcal{R} \mid \mathcal{O} \, . $$ This a relaxed mechanism where the missingness is independent of the partially observed variables given the fully observed variables. In graphical terms, it requires that there are no edges between variables response variables and fully observed variables, and that response variables and fully observed variables have no common unobserved parent. That assumption implies that $$ P(\mathcal{M} \mid \mathcal{O}) = P(\mathcal{P} \mid \mathcal{O}, \mathcal{R} = 0) \, . $$ That is, we can estimate any function of the partially observed variables from the observed portion. Note that discarding records with missing values does not lead to consistent estimates of arbitrary parameters, since it might not be true that $$ P(\mathcal{O} \mid \mathcal{R} = 0) \neq P(\mathcal{O} \mid \mathcal{R}=1) \, . $$ Hence, to obtain a consistent estimate of a parameter that depends on both partially observed and fully observed variables requires considering the entire dataset to estimate $P(\mathcal{O})$.

Consider again the data of the previous example, but assume now that the missingness process if modeled as follows.

According to the graph, $L$ is missing at random (but not completely at random), while $E$ is missing at random. Estimating $P(S)$ from the fully observed portion of the data is inconsistent because $S$ and $L_m$ cannot be d-separated; this implies that $$ P(S) = \sum_L P(S|L)P(L) \neq \sum_L P(S|L,L_m=0)P(L|L_m=0) \, . $$ Instead we obtain a consistent estimate using the entire dataset as before: $$ P(S=1) = 5/12 \, . $$ As for $P(L|S)$, note that $S$ d-separates $L$ from $L_m$. Thus $$ P(L|S) = P(L_o | S, L_m = 0) = P(L_o | S, L_m = 0, E_m = 0) \, . $$ As in the missing completely at random case, we can obtain consistent estimates fo $P(L|S)$ either by considering only the records where $L$ is not missing, or by considering only the fully observed records. As discussed, the former is preferred as makes better use of the data (but both will converge to the same estimate with a large enough data set). Note that for this missingness process, there is only one consistent factorization of $P(L,S)$, since $$ P(S|L)P(L) \neq P(S|L,L_m)P(L|L_m) \, . $$ Consider now the estimates of $P(E|L)$. Since $L$ d-separates $L_m$ and $E$, we get a consistent estimate by: $$ P(E|L) = P(E_o|L_o,L_m=0,E_m=0) \, . $$ We can exploit multiple factorizations of $P(L,E)$ to obtain more reliable estimates. However, since $L$ is not missing completely at random, we need to marginalize $S$ to render $L$ d-separated from $L_m$: $$ \begin{align*} P(L,E) &= P(E|L)P(E) = P(E_o|L_o,L_m=0,E_m=0) \sum_S P(L_o|S,L_m=0)P(S) \\ &= P(L|E)P(E) = P(E_o|E_m=0) \sum_S P(L_o|E_o,S,E_m=0,L_m=0) \, . \end{align*} $$ Note that even for the first factorization, the estimate for $P(E|L)$ from $P(L,E)$ makes use of more data, since it contains the term $P(S)$ which uses all data. The second factorization also benefits from more data as $P(E)$ is estimated from all the records with $E$ observed.

For the remaining cases the data is considered missing not at random. This scenario requires more sophisticated inference routines that exploit the graphical structure of the problem. For example, we have the following result.

A query $P(X|Y)$ (or any function of it) is recoverable from the fully observed records alone if $X \dsep \mathcal{R} \mid Y$ in the m-graph. The query (or any function of it) is recoverable from the records where $X$ and $Y$ are both observed if $X \dsep X_m,Y_m \mid Y$ in the m-graph.

In the result above $X$ and $Y$ can denote sets of random variables, with the appropriate modifications. Thus, the m-graph allows us to recover desired quantities even in the presence of missing not at random data. This is shown by the next example.

Consider data generated according to the missingness process below (proxy variables are omitted for clarity):

Example of recoverable model of data missing not at random

Through d-separation, we have the following estimates: $$ \begin{gather*} P(Y) = P(Y_o|Y_m=0) \\ P(X|Y) = P(X_o|Y_o, X_m=0, Y_m=0) \\ P(Z|X,Y) = P(Z_o|X_o, Y_o, X_m=0, Y_m=0, Z_m=0) \\ P(X,Y,Z) = P(Y) P(X|Y) P(Z|X,Y) \end{gather*} $$ From the joint distribution we can recover all the conditional probabilities parametrizing the structure.

Note that while MAR is not testable, some of the implications of the m-graph can be tested from the observed data, for example:

$Z \ind Y_m \mid Z_m = 0$
$Z_m \ind X_m \mid Y, Y_m = 0$

Recoverability is summarized by the following theorem.

Given an m-graph with no arcs between variables in $\mathcal{R}$, the joint distribution $P(\mathcal{M,\mathcal{O}})$ is recoverable if and only if there is no variable $X \in \mathcal{M}$ such that

$X$ and $X_m$ are connected by an arc, or
$X$ and $X_m$ are connected by a path in which all intermediate nodes are colliders and elements of $\mathcal{M} \cup \mathcal{O}$.

If the joint distribution is recoverable, then it can be estimated as $$ P(\mathcal{M},\mathcal{O}) = \frac{P(\mathcal{R}=0,\mathcal{P},\mathcal{O})}{\prod_X P(X_m = 0 \mid \text{mb}(X_m), \mathcal{R}_{X} = 0 )} \, , $$

where $\text{mb}(X)$ denotes the Markov Blanket of variable $X$ and $\mathcal{R}_{X}$ is the set of response variables of the partially observed variables in $\text{mb}(X_m)$.

Consider again the m-graph below.

Using the theorem, we can recover the joint distribution as $$ P(X,Y,Z) = \frac{P(X_o,Y_o,Z_o,X_m=Y_m=Z_m=0)}{P(X_m=0|Y_o,Y_m=0)P(Y_m=0)P(Z_m=0|X_o,Y_o,X_m=Y_m=0)} $$ We can manipulate the numerator to cast it in terms of observed variables only: $$ \begin{multline*} P(X_o,Y_o,Z_o,X_m=Y_m=Z_m=0) = \\ P(Y_m=0) P(X_m=0|Y_o,Y_m=0) \times \\ P(Z_m=0|X_o,Y_o,X_m=Y_m=0) P(X,Y,Z) \, . \end{multline*} $$ Note that the denominator cancels with the first terms in the expression above, proving the correctness of the procedure.

As the example above shows, a recoverable quantity can be estimated in more than one way. The next example shows a situation where the simple decomposition approach taken previously fails, and the formula given by the Theorem is required.

Consider partially observed variables $X$ and $Y$ with a missingness not at random process given below.

We cannot recover $P(X,Y)$ by decomposition since $X$ requires $Y$ to be observed to be recoverable and $Y$ requires $X$ to be observed to be recovered. That is, $$ P(X|Y) \neq P(X_o|Y_o,Y_m=0,X_m=0) $$ because $X$ is not d-separated of $Y_m$ by $Y$, and $$ P(Y|X) \neq P(Y_o|X_o,Y_m=0,X_m=0) $$ because $Y$ is not d-separated of $X_m$ by $Y$. The m-graph however satisfies the properties given in theorem. In fact, $$ \begin{align*} P(X,Y) &= P(X,Y) \frac{P(X_m=0,Y_m=0 \mid X,Y)}{P(X_m=0,Y_m=0 \mid X,Y)} \\ &= \frac{P(X_m=0,Y_m=0)P(X,Y|X_m=0,Y_m=0)}{P(X_m=0|Y,Y_m=0)P(Y_m=0|X,X_m=0)} \\ &= \frac{P(X_m=0,Y_m=0)P(X_o,Y_o|X_m=0,Y_m=0)}{P(X_m=0|Y,Y_m=0)P(Y_m=0|X,X_m=0)} \, . \end{align*} $$ Note that all variables in the latter expression are fully observed.

The m-graph also allows us to identify non-recoverable cases, as in the following example.

Consider the a missingness process representd by the m-graph below.

Example of non-recoverable model of data missing not at random

The query $P(X,Y,Z)$ is not recoverable since $Y \not\ind Y_m$.

Beyond structural properties

It is also possible that some distributions consistent with a structure might allow consistent estimates while other similarly consistent distributions might not.

Consider the m-graph below.

Assume that $X$ and $T$ are binary. We can first recover $P(T|X)$, since $T$ is d-separated from $X_m$ by $X$, and: $$ P(T|X) = P(T|X,X_m=0) = P(T|X_o,X_m=0) \, . $$ We can then obtain $P(X)$ by solving the following system of equations: $$ P(T=t) = \sum_x P(T=t|X=x)P(X=x) \, . $$ This is a system with two unknowns (say, $P(T=1)$ and $P(X=1)$) and two equations (one for each value of $t$). Hence, it admits a unique solution unless the two equations are linearly dependent. The latter is the case, for example, when $T$ and $X$ are independent (so $P(T|X=0)=P(T|X=1)$ and any value for $P(X)$ is a solution to the system). The same reasoning applies to categorical random variables with the appropriate changes.

Summary

Missing data is abundant in real applications, and its generating mechanism can induce biased inferences and produce erroneous conclusion. Yet practitioners often assume, consciously or not, that data is missing at random. Missingness graphs, that is, a Bayesian network representation of the missingness process, can help in formalizing and identifying harmful and non harmful scenarios, as well, as improve statistical efficiency of inferences with missing data.

Benjamin M. Marlin, Richard S. Zemel, Sam Roweis and Malcolm Slaney, Collaborative filtering and and the missing at random assumption, In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, 2007. ↩︎
Benjamin M. Marlin, Richard S. Zemel, Sam T. Roweis and Malcom Slaney, Recommender Systems: Missing Data and Statistical Model Estimation, In Proceedings of the 22nd International Joint Conference in Artificial Intelligence, 2011. ↩︎
Geert Molenberghs, Caroline Beunckens, Cristina Sotto and Michael G. Kenward, Every missingness not at random model has a missingness at random counterpart with equal fit, Journal of the Royal Statistical Society Series B 70:2, 2008. ↩︎
Karthika Mohan and Judea Pearl, Graphical Models for Processing Missing Data, Journal of American Statistical Association (to appear) link ↩︎
Guy Van den Broeck, Karthika Mohan, Arthur Choi, Adnan Darwiche and Judea Pearl, Efficient Algorithms for Bayesian Network Parameter Learning from Incomplete Data, In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, 2015. ↩︎

Last updated on Sep 23, 2020

id	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24
L	0	0	0	0	1	1	1	0	0	0	0	1	1	1	1	?	?	?	?	?	?	?	?
E	0	0	0	1	1	1	1	?	?	?	?	?	?	?	?	0	0	0	0	1	1	?	?
S	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1	0	0	0	0
D	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1

id	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24
L	0	0	0	0	1	1	1	0	0	0	0	1	1	1	1	?	?	?	?	?	?	?	?
E	0	0	0	1	1	1	1	?	?	?	?	?	?	?	?	0	0	0	0	1	1	?	?
S	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1	0	0	0	0
D	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1

id	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24
L	0	0	0	0	1	1	1	0	0	0	0	1	1	1	1	?	?	?	?	?	?	?	?
E	0	0	0	1	1	1	1	?	?	?	?	?	?	?	?	0	0	0	0	1	1	?	?
S	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1	0	0	1	1	0	0	0	0
D	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1