Probability Theory

Table of Contents

In this lecture, we discuss the different meanings attributed to probability. We show how the Kolmogorov's axioms of probability can be derived as a disposition to avoid sure loss. We then review the basic concepts of probability theory such as algebras of sets, probability measures, random variables, conditional probability, independence, the chain rule, the total rule and Bayes' rule.

1 An unfinished gamble

The idea that an event is more probable than another have long existed. However, it was not until the 17th century that a mathematical theory of uncertainty and randomness was developed. What started such rigorous study of the laws of uncertainty was the so-called problem of points, a somewhat frugal question on how to divide the stake of an unfinished gamble. By the end of the 15th century, some Italian mathematicians became interested in this problem without reaching a satisfactory solution. Then, in 1654, the problem was proposed to Pierre de Fermat and Blaise Pascal, who started a famous correspondence that would culminate in the first formalization of probability theory. The problem they discussed involved some unnecessary complications, but its essence can be put forward by the following example:

A gambling game between two players consists in tossing a coin repeatedly until one of the players accumulates 10 points. One player gets a point when the coin comes up heads, the other player gets a point when the coin comes up tails. Each player puts 50 Francs, making a pot worth of 100 Francs to be awarded to the winner. After playing a few rounds, one of the players is winning by 8 points to 7. Then, the game is suddenly interrupted to never be resumed. The question is: how should the players split the 100 Francs?

The solution proposed by Fermat and Pascal is rather trivial for anyone that has taken a basic course in probability theory, but was at the time very new: the stake should be divided in the same proportion as the number of cases that are favorable to each player, provided that each case is equally likely. Here is the rationale:

The first player needs two points to win the game, while the second player needs three. Thus, the game would be over had they played at most four more rounds. So let us assume that we would play the game for exactly four more rounds, independently of whether someone would have won before that. The following are the possible configurations that might have occurred (we use ➊ for heads and ➋ for tails):

➊➊➊➊ ➊➊➋➊ ➊➊➊➋ ➊➊➋➋
➊➋➊➊ ➊➋➋➊ ➊➋➊➋ ➊➋➋➋
➋➊➊➊ ➋➊➋➊ ➋➊➊➋ ➋➊➋➋
➋➋➊➊ ➋➋➋➊ ➋➋➊➋ ➋➋➋➋

Every configuration with at least two heads indicates a scenario where the first player wins, and there are 11 such configurations. If the stakes were divided according to the proportion of "favorable cases" to each player (i.e., those in which a player wins) then the player with 8 points would get 11/16 of the pot (68.75 Francs), and the other player would get 5/16 (31.25 Francs).

The key to the above solution is to consider only configurations with identical probabilities (whatever these might be). Had we simply listed the possible completions of the game (for example, that two heads straight ends the game in less than four tosses), that would have not been the case, as counting the number of favorable cases would produce an unfair split of the stake. Fermat's ingenious solution was to break down the unequal probability scenarios into equal probability configurations. This procedure is at the heart of the principle of indifference (also called the principle of insufficient reason), which was proposed as a definition of probability nearly two centuries later by Pierre Laplace:

The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought.

The principle of indifference can be understood as constructing uniform probabilities over spaces of elementary events that, when combined, form the events in which we are interested. According to Pascal and his followers, these elementary events should be assigned identical probability due to our inability of predicting the unique true event that will occur (or have occurred); this is an immediate conclusion of determinism, the belief that the occurrence of any event in nature can be precisely determined by the laws of mechanics if one possess sufficient knowledge (which is never the case).

2 What is probability?

So what does it mean to say that "the probability of a fair die landing with number 4 up is 1/6" or that "the probability of a fair coin landing heads up is 1/2"?

2.1 Classical probability

The classical interpretation of probability is to consider the probability of an event as the number of favorable cases divided by the total number of cases, where each case is considered equiprobable. This definition faces several shortcomings. First, taken literally, it is only applicable for a finite number of elementary events, so as to exclude probability over infinite domains. Second, it contains a circular character, as it presumes that probability has been defined for the elementary (equiprobable) events. Last, the probability of an event depends on the precise characterization of the elementary events, with different parametrizations leading to different probabilities.1 These difficulties encouraged mathematicians to seek for a non-epistemic interpretation.

2.2 Frequentist probability

The objectivist interpretation states that this number is a measure of the propensity of that event to naturally occur. In other words, by saying that the probability of a fair die landing 4 is 1/6 we are actually saying that if we (conceivably) roll the die infinitely many times, then at the limit 1/6th of rolls will end with the die landing with face 4 up. This interpretation views probability as a physical property of objects in the world, much in the same way as we see height, weight and distance. Unlike height, weight and distance, however, probabilities are assumed not be directly measurable; instead, they should be estimated from real observations. Probabilities need also to be agreed on by everyone, since they rely exclusively on the object and not on the observer. Objective probabilities thus need to be constrained only by the mathematical rules of relative frequencies (or proportions).

But what does it mean to say that the limiting relative frequency of an event (say, a coin landing heads up) is 1/2? The standard approach is to consider a sequence \(a_1,a_2,\dotsc\), where each \(a_i\) denotes the outcome of an experiment performed (ideally) under the exact same conditions (but oddly providing different results).2 Then, provided that this sequence converges, the frequentist probability of an event is taken to be

\[ \lim_{n \rightarrow \infty} \sum_i a_i/n \, . \]

While this definition appeals to intuition, it allows somewhat strange situations when seen from an operational viewpoint. For example, if the experiment is tossing a coin, and the first 100 tosses land heads up but all (infinitely many) tosses after that land tails, then the probability of heads is zero, irrespective of the fact that we have observed it occurring 100 times (of course, we would be dead after infinitely many observations). Also, the limiting frequency is sensitive to rearrangements of the sequence (e.g., compare a sequence of 0s and 1s, where a 1 follows a 0; with a rearrangement where each 1 is followed by 3 other 0s). Most importantly, the frequentist view disallows probabilities to be assigned to propositions that are elusive to repeating experimentation. Take for example the proposition 'The Amazon river is longer than the Nile river'. A person might want to state that this proposition is true with probability 0.8; but the proposition is either true or false!3 It is hard to imagine an experiment where the Amazon river will at times be longer than the Nile while at other times it will be shorter.

2.3 Subjective probability

Issues like these led de Finetti and many others to advocate a subjectivist interpretation of probability, in which an event is a measure of your belief about the occurrence of an event. Probability thus expresses a personal state of knowledge. In this view, events need not be repeatable, and different individuals might disagree on the probability of an event. In fact, instead of asking what is the probability of an event, one should instead ask for your probability. Subjective probabilities are measurable to the extent that an individual might be willing to disclose his or her intentions. Subjective probabilities are constrained by requirements of coherence, to assure that an individual will not hold conflicting or contradictory beliefs. This is usually derived by assuming that a rational agent should be disposed to act according to her beliefs. So consider a gamble that pays one unit of (probability) currency if \(A\) occurs and zero otherwise.4 Disposition to act means that you should be indifferent between entering the bet or receiving \(P(A)\) with certainty. For example, if you are certain that \(A\) will not occur, then your probability for \(A\) should be equal to 0. Similarly, if you are certain that \(A\) will occur, then you should assign \(P(A)=1\).

2.4 Axiomatic probability

Finally, the axiomatic interpretation views probability as an elementary concept, on which a formal theory is built and one that has no meaning per se. This is akin to the concept of a point in geometry. Probability theory, in this interpretation, should be concerned not with the nature of probability, but with consistency and completeness. Axioms might be inspired by the practice of probability (and statistics), but their validity should finally be decided by their ability to produce a system of well-formed and precise statements. The only constraints on probability values are those that enforce a consistent system, and those that allow us to assign probabilities to any object in the system. Within this view, the study of the laws of probability is a purely mathematical endeavor, much as the study of number theory. This view finds its place within the greater program of axiomatic mathematics, which has come to dominate the area.

3 The laws of probability

If probabilities are to be used to express knowledge and inform decisions (as is our general goal here), then it seems wise to investigate the consequences of an agent that acts according to the laws of probabilities. In fact, we can invert this reasoning and ask what should be such laws for an agent that displays some minimum desirable behaviour. To this end, we will adopt the operational definition of probability, and declare that \(P(A)\), the probability of an event \(A\), is the fair price you are willing to pay for a ticket to a lottery that awards $1 if event \(A\) occurs and \(0\) otherwise. An event is simply a set of states from the universe of discourse (which is as narrow or as wide as one so desires), and \(\Omega\) is the sure event, which contains all states.5 Since this price is declared fair, if I ask you to sell me the opportunity to take part in the lottery, you should charge me the same price \(P(A)\) for it. Thus, you ought to be indifferent between accepting a game that either pays you $1 if \(A\) occurs or nothing, and immediately receiving \(P(A)\). If you so desire, you are allowed to equate \(P(A)\) with the limiting relative frequency of a thought experiment, but in principle you are not forced to do so. In fact, your probabilities (or prices) for some event might differ from the probabilities of another person (as is usually the case when probabilities of complex events are elicited from pundits).

3.1 Fair prices and finitely additive probability

So what should we expect from your price? Since you can receive at most $1, it is unreasonable to assign \(P(A) > 1\) for any event \(A\). And since your price is fair (or equivalently, since I can ask you to sell me your ticket for the same price you are willing to buy it), your price for the lottery that will certainly reward you with $1 should be no less than $1. Thus, we have our first law of probability:

\begin{align} \label{org33eaabf} \label{NU} \tag{NU} P(\Omega) & = 1 \, . & & \text{[normalization to unity]} \end{align}

Also, it would be meaningless to assign \(P(A) < 0\) (that would imply that you buy the ticket and still receive some money, or that you pay some amount to transfer the ticket to someone else). We thus have our second law:

For any event \(A\),

\begin{align} \label{orga96533c} \label{NN} \tag{NN} P(A) &\geq 0 \, . & &\text{[nonnegativity]} \end{align}

For the rest of this discussion it will be convenient to overload notation and equate an event \(A\) with the \(\{0,1\}\)-valued function that returns \(A(\omega)=1\) if \(\omega \in A\), and \(0\) otherwise. With this notation we can write that your net gain/loss from buying a ticket to the lottery related to event \(A\) is \(A(\omega)-P(A)\), which is a function from the unknown state of the world \(\omega \in \Omega\). Similarly, your reward/loss for selling the ticket is \(P(A) - A(\omega)\).

So consider disjoint events \(A\) and \(B\). Using our notation, we have that \((A \cup B)(\omega) = A(\omega) + B(\omega)\) for all \(\omega \in \Omega\). Now suppose you judge that \(P(A\cup B) > P(A) + P(B)\). Then if I buy the tickets for lotteries \(A\) and \(B\) from you and sell you the ticket for lottery \(A \cup B\), your overall net loss is

\[ (A \cup B)(\omega) - P(A \cup B) + P(A) - A(\omega) + P(B) - B(\omega) = P(A) + P(B) - P(A \cup B) < 0 \, , \]

whatever the state \(\omega\) happens to be true. On the other hand, if you judge \(P(A \cup B) < P(A) + P(B)\) then I can sell you tickets for \(A\) and \(B\) and buy your ticket for \(A \cup B\), leaving you with a sure loss of \(P(A \cup B) - P(A) - P(B)\). Thus, to avoid losing money under regardless of the state of the world (i.e., being a sure loser), you must specify your probabilities such that:

For any two disjoint events \(A\) and \(B\),

\begin{align} \label{orgdd80683} \label{FA} \tag{FA} P(A \cup B) & = P(A) + P(B)\,. & &\text{[finite additivity]} \end{align}

It turns out that these three rules are all the necessary requirements for avoiding sure loss (i.e., losing money irrespective of the true state of the world).6 In particular, they are consistent with properties of limiting frequencies: if

\[ P(A)= \lim_{n \rightarrow \infty} \sum_i a_i/n \, \text{ and } \, P(B)=\lim_{n \rightarrow \infty} \sum_i b_i/n \, , \]

and \(A\) and \(B\) are disjoint (so \(a_i + b_i \leq 1\)) then \(P(A) \geq 0\), \(P(B) \geq 0\), and

\begin{align} P(A \cup B) &= \lim_{n\rightarrow n} \sum_i (a_i + b_i)/n \\ &= \lim_{n \rightarrow \infty} \sum_i a_i/n + \lim_{n \rightarrow \infty} \sum_i b_i/n \\ &= P(A) + P(B) \, . \end{align}

Also, if \(a_i=1\) for every \(i\) "large enough", then \(P(A)=1\).

3.2 Probability measure

We thus get to the standard definition of probability:

Probability measure
A probability measure is a function from the set \(\mathcal{F}\) of events to the reals satisfying Properties \eqref{NN}, \eqref{NU} and \eqref{FA}.

The set \(\mathcal{F}\) can be any collection of events that you consider useful. When \(\Omega\) is finite, there is no harm in assuming \(\mathcal{F}\) to be the power set of \(\Omega\), since any probability measure on \(\mathcal{F}\) can be extended to the power set by application of Property \eqref{FA} (but you are not constrained to do so!). We will discuss what constraints need to be imposed on \(\mathcal{F}\) for infinite sets later.

3.3 A finite probability function

The following example helps consolidating the terminology and definitions.

The problem of points revisited.

Let us revisit the problem of points. Recall that the first player gets a point when the coin lands heads (\(➊\)), and needs two points to win the game; the second player gains points when the coin lands tails (\(➋\)) and needs three points to win. The state space is

\begin{equation*} \Omega = \left\{ \begin{matrix} ➊➊➊➊,& ➊➊➋➊,& ➊➊➊➋,& ➊➊➋➋, \\ ➊➋➊➊,& ➊➋➋➊,& ➊➋➊➋,& ➊➋➋➋, \\ ➋➊➊➊,& ➋➊➋➊,& ➋➊➊➋,& ➋➊➋➋ , \\ ➋➋➊➊,& ➋➋➋➊,& ➋➋➊➋,& ➋➋➋➋ \\ \end{matrix} \right\} \, . \end{equation*}

If you assume the coin is unbiased, then you should assign equal probabilities to the singleton events \(\{ \omega \}, \omega \in \Omega\) (note that otherwise you lose money by selling tickets for ``underestimated'' probabilities and buying tickets for ``overestimated'' probabilities). Properties \eqref{NN}, \eqref{NU} and \eqref{FA} combined then imply that \(P(\{ \omega \})=1/16\), coinciding with our initial informal analysis.

The event \(A\) representing that the first player wins collects all states with at least two heads:

\begin{equation*} A = \left\{ \begin{matrix} ➊➊➊➊, & ➊➊➋➊, & ➊➊➊➋, \\ ➊➊➋➋, & ➊➋➊➊, & ➊➋➋➊, \\ ➊➋➊➋, & ➋➊➊➊, & ➋➊➋➊, \\ ➋➊➊➋, & ➋➋➊➊ \\ \end{matrix} \right\} \, . \end{equation*}

Adding up the probabilities of each atomic (i.e., singleton) event, we get \(P(A) = 11/16\). Call \(B\) be the event that the second player wins. Since \(A \cup B = \Omega\), it follows from \eqref{NU} and \eqref{FA} that

\[ P(B) = 1 - P(A) = 5/16 \, . \]

As a sanity check note that the event that a heads is obtained in the third toss of the coin has probability \(8/16=1/2\).

An alternative formulation of the problem is to consider only the realizable completions of the game:

\begin{equation*} \Omega = \left\{ \begin{matrix} ➊➊, & ➊➋➊, & ➊➋➋➊, \\ ➊➋➋➋, & ➋➊➊, & ➋➊➋➊, \\ ➋➊➋➋, & ➋➋➊➊, & ➋➋➊➋, \\ ➋➋➋ \end{matrix} \right\} \end{equation*}

In this case, we wold like to assign \(P(\{ \omega \}) = 2^{-|\omega|}\), where \(|\omega|\) denotes the number of symbols (coins) in state \(\omega\). The probability of \(B\) would in this case be given by

\begin{align*} P(\{ ➊➋➋➋, ➋➊➋➋, ➋➋➊➋, ➋➋➋\}) &= 3/16 + 1/8 \\ &= 5/16 \,, \end{align*}

as before. Note that in this formulation events like the probability that the third coin lands heads is not 1/2 (in fact it is \(P(\{➊➋➊,➋➊➊,➋➋➊➋,➋➋➊➊\})=3/8\)), since some configurations of tosses are missing from the state space.

3.4 Countably infinite probability

Here is a more intricate example involving countably infinite outcomes:

Consider a procedure (say, a computer program with access to a random number generator) which generates binary strings as follows: first a length \(\ell > 0\) is generated with probability \(1/2^\ell\), then a string of length \(\ell\) is generated by tossing a fair coin \(\ell\) times. The sure event is thus the countably infinite set \(\Omega=\{0,1,00,01,10,11,001,\dotsc\}\). Let \(\mathcal{F}\) denote the power set of \(\Omega\); define your probability of observing a string \(\omega\) as

\begin{equation} \label{eq:prob-string} P(\{\omega\}) = 2^{-2|\omega|} \text{ for } \omega \in \Omega \, , \end{equation}

where \(|\omega|\) denotes the length of \(\omega\). Since any event \(A \in \mathcal{F}\) is countable, that is, it is either finite or countably finite, we can extend the above definition to any event \(A\) by

\[ P(A) = \sum_{\omega \in A} P(\{\omega\}) \, . \]

It is not difficult to see that this probability measure satisfies \eqref{NN} and \eqref{FA}. For example, the probability that a string has length at most 3 is

\begin{align*} P(\{ \omega: |\omega| \leq 3\}) &= \sum_{i=1}^3 \sum_{\omega: |\omega|=i} P(\{\omega\}) \\ & = \sum_{i=1}^3 2^i 2^{-2i} = 2^{-1} + 2^{-2} + 2^{-3} = 7/8 \, . \end{align*}

To obtain \eqref{NU} from \eqref{eq:prob-string} however we need a stronger property than finite additivity, we need

\[ P(\Omega) = \sum_{i=1}^\infty \sum_{\omega \in \Omega: |\omega|=i} P(\{\omega\}) = \sum_{i=1}^\infty 2^{-i}=1 \, . \]

The property that we required in the last example to derive \eqref{NU} from the probability of basic events is called countable additivity.7 It is defined as follows.

If \(A_1,A_2,\dotsc\) is a sequence of pairwise disjoint events, and \(\cup_i A_i \in \mathcal{F}\), then we say that \(P\) is countably additive if

\begin{align} \label{CA} \tag{CA} P(\cup_{i \in \mathbb{N}} A_i) & = \sum_{i \in \mathbb{N}} P(A_i)\,. & &\text{[countable additivity]} \end{align}

3.5 Discrete probability measure

Countable additivity is stronger than finite additivity in the sense that the former implies the latter but the converse is not true. A countably additive probability measure over a countable domain is known as discrete probability measure. Discrete probability measures can be concisely specified by the probabilities of singleton events:

Let \(\Omega\) be a countable non-empty set, and \(p\) be a nonnegative function from \(\Omega\) to the unit interval such that \(\sum_{\omega \in \Omega} p(\omega) = 1\). Then there is a unique discrete probability measure defined by \(P(A)=\sum_{\omega \in A} p(\omega)\) for every \(A \subseteq \Omega\).

The probability measure in the previous example is discrete. Apart from ease of specification, is there any reason why we should assume a probability measure as countably additive? If we allow you to take part in an infinitely countable number of transactions (or lotteries), then one can show that countable additivity is both sufficient and necessary for avoiding sure loss. However, taking part in an infinite number of transactions removes the operational character of the behaviourist definition of probability. This reason lead de Finetti and many others to condemn countable additivity as a requirement for rationality (though you are still allowed to enforce it if you so desired). On the other hand, most measure-theoretic (i.e., axiomatic) approaches for probability theory assume countable additivity, and it is hard to find useful realistic examples which violate countable additivity. Curiously, countable additivity is not enforced by the properties of limiting relative frequencies. Countable additivity is also required for avoiding dynamic loss, that is, a situation where one infringes you a sure loss by combining your prices before and after the occurrence of some event.

We will in the following assume for convenience that every probability measure is countably additivity.

3.6 The Total Rule Theorem

Here is a very useful property of discrete probability measures:

If \(B_1,B_2,\dotsc\) is a countable partition of \(\Omega\) then \[ P(A) = P(A \cap B_1) + P(A \cap B_2) + \dotsb \] for any event \(A\).

It follows from \eqref{CA} since \(A=\cup_i (A \cap B_i)\).

3.7 Uncountably infinite probability

In our previous examples, we have assumed that \(\mathcal{F}\) is simply the power set of \(\Omega\). For countable number of states (assuming countable additivity), this is no restriction, as we can always extend the probability to the rest of the domain by enforcing \eqref{CA}. Is this also true for uncountable domains? To better motivate this question, consider the following example:

Lebesgue measure.

The simplest probability measure on an uncountable domain is arguably a uniform measure over the unit interval \(\Omega=[0,1]\). By uniform, we mean that \[ P([a,b]) = b-a \, , \] for \(a \leq b\) (i.e., the length of the interval). The above equation includes the degenerate case of a singleton interval, since \[ P([a,a]) = P(a) = 0 \, . \] This example shows that we cannot extend countable additivity to uncountable additivity, as this would imply that \(P(\Omega) = \sum_{0 \leq x \leq 1} P(x) = 0\).8 Countable additivity, on the other hand, seems to work just fine, for example:

\begin{align*} P([a_1,b_1] \cup [a_2,b_2] \cup \dotsb) &= \sum_{i \in \mathbb{N}} P([a_i,b_i]) \\ &= \lim_{n \rightarrow \infty} \sum_{i=1}^n (b_i - a_i) \leq 1 \, , \end{align*}

if \(a_1 \leq b_1 \leq a_2 \leq b_2 \leq \dotsb\), since \(\cup_{i=1}^n [a_i,b_i] \subseteq \Omega\). So far so good, but we would like to extend \(P\) to any subset \(A\) of \(\Omega\). One way of doing this is to consider coverings of \(A\), that is a countable collection of intervals \([a_1,b_1],[a_2,b_2],\dotsb\) (not necessarily disjoint), such that \(A \subseteq \cup_i [a_i,b_i]\). Then we define the probability (or measure) of \(A\) as

\[ P(A) = \inf \left \{ \sum_i (b_i - a_i) : A \subseteq \cup_i [a_i,b_i] \right \} \, . \]

So the question is: is the quantity above well-defined for any \(A\)? It can be show that there are sets that cannot be assigned a probability measure altogether.

The impossibility of assigning a probability value to any subset of the unit interval is not a pathological case; in fact this is true for any uncountable domain:

Let \(\Omega\) be an uncountable infinite set (e.g., the real line). Then any probability measure must be undefined on an uncountable number of events.

Thus, for uncountable domains, one needs to be careful to choose a domain \(\mathcal{F}\). One typical choice is the so called Borel \(\sigma\)-algebra, which is formed by the countable unions and intersections of open sets. For example, the Borel \(\sigma\)-algebra on the reals \(\mathcal{B}\) is the smallest \(\sigma\)-algebra that contains the (open) intervals (i.e., the sets \((a,b)\) for reals \( a < b \)). We will assume that \(\mathcal{F}\) is this algebra when \(\Omega\) is the real line, and denote it by \(\mathcal{B}\). The elements of \(\mathcal{B}\) are called Borel sets.

3.8 Probability distributions

A probability distribution is simply any probability measure on the real line. Possibly the most famous (continuous) distribution is the Gaussian or normal distribution:

Gaussian distribution
Given real-valued parameter \(\mu\) and \(\sigma \neq 0\), we define the Gaussian Distribution as \[ P(A) = \int_{A} \frac{1}{\sigma\sqrt{2\pi}}\exp\left( - \frac{(x-\mu)^2}{2\sigma^2} \right) dx \, , \] for any set \(A\) for which the integral exists.

A very useful distribution distinguishes a single point:

Point-mass distribution
Let \(x\) be any real value: \[ P(A) = \begin{cases} 1 & \text{if } x \in A,\\ 0 & \text{if } x \not\in A, \end{cases} \] for any measurable set \(A\).

And at last:

Uniform distribution
Given real values \(a < b\), let \[ f(x) = \begin{cases} \frac{1}{b-a} & \text{if } a < x \leq b, \\ 0 & \text{otherwise}. \end{cases} \] The Uniform distribution is defined for any Borel set \(A\) as \[ P(A) = \int_{A} f(x) dx \, , \] whenever the integral exists.

4 Conditional beliefs

Usually, a probability is much easier to specify if we assume that the outcome of some event is revealed. This sort of reasoning requires the use of conditional probabilities. We can define conditional probabilities in much the same way as we defined unconditional probabilities, except for the possibility of cancelling of the lottery. So let \(P(A|B)\) be your fair price for participating in a lottery that rewards you $1 if events \(A\) and $0 otherwise, but is cancelled if \(B\) does not occur. By cancelled I mean that you get the amount you paid back, or that you have to refund any buyer the exact amount. Thus, your net gain/loss for this transaction is \(B(\omega)[A(\omega)-P(A|B)]\). How should you set your conditional prices given your unconditional prices?

Your probability measure avoids sure loss if and only if

\[ P(A|B)P(B) = P(A \cap B) \, . \]

4.1 Conditional probability

When \(P(B) > 0\), the theorem above resorts to the usual axiomatic definition of conditional probability

\begin{equation} \tag{CP} \label{CP} P(A|B) = \frac{P(A \cap B)}{P(B)} \, . \end{equation}

The above definition is usually justified as a re-normalization of your beliefs once an event is revealed. For \(P(B)=0\), any price for \(P(A|B)\) will suffice to avoid sure loss (but traditional definition leaves the latter value undefined in this situation).

4.2 Bayes' Theorem

Conditional probabilities satisfy several properties with respect to other conditional and unconditional probabilities. Arguably the most famous of these properties is Bayes' Theorem, which follows by simple arithmetic manipulation of the above equation.

If \(A\) and \(B\) are events such that \(P(A) > 0\) and \(P(B) > 0\), then \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \, . \]

4.3 The chain rule

A property at the heart of the theory of probabilistic graphical models is the so-called Chain Rule:

For any events \(A_{1},\dotsc,A_{n}\) it follows that \[ P(A_{\sigma(1)} \cap \dotsb \cap A_{\sigma(m)}) = P(A_{\sigma(1)})P(A_{\sigma(2)}|A_{\sigma(1)})\dotsb P(A_{\sigma(n)}|A_{\sigma(1)} \cap \dotsb \cap A_{\sigma(n-1)}) \, , \] for any permutation \(\sigma: \{1,\dotsc,n\} \rightarrow \{1,\dotsc,n\} \).

We say that the right-hand side of the equation above is a factorization of the probability on the left-hand side, just like \( 2 \times 3 \) is a factorization of \( 6 \).

5 Independence

Independence is actually an extra-mathematical and qualitative notion, one of utmost importance for the practice of probabilistic modelling. In fact, most of this course is focused on creating a language to describe independence assertions. Intuitively, a set of independent events is such that no combination of a subset of events brings any information to any other event in that set.

Events \(A_1,\dotsc,A_n\) are independent if \[ P( \cap_{i \in S} A_i) = \prod_{i \in S} P(A_i) \, , \] for every nonempty set \(S \subseteq \{1,\dotsc,n\}\).

If \(A\) and \(B\) are events such that \(P(A \cap B) > 0\) (hence \(P(A), P(B) > 0\)), it follows that \(P(A|B)=P(A)\) and \(P(B|A)=P(A)\) if and only if \(A\) and \(B\) are independents. This connects the definition above with our intuitive notion of independence as irrelevance of knowledge. Note that the definition above defines a system of \(2^{n}-n-1\) equations and \(2^{n}-1\) variables; this is because factorization of subsets (e.g., of pairs of variables) does not imply factorization of supersets (e.g., triples of variables):

Pairwise independence does not imply 3-way independence.

Consider an experiment where two fair coins are tossed; if they either lands both with heads up or heads down a third coin is placed tails up, otherwise, the third coin is placed head up. Our state space is thus \(\Omega = \{000,011,101,110\} \), and \(P(\omega)=1/4\) for each \(\omega \in \Omega \). Let \(A=\{101,110\}\), \(B=\{011,110\}\) and \(C=\{011,101\}\) be, respectively, the events that the first, second and third coin lands heads up. Then \(P(A)=P(B)=P(C)=1/2\). Also, knowing the result of one of the coins gives no information about another coin: \(P(A|B) = P(B|A) = P(A|C) = P(C|A) = P(B|C) = P(C|B) = 1/2\). However, the probability of the third coin is completely determined by the values of the first and second coins: \(P(C|A \cap B) = 0 \).

The converse is also true: the factorization of the probability of all events is not sufficient to imply factorization of every subset:

3-way independence does not imply pairwise independence.

Consider a state space \( \Omega = \{1,\dotsc,8\} \) and probabilities

\begin{align*} P(1) &= 2/64 & P(2) &= 7/64 & P(3) &= 7/64 & P(4) &= 0 \\ P(5) &= 7/64 & P(6) &= 32/64 & P(7) &= 0 & P(8) &= 9/64 \, . \end{align*}

Let \( A = \{ 2,4,6,8 \} \), \( B=\{3,6,7,8\} \) and \( C=\{ 4,5,7,8 \} \). Then, \(P(A) = P(B) = 7/64 + 0 + 32/64 + 9/64 = 3/4\) and \(P(C) = 1/4 \). Also, \( P(A \cap B \cap C) = P(\{8\}) = 9/64 = (3/4)^{2}(1/4) = P(A)P(B)P(C) \). However, \( P(A \cap B) = P(\{6,8\}) = 32/64 + 9/64 = 41/64 > 36/64 = (3/4)^{2} = P(A)P(B) \).

6 Random variables

The set-theoretic approach just laid out is sufficient to describe simple scenarios like the examples given, but turns out to be cumbersome to represent more complex knowledge, especially when it involves numerical quantities such as counts, sums, etc. For instance, take the example of the random string generation. To consider quantities such as the average length of a string, we might create a function \(X\) that maps every string to its length. We can then query the probability that a string with length at most 5 is generated by \(P(X \leq 5)=P(\{\omega \in \Omega: X(\omega) \leq 5\})\).

A real-valued function on \(\Omega\) is called a random variable. The standard definition of a random variable \(X\) assumes a probability space \((P,\Omega,\mathcal{F})\) is defined (and fixed), and requires that \(P(X \leq x)=P(\{ \omega: X(\omega) \leq x\})\) be defined for any real \(x\) (i.e., the set \(\{ \omega: X(\omega) \leq x \}\) must be in \(\mathcal{F}\)). In practice, we usually specify the cumulative distribution function of the random variable, and implicitly assume that there is a probability space that generated it (there is always such a space).

Cumulative distribution function
The cumulative distribution function of a random variable \(X\) is a real-valued function \(F\) such that \[ F(x) = P(X \leq x) \, \text{ for all real } x \, . \]

6.1 Probability distribution

The probability distribution of a random variable \(X\) is the function \(p(A) = P(\{ \omega: X(\omega) \in A\})\), defined on \(\mathcal{B}\), the Borel \(\sigma\)-algebra of the reals. Given that random variable, the triple \((\mathbb{R},p,\mathcal{B})\) is a valid probability space, which removes the ambiguity of definition of probability distribution.

Discrete random variable
We say that a random variable is discrete if its range is countable.

A discrete random variable has countable support, that is, a countable subset \(\mathcal{D} \subset \mathbb{R}\) such that \(P(\mathcal{D}) > 0\). If the \(\mathcal{D}\) is also finite, then the random variable is called simple. We write \(\Omega_X\) to denote the range of a variable \(X\), that is, the set \(\{ x: X(\omega) = x, \omega \in \Omega\). If \(X\) is a discrete random variable, then we can define \(p(x)=P(X=x)=P(\{ \omega: X(\omega) = x\})\) for any \(x\) in \(\Omega_X\). The function \(p\) is called the probability mass function of \(X\). By Theorem 3.5, the function \(p\) induces a unique distribution \(p(A)\) defined on the power set of \(\Omega_X\), hence there is no confusion in overloading the notation for distribution and mass function (and we often use both terms interchangeably) when discussing discrete random variables.

Dyadic numbers.

Let \(P\) be the Lebesgue measure on the unit interval \(\Omega=[0,1]\). Define \(D_i\) as the random variable returning the \(i\)-th digit in the infinite dyadic (i.e., binary) expansion of \(\omega=\sum_i D_n(\omega)/2^n\). For example \(D_1(\omega)=1\) only for \(\omega > 1/2\), and \(D_2(\omega)=1\) only for \(\omega \in (1/4,1/2] \cup (3/4,1]\). Each \(D_i\) is a simple random variable with range \(\{0,1\}\) and \(p(D_i=0)=p(D_i=1)=1/2\). The latter follows as \(D_i\) partitions \(\Omega\) in \(2^i\) intervals of length (probability) \(1/2^{i+1}\). Let \(S_n = \sum_{i=1}^n D_i\). Then \(S_n\) is also a simple random variable with \(p(S_n=k)={n \choose k} 2^{-n}\) for \(k \in \{0,\dotsc,n\}\). Note that \(p(S_n)\) is the Binomial distribution with \(p=1-p=1/2\) (we say that \(S_n\) follows a Binomial distribution).

6.2 Common distributions

Random variables are usually characterized by their distributions. For example, the Bernoulli distribution is usually motivated as the probability of observing \(k\) heads in \(n\) tosses of a binary coin:

Binomial distribution
Given parameters \(p \in [0,1]\) and \(n \in \mathbb{N}^+\), the Binomial Distribution is \[ P(X = k) = {n \choose k} p^k(1-p)^{n-k} \ , \] if \(k \in \mathbb{N}\) and \(P(X = k) = 0\) otherwise. The distribution can be extended to any Borel set by \eqref{FA}.

The Poisson random variable is commonly used to mode arrival in queuing theory:

Poisson distribution
Given a real-valued parameter \(\lambda > 0\), the Poisson Distribution is \[ P(X = k) = e^{-\lambda}\lambda^r/r! \ , \] if \(k \in \mathbb{N}\) and \(P(X = k) = 0\) otherwise. The distribution can be extended to any Borel set by \eqref{FA}.

6.3 Functions of random variables

A function of a random variable is (usually) also a random variable. Thus, related random variables can be created as complex expressions of simpler variables. For example, assume that \(X\) is a random variable such that

\begin{equation*} P(a \leq X \leq b) = \begin{cases} b - a , & \text{if } 0 \leq a < b \leq 1 \\ 0, & \text{otherwise}. \end{cases} \end{equation*}

That is, \(X\) follows a uniform distribution on the unit interval. Now let

\[ Y = \lfloor 3X + 1 \rfloor \, . \]

Then, \(P(Y=1)=P(Y=2)=P(Y=3)=1/3\) and \(P(Y=y)=0\) for all \(y \not\in \{1,2,3\}\). Thus, we can use a uniform distribution on the unit interval to create a uniform distribution over three categories by simply composing random variables. Not every composition of random variables is a random variable due to the technicality that there must exist a probability space which "originates" it. However, composition of random variables not resulting in random variables are rare. In fact, arithmetic expressions (e.g., \(2X^2 + 3Y/4 + 1\)) and limits of random variables are always random variables.

6.4 Continuous random variables

Continuous random variable
We say that a random variable \(X\) is continuous if there is a nonnegative (Lebesgue) integrable function \(f\) such that \[ P(a \leq X \leq b) = \int_a^b f(x) dx \, , \] for any reals \(a \leq b\).

The definition above implies that \(P(X = x) = P(x \leq X \leq x) = 0\) for any value \(x\), thus coinciding with our intuition of a continuous random quantity. The support of a continuous random variable is thus uncountable. The function \(f\) (if it exists) is called the probability density function of \(X\), or more simply, the density of \(X\). Note that we allow \(f\) to be discontinuous at a finite number of points (e.g., \(f\) can be piecewise continuous) with probability zero. By construction, if \(F\) is the cumulative distribution of \(X\) then \(F(x)=\int_{-\infty}^x f(u) du\). By the Fundamental Theorem of Calculus, if \(F\) is differentiable then

\[ f(x) = \frac{d}{dx} F(x) \, . \]

Here is a common continuous random variable used for modelling arrival of items:

Exponential distribution

Given a real-valued parameter \(\lambda > 0\), let

\[ f(x) = \lambda e^{-\lambda x} \ , \]

if \(x \geq 0\) and \(f(x) = 0\) otherwise. The Exponential Distribution is defined as

\[ P(A) = \int_{A} f(x) dx \, , \]

whenever the integral exists.

A probability density function can take values larger than one; in fact probability density functions can be unbounded.

Consider a random variable with density

\[ f(x) = \begin{cases} \frac{1}{2\sqrt{x}} & \text{if } 0 < x \leq 1 \, \\ 0 & \text{otherwise}. \end{cases} \]

Then \(f(x) \rightarrow \infty\) as \(x \rightarrow 0\), however

\[ \int_{-\infty}^{\infty} f(x) dx = 1 \, . \]

Continuity of the cumulative distribution function does not suffice to ensure that a random variable is continuous, as there are continuous probability distributions which are zero except on a set of Lebesgue measure zero, yet each point in that set has probability zero (hence the distribution is also not discrete).9 The converse however is true: if \(X\) has a continuous density then its cumulative function is continuous.

So discrete random variables have countable supports and continuous random variables have uncountable support; similarly, apart from some highly pathological cases, continuous random variables are (Lebesgue) integrable functions that assign zero probability to every value. There are however very frequently occurring types of random variables that satisfy none of these descriptions. They naturally occur when measuring physical phenomena. For example, consider the amount of rainfall \(X\) in a certain location (in a certain period). For any value \(x > 0\), it is natural to assume that \(P(x) = 0\), unless some censoring occurs. However, it is most likely that we have \(P(X=0) > 0\), indicating that there is a nonzero probability of not raining. Such types of variables are often denoted mixed-type random variables or yet hybrid random variables. Such variables can still be defined by their cumulative distributions, or more conveniently by their discrete and continuous parts.

Mixed-type random variable
We say that a random variable \(X\) is of mixed-type if there are functions \(f\) and \(g\) such that \(f\) is integrable, \(g(x) > 0\) for some countable set \(\mathcal{D} \subset \Omega_X\), and \[ P(a \leq X \leq b) = \int_a^b f(x) dx + \sum_{x \in \mathcal{D} \cap [a,b]} g(x) \, , \] for any \(a \leq b\).

Mixed-type random variables are simply the combination of discrete and continuous random variables; the functions \(f\) and \(g\) characterize, respectively, the probability density function and probability mass function of the discrete and continuous counter-parts. Mixed-type random variables appear often when one models censoring in measurements as in for example, when measurement can be made within a time limit (which is arguably always the case).

Truncated random variable.

Consider a continuous random variable \(X\) with density \(f\). Given values \(\tau_1 < \tau_2\), obtain a new random variable

\begin{equation*} Y = \begin{cases} \tau_1 & \text{if } X \leq \tau_1 \, ,\\ X & \text{if } \tau_1 < X < \tau_2 \, ,\\ \tau_2 & \text{if } X \geq \tau_2 \, . \end{cases} \end{equation*}

Then \(Y\) has distribution

\[ P(Y \in A) = \int_A f'(y) dy + \sum_{y \in A \cap \mathcal{D}} g(y) \, , \]

where \(f'(y) = f(y)\) if \(\tau_1 < y < \tau_2\) and \(f'(y) = 0\) otherwise,

\[ g(x) = \begin{cases} \int_{-\infty}^{\tau_1} f(x) dx & \text{if } x = \tau_1 \, ,\\ 0 & \text{if } x \not\in \{\tau_1,\tau_2\} \, ,\\ \int_\tau^{\infty} f(x) dx & \text{if } x = \tau_2 \, , \end{cases} \]

and \(\mathcal{D}=\{\tau_1,\tau_2\}\).

6.5 Conditional probability distributions

Similarly to (unconditional) probabilities, we can defined conditional distributions of a random variable \(X\) conditional on another random variable \(Y\), written \(p( X \in A | Y \in B \). As with conditional probability of events, this probability is only defined in \(p(Y \in B)>0\). This works well for discrete random variables, but is unsatisfactory for continuous random variables; often we want to model a phenomenon assuming an observation; if the random variable being assumed observed is continuous, then \(B=\{y\}\) will be a singleton with probability zero. We could instead assume that \(B\) is in fact an infinitesimal region around the observation \(y\), one large enough to have positive probability: \[ \lim_{\Delta \rightarrow 0} p(X \in A| Y \in \{y-\Delta/2,y+\Delta/2) \, . \] We can generalize this idea to the density of \(X\) as well. Assume that \(X\) and \(Y\) have joint density \(f(X,Y)\). Then we define the \define{conditional probability density} \(f(X|Y)\) of \(X\) given \(Y\) as \[ f(x|y) = \lim_{\Delta \rightarrow 0} \frac{d}{dx} p(X \leq x| Y \in \{y-\Delta/2,y+\Delta/2\}) = \frac{f(x,y)}{f(y)} \, . \] The conditional probability of a measurable set \( A \) given \(Y\) is, as desired, \[ p(X \in A | Y = y) = \int_{A} f(x|y) dx \, . \]

7 Multivariate reasoning

In most modeling situations we are interested in not a single but a set of quantities or summaries represented by a set of random variables. For convenience, if \(X\) and \(Y\) are sets of random variables we write \(XY\) to denote their union \(X \cup Y\) whenever there is no ambiguity.

7.1 Assignment

Let \(S\) be a set of random variables. An assignment \(\nu\) to \(S\) is a function that maps every variable \(X \in S\) into a value \(\mu[X]\) in its domain \(\Omega_X\). We write \(\nu \sim S\) to denote an arbitrary assignment of \(S\), or to consider the set of all assignments, as in the expression \(\sum_{\nu \sim S\} 1\), which counts the number of assignments of \(S\).

7.2 TODO Joint probability distribution

Consider a vector \(X=(X_1,\dotsc,X_n)\) of random variables from a probability space \((P,\Omega,\mathcal{F})\). the joint probability distribution of \(S\) is the function $p(A_1,\dotsc,A_n)=P(\{ ω: X_i(ω) ∈ A_i, i=1,\dotsc,n \})\), defined on \(\mathcal{B}^n\), the \(n\)-dimensional Borel \(\sigma\)-algebra. If all random variables in \(X\) are discrete, we define the joint probability mass function as the function \(p(\nu)=P(\{\omega: X_i(\omega)=\nu[X_i]\})\) of assignments of \(S=\{X_1,\dotsc,X_n\}\) to joint probability values. Note that this (somewhat unusual) definition of probability mass function is invariant to the order of the variables in the vector \(X\).

TODO: define joint density functions.

7.3 TODO The total rule for random variables

Define and prove.

7.4 Independence

Since random variables define events, we can generalize event-related concepts.

Random variables \(X_1,\dotsc,X_n\) are fully independent if their joint distribution factorizes:

\begin{equation*} p(X_1 \in A_1,\dotsc,X_n \in A_n) = p(X_1 \in A_i)p(X_2 \in A_2) \dotsb p(X_n \in A_n) \, . \end{equation*}

for any combination of measurable sets \( A_{1},\dotsc,A_{n} \subseteq \mathbb{R} \).

Note that in the definition above we do not require that subsets of variables be independent. This is because we can take any \(A_{i} = \mathbb{R} \), effectively removing it from the equation above.

We say that a set of random variables \(X\) and \(Y\) are conditionally independent given a set of random variables \(\mathcal{Z}\) if

\begin{equation*} p(XY|Z) = p(X|Z)p(Y|Z) \end{equation*}

7.5 TODO The chain rule for random variables

Define and prove.

8 Expectation

We are often interested in aggregate characteristic of uncertain phenomena such as for instances the average length of a randomly generated string, the average value of the roll of a die or the average value of a number randomly picked from the unit interval (of course, things get more interesting when the distributions involved are not uniform, as in some of these examples). Theses quantities are formalized by the notion of expectation or expected value.

Expectation of discrete random variable

The expectation or expected value of a discrete random variable \(X\) (i.e., one taking on countably many values) with probability mass function \(p\) is

\[ E(X) = \sum_{x} x \cdot p(x) \, , \]

provided this sum exists (i.e., if either converges to a number or diverges to \(\pm \infty\)).

The sum in the definition above ranges over all values \(x \in \Omega_X\). To avoid cluttering, we omit the domain of \(x\) is such sums.

In the Binary Strings Example, the expected value of random variable \(L(\omega)=|\omega|\) is \( E(L)=\sum_{i \in \mathbb{N}} i \cdot 2^{-i} = 2 \).

Expectation of continuous random variable
The expectation or expected value of a continuous random variable \(X\) with density \(f\) is \[ E(X) = \int_{-\infty}^\infty x \cdot f(x) \, , \] provided this integral exists (it can be \(\pm \infty\)).

Expectation can also be defined for other types of random variables (other than discrete and continuous), but the definition is rather technical. To keep this discussion simple, we will focus on discrete and continuous random variables, even though most of the theory developed applies to arbitrary random variables.

8.1 Linearity of expectation

The following property (which is true for any random variable) is extremely useful:

Let \(X\) and \(Y\) denote random variables and \(a\) and \(b\) real values. The expectation of a random variable \(aX + bY\), if it exists, satisfies \[ E(aX + bY) = aE(X) + bE(Y) \,. \]

8.2 Non-trivial random variables

A random variable \(X\) is trivial if there is \(x\) such that \(P(X=x)=1\). A trivial random variable satisfies \(E(X)=x\).

A non-trivial bounded random variable \(X\) satisfies \[ \min X < E(X) < \max X \, , \] where \(\min X/\max X\) denote the minimum/maximum value on the range of \(X\).

8.3 Change of variable

Recall that a measurable function of a random variable is also a random variable. Hence, we can also compute the expectation of functions of random variables:

Let \(X\) be a discrete variable with probability mass function \(p\) and \(Y=f(X)\). Then \[ E(Y) = \sum_x f(x) p(x) \, , \] provided the sum exists.

8.4 Conditional expectation

Oftentimes, we are interested in the average value of a random variable, say \(X\), assuming we have observed some other random variable, say \(Y=y\). This notion is captured by conditional expectation.

Let \(X\) and \(Y\) denote discrete random variables. The conditional expectation of \(X\) given \(Y=y\) is

\[ E(X|Y=y) = \sum_x x \cdot P(X=x|Y=y) \, . \]

The conditional expectation can be seen as a new random variable \(Z=E(X|Y=y)\) obtained as a function of \(Y\). Hence, the expectation of the conditional expectation, denoted \(E(E(X|Y))\) is well-defined. We have the following very useful property regarding such random variables, known as the law of iterated expectations:

Let \(X\) and \(Y\) be random variables. Then \(E(E(X|Y)) = E(X)\).

This is property is so important that some authors actually define conditional probability density as the function that satisfies it (if it exists).

8.5 Variance

The variance of a random variable measures the spread of probability mass or density over its domain. It is defined assuming

The variance of a random variable \(X\) is
\begin{equation*} V(X) = E([X^2-E^2(x)]) = E(X^2) - \left( E(X) \right)^2 \, , \end{equation*}

whenever these expectations exist.

9 Limit Laws

The following result is known as the weak law of large numbers.

Let \(X_1,\dotsc,X_n\) be independent and identically distributed random variables with expected value \(\mu\), and define \(S_n = \sum_{i=1}^n X_i / n\). Then,

\[ \lim_{n \rightarrow \infty} P( |S_n - \mu| > \epsilon) = 0 \, , \]

for any \(\epsilon\).

The previous results states that the probability that the mean of independently generated observations (or copies) of a quantity diverges from its expected value converges to zero as the number of observations increases. This is at the heart of techniques of statistical inference that estimate a unknown parameter from data.

The following result is known as the strong law of large numbers, as it subsumes the previous one.

Let \(X_1,\dotsc,X_n\) be independent and identically distributed random variables with expected value \(\mu\) and finite variance, and define \(S_n = \sum_{i=1}^n X_i / n\). Then,

\[ P\left( \lim_{n \rightarrow \infty} |S_n - \mu| = 0\right) = 1 \, . \]

The result above states that the mean of independently generated observations of a quantity converges in probability to its expected value in the limit as the number of observations goes to infinity.

10 To known more

Boole's 1854 treatise is often view as the first investigation of probabilistic modeling Boo1854. The most popular book on the foundations of modern probability theory remains to be the work of Billingsley Bil1995. A more condensed version can be found in the book of Rosenthal Ros2006. Jaynes provides a formalization of probability theory that does not allow infinite sample spaces Jay2003. In Jaynes' view, models over infinite domains should be construed as the limit of a family of models, each over a finite domain. This removes several of the "paradoxes" that appear in Kolmogorov's (i.e., measure-theoretic) probability theory. Kadane presents a complete introduction to probability theory from the Bayesian viewpoint Kad2011. Much of the initial discussion of this work is based on this work. A traditional (frequentist) take on probability theory is given by Feller Fel1968.

11 Exercises

  1. Prove the Bonferroni inequality from the Axioms of Probability Theory: \[ P(A \cap B) \geq P(A) + P(B) - 1 \, . \] for any events \(A\) and \(B\). Hint: Note that \(A \cap B = (A - B) \cup (A \cap B) \cup (B - A)\).
  2. Prove that \[ P(A \cap (B \cup C)) = P(A \cap B) + P(A \cap C) - P(A \cap B \cap C) \, . \] for any three events \(A\), \(B\) and \(C\). Hint: Use Distributivity of \(\cap\) over \(\cup\): \(A \cap (B \cup C) = (A \cap B) \cup (A \cap C)\).
  3. Show that independence among \(n\) random variables implies pairwise independence of any two random variables (i.e., \(p(X_1,\dotsc,X_n)=\prod_i p(X_i)\) implies \(p(X_i,X_j)=p(X_i)p(X_j)\) for \(i\neq j\)).
  4. An exam to test for a serious disease has 99% accuracy meaning that the probability of the exam testing positive for the disease if the person does have the disease is 0.99, which is the same probability of the exam testing negative if the person does not have the disease). Suppose that you test positive for the disease, and that the disease is present in 1 every 10.000 people of your age.
    • What are the odds that you indeed have the disease?
  5. Knuth and Yao investigated how distributions on arbitrary simple random variables (i.e., random variables with a finite support) could be simulated by sequences of random bits, that is, by a sequence of independent binary-valued random variables. A particular interesting special case of their method consists in the procedure illustrated in the diagram in Figure 1, that simulates the toss of a six-sided fair die using random bits (or tosses of a fair coin). Each arc in the diagram represents the outcome of a coin toss, and the leaves represent the outcome of the algorithm (a number between 1 and 6). By convention, assume that the arcs going up denote heads (H) while the arcs going down denote tails (T). Every path from the root to a leaf thus describes a terminating run (i.e., an execution) of the algorithm. Note that there are non-terminating runs of the algorithm, as the diagram contains directed cycles.
    • Formulate a probability space for the problem (sample space, algebra and probability measure) and define which events correspond to outputs of the algorithm (i.e., to tosses of the die). Note that you have to prove that the probability measure satisfies the Axioms of probability theory and that the event space is (at least) an algebra. Hint: consider a countably infinite space of finite and infinite executions of the algorithm. If you get stuck, consider modeling only finite runs of the program up to a certain amount \(n\) of tosses, then take this number to infinity.
    • Specify a (discrete) random variable on the probability space you created that represents the output of the algorithm. The random variable should take values \(i=1,2,\dotsc,6\) with probability \(1/6\). Note that the random variable must map all elements of the sample space, not only those leading to desired outcomes.


Figure 1: Diagram describind Knuth and Yao's algorithm to generate samples of a die using fair coins.


  • [Boo1854] George Boole, An Investigation of the Laws of Thought: On which are Founded the Mathematical Theories of Logic and Probabilities, Walton and Maberly (1854).
  • [Bil1995] Patrick Billingsley, Probability and measure, Wiley-Interscience (1995).
  • [Ros2006] Jeffrey Rosenthal, A First Look at Rigorous Probability Theory, World Scientific Publishing Company (2006).
  • [Jay2003] Edwin Jaynes, Probability theory: the logic of science, Cambridge University Press (2003).
  • [Kad2011] Joseph Kadane, Principles of Uncertainty, Chapman & Hall (2011).
  • [Fel1968] William Feller, An Introduction to Probability Theory, John Wiley & Sons (1968).



This point is usually made by the so-called Bertrand's paradox, which attempts at building a uniform distribution of chords of a circle.


Mathematically, these outcomes can be captured by the notion of a random variable, which maps states of the world into real values.


Probability currency is an imaginary currency that behaves as lottery tickets, so that your utility is linear.


To remain valid as an operational definition, an event is usually constrained to be the result of realizable and unambiguous procedure; for example, the result of tossing one particular coin at a particular time by a designated person.


In general, we assume that you can be involved in a finite number of transactions \(\sum_{i=1}^m \gamma_i |A_i(\omega) - P(A_i)|\), where \(\gamma_i\) is a real value indicating the number of lottery tickets related to event \(A_i\) that you have bought (if \(\gamma_i>0\)) or sold (for \(\gamma_i < 0\)). It can be shown that the three laws are necessary and sufficient to ensure that any such compound transaction is nonnegative for some state \(\omega\), i.e., it avoids sure loss.


Note that this property was not required: we could have simply defined probabilities of infinite sets in a consistent way (e.g. the probability of even length string and odd length strings being 1/2 each).


The meaning of the sum over an uncountable domain can be made precise while still validating the argument.


Some textbooks refer to the above definition of continuous random variables as absolutely continuous, to emphasize this difference.

Author: Denis D. Mauá

Created: 2019-10-11 Fri 11:59