# Probability Theory

## Table of Contents

In this lecture, we discuss the different meanings attributed to probability. We show how the Kolmogorov's axioms of probability can be derived as a disposition to avoid sure loss. We then review the basic concepts of probability theory such as algebras of sets, probability measures, random variables, conditional probability, independence, the chain rule, the total rule and Bayes' rule.

## 1 An unfinished gamble

The idea that an event is more probable than another have long existed. However, it was not until the 17th century that a mathematical theory of uncertainty and randomness was developed. What started such rigorous study of the laws of uncertainty was the so-called problem of points, a somewhat frugal question on how to divide the stake of an unfinished gamble. By the end of the 15th century, some Italian mathematicians became interested in this problem without reaching a satisfactory solution. Then, in 1654, the problem was proposed to Pierre de Fermat and Blaise Pascal, who started a famous correspondence that would culminate in the first formalization of probability theory. The problem they discussed involved some unnecessary complications, but its essence can be put forward by the following example:

A gambling game between two players consists in tossing a coin repeatedly until one of the players accumulates 10 points. One player gets a point when the coin comes up heads, the other player gets a point when the coin comes up tails. Each player puts 50 Francs, making a pot worth of 100 Francs to be awarded to the winner. After playing a few rounds, one of the players is winning by 8 points to 7. Then, the game is suddenly interrupted to never be resumed. The question is: how should the players split the 100 Francs?

The solution proposed by Fermat and Pascal is rather trivial
for anyone that has taken a basic course in probability
theory, but was at the time very new: the stake should be
divided in the same proportion as the number of cases that are
favorable to each player, *provided that each case is equally
likely*. Here is the rationale:

The first player needs two points to win the game, while the second player needs three. Thus, the game would be over had they played at most four more rounds. So let us assume that we would play the game for exactly four more rounds, independently of whether someone would have won before that. The following are the possible configurations that might have occurred (we use ➊ for heads and ➋ for tails):

➊➊➊➊ ➊➊➋➊ ➊➊➊➋ ➊➊➋➋ ➊➋➊➊ ➊➋➋➊ ➊➋➊➋ ➊➋➋➋ ➋➊➊➊ ➋➊➋➊ ➋➊➊➋ ➋➊➋➋ ➋➋➊➊ ➋➋➋➊ ➋➋➊➋ ➋➋➋➋ Every configuration with at least two heads indicates a scenario where the first player wins, and there are 11 such configurations. If the stakes were divided according to the proportion of "favorable cases" to each player (i.e., those in which a player wins) then the player with 8 points would get 11/16 of the pot (68.75 Francs), and the other player would get 5/16 (31.25 Francs).

The key to the above solution is to consider only
configurations with identical probabilities (whatever these
might be). Had we simply listed the possible completions of
the game (for example, that two heads straight ends the game
in less than four tosses), that would have not been the case,
as counting the number of favorable cases would produce an
unfair split of the stake. Fermat's ingenious solution was to
break down the unequal probability scenarios into equal
probability configurations. This procedure is at the heart of
the **principle of indifference** (also called the principle of
insufficient reason), which was proposed as a definition of
probability nearly two centuries later by Pierre Laplace:

The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought.

The principle of indifference can be understood as
constructing uniform probabilities over spaces of elementary
events that, when combined, form the events in which we are
interested. According to Pascal and his followers, these
elementary events should be assigned identical probability due
to our inability of predicting the unique true event that will
occur (or have occurred); this is an immediate conclusion of
**determinism**, the belief that the occurrence of any event in
nature can be precisely determined by the laws of mechanics if
one possess sufficient knowledge (which is never the case).

## 2 What is probability?

So what does it mean to say that "the probability of a fair die landing with number 4 up is 1/6" or that "the probability of a fair coin landing heads up is 1/2"?

### 2.1 Classical probability

The **classical interpretation** of probability is to consider
the probability of an event as the number of favorable cases
divided by the total number of cases, where each case is
considered equiprobable. This definition faces several
shortcomings. First, taken literally, it is only applicable
for a finite number of elementary events, so as to exclude
probability over infinite domains. Second, it contains a
circular character, as it presumes that probability has been
defined for the elementary (equiprobable) events. Last, the
probability of an event depends on the precise
characterization of the elementary events, with different
**parametrizations** leading to different
probabilities.^{1} These difficulties
encouraged mathematicians to seek for a non-epistemic
interpretation.

### 2.2 Frequentist probability

The **objectivist interpretation** states that this
number is a measure of the propensity of that event to
naturally occur. In other words, by saying that the
probability of a fair die landing 4 is 1/6 we are actually
saying that if we (conceivably) roll the die infinitely many
times, then at the limit 1/6th of rolls will end with the die
landing with face 4 up. This interpretation views probability
as a physical property of objects in the world, much in the
same way as we see height, weight and distance. Unlike height,
weight and distance, however, probabilities are assumed not be
directly measurable; instead, they should be estimated from
real observations. Probabilities need also to be agreed on by
everyone, since they rely exclusively on the object and not on
the observer. Objective probabilities thus need to be
constrained only by the mathematical rules of relative
frequencies (or proportions).

But what does it mean to say that the limiting relative
frequency of an event (say, a coin landing heads up) is 1/2?
The standard approach is to consider a sequence
\(a_1,a_2,\dotsc\), where each \(a_i\) denotes the outcome of an
experiment performed (ideally) under the exact same conditions
(but oddly providing different results).^{2}
Then, provided that this sequence converges, the **frequentist
probability** of an event is taken to be

\[ \lim_{n \rightarrow \infty} \sum_i a_i/n \, . \]

While this definition appeals to intuition, it allows somewhat
strange situations when seen from an operational viewpoint.
For example, if the experiment is tossing a coin, and the
first 100 tosses land heads up but all (infinitely many)
tosses after that land tails, then the probability of heads is
zero, irrespective of the fact that we have observed it
occurring 100 times (of course, we would be dead after
infinitely many observations). Also, the limiting frequency is
sensitive to rearrangements of the sequence (e.g., compare a
sequence of 0s and 1s, where a 1 follows a 0; with a
rearrangement where each 1 is followed by 3 other 0s). Most
importantly, the frequentist view disallows probabilities to
be assigned to propositions that are elusive to repeating
experimentation. Take for example the proposition 'The Amazon
river is longer than the Nile river'. A person might want to
state that this proposition is true with probability 0.8; but
the proposition is either true or false!^{3} It is hard to imagine an experiment where the Amazon
river will at times be longer than the Nile while at other
times it will be shorter.

### 2.3 Subjective probability

Issues like these led de Finetti and many others to advocate a
**subjectivist interpretation** of probability, in which an
event is a measure of *your* belief about the occurrence of an
event. Probability thus expresses a personal state of
knowledge. In this view, events need not be repeatable, and
different individuals might disagree on the probability of an
event. In fact, instead of asking what is *the* probability of
an event, one should instead ask for *your* probability.
Subjective probabilities are measurable to the extent that an
individual might be willing to disclose his or her intentions.
Subjective probabilities are constrained by requirements of
coherence, to assure that an individual will not hold
conflicting or contradictory beliefs. This is usually derived
by assuming that a rational agent should be *disposed* to act
according to her beliefs. So consider a gamble that pays one
unit of (probability) currency if \(A\) occurs and zero
otherwise.^{4} Disposition to act means that you should be
indifferent between entering the bet or receiving \(P(A)\)
with certainty. For example, if you are certain that \(A\) will
not occur, then your probability for \(A\) should be equal to 0.
Similarly, if you are certain that \(A\) will occur, then you
should assign \(P(A)=1\).

### 2.4 Axiomatic probability

Finally, the **axiomatic interpretation** views probability as
an elementary concept, on which a formal theory is built and
one that has no meaning per se. This is akin to the concept of
a point in geometry. Probability theory, in this
interpretation, should be concerned not with the nature of
probability, but with consistency and completeness. Axioms
might be inspired by the practice of probability (and
statistics), but their validity should finally be decided by
their ability to produce a system of well-formed and precise
statements. The only constraints on probability values are
those that enforce a consistent system, and those that allow
us to assign probabilities to any object in the system. Within
this view, the study of the laws of probability is a purely
mathematical endeavor, much as the study of number theory.
This view finds its place within the greater program of
axiomatic mathematics, which has come to dominate the area.

## 3 The laws of probability

If probabilities are to be used to express knowledge and
inform decisions (as is our general goal here), then it seems
wise to investigate the consequences of an agent that acts
according to the laws of probabilities. In fact, we can invert
this reasoning and ask what should be such laws for an agent
that displays some minimum desirable behaviour. To this end,
we will adopt the **operational definition** of probability, and
declare that \(P(A)\), the probability of an event \(A\), is the
**fair price** you are willing to pay for a ticket to a lottery
that awards $1 if event \(A\) occurs and \(0\) otherwise. An
**event** is simply a set of states from the universe of
discourse (which is as narrow or as wide as one so desires),
and \(\Omega\) is the **sure event**, which contains all
states.^{5} Since this price is declared fair, if I ask you to
sell me the opportunity to take part in the lottery, you
should charge me the same price \(P(A)\) for it. Thus, you
ought to be indifferent between accepting a game that either
pays you $1 if \(A\) occurs or nothing, and immediately
receiving \(P(A)\). If you so desire, you are allowed to
equate \(P(A)\) with the limiting relative frequency of a
thought experiment, but in principle you are not forced to do
so. In fact, your probabilities (or prices) for some event
might differ from the probabilities of another person (as is
usually the case when probabilities of complex events are
elicited from pundits).

### 3.1 Fair prices and finitely additive probability

So what should we expect from your price? Since you can
receive at most $1, it is unreasonable to assign \(P(A) > 1\)
for any event \(A\). And since your price is fair (or
equivalently, since I can ask you to sell me your ticket for
the same price you are willing to buy it), your price for the
lottery that will *certainly* reward you with $1 should be no
less than $1. Thus, we have our first law of probability:

Also, it would be meaningless to assign \(P(A) < 0\) (that would imply that you buy the ticket and still receive some money, or that you pay some amount to transfer the ticket to someone else). We thus have our second law:

For any event \(A\),

\begin{align} \label{orga96533c} \label{NN} \tag{NN} P(A) &\geq 0 \, . & &\text{[nonnegativity]} \end{align}For the rest of this discussion it will be convenient to overload notation and equate an event \(A\) with the \(\{0,1\}\)-valued function that returns \(A(\omega)=1\) if \(\omega \in A\), and \(0\) otherwise. With this notation we can write that your net gain/loss from buying a ticket to the lottery related to event \(A\) is \(A(\omega)-P(A)\), which is a function from the unknown state of the world \(\omega \in \Omega\). Similarly, your reward/loss for selling the ticket is \(P(A) - A(\omega)\).

So consider disjoint events \(A\) and \(B\). Using our notation, we have that \((A \cup B)(\omega) = A(\omega) + B(\omega)\) for all \(\omega \in \Omega\). Now suppose you judge that \(P(A\cup B) > P(A) + P(B)\). Then if I buy the tickets for lotteries \(A\) and \(B\) from you and sell you the ticket for lottery \(A \cup B\), your overall net loss is

\[ (A \cup B)(\omega) - P(A \cup B) + P(A) - A(\omega) + P(B) - B(\omega) = P(A) + P(B) - P(A \cup B) < 0 \, , \]

whatever the state \(\omega\) happens to be true. On the other
hand, if you judge \(P(A \cup B) < P(A) + P(B)\) then I can sell
you tickets for \(A\) and \(B\) and buy your ticket for \(A \cup
B\), leaving you with a **sure loss** of \(P(A \cup B) - P(A) -
P(B)\). Thus, to avoid losing money under regardless of the
state of the world (i.e., being a sure loser), you must
specify your probabilities such that:

For any two disjoint events \(A\) and \(B\),

\begin{align} \label{orgdd80683} \label{FA} \tag{FA} P(A \cup B) & = P(A) + P(B)\,. & &\text{[finite additivity]} \end{align}
It turns out that these three rules are all the necessary
requirements for avoiding sure loss (i.e., losing money
irrespective of the true state of the world).^{6} In
particular, they are consistent with properties of limiting
frequencies: if

\[ P(A)= \lim_{n \rightarrow \infty} \sum_i a_i/n \, \text{ and } \, P(B)=\lim_{n \rightarrow \infty} \sum_i b_i/n \, , \]

and \(A\) and \(B\) are disjoint (so \(a_i + b_i \leq 1\)) then \(P(A) \geq 0\), \(P(B) \geq 0\), and

\begin{align} P(A \cup B) &= \lim_{n\rightarrow n} \sum_i (a_i + b_i)/n \\ &= \lim_{n \rightarrow \infty} \sum_i a_i/n + \lim_{n \rightarrow \infty} \sum_i b_i/n \\ &= P(A) + P(B) \, . \end{align}Also, if \(a_i=1\) for every \(i\) "large enough", then \(P(A)=1\).

### 3.2 Probability measure

We thus get to the standard definition of probability:

- Probability measure
- A probability measure is a function from the set \(\mathcal{F}\) of events to the reals satisfying Properties \eqref{NN}, \eqref{NU} and \eqref{FA}.

The set \(\mathcal{F}\) can be any collection of events that you consider useful. When \(\Omega\) is finite, there is no harm in assuming \(\mathcal{F}\) to be the power set of \(\Omega\), since any probability measure on \(\mathcal{F}\) can be extended to the power set by application of Property \eqref{FA} (but you are not constrained to do so!). We will discuss what constraints need to be imposed on \(\mathcal{F}\) for infinite sets later.

### 3.3 A finite probability function

The following example helps consolidating the terminology and definitions.

**The problem of points revisited**.

Let us revisit the problem of points. Recall that the first player gets a point when the coin lands heads (\(➊\)), and needs two points to win the game; the second player gains points when the coin lands tails (\(➋\)) and needs three points to win. The state space is

\begin{equation*} \Omega = \left\{ \begin{matrix} ➊➊➊➊,& ➊➊➋➊,& ➊➊➊➋,& ➊➊➋➋, \\ ➊➋➊➊,& ➊➋➋➊,& ➊➋➊➋,& ➊➋➋➋, \\ ➋➊➊➊,& ➋➊➋➊,& ➋➊➊➋,& ➋➊➋➋ , \\ ➋➋➊➊,& ➋➋➋➊,& ➋➋➊➋,& ➋➋➋➋ \\ \end{matrix} \right\} \, . \end{equation*}If you assume the coin is unbiased, then you should assign equal probabilities to the singleton events \(\{ \omega \}, \omega \in \Omega\) (note that otherwise you lose money by selling tickets for ``underestimated'' probabilities and buying tickets for ``overestimated'' probabilities). Properties \eqref{NN}, \eqref{NU} and \eqref{FA} combined then imply that \(P(\{ \omega \})=1/16\), coinciding with our initial informal analysis.

The event \(A\) representing that the first player wins collects all states with at least two heads:

\begin{equation*} A = \left\{ \begin{matrix} ➊➊➊➊, & ➊➊➋➊, & ➊➊➊➋, \\ ➊➊➋➋, & ➊➋➊➊, & ➊➋➋➊, \\ ➊➋➊➋, & ➋➊➊➊, & ➋➊➋➊, \\ ➋➊➊➋, & ➋➋➊➊ \\ \end{matrix} \right\} \, . \end{equation*}Adding up the probabilities of each atomic (i.e., singleton) event, we get \(P(A) = 11/16\). Call \(B\) be the event that the second player wins. Since \(A \cup B = \Omega\), it follows from \eqref{NU} and \eqref{FA} that

\[ P(B) = 1 - P(A) = 5/16 \, . \]

As a sanity check note that the event that a heads is obtained in the third toss of the coin has probability \(8/16=1/2\).

An alternative formulation of the problem is to consider only the realizable completions of the game:

\begin{equation*} \Omega = \left\{ \begin{matrix} ➊➊, & ➊➋➊, & ➊➋➋➊, \\ ➊➋➋➋, & ➋➊➊, & ➋➊➋➊, \\ ➋➊➋➋, & ➋➋➊➊, & ➋➋➊➋, \\ ➋➋➋ \end{matrix} \right\} \end{equation*}In this case, we wold like to assign \(P(\{ \omega \}) = 2^{-|\omega|}\), where \(|\omega|\) denotes the number of symbols (coins) in state \(\omega\). The probability of \(B\) would in this case be given by

\begin{align*} P(\{ ➊➋➋➋, ➋➊➋➋, ➋➋➊➋, ➋➋➋\}) &= 3/16 + 1/8 \\ &= 5/16 \,, \end{align*}as before. Note that in this formulation events like the probability that the third coin lands heads is not 1/2 (in fact it is \(P(\{➊➋➊,➋➊➊,➋➋➊➋,➋➋➊➊\})=3/8\)), since some configurations of tosses are missing from the state space.

### 3.4 Countably infinite probability

Here is a more intricate example involving countably infinite outcomes:

Consider a procedure (say, a computer program with access to a random number generator) which generates binary strings as follows: first a length \(\ell > 0\) is generated with probability \(1/2^\ell\), then a string of length \(\ell\) is generated by tossing a fair coin \(\ell\) times. The sure event is thus the countably infinite set \(\Omega=\{0,1,00,01,10,11,001,\dotsc\}\). Let \(\mathcal{F}\) denote the power set of \(\Omega\); define your probability of observing a string \(\omega\) as

\begin{equation} \label{eq:prob-string} P(\{\omega\}) = 2^{-2|\omega|} \text{ for } \omega \in \Omega \, , \end{equation}where \(|\omega|\) denotes the length of \(\omega\). Since any event \(A \in \mathcal{F}\) is countable, that is, it is either finite or countably finite, we can extend the above definition to any event \(A\) by

\[ P(A) = \sum_{\omega \in A} P(\{\omega\}) \, . \]

It is not difficult to see that this probability measure satisfies \eqref{NN} and \eqref{FA}. For example, the probability that a string has length at most 3 is

\begin{align*} P(\{ \omega: |\omega| \leq 3\}) &= \sum_{i=1}^3 \sum_{\omega: |\omega|=i} P(\{\omega\}) \\ & = \sum_{i=1}^3 2^i 2^{-2i} = 2^{-1} + 2^{-2} + 2^{-3} = 7/8 \, . \end{align*}To obtain \eqref{NU} from \eqref{eq:prob-string} however we need a stronger property than finite additivity, we need

\[ P(\Omega) = \sum_{i=1}^\infty \sum_{\omega \in \Omega: |\omega|=i} P(\{\omega\}) = \sum_{i=1}^\infty 2^{-i}=1 \, . \]

The property that we required in the last example to derive
\eqref{NU} from the probability of basic events is called
**countable additivity**.^{7} It
is defined as follows.

If \(A_1,A_2,\dotsc\) is a sequence of pairwise disjoint events, and \(\cup_i A_i \in \mathcal{F}\), then we say that \(P\) is countably additive if

\begin{align} \label{CA} \tag{CA} P(\cup_{i \in \mathbb{N}} A_i) & = \sum_{i \in \mathbb{N}} P(A_i)\,. & &\text{[countable additivity]} \end{align}### 3.5 Discrete probability measure

Countable additivity is stronger than finite additivity in the
sense that the former implies the latter but the converse is
not true. A countably additive probability measure over a
countable domain is known as **discrete probability measure**.
Discrete probability measures can be concisely specified by
the probabilities of singleton events:

Let \(\Omega\) be a countable non-empty set, and \(p\) be a nonnegative function from \(\Omega\) to the unit interval such that \(\sum_{\omega \in \Omega} p(\omega) = 1\). Then there is a unique discrete probability measure defined by \(P(A)=\sum_{\omega \in A} p(\omega)\) for every \(A \subseteq \Omega\).

The probability measure in the previous example is discrete.
Apart from ease of specification, is there any reason why we
should assume a probability measure as countably additive? If
we allow you to take part in an infinitely countable number of
transactions (or lotteries), then one can show that countable
additivity is both sufficient and necessary for avoiding sure
loss. However, taking part in an infinite number of
transactions removes the operational character of the
behaviourist definition of probability. This reason lead de
Finetti and many others to condemn countable additivity as a
requirement for rationality (though you are still *allowed* to
enforce it if you so desired). On the other hand, most
measure-theoretic (i.e., axiomatic) approaches for probability
theory assume countable additivity, and it is hard to find
useful realistic examples which violate countable additivity.
Curiously, countable additivity is not enforced by the
properties of limiting relative frequencies. Countable
additivity is also required for avoiding **dynamic loss**, that
is, a situation where one infringes you a sure loss by
combining your prices before and after the occurrence of some
event.

We will in the following assume for convenience that every probability measure is countably additivity.

### 3.6 The Total Rule Theorem

### 3.7 Uncountably infinite probability

In our previous examples, we have assumed that \(\mathcal{F}\) is simply the power set of \(\Omega\). For countable number of states (assuming countable additivity), this is no restriction, as we can always extend the probability to the rest of the domain by enforcing \eqref{CA}. Is this also true for uncountable domains? To better motivate this question, consider the following example:

The simplest probability measure on an uncountable domain is
arguably a uniform measure over the unit interval
\(\Omega=[0,1]\). By uniform, we mean that \[ P([a,b]) = b-a
\, , \] for \(a \leq b\) (i.e., the length of the interval).
The above equation includes the degenerate case of a
singleton interval, since \[ P([a,a]) = P(a) = 0 \, . \]
This example shows that we cannot extend countable
additivity to uncountable additivity, as this would imply
that \(P(\Omega) = \sum_{0 \leq x \leq 1} P(x) = 0\).^{8} Countable
additivity, on the other hand, seems to work just fine, for
example:

if \(a_1 \leq b_1 \leq a_2 \leq b_2 \leq \dotsb\), since \(\cup_{i=1}^n [a_i,b_i] \subseteq \Omega\). So far so good, but we would like to extend \(P\) to any subset \(A\) of \(\Omega\). One way of doing this is to consider coverings of \(A\), that is a countable collection of intervals \([a_1,b_1],[a_2,b_2],\dotsb\) (not necessarily disjoint), such that \(A \subseteq \cup_i [a_i,b_i]\). Then we define the probability (or measure) of \(A\) as

\[ P(A) = \inf \left \{ \sum_i (b_i - a_i) : A \subseteq \cup_i [a_i,b_i] \right \} \, . \]

So the question is: is the quantity above well-defined for any \(A\)? It can be show that there are sets that cannot be assigned a probability measure altogether.

The impossibility of assigning a probability value to any subset of the unit interval is not a pathological case; in fact this is true for any uncountable domain:

Let \(\Omega\) be an uncountable infinite set (e.g., the real line). Then any probability measure must be undefined on an uncountable number of events.

Thus, for uncountable domains, one needs to be careful to
choose a domain \(\mathcal{F}\). One typical choice is the so
called Borel \(\sigma\)-algebra, which is formed by the
countable unions and intersections of open sets. For example,
the **Borel \(\sigma\)-algebra on the reals** \(\mathcal{B}\) is
the smallest \(\sigma\)-algebra that contains the (open)
intervals (i.e., the sets \((a,b)\) for reals \( a < b \)). We
will assume that \(\mathcal{F}\) is this algebra when \(\Omega\)
is the real line, and denote it by \(\mathcal{B}\). The elements
of \(\mathcal{B}\) are called **Borel sets**.

### 3.8 Probability distributions

A **probability distribution** is simply any probability measure
on the real line. Possibly the most famous (continuous)
distribution is the Gaussian or normal distribution:

- Gaussian distribution
- Given real-valued parameter \(\mu\) and \(\sigma \neq 0\), we define the Gaussian Distribution as \[ P(A) = \int_{A} \frac{1}{\sigma\sqrt{2\pi}}\exp\left( - \frac{(x-\mu)^2}{2\sigma^2} \right) dx \, , \] for any set \(A\) for which the integral exists.

A very useful distribution distinguishes a single point:

- Point-mass distribution
- Let \(x\) be any real value: \[ P(A) = \begin{cases} 1 & \text{if } x \in A,\\ 0 & \text{if } x \not\in A, \end{cases} \] for any measurable set \(A\).

And at last:

## 4 Conditional beliefs

Usually, a probability is much easier to specify if we assume that the outcome of some event is revealed. This sort of reasoning requires the use of conditional probabilities. We can define conditional probabilities in much the same way as we defined unconditional probabilities, except for the possibility of cancelling of the lottery. So let \(P(A|B)\) be your fair price for participating in a lottery that rewards you $1 if events \(A\) and $0 otherwise, but is cancelled if \(B\) does not occur. By cancelled I mean that you get the amount you paid back, or that you have to refund any buyer the exact amount. Thus, your net gain/loss for this transaction is \(B(\omega)[A(\omega)-P(A|B)]\). How should you set your conditional prices given your unconditional prices?

### 4.1 Conditional probability

When \(P(B) > 0\), the theorem above resorts to the usual axiomatic definition of conditional probability

The above definition is usually justified as a re-normalization of your beliefs once an event is revealed. For \(P(B)=0\), any price for \(P(A|B)\) will suffice to avoid sure loss (but traditional definition leaves the latter value undefined in this situation).

### 4.2 Bayes' Theorem

Conditional probabilities satisfy several properties with respect to other conditional and unconditional probabilities. Arguably the most famous of these properties is Bayes' Theorem, which follows by simple arithmetic manipulation of the above equation.

### 4.3 The chain rule

A property at the heart of the theory of probabilistic graphical models is the so-called Chain Rule:

For any events \(A_{1},\dotsc,A_{n}\) it follows that \[ P(A_{\sigma(1)} \cap \dotsb \cap A_{\sigma(m)}) = P(A_{\sigma(1)})P(A_{\sigma(2)}|A_{\sigma(1)})\dotsb P(A_{\sigma(n)}|A_{\sigma(1)} \cap \dotsb \cap A_{\sigma(n-1)}) \, , \] for any permutation \(\sigma: \{1,\dotsc,n\} \rightarrow \{1,\dotsc,n\} \).

We say that the right-hand side of the equation above is a
**factorization** of the probability on the left-hand side, just
like \( 2 \times 3 \) is a factorization of \( 6 \).

## 5 Independence

Independence is actually an extra-mathematical and qualitative notion, one of utmost importance for the practice of probabilistic modelling. In fact, most of this course is focused on creating a language to describe independence assertions. Intuitively, a set of independent events is such that no combination of a subset of events brings any information to any other event in that set.

- Independence
- Events \(A_1,\dotsc,A_n\) are independent if \[ P( \cap_{i \in S} A_i) = \prod_{i \in S} P(A_i) \, , \] for every nonempty set \(S \subseteq \{1,\dotsc,n\}\).

If \(A\) and \(B\) are events such that \(P(A \cap B) > 0\) (hence \(P(A), P(B) > 0\)), it follows that \(P(A|B)=P(A)\) and \(P(B|A)=P(A)\) if and only if \(A\) and \(B\) are independents. This connects the definition above with our intuitive notion of independence as irrelevance of knowledge. Note that the definition above defines a system of \(2^{n}-n-1\) equations and \(2^{n}-1\) variables; this is because factorization of subsets (e.g., of pairs of variables) does not imply factorization of supersets (e.g., triples of variables):

**Pairwise independence does not imply 3-way independence.**

Consider an experiment where two fair coins are tossed; if they either lands both with heads up or heads down a third coin is placed tails up, otherwise, the third coin is placed head up. Our state space is thus \(\Omega = \{000,011,101,110\} \), and \(P(\omega)=1/4\) for each \(\omega \in \Omega \). Let \(A=\{101,110\}\), \(B=\{011,110\}\) and \(C=\{011,101\}\) be, respectively, the events that the first, second and third coin lands heads up. Then \(P(A)=P(B)=P(C)=1/2\). Also, knowing the result of one of the coins gives no information about another coin: \(P(A|B) = P(B|A) = P(A|C) = P(C|A) = P(B|C) = P(C|B) = 1/2\). However, the probability of the third coin is completely determined by the values of the first and second coins: \(P(C|A \cap B) = 0 \).

The converse is also true: the factorization of the probability of all events is not sufficient to imply factorization of every subset:

**3-way independence does not imply pairwise independence.**

Consider a state space \( \Omega = \{1,\dotsc,8\} \) and probabilities

\begin{align*} P(1) &= 2/64 & P(2) &= 7/64 & P(3) &= 7/64 & P(4) &= 0 \\ P(5) &= 7/64 & P(6) &= 32/64 & P(7) &= 0 & P(8) &= 9/64 \, . \end{align*}Let \( A = \{ 2,4,6,8 \} \), \( B=\{3,6,7,8\} \) and \( C=\{ 4,5,7,8 \} \). Then, \(P(A) = P(B) = 7/64 + 0 + 32/64 + 9/64 = 3/4\) and \(P(C) = 1/4 \). Also, \( P(A \cap B \cap C) = P(\{8\}) = 9/64 = (3/4)^{2}(1/4) = P(A)P(B)P(C) \). However, \( P(A \cap B) = P(\{6,8\}) = 32/64 + 9/64 = 41/64 > 36/64 = (3/4)^{2} = P(A)P(B) \).

## 6 Random variables

The set-theoretic approach just laid out is sufficient to describe simple scenarios like the examples given, but turns out to be cumbersome to represent more complex knowledge, especially when it involves numerical quantities such as counts, sums, etc. For instance, take the example of the random string generation. To consider quantities such as the average length of a string, we might create a function \(X\) that maps every string to its length. We can then query the probability that a string with length at most 5 is generated by \(P(X \leq 5)=P(\{\omega \in \Omega: X(\omega) \leq 5\})\).

A real-valued function on \(\Omega\) is called a **random
variable**. The standard definition of a random variable \(X\)
assumes a probability space \((P,\Omega,\mathcal{F})\) is
defined (and fixed), and requires that \(P(X \leq x)=P(\{
\omega: X(\omega) \leq x\})\) be defined for any real \(x\)
(i.e., the set \(\{ \omega: X(\omega) \leq x \}\) must be in
\(\mathcal{F}\)). In practice, we usually specify the
cumulative distribution function of the random variable, and
implicitly assume that there is a probability space that
generated it (there is always such a space).

- Cumulative distribution function
- The cumulative distribution function of a random variable \(X\) is a real-valued function \(F\) such that \[ F(x) = P(X \leq x) \, \text{ for all real } x \, . \]

### 6.1 Probability distribution

The **probability distribution of a random variable** \(X\) is the
function \(p(A) = P(\{ \omega: X(\omega) \in A\})\), defined on
\(\mathcal{B}\), the Borel \(\sigma\)-algebra of the reals.
Given that random variable, the triple
\((\mathbb{R},p,\mathcal{B})\) is a valid probability space,
which removes the ambiguity of definition of probability
distribution.

- Discrete random variable
- We say that a random variable is discrete if its range is countable.

A discrete random variable has countable **support**, that is, a
countable subset \(\mathcal{D} \subset \mathbb{R}\) such that
\(P(\mathcal{D}) > 0\). If the \(\mathcal{D}\) is also finite,
then the random variable is called **simple**. We write
\(\Omega_X\) to denote the range of a variable \(X\), that is, the
set \(\{ x: X(\omega) = x, \omega \in \Omega\). If \(X\) is a
discrete random variable, then we can define \(p(x)=P(X=x)=P(\{
\omega: X(\omega) = x\})\) for any \(x\) in \(\Omega_X\). The
function \(p\) is called the **probability mass function** of \(X\).
By Theorem 3.5, the function \(p\)
induces a unique distribution \(p(A)\) defined on the power set
of \(\Omega_X\), hence there is no confusion in overloading the
notation for distribution and mass function (and we often use
both terms interchangeably) when discussing discrete random
variables.

Let \(P\) be the Lebesgue measure on the unit interval
\(\Omega=[0,1]\). Define \(D_i\) as the random variable
returning the \(i\)-th digit in the infinite dyadic (i.e.,
binary) expansion of \(\omega=\sum_i D_n(\omega)/2^n\). For
example \(D_1(\omega)=1\) only for \(\omega > 1/2\), and
\(D_2(\omega)=1\) only for \(\omega \in (1/4,1/2] \cup
(3/4,1]\). Each \(D_i\) is a simple random variable with range
\(\{0,1\}\) and \(p(D_i=0)=p(D_i=1)=1/2\). The latter follows as
\(D_i\) partitions \(\Omega\) in \(2^i\) intervals of length
(probability) \(1/2^{i+1}\). Let \(S_n = \sum_{i=1}^n D_i\).
Then \(S_n\) is also a simple random variable with
\(p(S_n=k)={n \choose k} 2^{-n}\) for \(k \in \{0,\dotsc,n\}\).
Note that \(p(S_n)\) is the Binomial distribution with
\(p=1-p=1/2\) (we say that \(S_n\) *follows* a Binomial
distribution).

### 6.2 Common distributions

Random variables are usually characterized by their distributions. For example, the Bernoulli distribution is usually motivated as the probability of observing \(k\) heads in \(n\) tosses of a binary coin:

- Binomial distribution
- Given parameters \(p \in [0,1]\) and \(n \in \mathbb{N}^+\), the Binomial Distribution is \[ P(X = k) = {n \choose k} p^k(1-p)^{n-k} \ , \] if \(k \in \mathbb{N}\) and \(P(X = k) = 0\) otherwise. The distribution can be extended to any Borel set by \eqref{FA}.

The Poisson random variable is commonly used to mode arrival in queuing theory:

### 6.3 Functions of random variables

A function of a random variable is (usually) also a random variable. Thus, related random variables can be created as complex expressions of simpler variables. For example, assume that \(X\) is a random variable such that

\begin{equation*} P(a \leq X \leq b) = \begin{cases} b - a , & \text{if } 0 \leq a < b \leq 1 \\ 0, & \text{otherwise}. \end{cases} \end{equation*}That is, \(X\) follows a uniform distribution on the unit interval. Now let

\[ Y = \lfloor 3X + 1 \rfloor \, . \]

Then, \(P(Y=1)=P(Y=2)=P(Y=3)=1/3\) and \(P(Y=y)=0\) for all \(y \not\in \{1,2,3\}\). Thus, we can use a uniform distribution on the unit interval to create a uniform distribution over three categories by simply composing random variables. Not every composition of random variables is a random variable due to the technicality that there must exist a probability space which "originates" it. However, composition of random variables not resulting in random variables are rare. In fact, arithmetic expressions (e.g., \(2X^2 + 3Y/4 + 1\)) and limits of random variables are always random variables.

### 6.4 Continuous random variables

- Continuous random variable
- We say that a random variable \(X\) is continuous if there is a nonnegative (Lebesgue) integrable function \(f\) such that \[ P(a \leq X \leq b) = \int_a^b f(x) dx \, , \] for any reals \(a \leq b\).

The definition above implies that \(P(X = x) = P(x \leq X \leq
x) = 0\) for any value \(x\), thus coinciding with our intuition
of a continuous random quantity. The support of a continuous
random variable is thus uncountable. The function \(f\) (if it
exists) is called the **probability density function** of \(X\),
or more simply, the density of \(X\). Note that we allow \(f\) to
be discontinuous at a finite number of points (e.g., \(f\) can
be piecewise continuous) with probability zero. By
construction, if \(F\) is the cumulative distribution of \(X\)
then \(F(x)=\int_{-\infty}^x f(u) du\). By the Fundamental
Theorem of Calculus, if \(F\) is differentiable then

\[ f(x) = \frac{d}{dx} F(x) \, . \]

Here is a common continuous random variable used for modelling arrival of items:

- Exponential distribution
Given a real-valued parameter \(\lambda > 0\), let

\[ f(x) = \lambda e^{-\lambda x} \ , \]

if \(x \geq 0\) and \(f(x) = 0\) otherwise. The Exponential Distribution is defined as

\[ P(A) = \int_{A} f(x) dx \, , \]

whenever the integral exists.

A probability density function can take values larger than one; in fact probability density functions can be unbounded.

Consider a random variable with density

\[ f(x) = \begin{cases} \frac{1}{2\sqrt{x}} & \text{if } 0 < x \leq 1 \, \\ 0 & \text{otherwise}. \end{cases} \]

Then \(f(x) \rightarrow \infty\) as \(x \rightarrow 0\), however

\[ \int_{-\infty}^{\infty} f(x) dx = 1 \, . \]

Continuity of the cumulative distribution function does not
suffice to ensure that a random variable is continuous, as
there are continuous probability distributions which are zero
except on a set of Lebesgue measure zero, yet each point in
that set has probability zero (hence the distribution is also
not discrete).^{9} The converse
however is true: if \(X\) has a continuous density then its
cumulative function is continuous.

So discrete random variables have countable supports and
continuous random variables have uncountable support;
similarly, apart from some highly pathological cases,
continuous random variables are (Lebesgue) integrable
functions that assign zero probability to every value. There
are however very frequently occurring types of random
variables that satisfy none of these descriptions. They
naturally occur when measuring physical phenomena. For
example, consider the amount of rainfall \(X\) in a certain
location (in a certain period). For any value \(x > 0\), it is
natural to assume that \(P(x) = 0\), unless some censoring
occurs. However, it is most likely that we have \(P(X=0) > 0\),
indicating that there is a nonzero probability of not raining.
Such types of variables are often denoted **mixed-type random
variables** or yet **hybrid random variables**. Such variables
can still be defined by their cumulative distributions, or
more conveniently by their discrete and continuous parts.

- Mixed-type random variable
- We say that a random variable \(X\) is of mixed-type if there are functions \(f\) and \(g\) such that \(f\) is integrable, \(g(x) > 0\) for some countable set \(\mathcal{D} \subset \Omega_X\), and \[ P(a \leq X \leq b) = \int_a^b f(x) dx + \sum_{x \in \mathcal{D} \cap [a,b]} g(x) \, , \] for any \(a \leq b\).

Mixed-type random variables are simply the combination of discrete and continuous random variables; the functions \(f\) and \(g\) characterize, respectively, the probability density function and probability mass function of the discrete and continuous counter-parts. Mixed-type random variables appear often when one models censoring in measurements as in for example, when measurement can be made within a time limit (which is arguably always the case).

**Truncated random variable.**

Consider a continuous random variable \(X\) with density \(f\). Given values \(\tau_1 < \tau_2\), obtain a new random variable

\begin{equation*} Y = \begin{cases} \tau_1 & \text{if } X \leq \tau_1 \, ,\\ X & \text{if } \tau_1 < X < \tau_2 \, ,\\ \tau_2 & \text{if } X \geq \tau_2 \, . \end{cases} \end{equation*}Then \(Y\) has distribution

\[ P(Y \in A) = \int_A f'(y) dy + \sum_{y \in A \cap \mathcal{D}} g(y) \, , \]

where \(f'(y) = f(y)\) if \(\tau_1 < y < \tau_2\) and \(f'(y) = 0\) otherwise,

\[ g(x) = \begin{cases} \int_{-\infty}^{\tau_1} f(x) dx & \text{if } x = \tau_1 \, ,\\ 0 & \text{if } x \not\in \{\tau_1,\tau_2\} \, ,\\ \int_\tau^{\infty} f(x) dx & \text{if } x = \tau_2 \, , \end{cases} \]

and \(\mathcal{D}=\{\tau_1,\tau_2\}\).

### 6.5 Conditional probability distributions

Similarly to (unconditional) probabilities, we can defined conditional distributions of a random variable \(X\) conditional on another random variable \(Y\), written \(p( X \in A | Y \in B \). As with conditional probability of events, this probability is only defined in \(p(Y \in B)>0\). This works well for discrete random variables, but is unsatisfactory for continuous random variables; often we want to model a phenomenon assuming an observation; if the random variable being assumed observed is continuous, then \(B=\{y\}\) will be a singleton with probability zero. We could instead assume that \(B\) is in fact an infinitesimal region around the observation \(y\), one large enough to have positive probability: \[ \lim_{\Delta \rightarrow 0} p(X \in A| Y \in \{y-\Delta/2,y+\Delta/2) \, . \] We can generalize this idea to the density of \(X\) as well. Assume that \(X\) and \(Y\) have joint density \(f(X,Y)\). Then we define the \define{conditional probability density} \(f(X|Y)\) of \(X\) given \(Y\) as \[ f(x|y) = \lim_{\Delta \rightarrow 0} \frac{d}{dx} p(X \leq x| Y \in \{y-\Delta/2,y+\Delta/2\}) = \frac{f(x,y)}{f(y)} \, . \] The conditional probability of a measurable set \( A \) given \(Y\) is, as desired, \[ p(X \in A | Y = y) = \int_{A} f(x|y) dx \, . \]

## 7 Multivariate reasoning

In most modeling situations we are interested in not a single but a set of quantities or summaries represented by a set of random variables. For convenience, if \(X\) and \(Y\) are sets of random variables we write \(XY\) to denote their union \(X \cup Y\) whenever there is no ambiguity.

### 7.1 Assignment

Let \(S\) be a set of random variables. An assignment \(\nu\) to \(S\) is a function that maps every variable \(X \in S\) into a value \(\mu[X]\) in its domain \(\Omega_X\). We write \(\nu \sim S\) to denote an arbitrary assignment of \(S\), or to consider the set of all assignments, as in the expression \(\sum_{\nu \sim S\} 1\), which counts the number of assignments of \(S\).

### 7.2 TODO Joint probability distribution

Consider a vector \(X=(X_1,\dotsc,X_n)\) of random variables
from a probability space \((P,\Omega,\mathcal{F})\). the
**joint probability distribution** of \(S\) is the function
$p(A_1,\dotsc,A_n)=P(\{ ω: X_i(ω) ∈ A_i,
i=1,\dotsc,n \})\), defined on \(\mathcal{B}^n\), the
\(n\)-dimensional Borel \(\sigma\)-algebra. If all random
variables in \(X\) are discrete, we define the **joint
probability mass function** as the function \(p(\nu)=P(\{\omega:
X_i(\omega)=\nu[X_i]\})\) of assignments of
\(S=\{X_1,\dotsc,X_n\}\) to joint probability values. Note that
this (somewhat unusual) definition of probability mass
function is invariant to the order of the variables in the
vector \(X\).

TODO: define joint density functions.

### 7.3 TODO The total rule for random variables

Define and prove.

### 7.4 Independence

Since random variables define events, we can generalize event-related concepts.

Random variables \(X_1,\dotsc,X_n\) are **fully
independent** if their joint distribution factorizes:

for any combination of measurable sets \( A_{1},\dotsc,A_{n} \subseteq \mathbb{R} \).

Note that in the definition above we do not require that subsets of variables be independent. This is because we can take any \(A_{i} = \mathbb{R} \), effectively removing it from the equation above.

We say that a set of random variables \(X\) and \(Y\) are
**conditionally independent** given a set of random variables
\(\mathcal{Z}\) if

### 7.5 TODO The chain rule for random variables

Define and prove.

## 8 Expectation

We are often interested in aggregate characteristic of uncertain phenomena such as for instances the average length of a randomly generated string, the average value of the roll of a die or the average value of a number randomly picked from the unit interval (of course, things get more interesting when the distributions involved are not uniform, as in some of these examples). Theses quantities are formalized by the notion of expectation or expected value.

- Expectation of discrete random variable
The expectation or expected value of a discrete random variable \(X\) (i.e., one taking on countably many values) with probability mass function \(p\) is

\[ E(X) = \sum_{x} x \cdot p(x) \, , \]

provided this sum exists (i.e., if either converges to a number or diverges to \(\pm \infty\)).

The sum in the definition above ranges over all values \(x \in \Omega_X\). To avoid cluttering, we omit the domain of \(x\) is such sums.

In the Binary Strings Example, the expected value of random variable \(L(\omega)=|\omega|\) is \( E(L)=\sum_{i \in \mathbb{N}} i \cdot 2^{-i} = 2 \).

- Expectation of continuous random variable
- The expectation or expected value of a continuous random variable \(X\) with density \(f\) is \[ E(X) = \int_{-\infty}^\infty x \cdot f(x) \, , \] provided this integral exists (it can be \(\pm \infty\)).

Expectation can also be defined for other types of random variables (other than discrete and continuous), but the definition is rather technical. To keep this discussion simple, we will focus on discrete and continuous random variables, even though most of the theory developed applies to arbitrary random variables.

### 8.1 Linearity of expectation

The following property (which is true for *any* random
variable) is extremely useful:

Let \(X\) and \(Y\) denote random variables and \(a\) and \(b\) real values. The expectation of a random variable \(aX + bY\), if it exists, satisfies \[ E(aX + bY) = aE(X) + bE(Y) \,. \]

### 8.2 Non-trivial random variables

A random variable \(X\) is **trivial** if there is \(x\) such that
\(P(X=x)=1\). A trivial random variable satisfies \(E(X)=x\).

A non-trivial bounded random variable \(X\) satisfies \[ \min X < E(X) < \max X \, , \] where \(\min X/\max X\) denote the minimum/maximum value on the range of \(X\).

### 8.3 Change of variable

Recall that a measurable function of a random variable is also a random variable. Hence, we can also compute the expectation of functions of random variables:

Let \(X\) be a discrete variable with probability mass function \(p\) and \(Y=f(X)\). Then \[ E(Y) = \sum_x f(x) p(x) \, , \] provided the sum exists.

### 8.4 Conditional expectation

Oftentimes, we are interested in the average value of a random variable, say \(X\), assuming we have observed some other random variable, say \(Y=y\). This notion is captured by conditional expectation.

Let \(X\) and \(Y\) denote discrete random variables. The conditional expectation of \(X\) given \(Y=y\) is

\[ E(X|Y=y) = \sum_x x \cdot P(X=x|Y=y) \, . \]

The conditional expectation can be seen as a new random
variable \(Z=E(X|Y=y)\) obtained as a function of \(Y\). Hence,
the expectation of the conditional expectation, denoted
\(E(E(X|Y))\) is well-defined. We have the following very useful
property regarding such random variables, known as the **law of
iterated expectations**:

Let \(X\) and \(Y\) be random variables. Then \(E(E(X|Y)) = E(X)\).

This is property is so important that some authors actually define conditional probability density as the function that satisfies it (if it exists).

### 8.5 Variance

The variance of a random variable measures the spread of probability mass or density over its domain. It is defined assuming

## 9 Limit Laws

The following result is known as the **weak law of large numbers**.

Let \(X_1,\dotsc,X_n\) be independent and identically distributed random variables with expected value \(\mu\), and define \(S_n = \sum_{i=1}^n X_i / n\). Then,

\[ \lim_{n \rightarrow \infty} P( |S_n - \mu| > \epsilon) = 0 \, , \]

for any \(\epsilon\).

The previous results states that the probability that the mean of independently generated observations (or copies) of a quantity diverges from its expected value converges to zero as the number of observations increases. This is at the heart of techniques of statistical inference that estimate a unknown parameter from data.

The following result is known as the **strong law of large
numbers**, as it subsumes the previous one.

Let \(X_1,\dotsc,X_n\) be independent and identically distributed random variables with expected value \(\mu\) and finite variance, and define \(S_n = \sum_{i=1}^n X_i / n\). Then,

\[ P\left( \lim_{n \rightarrow \infty} |S_n - \mu| = 0\right) = 1 \, . \]

The result above states that the mean of independently
generated observations of a quantity converges *in
probability* to its expected value in the limit as the number
of observations goes to infinity.

## 10 To known more

Boole's 1854 treatise is often view as the first investigation of probabilistic modeling Boo1854. The most popular book on the foundations of modern probability theory remains to be the work of Billingsley Bil1995. A more condensed version can be found in the book of Rosenthal Ros2006. Jaynes provides a formalization of probability theory that does not allow infinite sample spaces Jay2003. In Jaynes' view, models over infinite domains should be construed as the limit of a family of models, each over a finite domain. This removes several of the "paradoxes" that appear in Kolmogorov's (i.e., measure-theoretic) probability theory. Kadane presents a complete introduction to probability theory from the Bayesian viewpoint Kad2011. Much of the initial discussion of this work is based on this work. A traditional (frequentist) take on probability theory is given by Feller Fel1968.

## 11 Exercises

- Prove the Bonferroni inequality from the Axioms of
Probability Theory: \[ P(A \cap B) \geq P(A) + P(B) - 1 \,
. \] for any events \(A\) and \(B\).
*Hint:*Note that \(A \cap B = (A - B) \cup (A \cap B) \cup (B - A)\). - Prove that \[ P(A \cap (B \cup C)) = P(A \cap B) + P(A \cap
C) - P(A \cap B \cap C) \, . \] for any three events \(A\),
\(B\) and \(C\).
*Hint:*Use Distributivity of \(\cap\) over \(\cup\): \(A \cap (B \cup C) = (A \cap B) \cup (A \cap C)\). - Show that independence among \(n\) random variables implies pairwise independence of any two random variables (i.e., \(p(X_1,\dotsc,X_n)=\prod_i p(X_i)\) implies \(p(X_i,X_j)=p(X_i)p(X_j)\) for \(i\neq j\)).
- An exam to test for a serious disease has 99% accuracy
meaning that the probability of the exam testing positive
for the disease if the person does have the disease is
0.99, which is the same probability of the exam testing
negative if the person does not have the disease). Suppose
that you test positive for the disease, and that the
disease is present in 1 every 10.000 people of your age.
- What are the odds that you indeed have the disease?

- Knuth and Yao investigated how distributions on arbitrary
simple random variables (i.e., random variables with a
finite support) could be simulated by sequences of random
bits, that is, by a sequence of independent binary-valued
random variables. A particular interesting special case of
their method consists in the procedure illustrated in the
diagram in Figure 1, that simulates the toss of
a six-sided fair die using random bits (or tosses of a fair
coin). Each arc in the diagram represents the outcome of a
coin toss, and the leaves represent the outcome of the
algorithm (a number between 1 and 6). By convention, assume
that the arcs going up denote heads (H) while the arcs
going down denote tails (T). Every path from the root to a
leaf thus describes a terminating run (i.e., an execution)
of the algorithm. Note that there are non-terminating runs
of the algorithm, as the diagram contains directed cycles.
- Formulate a probability space for the problem (sample
space, algebra and probability measure) and define which
events correspond to outputs of the algorithm (i.e., to
tosses of the die). Note that you have to prove that the
probability measure satisfies the Axioms of probability
theory and that the event space is (at least) an algebra.
*Hint:*consider a countably infinite space of finite and infinite executions of the algorithm. If you get stuck, consider modeling only finite runs of the program up to a certain amount \(n\) of tosses, then take this number to infinity. - Specify a (discrete) random variable on the probability space you created that represents the output of the algorithm. The random variable should take values \(i=1,2,\dotsc,6\) with probability \(1/6\). Note that the random variable must map all elements of the sample space, not only those leading to desired outcomes.

- Formulate a probability space for the problem (sample
space, algebra and probability measure) and define which
events correspond to outputs of the algorithm (i.e., to
tosses of the die). Note that you have to prove that the
probability measure satisfies the Axioms of probability
theory and that the event space is (at least) an algebra.

Figure 1: Diagram describind Knuth and Yao's algorithm to generate samples of a die using fair coins.

# Bibliography

- [Boo1854] George Boole, An Investigation of the Laws of Thought: On which are Founded the Mathematical Theories of Logic and Probabilities, Walton and Maberly (1854).
- [Bil1995] Patrick Billingsley, Probability and measure, Wiley-Interscience (1995).
- [Ros2006] Jeffrey Rosenthal, A First Look at Rigorous Probability Theory, World Scientific Publishing Company (2006).
- [Jay2003] Edwin Jaynes, Probability theory: the logic of science, Cambridge University Press (2003).
- [Kad2011] Joseph Kadane, Principles of Uncertainty, Chapman & Hall (2011).
- [Fel1968] William Feller, An Introduction to Probability Theory, John Wiley & Sons (1968).

## Footnotes:

^{1}

This point is usually made by the so-called Bertrand's paradox, which attempts at building a uniform distribution of chords of a circle.

^{2}

Mathematically, these outcomes can be captured by the notion of a random variable, which maps states of the world into real values.

^{3}

As of this writing, Wikipedia currently lists the Amazon as the longest river.

^{4}

Probability currency is an imaginary currency that behaves as lottery tickets, so that your utility is linear.

^{5}

To remain valid as an operational definition, an event is usually constrained to be the result of realizable and unambiguous procedure; for example, the result of tossing one particular coin at a particular time by a designated person.

^{6}

In general,
we assume that you can be involved in a *finite* number of
transactions \(\sum_{i=1}^m \gamma_i |A_i(\omega) - P(A_i)|\),
where \(\gamma_i\) is a real value indicating the number of
lottery tickets related to event \(A_i\) that you have bought
(if \(\gamma_i>0\)) or sold (for \(\gamma_i < 0\)). It can be
shown that the three laws are necessary and sufficient to
ensure that any such compound transaction is nonnegative for
some state \(\omega\), i.e., it avoids sure loss.

^{7}

Note that this property was not required: we could have simply defined probabilities of infinite sets in a consistent way (e.g. the probability of even length string and odd length strings being 1/2 each).

^{8}

The meaning of the sum over an uncountable domain can be made precise while still validating the argument.

^{9}

Some textbooks refer to the above definition of continuous random variables as absolutely continuous, to emphasize this difference.