Jekyll2021-10-09T01:32:17+00:00https://amspector100.github.io/feed.xmlAsher SpectorAsher Spector's personal website.Asher Spectoraspector@stanford.eduProbabilistic Modeling2018-09-21T00:00:00+00:002018-09-21T00:00:00+00:00https://amspector100.github.io/review/probabilistic-modeling<blockquote>
<p>“All models are wrong, but some are useful.” — George E. P. Box</p>
</blockquote>
<h1 id="generative-models">Generative Models</h1>
<p>A <strong>model</strong> is a set of assumptions (and, usually, equations) which frame the
way the world works. For example, you might model a series of dice rolls in a
board game by assuming they are independent from each other - in other words,
the outcome of the first roll does not affect the outcome of the second roll.</p>
<p>However, we’ll usually be interested in <strong>generative models.</strong> A generative
model with respect to some observed data is a model which is capable of <em>fully
simulating</em> the observed data. For example, if you observed a bunch of dice
rolls in a row, as follows:</p>
\[1, \, 3, \, 4, \, 6, \, 2, \, 5, \, 5, \, 3, \, 1, \, 6\]
<p>just knowing that each roll is independent of the others is <em>not</em> enough to
simulate the data. To do that, you’d need to also to specify the <em>probability
distribution</em> of the value of each roll - in other words, the probability that
any roll will land as a $1$ or a $2$ or a $3$, etc.</p>
<p>We like generative models for at least two reasons! First, they let us do fun
(and useful) math - for example, we can estimate the probability that the
observed data occurs under a model. As we’ll see, calculating the likelihood of
observed data is extremely important in techniques like Maximum Likelihood
Estimation and more. Second, generative models can also <em>do</em> cool things: for
example, a generative model for natural language can <a href="https://medium.com/deep-writing/harry-potter-written-by-artificial-
intelligence-8a9431803da6">write a chapter of Harry
Potter</a> or even create <a href="https://www.youtube.com/watch?v=VrgYtFhVGmg">fake pictures of real
celebrities</a>, as pictured below:</p>
<p><img src="\assets\images\ML\fake-celebrities.PNG" alt="png" /></p>
<p>These fake celebrities came from a paper at
<a href="https://arxiv.org/abs/1710.10196">arXiv:1710.10196</a>.</p>
<h1 id="the-data-generating-process">The Data Generating Process</h1>
<p>The <strong>Data Generating Process</strong> (DGP) is the “true” generative model. To be more
specific, in most problems we assume there is some underlying <a href="https://amspector100.github.io/mathy/probability-theory-
in-15-minutes/#joint-and-conditional-distributions">joint probability
distribution</a> which governs the data we
observe, and the DGP is that underlying distribution. We don’t observe the DGP istself,
but we can use external data to get a rough sense of what’s going on.</p>
<p>In the example in the next section, which will tie together all of this
material, we have a “God’s eye view” and can see the DGP in all its glory, but
only because I literally made up the data. In reality, you will never know the
DGP - the point of statistics is broadly to create models which approximate it. We call
a model <em>correctly specified</em> if it has the same underlying structure as the
DGP.</p>
<h1 id="estimators-parametric-and-nonparametric-models">Estimators, Parametric, and Nonparametric Models</h1>
<p>Like people, models tend to come in families. For example, the normal
distribution is not a single distribution - it’s a family of extremely similar
distributions, each of which depends on two values: a <em>mean</em> and <em>variance.</em> We
generally call values which help index families of models <em>parameters</em>.</p>
<p>Unfortunately, we don’t usually know the values of parameters we’re interested
in. As a result, we have to create <em>estimates</em> for parameter values. We do this
using <em>estimators</em>, which are just functions of random data which we can use to
guess the parameter of interest. Usually, we denote estimators by putting a
little hat on top of some symbol, like $\hat \theta$ or $\hat \sigma$.</p>
<p>To understand all of the modeling terminology we’ve been discussing, consider
the example below.</p>
<div class="notice--warning"> Warning: the term "estimator" is extremely
confusing. The key point to remember is that an <strong> estimate </strong> is
nonrandom, whereas an <strong> estimator </strong> is a function of
random data. Both serve as a guess for a parameter of interest. This is all made
more confusing by the fact that some people refer to values they want to learn
as <strong> estimands </strong>. </div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">210</span><span class="p">)</span>
<span class="c1"># Simulate data w random noise
</span><span class="n">num_samples</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">true_lambda</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">pi</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">poisson</span><span class="p">(</span><span class="n">true_lambda</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="n">num_samples</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">data</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">num_samples</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'blue'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.3</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">"Histogram of Observed Data (From the DGP)"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-21-probabilistic-modeling_files\2018-09-21-probabilistic-modeling_1_0.png" alt="png" /></p>
<p>Let’s try fitting two models to this data.</p>
<p>In the first model, we’ll make a big (but correct!) assumption: that the DGP is
primarily a <a href="https://amspector100.github.io/mathy/probability-theory-
in-15-minutes/#poisson">Poisson
Distribution</a>. However, we need to find the rate parameter for the
Poisson distribution. We’ll talk more about different types of estimators in the
coming posts, but one commonly used estimator is simply the mean of the data.
This makes intuitive sense, because the mean of a Poisson is its rate parameter.
Then, if we denote our parameter as $\lambda$, our $n$ data points as $X_1,
\dots, X_n$, and our estimator as $\hat \lambda$, we define</p>
\[\hat \lambda = \frac{1}{n} \sum_{i = 1}^n X_n\]
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
<span class="c1"># Draw values from model dist
</span><span class="n">hat_lambda</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">model_pmf</span><span class="p">(</span><span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="n">stats</span><span class="p">.</span><span class="n">poisson</span><span class="p">.</span><span class="n">pmf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">hat_lambda</span><span class="p">,</span> <span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">x_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">float64</span><span class="p">)</span>
<span class="n">model_pmf_values</span> <span class="o">=</span> <span class="n">model_pmf</span><span class="p">(</span><span class="n">x_values</span><span class="p">)</span>
<span class="c1"># Plot
</span><span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'blue'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.3</span><span class="p">,</span> <span class="n">density</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Observed'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">model_pmf_values</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'green'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Poisson Model'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-21-probabilistic-modeling_files\2018-09-21-probabilistic-modeling_3_0.png" alt="png" /></p>
<p>In our second model, we’ll make fewer assumptions about the underlying
distribution of the data. (This model is also a bit more complicated). We will
model the density function underlying the data by simply calculating the percent
of data which falls into a small bin around the data. In this setting, our
estimand is the shape of the distribution itself: we’re modeling the entire
distribution in one go. (This is a simple example of <strong>kernal density estimation
(kde)</strong>, which we’ll talk more about later).</p>
<p>To get a little bit more formal, suppose we have observed data $X_1, \dots, X_n$
and we have selected a binsize of $h$ (usually, the more data you have observed,
the smaller you make the binsize). Then, for any real number $y$, our
estmimator, called $\hat f$, returns the following expression:</p>
\[\hat f(y) = \frac{1}{h} \sum_{i=1}^n \frac{I_{|X_i - y| < h/2}} {n}\]
<p>This notation may be a bit confusing at first, but remember that $I_{X_i \in
(a_i, b_i])} $ is just an indicator random variable which equals $1$ if the
$n$th data observation $X_n$ falls within $\frac{h}{2}$ of the input $y$, and
$0$ otherwise. Let’s see how this model performs below, especially compared
against the Poisson model:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.neighbors.kde</span> <span class="kn">import</span> <span class="n">KernelDensity</span>
<span class="c1"># Model 2 - Rectangular KDE
</span><span class="n">bandwidth</span> <span class="o">=</span> <span class="mf">1.06</span> <span class="o">*</span> <span class="n">data</span><span class="p">.</span><span class="n">std</span><span class="p">()</span> <span class="o">*</span> <span class="n">num_samples</span> <span class="o">**</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="o">/</span><span class="mi">5</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">f_hat</span><span class="p">(</span><span class="n">y</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">sum</span><span class="p">(</span><span class="mi">1</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">data</span> <span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span> <span class="o"><</span> <span class="n">bandwidth</span><span class="p">)</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">f_hat</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">f_hat</span><span class="p">)</span>
<span class="n">model2_values</span> <span class="o">=</span> <span class="n">f_hat</span><span class="p">(</span><span class="n">x_values</span><span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="n">ax</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'blue'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.3</span><span class="p">,</span> <span class="n">density</span> <span class="o">=</span> <span class="bp">True</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Observed'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">model_pmf_values</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'green'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Poisson Model'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">model2_values</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'orange'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Model 2'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-21-probabilistic-modeling_files\2018-09-21-probabilistic-modeling_5_0.png" alt="png" /></p>
<p>Both models perform reasonably well, probably because in these kinds of
examples, we have the luxury of actually knowing the DGP and can model
accordingly.</p>
<p>However, there is one key difference to note between the two models. In the
first model, no matter how much data we observe, we only have one parameter: the
rate parameter for the Poisson. On the other hand, in the second model, our
binsize gets smaller and smaller the more data we observe. As a result, the
number of output values we have to estimate for the function actually scales
with the size of the data; so if we observed an infinite amount of data, we’d
have to estimate an infinite number of values, i.e. each value $\hat f(y)$ for
any $y \in \mathbb{R}$.</p>
<p>This difference between the models is so important it has a name. Because the
first model only has a <em>finite</em> number of parameters, it is called a
<strong>parametric model.</strong> On the other hand, as we observe more data, the second
model has an unbounded number of parameters. As a result, it is called a
<strong>nonparametric model.</strong></p>
<h1 id="supervised-and-unsupervised-learning">Supervised and Unsupervised Learning</h1>
<p>One last important distinction worth reviewing is the difference between
<strong>supervised</strong> and <strong>unsupervised</strong> learning algorithms. (The distinction’s a
bit artificial, but the terminology is so common it’s worth reviewing).</p>
<p>Unsupervised algorithms are designed to automatically detect patterns in data
that has already been observed. For example, the following algorithm (called a
Gaussian Mixture Model, or GMM) can take the following data as an input:</p>
<p><img src="/assets/images/ML/unsup_init_data.png" alt="" /></p>
<p>and the GMM will automatically cluster it into something like the following:</p>
<p><img src="/assets/images/ML/gmm_output.png" alt="" /></p>
<p>The GMM did not require any training data; it simply detected the underlying
clusters in the data.</p>
<p>On the other hand, supervised learning algorithms are designed to solve a
different kind of problem. Imagine you have observed a variety of points in
space, each associated with a specific color. We will represent the location of
each point as $X_i$, and the color as $Y_i$. The goal in a supervised learning
problem is to learn to predict $Y$ given $X$: in other words, if you observe the
locations of a bunch of new points in space, predict the new colors of the
points.</p>
<p>For example, a simple feedforward neural network might receive the following
points as training data:
<img src="/assets/images/ML/feedforward_train.png" alt="" />
Then, if you fed the network a series of new points like this:
<img src="/assets/images/ML/feedforward_test.png" alt="" />
it would be (hopefully) able to predict their color.
<img src="/assets/images/ML/feedforward_predict.png" alt="" /></p>
<p>Supervised learning algorithms have a habit of <strong>overfitting</strong> their training
data, meaning that the algorithms basically memorizes the output for every
training input, but is unable to generalize to new datasets. To detect and
prevent overfitting, we often train upsupervised learning algorithms on a
partial subset of the training data (maybe ~ 80%) and then test it on the last
bit of training data which it has never seen before.</p>Asher Spectoraspector@stanford.eduWhat is a statistical model?Probability Theory In 15 Minutes2018-09-20T00:00:00+00:002018-09-20T00:00:00+00:00https://amspector100.github.io/review/probability-theory-in-15-minutes<blockquote>
<p>“Math is the logic of certainty; statistics is the logic of uncertainty.” -
Joe Blitzstein</p>
</blockquote>
<p>Before posting too much, I thought it would be worth reviewing probability
theory, which is really the foundation of all data science. This post is also
intended to serve as a quick (and unfortunately incomplete) reference guide,
from which you can pull specific pieces of information .</p>
<h1 id="-events-and-random-variables"><a name="events"></a> Events and Random Variables</h1>
<p>In statistics, we like to think about the <em>outcomes</em> of <em>random experiments.</em>
For example, one random experiment might be that my roommate rolls a pair of
dice. There are 12 possible outcomes to this event: he could roll a $2$, $3$,
$4$, etc. However, we can also group different outcomes together into <em>events</em>:
for example, we might denote $A$ as the event that my roommate rolls an odd
number.</p>
<p>More mathematically, we might have a set $\Omega$, called the sample space,
which contains all the possible outcomes of a random experiment. An outcome
$\omega$ is an element of the sample space, i.e. $\omega \in \Omega$, and an
event $A$ is a subset of the sample space, i.e. $A \subset \Omega$. In the
previous example,</p>
<p>\(\Omega = \{ \text{ roll a $2$, roll a $3$, roll a $3$, }\dots, \text{ roll a
$12$ } \}\)
\(\omega = \{ \text{ roll a $1$ } \}\)
\(A = \{ \text{ roll an odd number } \}\)</p>
<p>Because the experiment is random, we assign each event a <em>probability</em> between
$0$ and $1$. For some event $A$, we usually denote this probability as $P(A)$.</p>
<p>It’s important to note that although each outcome is <em>associated</em> with a number,
each outcome is <em>not</em> a number: it’s a different kind of mathematical object.
However, we might decide want want to <em>represent</em> each outcome with a number,
and this is where random variables come in.</p>
<p>A <em>random variable</em> is a function from the sample space to the real number line.
Before a random event occurs, the random variable is like a black box: we can’t
know precisely what its value will be. After an experiment however, the random
variable crystallizes to some real number associated with an outcome of the
experiment. More formally, $X$ is a random variable if
\(X : \Omega \to \mathbb{R}\)</p>
<p>Note that we tend to represent random variables with <em>capital letters</em>, and
deterministic/nonrandom variables with lowercase letters.</p>
<p>If we wanted to, we could also define random vectors as maps from the sample
space to lists of real numbers, i.e.
\((X_1, \dots, X_n) = \vec{X} : \Omega \to \mathbb{R}^n\)</p>
<p>For now, however, let’s stick to the univariate (non-vector) case. As an
example, let $X$ be the number that the sum of the values that my roommate rolls
on a pair of dice. Then, if Alex rolls a $3$ and a $4$, $X$ would crystallize to
a value of $7$; and if Alex rolls an $6$ and a $2$, $X$ would crystallize to a
value of $2$, et cetera. In this way, random variables are random because their
value depends on the outcome of the random experiment.</p>
<p>It’s important to note that when we write an expression like $X = 4$ or $X \le
3$, these expressions describe <em>events</em>. Recall that an event is a simply a
subset of the sample space; and $X = 4$ is true only in a subset of the sample
space (in our example, only if Alex’s rolls sum to $4$).</p>
<p>Random variables have <em>distributions</em> which describe how likely they are to take
on certain values after the experiment has concluded. For example, our random
variabel $X$ from the last paragraph will never take on an odd value - this is a
fact about its distribution. We can simulate the dice rolls and look at the
distribution below:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="c1"># Simulate 10000 rolls
</span><span class="n">dice1</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">10000</span><span class="p">)</span>
<span class="n">dice2</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">10000</span><span class="p">)</span>
<span class="n">roll</span> <span class="o">=</span> <span class="n">dice1</span> <span class="o">+</span> <span class="n">dice2</span>
<span class="c1"># Plot
</span><span class="n">roll_counts</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="n">roll</span><span class="p">,</span> <span class="n">return_counts</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">roll_counts</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">roll_counts</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'blue'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Sum of Two Independent Dice Rolls'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Roll Value'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Frequency'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_1_0.png" alt="png" /></p>
<p>Although it’s nice to visualize distributions, we also want to be able to write
them down mathematically. We typically do this in one of three ways ways
(although these are not the only three ways to specify a distribution).</p>
<h2 id="culmulative-distribution-functions">Culmulative Distribution Functions</h2>
<p>First, we can try to work with the <em>Culmulative Distribution Function</em> (CDF).
The CDF of a random variable is a function $F$ which takes in a real number $y$
and returns the probability that the random variable is less than or equal to
$y$.
\(F(y) = P(X \le y)\)</p>
<p>For example, we can plot an empirical CDF of the data above.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Create CDF
</span><span class="k">def</span> <span class="nf">my_cdf</span><span class="p">(</span><span class="n">y</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">roll</span> <span class="o"><=</span> <span class="n">y</span><span class="p">)</span><span class="o">/</span><span class="mi">10000</span>
<span class="k">return</span> <span class="n">result</span>
<span class="n">my_cdf</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">my_cdf</span><span class="p">)</span>
<span class="c1"># Plot
</span><span class="n">x_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">130</span><span class="p">,</span> <span class="mi">1</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span><span class="o">/</span><span class="mi">10</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">my_cdf</span><span class="p">(</span><span class="n">x_values</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">output</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'CDF for Sum of Two Dice Rolls'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_3_0.png" alt="png" /></p>
<p>CDFs are useful because we can use them to calculate the probability that a
random variable will lie in any arbitrary interval. In particular,</p>
\[P(a < X < b) = P(X \le b) - P(X \le a) = F(b) - F(a)\]
<h2 id="probability-mass-functions">Probability Mass Functions</h2>
<p>Some distributions are <em>discrete</em>, in that they can only take on quantized or
“spaced out” values. For example, our example random variable $X$ was discrete
because it could only crystallize to become an whole number, but it could never
crystallize to a fraction. We often use something called the Probability Mass
Function, or PMF to describe the distribution of discrete random variables. If
we denote the PMF as $P_X$, then for any real number $y$,
\(P_X(y) = P(X = y)\)</p>
<p>We’ll plot an empirical PMF of the data above.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Create PMF
</span><span class="k">def</span> <span class="nf">my_pmf</span><span class="p">(</span><span class="n">y</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">roll</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span><span class="o">/</span><span class="mi">10000</span>
<span class="k">return</span> <span class="n">result</span>
<span class="n">my_pmf</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">my_pmf</span><span class="p">)</span>
<span class="c1"># Plot
</span><span class="n">output</span> <span class="o">=</span> <span class="n">my_pmf</span><span class="p">(</span><span class="n">x_values</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">output</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'PMF for Sum of Two Dice Rolls'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_5_0.png" alt="png" /></p>
<p>Note that because $X$ is discrete, the PMF is zero almost everywhere - it only
takes on nonzero values at a couple of whole numbers (2 through 12).</p>
<h2 id="probability-density-functions">Probability Density Functions</h2>
<p>Other distributions are <em>continuous</em>, in that they can take on any real value.
We can’t use a PMF to describe discrete distributions, because the probability
that they take on any particular value is $0$! Although we can still use the CDF
to describe the probability that they’ll land in a particular <em>interval</em>, we
might still want something a bit more analagous to the PMF. The solution is to
use a <strong>probability density function</strong>, which is the <strong>derivative of the CDF</strong>.
To understand how this works, let’s look at the distribution of a “mystery”
random variable.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="n">stats</span>
<span class="n">x_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="mi">1</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span><span class="o">/</span><span class="mi">1000</span>
<span class="n">cdf_values</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">beta</span><span class="p">.</span><span class="n">cdf</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">cdf_values</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'"Mystery" CDF'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_7_0.png" alt="png" /></p>
<p>Let’s try to interpret this arbitrary CDF. Recall that the CDF $F$ for a random
variable $X$ is simply the probability that $X$ will be less than or equal to
$x$: $F(x) = P(X \le x)$. Here, $F(0)$ is $0$, meaning that our random variable
$X$ wil never be less than $0$. However, the slope of the CDF is pretty high:
the CDF increases very quickly, until at $x = 0.1$, $F(0.1) \approx 0.5$. Like
we discussed earlier, this means that $X$ is very likely to appear in the range
between $0$ and $0.1$. On the other hand, the slope of the CDF is very small
(almost $0$) between $0.6$ and $0.9$, implying that $X$ will almost never appear
in that range.</p>
<p>We can take this idea to its limit (get it!) by taking the derivative of the
CDF, which corresponds to limit of the probability that $X$ will fall into a bin
around some value as the size of the bin gets smaller and smaller. This
derivative is the PDF, and just as we suspected, it is very high while $x$ is
close to $0$, and drops off as $x$ increases.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">pdf_values</span> <span class="o">=</span> <span class="n">stats</span><span class="p">.</span><span class="n">beta</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">ax2</span><span class="p">,</span> <span class="n">ax1</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">nrows</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span><span class="n">ncols</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">pdf_values</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'orange'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">cdf_values</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Mystery PDF"</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Mystery CDF"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_9_0.png" alt="png" /></p>
<p>At this point, the difference between a “Probability Mass Function” and a
“Probability Density Function” might be a bit clearer, especially if you’ve ever
done any physics. For a probability <em>mass</em> function of a <em>discrete</em> random
variable, there is actual <em>mass</em>, or a finite probability, associated with many
points.</p>
<p>On the other hand, for a continuous random variable, there’s no finite
probability associated with any point, so instead we associate each point with a
<em>density</em>.</p>
<h2 id="expectation-variance-and-moments">Expectation, Variance, and Moments</h2>
<h3 id="expectation">Expectation</h3>
<p>The <strong>expectation</strong>, or mean, of a distribution is intuitively its “average
value.” Mathematically, we can define the expectation of a random variable,
denoted $E(X)$, as follows. Let $A$ be the <em>support</em> of $X$, i.e. the set of
values $k$ for which $P(X = k)$ is nonzero. In the discrete case, $X$, we define
\(E(X) = \sum_{k \in A} k \cdot P(X = k)\)
In the continuous case, since $P(X = k) = 0$ for every $k$, we use its PDF
instead. If $f$ is the pdf of $X$, then we define
\(E(X) = \int_{x \in A} x f(x) \, dx\)
If you’re familiar with the Riemann–Stieltjes integral, we can unify both
definitions. For $F$ the cdf of $X$, we define
\(E(X) = \int_{x \in A} x dF(x)\)
although if you don’t know this notation, that’s completely fine.</p>
<p>Because integrals and sums are linear operators, expectation is linear too: in
other words, for any two random variables $X$ and $Y$, and some constants $a, b
\in \mathbb{R}$,
\(E(aX + bY) = aE(X) + bE(Y)\)</p>
<p>The Law of the Unconscious Statistician, known as LOTUS, states that we can
actually extend this definition and compose an arbitrary real-valued function
$g$ on $X$. More specifically,</p>
\[E(g(X)) = \begin{cases} \int_{x \in A} g(x) f(x) \, dx & \text{X is continuous
} \\
\sum_{k \in A} g(k) \cdot P(X = k) & \text{X is
discrete }\end{cases}\]
<h3 id="variance">Variance</h3>
<p>The variance of a random variable measures the “spread” of its distribution.
More precisely, the variance is defined as the expected squared distance between
a random variable and its own expectation. I.e. for any $X$,</p>
\[\text{Var}(X) = E\bigg( (E(X) - X)^2 \bigg)\]
<p>Using the linear properties of expectation, we can simplify to find that
\(\text{Var}(X) = E\bigg( E(X)^2 - 2 XE(X) + E(X)E(X) \bigg)\)
\(= E \Big(X^2\Big) - E\Big(2 X E(X)\Big) + E\Big( E(X)E(X)\Big)\)
Because expectation is linear, and the value $E(X)$ is a constant, we can pull
rearrange as follows
\(\text{Var}(X) = E(X^2) - 2E(X)^2 + E(X)^2 = E(X^2) - E(X)^2\)</p>
<p>We define the <strong>standard deviation</strong> as the square root of the variance:</p>
\[\text{SD}(X) = \sqrt{\text{Var}(X)}\]
<p>I’ve plotted the expectations and variances of a couple of common distributions
below, because visualization aids intuition!</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">ax3</span><span class="p">,</span> <span class="n">ax2</span><span class="p">,</span> <span class="n">ax1</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">nrows</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span><span class="n">ncols</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">ax3</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">pdf_values</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Beta(1,5) PDF'</span><span class="p">)</span>
<span class="n">ax3</span><span class="p">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="mi">1</span><span class="o">/</span><span class="mi">6</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'Red'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">linestyle</span> <span class="o">=</span> <span class="s">'dashed'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Expectation'</span><span class="p">)</span>
<span class="n">ax3</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Variance = {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="mf">0.0198</span><span class="p">))</span>
<span class="n">ax3</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">poisson_x_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">poisson_x_values</span><span class="p">,</span> <span class="n">stats</span><span class="p">.</span><span class="n">poisson</span><span class="p">.</span><span class="n">pmf</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">poisson_x_values</span><span class="p">),</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Poisson(3) PMF'</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="mi">3</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'Red'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">linestyle</span> <span class="o">=</span> <span class="s">'dashed'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Expectation'</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Variance = {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span> <span class="o">-</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">stats</span><span class="p">.</span><span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x_values</span> <span class="o">-</span> <span class="mf">0.5</span><span class="p">),</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Standard Normal PDF'</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'Red'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">linestyle</span> <span class="o">=</span> <span class="s">'dashed'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Expectation'</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Variance = {}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.legend.Legend at 0x2aa12e0e160>
</code></pre></div></div>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_12_1.png" alt="png" /></p>
<h3 id="moments">Moments</h3>
<p>More generally, we define the $k$th <strong>moment</strong> of $X$ to be the value $E(X^k)$.
It’s hard to find moments, but one we can do it is by using <strong>generating
functions</strong>. The moment generating function $M(t)$ is defined as follows:</p>
\[M(t) = E(e^{tX})\]
<p>Although this does <em>not</em> seem to be related to moments in the first place, it
helps if we think about it as a <a href="https://en.wikipedia.org/wiki/Taylor_series#Exponential_function">Taylor
series</a>.
Specifically, the Taylor series for $e^x$ is
\(e^x = \sum_{n = 0}^\infty \frac{x^n}{n!}\)
So using the linearty of expectation, we can rewrite the moment generating
function as
\(M(t) = E \bigg(\sum_{n = 0}^\infty \frac{(tX)^n}{n!} \bigg) =
\sum_{n=0}^\infty \frac{t^n E(X^n)}{n!}\)</p>
<p>And now we see that the $E(X^n)$ popping up as a part of each term in the sum.
Moment generating functions can be a bit confusing, but extremely useful. I may
end up doing an entire post on them for this reason.</p>
<h1 id="conditioning">Conditioning</h1>
<blockquote>
<p>“Conditioning is the soul of statistics.” - Joe Blitzstein</p>
</blockquote>
<p>Often, we have not observed the full result of an experiment, but we might have
observed <em>part</em> of the result, and we want to use this information to help
predict the rest of the experiment. In other words, if some event $B$ has
already occured, but we want to know the probability of some other event $A$
given $B$. We denote this as $P(A | B)$ where the $|$ symbol means “given.”</p>
<p>How can we calculate this probability? Well, one way to think about this is to
reduce the size of our <a href="#events">sample space</a>, and to throw away all the
outcomes where $B$ does not occur, because we know that $B$ did in fact occur.
Then, we can simply check the proportion of outcomes left in which $A$ occurs.
Another way to phrase this is to write
\(P(A|B) = \frac{P(A, B)}{P(B)}\)
where $P(A,B)$ is the probability that <em>both</em> $A$ and $B$ occur. With this
definition, we can also see that
\(P(A|B) P(B) = \frac{P(A, B)}{P(B)} P(B) = P(A, B)\)
and similarly
\(P(B|A) P(A) = P(A,B)\)
due to cancellation; this holds for any number of events and is often referred
to as the <strong>chain rule of proabbility</strong>. For example, for any four events we
know
\(P(A|B, C, D) P(B|C, D) P(C|D) P(D) = P(A, B, C, D)\)
for the same reason. By this identity, it logically follows that
\(P(A|B) = \frac{P(A,B)}{P(B)} = \frac{P(B|A) P(A)}{P(B)}\)</p>
<p>This result is actually one of the fundamental theorems of statistics: it is
known as <strong>Bayes’ Rule</strong>.</p>
<h2 id="independence">Independence</h2>
<p>Two events $A, B$ are defined to be <strong>independent</strong> if and only if $P(A, B) =
P(A) P(B)$. What does this mean? Remember, $A$ and $B$ are events, or a subset
of outcomes of some experiment. For example, if we think of the wheel of fortune
as a random experiment, $A$ could be the event that you win over 1000 dollars,
could be the event that you win some sort of vacation as your price.</p>
<p>With this in mind, we can construct an intuitive understanding of independence.
Events $A$ and $B$ are independent if knowing that one occurs gives you no
information about whether the other will occur; i.e. perhaps you spin two
separate wheels, one which might reward you with a vacation, and one which might
reward you with cash. On the other hand, if $P(A,B) < P(A)P(B)$, then knowing
that $A$ occurs makes $B$ less likely to occur, and vice versa - perhaps the
game show host only allows you to win <em>either</em> a vacation or cash. Or finally,
if $P(A,B) > P(A)P(B)$, then knowing $A$ occured makes $B$ more likely to occur,
and vice versa.</p>
<p>In the context of random variables, independence is rather similar. Two random
variables $X$ and $Y$ are independent if and only if for any $a, b, c, d \in
\mathbb{R}$,</p>
\[P(a < X < b)P(c < Y < d) = P(a < X < b, c < Y < d)\]
<p>Just like in the case of events, the intuitive way to interpret this is that $X$
and $Y$ are independent if and only if knowing $X$ tells you absolutely nothing
about $Y$.</p>
<p>One last piece of terminology is important: we say that $X$ and $Y$ are
<strong>i.i.d.</strong> if they are independent and have the exact same distribution (i.i.d.
stands for “independent and identically distributed”). We often think about
i.i.d.-ness because it allows us to apply the <a href="#normal">central limit theorem</a>,
which we’ll talk about later.</p>
<h2 id="joint-and-conditional-distributions">Joint and Conditional Distributions</h2>
<p>For a random variable $X$, recall that expressions like $X = \pi$ or $X < 2.5$
correspond to events. For this reason, we can apply the framework of conditional
probability and independence to random variables as well. For example, we might
have observed a random variable $Y$ but not $X$. Then, using Bayes’ Rule, we can
calculate the <em>conditional distribution</em> of $X$ as follows:</p>
\[P(X = n|Y = m) = \frac{P(Y = m | X = n) P(X = n)}{P(Y = m)}\]
<p>Note that if $X$ or $Y$ are continuous, we can simply replace the Probability
Mass functions above with PDFs. Note that if $X$ and $Y$ are independent, then
$P(Y = m|X = n) = P(Y = m)$ in all cases; and therefore
\(P(X = n | Y = m) = \frac{P(Y = m ) P(X = n)}{P(Y = m)} = P(X = n)\)
This result makes sense, because if $X$ and $Y$ are independent, then knowing
one gives you no information about the other: therefore, knowing $Y$ should not
change $X$’s distribution.</p>
<p>Even if we haven’t observed $X$ or $Y$, we often want to calculate their joint
distribution, meaning we want to know the probability (or density) that $X = n$
and $Y = m$ at the <em>same time</em>. For example, if $X$ and $Y$ are discrete, we
might want to calculate their <em>joint PMF</em>, i.e. $P(X = n, Y = m)$, or
alternatively if they are continuous, their joint density $p_{x, y}(x,y)$.</p>
<p>Remember that in general, just knowing the distributions of $X$ and $Y$ is not
enough to tell us their joint distribution: or more formally, $P(X = n, Y = m)
\ne P(X = n)P(Y=m)$, because $X$ and $Y$ might not be independent. As a simple
example, imagine $X$ is the <a href="#events">indicator random variable</a> for a coin
landing heads, and $Y$ is the indicator random variable for the coin landing
tails. In this case, $P(X = 1, Y = 1) = 0$, because if $X = 1$, then the coin
lands heads, so $Y = 0$. However, $P(X = 1) P(Y = 1) = \frac{1}{2} \cdot
\frac{1}{2} = \frac{1}{4}$.</p>
<p>We often calculate the joint PDF by simply applying the laws of conditional
probability. For example, suppose $X$ and $Y$ are continuous, we know the
distribution $p_x$ of $X$, and we know the conditional distribution $p_{y|x}$ of
$Y$ given $X$. Then, we may calculate
\(p_{x, y} = p_x(x) p_{y|x}(y|x)\)</p>
<h2 id="covariance">Covariance</h2>
<p>The <strong>covariance</strong> of two random variables $X$ and $Y$ are is defined as</p>
<p>\(\text{Cov}(X, Y) = E\bigg( (X - E(X)) (Y - E(Y)) \bigg)\)
\(= E(XY) - E(X)E(Y)\)</p>
<p>Although the rightmost expression is often very useful in practice, the middle
expression can help us get a sense of what covariance actually means. $E\bigg(
(X - E(X)) (Y - E(Y)) \bigg)$ will only be large if in the cases when $X$ is
significantly greater than its mean, $Y$ will also be greater than its mean, and
vice versa. This is what it means for $X$ and $Y$ to have a “positive
covariance.” On the other hand, $Y$ and $X$ will have a <em>negative</em> covariance if
$X$ being “larger than normal” implies $Y$ will likely be “smaller than normal.”
Let’s look at some very simple examples.</p>
<p>Suppose $X$ is any random variable, and $Y = 2X$. In this case, $X$ and $Y$ have
positive covariance, because if $X$ is larger than it mean, then $Y$ is
guarenteed to be larger than its mean as well, since $Y = 2X$. On the other
hand, if we set $Y = -2X$, then $X$ and $Y$ would have negative covariance,
because if $X$ was quite large, then $Y$ would be highly negative.</p>
<p>We often think about the <em>covariance matrix</em> of two random vectors, which is
simply a matrix holding the covariance of each of the components of the two
vectors. More specifically, for $X, Y$ which are random vectors in
$\mathbb{R}^n$, define $\text{Cov}(X, Y)$ as:
\(\begin{bmatrix} \text{Cov}(X_1, Y_1) & \text{Cov}(X_1, Y_2) & \dots &
\text{Cov}(X_1, Y_n)
\\ \text{Cov}(X_2, Y_1) & \text{Cov}(X_2, Y_2) & \dots & \text{Cov}(X_2, Y_n)
\\ \vdots & \vdots & \vdots & \vdots
\\ \text{Cov}(X_n, Y_1) & \text{Cov}(X_n, Y_2) & \dots & \text{Cov}(X_n,
Y_n)\end{bmatrix}\)</p>
<p>It’s worth noting that the covariance matrix is always symmetric because
$\text{Cov}(X_i, Y_j) = \text{Cov}(Y_j,X_i)$. If you’re familiar with Linear
Algebra, it’s worth knowing that the covariance matrix is <em>usually</em>, but not
always, positive semi-definite as well.</p>
<h2 id="conditional-independence">Conditional Independence</h2>
<p>Imagine that we are interested in the height, education level, and age of a
random person in the United States. Let $X$ be the random variable denoting
their height, let $Y$ denote their education level in some way, and let $Z$
denote their age. To begin with, we should note that $X$ (height) and $Y$
(education level) are <em>not</em> independent: on average, shorter people tend to have
lower education levels than taller people. How do I know this? It’s becauase
shorter people tend to be <em>younger</em>, and younger people haven’t gone to school
for as long!</p>
<p>Here, $X$ (height) and $Y$ (education level) are not independent, because they
are both influenced by a third random variable, $Z$ (age). However, we still
might feel like there is <em>some</em> kind of independence relation between height and
education level, because other than the fact that age affects both, there
probably isn’t a strong relationship between how tall someone is and whether
they decide to go to school. Formally, we might say that height and education
level are <strong>conditionally</strong> independent given age, or that $X$ and $Y$ are
independent given $Z$.</p>
<p>Mathematically, conditional independence looks very similar to regular
independence, just with an extra piece of conditioning. Specifically, we define
$X$ and $Y$ to be conditionally independent given $Z$ if and only if
\(P(X,Y|Z) = P(X|Z) P(Y|Z)\)
for all values of $X$, $Y$, $Z$. Conditional independence is extremely important
in modeling, because if we have some random variables $Y_1, \dots, Y_n$ which
are all conditionally independent given $X$, then we may factor the joint
distribution as follows:</p>
\[p(x, y_1, \dots, y_n) = p(x) p(y_1, \dots, y_n | x) = p(x) \prod_{i=1}^n
p(y_i|x)\]
<p>which turns out to be an extremely useful identity in pretty much every modeling
setting.</p>
<p>We just saw an example (height, age, education) that conditional independence
does not imply unconditional independence - in other words $X$ and $Y$ might be
conditionally independent given $Z$, but that does not imply they will be
conditionally independent. Interestingly, it turns out that unconditional
independence (regular independence) does not imply conditional indepence either.
As an example, imagine that you flip two coins independently, so that $X$, the
result of the first flip, is independent from $Y$, the result of the second
flip. However, imagine we have some $Z$, which tells us the total number of
heads in both flips! Then, if we know $X$ and $Z$, we can do some subtraction to
find $Y$ as well; so $X$ and $Y$ are unconditionally independent, but they are
<em>not</em> conditionally independent given $Z$.</p>
<h1 id="-fundamental-distributions"><a name="distributions"></a> Fundamental Distributions</h1>
<p>There are “dozens” of fundamental distributions in statistics, but here is a
brief review of five simple but extremely useful distributions. Note that for
some distribution I use the $X \sim distribution $ notation to illustrate that a
random variable $X$ is distributed according that distribution.</p>
<h2 id="bernoulli-and-binomial">Bernoulli and Binomial</h2>
<p>If a random variable $X \sim Bern(p)$, then $X$ has a distribution chracterized
by the following PMF:</p>
\[P(X = 1) = p, \, \, P(X = 0) = q = 1 - p\]
<p>Intuitively, this means that $X$ is basically a weighted coin toss: it has a $p$
chance of crystallizing to the value $1$, and otherwise will crystallize to the
value $0$. This is called the <strong>Bernoulli</strong>, which we often use to characterize
binary events (i.e. whether a coin will land heads/tails).</p>
<p>The <strong>binomial</strong> distribution is intimately related to the Bernoulli. From a
mathematical point of view, we say that if $ X \sim Bin(n, p)$, then its
distribution is characterized by the following PMF:</p>
\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}\]
<p>This looks a bit complicated, but is actually rather easy to interpret.
Intuitively, if $X \sim Bin(n, p)$, then $X$ can be thought of as the number of
heads you’ll observe if you flip a weighted coin $n$ times, where each flip has
a $p$ probability of landing heads. More formally, a binomially distributed
random variables is identically distributed to the sum of a bunch of independent
Bernoulli random variables:</p>
<p>\(X \sim Bin(n, p) \implies X \sim X_1 + X_2 + \dots + X_n\)
where $X_i \sim \text{Bern}(p) $.</p>
<p>How does this relate to the PMF from above? Well, let’s consider the Binomial as
a sum of Bernoulli’s. We want to find the probability that there will be $k$
coins landing heads. For any sequence involving $k$ heads, like below:
\(H, \, T, \, T, \, H, \, T, \, H, \dots \, H \, T \text{ (assume there are $k$
heads in this sequence) }\)</p>
<p>the probability of that exact sequence occuring is $p^k (1-p)^{n - k}$, because
$k$ heads and $n-k$ tails must land in precisely that order. However, for each
$k$, there are $\binom{n}{k}$ possibly orderings of tails/heads with exactly $k$
heads, so therefore there probability that any one of those sequences will come
to pass is
\(P(X = k) = \binom{n}{k} p^k (1-p)^k\)
as before.</p>
<p>From a modeling perspective, these two distributions are pretty clearly
important. In particular, the Bernoulli is often used to model binary events
(i.e. a coin flip), and the Binomial is often used to count the number of
“successful” trials in a fixed number of events.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">ax3</span><span class="p">,</span> <span class="n">ax2</span><span class="p">,</span> <span class="n">ax1</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">nrows</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span><span class="n">ncols</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">x_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">ax3</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">stats</span><span class="p">.</span><span class="n">binom</span><span class="p">.</span><span class="n">pmf</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mf">0.7</span><span class="p">),</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'blue'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Bin(10, 0.7) PMF'</span><span class="p">)</span>
<span class="n">ax3</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">stats</span><span class="p">.</span><span class="n">binom</span><span class="p">.</span><span class="n">pmf</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">),</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'red'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Bin(10, 0.5) PMF'</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">stats</span><span class="p">.</span><span class="n">binom</span><span class="p">.</span><span class="n">pmf</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">),</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'orange'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Bin(7, 0.8) PMF'</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_14_0.png" alt="png" /></p>
<h2 id="-poisson-and-exponential"><a name="poisson"></a> Poisson and Exponential</h2>
<p>Just for a second, imagine you are an avid dragon-watcher. One day, you decide to talk a stroll through a magical forest, and while doing so, you make a log of all the dragons you see. The next day, you take a look at the log, and you make two astute observations:</p>
<ol>
<li>You are equally likely to spot a dragon at any time. Seeing lots of dragons one hour does not make you less likely to see dragons the next hour; and dragons are equally prevalent in this magical forest at all hours of the day.</li>
<li>It seems that the number of dragons you spot in a given time interval is proportional to the length of a time interval. For example, if you spot a single dragon in a 1 hour period, then this implies that you are likely to spot (on average) $k$ dragons in every $k$-hour period.</li>
</ol>
<p>This is a (magical) example of a <strong>poisson process</strong>. A poisson process is a time-independent process in which we are interested in a specific kind of random occurence (for occurence, the event of spotting a dragon). Informally, the assumptions of the poisson process are that (i) random occurences are equally likely to occur at any time, (ii) for any two time intervals, the number of random occurences in one interval or the other is independent.</p>
<p>It turns out that these two assumptions are enough to characterize two very special distributions, called the <em>Poisson</em> and the <em>Exponential</em>.</p>
<p>The <em>Poisson Distribution</em>, which takes a single parameter, $\lambda$, can be thought of as describing the number of random occurences in a time interval in a poisson process. For example, let $Z \sim \text{Pois}(\lambda)$, where $\lambda$ is the “average” number of dragons you expect to see in one hour. We can think of $Z$ as representing the <em>distribution</em> of the number of dragons you see in any given hour. The PMF of the Poisson is as follows (this can be derived from the assumptions of the poisson process, although that’s another topic for another time):</p>
\[Z \sim \text{Pois}(\lambda) \implies P(Z = k) = \frac{e^{- \lambda} \lambda^k}{k!}\]
<p>The Poisson Distribution is very useful in modeling when we are trying to count the <em>number of events</em> in a random process. Other useful facts about the Poisson is that it turns out $E(Z) = \lambda = \text{Var}(Z)$.</p>
<p>On the other hand, the <em>Exponential Distribution</em> takes the same parameter $\lambda$ but instead describes the amount of time you must wait before spotting a single random occurence (i.e. spotting a single dragon). This is a continuous distribution, so we define it in terms of its PDF:</p>
\[W \sim \text{Expo}(\lambda) \implies p_w(x) = \lambda e^{- \lambda x}\]
<p>For the exponential distribution, we know $E(W) = \frac{1}{\lambda}$ and $\text{Var}(W) = \frac{1}{\lambda^2}$. The exponential distribution is extremely cool and is connected to all sorts of things in Statistics (for example, it is the only continuous distribution which has the <a href="https://en.wikipedia.org/wiki/Memorylessness">memoryless property</a>), but for now it’s enough to know that it’s very useful in modeling time lengths, i.e. the amount of time it might take for a component in a machine to break down, or the amount of time it might take for a fundamental particle to decay.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fig</span><span class="p">,</span> <span class="p">(</span><span class="n">ax2</span><span class="p">,</span> <span class="n">ax1</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">nrows</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">ncols</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">x_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="n">stats</span><span class="p">.</span><span class="n">poisson</span><span class="p">.</span><span class="n">pmf</span><span class="p">(</span><span class="n">x_values</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'blue'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Poiss(1) PMF'</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">new_x_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">1</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">float32</span><span class="p">)</span><span class="o">/</span><span class="mi">10</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">new_x_values</span><span class="p">,</span> <span class="n">stats</span><span class="p">.</span><span class="n">expon</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">new_x_values</span><span class="p">),</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'orange'</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Expo(1) PDF'</span><span class="p">)</span>
<span class="n">ax1</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_16_0.png" alt="png" /></p>
<h2 id="-gaussiannormal"><a name="normal"></a> Gaussian/Normal</h2>
<p>The <strong>Gaussian</strong> or <strong>Normal</strong> distribution is probably the most famous
distribution in all of statistics, for good reason. It has a number of unique
properties, but we’ll start by looking at its PDF. Let $X \sim \mathcal{N}(\mu,
\sigma^2)$; then $X$ has the following PDF.</p>
\[g(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-(x-\mu)^2/(2\sigma^2)}\]
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">110</span><span class="p">)</span>
<span class="n">normal_data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">5000</span><span class="p">)</span>
<span class="n">sns</span><span class="p">.</span><span class="n">distplot</span><span class="p">(</span><span class="n">normal_data</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'cornflowerblue'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Normal(0,1) Distribution'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="\assets\images\ipython\2018-09-20-probability-theory-in-15-minutes_files\2018-09-20-probability-theory-in-15-minutes_18_0.png" alt="png" /></p>
<p>The parameters of a Normal distribution have a simple interpretation: for $X
\sim \mathcal{N}(\mu, \sigma^2)$, $X$’s mean is $\mu$, and its variance is
$\sigma^2$. Moreover, the Normal has the nice property that the <em>mode</em> and
<em>median</em> of the distribution are equal to the mean.</p>
<p>The Normal distribution is particularly famous because of something called the
<strong>Central Limit Theorem</strong>, which states that for a sequence of i.i.d. random
variables $Y_1, \dots, Y_n$, then the sum of $Y_i$ is distributed normally (at
least asymptotically). More formally, if we let
\(\bar Y = \frac{1}{n} \sum_{i=1}^n Y_i\)
Then
\(\lim_{n \to \infty} \frac{\sqrt{n} (\bar Y - E(Y))}{\text{SD}(Y)} \to^d
\mathcal{N}(0, 1)\)</p>
<p>where the $\to^d$ symbol means “converges to in distribution.” The above limit
can look confusing, it really just means that for sufficiently large $n$:</p>
<ol>
<li>The average value of all the $Y_i$ is distributed normally with a mean at
$E(Y_i)$ (i.e. the mean of any of the samples);</li>
<li>The variance of the average value of the random samples should be around
$\frac{\text{Var}(Y_i)}{\sqrt{n}}$. This means that the variance decreases with
the size of $n$, which makes sense intuitively: as we get more and more random
samples, any outliers in the samples should start to cancel each other out,
decreasing the variance of the sum of the samples.</li>
</ol>
<p>Remember, the Central Limit Theorem is not describing the behavior of the
<em>samples,</em> but rather the behavior of their <em>sum</em> as $n$ increases.</p>
<p>The Normal distribution has a whole host of other exceedingly interesting
properties, but many of them deserve their own post, and this post is long
enough as it is. That’s all for now!</p>Asher Spectoraspector@stanford.eduA quick review of probability theory.