# Probabilistic Modeling

“All models are wrong, but some are useful.” — George E. P. Box

# Generative Models

A **model** is a set of assumptions (and, usually, equations) which frame the
way the world works. For example, you might model a series of dice rolls in a
board game by assuming they are independent from each other - in other words,
the outcome of the first roll does not affect the outcome of the second roll.

However, we’ll usually be interested in **generative models.** A generative
model with respect to some observed data is a model which is capable of *fully
simulating* the observed data. For example, if you observed a bunch of dice
rolls in a row, as follows:

just knowing that each roll is independent of the others is *not* enough to
simulate the data. To do that, you’d need to also to specify the *probability
distribution* of the value of each roll - in other words, the probability that
any roll will land as a $1$ or a $2$ or a $3$, etc.

We like generative models for at least two reasons! First, they let us do fun
(and useful) math - for example, we can estimate the probability that the
observed data occurs under a model. As we’ll see, calculating the likelihood of
observed data is extremely important in techniques like Maximum Likelihood
Estimation and more. Second, generative models can also *do* cool things: for
example, a generative model for natural language can write a chapter of Harry
Potter or even create fake pictures of real
celebrities, as pictured below:

These fake celebrities came from a paper at arXiv:1710.10196.

# The Data Generating Process

The **Data Generating Process** (DGP) is the “true” generative model. To be more
specific, in most problems we assume there is some underlying joint probability
distribution which governs the data we
observe, and the DGP is that underlying distribution. We don’t observe the DGP istself,
but we can use external data to get a rough sense of what’s going on.

In the example in the next section, which will tie together all of this
material, we have a “God’s eye view” and can see the DGP in all its glory, but
only because I literally made up the data. In reality, you will never know the
DGP - the point of statistics is broadly to create models which approximate it. We call
a model *correctly specified* if it has the same underlying structure as the
DGP.

# Estimators, Parametric, and Nonparametric Models

Like people, models tend to come in families. For example, the normal
distribution is not a single distribution - it’s a family of extremely similar
distributions, each of which depends on two values: a *mean* and *variance.* We
generally call values which help index families of models *parameters*.

Unfortunately, we don’t usually know the values of parameters we’re interested
in. As a result, we have to create *estimates* for parameter values. We do this
using *estimators*, which are just functions of random data which we can use to
guess the parameter of interest. Usually, we denote estimators by putting a
little hat on top of some symbol, like $\hat \theta$ or $\hat \sigma$.

To understand all of the modeling terminology we’ve been discussing, consider the example below.

**estimate**is nonrandom, whereas an

**estimator**is a function of random data. Both serve as a guess for a parameter of interest. This is all made more confusing by the fact that some people refer to values they want to learn as

**estimands**.

Let’s try fitting two models to this data.

In the first model, we’ll make a big (but correct!) assumption: that the DGP is primarily a Poisson Distribution. However, we need to find the rate parameter for the Poisson distribution. We’ll talk more about different types of estimators in the coming posts, but one commonly used estimator is simply the mean of the data. This makes intuitive sense, because the mean of a Poisson is its rate parameter. Then, if we denote our parameter as $\lambda$, our $n$ data points as $X_1, \dots, X_n$, and our estimator as $\hat \lambda$, we define

\[\hat \lambda = \frac{1}{n} \sum_{i = 1}^n X_n\]In our second model, we’ll make fewer assumptions about the underlying
distribution of the data. (This model is also a bit more complicated). We will
model the density function underlying the data by simply calculating the percent
of data which falls into a small bin around the data. In this setting, our
estimand is the shape of the distribution itself: we’re modeling the entire
distribution in one go. (This is a simple example of **kernal density estimation
(kde)**, which we’ll talk more about later).

To get a little bit more formal, suppose we have observed data $X_1, \dots, X_n$ and we have selected a binsize of $h$ (usually, the more data you have observed, the smaller you make the binsize). Then, for any real number $y$, our estmimator, called $\hat f$, returns the following expression:

\[\hat f(y) = \frac{1}{h} \sum_{i=1}^n \frac{I_{|X_i - y| < h/2}} {n}\]This notation may be a bit confusing at first, but remember that $I_{X_i \in (a_i, b_i])} $ is just an indicator random variable which equals $1$ if the $n$th data observation $X_n$ falls within $\frac{h}{2}$ of the input $y$, and $0$ otherwise. Let’s see how this model performs below, especially compared against the Poisson model:

Both models perform reasonably well, probably because in these kinds of examples, we have the luxury of actually knowing the DGP and can model accordingly.

However, there is one key difference to note between the two models. In the first model, no matter how much data we observe, we only have one parameter: the rate parameter for the Poisson. On the other hand, in the second model, our binsize gets smaller and smaller the more data we observe. As a result, the number of output values we have to estimate for the function actually scales with the size of the data; so if we observed an infinite amount of data, we’d have to estimate an infinite number of values, i.e. each value $\hat f(y)$ for any $y \in \mathbb{R}$.

This difference between the models is so important it has a name. Because the
first model only has a *finite* number of parameters, it is called a
**parametric model.** On the other hand, as we observe more data, the second
model has an unbounded number of parameters. As a result, it is called a
**nonparametric model.**

# Supervised and Unsupervised Learning

One last important distinction worth reviewing is the difference between
**supervised** and **unsupervised** learning algorithms. (The distinction’s a
bit artificial, but the terminology is so common it’s worth reviewing).

Unsupervised algorithms are designed to automatically detect patterns in data that has already been observed. For example, the following algorithm (called a Gaussian Mixture Model, or GMM) can take the following data as an input:

and the GMM will automatically cluster it into something like the following:

The GMM did not require any training data; it simply detected the underlying clusters in the data.

On the other hand, supervised learning algorithms are designed to solve a different kind of problem. Imagine you have observed a variety of points in space, each associated with a specific color. We will represent the location of each point as $X_i$, and the color as $Y_i$. The goal in a supervised learning problem is to learn to predict $Y$ given $X$: in other words, if you observe the locations of a bunch of new points in space, predict the new colors of the points.

For example, a simple feedforward neural network might receive the following points as training data: Then, if you fed the network a series of new points like this: it would be (hopefully) able to predict their color.

Supervised learning algorithms have a habit of **overfitting** their training
data, meaning that the algorithms basically memorizes the output for every
training input, but is unable to generalize to new datasets. To detect and
prevent overfitting, we often train upsupervised learning algorithms on a
partial subset of the training data (maybe ~ 80%) and then test it on the last
bit of training data which it has never seen before.