1 Data = Signal + Noise

Introducing data-generating distributions for machine learning

Open the live notebook in Google Colab.

Introduction: Data, Signal, and Noise

In these notes, we’ll expand on the following idea:

Machine learning is the science and practice of building algorithms that distinguish between signal and noise in real-world data.

Consider the following simple data set:

Code

import torch 
from matplotlib import pyplot as plt

scatterplot_kwargs = dict(color = "black", label = "data", facecolors = "none", s = 40, alpha = 0.6)

n_points = 20
x = torch.linspace(0, 10, n_points)
signal = 2.0 * x + 1.0  # underlying pattern (signal)
noise = torch.randn(n_points) * 3.0  # random noise

y = signal + noise

# Plot residual segments
fig, ax = plt.subplots(figsize = (6, 4))
ax.scatter(x.numpy(), y.numpy(), **scatterplot_kwargs)
ax.plot(x.numpy(), signal.numpy(), color = "black", linestyle = "--", label = r"Signal: $f(x_i) = 2x_i + 1$")
for i in range(n_points):
    if i == 0: 
        ax.plot([x[i].item(), x[i].item()], [signal[i].item(), y[i].item()], color = "red", alpha = 0.3, linewidth = 0.8, label = r"Noise: $\epsilon_i$")
    else: 
        ax.plot([x[i].item(), x[i].item()], [signal[i].item(), y[i].item()], color = "red", alpha = 0.3, linewidth = 0.8)
ax.set(xlabel = r"$x$", ylabel = r"$y$")
plt.legend()

Figure 1.1: An illustrative decomposition of a data set (points) into a hypothesized underlying signal (dashed line) and noise (red segments).

We can think of this data as consisting of two components: a signal expressed by a relationship $y \approx f(x)$ (in this case $f(x) = 2x + 1$), and some noise that partially obscures this relationship. Schematically, we can write this relationship as:

\[ \begin{aligned} y_i = f(x_i) + \epsilon_i\;, \end{aligned} \tag{1.1}\]

which says that the $i$th value of the target is equal to some function $f(x_i)$ of the input variable $x_i$, plus some random noise term $\epsilon_i$.

Overfitting

It’s important to emphasize here that the thing we want to learn is not the individual targets $y_i$, but rather the underlying function $f(x)$. To see why, let’s consider an example of what goes wrong if we try to learn the targets $y_i$ exactly. This is called interpolation, and is illustrated by Figure 1.2:

Code

from scipy import interpolate
# Create an interpolating function
f_interp = interpolate.interp1d(x.numpy(), y.numpy(), kind='cubic')
x_dense = torch.linspace(0, 10, 100)
y_interp = f_interp(x_dense.numpy())

fig, ax = plt.subplots(figsize = (6, 4))
ax.scatter(x.numpy(), y.numpy(), **scatterplot_kwargs)
ax.plot(x_dense.numpy(), y_interp, color = "red", linestyle = "--", label = "interpolating fit", zorder = -10)
ax.set(xlabel = r"$x$", ylabel = r"$y$")
plt.legend()
plt.show()

Figure 1.2: A function which exactly interpolates the data, perfectly fitting both signal and noise.

The problem with interpolation is that while we have perfectly fit our training data, we have not learned the underlying signal $f(x)$. If we were to generate new data with the same signal but with different noise, we would likely find that our interpolating function doesn’t actually make very good predictions about that data at all: many of its bends and wiggles don’t have any correspondence to features in the new data.

Code

# Generate new data
y_new = 2.0 * x_dense + 1.0 + torch.randn(100) * 3.0
fig, ax = plt.subplots(figsize = (6, 4))
ax.scatter(x_dense.numpy(), y_new.numpy(), **scatterplot_kwargs)
ax.plot(x_dense.numpy(), y_interp, color = "red", linestyle = "--", label = "interpolating fit", zorder = -10)
ax.set(xlabel = r"$x$", ylabel = r"$y$")
plt.legend()
plt.show()

Figure 1.3: New data plotted alongside the interpolating fit from the previous figure.

We observe a number of irrelevant fluctuations in the interpolating fit that do not correspond to the underlying pattern. This phenomenon is called overfitting: by trying to fit the noise in our training data, we have failed to learn the true signal. We’ll learn how to quantify overfitting once we begin to study measures of model quality.

Modeling the Noise

If we want to learn the signal $f$ in Equation 1.1, it’s helpful to learn how to talk mathematically about the noise term $\epsilon_i$. To do this, we need to step into the language of probability theory. Our standing assumption will be that the noise terms $\epsilon_i$ are random variables drawn independently and identically-distributed from some probability distribution. This means that each time we observe a new data point, we get a different value of $\epsilon_i$ drawn from the same distribution, without any influence from other values of $\epsilon_j$ for $j \neq i$.

The noise distribution we will usually consider is the Gaussian distribution, also called the Gaussian distribution.

Figure 1.4: The probability that a Gaussian random variable lies between two values $a$ and $b$ is given by the area under its PDF between those values.

Definition 1.1 (Gaussian (Normal) Distribution) A random variable $\epsilon$ has a Gaussian distribution with parameters mean $\mu$ and standard deviation $\sigma$ if the probability that $\epsilon$ has a value between $a$ and $b$ is given by:

\[ \begin{aligned} \mathbb{P}(a \leq \epsilon \leq b) = \int_a^b p_\epsilon(x;\mu, \sigma) \, dx\;, \end{aligned} \]

where

\[ \begin{aligned} p_\epsilon(x;\mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\;. \end{aligned} \]

The function $p_\epsilon(x;\mu, \sigma)$ is called the probability density function (PDF) of the Gaussian distribution. We use the shorthand notation $\epsilon \sim \mathcal{N}(\mu, \sigma^2)$ to indicate that the random variable $\epsilon$ has a Gaussian distribution with mean $\mu$ and standard deviation $\sigma$.

Gaussian distributions are often called normal distributions in the statistics literature.

Here’s a simple vectorized implementation of the Gaussian PDF, which can be evaluated on a PyTorch tensor of inputs:

pi = 3.141592653589793
def normal_pdf(x, mu, sigma):
    return 1 / (sigma * (2 * pi)**0.5) * torch.exp(-0.5 * ((x - mu) / sigma) ** 2)

normal_pdf(torch.tensor([0.0, 1.0, 2.0]), mu=0.0, sigma=1.0)

tensor([0.3989, 0.2420, 0.0540])

Figure 1.5: Normal distribution PDFs for different values.

The two parameters of the Gaussian distribution describe its “shape:” The mean $\mu$ indicates the “center” of the distribution, while the standard deviation $\sigma$ indicates how “spread out” the distribution is.

“Gaussian noise” refers to random variables drawn from a Gaussian distribution. We can generate Gaussian noise from a given Gaussian distribution using many functions. We’ll focus on PyTorch’s implementation, which allows us to generate tensors of Gaussian noise easily:

mu = 0.0
sigma = 1.0
epsilon = torch.normal(mu, sigma, size=(10,))
epsilon

tensor([ 0.1864,  1.0350,  0.8385,  0.1005, -0.8591,  0.1861, -1.9333, -1.0409,
        -0.1336,  1.1946])

If we take many samples of Gaussian noise and plot a histogram of their values, then we’ll find (via the law of large numbers) that the histogram approximates the PDF of the Gaussian distribution we sampled from, as illustrated in Figure 1.6:

Figure 1.6: Histogram of samples from a Gaussian distribution compared to its PDF.

Properties of the Gaussian Distribution

The Gaussian distribution has a number of useful properties for modeling noise in data. Here are a few:

Theorem 1.1 (Translation) If $\epsilon$ is normally distributed with mean $\mu$ and standard deviation $\sigma$, then the random variable $\epsilon + c$ (where $c$ is a constant) is normally distributed with mean $\mu + c$ and standard deviation $\sigma$. In other words, if $\epsilon \sim \mathcal{N}(\mu, \sigma^2)$, then $\epsilon + c \sim \mathcal{N}(\mu + c, \sigma^2)$.

Theorem 1.2 (Scaling) If $\epsilon$ is normally distributed with mean $\mu$ and standard deviation $\sigma$, then the random variable $a \epsilon$ (where $a$ is a constant) is normally distributed with mean $a \mu$ and standard deviation $|a| \sigma$. In other words, if $\epsilon \sim \mathcal{N}(\mu, \sigma^2)$, then $a \epsilon \sim \mathcal{N}(a \mu, (a \sigma)^2)$.

The expectation or mean of a continuous-valued random variable $\epsilon$ with probability density function $p_\epsilon(x)$ is defined as:

\[ \begin{aligned} \mathbb{E}[\epsilon] = \int_{-\infty}^{\infty} x \, p_\epsilon(x) \, dx\;. \end{aligned} \]

The variance of a continuous-valued random variable $\epsilon$ with probability density function $p_\epsilon(x)$ is defined as:

\[ \begin{aligned} \text{Var}(\epsilon) = \mathbb{E}[(\epsilon - \mathbb{E}[\epsilon])^2] = \int_{-\infty}^{\infty} (x - \mathbb{E}[\epsilon])^2 \, p_\epsilon(x) \, dx\;. \end{aligned} \]

We can interpret the variance as a measure of how far away $\epsilon$ “usually, on average” lies from its mean value.

Theorem 1.3 (Mean and Variance of the Gaussian) For a normally distributed random variable with mean $\mu$ and standard deviation $\sigma$:

\[ \begin{aligned} \mathbb{E}[\epsilon] &= \mu \\ \mathrm{Var}[\epsilon] &= \sigma^2. \end{aligned} \]

These two properties can be proven using some integration tricks which are beyond our scope here.

The Standard Gaussian

The standard Gaussian distribution is the Gaussian with mean zero and standard deviation one: $\mathcal{N}(0, 1)$. We often use the symbol $Z$ to Due to the properties above, we can make any Gaussian random variable from a standard Gaussian: if $X \sim \mathcal{N}(\mu, \sigma^2)$, then, using Theorem 1.2 and Theorem 1.1, we can write $X$ as

\[ \begin{aligned} X \sim \sigma Z + \mu\;. \end{aligned} \]

We can check that this random variable has the correct mean and variance:

\[ \begin{aligned} \mathbb{E}[X] = \mathbb{E}[\sigma Z + \mu] = \sigma \mathbb{E}[Z] + \mu = \sigma \cdot 0 + \mu = \mu\;, \end{aligned} \]

where we’ve used linearity of expectation and the fact that $\mathbb{E}[Z] = 0$. To calculate the variance, we use the fact that $\mathbb{E}[Z]= 0$, again, so that

\[ \begin{aligned} \mathrm{Var}[X] = \mathrm{Var}[\sigma Z + \mu] = \mathrm{Var}[\sigma Z] = \sigma^2 \mathrm{Var}[Z] = \sigma^2 \cdot 1 = \sigma^2\;. \end{aligned} \]

We’ve used the variance properties

\[ \begin{aligned} \mathrm{Var}[aX] = a^2\mathrm{Var}[X]& \\ \mathrm{Var}[X + b] = \mathrm{Var}[X]&\; \end{aligned} \]

for constants $a$ and $b$.

Gaussian Noise

With the Gaussian distribution in mind, let’s now return to our signal-plus-noise paradigm:

\[ \begin{aligned} y_i = f(x_i) + \epsilon_i\;, \end{aligned} \]

If we assume that $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ (i.e., that the noise terms are independent and identically distributed Gaussian random variables with mean zero and standard deviation $\sigma$), then we can use the translation property of the Gaussian distribution (Theorem 1.1) to deduce that

\[ \begin{aligned} y_i &\sim \mathcal{N}(f(x_i), \sigma^2)\;. \end{aligned} \tag{1.2}\]

So, we are modeling each data point $y_i$ as a Gaussian random variable whose mean is given by the underlying signal $f(x_i)$, and whose standard deviation is given by the noise level $\sigma$. This modeling approach is important enough to merit a name:

Technically, the model described below is an additive Gaussian model, but we won’t worry about non-additive models here and therefore won’t bother repeating “additive.”

Definition 1.2 (Gaussian Model) The model

\[ \begin{aligned} y &= f(x_i) + \epsilon_i \\ \epsilon_i &\sim \mathcal{N}(0, \sigma^2) \end{aligned} \tag{1.3}\]

is called a Gaussian data generating model.

Often, as illustrated so far in these notes, we choose $f$ to be a linear function of $x_i$:

Definition 1.3 (Linear-Gaussian Model) A Gaussian model in which

\[ \begin{aligned} f(x_i) = w_0 + w_1 x_i \end{aligned} \tag{1.4}\]

for parameters $w_0, w_1 \in \mathbb{R}$ is called a (1-dimensional) linear-Gaussian model.

Here’s a schematic picture of the linear-Gaussian model.

Code

# from https://stackoverflow.com/questions/47597119/plot-a-vertical-normal-distribution-in-python
def draw_gaussian_at(support, sd=1.0, height=1.0, 
        xpos=0.0, ypos=0.0, ax=None, **kwargs):
    if ax is None:
        ax = plt.gca()
    gaussian = torch.exp((-support ** 2.0) / (2 * sd ** 2.0))
    gaussian /= gaussian.max()
    gaussian *= height
    return ax.plot(gaussian + xpos, support + ypos, **kwargs)
    
support = torch.linspace(-10, 10, 1000)
fig, ax = plt.subplots()

ax.plot(x, signal, color = "black", linestyle = "--", label = r"Signal: $f(x_i) = 2x_i + 1$")

for each in x:
    draw_gaussian_at(support, sd=3, height=0.4, xpos=each, ypos=2.0 * each + 1.0, ax=ax, color='C0', alpha=0.4)

ax.scatter(x.numpy(), y.numpy(), **scatterplot_kwargs)

plt.xlabel(r"$x$")
plt.ylabel(r"$y$")
plt.title("Data points modeled as Gaussians around the signal")
plt.legend()
plt.show()

Figure 1.7: Illustration of the Gaussian noise model Equation 1.2, showing data points (black circles) as samples from Gaussian distributions (blue curves) centered on the signal (dashed line).

We are modeling $y_i$ as a noisy sample from a Gaussian distribution centered at the signal value $f(x_i)$, with noise level $\sigma$ controlling how spread out the distribution is. A larger value of $\sigma$ means that the observed data points $y_i$ will tend to be further away from the signal $f(x_i)$, while a smaller value of $\sigma$ means that the data points will tend to be closer to the signal. Larger values of the noise level $\sigma$ create noisier data sets where it is more difficult to discern the underlying signal.

Data Generating Distributions

One of the primary reasons to define models like Equation 1.2 is that they give us principled ways for thinking about what our models should learn – the signal – and what they should ignore – the noise. These models are also very helpful as models of where the data comes from – that is, as data generating distributions.

Definition 1.4 (Data Generating Distribution) A data generating distribution (also called a data generating model) is a probabilistic model that describes how data points (especially targets $y$) are generated in terms of random variables and their distributions.

The linear-Gaussian model is a simple example of a data generating model: it describes a recipe to simulate each target $y_i$ by first computing the signal value $f(x_i)$, then sampling a noise term $\epsilon_i$ from a Gaussian distribution, and finally adding the two together to get $y_i = f(x_i) + \epsilon_i$.

A very useful feature of data generating models is that they also give us tools to measure how well our learned signal fits observed data, via the concept of a likelihood.

Model Likelihood

Given a data-generating distribution, we can compute the likelihood of the observed data under that model. Recall that the PDF of a single data point $y_i$ under the Gaussian noise model Equation 1.2 with predictor value $x_i$ is given by the formula

\[ \begin{aligned} p_{y}(y_i; f(x_i), \sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}} \exp\left(-\frac{(y_i - f(x_i))^2}{2\sigma^2}\right)\;. \end{aligned} \]

The likelihood of the complete data set is simply the product of the individual data point PDFs, evaluated at their corresponding observed values. The likelihood is a function of the predictors which we’ll collect into a vector $= (x_1,,x_n)^T, the targets which we’ll collect into a vector $ $\mathbf{y}= (y_1,\ldots,y_n)^T$, and the parameters of the model (in this case, the function $f$ and the noise level $\sigma$):

Definition 1.5 (Gaussian Likelihood, Log-Likelihood) The likelihood of the observed data $(\mathbf{x}, \mathbf{y})$ under a 1d Gaussian model with function $f$ and noise level $\sigma$ is given by:

\[ \begin{aligned} L(\mathbf{x}, \mathbf{y}; f, \sigma) = \prod_{i = 1}^n p_{y}(y_i; f(x_i), \sigma^2) \end{aligned} \]

The log-likelihood is the logarithm of the likelihood:

\[ \begin{aligned} \mathcal{L}(\mathbf{x}, \mathbf{y}; f, \sigma) = \log L(\mathbf{x}, \mathbf{y}; f, \sigma) = \sum_{i = 1}^n \log p_{y}(y_i; f(x_i), \sigma^2)\;. \end{aligned} \tag{1.5}\]

The log-likelihood $\mathcal{L}$ in Equation 1.5 is almost always the tool we work with in applied contexts – it turns products into sums, which is very useful in computational practice.

A First Look: Likelihood Maximization

Let’s now finally fit a machine learning model! We’ll assume that the data is sampled from the linear-Gaussian model specified by Equation 1.3 and Equation 1.4, and try to fit the parameters $w_0, w_1$ of the function $f$ to maximize the likelihood of the observed data. For now, we’ll do this simply by choosing the combination of parameters that achieves the best likelihood from among many candidates:

To see how the likelihood can give us a tool to assess model fit, we can compute the likelihood of our observed data for different choices of the function $f$. This can be done in a simple grid search over possible values of the parameters $w_0, w_1$ in the linear function $f(x) = w_0 + w_1 x$.

sig = 3.0  # assumed noise level
1best_ll = -float('inf')
best_w = None

for w0 in torch.linspace(-5, 5, 20):
    for w1 in torch.linspace(-1, 3, 20):
2        f = lambda x: w0 + w1 * x
        ll = 0.0
3        ll += normal_pdf(y, f(x), sig).log().sum().item()

4        if ll > best_ll:
            best_ll = ll
            best_w = (w0.item(), w1.item())

1: Initialize the best log-likelihood and best parameters.
2: Define the predictor function $f$ with current parameters.
3: Compute the data log-likelihood. The normal_pdf function computes the PDF values for all data points at once, which we then log and sum to get the log-likelihood.
4: Update the best log-likelihood and parameters if the current log-likelihood is better.

Let’s check the predictor function $f$ we learned by heuristically maximizing the log-likelihood against the observed data:

Code

fig, ax = plt.subplots(figsize = (6, 4))

ax.scatter(x.numpy(), y.numpy(), **scatterplot_kwargs)
ax.plot(x.numpy(), (best_w[0] + best_w[1] * x).numpy(), color = "firebrick", linestyle = "--", label = r"$f(x) = w_0 + w_1 x$")

ax.plot(x.numpy(), signal.numpy(), color = "black", linestyle = "--", label = r"True signal: $f(x) = 2x + 1$")
ax.set(xlabel = r"$x$", ylabel = r"$y$")
plt.legend()
plt.title(fr"Best LL: {best_ll:.2f}   $w_1 = {best_w[1]:.2f}, w_0 = {best_w[0]:.2f}$")
plt.show()

Figure 1.8: Comparison of the true signal (black dashed line) and the model fit by maximizing the likelihood (red dashed line).

The model we selected via our maximum likelihood grid-search agrees relatively closely with the true underlying signal.

Soon, we’ll learn how to use the likelihood as a method to learn the function $f$ more systematically from data.