12  Introducing Deep Learning

Learn the features.

Open the live notebook in Google Colab.

The Problem of Features

An ongoing tension throughout our study of machine learning has been the need express nonlinear relationships between the features \(\mathbf{x}_i\) and the target variable \(y_i\) while maintaining a manageable model complexity. Models that don’t have the capacity to express nonlinear relationships (e.g. logistic regression without feature maps) will often underfit nonlinear data, but models that are too flexible (e.g. logistic regression with a high-dimensional feature map) will often overfit.

Our approach so far has been to study problems of the form

\[ \begin{aligned} {\color{blue}\hat{\mathbf{W}}} = \mathop{\mathrm{arg\,min}}_{{\color{blue}\mathbf{W}} \in \mathbb{R}^{d\times k}} \frac{1}{n}\sum_{i=1}^n \ell(\mathbf{\phi}(\mathbf{x}_i){\color{blue}\mathbf{W}}, y_i) + \lambda r({\color{blue}\mathbf{W}})\;, \end{aligned} \tag{12.1}\]

where:

  • \(\mathbf{W}\) is a matrix of parameters that we are trying to learn, and \(\hat{\mathbf{W}}\) is the optimal parameter vector that minimizes the objective function. In the context of regression we had \(\mathbf{W}\in \mathbb{R}^{d\times1}\) where \(d\) is the dimension of the feature map \(\mathbf{\phi}(\mathbf{x}_i)\), while in the context of classification we had \(\mathbf{W}\in \mathbb{R}^{k\times d}\) where \(k\) is the number of classes and \(d\) is again the dimension of the feature map.
  • \(\mathbf{x}_i\) is the feature vector for the \(i\)-th training example, and \(y_i\) is the corresponding label or target value.
  • \(\mathbf{\phi}\) is an optional nonlinear feature map that maps the input features \(\mathbf{x}_i\) to a higher-dimensional space, enabling us to learn nonlinear relationships between the features and the target variable.
  • \(\ell\) is a loss function that measures the discrepancy between the predicted value $ = (_i)$ and the true label \(y_i\).
    • In the context of regression we’ve seen \(\ell(\hat{y},y) = (\hat{y}-y)^2\) (mean-squared error) and \(\ell(\hat{y},y) = \lvert \hat{y}-y \rvert\) (mean-absolute error).
    • In the context of classification we’ve seen \(\ell(\hat{y},y) = y\log \sigma(\hat{y}) + (1-y)\log(1-\sigma(\hat{y}))\) (logistic loss).
  • \(\lambda r({\color{blue}\mathbf{W}})\) is a regularization term that penalizes large values of the parameters \(\mathbf{W}\) to guard against overfitting. We’ve seen examples of regularization terms such as \(\lambda \|\mathbf{W}\|_2^2\) (L2 regularization) and \(\lambda \|\mathbf{W}\|_1\) (L1 regularization).

Too Many Features

One of the reasons this problem is difficult is that we don’t usually know at the outset which feature maps are likely to be helpful for our modeling problem. So, we often generate many features (e.g. all polynomials of degree up to \(k\)) and then hope that some of these will be helpful. Techniques like regularization or model selection via cross-validation can help us to avoid overfitting in this setting, but they don’t solve the problem of having to generate a large number of features in the first place.

Motivating Representation Learning

Wouldn’t it be so much easier if our models could just learn which features are helpful for the problem? That is, what if we could instead solve a problem of the form

\[ \begin{aligned} {\color{blue}\hat{\mathbf{W}}}, {\color{red}\hat{\mathbf{\phi}}} = \mathop{\mathrm{arg\,min}}_{{\color{blue}\mathbf{W}} \in \mathbb{R}^{d\times k}, {\color{red}\mathbf{\phi}} \in \Phi} \frac{1}{n}\sum_{i=1}^n \ell(( {\color{red}\mathbf{\phi}}(\mathbf{x}_i){\color{blue}\mathbf{W}}), y_i) \;, \end{aligned} \tag{12.2}\]

where now we learn the feature map \(\mathbf{\phi}\) rather than having to specify it in advance? If we think of a feature map \(\mathbf{\phi}\) as a function that represents a data point \(\mathbf{x}_i\) in some new way, then what we are asking for is a model that can learn how to represent data points in a way that facilitates prediction of the target variable. This problem is sometimes called representation learning, and is the primary motivation of deep learning.

Interlude: Computational Graphs Again

To begin our study, it’s helpful to start moving beyond equations and into visual representations of our models. This is another setting in which the idea of a computational graph can be very helpful, although this time our graphs will be somewhat less detailed than when we studied automatic differentiation. In Equation 12.1, ignoring the regularization term for simplicity, we can think of the model output $_i = (_i)$ as being generated by a computational graph that takes \(\mathbf{x}_i\) as input, applies the feature map \(\mathbf{\phi}\) to generate a new representation of the data point, and then applies a linear transformation (the multiplication with \(\mathbf{W}\)) to generate the output. Schematically, we can represent this process with the following diagram.

Figure 12.1: Example flow of data through a computational graph for a model of the form in Equation 12.1. The bottom row shows the state of the data as it flows through the graph, while the top row shows the operations applied to the data to move from one stage to the next. The feature map \(\mathbf{\phi}\) is fixed and not learned by the model.

An important feature of this model architecture is that the model weights \(\mathbf{W}\) are trainable, but the feature map \(\mathbf{\phi}\) is not trainable. This means that the model can learn to use the features generated by \(\mathbf{\phi}\) in a way that minimizes the loss, but it cannot tune these features to be more helpful for the task at hand. To move beyond that limitation requires a model that attempts to solve Equation 12.2.

Multilayer Perceptron

In general we can’t possibly choose “the best” feature map \(\mathbf{\phi}\) from the space of all possible functions – there are too many choices for the optimization to be feasible. So, we typically restrict \(\mathbf{\phi}\) to lie within some parameterized family of functions, which we then optimizer over. For our first model that uses representation learning, we’ll let \(\mathbf{\phi}\) have the following two-stage structure:

  1. First, multiply \(\mathbf{x}\) by a matrix \(\mathbf{V}\) of learnable weights.
  2. Then, apply a fixed nonlinear function \(\alpha\), called an activation function, entrywise to each element of the resulting vector.

Common choices of the activation function \(\alpha\) include the sigmoid function \(\alpha(z) = 1/(1+e^{-z})\) and the rectified linear unit (ReLU) function \(\text{ReLU}(z) = \max(0,z)\).

The structure of our feature map is therefore

\[ \begin{aligned} \phi(\mathbf{x}) = \alpha(\mathbf{x}\mathbf{V})\;. \end{aligned} \tag{12.3}\]

The input dimension of the map must be \(d\), the dimension of the original data \(\mathbf{x}\). The output dimension of the feature map is tunable; let’s call it \(p\). The matrix \(\mathbf{V}\) therefore has dimensions \(d \times p\).

Figure 12.2: Example flow of data through a computational graph for a model with feature maps of the form Equation 12.3. The feature map \(\mathbf{\phi}\) is now parameterized by the matrix \(\mathbf{V}\), which is learned by the model.

Importantly, if \(\mathbf{V}\in \mathbb{R}^{d \times p}\), then \(\alpha(\mathbf{x}\mathbf{V}) \in \mathbb{R}^p\). This implies that the next matrix multiplication must have input dimension \(p\) and output dimension \(k\), where \(k\) is the dimension of the final model output. In other words, we must have \(\mathbf{W}\in \mathbb{R}^{p \times k}\), and the output of the model \(\hat{y} = \alpha(\mathbf{x}\mathbf{V})\mathbf{W}\in \mathbb{R}^k\).

Getting started with Deep Models in PyTorch

Let’s go ahead and start implementing these models in PyTorch. We’re going to start with the model illustrated in Figure 12.1 using the programming idioms we’ve developed so far, and then we’ll highlight the small shifts needed to move to fully idiomatic PyTorch. Then we’ll move on to the model illustrated in Figure 12.2.

We’ll illustrate our models on the following synthetic data set:

Code
from sklearn.datasets import make_circles
import torch
import matplotlib.pyplot as plt 

X, y = make_circles(n_samples=100, factor=0.5, noise=0.3, random_state=42)
X = torch.tensor(X, dtype=torch.float32)
X = torch.cat([torch.ones(X.shape[0], 1), X], dim=1) # add bias term
y = torch.tensor(y, dtype=torch.float32)

def scatter_data(X, y, ax):
    markers = ['o', 's']
    for i in range(2): 
        idx = (y.flatten() == i)
        ax.scatter(X[idx, 1], X[idx, 2], c = y.flatten()[idx], label=f'Class {i}', edgecolor='k', marker=markers[i],   cmap = 'BrBG', vmin = -0.5, vmax = 1.5, alpha = 0.7)

fig, ax = plt.subplots()
scatter_data(X, y, ax)


ax.set_aspect('equal')
ax.set_title('Synthetic Data Set')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.legend()
plt.show()

Figure 12.3: Example data for our first steps in deep learning.

We previously implemented logistic regression like this:

class LogisticRegression: 
    def __init__(self, n_features):
        self.W = torch.randn(n_features, 2, requires_grad=True)
    
    def forward(self, X):
        return torch.softmax(X @ self.W, dim=1)

To transition into more extensible code, we’re going to leverage torch’s nn.Module class, which provides a convenient way to define models with learnable parameters. Here’s how we can implement the same logistic regression model using nn.Module:

from torch import nn
class LogisticRegression(nn.Module): 
    def __init__(self, n_features):
        super().__init__()
1        self.layer1 = nn.Linear(n_features, 2)
    
    def forward(self, X):
2        x1 = self.layer1(X)
        return torch.softmax(x1, dim=1)
1
Abstract Linear layer which handles learnable matrix multiplication, including the bias term.
2
Forward pass through the linear layer.

If we try this model on our sample data, we should expect to see the linear decision boundary that we get from logistic regression.

Let’s give this a shot using a standard training loop.

def train_model(model, X, y, n_epochs=2000): 
    opt = torch.optim.SGD(model.parameters(), lr=1)
    loss_fn = nn.CrossEntropyLoss()
    losses = []
    for epoch in range(n_epochs): 
        opt.zero_grad()
        y_pred = model(X)
        loss = loss_fn(y_pred, y.long())
        losses.append(loss.item())
        loss.backward()
        opt.step()
    return losses

model = LogisticRegression(n_features=X.shape[1])
losses= train_model(model, X, y)
Code
def plot_decision_contours(model, ax, cmap = "BrBG"): 
    x_min, x_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    y_min, y_max = X[:, 2].min() - 0.5, X[:, 2].max() + 0.5
    x = torch.linspace(x_min, x_max, 101)
    y = torch.linspace(y_min, y_max, 101)

    xx, yy = torch.meshgrid(x, y)
    grid_points = torch.cat([torch.ones(xx.shape[0], xx.shape[1], 1), xx.unsqueeze(-1), yy.unsqueeze(-1)], dim=-1)
    with torch.no_grad():
        pred = model(grid_points.reshape(-1, 3))
        pred = pred[:, 1].reshape(xx.shape) # probability of class 1
    ax.contourf(xx.numpy(), yy.numpy(), pred.numpy(), alpha=0.3, cmap = cmap, vmin = -0.5, vmax = 1.5, zorder = -100)
    ax.contour(xx.numpy(), yy.numpy(), pred.numpy(), levels=[0.5], colors='k', linewidths=1, zorder = -100, linestyles='--')

def plot_contours_and_scatter(model, X, y, ax):
    plot_decision_contours(model, ax)
    scatter_data(X, y, ax)
    ax.set_aspect('equal')
    ax.set_xlabel(r"$x_1$")
    ax.set_ylabel(r"$x_2$")
    ax.legend()
Code
fig, ax = plt.subplots()
plot_contours_and_scatter(model, X, y, ax)
t = ax.set_title('Logistic Regression Decision Boundary')

Figure 12.4: Decision boundary learned by logistic regression.

Since we have not baked any nonlinear features into this model, the model simply uses a linear decision boundary with predictably poor results.

Now let’s try a model with multiple linear layers:

class DeepModelAttempt(nn.Module): 
    def __init__(self, n_features, hidden_dim=10, output_dim = 2):
        super().__init__()
        self.layer1 = nn.Linear(n_features, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, X):
        x1 = self.layer1(X) 
        return torch.softmax(self.layer2(x1), dim=1) 

Let’s fit this model to data, making sure to use plenty of hidden dimensions:

Code
model = DeepModelAttempt(n_features=X.shape[1], hidden_dim=500, output_dim = 2)
losses = train_model(model, X, y, n_epochs = 2000)
fig, ax = plt.subplots()

plot_contours_and_scatter(model, X, y, ax)

t = ax.set_title('"Deep" model decision boundary')

Figure 12.5: Decision boundary learned by a “deep” model with two linear layers but no nonlinear activation function.

Oh dear – our classifier has still learned a linear decision boundary. The reason is that we’ve forgotten something important: the nonlinear activation function between the first and second Linear layers. Between our two linear layers we’ve essentially performed the computation \(\mathbf{s}= \mathbf{x}\mathbf{V}\mathbf{W}= \mathbf{x}(\mathbf{W}\mathbf{V})\), which we could have achieved with a single matrix multiplication. So, the model above is equivalent to logistic regression again.

To correctly implement our first deep learning model, we’ll add a nonlinear activation function between the two layers. Following historical precedent, we’ll refer to this model as the multilayer perceptron (MLP).

class MLP(nn.Module): 
    def __init__(self, n_features, hidden_dim=10, output_dim = 2):
        super().__init__()
1        self.feature_layer = nn.Linear(n_features, hidden_dim)
2        self.output_layer = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, X):
3        x1 = torch.relu(self.feature_layer(X))
4        return torch.softmax(self.output_layer(x1), dim=1)
1
First linear layer to learn the feature map.
2
Second linear layer to generate the final output.
3
Apply ReLU activation function to the output of the first layer.
4
Apply softmax to the output of the second layer to get class probabilities.

Let’s train this model and inspect it’s performance. We’ll use a total of 5 hidden features, which means that \(\mathbf{V}\in \mathbb{R}^{n \times 5}\) and \(\mathbf{W}\in \mathbb{R}^{5\times 2}\).

model = MLP(n_features=X.shape[1], hidden_dim=2, output_dim = 2)
losses = train_model(model, X, y, n_epochs = 2000)
fig, ax = plt.subplots()
plot_contours_and_scatter(model, X, y, ax)
t = ax.set_title('MLP Decision Boundary')

Figure 12.6: Decision boundary learned by a multilayer perceptron with one hidden layer and two hidden features.

With a hidden dimension of size 2, our model is already able to learn a nonlinear decision boundary. Let’s explore the impact of increasing the hidden dimension.

Code
torch.manual_seed(12)

fig, ax = plt.subplots(2, 2, figsize=(8, 7))

for i, hidden_dim in enumerate([2, 3, 5, 10]):
    model = MLP(n_features=X.shape[1], hidden_dim=hidden_dim, output_dim = 2)
    losses = train_model(model, X, y, n_epochs = 2000)
    plot_contours_and_scatter(model, X, y, ax[i//2, i%2])
    ax[i//2, i%2].set_title(f'Hidden Dim: {hidden_dim}')

plt.tight_layout()

Figure 12.7: Decision boundaries learned by a multilayer perceptron with one hidden layer and varying hidden dimension sizes.

As we increase the hidden dimension, the model develops a greater capacity to learn more complex decision boundaries, giving a relatively smoother decision boundary.

Tuning Model Complexity

Increasing the hidden dimension size is one way to increase model complexity. Another way is to add more layers, again separated by nonlinear activation functions.

class DeeperMLP(nn.Module): 
    def __init__(self, n_features, hidden_dim=10, output_dim = 2):
        super().__init__()
        self.feature_layer1 = nn.Linear(n_features, hidden_dim) 
        self.feature_layer2 = nn.Linear(hidden_dim, hidden_dim) 
        self.output_layer = nn.Linear(hidden_dim, output_dim) 
    
    def forward(self, X):
        x1 = torch.relu(self.feature_layer1(X)) 
        x2 = torch.relu(self.feature_layer2(x1)) 
        return torch.softmax(self.output_layer(x2), dim=1) 

As usual, models which are excessively complex may begin to show visual signs of overfitting, although techniques like cross-validation should be used to provide a formal assessment.

Code
torch.manual_seed(12345)
model = DeeperMLP(n_features=X.shape[1], hidden_dim=50, output_dim = 2)
losses = train_model(model, X, y, n_epochs = 5000)
fig, ax = plt.subplots()
plot_contours_and_scatter(model, X, y, ax)
plot_decision_contours(model, ax)
ax.set_title('Deeper MLP Decision Boundary')
plt.show()

Figure 12.8: Decision boundary learned by a multilayer perceptron with two layers and 50 hidden features in each layer.

Inspecting Learned Features

Recall that our motivation for deep learning was to learn the features rather than specifying them ahead of time. What features have we learned?

Because of the composable, layered structure of our model, it’s relatively simple to see the outputs of each layer on the data: we just need to call the relevant layer. For example:

with torch.no_grad():
    learned_features1 = model.feature_layer1(X) 
1    learned_features2 = model.feature_layer2(torch.relu(learned_features1))

learned_features1.shape, learned_features2.shape
1
Simulating the second layer.
(torch.Size([100, 50]), torch.Size([100, 50]))

In the code block below, we call each of the two layers of our model on grid so that we can visualize the feature representations.

Code
fig, axarr = plt.subplots(1, 3, figsize=(7, 2.7), sharex=True, sharey=True)

ax = axarr[0]

plot_contours_and_scatter(model, X, y, ax)

ax.set_title('MLP Decision Boundary')


x_min, x_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
y_min, y_max = X[:, 2].min() - 0.5, X[:, 2].max() + 0.5
x_grid = torch.linspace(x_min, x_max, 101)
y_grid = torch.linspace(y_min, y_max, 101)

xx, yy = torch.meshgrid(x_grid, y_grid)
grid_points = torch.cat([torch.ones(xx.shape[0], xx.shape[1], 1), xx.unsqueeze(-1), yy.unsqueeze(-1)], dim=-1)

with torch.no_grad():
    learned_features1 = model.feature_layer1(grid_points) 
    learned_features2 = model.feature_layer2(torch.relu(learned_features1))

ax = axarr[1]
ax.contourf(xx.numpy(), yy.numpy(), learned_features1[:, :, 0].reshape(xx.shape).numpy(), alpha=0.3, cmap = 'Greys')
ax.grid(False)
ax.set_title('Learned Feature (Layer 1)')
ax.set_aspect('equal')
ax.set_xlabel(r"$x_1$")

ax = axarr[2]
axarr[2].contourf(xx.numpy(), yy.numpy(), learned_features2[:, :, 0].reshape(xx.shape).numpy(), alpha=0.3, cmap = 'Greys')
axarr[2].grid(False)
ax.set_title('Learned Feature (Layer 2)')
ax.set_aspect('equal')
ax.set_xlabel(r"$x_1$")

    
plt.tight_layout()

Figure 12.9: Visualization of the decision boundary learned by the multilayer perceptron with two hidden layers, alongside one learned feature from each of the two layers.

The first layer is able to learn only linear features, which individually are of little use. However, by combining many linear features, the second layer is able to learn a nonlinear feature that roughly describes the radial structure of the data set.



© Phil Chodrow, 2025