13 Convolutional Neural Networks and Image Classification

Our first adventure with nontabular data.

Open the live notebook in Google Colab.

import torch
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.ticker as mtick
import torch.nn as nn
from torch.nn import Conv2d, MaxPool2d, Parameter
from torch.nn.functional import relu
from torchvision import models
import torch.optim as optim
import torch.nn as nn
from torch.nn import ReLU

!pip install torchinfo
from torchinfo import summary

plt.style.use('seaborn-v0_8-whitegrid')

zsh:1: /usr/local/bin/pip: bad interpreter: /usr/bin/python: no such file or directory
Requirement already satisfied: torchinfo in /Users/philchodrow/opt/anaconda3/envs/cs451/lib/python3.10/site-packages (1.8.0)

Our data for today is the Sign Language MNIST data set, which I retrieved from Kaggle. This data set poses a challenge: can we train a model to recognize a letter of American Sign Language from a hand gesture?

train_url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/sign-language-mnist/sign_mnist_train.csv"
test_url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/sign-language-mnist/sign_mnist_test.csv"

df_train = pd.read_csv(train_url)
df_val   = pd.read_csv(test_url)

Natively, this data set comes to us as a data frame in which each column represents a pixel. Each image has 28x28 pixels, so there are 784 pixel columns, plus one column for the label (the letter being signed). Let’s take a look at the data frame:

df_train.head()

	label	pixel1	pixel2	pixel3	pixel4	pixel5	pixel6	pixel7	pixel8	pixel9	...	pixel775	pixel776	pixel777	pixel778	pixel779	pixel780	pixel781	pixel782	pixel783	pixel784
0	3	107	118	127	134	139	143	146	150	153	...	207	207	207	207	206	206	206	204	203	202
1	6	155	157	156	156	156	157	156	158	158	...	69	149	128	87	94	163	175	103	135	149
2	2	187	188	188	187	187	186	187	188	187	...	202	201	200	199	198	199	198	195	194	195
3	2	211	211	212	212	211	210	211	210	210	...	235	234	233	231	230	226	225	222	229	163
4	13	164	167	170	172	176	179	180	184	185	...	92	105	105	108	133	163	157	163	164	179

5 rows × 785 columns

In principle, we’re already able to perform machine learning tasks on data in this format – treat each column as a feature and we’re ready to go. However, this format makes it very difficult to see relationships between these features. Importantly, a key feature of image data is that nearby pixels are often related to each other in important ways. There’s no hope to capture that idea from this data frame, because we can’t even tell which pixels are nearby each other. Let’s therefore reshape the data into something more like its native pixel format.

def prep_data(df):
    n, p = df.shape[0], df.shape[1] - 1
    y = torch.tensor(df["label"].values)
    X = df.drop(["label"], axis = 1)
    X = torch.tensor(X.values)
    X = torch.reshape(X, (n, 1, 28, 28))
    X = X / 255

    return X, y

X_train, y_train = prep_data(df_train)
X_val, y_val     = prep_data(df_val)

print("Training data shapes:")
print(X_train.shape, y_train.shape)
print("Validation data shapes:")
print(X_val.shape, y_val.shape)

Training data shapes:
torch.Size([27455, 1, 28, 28]) torch.Size([27455])
Validation data shapes:
torch.Size([7172, 1, 28, 28]) torch.Size([7172])

We’ve shaped the data into a 4-dimensional tensor, with dimensions

The channel refers to the number of color channels in the image. Colors represented in Red, Green, and Blue color format (RGB) have 3 channels, one for each of R, G, and B. Greyscale images like the ones we’re working with today have only one channel.

\[ \begin{aligned} (\text{image index}, \text{channel}, \text{width}, \text{height})\;. \end{aligned} \]

So, to interpret X_train.shape, we have 27,455 images in the training data, each with 1 channel (grayscale), and each image is 28 pixels wide and 28 pixels high.

Beyond data frames

So far in this class, we’ve almost exclusively considered data arranged in 2-dimensional structures like matrices or data frames with dimensions \(n\times d\), where \(n\) is the number of data observations and \(d\) is the number of features. As we move into deep learning, we’re going to see more examples of data sets like this one in which the 2-dimensional representation of data is inappropriate. We’ll therefore need to expand our thinking to more general tensors with more than two dimensions, like the one we have here.

Code

ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

def show_images(X, y, rows, cols, channel = 0):

    fig, axarr = plt.subplots(rows, cols, figsize = (2*cols, 2*rows))
    for i, ax in enumerate(axarr.ravel()):
        ax.imshow(X[i, channel].detach().cpu(), cmap = "Greys_r")
        ax.set(title = f"{ALPHABET[y[i]]}")
        ax.axis("off")
    plt.tight_layout()

show_images(X_train, y_train, 5, 5)

Figure 13.1: Example images from the sign-language MNIST data set, with their corresponding class labels.

Let’s take a look at the distribution of characters in the training data:

Code

fig, ax = plt.subplots(1, 1, figsize = (6, 2))
letters, counts = torch.unique(y_train, return_counts = True)
proportions = counts / counts.sum()

ax.scatter(letters, 100*proportions, edgecolor = "steelblue")
ax.set_xticks(range(26))
ax.set_xticklabels(list(ALPHABET))
ax.set(xlabel = "Letter", ylabel = "Frequency")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(decimals = 1))
ax.set_ylim(0, 7.5)

Figure 13.2: Distribution of class labels in the training data. There are no “J”s or “Z”s in this data because these characters require motion rather than a static gesture.

The most frequent letter (“R”) in this data comprises no more than 5% of the entire data set. So, as a minimal aim, we would like a model that gets the right character at least 5% of the time.

Data Prep

Sidebar: GPU Acceleration in Torch

As we’ve seen from the last several lectures, deep learning models involve a lot of linear algebra in order to compute predictions and gradients. This means that deep models, even more than many other machine learning models, strongly benefit from hardware that is good at doing linear algebra fast. As it happens, graphics processing units (GPUs) are very, very good at fast linear algebra. So, it’s very helpful when running our models to have access to GPUs; using a GPU can often result in up to 10x speedups. While some folks can use GPUs on their personal laptops, another common option for learning purposes is to use a cloud-hosted GPU. My personal recommendation is Google Colab, and I’ll supply links that allow you to open lecture notes in Colab and use their GPU runtimes.

The reason that GPUs are so good at this is that they were originally optimized for rendering complex graphics in e.g. animation and video games, and this involves lots of linear algebra.

The following torch code checks whether there is a GPU available to PyTorch, and if so, sets a variable called device to log this fact. We’ll make sure that both our data and our models fully live on the same device when doing model training.

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on {device}.")

Running on cpu.

Now that we’ve configured the device, we are going to need to place both our data and our models on the device.

X_train, y_train = X_train.to(device), y_train.to(device)
X_val, y_val     = X_val.to(device), y_val.to(device)

Data Loaders

As we saw when studying modern approaches to optimization loops, it’s convenient to loop through our data in batches and perform an optimization step after each batch. The following code defines a function that creates a data loader for our training and validation data sets.

def make_data_loader(X, y, batch_size = 32): 
    return torch.utils.data.DataLoader(
                torch.utils.data.TensorDataset(X, y),
                batch_size = 32,
                shuffle = True
    )

train_loader = make_data_loader(X_train, y_train)
val_loader   = make_data_loader(X_val, y_val)

As a reminder, it’s possible to loop through elements of the data loader using syntax like:

for X_batch, y_batch in train_loader:
    # do something with X_batch 
    # and y_batch

Logistic Regression Baseline

Let’s go ahead and train a logistic regression model on this data. This model will make no use of the spatial structure of the pixels. Relative to our previous logistic regression implementations, the main difference here is that we need to structure the model to accept input data whose single instance has dimensions (1, 28, 28) rather than (784,). We can achieve this by including a Flatten layer in our model, which will take the 1x28x28 pixel input and flatten it into a 784-dimensional vector before applying the linear transformation.

We can make our code a bit more concise by enclosing each of our layers inside an nn.Sequential container, which allows us to treat the entire sequence of layers as a single layer which we then call in the forward method.

Relative to previous implementations of logistic regression, you might notice that this implementation does not contain a call to the logistic sigmoid. The reason for this is that we are going to use torch’s built-in nn.CrossEntropyLoss() as our loss function for training. For reasons of numerical stability, this function is structured to apply the cross-entropy loss, so we don’t need to do it in the model itself.

class LinearModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.pipeline = nn.Sequential(
1            nn.Flatten(),
2            nn.Linear(28*28, 26)
        )

    def forward(self, x):
3        return self.pipeline(x)

4model = LinearModel().to(device)

1: The flatten layer transforms the (1,28,28) image into a (784,) pixel sequence.
2: Apply a linear map (matrix multiplication)
3: Call the entire pipeline with a single call.
4: Instantiate the model and move it to the device.

Model Inspection

When constructing nontrivial deep learning models, it can be helpful to inspect them in order to get a sense for their structure and complexity level (especially the number of parameters). There are multiple ways to approach this, including several utilities which visualize the computational graph. Here, we’ll use the torchinfo package, which provides a nice summary of the model’s layers and parameters and includes a parameter count. To call the summary function, we need to specify the expected dimension of a single piece of data (including the channel):

For example, could you read off from the above model definition how many parameters are included in the model?

summary(model, input_size=(1, 28, 28), device = device)

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
LinearModel                              [1, 26]                   --
├─Sequential: 1-1                        [1, 26]                   --
│    └─Flatten: 2-1                      [1, 784]                  --
│    └─Linear: 2-2                       [1, 26]                   20,410
==========================================================================================
Total params: 20,410
Trainable params: 20,410
Non-trainable params: 0
Total mult-adds (M): 0.02
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.08
Estimated Total Size (MB): 0.09
==========================================================================================

Even though we are looking at a simple logistic regression model, our parameter count is already in the tens of thousands!

The code block below allows us to evaluate a model’s performance by computing its accuracy and a confusion matrix on a specified data loader. There’s also a helper function for plotting the confusion matrix.

def evaluate(model, data_loader, multichannel = False): 
    ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    
    # confusion matrix
1    conf_mat = torch.zeros((26, 26), dtype = torch.int32)
    loss_fn = nn.CrossEntropyLoss()
    loss = 0

    with torch.no_grad(): 
2        for X_batch, y_batch in data_loader:
3            scores = model(X_batch)
            loss += loss_fn(scores, y_batch).item()
4            y_pred = torch.argmax(scores, dim = 1)
            for i in range(len(y_batch)):
5                conf_mat[y_batch[i], y_pred[i]] += 1

6    acc = torch.diag(conf_mat).sum() / conf_mat.sum()

    return acc, loss, conf_mat

1: Initialize a confusion matrix of zeros with dimensions 26x26 (one row and one column for each letter of the alphabet).
2: For each batch in the data loader:
3: Compute the model’s output scores for the batch.
4: Determine the predicted class by taking the index of the maximum score for each instance in the batch.
5: Update the confusion matrix by incrementing the count for the true label and predicted label for each instance in the batch.
6: Compute the overall accuracy by summing the diagonal of the confusion matrix (correct predictions) and dividing by the total number of predictions.

Code

def plot_confusion_mat(conf_mat, ax, title = "Confusion Matrix"):
    im = ax.imshow(conf_mat.cpu(), cmap = "Blues", origin = "upper")
    # Show all ticks and label them with the respective list entries
    ax.set_xticks(torch.arange(len(ALPHABET)))
    ax.set_yticks(torch.arange(len(ALPHABET)))
    ax.set_xticklabels(list(ALPHABET))
    ax.set_yticklabels(list(ALPHABET))
    ax.set_xlabel("Predicted Label")
    ax.set_ylabel("True Label")

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    for i in range(conf_mat.shape[0]):
        for j in range(conf_mat.shape[1]):
            text = ax.text(j, i, conf_mat[i, j].item(),
                           ha="center", va="center", color="black", size = 6)

    ax.set_title(title)
    ax.grid(False)

Figure 13.3: Confusion matrix for the untrained logistic regression model.

Obviously this model does not classify the data very impressively at all, since it hasn’t been trained yet. Let’s implement a training loop. This loop will perform model updates and track the model’s accuracy over time.

It would also be reasonable to track the average loss per batch over time. This would require an extra call to model.forward on the validation set, which we’re not going to do here because it increases the training time substantially.

def train(model, k_epochs = 1, print_every = 2000, **opt_kwargs):

    # loss function is cross-entropy (multiclass logistic)
    loss_fn = nn.CrossEntropyLoss()

    # optimizer is SGD with momentum
    optimizer = optim.SGD(model.parameters(), **opt_kwargs)

    # initialize list of accuracies to 
    train_accuracy = []
    val_accuracy   = []
    train_loss     = []
    val_loss       = []

    for epoch in range(k_epochs):

        for i, data in enumerate(train_loader):
            X, y = data
            optimizer.zero_grad()
            y_pred = model(X)
            loss   = loss_fn(y_pred, y)
            loss.backward()
            optimizer.step()
            
        train_acc, train_l, train_cm = evaluate(model, data_loader = train_loader)
        val_acc, val_l, val_cm     = evaluate(model, data_loader = val_loader)
        
        train_accuracy += [train_acc]
        val_accuracy   += [val_acc]
        train_loss     += [train_l]
        val_loss       += [val_l]

    return train_accuracy, train_loss, val_accuracy, val_loss

Now let’s train the model and visualize its performance over time:

train_accuracy, train_loss, val_accuracy, val_loss = train(model, k_epochs = 30, lr = 0.001)

Code

fig, axarr = plt.subplots(1, 2, figsize = (8, 3.5))

ax = axarr[0]
ax.plot(train_loss, color = "black", label = "Training")
ax.plot(val_loss, color = "firebrick", label = "Validation")
ax.set_title("Cross-Entropy Loss")
ax.set_xlabel("Epoch")
ax.set_ylabel("Loss")

ax = axarr[1]
ax.plot(train_accuracy, color = "black", label = "Training")
ax.plot(val_accuracy, color = "firebrick", label = "Validation")
ax.set_xlabel("Epoch")
ax.set_ylabel("Accuracy")
ax.set_title("Classification Accuracy")
ax.set(ylim = (0, 1))
plt.tight_layout()
l = ax.legend()

Figure 13.4: Training curves for the logistic regression model. The left panel shows the training and validation loss over time, and the right panel shows the training and validation accuracy over time.

This model is not yet done training, and additional epochs might improve the performance. For the purposes of these notes we won’t push it further than that.

Already, the confusion matrix for the trained model looks much better:

Code

acc, loss, conf_mat = evaluate(model, val_loader)
fig, ax = plt.subplots(figsize = (7, 7))
plot_confusion_mat(conf_mat, ax, title = f"Confusion Matrix (acc = {acc:.2%})")

Figure 13.5: Confusion matrix for the logistic regression model.

Although we can see that the model is often successful, 54% accuracy leaves a lot of room for improvement. Can we do better?

Convolutional Neural Networks

A common approach to feature extraction in images is to apply a convolutional kernel. A convolutional kernel is a component of a vectorization pipeline which is specifically suited to the structure of images. In particular, images are fundamentally spatial. We might want to construct data features which reflect not just the value of an individual pixel, but also the values of pixels nearby that one.

Convolutional kernels are not related in any way to positive-definite kernels used in kernel classifiers, another important ML topic.

The idea of an image convolution is pretty simple. We define a square kernel matrix containing some numbers, and we “slide it over” the input data. At each location, we multiply the data values by the kernel matrix values, and add them together. Here’s an illustrative diagram:

Kernel Convolutions

Figure 13.6: Image from *Dive Into Deep Learning*.

In this example, the value of 19 is computed as \(0\times 0 + 1\times 1 + 3\times 2 + 4\times 3 = 19\).

Manual Kernel Convolutions for Feature Extraction

For a long time, a common approach to image classification and related computer vision tasks was to hand-engineer a set of convolutional kernels designed to extract certain specific features of interest from the image. For example, here are some kernels designed to extract vertical, horizontal, and diagonal features from an image:

Code

vertical = torch.tensor([0, 0, 5, 0, 0]).repeat(5, 1) - 1.0
diag1    = torch.eye(5)*5 - 1
horizontal = torch.transpose(vertical, 1, 0)
diag2    = diag1.flip(1)

fig, ax = plt.subplots(1, 4)
for i, kernel in enumerate([vertical, horizontal, diag1, diag2]):
    ax[i].imshow(kernel, vmin = -1.5, vmax = 2)
    ax[i].axis("off")
    ax[i].set(title = f'{["Vertical", "Horizontal", "Diagonal Down", "Diagonal Up"][i]}')

When we apply these convolutional kernels to an image, we obtain a new image in which the value of each pixel corresponds to the “alignment” of the image with the kernel at that point. Here are some examples:

Code

def apply_convolutions(X): 

    # this is actually a neural network layer -- we'll learn how to use these
    # in that context soon 
    conv1 = Conv2d(1, 4, 5).to(device) # 1 input channel, 4 output channels, 5x5 kernels

    # need to disable gradients for this layer
    for p in conv1.parameters():
        p.requires_grad = False

    # replace kernels in layer with our custom ones
    conv1.weight[0, 0] = Parameter(vertical)
    conv1.weight[1, 0] = Parameter(horizontal)
    conv1.weight[2, 0] = Parameter(diag1)
    conv1.weight[3, 0] = Parameter(diag2)

    # apply to input data and disable gradients
    return conv1(X).detach()

def kernel_viz(pipeline):

    fig, ax = plt.subplots(5, 5, figsize = (8, 8))
    

    X_convd = pipeline(X_train).cpu()

    for i in range(5): 
        for j in range(5):
            if i == 0: 
                ax[i,j].imshow(X_train[j, 0].cpu())
            
            else: 
                ax[i, j].imshow(X_convd[j,i-1])
            
            ax[i,j].tick_params(
                        axis='both',      
                        which='both',     
                        bottom=False,     
                        left=False,
                        right=False,         
                        labelbottom=False, 
                        labelleft=False)
            ax[i,j].grid(False)
            ax[i, 0].set(ylabel = ["Original", "Vertical", "Horizontal", "Diag Down", "Diag Up"][i])

kernel_viz(apply_convolutions)

Figure 13.7: Example of four kernel convolutions applied to five sample input images.

In principle, these convolutions could be used to define “scores” for each image: for example, summing up the “horizontal” values could give an image a score reflecting the prevalence of horizontal lines in the image.

A limitation of this approach is that we have to engineer all our kernels in advance. Wouldn’t it be simpler if we could simply initialize the kernels randomly and let the data tell us what the useful kernels might be?

Learnable Kernels

Fortunately, neural networks give us a framework for doing exactly that. The key insight is that the kernel convolution operation is, fundamentally, just a sequence of pairwise multiplications followed by an addition. This means that the convolution is a linear operation. This means:

Kernel Convolution is a Matrix Multiplication

Actually writing down the matrix multiplication formula for kernel convolution is complex and involves “doubly-index matrices,” so we won’t do that here. The key point is that, since the convolution operation is a matrix multiplication, we can treat it with the same framework as we have been using for other linear models – we just need to throw it in as a layer in a neural network.

A Minimal Convolutional Model

Just inserting a convolutional layer on its own won’t lead to substantial gains because, as we saw, composing linear layers doesn’t actually add that much. Instead, we can compose the convolutional layer with a nonlinearity (e.g. ReLU) and a final linear layer to get a minimal convolutional model. Since our output layer is a Linear layer, we still need to Flatten the output of the convolutional layer before feeding it into the linear layer.

class SmallConvNet(nn.Module):
    def __init__(self):
        super().__init__()

        self.pipeline = torch.nn.Sequential(
1            nn.Conv2d(1, 4, 5),
2            ReLU(),
3            nn.Flatten(),
4            nn.Linear(4*24*24, 26)
        )

    def forward(self, x):
        return self.pipeline(x)

1: Specify that our images have only one input channel (greyscale), 4 output channels (we’ll try learning 4 kernels), and the kernels have shape 5x5 pixels.
2: Apply nonlinearity.
3: Flatten the output of the convolutional layer to feed into the linear layer.
4: Apply the linear layer. The input dimension reflects the presence of 4 channels whose outputs have shape 24x24 pixels, and 26 output classes.

Let’s instantiate a model and take a look.

X_batch = next(iter(train_loader))[0]
model = SmallConvNet().to(device)
summary(model, input_size=X_batch.shape, device = device)

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
SmallConvNet                             [32, 26]                  --
├─Sequential: 1-1                        [32, 26]                  --
│    └─Conv2d: 2-1                       [32, 4, 24, 24]           104
│    └─ReLU: 2-2                         [32, 4, 24, 24]           --
│    └─Flatten: 2-3                      [32, 2304]                --
│    └─Linear: 2-4                       [32, 26]                  59,930
==========================================================================================
Total params: 60,034
Trainable params: 60,034
Non-trainable params: 0
Total mult-adds (M): 3.83
==========================================================================================
Input size (MB): 0.10
Forward/backward pass size (MB): 0.60
Params size (MB): 0.24
Estimated Total Size (MB): 0.94
==========================================================================================

How does this model perform?

k_epochs = 5

train_accuracy, train_loss, val_accuracy, val_loss = train(model, k_epochs = k_epochs,  lr = 0.01,  momentum = 0.9)

Code

fig, axarr = plt.subplots(1, 2, figsize = (8, 3.5))

ax = axarr[0]
ax.plot(train_loss, color = "black", label = "Training")
ax.plot(val_loss, color = "firebrick", label = "Validation")
ax.set_title("Cross-Entropy Loss")
ax.set_xlabel("Epoch")
ax.set_ylabel("Loss")

ax = axarr[1]
ax.plot(train_accuracy, color = "black", label = "Training")
ax.plot(val_accuracy, color = "firebrick", label = "Validation")
ax.set_xlabel("Epoch")
ax.set_ylabel("Accuracy")
ax.set_title("Classification Accuracy")
ax.set(ylim = (0, None))
plt.tight_layout()
l = ax.legend()

Figure 13.8: Training curves for the minimal convolutional neural network.

Figure 13.9: Confusion matrix for the minimal convolutional neural network. .

The minimal model is already able to somewhat exceed the performance of the logistic regression model, even after fewer epochs of training.

More Complex Convolutional Models

A common approach to building more complex convolutional models is to stack multiple convolutional layers on top of each other, with nonlinearities in between. This allows the model to learn more complex features at different levels of abstraction. For example, the first convolutional layer might learn to detect edges, while the second convolutional layer might learn to detect combinations of edges that form shapes, and the third convolutional layer might learn to detect combinations of shapes that form objects.

Pooling

An issue with the stacking approach, however, is that the data remains very large throughout the pipeline, with each convolution reducing the data just by a few pixels in each dimension. This also prevents convolutional kernels at later layers from combining data from far away regions in the image. To address this, let’s reduce the data in a nonlinear way. We’ll do this with max pooling. You can think of it as a kind of “summarization” step in which we intentionally make the current output somewhat “blockier.” Technically, it involves sliding a window over the current batch of data and picking only the largest element within that window. Here’s an example of how this looks:

A useful effect of pooling is that it reduces the number of features in our data. In the image above, we reduce the number of features by a factor of \(2\times 2 = 4\).

Let’s now construct a complex model that layers convolutional layers, nonlinearities, and pooling layers.

There’s nothing special about the engineering of this model: I just stacked a few convolution and pooling layers together and then added a few more linear outputs.

class ConvNet(nn.Module):
    def __init__(self):
        super().__init__()

        self.pipeline = torch.nn.Sequential(
            nn.Conv2d(1, 100, 5),
            nn.MaxPool2d(2, 2),
            ReLU(),
            nn.Conv2d(100, 50, 3),
            nn.MaxPool2d(2, 2),
            ReLU(),
            nn.Conv2d(50, 50, 3),
            nn.MaxPool2d(2, 2),
            ReLU(),
            nn.Flatten(),
            nn.Linear(50, len(ALPHABET))
        )

    def forward(self, x):
        return self.pipeline(x)

This model has considerable additional complexity, indicated by its parameter count.

X_batch = next(iter(train_loader))[0]
model = ConvNet().to(device)
summary(model, input_size=X_batch.shape, device = device)

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
ConvNet                                  [32, 26]                  --
├─Sequential: 1-1                        [32, 26]                  --
│    └─Conv2d: 2-1                       [32, 100, 24, 24]         2,600
│    └─MaxPool2d: 2-2                    [32, 100, 12, 12]         --
│    └─ReLU: 2-3                         [32, 100, 12, 12]         --
│    └─Conv2d: 2-4                       [32, 50, 10, 10]          45,050
│    └─MaxPool2d: 2-5                    [32, 50, 5, 5]            --
│    └─ReLU: 2-6                         [32, 50, 5, 5]            --
│    └─Conv2d: 2-7                       [32, 50, 3, 3]            22,550
│    └─MaxPool2d: 2-8                    [32, 50, 1, 1]            --
│    └─ReLU: 2-9                         [32, 50, 1, 1]            --
│    └─Flatten: 2-10                     [32, 50]                  --
│    └─Linear: 2-11                      [32, 26]                  1,326
==========================================================================================
Total params: 71,526
Trainable params: 71,526
Non-trainable params: 0
Total mult-adds (M): 198.62
==========================================================================================
Input size (MB): 0.10
Forward/backward pass size (MB): 16.15
Params size (MB): 0.29
Estimated Total Size (MB): 16.53
==========================================================================================

Let’s try training the model and seeing how we do.

train_accuracy, train_loss, val_accuracy, val_loss = train(model, k_epochs = k_epochs,  lr = 0.01,  momentum = 0.9)

Figure 13.11: Training curves for the convolutional neural network.

Figure 13.12: Confusion matrix for the convolutional neural network.

The additional complexity of this model enables it to achieve much higher accuracy than the previous models we’ve discussed, although more thorough training runs would be necessary for a full assessment.

Inspecting Learned Features

Like we saw last time, it’s possible to inspect the features learned by a neural network at different levels of abstraction. Let’s see some of the features learned by the model for a single image:

Figure 13.13: Example image used as base for the experiment in Figure 13.14.

In the code block below, we show the outputs of different layers of the model when applied to this original image. Each layer’s output can be thought of as a different “representation” of the original image, with different features extracted at each layer.

Figure 13.14: Sample learned features at varying levels of abstraction from the convolutional model, using Figure 13.13 as a base. Note that the features in each row are not necessarily directly related to each other.

Interpreting these learned features can be tricky and is not generally recommended without context and many additional experiments.

Onward

In the next lecture, we’ll consider some additional practical considerations that arise when working with spatially-structured data and convolutional neural networks.