Last time, we studied the empirical risk minimization (ERM). ERM casts the machine learning problem as an optimization problem: we want to find a vector of weights \(\mathbf{w}\) such that
and \(\mathbf{y}\) is the vector of targets, which we usually assume to be binary in the context of classification. The per-observation loss function\(\ell: \mathbb{R}\times \mathbb{R}\rightarrow \mathbb{R}\) measures the quality of the score \(s_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\) assigned to data point \(i\) by comparing it to \(y_i\) and outputing a real number \(\ell(s_i, y_i)\).
So, our mathematical task is to find a vector of weights \(\mathbf{w}\) that solves Equation 13.1. How do we do it? The modern answer is gradient descent, and this set of lecture notes is about what that means and why it works.
Linear Approximations of Single-Variable Functions
Recall the limit definition of a derivative of a single-variable function. Let \(g:\mathbb{R} \rightarrow \mathbb{R}\). The derivative of \(g\) at point \(w_0\), if it exists, is
Theorem 9.1 (Taylor’s Theorem: Univariate Functions) Let \(g:\mathbb{R}\rightarrow \mathbb{R}\) be differentiable at point \(w_0\). Then, there exists \(a > 0\) such that, if \(\lvert\delta w\rvert < a\), then
to find critical points – under certain conditions, it is guaranteed that a minimum, if it exists, will be one of these critical points. However, it is not always feasible to solve this equation exactly in practice.
In iterative approaches, we instead imagine that we have a current guess \(\hat{w}\) which we would like to improve by adding some \(\delta \hat{w}\) to it. To this end, consider the casual Taylor approximation In the rest of these notes, we will assume that term \(o(\lvert \delta w\rvert)\) is small enough to be negligible.
We’d like to update our estimate of \(\hat{w}\). Suppose we make a strategic choice: \(\delta \hat{w} = -\alpha \frac{dg(\hat{w})}{dw}\) for some small \(\alpha > 0\). We therefore decide that we will do the update
This is the big punchline. Let’s look at the second term. If \(\left(\frac{dg(\hat{w})}{dw}\right)^2 = 0\) then that must mean that \(\frac{dg(\hat{w})}{dw}\) and that we are at a critical point, which we could check for being a local minimum. On the other hand, if \(\frac{dg(\hat{w})}{dw} \neq 0\), then \(\left(\frac{dg(\hat{w})}{dw}\right)^2 > 0\). This means that
provided that \(\alpha\) is small enough for the error terms in Taylor’s Theorem to be small. We have informally derived the following fact:
Single-Variable Gradient-Descent Works
Let \(g:\mathbb{R}\rightarrow \mathbb{R}\) be differentiable and assume that \(\frac{dg(\hat{w})}{dw} \neq 0\). Then, if \(\alpha\) is sufficiently small, Equation 9.2 is guaranteed to reduce the value of \(g\).
Let’s see an example of single-variable gradient descent in action:
We can see the updates from gradient descent eventually converging to the point \(w = 0\), which is the global minimum of this function.
Gradient Descent in Multiple Dimensions
Our empirical risk function \(L\) is not a single-variable function; indeed, \(L: \mathbb{R}^p \rightarrow \mathbb{R}\). So, we can’t directly apply the results above. Fortunately, these results extend in a smooth way to this setting. The main thing we need is the definition of the gradient of a multivariate function.
Gradients
We’re not going to talk much about what it means for a function to be multivariate differentiable. You can assume that all the functions we will deal with in this class are unless I highlight otherwise. For a more rigorous definition, you should check out a multivariable calculus class.
Definition 9.1 (Gradient of a Multivariate Function) Let \(g:\mathbb{R}^p \rightarrow \mathbb{R}\) be a multivariate differentiable function. The gradient of \(g\) evaluated at point \(\mathbf{w}\in \mathbb{R}^p\) is written \(\nabla g(\mathbf{w})\), and has value
Here, \(\frac{\partial g(\mathbf{w})}{\partial w_1}\) is the partial derivative of \(f\) with respect to \(z_1\), evaluated at \(\mathbf{w}\). To compute it:
Take the derivative of \(f\) *with respect to variable \(z_1\), holding all other variables constant, and then evaluate the result at \(\mathbf{w}\).
Example
Let \(p = 3\). Let \(g(\mathbf{w}) = w_2\sin w_1 + w_1e^{2w_3}\). The partial derivatives we need are
Taylor’s Theorem extends smoothly to this setting.
Theorem 9.2 (Taylor’s Theorem: Multivariate Functions) Let \(g:\mathbb{R}^p\rightarrow \mathbb{R}\) be differentiable at point \(\mathbf{w}_0 \in \mathbb{R}^p\). Then, there exists \(a > 0\) such that, if \(\lVert \delta \mathbf{w} \rVert < a\), then
Let’s check that a single update of gradient descent will reduce the value of \(g\) provided that \(\alpha\) is small enough. Here, \(\delta \hat{\mathbf{w}} = -\alpha \nabla g(\hat{\mathbf{w}})\).
Since \(\lVert\nabla g(\hat{\mathbf{w}}) \rVert^2 > 0\) whenever \(\nabla g(\hat{\mathbf{w}}) \neq\mathbf{0}\), we conclude that, unless \(\hat{w}\) is a critical point (where the gradient is zero), then
In other words, provided that \(\alpha\) is small enough for the Taylor approximation to be a good one, multivariate gradient descent also always reduces the value of the objective function.
Gradient of the Empirical Risk
Remember that our big objective here was to solve Equation 13.1 using gradient descent. To do this, we need to be able to calculate \(\nabla L(\mathbf{w})\), where the gradient is with respect to the entries of \(\mathbf{w}\). Fortunately, the specific linear structure of the score function \(s_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\) makes this relatively simple: indeed, we actually only need to worry about the single variable derivatives of the per-observation loss \(\ell\). To see this, we can compute
The good news here is that for linear models, we don’t actually need to be able to compute more gradients: we just need to be able to compute derivatives of the form \(\frac{d\ell(s_i, y_i)}{ds}\) and then plug in \(s_i = \langle \mathbf{w}, \mathbf{x}_i\rangle\).
A useful fact to know about the logistic sigmoid function \(\sigma\) is that \(\frac{d\sigma(s) }{ds} = \sigma(s) (1 - \sigma(s))\). So, using that and the chain rule, the derivative we need is
An important note about this formula that can easily trip one up: this looks a bit like a matrix multiplication or dot product, but it isn’t!
This gives us all the math that we need in order to learn logistic regression by choosing a learning rate and iterating the update \(\mathbf{w}^{(t+1)} \gets \mathbf{w}^{(t)} - \alpha \nabla L(\mathbf{w}^{(t)})\) until convergence.
Example: Logistic Regression
Let’s see gradient-descent in action for logistic regression. Our computational approach is based on the LinearModel class that you previously started implementing. The training loop is also very similar to our training loop for the perceptron. The main difference is that the loss is calculated using the binary_cross_entropy function above, and the step function of the GradientDescentOptimizer works differently in a way that we will discuss below.
Starting with the code block below, you won’t be able to follow along in coding these notes unless you have sneakily implemented logistic regression in a hidden module.
Code
def classification_data(n_points =300, noise =0.2, p_dims =2): y = torch.arange(n_points) >=int(n_points/2) y =1.0*y X = y[:, None] + torch.normal(0.0, noise, size = (n_points,p_dims)) X = torch.cat((X, torch.ones((X.shape[0], 1))), 1)return X, yX, y = classification_data(noise =0.5)
The logistic regression training loop relies on a new implementation of opt.step. For gradient descent, here’s the complete implementation: just a quick Python version of the gradient descent update Equation 9.3.
def step(self, X, y, lr =0.01):self.model.w -= lr*self.model.grad(X, y)
The method model.grad() is the challenging part of the implementation: this is where we actually need to turn Equation 9.5 into code.
Here’s the complete training loop. This loop is very similar to our perceptron training loop – we’re just using a different loss and a different implementation of grad.
from hidden.logistic import LogisticRegression, GradientDescentOptimizer# instantiate a model and an optimizerLR = LogisticRegression() opt = GradientDescentOptimizer(LR)# for keeping track of loss valuesloss_vec = []for _ inrange(100):# not part of the update: just for tracking our progress loss = LR.loss(X, y) loss_vec.append(loss)# only this line actually changes the parameter value# The whole definition is: # self.model.w -= lr*self.model.grad(X, y) opt.step(X, y, lr =0.02)plt.plot(torch.arange(1, len(loss_vec)+1), loss_vec, color ="black")plt.semilogx()labs = plt.gca().set(xlabel ="Number of gradient descent iterations", ylabel ="Loss (binary cross entropy)")
Figure 9.1: Evolution of the binary cross entropy loss function in the logistic regression training loop.
The loss quickly levels out to a constant value, which is our optimized weight vector \(\mathbf{w}\). Because our theory tells us that the loss function is convex, we know that the value of \(\mathbf{w}\) we have found is the best possible, in the sense of minimizing the loss.
Let’s check our training accuracy:
(1.0*(LR.predict(X) == y)).mean()
tensor(0.9167)
Not too bad!
Recap
In these lecture notes, we introduced gradient descent as a method for minimizing functions, and showed an application of gradient descent for logistic regression. Gradient descent is especially useful when working with convex functions, since in this case it is guaranteed to converge to the global minimum of the empirical risk (provided that the learning rate \(\alpha\) is sufficiently low). The idea of gradient descent – incremental improvement to the weight vector \(\mathbf{w}\) using information about the derivatives of the loss function–is a fundamental one which has led to many variations and improvements.