$$\notag \newcommand{\be}{\mathbf{e}} \newcommand{\bff}{\mathbf{f}} \newcommand{\br}{\mathbf{r}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\g}{\,|\,} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} $$

Regression and Gradients

We’ve seen how to fit non-linear functions using only least squares linear regression. From next week we will also see more useful things we can do just by manipulating Gaussian distributions. However, the material so far already raises questions that we can’t answer with just the tools we’ve reviewed so far.

To answer these questions, we’ll need to fit non-linear optimization problems. These won’t have closed-form solutions, and so we will have to approach them numerically. Many, but not all, optimization problems are solved by gradient-based methods, which we begin to look at in this note.

We start by looking at how linear least squares regression can be solved with gradient-based methods, so we can generalize it to other cases. In the next note we’ll then fit a better regression model for classification: logistic regression.

1 Gradients for linear least squares regression

We can create a vector of residuals (differences between observed values and function values) for a linear regression model as follows: \[ \br = \by - X\bw. \] And previously we noticed that NumPy knows how to minimize the sum of the square residuals: \[\begin{align} \br^\top\br &= (\by - X\bw)^\top(\by - X\bw)\\ &= \by^\top\by - 2\bw^\top X^\top\by + \bw^\top X^\top X \bw. \end{align}\] We’ll now minimize this cost function using calculus.

Our first task is to find the gradient of this cost function with respect to the weights. That is, the vector of partial derivatives: \(\nabla_\bw \br^\top\br\). The gradient vector is a function of the weights. At a given position in weight-space, it points in the direction in which a small movement will increase the cost the most. You were asked to show this fact in the background self-test, question 6ii), which has an answer available if you need to review this material.

We can differentiate small matrix/vector expressions by writing them as sums, and using the elementary differentiation rules for scalars. For example: \[ \pdd{\bx^\top\by}{x_i} = \pdd{\sum_j x_j y_j}{x_i} = y_i, \quad \Rightarrow \quad \nabla_\bx [\bx^\top\by] = \by. \] and \[\begin{align} \pdd{\bx^\top A \bx}{x_i} &= \pdd{\sum_{jk} x_j A_{jk} x_k}{x_i} = \sum_{k} A_{ik} x_k + \sum_{j} x_j A_{ji},\\ \Rightarrow \quad \nabla_\bx [\bx^\top A \bx] &= A\bx + A^\top \bx, \quad \text{or $2A\bx$ if $A$ is symmetric}. \end{align}\] After some experience, you might remember some of these matrix/vector rules. Other such rules can be found in references like The Matrix Cookbook. We will discuss a more systematic approach to differentiate large functions in a later note.

For now, we can use the two rules we’ve derived to differentiate the cost function above: \[ \nabla_\bw [\br^\top\br] = -2 X^\top\by + 2X^\top X \bw. \]

2 A closed form solution

The partial derivatives are all zero at an optimum weight vector. If there is a unique best setting of the weights, we can solve for where that happens: \[\begin{align} -2 X^\top\by + 2X^\top X \bw &= \mathbf{0}\\ \Rightarrow\quad X^\top X \bw &= X^\top\by\\ \Rightarrow\quad \bw &= (X^\top X)^{-1}X^\top\by. \end{align}\] If (and only if) there isn’t a unique solution for the weights that give the minimum square error, then \((X^\top X)\) is not invertible and the equation isn’t defined.

The above expression is known as the normal equation solution of the least squares problem. In mathematical working it’s a “closed-form expression”, but to fit weights (to some precision) from data, we still need a numerical algorithm and don’t necessarily get exactly the right answer. You could implement the normal equations solution as follows:

ww = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, yy))

It’s best practice in implementations to solve linear systems with a dedicated solver1, rather than inverting the matrix \(X^\top X\). However even this code doesn’t always give as accurate an answer as dedicated routines such as lstsq.2

2.1 Uniqueness of the solution

open above question in new tab (allows annotation)

2.2 Another way to derive the solution

open above question in new tab (allows annotation)

3 Iterative methods

There are also generic algorithms that iteratively improve an initial guess of some model parameters using a cost function and its gradient vectors. While linear regression has specialist solvers, we could apply these generic algorithms anyway. If they don’t work in simple cases, they’re not likely to be useful more generally!

A naive way to use the gradient \(\nabla_\bw [\br^\top\br]\) is the steepest-descent method: \[ \bw \leftarrow \bw - \eta \nabla_\bw [\br^\top\br], \] which uses a small step size \(\eta\). The parameters are moved in the direction that makes the biggest immediate change. The rule is applied repeatedly, with the gradient re-evaluated before each update. There are several methods (for example provided by SciPy) that will converge faster. Sometimes much faster.

It may help to rewrite the gradient to understand what the steepest descent rule is doing. \[\begin{align} \nabla_\bw [\br^\top\br] &= 2X^\top (X \bw) -2 X^\top\by\\ &= 2X^\top(\bff -\by)\\ &= 2\sum_n \bx^{(n)} (f^{(n)} - y^{(n)}). \end{align}\] For each example we look at the ‘prediction error’, \((f-y)\). The weights are pulled most in the direction of inputs that had large prediction errors, to reduce misfit in those directions.

In Stochastic Gradient Descent (SGD) we take just one example at a time (perhaps at random, or visit the examples in order). Each example gives a crude one-sample Monte Carlo approximation of the gradient sum: \[ \frac{1}{N}\nabla_\bw [\br^\top\br] \approx 2 \bx^{(n)} (f^{(n)} - y^{(n)}). \] We take a small step in minus this direction. Then pick another example and repeat. Each time we see an example, we move the weights in a direction proportional to just one of the input vectors.

If we have 100,000 datapoints, we perform 100,000 updates in one sweep through the dataset (a ‘training epoch’). In the traditional (batch) steepest gradient descent method, we only perform one update after looking at the whole dataset once. In the limit of an infinite stream of data, SGD can fit a model as the data comes in. A traditional batch method never gets started, as it can never compute the gradient.

Once we have a working gradient-based optimization procedure3, we can apply it to problems beyond linear regression. We need to identify a suitable cost function, which no longer needs to be least squares, and then obtain its gradients.

open above question in new tab (allows annotation)

4 Further Reading

Bishop covers the least squares solution in Section 3.1.1. The equations include a basis function expansion (replacing \(X\) with \(\Phi\)), and motivates the least squares cost from the log-likelihood of a Gaussian model.

Murphy Book 1 covers linear regression in Chapter 11, with a derivation of the least squares solution in Section 11.2.2.1. Murphy’s exposition uses a Gaussian model to motivate the least squares cost function.

Barber Chapter 17 starts with linear regression.

Numerical Recipes (Press et al.) will tell you that steepest gradient descent is a bad algorithm, and describes more sophisticated alternatives. See the section on Conjugate Gradient Methods in Multidimensions (section 10.6 of the second edition or 10.8 of the third edition). The books have some nice descriptions, but I would stay clear of the code (better alternatives with free licenses are available). However, they only describe ‘batch’ methods, that look at an entire dataset before making each update. Stochastic (steepest) gradient descent is not such a bad algorithm, especially for machine learning (e.g., this survey https://arxiv.org/abs/1606.04838).

5 Check your understanding

Optional questions for those with time to be thorough on their second reading, or for exam revision later:


  1. The Cholesky-based solver in SciPy is specialized for solving with a positive-definite matrix, and would be faster here.

  2. See Solving Least Squares Problems, Lawson and Hanson (1974), Chapter 19, which argues that it’s better to use a QR decomposition of the data matrix than losing precision by forming the summary \(X^\top X\). Murphy’s Book 1 (Section 11.2.2.3) discusses using an approach based on a QR decomposition or a Singular Value Decomposition (SVD). I must say I’m not totally convinced this care is usually necessary. For noisy and regularized machine learning problems I’ve tried, the normal equations approach seems fine, and is \(\sim\!10\times\) faster on my machine. However, I still frequently use the least squares solvers: they’re still quite fast, convenient, and I don’t have to keep checking whether the normal equations approach will be accurate.

  3. And we might find one slightly more sophisticated than described above.