$$\notag \newcommand{\bc}{\mathbf{c}} \newcommand{\be}{\mathbf{e}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bphi}{{\boldsymbol{\phi}}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\nth}{^{(n)}} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} \newcommand{\tp}{\!+\!} \newcommand{\ttimes}{\!\times\!} $$

Linear regression

Much of machine learning is about fitting functions to data. That may not sound like an exciting activity that will give us artificial intelligence. However, representing and fitting functions is a core building block of most working machine learning or AI systems. We start with linear functions, both because this idea turns out to be surprisingly powerful, and because it’s a useful starting point for more interesting models and methods.

Video: Affine functions and linear least squares fitting (19 minutes)
Some pictures of affine functions, some notation, and an observation that there are standard routines to fit linear maps to data by least squares. (Ends with the question: why least squares?)

1 Affine functions

An affine function (also called a linear function1) of a vector \(\bx\) takes a weighted sum of each input and adds a constant. For example, for \(D\te3\) inputs \(\bx = [x_1~x_2~x_3]^\top\), a general (scalar) affine function is: \[ f(\bx; \bw, b) = w_1x_1 + w_2x_2 + w_3x_3 + b = \bw^\top\bx + b, \] where \(\bw\) is a \(D\)-dimensional vector of “weights”. The constant offset or “bias weight” \(b\) gives the value of the function at \(\bx = \mathbf{0}\) (the origin). In machine learning, the numbers in \(\bx\) are the input features (also called just “inputs” or “features”) describing a setting that we want our model to consider.

Some example functions are below. These models are wrong, but you can think about whether they might be useful, and might think of better models.

  1. Spammyness of an email: inputs \(\bx\) could be zeros and ones for presence/absence of words.
  2. Predicted strength of a metal alloy: inputs \(\bx\) could be the amounts of different metals we include.
  3. Predicted height a sunflower plant will grow: a simple model might use only one input \(\bx\) (or a scalar, \(x\)): the amount of fertilizer we add to a sunflower.

open above question in new tab (allows annotation)

2 Fitting to data

[Please see our background page on notation if you have queries.]

We will fit the function to a training set of \(N\) input-output pairs \(\{\bx\nth,y\nth\}_{n=1}^N\). For example we make \(N\) different alloys, with the amounts of metals described by the inputs \(\{\bx\nth\}\), and measure the target outputs, the true strengths of these alloys \(\{y\nth\}\), which we would like our model to predict.

Expressing the data and function values using linear algebra, makes the maths easier in the long run, and makes it easier to write fast array-based code. We stack all of the observed outputs into a column vector \(\by\), and the inputs into an \(N\ttimes D\) “design matrix” \(X\): \[ \by = \left[ \begin{array}{c}y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(N)} \end{array} \right], \qquad X = \left[ \begin{array}{c}\bx^{(1)\top} \\ \bx^{(2)\top} \\ \vdots \\ \bx^{(N)\top} \end{array} \right] = \left[ \begin{array}{cccc} x_1^{(1)} & x_2^{(1)} & \cdots & x_D^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_D^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ x_1^{(N)} & x_2^{(N)} & \cdots & x_D^{(N)} \\ \end{array} \right]. \] Elements of these arrays are \(y_n = y^{(n)}\), and \(X_{n,d} = x_d\nth\). Each row of the design matrix gives the features for one training input vector. We can simultaneously evaluate the function at every training input with one matrix-vector multiplication: \[ \bff = X\bw + b, \] where the scalar bias \(b\) is added to each element of the vector \(X\bw\). Each element of the vector on the left-hand side gives the function at one training input: \(f_n = f(\bx\nth;\bw,b) = \bw^\top\bx\nth + b\).

We can compute the total square error of the function values above, compared to the observed training set values: \[ \sum_{n=1}^N \left[ y\nth - f(\bx\nth; \bw, b) \right]^2 = (\by - \bff)^\top (\by - \bff). \] The least-squares fitting problem is finding the parameters that minimize this error.

open above question in new tab (allows annotation)

2.1 Fitting weights with \(b=0\)

To keep the maths simpler, we will temporarily assume our function goes through the origin. That is, we’ll assume \(b=0\). Thus we are fitting the “linear map”: \[ \by \approx \bff = X\bw. \] Fitting \(\bw\) to this approximate relationship by least-squares is so common that programming languages for scientific computing like Matlab/Octave make fitting the parameters astonishingly easy2:

% Matlab sizes: w_fit is Dx1 if X is NxD and yy is Nx1
w_fit = X \ yy;

In this course we’re using Python+NumPy, so the code to fit the weights isn’t quite so short, but is still one line of code:

# NumPy shapes: w_fit is (D,) if X is (N,D) and yy is (N,)
w_fit = np.linalg.lstsq(X, yy, rcond=None)[0]

Video: Checking basic use of np.linalg.lstsq (7 minutes)
Whenever I’m about to use a new function in code, I tend to quickly try an example, to check I’ve understood how to call it. This video is an example.

2.2 Fitting more general functions

Video: Fitting more general functions with linear least squares (12 minutes)
Augmenting the inputs to a regression problem let’s us fit affine functions, and much more, using a least squares routine for fitting linear maps.

We’ll return to how the lstsq fitting routine works later. For now, we’ll assume it’s available, and see what we can do with it.

In machine learning it’s often the case that the same code can solve different tasks simply by using it on different representations of our data. In the rest of this note, and the next, we will solve multiple tasks with the same linear least-squares primitive operation.

We assumed that our function passed through the origin. We can remove that assumption simply by creating a new version of our design matrix. We add an extra column that contains a one in every row: \[ \tilde{X} = \left[ \begin{array}{cc}\bx^{(1)\top} & 1 \\ \bx^{(2)\top}&1 \\ \vdots&\vdots~~ \\ \bx^{(N)\top}&1 \end{array} \right] = \left[ \begin{array}{ccccc} x_1^{(1)} & x_2^{(1)} & \cdots & x_D^{(1)} & 1\\ x_1^{(2)} & x_2^{(2)} & \cdots & x_D^{(2)} & 1\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_1^{(N)} & x_2^{(N)} & \cdots & x_D^{(N)} & 1\\ \end{array} \right]. \] In Python with NumPy imported:

X_bias = np.concatenate([X, np.ones((X.shape[0],1))], axis=1) # (N,D+1)

Then we fit least squares weights exactly as before, just using \(\tilde{X}\) or X_bias, instead of the original design matrix \(X\). If our input was \(D\)-dimensional before, we will now fit \(D\tp1\) weights, \(\tilde{\bw}\). The last of these is always multiplied by one, and so is actually the bias weight \(b\), while the first \(D\) weights give the regression weights for our original design matrix: \[ \tilde{X}\tilde{\bw} = X\tilde{\bw}_{1:D} + \tilde{w}_{D+1} = X\bw + b. \]

2.2.1 Polynomials

We can go further, replacing the design matrix with a new matrix \(\Phi\). Each row of this matrix is an arbitrary vector-valued function of the original input: \(\Phi_{n,:} = \bphi(\bx\nth)^\top\). If the function is non-linear, then our function \(f(\bx)\te\bw^\top\bphi(\bx)\) will be non-linear in \(\bx\). However, we can still use linear-regression code to fit the model, as the function is still a linear map of a known vector, \(\bphi(\bx)\).

The introductory example you’ll see in most textbooks is fitting a polynomial curve to one-dimensional data. Each column of the new design matrix \(\Phi\) is a monomial of the original feature: \[ \Phi = \left[ \begin{array}{ccccc} 1 & x^{(1)} & {(x^{(1)})}^2 & \cdots & {(x^{(1)})}^{K-1} \\ 1 & x^{(2)} & {(x^{(2)})}^2 & \cdots & {(x^{(2)})}^{K-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x^{(N)} & {(x^{(N)})}^2 & \cdots & {(x^{(N)})}^{K-1} \\ \end{array} \right]. \] Using \(\Phi\) as our design matrix or data matrix we can then fit the model \[ y \approx f = \bw^\top\bphi(x) = w_1 + w_2x + w_3x^2 + \dots + w_K x^{K-1}, \] which is a general polynomial of degree \(K\tm1\).

We could generalize the transformation for multivariate inputs, \[ \bphi(\bx) = [1~~x_1~~x_2~~x_3~~x_1x_2~~x_1x_3~~x_2x_3~~x_1^2~~\dots]^\top, \] and hence fit a multivariate polynomial function of our original features. Given that a general polynomial includes cross terms like \(x_1x_3\), \(x_1x_2x_3\), the number of columns in \(\Phi\) could be large.

Polynomials are usually taught in introductions to regression first, because the idea of fitting a polynomial curve may already be familiar. However, polynomial fits are not actually used very often in machine learning. They’re probably avoided for two reasons. 1) The feature space grows quickly for high-dimensional inputs; 2) Polynomials rapidly take on extreme values as the input \(\bx\) moves away from the origin. Of course there are exceptions3.

Video: Basis functions for linear regression (29 minutes)
Moving on from fitting polynomials: Radial Basis Functions (RBFs), sigmoidal basis functions, multivariate polynomials.

2.2.2 Basis functions

Instead of creating monomial features, we can transform our data with any other vector-valued function: \[ \bphi(\bx) = [\phi_1(\bx)~~\phi_2(\bx)~~\dots~~\phi_K(\bx)]^\top. \] By convention, each \(\phi_k\) is called a basis function. The function we fit is a linear combination of these basis functions: \[ f(\bx) = \bw^\top\bphi(\bx) = \sum_k w_k \phi_k(\bx). \] We don’t have to include a constant or bias term in the mathematics, because we can always set one of the \(\phi_k\) functions to a constant.

One possible choice for a basis function \(\phi_k\) is a Radial Basis Function (RBF): \[ \exp(-(\bx-\bc)^\top(\bx-\bc)/h^2), \] where different basis functions can have different parameters \(\bc\) and \(h\). The function is proportional to a Gaussian probability density function (although it is not a probability density in this context). The function has a bell-curve shape centred at \(\bc\), with ‘bandwidth’ \(h\). The bell-curve shape is radially symmetric: the function value only depends on the radial distance from the centre \(\bc\).

Another possible choice is a logistic-sigmoid function: \[ \sigma(\bv^\top\bx + b) = \frac{1}{1 + \exp(-\bv^\top\bx-b)}. \] This sigmoidal or “s-shaped” curve saturates at zero and one for extreme values of \(\bx\). The parameters \(\bv\) and \(b\) determine the steepness and position of the curve.

For each column of the matrix of basis functions \(\Phi\), we will choose a basis function, along with its free parameters such as \(\{\bc,h\}\) for an RBF, or \(\{\bv,b\}\) for the logistic-sigmoid. After sketching/plotting these basis functions (next note) you should develop some intuitions for setting these parameters by hand. Ultimately we will also want to set these parameters automatically.

open above question in new tab (allows annotation)

3 Summary

Using just linear regression, one line of fitting code, we can fit flexible regression models with many parameters. The trick is to construct a large design matrix, where each column corresponds to a basis function evaluated on each of the original inputs. Each basis function should have a different position and/or shape.

Models built with radial-basis functions and sigmoidal functions extrapolate more conservatively than polynomials for extreme inputs. Radial-basis functions change the function close to a reference point, while sigmoidal functions change the function in half of the space. Neither of these families of basis functions has fundamental status however, and other basis functions are also used.


  1. The Wikipedia page for Linear function explains the terminology.↩︎

  2. For more on what the backslash “\” does in Matlab, you can see the documentation for mldivide.↩︎

  3. One case where I might consider polynomials is when I have sparse binary features. That is \(x_d\in\{0,1\}\) where only a few of the features are non-zero. If the features are binary, none of the polynomial terms take on extreme values. ‘Interaction terms’, such as \(x_1x_3\), detect whether it’s important for features to be on at the same time. These extra features are more sparse than the original features, and in careful implementations might not create much more work.↩︎