Class test

This is the class test, as described in the background notes. This test forms 20% of your overall mark (not just the mark for Week 8).

You can edit and resubmit your answers as many times as you like until the deadline.

Deadline: The submission deadline for this class test is on Tuesday 10 November at 10am (UK time, UTC). This is a hard deadline: This course does not permit extensions and any work submitted after the deadline will receive a mark of zero. See the late work policy.

Queries: You may not discuss the class test with others in this 24hr period. You may not post to hypothesis or other forums about the class in this period.

Please only answer what’s asked. Markers will reward succinct to-the-point answers. You can put any other observations in the “Add any extra notes” button (but this is for your record, or to point out things that seemed strange, not to get extra credit). Some questions ask for discussion, and so are open-ended, and probably have no perfect answer. For these, stay within the stated word limits, and limit the amount of time you spend on them (they are a small part of your final mark).

Feedback: We’ll return feedback on your submission via email by Friday 20 November.

Good Scholarly Practice: Please remember the University requirements for all assessed work for credit. Furthermore, you are required to take reasonable measures to protect your assessed work from unauthorised access. For example, if you put any such work on a public repository then you must set access permissions appropriately (permitting access only to yourself). You may not publish your solutions after the deadline either.

1 Regularized regression with variable noise variances

A regression task has \(N\) training pairs \(\{(\bx^{(n)},y^{(n)})\}_{n=1}^N\), where \(\bx^{(n)}\) is a \(D\)-dimensional vector of input features, and \(y^{(n)}\) is a target output. There are different noise variances \(\sigma_{y^{(n)}}^2\) associated with each target output \(y^{(n)}\). A simple linear regression model is fitted by minimizing the following cost function with respect to some weights \(\bw\), where the noise variances are taken into account by precision factors in the sum: \[ c(\bw,\lambda) = \lambda\bw^\top\bw + \sum_{n=1}^N\frac{1}{\sigma_{y^{(n)}}^2}(\bw^\top\bx^{(n)}-y^{(n)})^2. \]

Describe the simplest reasonable way to choose \(\lambda\) from a set of candidate values (e.g., \(\lambda \in \{0,0.01, 0.1, 1, 10, 100\}\)). Write no more than 3 sentences. [15 marks]
Assume now that we wish to fit a regression model that uses \(K\) basis functions and still takes the target output noise variances into account. Write down a cost function for fitting this model, making sure to define any notation you introduce. [10 marks]

2 Filters for image classification

A gray-scale image is represented by a vector \(\bx\) containing \(D\) pixel intensities. Given two non-overlapping regions \(\mathcal{R}_1\) and \(\mathcal{R}_2\), the average pixel values for these regions are: \[ a_1 = \frac{1}{|\mathcal{R}_1|} \sum_{d \in \mathcal{R}_1} x_d\,, \qquad a_2 = \frac{1}{|\mathcal{R}_2|} \sum_{d \in \mathcal{R}_2} x_d\,. \]

Define a weight vector \(\bw\), such that \(\bw^\top\bx = a_1 \tm a_2\), which measures the change in the average pixel values between the two regions. [10 marks]

\(K\) different pairs of regions \(\{(\mathcal{R}_1\kth,\mathcal{R}_2\kth)\}_{k=1}^K\) are selected to construct a set of weights \(\{\bw\kth\}_{k=1}^K\) as above. These weights define a feature vector: \[ \bphi(\bx) = [\bw^{(1)\top}\kern-1pt\bx ~~~~ \bw^{(2)\top}\kern-1pt\bx ~~~\cdots~~~ \bw^{(K)\top}\kern-1pt\bx]^\top. \] A colleague is keen to use an established logistic regression implementation for a specialized image classification task, but finds using the raw pixels doesn’t work. They then used the hand-crafted features \(\bphi(\bx)\) to fit logistic regression parameters \((\bv,b)\) for the predictor: \[ P(y\te 1\g\bx,\bv,b) = \sigma(\bv^\top\bphi(\bx) + b), \quad \text{where}~~\sigma(a) = 1/(1+\exp(-a)). \]

Explain why using the change-detection features in this way didn’t help. Write no more than 3 sentences. [15 marks]
Suggest an incremental improvement, that your colleague could easily adopt, that might immediately improve classification performance. Write no more than 2 sentences. [10 marks]

3 Gaussian process hyperparameters

We have a regression problem where some of the input features \(x_d\) are not useful for predicting the output, but we don’t know which features aren’t useful before seeing the training data. We fit a Gaussian process with kernel: \[ k(\bx\ith, \bx\jth) \;=\; \sigma_f^2\, \exp\!\bigg(\!-{\textstyle\frac{1}{2}} \sum_{d=1}^D (x_d\ith - x_d\jth)^2 / \ell_d^2 \bigg). \] Explain what we hope will happen to the \(\ell_d\) hyperparameters when we maximize the marginal likelihood. Write no more than 3 sentences. [15 marks]

4 Held-out set vs. marginal likelihood

Jo suggests choosing the hyper-parameters of a Gaussian process by picking the settings that give the lowest loss on a held-out validation set. Compare this idea to the more standard approach of optimizing the marginal likelihood. Write no more than 100 words. [25 marks]