Week 10 exercises

This is the seventh page of assessed questions, as described in the background notes. These questions form 70% of your mark for Week 10. The introductory questions in the notes and the Week 10 discussion group task form the remaining 30% of your mark for Week 10.

Unlike the questions in the notes, you’ll not immediately see any example answers on this page. However, you can edit and resubmit your answers as many times as you like until the deadline (Friday 27 November ~~4pm~~ 6pm UK time, UTC). This is a hard deadline: This course does not permit extensions and any work submitted after the deadline will receive a mark of zero. See the late work policy.

Queries: Please don’t discuss/query the assessed questions on hypothesis until after the deadline. If you think there is a mistake in a question this week, please email Iain.

Please only answer what’s asked. Markers will reward succinct to-the-point answers. You can put any other observations in the “Add any extra notes” button (but this is for your record, or to point out things that seemed strange, not to get extra credit). Some questions ask for discussion, and so are open-ended, and probably have no perfect answer. For these, stay within the stated word limits, and limit the amount of time you spend on them (they are a small part of your final mark).

Feedback: We’ll return feedback on your submission via email by Friday 4 December.

Good Scholarly Practice: Please remember the University requirements for all assessed work for credit. Furthermore, you are required to take reasonable measures to protect your assessed work from unauthorised access. For example, if you put any such work on a public repository then you must set access permissions appropriately (permitting access only to yourself). You may not publish your solutions after the deadline either.

1 Linear and non-linear autoencoders

We centre our data so it has zero mean and fit a linear autoencoder with no bias parameters. The autoencoder is a \(D\)-dimensional vector-valued function \(\bff\), computed from \(D\)-dimensional inputs \(\bx\), using an intermediate \(K\)-dimensional “hidden” vector \(\bh\): \[\begin{align} \notag \bh &= W^{(1)}\bx\\ \notag \bff &= W^{(2)}\bh. \end{align}\] Assume we want to find a setting of the parameters that minimizes the square error \(\|\bff-\bx\|^2 = (\bff - \bx)^\top(\bff-\bx)\), averaged (or summed) over training examples.

The PCA solution sets \(W^{(1)}\te V^\top\) and \(W^{(2)}\te V\), where the columns of \(V\) contain eigenvectors of the covariance of the inputs. We only really need to fit one matrix to minimize square error. Tying the weight matrices together: \(W^{(1)}\te U^\top\) and \(W^{(2)}\te U\), we can fit one matrix \(U\).

Give a series of equations that show how to compute \(\bar{U}\) by reverse-mode differentiation (or “backprop”). Here \(\bar{U}_{ij} = \pdd{c}{U_{ij}}\), where \(c = (\bff - \bx)^\top (\bff - \bx)\) is the cost for a single input.

List the equations in the order that we would compute them. Assume that we have already performed a forward pass of the network, and have computed and stored, \(\bh\) and \(\bff\) for the current input \(\bx\). [15 marks]
For the following questions i-iii), assume some 2-dimensional points lie along the one-dimensional circumference of a semi-circle. You could create such a dataset, by drawing one of the features from a uniform distribution between \(-1\) and \(+1\), and setting the other feature based on that: \[\begin{align} \notag x_1\nth &\sim \text{Uniform}[-1,1]\\ \notag x_2\nth &= \sqrt{1- \big(x_1\nth{\big)}^2}. \end{align}\]
1. Explain why these points can’t be perfectly reconstructed when passed through the linear autoencoder defined above with \(K\te1\). [10 marks]
2. Explain whether the points could be perfectly reconstructed with \(K\te1\) by some non-linear decoder: \(\bff = \bg(h)\). Where \(\bg\) could be an arbitrary function, perhaps represented by multiple neural network layers. Assume the encoder is still linear: \(h = W^{(1)}\bx\). [10 marks]
3. Explain whether the points could be perfectly reconstructed with \(K\te1\) when using some non-linear encoder: \(h = g(\bx)\). Where \(g\) could again be an arbitrary function, perhaps represented by multiple neural network layers. Assume the decoder is still linear: \(\bff = W^{(2)}h\). [10 marks]
Discuss advantages and disadvantages of the four different high-level architectures for an autoencoder:
1. linear encoder, linear decoder
2. linear encoder, non-linear decoder
3. non-linear encoder, linear decoder
4. non-linear encoder, non-linear decoder
[25 marks]