$$\notag \newcommand{\bphi}{{\boldsymbol{\phi}}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\te}{\!=\!} \newcommand{\ttimes}{\!\times\!} $$

Week 2 exercises

Each week after this one will have a page of assessed questions, as described in the background notes. The questions this week are for practice, and are not assessed. However, you do need to do the programming questions, because we’ll build on them next week (when the exercise will also get more interesting!).

Unlike the questions in the notes, you’ll not immediately see any example answers on this page. However, you can edit and resubmit your answers as many times as you like until the deadline (Friday 2 October 4pm UK time).

Normally these questions will be entirely based on material from the previous week(s). The programming part this week does mention training/validation/test sets, even though they were first mentioned this week. But we don’t really require any of this week’s material. We just need to split up an array, which involves NumPy skills you should practice now.

Please only answer what’s asked. When the questions are assessed, markers will reward succinct to-the-point answers. You can put any other observations in the “Add any extra notes” button (but this is for your record, or to point out things that seemed strange, not to get extra credit).

1 Maths: Linear regression

Alice fits a function \(f(\bx) = \bw^\top\bx\) to a training set of \(N\) datapoints \(\{\bx^{(n)},y^{(n)}\}\) by least squares. The inputs \(\bx\) are \(D\)-dimensional column vectors. You can assume a unique setting of the weights \(\bw\) minimizes the square error on the training set.

Bob has heard that by transforming the inputs \(\bx\) with a vector-valued function \(\bphi\), he can fit an alternative function, \(g(\bx) = \bv^\top\bphi(\bx)\), with the same fitting code. He decides to use a linear transformation \(\bphi(\bx) = A\bx\), where \(A\) is an invertible matrix.

  1. Show that Bob’s procedure will fit the same function as Alice’s original procedure.

    [Guidance (you won’t always get lots of guidance): You don’t have to do any extensive mathematical manipulation. You also don’t need a mathematical expression for the least squares weights. Instead: reason about the sets of functions that Alice and Bob are choosing their functions from. For comparison, we didn’t need to say what the least squares solution was when discussing nested models in the notes.]

  2. A colleague asks whether Bob’s procedure could be better than Alice’s if the matrix \(A\) is not invertible. What do you tell them?

    [If you need a hint, it may help to remind yourself of the discussion involving invertible matrices in the pre-test answers. If you think your colleague’s question is vague or unclear, then you can say why as part of your answer, and then answer what you think the most sensible version of the question is.]

We’ve kept this section short this week, because some of you will still be settling into the course, and catching up on background and tools.

Future weeks will have some open-ended questions to distinguish between the top grades. However, all questions will require fairly short answers to keep the amount of work under control. At times we will be assessing good judgement, without reminding you (in the question) of best practice. For example, later in the course you won’t get a good mark if you make a mistake that a test would have caught, or you fit a model to a test set.

2 Programming: Modelling audio

Background: Raw audio data is represented as a sequence of amplitudes. Lossless audio compression systems (like flac) use a model to predict each amplitude in turn, given the sequence so far. The residuals of the predictions are compressed and stored, from which the audio file can be reconstructed. We’ll do some initial exploratory analysis of an audio file. Really we’re just using the file as a convenient source of a lot of data points for an exercise.

Download the data here (65 MB).

Programming: You must use Python+NumPy+Matplotlib, and may not use any other libraries (e.g., not SciPy, pandas, or sklearn), or code written by other people. When we suggest functions you could use (e.g., np.load), you can get quick help at the Python prompt with the help function, e.g., help(np.load)

  1. Getting started: Load a long array into Python:
    amp_data = np.load('amp_data.npz')['amp_data']

    1. Plot a line graph showing the sequence in amp_data, and a histogram of the amplitudes in this sequence. Include the code for your plots, with one to three sentences about anything you notice that might be important for modelling these data.

    We will create a dataset that will be convenient for trying different regression models, without some of the complications of responsible time series modelling. Take the vector in amp_data and wrap it into a \(C\ttimes21\) matrix, where each row contains 21 amplitudes that were adjacent in the original sequence. As the vector’s length isn’t a multiple of 21, you’ll have to discard some elements before reshaping the data.

    It should be clear from your plot that the distribution over amplitudes changes over time. Randomly shuffle the rows of the matrix. Then split the data into parts for training (70%), validation (15%), and testing (15%). Each dataset should take the first \(D\te20\) columns as a matrix of inputs X, and take the final column to create a vector of targets y. Name the resulting six arrays: X_shuf_train, y_shuf_train, X_shuf_val, y_shuf_val, X_shuf_test and y_shuf_test. The shuffling means that our training, validation and testing datasets all come from the same distribution. Creating this ideal setting can be useful when first learning about some different methods. Although we should remember our models might not generalize well to new files with different distributions.

    Useful NumPy functions: np.reshape, np.random.permutation

    In future questions you will need repeated access to your shuffled datasets. You could set the “random seed” for the shuffling operation, so that your code creates the same datasets each time. You may also wish to save temporary copies of the shuffled datasets, but save the random seed regardless.1

    Be careful: if code for pre-processing data is incorrect, then all further experiments will be broken. Do some checking to ensure your code does what you think it does.

    1. Include your code for creating the six arrays above from the original amp_data array. Your answers to future questions should assume these arrays exist, rather than relisting the code from this part.

  2. Curve fitting on a snippet of audio:

    Given just one row of inputs, we could fit a curve of amplitude against time through the 20 points, and extrapolate it one step into the future.

    Plot the points in one row of your X_shuf_train data against the numbers \(t\te\frac{0}{20},\frac{1}{20},\frac{2}{20},...,\frac{19}{20}\), representing times. We can fit this sequence with various linear regression models, and extrapolate them to predict the 21st time step at time \(\frac{20}{20}\te1\). Indicate the point you’re predicting from y_shuf_train on the plot at \(t\te1\).

    First fit and plot a straight line to the 20 training points. Then fit a quartic polynomial, by expanding each time \(t\) to a feature vector \(\bphi(t) = [1~~t~~t^2~~t^3~~t^4]^\top\), and fitting a linear model by least squares. Plot both fits between \(t\te0\) and \(t\te1\).

    1. Include the code for a plot that shows 20 training points, a test point, a straight line fit, and a quartic fit.

    2. Explain why the linear fit might be better if we use only the most recent two points, at times \(t\te\frac{18}{20}\) and \(t\te\frac{19}{20}\), rather than all 20 points. Also explain why the quartic fit might be better with a longer context than is best for the straight line model.

    3. Based on your manual visualization of this snippet of audio data, and maybe one or two other rows of your training dataset, roughly what order of polynomial and context length do you guess might be best for prediction and why?


  1. When doing research on larger datasets you won’t want to commit them to version control, to have to distribute copies to anyone following your work, or to pay to back up redundant processed copies.↩︎