We expect those wanting to do really well on the course to work through everything here before the tutorial. There are two core questions, with entry boxes on the website, that everyone should attempt. Unlike the in-note questions, you can edit answers until we release answers after the final tutorial group meets. Almost all of you should try more than just these parts.
We strongly recommend that you discuss your tutorial work — and the course in general — with your peers. If you can’t do a part, that’s normal; skip it and move on! After attempting what you can, try to meet another student from the class and pool your understanding. Tutorials run more smoothly if you have agreed with someone what you want to talk about. If you haven’t met people in the class yet, exchange details in your first tutorial!
Your tutorial session usually won’t discuss every question. Tutorials are for discussion, not to confirm all the answers: detailed answers will be made available after the week’s tutorials, and can be discussed further on Hypothesis.
Linear Regression and linear transformations:
Alice fits a function \(f(\bx) = \bw^\top\bx\) to a training set of \(N\) datapoints \(\{\bx^{(n)},y^{(n)}\}\) by least squares. The inputs \(\bx\) are \(D\)-dimensional column vectors. You can assume a unique setting of the weights \(\bw\) minimizes the square error on the training set.
Bob has heard that by transforming the inputs \(\bx\) with a vector-valued function \(\bphi\), he can fit an alternative function, \(g(\bx) = \bv^\top\bphi(\bx)\), with the same fitting code. He decides to use a linear transformation \(\bphi(\bx) = A\bx\), where \(A\) is an invertible matrix.
Show that Bob’s procedure will fit the same function as Alice’s original procedure.
NB You don’t have to do any extensive mathematical manipulation. You don’t need a mathematical expression for the least squares weights.
Hint: reason about the sets of functions that Alice and Bob are choosing their functions from. If necessary, more help is in the footnote1.
Could Bob’s procedure be better than Alice’s if the matrix \(A\) is not invertible?
[If you need a hint, it may help to remind yourself of the discussion involving invertible matrices in the pre-test answers. Also, think about different possible interpretations of “better”.]
Bonus part: Only do this part this week if you have time. Otherwise review it later.
Alice becomes worried about overfitting, adds a regularizer \(\lambda \bw^\top\bw\) to the least-squares error function, and refits the model. Assuming \(A\) is invertible, can Bob choose a regularizer so that he will still always obtain the same function as Alice?
Bonus part: Only do this part this week if you have time. Otherwise review it later.
Suppose we wish to find the vector \(\bv\) that minimizes the function \[\notag
(\by - \Phi\bv)^\top(\by - \Phi\bv) + \bv^\top M \bv.
\]
Show that \(\bv^\top M \bv = \bv^\top(\frac{1}{2}M + \frac{1}{2}M^\top) \bv\), and hence that we can assume without loss of generality that \(M\) is symmetric.
Why would we usually choose \(M\) to be positive semi-definite in a regularizer, meaning that \(\ba^\top M \ba \ge 0\) for all vectors \(\ba\)?
Assume we can find a factorization \(M = AA^\top\). Can we minimize the function above using a standard routine that can minimize \((\bz-X\bw)^\top(\bz-X\bw)\) with respect to \(\bw\)?
Radial Basis Functions (RBFs):
In this question we form a linear regression model for one-dimensional inputs: \(f(x) = \bw^\top\bphi(x;h)\), where \(\bphi(x;h)\) evaluates the input at 101 basis functions. The basis functions \[\notag \phi_k(x) = e^{-(x-c_k)^2/h^2} \] share a common user-specified bandwidth \(h\), while the positions of the centers are set to make the basis functions overlap: \(c_k \te (k-51)h/\sqrt{2}\), with \(k = 1\ldots 101\). The free parameters of the model are the bandwidth \(h\) and weights \(\bw\).
The model is used to fit a dataset with \(N\te70\) observations each with inputs \(x\in[-1,+1]\). Assume each of the observations has outputs \(y\in[-1,+1]\) also. The model is fitted for any particular \(h\) by transforming the inputs using that bandwidth into a feature matrix \(\Phi\), then minimizing the regularized least squares cost: \[\notag C = (\by-\Phi\bw)^\top(\by-\Phi\bw) + 0.1\bw^\top\bw.\]
You’ll want to understand the arrangement of the RBFs to be able to answer this question. Make sure you know how to sketch the situation with pen and paper. However, this second and final core question asks you instead for a straightforward plot using Python.
Write code to plot all of the K=101
basis functions \(\phi_k(x)\) with \(h\te0.2\). Don’t plot data or \(f(x)\) functions.
Explain why many of the weights will be close to zero when \(h\te0.2\), and why even more weights will probably be close to zero when \(h\te1.0\).
It is suggested that we could choose \(h\) by fitting \(\bw\) for each \(h\) in a grid of values, and pick the \(h\) which led to a fit with the smallest cost \(C\). Explain whether this suggestion is a good idea, or whether you would modify the procedure.
Another data set with inputs \(x\!\in\![-1,+1]\) arrives, but now you notice that all of the observed outputs are larger, \(y\!\in\![1000,1010]\). What problem would we encounter if we performed linear regression as above to this data? How could this problem be fixed?
Logistic Sigmoids:
Sketch — with pen and paper — a contour plot of the sigmoidal function \[\notag\phi(\bx) = \sigma(\bv^\top\bx + b),\] for \(\bv \te [1~2]^\top\) and \(b \te 5\), where \(\sigma(a) \te 1/(1\tp\exp(-a))\).
Indicate the precise location of the \(\phi\te 0.5\) contour on your sketch, and give at least a rough indication of some other contours. Also mark the vector \(\bv\) on your diagram, and indicate how its direction is related to the contours.
Hints: What happens to \(\phi\) as \(\bx\) moves orthogonal (perpendicular) to \(\bv\)? What happens to \(\phi\) as \(\bx\) moves parallel to \(\bv\)? To draw the \(\phi\te0.5\) contour, it may help to identify special places where it is easy to show that \(\phi\te0.5\).
If \(\bx\) and \(\bv\) were three-dimensional, what would the contours of \(\phi\) look like, and how would they relate to \(\bv\)? (A sketch is not expected.)
For any function that Alice might pick, can Bob pick the same function? Can Bob represent any functions that Alice can’t represent? Bob and Alice are both trying to pick a function to minimize square error. Given the options Bob has, why will he neither pick a different function that Alice could have picked, nor a function Alice couldn’t have picked?
This course doesn’t ask for proofs often, and doesn’t require a really formal presentation. But it does require some clear thinking and explanation, and reasoning about when model additions might help is important. If you don’t manage this question, you won’t be alone, but you will need to be able to adapt the reasoning to new situations.↩