The two parts with entry boxes in Q1 are the “core questions”, as described in Tutorial 1. We are expecting you to attempt all the tutorial questions. You can seek clarifications and hints on Hypothesis.
Building a toy neural network
Consider the following classification problem. There are two real-valued features \(x_1\) and \(x_2\), and a binary class label. The class label is determined by \[\notag y = \begin{cases} 1 & \text{if}~x_2 \ge |x_1|\\ 0 & \text{otherwise}. \end{cases} \]
Can this function be perfectly represented by logistic regression, or a feedforward neural network without a hidden layer? Why or why not?
Answer:
No. A neural network without a hidden layer simply computes a linear function of the inputs. The decision boundary required for this problem is not linear.
Consider a simpler problem for a moment, the classification problem \[\notag
y = \begin{cases}
1 & \text{if}~x_2 \ge x_1\\
0 & \text{otherwise}.
\end{cases}
\] Implement a Python function h1
that represents a single ‘neuron’ computing: \[
h_1 = \Theta(W_{11}^{(1)} x_1 + W_{12}^{(1)} x_2 + b^{(1)}_1),
\] where \(\Theta\) is the hard threshold function: \[\notag
\Theta(a) = \begin{cases}
1 & \text{if}~a \ge 0\\
0 & \text{otherwise}.
\end{cases}
\] The neuron thus applies the hard threshold function to a linear combination of the \(\bx\) inputs. Pick the weights by hand to solve the above classification problem.
Now go back to the classification problem at the beginning of this question. Design a two layer feedforward network (that is, one hidden layer with two layers of weights) that represents this function. Use the hard threshold activation function as in the previous question.
To clarify, we’re not asking for Python code in this question part. Give equations for the computations that the network performs, like Equation (1) for \(h_1\) in the previous question part. And find parameters that produce the required function.
Hints: Use two units in the hidden layer. The unit from the last question will be one of the units, and you will need to design one more. You can then find settings of the weights such that the output unit performs a binary AND operation on the two hidden units.
Answer:
We can define a second neuron \(h_2\), \[ h_2 = \Theta(W_{21}^{(1)} x_1 + W_{22}^{(1)} x_2 + b^{(1)}_2), \] with \[ W_{21}^{(1)} = 1,~~W_{22}^{(1)}=1,~~b^{(1)}_2 = 0. \] Now \(h_2\te1~\Longleftrightarrow~x_1+x_2 \ge 0\). So \(x_2 \ge -x_1\).
We can also write: \[ \bh = \Theta(W^{(1)}\bx + \bb^{(1)}). \]
The intersection of the area for which \(h_1\te1\) and the area for which \(h_2\te1\) is the area where \(y\te1\). We can set up a logical AND
with the following output unit: \[
f = \Theta(w_1^{(2)} h_1 + w_2^{(2)} h_2 + b^{(2)})
= \Theta(\bw^{(2)\top}\bh + b^{(2)}),
\] where we choose \[
w_1^{(2)} = 1,~~ w_2^{(2)} = 1,~~ b^{(2)} = -1.1
\] You can show that \(y = 1~\Longleftrightarrow~ x_2 \ge |x_1|\). Check this relationship on a few example points to verify that it works.
Intended lessons of Q1:
This exercise is intended to step you through how using a hidden layer with a non-linearity can represent functions that linear functions can’t. The hard threshold or step function used in this question was common in early neural networks. It’s used here because it’s easier to reason about than a logistic sigmoid when building a toy network by hand.
Regression with input-dependent noise:
In lectures we turned a representation of a function into a probabilistic model of real-valued outputs by modelling the residuals as Gaussian noise: \[\notag p(y\g\bx,\theta) = \N(y;\, f(\bx;\theta),\, \sigma^2). \] The noise variance \(\sigma^2\) is often assumed to be a constant, but it could be a function of the input location \(\bx\) (a “heteroscedastic” model).
A flexible model could set the variance using a neural network: \[\notag \sigma(\bx)^2 = \exp(\bw^{(\sigma)\top}\bh(\bx;\theta) + b^{(\sigma)}), \] where \(\bh\) is a vector of hidden unit values. These could be hidden units from the neural network used to compute function \(f(\bx;\theta)\), or there could be a separate network to model the variances.
Assume that \(\bh\) is the final layer of the same neural network used to compute \(f\). How could we modify the training procedure for a neural network that fits \(f\) by least squares, to fit this new model?
Answer:
Probabilistic models are often fitted by maximum likelihood (perhaps with regularization), where the cost function to minimize is the negative log-likelihood. (Other cost functions are possible.) I’ll show how we find the derivatives of the cost for one example. These could be used in stochastic gradient descent. We could add up the gradients over a batch of examples and use a batch or mini-batch gradient-based optimizer.
I’ll write the variance as \(v\te\sigma^2\), so the square doesn’t confuse me when differentiating: \[ c = -\log\N(y;\,f,\,v) = \frac{1}{2}\log v + \frac{1}{2v}(y-f)^2 + \frac{1}{2}\log(2\pi). \] We can work out the effect on the cost of an example that perturbing the mean or variance would have: \[\begin{align} \bar{f} &= \pdd{c}{f} = \frac{1}{v}(f-y),\\ \bar{v} &= \pdd{c}{v} = \frac{1}{2v} - \frac{1}{2v^2}(y-f)^2 \end{align}\] These signals are then backpropagated to earlier stages of the computation.
I’ll write that: \[\begin{align} f &= \bw^{(f)\top}\bh^{(f)} + b^{(f)}\\ a^{(\sigma)} &= \bw^{(\sigma)\top}\bh^{(\sigma)} + b^{(\sigma)}\\ v &= \exp(a^{(\sigma)}), \end{align}\] where \(\bh^{(f)}\) is the final hidden layer before computing the function, which the least squares neural network will also have. \(\bh^{(\sigma)}\) is a hidden layer used to compute the new variance prediction. Later I assume that the network shares the same hidden layer for both outputs (a common choice, but not necessary): \(\bh^{(\sigma)} = \bh^{(f)} = \bh^{(L)}\).
We backpropagate the variance computation one step: \[ \bar{a}^{(\sigma)} = \bar{v}\exp(a^{(\sigma)}) = \bar{v}v. \] Standard results for backpropagation apply to a generic scalar neural network activation as follows: \[ a = \bw^\top\bh + b \quad\Rightarrow\quad \bar{\bw} = \bar{a}\bh, \quad \bar{\bh} = \bar{a}\bw, \quad \bar{b} = \bar{a}. \] These rules apply to the variance activation: \[ \bar{\bw}^{(\sigma)} = \bar{a}^{(\sigma)}\bh^{(\sigma)}, \quad \bar{\bh}^{(\sigma)} = \bar{a}^{(\sigma)}\bw^{(\sigma)}, \quad \bar{b}^{(\sigma)} = \bar{a}^{(\sigma)} \] and the function value \(f\) (a least squares network also does this part): \[ \bar{\bw}^{(f)} = \bar{f}\bh^{(f)}, \quad \bar{\bh}^{(f)} = \bar{f}\bw^{(f)}, \quad \bar{b}^{(f)} = \bar{f}. \] We now have the gradients for all of the weights producing the function and variance at the final stage of the network.
If we assume that the hidden layers for the variance and function are shared (the same hidden layer), we can combine the signals for the shared final hidden layer: \[ \bar{\bh}^{(L)} = \bar{\bh}^{(\sigma)} + \bar{\bh}^{(f)}, \] and then propagate that back through whatever the neural network is, exactly as the least squares network did. If we don’t share the hidden layers, we’d have to describe where \(\bh^{(\sigma)}\) comes from, and how any extra parameters are fitted.
My empirical experience suggests that fitting the neural network in this question is harder than a least squares network. It’s harder to find good learning rates for example.
There are other possible ways to predict the typical size of a neural network’s mistake. For example we could compute the log of the square residuals on the training set, and then fit a second neural network (using the same least squares code as used to fit the function) to fit the log square residuals.
In the suggestion above, the activation \(a^{(\sigma)} = \bw^{(\sigma)\top}\bh + b^{(\sigma)}\) sets the log of the variance of the observations.
Why not set the variance directly to this activation value, \(\sigma^2 = a^{(\sigma)}\)?
Hard: Why not set the variance to the square of the activation, \(\sigma^2 = (a^{(\sigma)})^2\)?
Answer:
A neural network activation is unconstrained, but the model is undefined for negative variances.
Setting \(\sigma^2\te a^2\) would mean that the variances were non-negative, and so in principle this idea could work. But it’s not likely to work well in practice. The variance no longer changes monotonically as a function of the activation. Moreover, as the activation passes through zero, the cost function passes down towards minus infinity and up again, with extreme gradients. There will probably be worse local optima than with the suggested architecture, and gradient fitting procedures are more likely to become unstable.
Taking the log of the variance puts problematic zero variances out at activations of negative infinity.
Given a test input \(\bx^{(*)}\), the model above outputs both a guess of an output, \(f(\bx^{(*)})\), and an ‘error bar’ \(\sigma(\bx^{(*)})\), which indicates how wrong the guess could be.
The Bayesian linear regression and Gaussian process models covered in lectures also give error bars on their predictions. What are the pros and cons of the neural network approach in this question? Would you use this neural network to help guide which experiments to run?
Answer:
Model flexibility: The neural network model in this question is “heteroscedastic”, it has input-dependent noise in the model. Standard Bayesian linear regression and GP models (although they could be generalized) assume fixed noise. Given lots of data, these models will make predictions with the same variance everywhere, even if that’s the wrong thing to do.
Costs: It’s easier to fit the neural network model to large datasets than Gaussian processes. One sweep through \(N\) datapoints by stochastic gradient descent with \(H\) hidden units and \(D\)-dimensional inputs costs \(O(NDH)\). Gaussian processes cost \(O(N^3)\). Bayesian linear regression is cubic in the number of basis functions, and it’s hard to cover high-dimensional input spaces. For smaller datasets, the closed-form Gaussian computations of GPs and linear regression are more attractive.
Uncertainties: The Bayesian approaches explicitly model uncertainty in the parameters separately from the noise of the observation process. With Gaussian processes, the error bars grow to the prior function standard deviation for inputs far away from any observations. In the neural network, the variance could extrapolate in arbitrary ways in unseen regions. Some thought could go into regularizing it towards sensible solutions — and Bayesian versions of neural networks are possible. However, Gaussian processes are usually the first choice for experiment design, with neural networks adopted carefully for large problems.
You might have thought of other things… if so, please add them to the Forum.
Classification with unbalanced data
Classification tasks often involve rare outcomes, for example predicting click-through events, fraud detection, and disease screening. We’ll restrict ourselves to binary classification, \(y\in\{0,1\}\), where the positive class is rare: \(P(y\te1)\) is small.
We are likely to see lots of events before observing enough rare \(y\te1\) events to train a good model. To save resources, it’s common to only keep a small fraction \(f\) of the \(y\te0\) ‘negative examples’. A classifier trained naively on this sub-sampled data would predict positive labels more frequently than they actually occur. For generative classifiers we could set the class probabilities based on the original class counts (or domain knowledge). This question explores what to do for logistic regression.
We write that an input-output pair occurs in the world with probability \(P(\bx,y)\), and that the joint probability of an input-output pair (\(\bx,y\)) and keeping it (\(k\)) is \(P(\bx, y, k)\). Because conditional probabilities are proportional to joint probabilities, we can write: \[\notag P(y\g\bx,k)\propto P(\bx,y\g k) \propto P(\bx, y, k) = \begin{cases} P(\bx,y) & y=1\\ f\,P(\bx,y) & y=0.\end{cases} \]
Show that if: \[\notag P(y\g \bx) \propto \begin{cases} c & y=1\\ d & y=0,\end{cases} \] then \(P(y\te1\g \bx) = \sigma(\log c - \log d)\), where \(\sigma(a) = 1/(1+e^{-a})\).
[This core question should be revision, given Tutorial 5, Q1.]
We train a logistic regression classifier, with a bias weight, to match subsampled data, so that \(P(y\te1\g \bx, k)\approx \sigma(\bw^\top\bx + b)\). Use the above results to argue that we should add \(\log f\) to the bias parameter to get a model for the real-world distribution \(P(y\g \bx) \propto P(y, \bx)\).
Have we changed the bias in the direction that you would expect?
Answer:
Combining the results above, the fitted model says: \[ P(y\te1\g\bx,k) = \sigma\big(\log P(\bx,y\te1) - \log P(\bx,y\te0) - \log f\big) \approx \sigma(\bw^\top\bx + b). \] Taking the contents of the sigmoids, and adding \(\log f\) to both sides: \[ \bw^\top\bx + (b+\log f) \approx \log P(\bx,y\te1) - \log P(\bx,y\te0). \] Thus the suggested classifier with the corrected bias does what we want: \[ \sigma(\bw^\top\bx + (b+\log f)) \approx \sigma\big(\log P(\bx,y\te1) - \log P(\bx,y\te0)\big) = P(y\te1\g\bx). \]
You should always question if your results look plausible: to see if you can spot obvious mistakes. Here \(\log f\) is negative, because \(f\) is a fraction that we are keeping, \(f \ll 1\). Thus we reduce the activation to the sigmoid, and predict \(y\te1\) with a smaller probability than the model that saw a greater fraction of \(y\te1\) examples than really happen. If \(f\te1\) we make no change, also as expected.
[You could also check whether the correction has the same effect on the predictive probability as changing the class probabilities for a generative classifier. Working deliberately omitted; this is something you could investigate for yourself if keen.]
We now consider a different approach. We may wish to minimize the loss: \[\notag L = -\E_{P(\bx,y)}\!\left[ \log P(y\g\bx) \right]. \] Multiplying the integrand by \(1 \te \frac{P(\bx,\,\by\g k)}{P(\bx,\,\by\g k)}\) changes nothing, so we can write: \[\notag\begin{align} L &= -\int \sum_y P(\bx,y\g k)\, \frac{P(\bx,y)}{P(\bx,y\g k)} \,\log P(y\g\bx)\;\mathrm{d}\bx \notag\\ &= -\E_{p(\bx,y\g k)}\!\left[\frac{P(\bx,y)}{P(\bx,y\g k)} \,\log P(y\g\bx)\right] \notag\\ &\approx -\frac{1}{N} \sum_{n=1}^N \frac{P(\bx\nth,y\nth)}{P(\bx\nth,y\nth\g k)} \,\log P(y\nth\g\bx\nth),\notag \end{align}\] where \(\bx\nth,y\nth\) come from the subsampled data. This manipulation is a special case of a trick known as importance sampling, which we will see later in lectures in another context. We have converted an expectation under the original data distribution, into an expectation under the subsampling distribution. We then replaced the formal expectation with an average over subsampled data.
Use the above idea to justify multiplying the gradients for \(y\te0\) examples by \(1/f\), when training a logistic regression classifier on subsampled data.
Answer:
\(P(\bx, y\g k)\) is defined up to a constant (did you notice only up to a constant?) at the start of the question. Thus the “importance weights” applied to each log probability are: \[ \frac{P(\bx, y)}{P(\bx,y\g k)} ~\propto~ \begin{cases} \frac{P(\bx,y)}{P(\bx,y)} & y=1\\ \frac{P(\bx,y)}{f\,P(\bx,y)} & y=0\end{cases} ~\propto~ \begin{cases} 1 & y=1\\ 1/f & y=0.\end{cases} \] We can substitute these values of \(1\) or \(1/f\) into the loss. The unknown constant scales the loss surface, but does not change where the minimum is.
The loss for a positive example is then \(\log P(y\te1\g\bx)\) as usual, whereas the loss for a negative example becomes \((1/f)\log P(y\te0\g\bx)\). The gradients for negative examples are then also scaled by \((1/f)\).
This method says that if we throw away all but 1/1000 of the negative examples, we should up-weight the effect of each remaining negative example by 1000\(\times\) to make up for it.
Hard: Compare the two methods that we have considered for training models based on subsampled data, giving some pros and cons of each.
Answer:
Both methods apply beyond logistic regression. Method b) could be modified to adjust probabilities after making predictions, rather than adjusting a bias parameter. Method c) applies to any method that can apply importance weights to examples – the argument also didn’t rely on using log-probability loss. We can adjust the gradients of any differentiable loss function, and gradient boosted tree software like XGBoost (non-examinable) will also let you specify weights on examples.
The method proposed in b) is simple, and doesn’t require modification to existing training code: simply change the bias (or adjust probabilities) at the end.
The method proposed in c) multiplies some gradients by \(1/f\), which could be a large factor (such as 100 or 1000). Default step sizes in software might not work well, so it could be more work to get this modified method to work.
In the limit of infinite data, the empirical estimate of the average loss in method c) will become accurate, and so we will minimize the same loss as without subsampling. Method b) won’t usually minimize that same loss: although b) tries to match the same \(P(y\g\bx)\) distribution as c), the inputs are still weighted in the average loss by the modified distribution \(P(\bx\g k)\), whereas in c) the input locations are weighted by \(P(\bx)\).
Thus method b) is increasing the importance of getting accurate probabilities in regions where positive examples occur. The expected log probability at test time could be worse for b) as a result, but other error measures such as “precision” and “recall” (non-examinable, but worth looking up) may be better than c).
Resampling a dataset to make it more balanced often makes classifiers work better (under some measures), regardless of whether you need to save storage. One could argue we’re implicitly changing the loss function, which we should do directly – but pragmatically, it’s often easier to process data and use the software we have.
Intended lessons of Q3:
Subsampling is common practice for large unbalanced datasets. Some of you are likely to need these tricks in the real world. If your dataset is not large, but unbalanced, you might not want to throw data away, but you could consider up-sampling the positive class (including the interesting examples more than once) to create a balanced classification problem. Various classifiers might work better on balanced data, and you can then correct the probabilities (or adjust decision thresholds) at the end.
The techniques and ideas used in this question come up a lot too.
Q1 is based on a previous sheet by Amos Storkey, Charles Sutton, and/or Chris Williams↩