$$\notag \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\g}{\,|\,} \newcommand{\te}{\!=\!} \newcommand{\ttimes}{\!\times\!} $$

MLPR Tutorial1 Sheet 3

The parts with entry boxes in Q1 and Q3 are the “core questions”. As described in Tutorial 1, almost all of you should do more than just these.


  1. A Gaussian classifier:

    A training set consists of 14 one-dimensional examples from two classes. The training inputs for the two classes (\(y\te1\) and \(y\te2\)) are:

    x1 = [0.5, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.35, 0.25]
    x2 = [0.9, 0.8, 0.75, 1.0]
    1. In this part you should write code that plots a figure and prints a single number. Please don’t use any libraries except numpy and matplotlib. If you copy the code between your lines of backtics, it should run if you type %paste into a new ipython session.

      Your code should fit a one-dimensional Gaussian to each class by matching the mean and variance. Then estimate the class probabilities \(\pi_1\) and \(\pi_2\) by matching the observed class fractions. (This procedure fits the model with maximum likelihood: it selects the parameters that give the training data the highest probability.) Then on a single set of axes, plot the scores \(p(x,y) \te P(y)\,p(x\g y)\) for each class
      \(y\), as functions of input location \(x\). (This “quick sketch” to help your tutorial discussion doesn’t need labelled axes, or to be beautiful. But please do make good plots for reports, dissertations, etc.)

      Finally print \(P(y\te1\g x\te0.6)\), the probability that a test point at \(x \te 0.6\) belongs to class 1.

    2. Roughly where on your plot are the “decision boundaries”, the locations where \(P(\text{class 1}\g x) = P(\text{class 2}\g x) = 0.5\)?

      You are not required to calculate the locations exactly. You’re aiming to be able to discuss in the tutorial why the decisions change where they do.

    3. Are the decisions that the model makes reasonable for very negative \(x\) and very positive \(x\)? Are there any changes we could consider making to the model if we wanted to change the model’s asymptotic behaviour?

  2. Maximum likelihood and logistic regression:

    Maximum likelihood logistic regression maximizes the log probability of the labels, \[\notag \sum_n \log P(y^{(n)}\g \bx^{(n)}, \bw), \] with respect to the weights \(\bw\). As usual, \(y^{(n)}\) is a binary label at input location \(\bx^{(n)}\).

    The training data is said to be linearly separable if the two classes can be completely separated by a hyperplane. That means we can find a decision boundary \[\notag P(y^{(n)}\te1\g \bx^{(n)}, \bw,b) = \sigma(\bw^\top\bx^{(n)} + b) = 0.5,\qquad \text{where}~\sigma(a) = \frac{1}{1+e^{-a}},\] such that all the \(y\te1\) labels are on one side (with probability greater than 0.5), and all of the \(y\!\ne\!1\) labels are on the other side.

    1. Show that if the training data is linearly separable with a decision hyperplane specified by \(\bw\) and \(b\), the data is also separable with the boundary given by \(\tilde{\bw}\) and \(\tilde{b}\), where \(\tilde{\bw} \te c\bw\) and \(\tilde{b} \te cb\) for any scalar \(c \!>\! 0\).

    2. What consequence does the above result have for maximum likelihood training of logistic regression for linearly separable data?

  3. The costs of some classifiers:

    We have a training set of \(N\) examples, with \(D\)-dimensional features and binary labels. This question gets you to revisit what computations you need to do to build and use a classifier.

    Assume the following computational complexities: matrix-matrix multiplication \(AB\) costs \(O(LMN)\) for \(L\ttimes M\) and \(M\ttimes N\) matrices \(A\) and \(B\). Inverting an \(N\ttimes N\) matrix \(G\) and/or finding its determinant costs \(O(N^3)\).2

    1. What is the computational complexity of training a generative classifier that models the features of each class by a maximum likelihood Gaussian fit (a Gaussian matching the mean and covariances of the features)?

      State your complexity first, then briefly explain how you got it. Include all of the computations that we could reasonably perform based on the training data before seeing the test data, so that test time predictions (part b) will be as cheap as possible.

    2. What is the computational complexity of assigning class probabilities to a test feature vector?

    Bonus parts: If you’re finding the course difficult, leave these for later review:

    1. In linear discriminant analysis we assume the classes just have different means, and share the same covariance matrix. Show that given its parameters \(\theta\), the log likelihood ratio for a binary classifier, \[\notag \log \frac{p(\bx\g y\te1,\,\theta)}{p(\bx\g y\te0,\,\theta)} \;=\; \log p(\bx\g y\te1,\,\theta) - \log p(\bx\g y\te0,\,\theta), \] is a linear function of \(\bx\) (as opposed to quadratic).

    2. When the log likelihood ratio is linear in \(\bx\), the log posterior ratio is also linear. If the log posterior ratio can be written as a linear equation in the form \[\notag \log P(y\te1\g\bx,\theta) - \log P(y\te0\g \bx,\theta) = \bw^\top\bx + b, \] then it can be shown that the predictions have the same form as logistic regression: \[\notag P(y\te1\g\bx,\,\theta) = \sigma(\bw^\top\bx + b), \] but the parameters \(\bw(\theta)\) and \(b(\theta)\) are fitted differently.

      How do your answers to a) and b) change when the classes share one covariance? What can you say about the cost of the linear discriminant analysis method compared to logistic regression?


  1. Parts of this tutorial sheet are based on previous versions by Amos Storkey, Charles Sutton, and Chris Williams

  2. In fact, good implementations usually take a factorization of \(G\) rather than inverting it, which also costs \(O(N^3)\). Given that factorization, we can compute \(G^{-1}H\) in \(O(N^2K)\), where \(H\) is \(N\ttimes K\), and find the (log) determinant in \(O(N)\). I’m using “big-O notation”. A good introduction on this notation in the context of computational complexity are the notes from our second year Inf2b course. Applied areas like machine learning usually use big-O notation more sloppily than in that note; it’s rare to see \(\Omega\) or \(\Theta\).