$$\notag \newcommand{\I}{\mathbb{I}} \newcommand{\N}{\mathcal{N}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bc}{\mathbf{c}} \newcommand{\bm}{\mathbf{m}} \newcommand{\bmu}{{\boldsymbol{\mu}}} \newcommand{\bnu}{{\boldsymbol{\nu}}} \newcommand{\bx}{\mathbf{x}} \newcommand{\noo}[1]{\frac{1}{#1}} \newcommand{\te}{\!=\!} \newcommand{\ttimes}{\!\times\!} $$

MLPR Tutorial Sheet 2

Reminders: Attempt the tutorial questions before your tutorial. We strongly recommend you discuss these questions — and the course in general — with your peers. Detailed answers will be made available after the week’s tutorials, and can be discussed further on Hypothesis.

The two parts with entry boxes in Q3 are the “core questions”. As described in Tutorial 1, almost all of you should do more than just these.

This tutorial is largely just maths. While we ask for a small numerical experiment in the middle, there’s no real data, and no machine learning. However, throughout the course we will derive models and algorithms that use multivariate Gaussian distributions. And other machine learning methods share some of the same maths. We’ve put this material on the tutorial, because it’s useful stuff, and you need to work through it at your own pace. In the mean time, you’re also working on assignment 1, which involves some data!


  1. Warm-up exercise:

    If \(\ba\) and \(\bb\) are \(D\ttimes1\) column vectors and \(M\) is a \(D\ttimes D\) symmetric matrix, show that \[\notag\ba^\top M \bb = \bb^\top M \ba.\] You wouldn’t need to show this result in an exam unless you were explicitly asked to. In some working (like for the next question) you could just state that it’s true for symmetric \(M\).

  2. Identifying a Gaussian:

    As part of a derivation, we may need to identify the probability density function of a vector \(\bx\), up to a constant with respect to \(\bx\). For example: \[\notag p(\bx) \propto \exp\left( -\bx^\top A \bx - \bx^\top\bc\right), \] where \(A\) is a symmetric invertible matrix. As this distribution is proportional to the exponential of a quadratic in \(\bx\), it is a Gaussian: \(p(\bx) = \N(\bx; \bmu, \Sigma)\).

    Identify which Gaussian \(\bx\) comes from by identifying the mean \(\bmu\) and covariance \(\Sigma\) in terms of \(A\) and \(\bc\). The easiest method is to compare \(p(\bx)\) to the standard form for the multivariate Gaussian PDF (given in class).

    Advice: If you’re stuck, make sure you’ve done this w1 in-note question and this w3 in-note question. If you’re lacking confidence, the answer you should be able to show is in a footnote.1 If you’re still stuck, move on: you can discuss it in your tutorial group.

  3. Creating a 2D multivariate Gaussian, and a simple experiment:

    The first element of a vector has \(p(x_1) = \N(x_1; m, \sigma^2)\).

    A second element is generated according to the following process: \[\notag x_2 = \alpha x_1 + \nu, \quad \nu\sim \N(0, n^2). \] Here \(x_2\) depends on \(x_1\), but the noise term \(\nu\) is independent of \(x_1\).

    Recall that a linear combination of Gaussian values is Gaussian distributed.

    1. The joint distribution of the vector \(\bx\te[x_1~x_2]^\top\) is Gaussian, and so takes the form \(p(\bx)=\N(\bx;\bmu,\Sigma)\). Identify \(\bmu\) and \(\Sigma\). Advice, if needed, in footnote.2

      Some working is required, but you can just enter the answers into this box:

    2. Turning to a computer: pick a setting for each of the parameters \(m\), \(\sigma\), \(\alpha\), and \(n\), and simulate samples from the above process. Estimate the mean and covariance from the samples. Include your code, and give some example output, stating whether you seem to get close agreement with your theoretical prediction in part a) or not.

    3. Try to put error bars on your estimates. Putting a standard error on your estimates of the means should be straightforward. You may have to use some creativity to put error bars on your estimates of the covariances (move on, if you’re stuck).

  4. Gaussians, eigenvectors and eigenvalues:

    We can sample from a multivariate Gaussian \(\bx\sim\N(\mathbf{0},\Sigma)\) by drawing a vector of standard normals, \(\bnu\sim\N(\mathbf{0},\I)\), and setting \(\bx\te A\bnu\), for any matrix \(A\) where \(AA^\top \te \Sigma\).

    Real symmetric matrices, like covariance matrices, can always be written in the form3 \[\notag \Sigma = Q\Lambda Q^\top, \] where \(\Lambda\) is a diagonal matrix of eigenvalues, and the columns of \(Q\) are the eigenvectors of \(\Sigma\).4

    1. Describe how to sample from \(\N(\mathbf{0},\Sigma)\) using this decomposition. Hint in footnote.5

    2. \(Q\) is an orthogonal matrix, corresponding to a rigid rotation (and possibly a reflection). Describe geometrically (perhaps in 2D) how your sampling process transforms a cloud of points drawn from a standard normal. Hint in footnote.6

  5. Bonus: More on sampling Gaussians, and matrix decompositions:

    [Only do this part this week if you have time. Most groups won’t find time to discuss it. However, if you’re interested in computations involving large Gaussians and covariances, exploring different decompositions could be useful, and you may wish to return to this question.]

    The notes introduced the lower-triangular Cholesky decomposition \(\Sigma=LL^\top\), which can be applied to symmetric positive- definite (but not semi-definite) matrices. As well as being useful for sampling, common computations involving triangular matrices (determinants, matrix inverses, solving equations) are quick, so many library routines involving Gaussians use the Cholesky decomposition.7

    1. Sometimes instead of decomposing the covariance matrix, we have the Cholesky decomposition of the precision matrix, \(\Sigma^{-1} = CC^\top\), where \(C\) is lower-triangular. How could we use \(C\) to sample from \(\N(\mathbf{0},\Sigma)\)?

    2. Yet another possible decomposition is the principal square root8: \(\Sigma = \Sigma^{\noo{2}}\Sigma^{\noo{2}}\), where \(\Sigma^{\noo{2}}\) is symmetric. None of the decompositions discussed so far are the same. In this part we’ll try to understand how they’re related.

      1. Consider two different decompositions \(\Sigma = AA^\top = BB^\top\). We’ll assume the matrices are full rank so that we can write \(B = AU\). Show that \(UU^\top \te \I\), the identity matrix, which means that \(U\) is an orthogonal matrix.

      2. Explain geometrically why if computing \(A\bnu\) from \(\bnu\sim \N(\mathbf{0},\I)\) is a way to sample from \(\N(\mathbf{0},\Sigma)\), computing \(B\bnu \te AU\bnu\) will be as well.


  1. \(\Sigma = \frac{1}{2}A^{-1}, ~~ \bmu = -\frac{1}{2}A^{-1}\bc.\)

  2. We assume you’ve worked through the background sheet on expectations and the lecture material on covariances. You might work through each scalar expectation that you need separately, or reduce the problem to shifting and transforming two standard normals, and apply standard results. Either approach is fine (you could try both if really keen!). Do press on to part b) even if, or especially if, you’re not sure about part a).

  3. https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices
    We can use this decomposition even when the determinant is zero: one or more eigenvalues will be zero.

  4. You could do an eigendecomposition with np.linalg.eig, or for covariance matrices the specialised routine np.linalg.eigh, and check for yourself that this decomposition works in practice.

  5. We need to identify a matrix \(A\), such that \(\Sigma \te AA^\top\). But we can’t immediately see \(A\) in our decomposition, because we have three terms \(Q\Lambda Q^\top\). Splitting \(\Lambda\) into the product of two matrices, will help you regroup the terms.

  6. When a transformation is a product of matrices, describe the transformation that each matrix does in turn.

  7. We can quickly find the determinant from the Cholesky: \(|L| = \prod_d L_{dd}\) where \(|\Sigma|\te|L|^2\), or more frequently \(\log |L| \te \sum_d \log L_{dd}\). An inverse \(L^{-1}\bb\) can be applied quicker than inv(L)@b with a call to scipy.linalg.solve_triangular. We can also solve linear systems of the form \(\Sigma^{-1}\bc\) faster than by inverting \(\Sigma\), see scipy.linalg.cho_solve.

  8. Non-examinable: \(\Sigma^{\noo{2}} \te Q\Lambda^{\noo{2}}Q^\top\), using \(Q\) and \(\Lambda\) from the eigendecomposition, and \(\Lambda^{\noo{2}}\) simply replaces each eigenvalue on the diagonal with its square root. However, it would be better to compute it with sqrtm.