$$\notag \newcommand{\E}{\mathbb{E}} \newcommand{\I}{\mathbb{I}} \newcommand{\N}{\mathcal{N}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\bc}{\mathbf{c}} \newcommand{\be}{\mathbf{e}} \newcommand{\bm}{\mathbf{m}} \newcommand{\bmu}{{\boldsymbol{\mu}}} \newcommand{\bnu}{{\boldsymbol{\nu}}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bz}{\mathbf{z}} \newcommand{\eye}{\mathbb{I}} \newcommand{\noo}[1]{\frac{1}{#1}} \newcommand{\te}{\!=\!} \newcommand{\tp}{\!+\!} \newcommand{\ttimes}{\!\times\!} $$

MLPR Tutorial Sheet 2 (Answers)

Reminders: Attempt the tutorial questions before your tutorial. We strongly recommend you discuss these questions — and the course in general — with your peers. Detailed answers will be made available after the week’s tutorials, and can be discussed further on Hypothesis.

The two parts with entry boxes in Q3 are the “core questions”. As described in Tutorial 1, almost all of you should do more than just these.

This tutorial is largely just maths. While we ask for a small numerical experiment in the middle, there’s no real data, and no machine learning. However, throughout the course we will derive models and algorithms that use multivariate Gaussian distributions. And other machine learning methods share some of the same maths. We’ve put this material on the tutorial, because it’s useful stuff, and you need to work through it at your own pace. In the mean time, you’re also working on assignment 1, which involves some data!


  1. Warm-up exercise:

    If \(\ba\) and \(\bb\) are \(D\ttimes1\) column vectors and \(M\) is a \(D\ttimes D\) symmetric matrix, show that \[\notag\ba^\top M \bb = \bb^\top M \ba.\] You wouldn’t need to show this result in an exam unless you were explicitly asked to. In some working (like for the next question) you could just state that it’s true for symmetric \(M\).


    Answer:

    As \(\ba^\top M \bb\) is a scalar, \(\ba^\top M \bb = (\ba^\top M \bb)^\top = (M \bb)^\top \ba = \bb^\top M^\top \ba\) using the result from the background material crib-sheet \((AB)^\top = B^\top A^\top\) twice.

    Since \(M\) is symmetric, \(M^\top = M\), and hence \(\ba^\top M \bb = \bb^\top M \ba\).

    Alternatively, you could write the quadratic form as a sum of scalars, where reordering is straightforward: \[ \ba^\top M \bb = \sum_{ij} a_i M_{ij} b_j = \sum_{ij} b_j M_{ij} a_i = \sum_{ij} b_j M_{ji} a_i = \bb^\top M \ba, \] where we can swap the indices on \(M\) because it is symmetric: \(M_{ij}=M_{ji}\).


  2. Identifying a Gaussian:

    As part of a derivation, we may need to identify the probability density function of a vector \(\bx\), up to a constant with respect to \(\bx\). For example: \[\notag p(\bx) \propto \exp\left( -\bx^\top A \bx - \bx^\top\bc\right), \] where \(A\) is a symmetric invertible matrix. As this distribution is proportional to the exponential of a quadratic in \(\bx\), it is a Gaussian: \(p(\bx) = \N(\bx; \bmu, \Sigma)\).

    Identify which Gaussian \(\bx\) comes from by identifying the mean \(\bmu\) and covariance \(\Sigma\) in terms of \(A\) and \(\bc\). The easiest method is to compare \(p(\bx)\) to the standard form for the multivariate Gaussian PDF (given in class).

    Advice: If you’re stuck, make sure you’ve done this w1 in-note question and this w3 in-note question. If you’re lacking confidence, the answer you should be able to show is in a footnote.1 If you’re still stuck, move on: you can discuss it in your tutorial group.


    Working for answer:

    We can write any term that doesn’t involve \(\bx\) as a constant (with respect to \(\bx\)). The quadratic form in the multivariate Gaussian takes the form: \[\begin{align} \log \N(\bx;\bmu,\Sigma) &= -\frac{1}{2}(\bx-\bmu)^\top\Sigma^{-1}(\bx-\bmu) + \text{const.}\\ &= -\frac{1}{2}\bx^\top\Sigma^{-1}\bx+\frac{1}{2}\bx^\top\Sigma^{-1}\bmu + \frac{1}{2}\bmu^\top\Sigma^{-1}\bx + \text{const.}\\ &= -\frac{1}{2}\bx^\top\Sigma^{-1}\bx+\bx^\top\Sigma^{-1}\bmu + \text{const.}\quad\text{($\Sigma$ and $\Sigma^{-1}$ are symmetric)} \end{align}\] In comparison, the quadratic form given in the question is: \[ \log p(\bx) = -\bx^\top A \bx - \bx^\top\bc + \text{const.} \] We can read off the mean and covariance by comparing the coefficients of these forms. Comparing the quadratic term: \[ -\frac{1}{2}\Sigma^{-1} = -A \quad\Rightarrow\quad \fbox{$\Sigma = \frac{1}{2}A^{-1}$}\,. \] Comparing the linear term: \[ \Sigma^{-1}\bmu = -\bc \quad\Rightarrow\quad \fbox{$\bmu = -\Sigma \bc = -\frac{1}{2}A^{-1}\bc$}\,. \]

    Some textbooks (including Bishop) will instruct you to ‘complete the square’ — manipulate the given equation for \(p(\bx)\) into the standard quadratic form that appears in a Gaussian PDF, keeping track of the constants. That approach is fine, but more work than necessary.


  3. Creating a 2D multivariate Gaussian, and a simple experiment:

    The first element of a vector has \(p(x_1) = \N(x_1; m, \sigma^2)\).

    A second element is generated according to the following process: \[\notag x_2 = \alpha x_1 + \nu, \quad \nu\sim \N(0, n^2). \] Here \(x_2\) depends on \(x_1\), but the noise term \(\nu\) is independent of \(x_1\).

    Recall that a linear combination of Gaussian values is Gaussian distributed.

    1. The joint distribution of the vector \(\bx\te[x_1~x_2]^\top\) is Gaussian, and so takes the form \(p(\bx)=\N(\bx;\bmu,\Sigma)\). Identify \(\bmu\) and \(\Sigma\). Advice, if needed, in footnote.2

      Some working is required, but you can just enter the answers into this box:


      Answer:

      \[ \bmu = \begin{bmatrix} m \\ \alpha m \\ \end{bmatrix}, \qquad \Sigma = \begin{bmatrix} \sigma^2 & \alpha\sigma^2 \\ \alpha\sigma^2 & \alpha^2\sigma^2 \tp n^2 \\ \end{bmatrix}. \]

      Here’s one way to get the above results. We can identify several expectations: \[\begin{align} \E[x_1] &= m\\ \E[x_1^2] &= \sigma^2 + m^2\\ \E[x_2] &= \alpha \E[x_1] + \E[\nu] = \alpha m\\ \E[x_1x_2] &= \E[ \alpha x_1^2 + \nu x_1] = \alpha \E[x_1^2] + \E[\nu]\E[x_1] = \alpha (\sigma^2 + m^2)\\ \E[x_2^2] &= \E[\alpha^2 x_1^2] + \E[2\alpha x_1\nu] + \E[\nu^2] = \alpha^2(\sigma^2 + m^2) + n^2. \end{align}\] Substituting these into the definitions for the mean and covariance give the answer stated above.

      There are other ways. For example, we could reduce the situation to a transformation of standard normals: \(\bx = A\bz + \bmu\), where \(\bz\sim\N(\mathbf{0},\eye)\) and we know \(\Sigma = AA^\top\). To do that, identify how we can generate each element from univariate standard normals: \[\begin{align} x_1 &= \sigma z_1 + m\\ x_2 &= \alpha x_1 + nz_2 = \alpha\sigma z_1 + \alpha m + n z_2 \end{align}\] Then we can read off the constant shift \(\bmu\) as above, and identify \[ A = \begin{bmatrix} \sigma & 0 \\ \alpha\sigma & n \\ \end{bmatrix}, \] from which \(AA^\top\) gives the same covariance \(\Sigma\) as above.


    2. Turning to a computer: pick a setting for each of the parameters \(m\), \(\sigma\), \(\alpha\), and \(n\), and simulate samples from the above process. Estimate the mean and covariance from the samples. Include your code, and give some example output, stating whether you seem to get close agreement with your theoretical prediction in part a) or not.


      Answer:

      To reduce the total length of the answers, we’ll answer part c) here as well:

      from numpy.random import rand, randn
      import numpy as np
      
      # Pick numbers for a test case. You don't have to pick these randomly,
      # but don't pick special numbers (like 0 or 1), or numbers that might
      # be related to each other.
      mm = randn()
      sigma = np.abs(randn())
      alpha = rand()
      nn = np.abs(randn())
      
      # Because the covariance is not a sum of independent terms, we formed
      # num_trials independent unbiased estimates, so we can check agreement
      # based on simple standard errors. There are other reasonable approaches.
      num_trials = 1000
      
      D = 2
      mu_ests = np.zeros((D, num_trials))
      Sigma_ests = np.zeros((D, D, num_trials))
      N = int(1e3) # estimates in each trial formed from N samples
      xx = np.zeros((N, D))
      for tt in range(num_trials):
          xx[:,0] = sigma*randn(N) + mm
          xx[:,1] = alpha*xx[:,0] + nn*randn(N)
          mu_ests[:, tt] = np.mean(xx, 0)
          Sigma_ests[:, :, tt] = np.cov(xx, rowvar=False, ddof=1)
      
      # If any of the numbers are consistently more than a couple of
      # standard errors out, then I will worry. The estimates and the
      # standard error estimates are coupled, so when one number is out by
      # a lot, others are likely to be as well.
      mu_theory = np.array([mm, alpha*mm])
      mu_est = np.mean(mu_ests, 1)
      mu_eb = np.std(mu_ests, 1) / np.sqrt(num_trials)
      mu_num_se_out = (mu_est - mu_theory) / mu_eb
      
      Sigma_theory = np.array([[sigma**2, alpha*(sigma**2)],
              [alpha*(sigma**2), (alpha**2)*(sigma**2)+nn**2]])
      Sigma_est = np.mean(Sigma_ests, 2)
      Sigma_eb = np.std(Sigma_ests, 2) / np.sqrt(num_trials)
      Sigma_num_se_out = (Sigma_est - Sigma_theory) / Sigma_eb
      
      print('mu: theory', mu_theory, 'est', mu_est, '+/-', mu_eb)
      print('std errs out:', mu_num_se_out)
      print('\nSigma: theory\n', Sigma_theory, '\nest\n', Sigma_est, '\n+/-\n', Sigma_eb)
      print('std errs out:\n', Sigma_num_se_out)

      Example output:

      mu: theory [-0.38668893 -0.37571585]
             est [-0.38677515 -0.37547248] +/- [0.00026255 0.00103507]
      std errs out: [-0.3283931   0.23512809]
      
      Sigma: theory
       [[0.07281336 0.07074713]
       [0.07074713 1.13578104]] 
      est
       [[0.07287119 0.07089284]
       [0.07089284 1.13219392]] 
      +/-
       [[0.00010316 0.00030217]
       [0.00030217 0.00161891]]
      std errs out:
       [[ 0.56057531  0.48220215]
       [ 0.48220215 -2.21576254]]

      We are getting close agreement (about 3 significant figures) from a million samples, which is typical (sqrt scaling of Monte Carlo estimates). We are usually within \(\pm\) one or two standard errors, so nothing is obviously wrong. It’s unlikely we’d have derived a wrong expression that was this close on a random example by chance, so I wouldn’t normally do any further analysis of the statistics.


    3. Try to put error bars on your estimates. Putting a standard error on your estimates of the means should be straightforward. You may have to use some creativity to put error bars on your estimates of the covariances (move on, if you’re stuck).

  4. Gaussians, eigenvectors and eigenvalues:

    We can sample from a multivariate Gaussian \(\bx\sim\N(\mathbf{0},\Sigma)\) by drawing a vector of standard normals, \(\bnu\sim\N(\mathbf{0},\I)\), and setting \(\bx\te A\bnu\), for any matrix \(A\) where \(AA^\top \te \Sigma\).

    Real symmetric matrices, like covariance matrices, can always be written in the form3 \[\notag \Sigma = Q\Lambda Q^\top, \] where \(\Lambda\) is a diagonal matrix of eigenvalues, and the columns of \(Q\) are the eigenvectors of \(\Sigma\).4

    1. Describe how to sample from \(\N(\mathbf{0},\Sigma)\) using this decomposition. Hint in footnote.5

    2. \(Q\) is an orthogonal matrix, corresponding to a rigid rotation (and possibly a reflection). Describe geometrically (perhaps in 2D) how your sampling process transforms a cloud of points drawn from a standard normal. Hint in footnote.6


    Answer:

    1. We can split up \(\Lambda = \Lambda^{\noo{2}}\Lambda^{\noo{2}}\), by taking the square-root of each eigenvalue on the diagonal, allowing us to regroup the decomposition as “\(AA^\top\)”. Noting that a diagonal matrix is symmetric, we write: \[ \Sigma = Q\Lambda^{\noo{2}}\Lambda^{\noo{2}}Q^\top = (Q\Lambda^{\noo{2}})(Q\Lambda^{\noo{2}})^\top. \] So we can use the standard sampling procedure given in the question, with \(A=Q\Lambda^{\noo{2}}\).

    2. We can compute each \(Q\Lambda^{\noo{2}}\bnu\) by first applying \(\Lambda^{\noo{2}}\), which scales each dimension of \(\bnu\) independently. The circular/spherical cloud of points is stretched by \(\sqrt{\Lambda_{dd}}\) along the \(d\)th axis (for each \(d\)) and becomes an elliptical cloud of points. (In 3D think of something like a squashed rugby ball or American football.) The tips, the axes of the ellipse, point along the coordinate axes. We then apply a rotation \(Q\). The cloud keeps its same elliptical shape, but now doesn’t point along the axes. In fact the axes of the ellipse point along the eigenvectors stored in the columns of \(Q\).

      Regardless of whether points are actually sampled using an eigendecomposition, we can now describe the shape of a Gaussian distribution. The axes of the ellipsoidal cloud of points are given by the eigenvectors, with the typical spread along each axis given by the square-root of the eigenvalues. This view will be useful when we review Principal Components Analysis (PCA) later in the course.


  5. Bonus: More on sampling Gaussians, and matrix decompositions:

    [Only do this part this week if you have time. Most groups won’t find time to discuss it. However, if you’re interested in computations involving large Gaussians and covariances, exploring different decompositions could be useful, and you may wish to return to this question.]

    The notes introduced the lower-triangular Cholesky decomposition \(\Sigma=LL^\top\), which can be applied to symmetric positive- definite (but not semi-definite) matrices. As well as being useful for sampling, common computations involving triangular matrices (determinants, matrix inverses, solving equations) are quick, so many library routines involving Gaussians use the Cholesky decomposition.7

    1. Sometimes instead of decomposing the covariance matrix, we have the Cholesky decomposition of the precision matrix, \(\Sigma^{-1} = CC^\top\), where \(C\) is lower-triangular. How could we use \(C\) to sample from \(\N(\mathbf{0},\Sigma)\)?


      Answer:

      If \(\Sigma^{-1} = CC^\top\), then \(\Sigma = C^{-\top}C^{-1}\). Thus we can use \(A=C^{-\top}\) in the sampling procedure given in the question. In code we don’t need to invert \(C\), we can quickly apply \(C^{-\top}\) using the triangular matrix \(C\). A quick check in case I’ve messed up the transposes (I mess up frequently):

      (Non-examinable: In some large-scale statistical applications, the precision and its Cholesky are sparse (contain many zeros), but their inverses are not, so we want to avoid explicitly inverting matrices to exploit fast sparse matrix compuations. R-INLA is an example software package where such computations are important.)


    2. Yet another possible decomposition is the principal square root8: \(\Sigma = \Sigma^{\noo{2}}\Sigma^{\noo{2}}\), where \(\Sigma^{\noo{2}}\) is symmetric. None of the decompositions discussed so far are the same. In this part we’ll try to understand how they’re related.

      1. Consider two different decompositions \(\Sigma = AA^\top = BB^\top\). We’ll assume the matrices are full rank so that we can write \(B = AU\). Show that \(UU^\top \te \I\), the identity matrix, which means that \(U\) is an orthogonal matrix.

      2. Explain geometrically why if computing \(A\bnu\) from \(\bnu\sim \N(\mathbf{0},\I)\) is a way to sample from \(\N(\mathbf{0},\Sigma)\), computing \(B\bnu \te AU\bnu\) will be as well.


      Answer:

      1. \(B = AU\) means \(U = A^{-1}B\), which we substitute into \[ UU^\top = A^{-1}B(A^{-1}B)^\top = A^{-1}(BB^\top)A^{-\top} = A^{-1}\Sigma A^{-\top} = A^{-1}A A^\top A^{-\top} = \I. \]

      2. If \(\bnu\) comes from a spherically symmetric distribution, then rotating that distribution (applying an orthogonal transformation) does nothing. Individual points move, but the distribution over outcomes \(\bz\te U\bnu\) is still \(\N(\mathbf{0},\I)\). Thus whether or not we apply any orthogonal transformation \(U\) to our standard normal draws before applying \(A\) makes no difference to the distribution defined by the procedure.



  1. \(\Sigma = \frac{1}{2}A^{-1}, ~~ \bmu = -\frac{1}{2}A^{-1}\bc.\)

  2. We assume you’ve worked through the background sheet on expectations and the lecture material on covariances. You might work through each scalar expectation that you need separately, or reduce the problem to shifting and transforming two standard normals, and apply standard results. Either approach is fine (you could try both if really keen!). Do press on to part b) even if, or especially if, you’re not sure about part a).

  3. https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices
    We can use this decomposition even when the determinant is zero: one or more eigenvalues will be zero.

  4. You could do an eigendecomposition with np.linalg.eig, or for covariance matrices the specialised routine np.linalg.eigh, and check for yourself that this decomposition works in practice.

  5. We need to identify a matrix \(A\), such that \(\Sigma \te AA^\top\). But we can’t immediately see \(A\) in our decomposition, because we have three terms \(Q\Lambda Q^\top\). Splitting \(\Lambda\) into the product of two matrices, will help you regroup the terms.

  6. When a transformation is a product of matrices, describe the transformation that each matrix does in turn.

  7. We can quickly find the determinant from the Cholesky: \(|L| = \prod_d L_{dd}\) where \(|\Sigma|\te|L|^2\), or more frequently \(\log |L| \te \sum_d \log L_{dd}\). An inverse \(L^{-1}\bb\) can be applied quicker than inv(L)@b with a call to scipy.linalg.solve_triangular. We can also solve linear systems of the form \(\Sigma^{-1}\bc\) faster than by inverting \(\Sigma\), see scipy.linalg.cho_solve.

  8. Non-examinable: \(\Sigma^{\noo{2}} \te Q\Lambda^{\noo{2}}Q^\top\), using \(Q\) and \(\Lambda\) from the eigendecomposition, and \(\Lambda^{\noo{2}}\) simply replaces each eigenvalue on the diagonal with its square root. However, it would be better to compute it with sqrtm.