$$\notag \newcommand{\E}{\mathbb{E}} \newcommand{\I}{\mathbb{I}} \newcommand{\N}{\mathcal{N}} \newcommand{\be}{\mathbf{e}} \newcommand{\bm}{\mathbf{m}} \newcommand{\bmu}{{\boldsymbol{\mu}}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\g}{\,|\,} \newcommand{\intd}{\;\mathrm{d}} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} $$

Multivariate Gaussians

[This note assumes that you know the background material on expectations of random variables.]

We’re going to use Gaussian distributions as parts of models of data, and to represent beliefs about models. Most models and algorithms in machine learning involve more than one scalar variable however. (A scalar meaning a single number, rather than a vector of values.) Multivariate Gaussians generalize the univariate Gaussian distribution to multiple variables, which can be dependent.

Video: Multivariate Gaussian Distribution (34 minutes)
Describes a sampling process to transform samples from a standard normal into samples from general multivariate Gaussians. We derive the covariance and PDF of these distributions. Erratum: at 1:35 rand should be randn.

1 Independent Standard Normals

We could sample a vector \(\bx\) by independently sampling each element from a standard normal distribution, \(x_d\sim\N(0,1)\). Because the variables are independent, the joint probability is the product of the individual or marginal probabilities: \[ p(\bx) = \prod_{d=1}^D p(x_d) = \prod_{d=1}^D \N(x_d; 0,1). \] Usually I recommend that you write any Gaussian PDFs in your maths using the \(\N(x;\mu,\sigma^2)\) notation unless you have to expand them. It will be less writing, and clearer. Here, I want to combine the PDFs, so will substitute in the standard equation: \[\begin{align} p(\bx) &= \prod_{d=1}^D \frac{1}{\sqrt{2\pi}}\, e^{-x_d^2/2} = \frac{1}{(2\pi)^{D/2}}\, e^{-\frac{1}{2}\sum_d x_d^2}\\ &= \frac{1}{(2\pi)^{D/2}}\, e^{-\frac{1}{2}\bx^\top\bx}. \end{align}\] The PDF is proportional to the Radial Basis Functions (RBFs) we’ve used previously. Here the normalizer \(1/(2\pi)^{D/2}\) means that the PDF integrates to one.

Like an RBF centred at the origin, this density function only depends on the square-distance or radius of \(\bx\) from the origin. Any point in a spherical shell (or a circular shell in 2-dimensions) is equally probable. Therefore if we simulate points in 2-dimensions and draw a scatter plot:

# Python
N = int(1e4); D = 2
X = np.random.randn(N, D)
plt.plot(X[:,0], X[:,1], '.')
plt.axis('square')
plt.show()

We will see a diffuse circular spray of points. The spherical symmetry is a special property of Gaussians. If you were to draw independent samples from, say, a Laplace distribution you would see a non-circular distribution that has more density close to the axes.

2 Covariance

The multivariate generalization of variance, is covariance, which is represented with a matrix. While a variance is often denoted \(\sigma^2\), a covariance matrix is often denoted \(\Sigma\) — not to be confused with a summation \(\sum_{d=1}^D \ldots\)

The elements of the covariance matrix for a random vector \(\bx\) are: \[ \mathrm{cov}[\bx]_{ij} = \E[x_ix_j] - \E[x_i]\E[x_j]. \] On the diagonal, where \(i\te j\), you will see that this definition gives the scalar variances \(\mathrm{var}[x_i]\) for each of the elements of the vector. We can write the whole matrix with a linear algebra expression: \[ \mathrm{cov}[\bx] = \E[\bx\bx^\top] - \E[\bx]\E[\bx]^\top. \]

Question: What is the covariance \(S\) of the spherical distribution of the previous section? (We will reserve \(\Sigma\) for the covariance of the general Gaussian in the next section.)

The video will take you through the answer, so we’re not making you type it in. But by considering each element \(S_{ij}\), you should be able to derive the answer yourself using the results in the review notes on expectations.

 

Answer: The first term is \(\E[x_ix_j] = \E[x_i]\E[x_j]\) if \(x_i\) and \(x_j\) are independent, which they are if \(i\!\ne\!j\). Thus \(S_{i\ne j} = 0\). The diagonal elements \(S_{ii}\) are equal to the variances of the individual variables, which are all equal to one. Therefore, \(S_{ij} = \delta_{ij}\), where \(\delta_{ij}\) is a Kronecker delta. Or as a matrix, \(S = \I\), the identity matrix.

2.1 Empirical covariance

The covariance above is a formal property of a distribution.

An empirical covariance (or sample covariance) is where the expectations in the definition of covariance, are replaced with averages over samples1. In NumPy, np.cov computes a covariance using expectations under a uniform distribution over \(N\) samples. Annoyingly this function requires an input of shape (D,N), so an (N,D) design matrix must be transposed.

If you have any doubt how covariances are computed, you should write your own version of cov from primitive matrix operations, and check agreement with np.cov.

3 Transforming and Rotating: general Gaussians

As with one-dimensional Gaussians, we can generalize the standard zero-mean, unit-variance Gaussian by a linear transformation and a shift. If any of the steps here are unclear, make sure you are comfortable with the univariate Gaussian note first.

If we generated the elements of \(\bx\) from independent \(\N(0,1)\) draws as above, we could form a linear combination of these outcomes: \[ \by = A\bx. \] To keep the discussion simpler, I will assume that \(A\) is square and invertible, so \(\by\) has the same dimensionality as \(\bx\).

Question: What is the covariance \(\Sigma\) of the new variable \(\by\)?

 

Answer: Simply substitute \(\by\) into the definition: \[\begin{align} \mathrm{cov}[\by] &= \E[\by\by^\top] - \E[\by]\E[\by]^\top\\ &= \E[A\bx\bx^\top A^\top] - \E[A\bx]\E[A\bx]^\top\\ &= A\E[\bx\bx^\top]A^\top - A\E[\bx](A\E[\bx])^\top. \end{align}\] Because \(\E[\bx]\) is zero, the second term is zero, and the expectation in the first term is equal to \(\mathrm{cov}[\bx] = \I\). Therefore, \[ \mathrm{cov}[\by] = \Sigma = AA^\top. \]

 

Because we’re assuming \(A\) is invertible, we can compute the original vector from the transformed one: \(\bx = A^{-1}\by\). Substituting that expression into the PDF for \(\bx\) we can see the shape of the new PDF: \[\begin{align} p(\by) &\propto e^{-\frac{1}{2} (A^{-1}\by)^\top(A^{-1}\by)}\\ &\propto e^{-\frac{1}{2} \by^\top A^{-\top}A^{-1}\by}. \end{align}\] As we saw in the univariate Gaussian note, if we stretch out a PDF, we must scale it down so that the distribution remains normalized. If we apply a linear transformation \(A\) to a volume of points, then the volume is multiplied by \(|A|\), the determinant of the matrix.2 Therefore, \[ p(\by) = \frac{1}{|A|(2\pi)^{D/2}}\, e^{-\frac{1}{2} \by^\top A^{-\top}A^{-1}\by}. \] Usually this expression is re-written in terms of the covariance of the vector. Noticing that \[\begin{align} \Sigma^{-1} &= A^{-\top}A^{-1}, ~\text{and}\\ |\Sigma| &= |AA^\top| = |A||A^\top| = |A|^2, \end{align}\] we can write: \[ p(\by) = \frac{1}{|\Sigma|^{1/2}(2\pi)^{D/2}}\, e^{-\frac{1}{2} \by^\top \Sigma^{-1}\by} = |2\pi\Sigma|^{-1/2}\, e^{-\frac{1}{2} \by^\top \Sigma^{-1}\by}. \] As demonstrated above, there are different equivalent ways to write the normalizing constant, and different books will choose different forms.3

Finally, we can shift the distribution to have non-zero mean: \[ \bz = \by + \bmu. \] Shifting the PDF does not change its normalization, so we can simply substitute \(\by=\bz\tm\bmu\) into the PDF for \(\by\): \[ \fbox{$p(\bz) = \N(\bz; \bmu, \Sigma) = \frac{1}{|\Sigma|^{1/2}(2\pi)^{D/2}} \, e^{-\frac{1}{2} (\bz-\bmu)^\top \Sigma^{-1}(\bz-\bmu)}.$} \] Here we’ve generalized the \(\N\) notation for Gaussian distributions to take a mean vector and a matrix of covariances. In one-dimension, these quantities still correspond to the scalar mean, and the variance.

It’s a common mistake to forget the matrix inverse inside the exponential. The inverse covariance matrix \(\Sigma^{-1}\), is also known as the precision matrix.

4 Covariances are positive (semi-)definite

[This section may be tough going on first reading. If so, that’s ok: just keep going to the “check your understanding” section, which you should work through.]

Covariance matrices are always symmetric: in the definition of covariance \(\text{cov}[x_i,x_j] = \text{cov}[x_j,x_i]\,\) or \(\,\Sigma_{ij}=\Sigma_{ji}\). Moreover, just as variances must be positive — or zero if we are careful — there is a positive-like constraint on covariance matrices.

A real4 symmetric matrix \(\Sigma\) is positive definite iff5 it satisfies: \[ \bz^\top \Sigma\,\bz > 0, \quad\text{for all real vectors $\bz\ne\mathbf{0}$.} \] these matrices are always invertible, and the inverse is also positive definite: \[ \bz^\top \Sigma^{-1}\bz > 0, \quad\text{for all real vectors $\bz\ne \mathbf{0}$.} \] Therefore, the exponential term in the probability density of a Gaussian falls as we move away from the mean in any direction iff the covariance is positive definite. If the density didn’t fall in some direction then it wouldn’t be normalized (integrate to one), and so wouldn’t be a valid density.

Edge case: In general, covariances can be positive semi-definite, which means: \[ \bz^\top \Sigma\,\bz \ge 0, \quad\text{for all real vectors $\bz$.} \] However, if \(\bz^\top \Sigma\,\bz\te 0\) for some \(\bz\!\ne\!0\), then the determinant \(|\Sigma|\) will be zero, and the covariance won’t be invertible. Therefore the expression we gave for the probability density is only valid for strictly positive definite covariances.

An example of a Gaussian distribution where the covariance isn’t strictly positive definite can be simulated by drawing \(x_1\sim\N(0,1)\) and deterministically setting \(x_2 = x_1\). You should be able to show that the theoretical covariance of such vectors is: \[ \Sigma = \left[ \begin{array}{cc} 1 & 1 \\ 1 & 1 \\ \end{array} \right], \] and you should be able to simulate this process to confirm numerically6.

In this example, the probability density is zero “almost everywhere”, for any \(\bx\) where \(x_1\!\ne\!x_2\). The only way to make \(\int p(\bx)\intd\bx = 1\) is to make the density infinite along the line \(x_1\te x_2\). Gaussian distributions with zero-determinant covariances generalize the Dirac delta function, where the distribution is constrained to a surface with zero volume, rather than just a point. Care is required with such distributions, both analytically and numerically. We will stick to strictly positive definite covariances whenever we can.

Given a real-valued matrix \(A\), \(\Sigma = AA^\top\) is always positive semi-definite. Moreover, if \(\Sigma\) is symmetric and positive semi-definite, it can always be written in this form. Allowing non-symmetric \(\Sigma\) wouldn’t expand the set of probability densities that can be expressed7. Therefore, the process for sampling from a Gaussian that was described in this document is general: we can sample from any Gaussian by transforming draws from a standard normal, and such a process always generates points from a distribution with a well-defined covariance.

5 Computing \(A\) to sample from \(\N(\mathbf{0}, \Sigma)\)

Given a covariance matrix \(\Sigma\), the transformation \(A\) is not uniquely defined. For example \[ A = \left[ \begin{array}{cc} 2 & 0 \\ 0 & 2 \\ \end{array} \right], \quad\text{and}\quad A = \left[ \begin{array}{cc} \sqrt{3} & 1 \\ -1 & \sqrt{3} \\ \end{array} \right], \] both give the same product \(AA^\top\). However, any \(A\) such that \(\Sigma = AA^\top\), can be used to transform points to sample from \(\N(\mathbf{0}, \Sigma)\).

It’s common to use a fast and standard matrix routine known as the (lower-triangular) Cholesky decomposition, which is easy to call in NumPy:

D = 3; Sigma = np.cov(np.random.randn(D, 3*D))
A = np.linalg.cholesky(Sigma)
Sigma_from_A = A @ A.T  # up to round-off error, matches Sigma

Do try this out, and look at the matrices; this checking step is not just for beginners! I still routinely check that functions do what I expect, because I am still frequently bitten by nasty surprises. For example, SciPy also has a cholesky function:

import scipy.linalg as sla
A = sla.cholesky(Sigma)
Sigma_wrong = A @ A.T  # doesn't match Sigma!

Unlike NumPy, the SciPy version gives the upper-triangular Cholesky decomposition by default — a difference you notice if you look at the matrices for a small example. You need to transpose the result, or use A = sla.cholesky(Sigma, lower=True).

6 Check your understanding

Diagonal transformation: A special case of a general transformation \(A\), is a diagonal matrix, that simply stretches each variable independently: \(\Lambda_{ij} = \delta_{ij}\sigma_i\). (This Kronecker delta notation was used earlier in the note.) For three dimensions the transformation would be: \[ \Lambda = \left[ \begin{array}{ccc} \sigma_1 & 0 & 0 \\ 0 & \sigma_2 & 0 \\ 0 & 0 & \sigma_3 \\ \end{array} \right]. \]

open above question in new tab (allows annotation)

Investigate a special case: You could use Python to sample some points from different multivariate Gaussians, and see how the covariance affects the cloud of points.

For example, you could use a family of transformations parameterized by \(a\): \[ A = \left[ \begin{array}{cc} 1 & 0 \\ a & 1\tm a \\ \end{array} \right], \] What does this transformation do? Is it clear why the variables are dependent for \(a\!\ne\!0\)? When are the variables maximally dependent8? What happens to the PDF as \(a\rightarrow1\) and why? Does the covariance matrix have an inverse when \(a\te1\)?

open above question in new tab (allows annotation)

Contours: The shape of a two-dimensional Gaussian is often sketched using a contour of its PDF. Just like the Radial Basis Function (RBF) discussed last week, the contours of a radially symmetric Gaussian are circular. So if you compute the \((x_1,x_2)\) coordinates of some points on a circle, these can be joined up to plot a contour of the Gaussian with identity covariance. You can then transform these points, just like sample positions, with a matrix \(A\), to plot a contour of \(\N(\bx;\mathbf{0},AA^\top)\).

open above question in new tab (allows annotation)

7 Further reading

https://en.wikipedia.org/wiki/Multivariate_normal_distribution

Both Bishop Section 2.3 and Barber Section 8.4 start with the definition that this note builds up to, and then works in the reverse direction from there to build up an interpretation. These sections then go further than this note, and both these books have some further exercises. The treatment by Murphy, Section 2.5.2, is rather more terse!

Transforming the PDF of the spherical distribution required getting the normalization correct due to the change of variables. If you would like a more rigorous treatment, or to understand what to do if the transformation is non-linear, I’ll defer to the text books. The maths for transforming a PDF due to a change of variables is quickly reviewed in Barber Section 8.2, Result 8.1. Murphy’s treatment is longer this time, in Section 2.6.


  1. There is also an “\((N\tm1)\)” version of the estimator, just as there is for estimating variances.↩︎

  2. Here \(|A|\) is the “Jacobian of the transformation”, although confusingly “Jacobian” can refer to both a matrix and its determinant. The change of variables might be clearer if we label the different probability density functions: \(p_Y(\mathbf{y}) = p_X(\mathbf{x})/|A| = p_X(A^{-1}\mathbf{y})/|A|\). See also the further reading section.↩︎

  3. Over the years, many students have questioned whether and why \(1/(|\Sigma|^{1/2}(2\pi)^{D/2}) = |2\pi\Sigma|^{-1/2}\). The first thing I do when unsure, is check an example numerically. For example:
    D = 5; Sigma = np.cov(np.random.randn(D, 10*D))
    lhs = 1 / (np.linalg.det(Sigma)**0.5 * (2*np.pi)**(D/2))
    rhs = np.linalg.det(2*np.pi*Sigma)**-0.5
    To understand why they’re equal: for a scalar \(c\) and a matrix \(A\), we can write \(|cA| = |c\I A| = |c\I||A| = c^D|A|\). The transformation \(c\I\) stretches an object in each of \(D\) directions by \(c\). The determinant gives the resulting volume change of \(c^D\).↩︎

  4. I often forget such distinctions, because I rarely deal with complex numbers. Although machine learning systems that use complex numbers have been proposed.↩︎

  5. “iff” means “if and only if”.↩︎

  6. For example: x1 = np.random.randn(10**6); X = np.hstack((x1[:,None], x1[:,None])); np.cov(X.T)↩︎

  7. Because \(\bx^\top M \bx = \bx^\top(\frac{1}{2}M + \frac{1}{2}M^\top) \bx\), for any \(\bx\) and square matrix \(M\). So we can replace a non-symmetric precision \(\Sigma^{-1}\) with the symmetric matrix \((\frac{1}{2}\Sigma^{-1} + \frac{1}{2}\Sigma^{-\top})\), and the covariance will be symmetric too.↩︎

  8. We don’t need a formal definition of dependence here. If you first understand the transformation, this question has a clear answer and is not ambiguous.↩︎