$$\notag \newcommand{\D}{\mathcal{D}} \newcommand{\N}{\mathcal{N}} \newcommand{\be}{\mathbf{e}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bm}{\mathbf{m}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\ith}{^{(i)}} \newcommand{\jth}{^{(j)}} \newcommand{\nth}{^{(n)}} \newcommand{\te}{\!=\!} \newcommand{\tp}{\!+\!} $$

MLPR Tutorial Sheet 5

The two parts with entry boxes are the “core questions”, and the same rules and guidance apply as for Tutorial 1. We are expecting you to attempt all the tutorial questions. You can seek clarifications and hints on Hypothesis.


  1. Recovering the kernel function from a GP prior sample:

    In the lecture notes, we’ve seen the Gaussian kernel function: \[ \notag k(\bx^{(i)}, \bx^{(j)}) = \exp(-\|\bx^{(i)}-\bx^{(j)}\|^2). \] With the Euclidean distance between \(\bx^{(i)}\) and \(\bx^{(j)}\) denoted as \(\Delta_{ij}\) we can rewrite the kernel function as: \[ \notag k(\bx^{(i)}, \bx^{(j)}) = \exp(-\Delta_{ij}^2). \] This is an example of an isotropic kernel, a kernel that depends only on the distance of the kernel arguments. The gp_minimal.py demo contains code for calculating this function with additional parameters ell and sigma_f (line 40 of the file). The function takes \(N\times D\) and \(M\times D\) design matrices, to give \(N\times M\) kernel values.

    1. Another kernel function mentioned in the lecture notes is the following: \[ \notag k(\bx^{(i)}, \bx^{(j)}) = (1 + \|\bx^{(i)}-\bx^{(j)}\|) \exp(-\|\bx^{(i)}-\bx^{(j)}\|). \] Write code for calculating kernel function values from this function with the same design matrix arguments as in the GP demo.

    2. Given a 1-dimensional grid of locations \(x^{(i)}\) and Gaussian process prior function values \(f_i=f(x^{(i)})\), give a description of how to estimate the kernel function value for a specific separation \(\Delta\) assuming the kernel is isotropic. We’re looking for a simple estimate for which the week 6 material isn’t required.

    3. Why are estimates of the kernel function values likely to become worse as \(\Delta\) increases?

    4. Now, generate a sample from a Gaussian process with the following code:

      For detailed comments about this code, have a look at the GP demo code. Write code for estimating the kernel function from X_grid and f_grid and plot a comparison of your estimated kernel function and the true kernel function.

  2. Gaussian processes with non-zero mean:

    In the lecture notes we assumed that the prior over any vector of function values was zero mean: \(\bff\sim\N(\mathbf{0}, K)\). We focussed on the covariance or kernel function \(k(\bx\ith, \bx\jth)\), which evaluates the \(K_{ij}\) elements of the covariance matrix (also called the ‘Gram matrix’).

    If we know in advance that the distribution of outputs should be centered around some other mean \(\bm\), we could put that into the model by using a Gaussian process prior with mean \(\bm\) such that \(\bff\sim\N(\bm, K)\). Instead, we usually subtract the known mean \(\bm\) from the \(\by\) data, and just use the zero mean model.

    Sometimes we don’t really know the mean \(\bm\), but look at the data to estimate it. A fully Bayesian treatment puts a prior on \(\bm\) and, because it’s an unknown, considers all possible values when making predictions. A flexible prior on the mean vector could be another Gaussian process(!). Our model for our noisy observations is now: \[\begin{align} \notag \bm &\sim \N(\mathbf{0}, K_m), \quad \text{$K_m$ from kernel function $k_m$,}\\ \notag \bff &\sim \N(\bm, K_f), \quad \text{$K_f$ from kernel function $k_f$,}\\ \notag \by &\sim \N(\bff, \,\sigma^2_n\mathbb{I}), \quad \text{noisy observations.} \end{align} \] Show that — despite our efforts — the function values \(\bff\) still come from a function drawn from a zero-mean Gaussian process. That is, when \(\bm\) is not observed, the resulting (marginal) distribution of the function values \(\bff\) is a zero-mean Gaussian. Identify the resulting covariance function of the zero-mean process for \(f\).

    Identify the mean’s kernel function \(k_m\) for two restricted types of mean: 1) An unknown constant \(m_i \te b\), with \(b\sim\N(0,\sigma_b^2)\). 2) An unknown linear trend: \(m_i \te m(\bx\ith) \te \bw^\top\bx\ith \tp b\), with Gaussian priors \(\bw\sim\N(\mathbf{0},\sigma_w^2\mathbb{I})\), and \(b\sim\N(0,\sigma_b^2)\).

    Sketch three typical draws from a GP prior with kernel: \[\notag k(x\ith, x\jth) = 0.1^2\exp\big(-(x\ith-x\jth)^2/2\big) + 1. \] Hints in footnote1.

  3. Pre-processing for Bayesian linear regression and Gaussian processes:

    We have a dataset of inputs and outputs \(\{\bx\nth,y\nth\}_{n=1}^N\), describing \(N\) preparations of cells from some lab experiments. The output of interest, \(y\nth\), is the fraction of cells that are alive in preparation \(n\). The first input feature of each preparation indicates whether the cells were created in lab A, B, or C. That is, \(\smash{x_1\nth\!\in\! \{\texttt{A},\texttt{B},\texttt{C}\}}\). The other features are real numbers describing experimental conditions such as temperature and concentrations of chemicals and nutrients.

    1. Describe how you might represent the first input feature and the output when learning a regression model to predict the fraction of alive cells in future preparations from these labs. Explain your reasoning.

    2. Compare using the lab identity as an input to your regression (as you’ve discussed above), with two baseline approaches: i) Ignore the lab feature, treat the data from all labs as if they came from one lab; ii) Split the dataset into three parts one for lab A, one for B, and one for C. Then train three separate regression models.

      Discuss both simple linear regression and Gaussian process regression. Is it possible for these models, when given the lab identity as in a), to learn to emulate either or both of the two baselines?

    3. There’s a debate in the lab about how to represent the other input features: log-temperature or temperature, and temperature in Fahrenheit, Celsius or Kelvin? Also whether to use log concentration or concentration as inputs to the regression. Discuss ways in which these issues could be resolved.

      Harder: there is a debate between two different representations of the output. Describe how this debate could be resolved.


  1. You can get the answer to this question by making a tiny tweak to the Gaussian process demo code provided with the class notes. However, please try reasoning about the answer first, and make sure you can explain why the plots look like they do.