The parts with entry boxes in Q1 are the “core questions”. As described in Tutorial 1, almost all of you should do more than just these.
Bayesian regression with multiple data chunks:
In this question, you will generate datasets and calculate posterior distributions. If you have trouble, at least try to work out roughly what should happen for each part.
We will use the probabilistic model for regression models from the lectures: \[\notag p(y\g \bx,\bw) = \N(y;\,f(\bx;\bw),\,\sigma_y^2). \]
In this question, we set \(f(\bx;\bw) = \bw^\top \bx\) and assume a diagonal prior covariance matrix: \[\notag p(\bw) = \N(\bw;\,\bw_0,\,\sigma_{\bw}^2\I). \]
From the lecture notes (w4a), we have for the posterior distribution \(p(\bw\g \D) = \N(\bw;\, \bw_N, V_N)\) with: \[\begin{align} V_N &= \sigma_y^2(\sigma_y^2V_0^{-1} + \Phi^\top \Phi)^{-1},\\ \bw_N &= V_N V_0^{-1}\bw_0 + \frac{1}{\sigma_y^2}V_N \Phi^\top\by. \end{align}\] With \(\Phi=X\) and \(V_0 = \sigma_{\bw}^2\I\), we get: \[\begin{align} V_N &= \sigma_y^2(\sigma_y^2\frac{1}{\sigma_{\bw}^2}\I + X^\top X)^{-1},\\ \bw_N &= V_N\frac{1}{\sigma_{\bw}^2}\I\bw_0 + \frac{1}{\sigma_y^2}V_N X^\top\by. \end{align}\]
What distribution with what parameters is \(p(\bw\g \D)\) approaching when we let \(\sigma_{\bw}^2\) approach infinity? Please justify your answer.
For the rest of the question, we assume \(\bw = [w_1 ~~ w_2]^\top\) and \(\bx = [x_1 ~~ x_2]^\top\), and \(\sigma_y^2 = 1^2 = 1\), and \(\bw_0 = [-5 ~~ 0]^\top\), and \(\sigma_{\bw}^2 = 2^2 = 4\).
Generating data: Below, we provide code for generating some synthetic dataset.
import numpy as np
from numpy.random import randn, uniform
import matplotlib.pyplot as plt
D = 2 # Dimension of the weight space
N_Data_1 = 15 # Number of samples in dataset 1
N_Data_2 = 30 # Number of samples in dataset 2
sigma_w = 2.0
prior_mean = [-5, 0]
prior_precision = np.eye(D) / sigma_w**2
# We summarize distributions using their parameters
prior_par = {'mean': prior_mean, 'precision': prior_precision}
# Here we draw the true underlying w. We do this only once
w_tilde = sigma_w * randn(2) + prior_mean
# Draw the inputs for datasets 1 and 2
X_Data_1 = 0.5 * randn(N_Data_1, D)
X_Data_2 = 0.1 * randn(N_Data_2, D) + 0.5
# Draw the outputs for the datasets
sigma_y = 1.0
y_Data_1 = np.dot(X_Data_1, w_tilde) + sigma_y * randn(N_Data_1)
y_Data_2 = np.dot(X_Data_2, w_tilde) + sigma_y * randn(N_Data_2)
# The complete datasets
Data_1 = {'X': X_Data_1,
'y': y_Data_1}
Data_2 = {'X': X_Data_2,
'y': y_Data_2}
Explanation: First, we draw a \(\tilde{\bw}\) from the prior distribution (i.e. draw a sample from the Gaussian). In the whole dataset generation process, we draw this \(\tilde{\bw}\) only once and keep it constant.
We then generate two chunks of data \(\D_1\) and \(\D_2\). For chunk \(\D_1 = \{\bx^{(n)},y^{(n)}\}_{n=1}^{15}\), we generate 15 input-output pairs where each \(\bx^{(n)}\) is drawn from \(\N(\bx^{(n)};\,\mathbf{0},0.5^2\I)\). For chunk \(\D_2 = \{\bx^{(n)},y^{(n)}\}_{n=16}^{45}\), we generate 30 input-output pairs where each \(\bx^{(n)}\) is drawn from \(\N(\bx^{(n)};\,[0.5 ~~ 0.5]^\top,0.1^2\I)\). For both chunks, we draw the \(y^{(n)}\) outputs using the \(p(y\g \bx,\bw)\) observation model above.
We use Python dictionaries to better organise the dataset and parameters: prior_par
holds the parameters of the prior distribution and Data_1
and Data_2
hold the two chunks of data \(\D_1\) and \(\D_2\), respectively. If you haven’t worked with dictionaries before, you can think of them as collections of stored variables which can be accessed by specified keys. For instance, you can access X_Data_1
in Data_1
with the expression Data_1['X']
. Here, Data_1
is the dictionary, X_Data_1
is the stored variable and X
is the key.
Complete the following function for calculating the posterior parameters. To avoid calculating inverse matrices, we keep track of the precision matrices \(K_0 = V_0^{-1}\) and \(K_N = V_N^{-1}\) and use the update equations: \[\begin{align} K_N &= K_0 + \frac{1}{\sigma_y^2}X^\top X,\\ \bw_N &= K_N^{-1} K_0\bw_0 + \frac{1}{\sigma_y^2}K_N^{-1} X^\top\by. \end{align}\]
Write a function that takes a mean and a covariance of any 2D multivariate Gaussian as input arguments. The function should then visualize the Gaussian by drawing 400 samples from the distribution and showing the samples in a 2D scatter plot. Let the function also show the mean as a bold cross in the same plot. To implement this function, you can use the cholesky
and solve_triangular
functions from scipy.linalg
. Then visualise the prior distribution and discuss the relationship between the parameters and the belief expressed by the prior.
Using the functions from question parts b) and c), calculate and visualise the posterior distributions that you get after observing:
Take the posterior that you got in d) for \(p(\bw\g \D_1)\) and now use it as a prior. Calculate the new posterior that you get after observing \(\D_2\). Numerically compare the mean and covariance with those of \(p(\bw\g \D_1,\D_2)\).
Is it important that the inputs were drawn from Gaussian distributions? What would happen if the \(\bx^{(n)}\) in \(\D_2\) were drawn from a uniform distribution on the unit square?
Now assume that our observations \(y\) are corrupted by additional Gaussian noise: \[\notag p(z\g \bx,\bw) = \N(z;\, y,\sigma_z^2). \] So we now observe datasets \(\D_z = \{\bx^{(n)},z^{(n)}\}_{n=1}^{N}\). What is the new posterior distribution \(p(\bw \g \D_z)\)? Hint: Recall how you did Question 1.a) of this sheet.
More practice with Gaussians:
\(N\) noisy independent observations are made of an unknown scalar quantity \(m\): \[\notag x\nth \sim \N(m, \sigma^2). \]
We don’t give you the raw data, \(\{x\nth\}\), but tell you the mean of the observations: \[\notag \bar{x} = \frac{1}{N} \sum_{n=1}^N x\nth. \] What is the likelihood1 of \(m\) given only this mean \(\bar{x}\)? That is, what is \(p(\bar{x}\g m)\)?2
A sufficient statistic is a summary of some data that contains all of the information about a parameter.
Show that \(\bar{x}\) is a sufficient statistic of the observations for \(m\), assuming we know the noise variance \(\sigma^2\). That is, show that \(p(m\g \bar{x}) \te p(m\g \{x\nth\}_{n=1}^N)\).
If we don’t know the noise variance \(\sigma^2\) or the mean, is \(\bar{x}\) still a sufficient statistic in the sense that \(p(m\g \bar{x}) \te p(m\g \{x\nth\}_{n=1}^N)\)? Explain your reasoning.
Note: In i) we were implicitly conditioning on \(\sigma^2\): \(p(m\g \bar{x}) = p(m\g \bar{x}, \sigma^2)\). In this part, \(\sigma^2\) is unknown, so \(p(m\g \bar{x}) = \int p(m,\sigma^2\g \bar{x})\,\mathrm{d}\sigma^2\). Although no detailed mathematical manipulation (or solving of integrals) is required.
Bonus question: Conjugate priors (This question sets up some intuitions about the larger picture for Bayesian methods.)
A conjugate prior for a likelihood function is a prior where the posterior is a distribution in the same family as the prior. For example, a Gaussian prior on the mean of a Gaussian distribution is conjugate to Gaussian observations of that mean.
The inverse-gamma distribution is a distribution over positive numbers. It’s often used to put a prior on the variance of a Gaussian distribution, because it’s a conjugate prior.
The inverse-gamma distribution has pdf (as cribbed from Wikipedia): \[ p(z\g \alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)}\, z^{-\alpha-1}\, \exp\!\left(-{\frac{\beta}{z}}\right), \qquad \text{with}~\alpha>0,~~\beta>0, \] where \(\Gamma(\cdot)\) is a gamma function.3
Assume we obtain \(N\) observations from a zero-mean Gaussian with unknown variance, \[ x\nth \sim \N(0,\, \sigma^2), \quad n = 1\dots N, \] and that we place an inverse-gamma prior with parameters \(\alpha\) and \(\beta\) on the variance. Show that the posterior over the variance is inverse-gamma, and find its parameters.
Hint: you can assume that the posterior distribution is a distribution; it normalizes to one. You don’t need to keep track of normalization constants, or do any integration. Simply show that the posterior matches the functional form of the inverse-gamma, and then you know the normalization (if you need it) by comparison to the pdf given.
If a conjugate prior exists, then the data can be replaced with sufficient statistics. Can you explain why?
We’re using the traditional statistics usage of the word “likelihood”: it’s a function of parameters given data, equal to the probability of the data given the parameters. You should avoid saying “likelihood of the data” (Cf p29 of MacKay’s textbook), although you’ll see that usage too.↩
The sum of Gaussian outcomes is Gaussian distributed; you only need to identify a mean and variance.↩
Numerical libraries often come with a gammaln
or lgamma
function to evaluate the log of the gamma function.↩