MIT Kavli Institute Talk

Hot take: Almost all machine learning comes down to modelling functions.

$x \in \mathbb{R}^{d}$

Model selection is a hard problem!

What if we were not forced to decide the complexity of $f$ at the outset. What if $f$ could calibrate its complexity on the fly as it sees the data - this is precisely called non-parameteric learning.

Gaussian Processes

Gaussian processes are a powerful non-parametric paradigm for performing state-of-the-art regression.

They are probabilistic $\rightarrow$ user has a sense of prediction uncertainty.
They don't have standard parameters they model the mapping $f$ directly by placing a prior in the space of functions!

We need to understand the notion of distribution over functions.

When we think of a function in a mathematical sense we immediately try to think of a parametric form. For example, $5x -2, x^2, 3x^3 - x, e^{x}$ .
But in GP world there is a fundamental shift in thinking about functions. We completely abandon the parametric form viewpoint.
A continuous function $f$ on the real domain $\mathbb{R}^{d}$ , can be thought of as an infinitely long vector evaluated at some index set $[x_{1}, x_{2}, ......]$ .
$[f(x_{1}), f(x_{2}), f(x_{3}),.......]$
Gaussian processes are probability distributions over functions!

Interpretation of functions

Sticking point: We cannot represent infinite dimensional vectors on a computer....true, but bear with me.

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

$m(x)$ is a mean function.

$k(x, x')$ is a covariance function.

What is a GP?

The most intuitive way of understanding GPs is understanding the correspondence between Gaussian distributions and Gaussian processes.

What is a GP?

A sample from a $k$ -dimensional Gaussian $\mathbf{x} \sim \mathcal{N}(\mu, \Sigma)$ is a vector of size $k$ . $\mathbf{x} = [x_{1}, \ldots, x_{k}]$

The mathematical crux of a GP is that $[f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{n})]$ is just a N-dimensional multivariate Gaussian $\mathcal{N}(\mu, K)$ .

\begin{bmatrix} f_{1} \\ \vdots\\ f_{499} \\ f_{500} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \vdots\\ \mu_{499} \\ \mu_{500} \\ \end{bmatrix}, \begin{bmatrix} k(x_{1}, x_{1}) & \ldots & \ldots k(x_{1}, k_{500}) \\ \vdots & \ddots &\vdots \\ \vdots & \ddots & \vdots \\ k(x_{500}, x_{1}) & \ldots & \ldots k(x_{500}, k_{500}) \\ \end{bmatrix} \right)

\begin{bmatrix} f_{1} \\ \vdots\\ f_{499} \\ f_{500} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \vdots\\ \mu_{499} \\ \mu_{500} \\ \end{bmatrix}, \begin{bmatrix} k(x_{1}, x_{1}) & \ldots & \ldots k(x_{1}, k_{500}) \\ \vdots & \ddots &\vdots \\ \vdots & \ddots & \vdots \\ k(x_{500}, x_{1}) & \ldots & \ldots k(x_{500}, k_{500}) \\ \end{bmatrix} \right)

A GP is an infinite dimensional analogue of a Gaussian distribution $\rightarrow$ a sample from it is a vector of infinite length?

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

But at any given point, we only need to represent our function $f(x)$ at a finite index set $\mathcal{X} = [x_{1},\ldots, x_{500}]$ . So we are interested in our long function vector $[f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{500})]$ .

Function samples from a GP

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

The kernel function $k(x,x')$ is the heart of a GP, it controls all of the inductive biases in our function space like shape, periodicity, smoothness.

prior over functions $\rightarrow$

Sample draws from a zero mean GP prior under different kernel functions.

In reality, they are just draws from a multivariate Gaussian $\mathcal{N}(0, K)$ where the covariance matrix has been evaluated by applying the kernel function to all pairs of data points.

Infinite dimensional prior:

$f(x) \sim \mathcal{GP}(m(x), k_{\theta}(x,x^{\prime}))$

$f(X) \sim \mathcal{N}(m(X), K_{X})$

For a finite set of points, $X$ :

$k_{\theta}(x,x^{\prime})$ encodes the support and inductive biases in function space.

Gaussian Process Regression

How do we fit functions to noisy data with GPs?

1. Given some noisy data $\bm{y} = \lbrace{y_{i}}\rbrace_{i=1}^{N}$ at $X = \{ x_{i}\}_{i=1}^N$ input locations.

2. You believe your data comes from a function $f$ corrupted by Gaussian noise.

$\bm{y} = f(X) + \epsilon, \hspace{10pt} \epsilon \sim \mathcal{N}(0, \sigma^{2})$

Data Likelihood: $\hspace{10pt} y|f \sim \mathcal{N}(f(x), \sigma^{2})$

Prior over functions: $f|\theta \sim \mathcal{GP}(0, k_{\theta})$

(The choice of kernel function $k_{\theta}$ controls how your functions space looks.)

\rightarrow

\rightarrow

\begin{aligned} f_{i} &\sim \mathcal{N}(0, K_{\theta}) \\ K_{i,j} &= k_{\theta}(x_{i}, x_{j}) \\ &\hspace{2mm} \forall i, j\\ \end{aligned}

\begin{aligned} f_{i} &\sim \mathcal{N}(0, K_{\theta}) \\ K_{i,j} &= k_{\theta}(x_{i}, x_{j}) \\ &\hspace{2mm} \forall i, j\\ \end{aligned}

Learning in a GP

\begin{aligned} p(f|y, \theta) &= \dfrac{p(y|f)p(f|\theta)}{p(y|\theta)} \\ \end{aligned}

\begin{aligned} p(f|y, \theta) &= \dfrac{p(y|f)p(f|\theta)}{p(y|\theta)} \\ \end{aligned}

p(f|\theta)

p(f|\theta)

\begin{aligned} \bm{\theta_{*}} &= \argmax_{\bm{\theta}}\mathcal{L(\bm{\theta})} \\ p(y|\theta) &= \int p(y|f)p(f|\theta)df = \mathcal{N}(0, K + \sigma^{2}\mathbb{I})\\ \textrm{where, } \mathcal{L(\bm{\theta})} = \log p(y|\theta) &= -\overbrace{\frac{1}{2}y^{T}(K_{\theta} + \sigma^{2}_{n})^{-1}y}^{\textrm{model fit}} -\overbrace{\frac{1}{2}log|K_{\theta} + \sigma^{2}_{n}\mathbb{I}|}^{\textrm{complexity penalty}} - \dfrac{n}{2}log2\pi \\ \end{aligned}

\begin{aligned} \bm{\theta_{*}} &= \argmax_{\bm{\theta}}\mathcal{L(\bm{\theta})} \\ p(y|\theta) &= \int p(y|f)p(f|\theta)df = \mathcal{N}(0, K + \sigma^{2}\mathbb{I})\\ \textrm{where, } \mathcal{L(\bm{\theta})} = \log p(y|\theta) &= -\overbrace{\frac{1}{2}y^{T}(K_{\theta} + \sigma^{2}_{n})^{-1}y}^{\textrm{model fit}} -\overbrace{\frac{1}{2}log|K_{\theta} + \sigma^{2}_{n}\mathbb{I}|}^{\textrm{complexity penalty}} - \dfrac{n}{2}log2\pi \\ \end{aligned}

Popular Kernels

Usually, the user picks one on the basis of prior knowledge.

Each kernel depends on some hyperparameters $\theta$ , which are tuned in the training step.

Predictions in a GP

We want to infer latent function values $f_{*}$ at any arbitrary input locations $X_{*}$ , so in a distribution sense we want,

$p(f_{*}|X_{*}, y, \theta)$

Posterior Predictive Distribution

The predictive posterior is closed form (because we are operating a world of Gaussians):

Write down the joint.
Derive the conditional from joint. (Gaussian identity)

\begin{aligned} \begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \mu_{2} \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{12}^{T} & \Sigma_{22}\\ \end{bmatrix} \right) \end{aligned}

\begin{aligned} \begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \mu_{2} \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{12}^{T} & \Sigma_{22}\\ \end{bmatrix} \right) \end{aligned}

\begin{aligned} x_{1}|x_{2} &\sim \mathcal{N}(\mu_{3}, \Sigma_{3})\\ \mu_{3} &= \mu_{1} + \Sigma_{12}\Sigma_{22}^{-1}(A_{2} - \mu_{2}) \\ \Sigma_{3} &= \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{aligned}

\begin{aligned} x_{1}|x_{2} &\sim \mathcal{N}(\mu_{3}, \Sigma_{3})\\ \mu_{3} &= \mu_{1} + \Sigma_{12}\Sigma_{22}^{-1}(A_{2} - \mu_{2}) \\ \Sigma_{3} &= \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{aligned}

Joint

x_{2}|x_{1}

x_{2}|x_{1}

Conditional

can be derived using symmetry arguments

Predictions in a GP

\begin{aligned} p(f_{*}|y, X_{*}, X, \theta_{*}) &= \mathcal{N}( \mu_{*}, \hspace{5pt} \Sigma_{*}) \\ \mu_{*} &= K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}y \\ \Sigma_{*} &= K_{**} - K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}K_{*}^T \\ \end{aligned}

\begin{aligned} p(f_{*}|y, X_{*}, X, \theta_{*}) &= \mathcal{N}( \mu_{*}, \hspace{5pt} \Sigma_{*}) \\ \mu_{*} &= K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}y \\ \Sigma_{*} &= K_{**} - K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}K_{*}^T \\ \end{aligned}

\begin{aligned} p(f, f_{*}|X, X_{*}, \theta) &= \mathcal{N} \left( 0, \begin{bmatrix} k(X,X) & k(X, X_{*}) \\ k(X_{*}, X) & k(X_{*}, X_{*}) \end{bmatrix} \right) = \mathcal{N}\left(0, \begin{bmatrix} K & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ p(y, f_{*}|X, X_{*}, \theta) &= \mathcal{N}\left(0, \begin{bmatrix} K + \sigma^{2}\mathbb{I} & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ \end{aligned}

\begin{aligned} p(f, f_{*}|X, X_{*}, \theta) &= \mathcal{N} \left( 0, \begin{bmatrix} k(X,X) & k(X, X_{*}) \\ k(X_{*}, X) & k(X_{*}, X_{*}) \end{bmatrix} \right) = \mathcal{N}\left(0, \begin{bmatrix} K & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ p(y, f_{*}|X, X_{*}, \theta) &= \mathcal{N}\left(0, \begin{bmatrix} K + \sigma^{2}\mathbb{I} & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ \end{aligned}

Joint

Conditional

Mathematical set-up

\begin{aligned} p(Y|f, Z) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{z}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

\begin{aligned} p(Y|f, Z) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{z}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

Data Likelihood:

p(f) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

p(f) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

Prior structure:

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The data are stacked row-wise but modelled column-wise, each column with a GP.

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

$Z$

$z_{n}$

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

Mathematical set-up

\begin{aligned} p(Y|f, Z) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{z}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

\begin{aligned} p(Y|f, Z) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{z}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

p(f|Z) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

p(f|Z) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The data are stacked row-wise but modelled column-wise, each column with a GP.

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

$Z$

$z_{n}$

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

Optimisation objective: