Can we scale Gaussian Process Latent Variable Models (GPLVMs) to big data?

Vidhi Lalchand

Harvard Computer Science

03-04-2023

"Functions describe the world"

- Thomas Garrity

Outline

Gaussian processes as a "function" learning paradigm.

Regression with GPs: Both inputs $(X)$ and outputs $ (Y)$ are observed.

Latent Variable Modelling with GP: Only outputs $ (Y)$ are observed.

Without loss of generality, we are going to assume $ X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times D}$ and $ Y \equiv \{\bm{y}_{n}\}_{n=1}^{N}, Y \in \mathbb{R}^{N \times 1}$

Gaussian Processes

Gaussian processes are a powerful non-parametric paradigm for performing state-of-the-art regression.

They are probabilistic $ \rightarrow$ user has a sense of prediction uncertainty.
They don't have standard parameters  they model the mapping $ f $ directly by placing a prior in the space of functions!

We need to understand the notion of distribution over functions.

What is a GP?

A sample from a $k$-dimensional Gaussian $ \mathbf{x} \sim \mathcal{N}(\mu, \Sigma) $ is a vector of size $k$. $$ \mathbf{x} = [x_{1}, \ldots, x_{k}] $$

The mathematical crux of a GP is that $ [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{n})]$ is just a N-dimensional multivariate Gaussian $ \mathcal{N}(\mu, K) $.

\begin{bmatrix} f_{1} \\ \vdots\\ f_{499} \\ f_{500} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \vdots\\ \mu_{499} \\ \mu_{500} \\ \end{bmatrix}, \begin{bmatrix} k(x_{1}, x_{1}) & \ldots & \ldots k(x_{1}, k_{500}) \\ \vdots & \ddots &\vdots \\ \vdots & \ddots & \vdots \\ k(x_{500}, x_{1}) & \ldots & \ldots k(x_{500}, k_{500}) \\ \end{bmatrix} \right)

A GP is an infinite dimensional analogue of a Gaussian distribution $ \rightarrow $ a sample from it is a vector of infinite length?

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

But at any given point, we only need to represent our function $ f(x) $ at a finite index set $ \mathcal{X} = [x_{1},\ldots, x_{500}]$. So we are interested in our long function vector $ [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{500})]$.

Function samples from a GP

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

The kernel function $ k(x,x')$ is the heart of a GP, it controls all of the inductive biases in our function space like shape, periodicity, smoothness.

prior over functions $ \rightarrow $

Sample draws from a zero mean GP prior under different kernel functions.

In reality, they are just draws from a multivariate Gaussian $ \mathcal{N}(0, K)$ where the covariance matrix has been evaluated by applying the kernel function to all pairs of data points.

Infinite dimensional prior:

$f(x) \sim \mathcal{GP}(m(x), k_{\theta}(x,x^{\prime})) $

$f(X) \sim \mathcal{N}(m(X), K_{X})$

For a finite set of points, $ X $:

$ k_{\theta}(x,x^{\prime})$ encodes the support and inductive biases in function space.

Gaussian Process Regression

How do we fit functions to noisy data with GPs?

1. Given some noisy data $ \bm{y} = \lbrace{y_{i}}\rbrace_{i=1}^{N} $ at $ X = \{ x_{i}\}_{i=1}^N$ input locations.

2. You believe your data comes from a function $ f$ corrupted by Gaussian noise.

$$ \bm{y} = f(X) + \epsilon, \hspace{10pt} \epsilon \sim \mathcal{N}(0, \sigma^{2})$$

Data Likelihood: $ \hspace{10pt} y|f \sim \mathcal{N}(f(x), \sigma^{2}) $

Prior over functions: $ f|\theta \sim \mathcal{GP}(0, k_{\theta}) $

(The choice of kernel function $ k_{\theta}$ controls how your functions space looks.)

\rightarrow

\begin{aligned} f_{i} &\sim \mathcal{N}(0, K_{\theta}) \\ K_{i,j} &= k_{\theta}(x_{i}, x_{j}) \\ &\hspace{2mm} \forall i, j\\ \end{aligned}

.....but we still need to fit the kernel hyperparameters $ \theta$

Learning Step:

\begin{aligned} p(\bm{y}|\bm{\theta}) &= \int p(\bm{y}|\bm{f})p(f|\bm{\theta})d\bm{f}\\ &= \int \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I})\mathcal{N}(\bm{0}, K_{\theta})d\bm{f} \\ &= \mathcal{N}(0, K_{\theta} + \sigma^{2}_{n}\mathbb{I}) \end{aligned}

\bm{\theta_{\star}} = \argmax_{\bm{\theta}}\log p(\bm{y}|\bm{\theta})

Learning in Gaussian process models occurs through the maximisation of the marginal likelihood w.r.t the kernel hyperparameters.

Data likelihood

Prior

Denominator of Bayes Rule

Learning in a GP

\begin{aligned} p(f|y, \theta) &= \dfrac{p(y|f)p(f|\theta)}{p(y|\theta)} \\ \end{aligned}

p(f|\theta)

\begin{aligned} \bm{\theta_{*}} &= \argmax_{\bm{\theta}}\mathcal{L(\bm{\theta})} \\ p(y|\theta) &= \int p(y|f)p(f|\theta)df = \mathcal{N}(0, K + \sigma^{2}\mathbb{I})\\ \textrm{where, } \mathcal{L(\bm{\theta})} = \log p(y|\theta) &= -\overbrace{\frac{1}{2}y^{T}(K_{\theta} + \sigma^{2}_{n})^{-1}y}^{\textrm{model fit}} -\overbrace{\frac{1}{2}\text{log}|K_{\theta} + \sigma^{2}_{n}\mathbb{I}|}^{\textrm{complexity penalty}} - \dfrac{n}{2}\text{log}2\pi \\ \end{aligned}

Predictions in a GP

\begin{aligned} p(f_{*}|y, X_{*}, X, \theta_{*}) &= \mathcal{N}( \mu_{*}, \hspace{5pt} \Sigma_{*}) \\ \mu_{*} &= K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}y \\ \Sigma_{*} &= K_{**} - K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}K_{*}^T \\ \end{aligned}

\begin{aligned} p(f, f_{*}|X, X_{*}, \theta) &= \mathcal{N} \left( 0, \begin{bmatrix} k(X,X) & k(X, X_{*}) \\ k(X_{*}, X) & k(X_{*}, X_{*}) \end{bmatrix} \right) = \mathcal{N}\left(0, \begin{bmatrix} K & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ p(y, f_{*}|X, X_{*}, \theta) &= \mathcal{N}\left(0, \begin{bmatrix} K + \sigma^{2}\mathbb{I} & K_{*}^T \\ K_{*} & K_{**} \\ \end{bmatrix}\right) \\ \end{aligned}

Joint

Conditional

Examples of GP Regression

Gaussian processes can also be used in contexts where the observations are a gigantic data matrix $ Y \equiv \{ y_{n}\}_{n=1}^{N}, y_{n} \in \mathbb{R}^{D}$. $D$ can be pretty big $\approx 1000s$.

Imagine a stack of images, where each image has been flattened into a vector of pixels and stacked together rowise in a matrix.

n = number of images

d = 784

Given: High dimensional training data $ Y \equiv \{\bm{y}_{n}\}_{n=1}^{N}, Y \in \mathbb{R}^{N \times D}$

Learn: Low dimensional latent space $ X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}$

N x D

The Gaussian process bridge

2d latent space

High-dimensional data space

Schematic of a Gaussian process Latent Variable Model

. . .

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

Structure / clustering in latent space can reveal insights into the high-dimensional data - for instance, which points are similar.

each cluster is a digit (coloured by labels)

$ Z \in \mathbb{R}^{N \times Q}$

$ F \in \mathbb{R}^{N \times D}$

$ Y \in \mathbb{R}^{N \times D} (= F + noise)$

Mathematical set-up

\begin{aligned} p(Y|f, X) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{x}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

Data Likelihood:

p(f) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

Prior structure:

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The data are stacked row-wise but modelled column-wise, each column with a GP.

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

$X$

$x_{n}$

Mathematical set-up

\begin{aligned} p(Y|f, X) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{x}_{n}), \sigma^{2}_{n}I_{N}) \end{aligned}

p(f|X) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The data are stacked row-wise but modelled column-wise, each column with a GP.

$X$

$x_{n}$

Optimisation objective:

\int \ldots \int p(Y|f, X)p(f_{1:D}|X)df_{1}\ldots df_{D} = \prod_{d=1}^{D}p(\bm{y}_{d}|X) = \prod_{d=1}^{D}\mathcal{N}(\bm{y}_{d}|0, K_{d}+\sigma^{2}_{n}\mathbb{I})

\text{argmax}_{Z, \theta} \prod_{d=1}^{D}\mathcal{N}(0, K_{d}+\sigma^{2}_{n}\mathbb{I})

Stochastic Variational Sparse GPs: Generative Model

\textbf{F} \equiv \{ f_{d} \}_{{d=1}}^{D}

\textbf{U} \equiv \{ u_{d} \}_{{d=1}}^{D}

p(\textbf{U}) = \prod_{d=1}^{D}p(u_{d}|Z) = \mathcal{N}(\textbf{0}, K_{mm})

where,

Q_{nn} = K_{nn} - K_{nm}K_{mm}^{-1}K_{mn}

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

p(\textbf{F}, \textbf{X}, \textbf{U} |\textbf{Y}) = \Big [\prod_{d=1}^{D}p(\bm{f}_{d}|\bm{u}_{d},\textbf{X})q(\bm{u}_{d}) \Big]\prod_{n=1}^{N}\mathcal{N}(\bm{x}_{n}; \mu_{n}, s_{n}\mathbb{I}_{Q}) \approx q(\textbf{F}, \textbf{X}, \textbf{U})

Stochastic variational inference for GP regression was introduced in Hensman et al (2013). In this work we use the SVI bound in the latent variable setting with unknown $X$ and multi-output $Y$.

\begin{aligned} \mathcal{L}_{1:D} = \sum_{d=1}^{D}\mathcal{L}_{d} = \sum_{d=1}^{D}\sum_{n=1}^{N} \mathbb{E}_{q(f, \textbf{X}, \textbf{U})}[\log p(y_{n,d}|\bm{f}_{d}, \bm{x}_{n})] - \sum_{d}\textrm{KL}(q(\bm{u}_{d})||p(\bm{u}_{d})) - \sum_{n}\textrm{KL}(q(\bm{x}_{n})||p(\bm{x}_{n})) \end{aligned}

GPLVM ELBO:

\begin{aligned} =\sum_{d=1}^{D}\sum_{n=1}^{N}\int q_{\phi}(\bm{x}_{n})\underbrace{\int q_{\lambda}(\bm{u}_{d})\int p(\bm{f}_{d}|\bm{u}_{d}, \bm{x}_{n}) \log\mathcal{N}(y_{n,d};\bm{f}_{d}(\bm{x}_{n}), \sigma^{2}_{y}) d\bm{f}_{d}(\bm{x}_{n})d\bm{u}_{d}d\bm{x}_{n}- \sum_{d}\textrm{KL}(q(\bm{u}_{d})||p(\bm{u}_{d}))}_{\textrm{SVGP Regression bound}} - \sum_{n}\textrm{KL}(q(\bm{x}_{n})||p(\bm{x}_{n})) \end{aligned}

\text{log }p(\textbf{Y}) \geq

Non-Gaussian likelihoods Flexible variational families Amortised Inference Interdomain inducing variables Missing data problems

Stochastic Variational Bayesian GPLVM

GPR ELBO:

\begin{aligned} &\text{log } p(\textbf{Y}|\textbf{X}) \geq \sum_{n=1}^{N} \mathbb{E}_{q(f, \bm{u})}[\log p(y_{n}|f, \bm{x}_{n})] - \textrm{KL}(q(\bm{u})||p(\bm{u})) = \mathcal{L}\\ &= \sum_{n=1}^{N}\log \mathcal{N}(y_{n}| k^{T}_{n}K_{mm}^{-1}\bm{m}, \sigma^{2}_{y}) -\dfrac{1}{2\sigma^{2}_{y}} \textrm{Tr}(K_{nn} - K_{mm}^{-1}K_{mn}K_{nm}) - \dfrac{1}{2\sigma^{2}_{y}} \textrm{Tr}(SK_{mm}^{-1} K_{mn}K_{nm}K_{mm}^{-1})- \textrm{KL}(q(\bm{u})||p(\bm{u})) \end{aligned}

\text{log }p(\textbf{Y}) = \text{log} \int p(\textbf{Y}|\textbf{X})p(\textbf{X})d\textbf{X} \geq \int q(\textbf{X})\mathcal{L}d\textbf{X} - \text{KL}(q(\textbf{X})||p(\textbf{X}))

Stochastic Variational Bayesian GPLVM

\begin{aligned} \mathcal{L}_{1:D} &= \sum_{n,d}\log \mathcal{N}(y_{n,d}|\underbrace{\langle k^{T}_{n} \rangle _{q(\bm{x}_{n})}}_{\Psi^{(n,\cdot)}_{1}}K_{mm}^{-1}\bm{m}_{d}, \sigma^{2}_{y}) -\dfrac{1}{2\sigma^{2}_{y}} \textrm{Tr}(\underbrace{\langle K_{nn}\rangle_{q(\bm{x}_{n})}}_{\psi_{0}} - K_{mm}^{-1}\underbrace{\langle K_{mn}K_{nm} \rangle_{q(\bm{x}_{n})}}_{\Psi_{2}}) - \dfrac{1}{2\sigma^{2}_{y}} \textrm{Tr}(S_{d}K_{mm}^{-1}\underbrace{\langle K_{mn}K_{nm}\rangle_{q(\bm{x}_{n})}}_{\Psi_{2}}K_{mm}^{-1}) \\ - &\sum_{n}\textrm{KL}(q(\bm{x}_{n})||p(\bm{x}_{n})) - \sum_{d}\textrm{KL}(q(\bm{u}_{d})||p(\bm{u}_{d})) \end{aligned}

Final factorised form:

Non-Gaussian likelihoods Flexible variational families Interdomain inducing variables Missing data problems

Point Inference $\longrightarrow$ $ \bm{x}_{n} $ is a point estimate, no KL term

MAP Inference $\longrightarrow$ Only prior term appears in ELBO $ p(\bm{x}_{n})$, $ \bm{x}_{n} $ is still a point estimate

Mean-field Variational $\longrightarrow$ $ q(\bm{x}_{n})$ is a factorised Gaussian (factorising across $d$).

Amortised Inference $\longrightarrow$ $ q(\bm{x}_{n})$ is potentially a Gaussian with a full covariance matrix (modelling correlations across dimensions)

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

Mean-field inference over latents

Amortised Inference over latents

Stochastic Variational Bayesian GPLVM

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

Stochastic Variational Bayesian GPLVM

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

Oilflow (12d)

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

Oilflow (12d)

	Point	MAP	Bayesian-SVI	AEB-SVI
RMSE	0.341 (0.008)	0.569 (0.092)	0.0925 (0.025)	0.067 (0.0016)
NLPD	4.104 (3.223)	8.16 (1.224)	-11.3105 (0.243)	-11.392 (0.147)

Robust to Missing Data: MNIST Reconstruction

30%

60%

Robust to Missing Data: Brendan Faces Reconstruction

Trained on ~40% missing pixels

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

Robust to Missing Data: Motion Capture

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022

Summary

Thank you!

We present a scalable generalized GPLVM model which leverages SVI for inference and is compatible with non-Gaussian likelihoods, flexible variational distributions and massively missing data.

The model is versatile enough to be combined with different priors in latent space - this might be important for better disentanglement of latent representations, future work would focus on such insights.

Thank you!

vr308@cam.ac.uk

@VRLalchand

Harvard Computer Science

By Vidhi Lalchand

Harvard Computer Science

Vidhi Lalchand

Doctoral Student @ Cambridge Bayesian Machine Learning

vidhilalchand.co.uk

Can we scale Gaussian Process Latent Variable Models (GPLVMs) to big data?

"Functions describe the world"

Outline

Harvard Computer Science

More from Vidhi Lalchand