Vidhi Lalchand
Doctoral Student @ Cambridge Bayesian Machine Learning
Vidhi Lalchand
Research Highlights
Flatiron Institute, New York
3rd Feb, 2023
This talk is a compilation of research results based on my work in Gaussian processes during the years 2019-2022.
Outline
Gaussian Processes: A generalisation of a Gaussian distribution
A sample from a \(k\)-dimensional Gaussian \( \mathbf{x} \sim \mathcal{N}(\mu, \Sigma) \) is a vector of size \(k\). $$ \mathbf{x} = [x_{1}, \ldots, x_{k}] $$
The mathematical crux of a GP is that \( [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{n})]\) is just a N-dimensional multivariate Gaussian \( \mathcal{N}(\mu, K) \) with a covariance matrix which has been generated using a psd kernel function.
A GP is an infinite dimensional analogue of a Gaussian distribution \( \rightarrow \) a sample from it is a vector of infinite length?
But at any given point, we only need to represent our function \( f(x) \) at a finite index set \( \mathcal{X} = [x_{1},\ldots, x_{500}]\). So we are interested in our long function vector \( [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{500})]\).
Gaussian processes: As a paradigm for learning
A powerful, Bayesian, non-parametric paradigm for learning functions.
\(f(x) \sim \mathcal{GP}(m(x), k_{\theta}(x,x^{\prime})) \)
\(f(X) \sim \mathcal{N}(m(X), K_{X})\)
For a finite set of points, \( X \):
\( k_{\theta}(x,x^{\prime})\) encodes the support and inductive biases in function space.
Visualising the prior predictive function space
Training a GP : Learning the kernel hyperparameters
Given observations \( {(X,\bm{y})} = \{(x_{i}, y_{i})\}_{i=1}^{N} \), which are noisy realisations of some latent function values \(\bm{f}\) we have the following prior and likelihood model.
Learning Step:
Learning in Gaussian process models occurs through the maximisation of the marginal likelihood w.r.t the kernel hyperparameters.
Type-II Maximum Likelihood / Empirical Bayes
Closed form posterior predictive
Deriving the posterior predictive just amounts to writing down the conditional from the joint distribution of \( p(f, f_{*})\)!
Behaviour of ML-II
ML-II - does it always work?
Sensitivity to training set size
Left: The bias in hyperparameter learning is exacerbated in weak data regimes, the plot shows the bias in learning the lengthscale for an underlying 1d regression problem. Right: The bias in learning the noise level in weak data regimes. Each point is an average across 100 datasets sampled from a GP prior with the RBF kernel.
A fully Bayesian treatment of GP models would integrate away kernel hyperparameters when making predictions:
where \( \bm{f}_{*}| X_{*}, X, \bm{y}, \bm{\theta} \sim \mathcal{N}(\bm{\mu_{*},\Sigma_{*}}) \)
The posterior over hyperparameters is given by,
where \(p(\bm{y}|\bm{\theta}) \) is the GP marginal likelihood.
Hierarchical GPs
Hierarchical Gaussian Processes: Propagate hyperparameter uncertainty to outputs.
Hyperprior \( \rightarrow \)
Prior \( \rightarrow \)
Model \( \rightarrow \)
Likelihood \( \rightarrow \)
Hierarchical GPs: Visualising the prior predictive space
We adapt a technique frequently used in physics and astronomy literature to sample from the hyperparameter posterior.
Nested Sampling (Skilling, 2004) is a gradient free method for Bayesian computation. We present a brief overview in the next few slides before looking at results for Fully Bayesian inference in Gaussian processes.
Fergus Simpson*, Vidhi Lalchand*, Carl E. Rasmussen. Marginalised Gaussian processes with Nested Sampling . https://arxiv.org/pdf/2010.16344.pdf
Nested Sampling: The idea
Principle: Sample from "nested shells" / iso-likelihood contours of the evidence and weight them appropriately to give posterior samples.
Nested Sampling: The principle
Results
Why does a more diffused predictive interval yield better test predictive density?
Fergus Simpson*, Vidhi Lalchand*, Carl E. Rasmussen. Marginalised Gaussian processes with Nested Sampling . https://arxiv.org/pdf/2010.16344.pdf
Results: 2d pattern extrapolation
Fergus Simpson*, Vidhi Lalchand*, Carl E. Rasmussen. Marginalised Gaussian processes with Nested Sampling . https://arxiv.org/pdf/2010.16344.pdf
When would you use Hierarchical GPs?
The benefits from incorporating hyperparameter uncertainty in predictive distributions are more pronounced under the following conditions:
Given: High dimensional training data \( Y \equiv \{\bm{y}_{n}\}_{n=1}^{N}, Y \in \mathbb{R}^{N \times D}\)
Learn: Low dimensional latent space \( X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}\)
Lalchand et al. (2022)
N x D
D - independent Gaussian processes
low dimensional latent space
High-dimensional data space
Gaussian Processes for Latent Variable Modelling (at scale)
Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2022
GPLVM: Generative Model with Sparse Gaussian processes
where,
Prior over latents:
Prior over inducing variables:
Conditional prior:
Data likelihood:
Stochastic Variational Evidence Lower bound
Stochastic variational inference for GP regression was introduced in Hensman et al (2013) (\( \mathcal{L}_{1}\) below). In this work we extend the SVI bound in two ways - we introduce the variational distribution over the unknown \(X\) and make \(Y\) multi-output.
ELBO:
Non-Gaussian likelihoods Flexible variational families Amortised Inference Missing data problems
Non-Gaussian likelihoods Flexible variational families Amortised Inference Interdomain inducing variables Missing data problems
Variational Formulation:
Oilflow (12d)
Point | MAP | Bayesian-SVI | AEB-SVI | |
---|---|---|---|---|
RMSE | 0.341 (0.008) | 0.569 (0.092) | 0.0925 (0.025) | 0.067 (0.0016) |
NLPD | 4.104 (3.223) | 8.16 (1.224) | -11.3105 (0.243) | -11.392 (0.147) |
Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2022
Robust to Missing Data: MNIST Reconstruction
30%
60%
Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2022
Robust to Missing Data: Motion Capture
Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2022
Given: High dimensional training data \( Y \equiv \{\bm{y}_{n}\}_{n=1}^{N}, Y \in \mathbb{R}^{N \times D}\)
Learn: Low dimensional latent space \( X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}\)
N x D
Gaussian Processes
2d latent space (each point represents a cell)
High-dimensional data space (A cell by gene matrix of expression counts)
Application: Learn a 2d latent space (a cell atlas) from a single-cell gene expression matrix
\( f_{d} \sim \mathcal{GP}(0, k_{f})\)
N x D
Gaussian Processes
2d latent space (each point represents a cell)
High-dimensional data space (A cell by gene matrix of expression counts)
Application: Learn a 2d latent space (a cell atlas) from a single-cell gene expression matrix
\( f_{d} \sim \mathcal{GP}(0, k_{f})\)
The Model: Augmented Kernel Function
where we assume a constant mean \(\mu_{f} \in \mathbb{R} \) for the \(\bm{f}\) process, the design matrix \(\Phi\) with covariates is specified and \(\zeta_{d}\) encapsulates the mean of random effects \(B\).
The expression matrix \(Y\) can be interpreted as being driven by a joint process \(\tilde{F}\) with columns \(\tilde{f}_{d}\) which model both latent and random effects, distributed as individual Gaussian processes.
Metadata
Expression
data
Vidhi Lalchand*, Aditya Ravuri*, Emma Dann*, Natsuhiko Kumasaka, Dinithi Sumanaweera, Rik G.H. Lindeboom, Shaista Madad, Neil D. Lawrence, Sarah A. Teichmann. Modelling Technical and Biological Effects in single-cell RNA-seq data with Scalable Gaussian Process Latent Variable Models (GPLVM). In Machine Learning in Computational Biology (MLCB), 2022
Inference: Stochastic Variational Inference
Canonical GP prior
Augmented GP prior
The augmented kernel formulation allows us to derive an objective (evidence lower bound) which factorises across both cells (\(N\)) and genes (\( D\)).
*Details about the derivation of the objective are in the paper
While not converged do
Vidhi Lalchand*, Aditya Ravuri*, Emma Dann*, Natsuhiko Kumasaka, Dinithi Sumanaweera, Rik G.H. Lindeboom, Shaista Madad, Neil D. Lawrence, Sarah A. Teichmann. Modelling Technical and Biological Effects in single-cell RNA-seq data with Scalable Gaussian Process Latent Variable Models (GPLVM). In Machine Learning in Computational Biology (MLCB), 2022
Treatment
Latent batch effect
Cell cycle phase
Augmented kernel disentangles cell cycle and treatment effects
Vidhi Lalchand*, Aditya Ravuri*, Emma Dann*, Natsuhiko Kumasaka, Dinithi Sumanaweera, Rik G.H. Lindeboom, Shaista Madad, Neil D. Lawrence, Sarah A. Teichmann. Modelling Technical and Biological Effects in single-cell RNA-seq data with Scalable Gaussian Process Latent Variable Models (GPLVM). In Machine Learning in Computational Biology (MLCB), 2022
COVID-19 scRNA-seq cohort
Vidhi Lalchand*, Aditya Ravuri*, Emma Dann*, Natsuhiko Kumasaka, Dinithi Sumanaweera, Rik G.H. Lindeboom, Shaista Madad, Neil D. Lawrence, Sarah A. Teichmann. Modelling Technical and Biological Effects in single-cell RNA-seq data with Scalable Gaussian Process Latent Variable Models (GPLVM). In Machine Learning in Computational Biology (MLCB), 2022
PCA
scVI
Augmented GPLVM
54,941 cells
130 patients
Modelling biological variation: COVID-19 severity
Vidhi Lalchand*, Aditya Ravuri*, Emma Dann*, Natsuhiko Kumasaka, Dinithi Sumanaweera, Rik G.H. Lindeboom, Shaista Madad, Neil D. Lawrence, Sarah A. Teichmann. Modelling Technical and Biological Effects in single-cell RNA-seq data with Scalable Gaussian Process Latent Variable Models (GPLVM). In Machine Learning in Computational Biology (MLCB), 2022
There is no reason to believe that real data is generated by a kernel with fixed inductive biases - where the smoothness properties do not change over the space of covariates/inputs.
We really need extremely flexible inductive biases which adapt to the needs of the data - but how to specify such kernels?
Stationary vs Non-stationary Kernel
The hyperparameters are themselves functions of the inputs \( X\).
How to specify \(\ell_{d}(\bm{x})\)?
Gibbs kernel:
Non-stationary kernels: Input dependent hyperparameters
It is basically derived exactly like a squared exponential kernel but with input dependent lengthscale functions \(l_{d}(x)\).
Application: Climate processes
(work in progress)
Source: National Oceanic Atmospheric Administration (https://www.ncei.noaa.gov/cdo-web/)
Hierarchical GP Model: MAP Inference
Classical
Stationary Hierarchical
Non-Stationary Hierarchical
Hyperprior \( \rightarrow \)
Prior \( \rightarrow \)
Model \( \rightarrow \)
Likelihood \( \rightarrow \)
Hyperprior \( \rightarrow \)
Synthetic Example: 1d
Synthetic Example: 2d
Stationary Kernel
Non-Stationary Kernel
Ground Truth function
Modelling Precipitation in the Upper Indus Basin
Time-series for a single spatial point from each of three distinct regions.
Modelling Precipitation in the Upper Indus Basin: Spatial Component
Modelling Precipitation in the Upper Indus Basin: Temporal Component
Modelling Precipitation in the Upper Indus Basin: Spatio-Temporal
Modelling Precipitation in the Upper Indus Basin: Spatio-Temporal
3D regression for spatio-temporal precipitation modelling. The red line demarcates the training and test months. The non-stationary kernel yielded a an average test RMSE of 0.9426 across three runs vs. 1.1086 for the stationary kernel.
Modelling Precipitation across the US: Sparse GPs + Non-stationary construction
SE-ARD kernel
Thank you!
(Bonus slides?)
Nested Sampling: The principle
Define a notion of prior volume, $$ X(\lambda) = \int_{\mathcal{L}(\theta) > \lambda} \pi(\theta)d\theta$$
The area/volume of parameter space "inside" a iso-likelihood contour
One can re-cast the multi-dimensional evidence integral as a 1d function of prior volume \( X\).
$$ \mathcal{Z} = \int \mathcal{L}(\theta)\pi(\theta)d\theta = \int_{0}^{1} \mathcal{L}(X)dX$$
The evidence can then just be estimated by 1d quadrature.
$$ \mathcal{Z} \approx \sum_{i=1}^{M}\mathcal{L_{i}}(X_{i} - X_{i+1})$$
Static Nested Sampling
Start with N "live" points \( \{\theta_{1}, \theta_{2}, \ldots, \theta_{N} \} \) sampled from the prior, \(\theta_{i} \sim p(\theta) \) , set \( \mathcal{Z} = 0\)
Skilling (2004)
for \( i = 1, \ldots, K\)
while stopping criterion is unmet do
Accumulate evidence, \( \mathcal{Z} = \mathcal{Z} + \mathcal{L}_{i}w_{i}\)
Evaluate stopping criterion, if triggered then break;
end
return set of saved points \(\{ \theta_{i}\}_{i=1}^{N + K} \), along with importance weights \( \{p_{i} \}_{i=1}^{N + K}\), and evidence \( \mathcal{Z} \)
Add final N live points to the "saved" list with:
# final slab of enclosed prior mass
\( \rightarrow \) hard problem
# why?
Prior Mass Estimation
Why is \( X_{i} \) set to \( e^{-i/N}\) where \(i\) is the iteration number and \(N\) is the number of live points.
(images from An Intro to dynamic nested sampling. Speagle (2017) )
Sampling from the constrained prior
Dynamic Nested Sampling. Speagle (2019)
At every step we need a sample \( \theta^{\prime}\) such that \( \mathcal{L_{\theta^{\prime}}} > \mathcal{L}_{i}\)
We could keep sampling uniformly from the prior and keep rejecting until we find one that meets the likelihood condition, but this takes too long when the likelihood threshold is high and there is a better way.
What if we could sample directly from the constrained prior?
New Directions
Neural encoder (parameteric)
Gaussian process (non-parameteric)
By Vidhi Lalchand