On Hierarchical Gaussian Processes: Methods and Applications

Vidhi Lalchand

Research Highlights

Flatiron Institute, New York

3rd Feb, 2023

This talk is a compilation of research results based on my work in Gaussian processes during the years 2019-2022.

Regression w. hierarchical GPs - what they are? how to construct them? how are they better than ordinary GPs.?

Hierarchical GPs for unsupervised learning.

Scientific applications & some new directions.

Outline

Gaussian Processes: A generalisation of a Gaussian distribution

A sample from a $k$-dimensional Gaussian $ \mathbf{x} \sim \mathcal{N}(\mu, \Sigma) $ is a vector of size $k$. $$ \mathbf{x} = [x_{1}, \ldots, x_{k}] $$

The mathematical crux of a GP is that $ [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{n})]$ is just a N-dimensional multivariate Gaussian $ \mathcal{N}(\mu, K) $ with a covariance matrix which has been generated using a psd kernel function.

\begin{bmatrix} f_{1} \\ \vdots\\ f_{499} \\ f_{500} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \vdots\\ \mu_{499} \\ \mu_{500} \\ \end{bmatrix}, \begin{bmatrix} k(x_{1}, x_{1}) & \ldots & \ldots k(x_{1}, k_{500}) \\ \vdots & \ddots &\vdots \\ \vdots & \ddots & \vdots \\ k(x_{500}, x_{1}) & \ldots & \ldots k(x_{500}, k_{500}) \\ \end{bmatrix} \right)

A GP is an infinite dimensional analogue of a Gaussian distribution $ \rightarrow $ a sample from it is a vector of infinite length?

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

But at any given point, we only need to represent our function $ f(x) $ at a finite index set $ \mathcal{X} = [x_{1},\ldots, x_{500}]$. So we are interested in our long function vector $ [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{500})]$.

Gaussian processes: As a paradigm for learning

A powerful, Bayesian, non-parametric paradigm for learning functions.

$f(x) \sim \mathcal{GP}(m(x), k_{\theta}(x,x^{\prime})) $

$f(X) \sim \mathcal{N}(m(X), K_{X})$

For a finite set of points, $ X $:

$ k_{\theta}(x,x^{\prime})$ encodes the support and inductive biases in function space.

Visualising the prior predictive function space

Training a GP : Learning the kernel hyperparameters

Given observations $ {(X,\bm{y})} = \{(x_{i}, y_{i})\}_{i=1}^{N} $, which are noisy realisations of some latent function values $\bm{f}$ we have the following prior and likelihood model.

\bm{y} = \bm{f} + \bm{\epsilon}, \hskip 20pt \bm{\epsilon} \sim \mathcal{N}(0, \sigma_{n}^{2}\mathbb{I}) \\ \bm{y}| \bm{f} \sim \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I}) \\ \bm{f}| \bm{\theta} \sim \mathcal{N}(\bm{0}, K_{\theta}) \hskip 10pt ((K_{\theta})_{i,j} = k_{\theta}(x_{i}, x_{j}))\\

Learning Step:

\begin{aligned} p(\bm{y}|\bm{\theta}) &= \int p(\bm{y}|\bm{f})p(f|\bm{\theta})d\bm{f}\\ &= \int \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I})\mathcal{N}(\bm{0}, K_{\theta})d\bm{f} \\ &= \mathcal{N}(0, K_{\theta} + \sigma^{2}_{n}\mathbb{I}) \end{aligned}

Learning in Gaussian process models occurs through the maximisation of the marginal likelihood w.r.t the kernel hyperparameters.

\bm{\theta_{\star}} = \argmax_{\bm{\theta}}\log p(\bm{y}|\bm{\theta})

Type-II Maximum Likelihood / Empirical Bayes

Closed form posterior predictive

p(\bm{f}_{*}| X_{*}, X, \bm{f}, \theta^{*}) = \mathcal{N}(\mu_{\star}, \Sigma_{\star})

Deriving the posterior predictive just amounts to writing down the conditional from the joint distribution of $ p(f, f_{*})$!

Behaviour of ML-II

ML-II - does it always work?

Sensitivity to training set size

Left: The bias in hyperparameter learning is exacerbated in weak data regimes, the plot shows the bias in learning the lengthscale for an underlying 1d regression problem. Right: The bias in learning the noise level in weak data regimes. Each point is an average across 100 datasets sampled from a GP prior with the RBF kernel.

A fully Bayesian treatment of GP models would integrate away kernel hyperparameters when making predictions:

where $ \bm{f}_{*}| X_{*}, X, \bm{y}, \bm{\theta} \sim \mathcal{N}(\bm{\mu_{*},\Sigma_{*}}) $
The posterior over hyperparameters is given by,

where $p(\bm{y}|\bm{\theta}) $ is the GP marginal likelihood.

p(\bm{f}_{*}| X_{*}, X, \bm{y}) = \int p(\bm{f}_{*}| X_{*}, X, \bm{y}, \bm{\theta})\textcolor{red}{p(\bm{\theta}|\bm{y})}d\bm{\theta}

\boxed{ \textcolor{red}{p(\bm{\theta}|\bm{y})} \propto p(\bm{y}|\bm{\theta})p(\bm{\theta}) }

Hierarchical GPs

Hierarchical Gaussian Processes: Propagate hyperparameter uncertainty to outputs.

\textcolor{red}{ \bm{\theta} \sim p(\bm{\theta}) }\\ \bm{f}| X, \bm{\theta} \sim \mathcal{N}(\bm{0}, K_{\theta}) \hskip 10pt \\ \bm{y}| \bm{f} \sim \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I}) \\ \\ \bm{y} = \bm{f} + \bm{\epsilon}, \hskip 20pt \bm{\epsilon} \sim \mathcal{N}(0, \sigma_{n}^{2}\mathbb{I}) \\

Hyperprior $ \rightarrow $

Prior $ \rightarrow $

Model $ \rightarrow $

Likelihood $ \rightarrow $

Hierarchical GPs: Visualising the prior predictive space

We adapt a technique frequently used in physics and astronomy literature to sample from the hyperparameter posterior.

Nested Sampling (Skilling, 2004) is a gradient free method for Bayesian computation. We present a brief overview in the next few slides before looking at results for Fully Bayesian inference in Gaussian processes.

Fergus Simpson*, Vidhi Lalchand*, Carl E. Rasmussen. Marginalised Gaussian processes with Nested Sampling . https://arxiv.org/pdf/2010.16344.pdf

Nested Sampling: The idea

Principle: Sample from "nested shells" / iso-likelihood contours of the evidence and weight them appropriately to give posterior samples.

Nested Sampling: The principle

0 < ....< X_{i+1} < X_{i} < X_{i-1} < .....1

Start with N "live" points $ \{\theta_{1}, \theta_{2}, \ldots, \theta_{N} \} $ sampled from the prior, $\theta_{i} \sim p(\theta) $ , set $ \mathcal{Z} = 0$

Compute the minimum likelihood $\mathcal{L_{i}} = \min(\mathcal{L}(\theta_{1}), \ldots \mathcal{L}(\theta_{N})) $ from the current set of live points and discard point $ \theta_{i}$.

Sample a new point $ \theta^{\prime}$ from the prior subject to $ \mathcal{L}(\theta^{\prime}) > \mathcal{L}_{i} $

Results

Why does a more diffused predictive interval yield better test predictive density?

Fergus Simpson*, Vidhi Lalchand*, Carl E. Rasmussen. Marginalised Gaussian processes with Nested Sampling . https://arxiv.org/pdf/2010.16344.pdf

Results: 2d pattern extrapolation

y = (\cos 2 x_1 \times \cos 2x_2) \sqrt{|x_1 x_2|}

Fergus Simpson*, Vidhi Lalchand*, Carl E. Rasmussen. Marginalised Gaussian processes with Nested Sampling . https://arxiv.org/pdf/2010.16344.pdf

When would you use Hierarchical GPs?

The benefits from incorporating hyperparameter uncertainty in predictive distributions are more pronounced under the following conditions:

High epistemic or aleatoric uncertainty.
Kernel design induces are large number of hyperparameters precluding ML-II type optimisation.

GPs can be used in the unsupervised settings by learning a non-linear, probabilistic mapping from latent space $ X $ to data-space $ Y $.
We assume the inputs $ X $ are latent (unobserved).

Given: High dimensional training data $ Y \equiv \{\bm{y}_{n}\}_{n=1}^{N}, Y \in \mathbb{R}^{N \times D}$

Learn: Low dimensional latent space $ X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}$

Lalchand et al. (2022)

N x D

D - independent Gaussian processes

low dimensional latent space

High-dimensional data space

Gaussian Processes for Latent Variable Modelling (at scale)

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2022

GPLVM: Generative Model with Sparse Gaussian processes

\textbf{F} \equiv \{ f_{d} \}_{{d=1}}^{D}

\textbf{U} \equiv \{ u_{d} \}_{{d=1}}^{D}

p(\textbf{U}) = \prod_{d=1}^{D}p(u_{d}|Z) = \mathcal{N}(\textbf{0}, K_{mm})

where,

Q_{nn} = K_{nn} - K_{nm}K_{mm}^{-1}K_{mn}

Prior over latents:

Prior over inducing variables:

Conditional prior:

Data likelihood:

Stochastic Variational Evidence Lower bound

Stochastic variational inference for GP regression was introduced in Hensman et al (2013) ($ \mathcal{L}_{1}$ below). In this work we extend the SVI bound in two ways - we introduce the variational distribution over the unknown $X$ and make $Y$ multi-output.

\begin{aligned} \mathcal{L}_{1:D} = \sum_{d=1}^{D}\mathcal{L}_{d} = \sum_{d=1}^{D}\sum_{n=1}^{N} \mathbb{E}_{q(f, \textbf{X}, \textbf{U})}[\log p(y_{n,d}|\bm{f}_{d}, \bm{x}_{n})] - \sum_{d}\textrm{KL}(q(\bm{u}_{d})||p(\bm{u}_{d})) - \sum_{n}\textrm{KL}(q(\bm{x}_{n})||p(\bm{x}_{n})) \end{aligned}

ELBO:

\log p(Y|\bm{\theta}) \geq

Non-Gaussian likelihoods Flexible variational families Amortised Inference Missing data problems

Non-Gaussian likelihoods Flexible variational families Amortised Inference Interdomain inducing variables Missing data problems

p(F, X, U |Y) = \Big [\prod_{d=1}^{D}p(\bm{f}_{d}|\bm{u}_{d},X)q(\bm{u}_{d}) \Big]q(X)\hspace{-4mm} \\ \approx q(F, X, U)

Variational Formulation:

q(X) =\prod_{n=1}^{N}\mathcal{N}(\bm{x}_{n}; \mu_{n}, s_{n}\mathbb{I}_{Q}),

\log p(Y|\theta) = \log \int p(Y|X)p(X)dX \\ \hspace{40mm} \geq \int q(X) (\mathcal{L}_{1} + \log p(X) - \log q(X))dX

Oilflow (12d)

	Point	MAP	Bayesian-SVI	AEB-SVI
RMSE	0.341 (0.008)	0.569 (0.092)	0.0925 (0.025)	0.067 (0.0016)
NLPD	4.104 (3.223)	8.16 (1.224)	-11.3105 (0.243)	-11.392 (0.147)

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2022

Robust to Missing Data: MNIST Reconstruction

30%

60%

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2022

Robust to Missing Data: Motion Capture

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2022

GPs can be used in the unsupervised settings by learning a non-linear, probabilistic mapping from latent space $ X $ to data-space $ Y $.
We assume the inputs $ X $ are latent (unobserved).

Given: High dimensional training data $ Y \equiv \{\bm{y}_{n}\}_{n=1}^{N}, Y \in \mathbb{R}^{N \times D}$

Learn: Low dimensional latent space $ X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}$

N x D

Gaussian Processes

2d latent space (each point represents a cell)

High-dimensional data space (A cell by gene matrix of expression counts)

Application: Learn a 2d latent space (a cell atlas) from a single-cell gene expression matrix

$ f_{d} \sim \mathcal{GP}(0, k_{f})$

N x D

Gaussian Processes

2d latent space (each point represents a cell)

High-dimensional data space (A cell by gene matrix of expression counts)

Application: Learn a 2d latent space (a cell atlas) from a single-cell gene expression matrix

$ f_{d} \sim \mathcal{GP}(0, k_{f})$

Y = \underbrace{F}_{\textrm{\tiny{{Sparse GP}}}} + \underbrace{\Phi}_{\textrm{\tiny{design matrix}}}\times\underbrace{B}_{\textrm{\tiny{random effects}}} + \underbrace{\bm{\epsilon}}_{\textrm{\tiny{noise model}}}

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The Model: Augmented Kernel Function

\begin{aligned} \tilde{F} = F + \Phi B \\ \end{aligned}

\begin{aligned} \mathbb{E}(\tilde{\bm{f}_{d}}) &= \mathbb{E}(\bm{f}_{d}) + \mathbb{E}(\Phi B_{d}) = {\color{blue}{\mu_{f}}}\mathbb{I}_{N} + \Phi{\color{blue}{\zeta_{d}}}\\ \textrm{Cov}(\tilde{\bm{f}_{d}}) &= \textrm{Cov}(\bm{f}_{d}) + \textrm{Cov}(\Phi B) = {\color{blue}{K_{nn}}} + {\color{blue}{\nu}}\Phi\Phi^{T} \end{aligned}

Y = \underbrace{F + \Phi \times B}_{\textrm{\tiny{GP with augmented kernel}}} + \underbrace{\bm{\epsilon}}_{\textrm{\tiny{noise model}}}

where we assume a constant mean $\mu_{f} \in \mathbb{R} $ for the $\bm{f}$ process, the design matrix $\Phi$ with covariates is specified and $\zeta_{d}$ encapsulates the mean of random effects $B$.

The expression matrix $Y$ can be interpreted as being driven by a joint process $\tilde{F}$ with columns $\tilde{f}_{d}$ which model both latent and random effects, distributed as individual Gaussian processes.

Metadata

Expression

data

Vidhi Lalchand*, Aditya Ravuri*, Emma Dann*, Natsuhiko Kumasaka, Dinithi Sumanaweera, Rik G.H. Lindeboom, Shaista Madad, Neil D. Lawrence, Sarah A. Teichmann. Modelling Technical and Biological Effects in single-cell RNA-seq data with Scalable Gaussian Process Latent Variable Models (GPLVM). In Machine Learning in Computational Biology (MLCB), 2022

Inference: Stochastic Variational Inference

p(\tilde{F}) = \prod_{d=1}^{D}p(\tilde{\bm{f}}_{d}) = \prod_{d=1}^{D}\mathcal{N}(\Phi\bm{\zeta}_{d}, K_{nn} + \nu\Phi\Phi^{T})

f_{d} \sim \mathcal{N}(\mu_f\mathbb{I}_{N}, K_{nn}),

\tilde{f}_{d} \sim \mathcal{N}(\mu_{f}\mathbb{I}_{N} + \Phi\zeta_{d}, K_{nn} + \nu\Phi\Phi^{T})

Canonical GP prior

Augmented GP prior

The augmented kernel formulation allows us to derive an objective (evidence lower bound) which factorises across both cells ($N$) and genes ($ D$).

*Details about the derivation of the objective are in the paper

While not converged do

Choose a random mini-batch $ Y_{B} \subset Y$
Form a mini-batch estimate of the ELBO:
$ p(Y) \geq \mathcal{L}(Y_{B}) = \dfrac{N}{B}(\sum_{b}\sum_{d}\mathcal{L}_{b,d})$ + terms*
Gradient step: All hyperparameters and latent variables $ \longrightarrow optim(\mathcal{L}(Y_{B}))$

Treatment

Latent batch effect

Cell cycle phase

Augmented kernel disentangles cell cycle and treatment effects

COVID-19 scRNA-seq cohort

PCA

scVI

Augmented GPLVM

54,941 cells

130 patients

Modelling biological variation: COVID-19 severity

There is no reason to believe that real data is generated by a kernel with fixed inductive biases - where the smoothness properties do not change over the space of covariates/inputs.

We really need extremely flexible inductive biases which adapt to the needs of the data - but how to specify such kernels?

\begin{aligned} u(x_1, x_2) = &\cos(2\pi x_1) \cos(2\pi x_2) + \sin(4\pi x_1)\sin(4\pi x_2){1}_{(1/4, 3/4)^2}(x_1, x_2) \\& + \sin(8\pi x_1)\sin(8\pi x_2){1}_{(1/2, 3/4)^2}(x_1, x_2) \\ &+ \sin(16\pi x_1)\sin(16\pi x_2)1_{(1/4, 1/2)^2}(x_1, x_2) \end{aligned}

Stationary vs Non-stationary Kernel

k_{Gibbs}(\bm{x}_{i},\bm{x}_{j}) = \displaystyle\prod_{d=1}^{D}\sqrt{\dfrac{2\ell_{d}(\bm{x}_{i})\ell_{d}(\bm{x}_{j})}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}}\exp \left\{ - \sum_{d=1}^{D} \dfrac{(x_{i}^{(d)} - x_{j}^{(d)})^{2}}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}\right \}

The hyperparameters are themselves functions of the inputs $ X$.

k(\bm{x}_{i}, \bm{x}_{j})_{\text{SE-ARD}} = {\color{blue}{\sigma^{2}_{f}}}\exp \Bigg\{-\dfrac{1}{2}\sum_{d=1}^{D}\dfrac{(x^{d}_{i} - x^{(d)}_{j})^{2}}{\color{blue}{\ell_{d}^{2}}} \Bigg\}

How to specify $\ell_{d}(\bm{x})$?

\displaystyle\prod_{d=1}^{D}\sqrt{\dfrac{2\ell_{d}(\bm{x}_{i})\ell_{d}(\bm{x}_{j})}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}}\exp \left\{ - \sum_{d=1}^{D} \dfrac{(x_{i}^{(d)} - x_{j}^{(d)})^{2}}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}\right \}

Gibbs kernel:

Non-stationary kernels: Input dependent hyperparameters

It is basically derived exactly like a squared exponential kernel but with input dependent lengthscale functions $l_{d}(x)$.

\log(\ell_{d}) \sim \mathcal{GP}(\mu_{\ell}, k_{\ell})

f \sim \mathcal{GP}(\mu_{f}, k_{f}(\ell(x_{i}),\ell(x_{j})))

Application: Climate processes

A significant amount of climate research suggests that climate processes are non-stationary.
For instance, the frequency of extreme temperatures and extreme precipitation events will not only increase, but that the patterns of extreme precipitation will become more erratic.
Precipitation is intricately linked to climate and one of the core targets of climate modelling. Very often the data is just spatio-temporal with 3 coordinates (lat, lon, time).

(work in progress)

Source: National Oceanic Atmospheric Administration (https://www.ncei.noaa.gov/cdo-web/)

Hierarchical GP Model: MAP Inference

Classical

Stationary Hierarchical

Non-Stationary Hierarchical

Hyperprior $ \rightarrow $

Prior $ \rightarrow $

Model $ \rightarrow $

Likelihood $ \rightarrow $

\bm{f}| X, \bm{\theta} \sim \mathcal{N}(\bm{0}, K_{\theta}) \hskip 10pt \\ \bm{y}| \bm{f} \sim \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I}) \\ \\ \bm{y} = \bm{f} + \bm{\epsilon}, \hskip 20pt \bm{\epsilon} \sim \mathcal{N}(0, \sigma_{n}^{2}\mathbb{I}) \\

\textcolor{red}{ \log \bm{\ell_{d}}(\bm{x}) \sim \mathcal{N}(0, K_{l}) }\\ \bm{f}| X, \bm{\theta} \sim \mathcal{N}(\bm{0}, K_{\theta}) \hskip 10pt \\ \bm{y}| \bm{f} \sim \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I}) \\ \\ \bm{y} = \bm{f} + \bm{\epsilon}, \hskip 20pt \bm{\epsilon} \sim \mathcal{N}(0, \sigma_{n}^{2}\mathbb{I}) \\

Hyperprior $ \rightarrow $

\bm{\theta_{\star}} = \argmax_{\bm{\theta}}\log p(\bm{y}|\bm{\theta})

\begin{aligned} \ell_{\text{MAP}} &= \argmax_{\ell}\log p(\bm{y}|\hat{\ell})p(\hat{\ell}) \\ &= \argmax_{\ell} \log \mathcal{N}(\bm{y}|0, K_{f} + \sigma^{2}_{n}\mathbb{I})\mathcal{N}(\hat{\ell}|0, K_{\ell}) \end{aligned}

\begin{aligned} \theta_{\text{MAP}} &= \argmax_{\ell}\log p(\bm{y}|\bm{\theta})p(\bm{\theta}) \\ &= \argmax_{\ell} \log \mathcal{N}(\bm{y}|0, K_{f} + \sigma^{2}_{n}\mathbb{I})\mathcal{N}(\bm{\theta}|0, \Sigma) \end{aligned}

Synthetic Example: 1d

Synthetic Example: 2d

Stationary Kernel

Non-Stationary Kernel

Ground Truth function

Modelling Precipitation in the Upper Indus Basin

Time-series for a single spatial point from each of three distinct regions.

Modelling Precipitation in the Upper Indus Basin: Spatial Component

Modelling Precipitation in the Upper Indus Basin: Temporal Component

k_{temporal}(\bm{x}_{i}, \bm{x}_{j}) = k_{\text{SE-ARD}}({\bm{x}_{i}}, {\bm{x}_{j}})\times k_{\text{PER}}({\bm{x}_{i}}, {\bm{x}_{j}})

Modelling Precipitation in the Upper Indus Basin: Spatio-Temporal

\begin{aligned} k_{\text{stat.}}(\bm{x}_{i}, \bm{x_{j}}) &= \underbrace{k_{\text{SE-ARD}}(({\bm{x}^{\text{lat}}_{i}},{\bm{x}^{\text{lon}}_{i}}), ({\bm{x}^{\text{lat}}_{j}},{\bm{x}^{\text{lon}}_{j}}))\times k_{\text{PER}}({\bm{x}^{t}_{i}}, {\bm{x}^{t}_{j}})}_{\text{temporal}}\\ & \hspace{30mm} + \underbrace{k_{\text{SE-ARD}}(({\bm{x}^{\text{lat}}_{i}},{\bm{x}^{\text{lon}}_{i}}),({\bm{x}^{\text{lat}}_{j}},{\bm{x}^{\text{lon}}_{j}}))}_{\text{spatial}} \end{aligned}

\begin{aligned} k_{\text{non-stat.}}(\bm{x}_{i}, \bm{x_{j}}) = \underbrace{k_{\text{SE-ARD}}(({\bm{x}^{\text{lat}}_{i}},{\bm{x}^{\text{lon}}_{i}}), ({\bm{x}^{\text{lat}}_{j}},{\bm{x}^{\text{lon}}_{j}}))\times k_{\text{PER}}({\bm{x}^{t}_{i}}, {\bm{x}^{t}_{j}})}_{\text{temporal}}\\ + \underbrace{k_{\text{Gibbs}}(({\bm{x}^{\text{lat}}_{i}},{\bm{x}^{\text{lon}}_{i}}),({\bm{x}^{\text{lat}}_{j}},{\bm{x}^{\text{lon}}_{j}}))}_{\text{spatial}} \end{aligned}

Modelling Precipitation in the Upper Indus Basin: Spatio-Temporal

3D regression for spatio-temporal precipitation modelling. The red line demarcates the training and test months. The non-stationary kernel yielded a an average test RMSE of 0.9426 across three runs vs. 1.1086 for the stationary kernel.

Modelling Precipitation across the US: Sparse GPs + Non-stationary construction

SE-ARD kernel

Thank you!

(Bonus slides?)

Nested Sampling: The principle

Define a notion of prior volume, $$ X(\lambda) = \int_{\mathcal{L}(\theta) > \lambda} \pi(\theta)d\theta$$

The area/volume of parameter space "inside" a iso-likelihood contour

One can re-cast the multi-dimensional evidence integral as a 1d function of prior volume $ X$.

$$ \mathcal{Z} = \int \mathcal{L}(\theta)\pi(\theta)d\theta = \int_{0}^{1} \mathcal{L}(X)dX$$

The evidence can then just be estimated by 1d quadrature.

$$ \mathcal{Z} \approx \sum_{i=1}^{M}\mathcal{L_{i}}(X_{i} - X_{i+1})$$

Static Nested Sampling

Start with N "live" points $ \{\theta_{1}, \theta_{2}, \ldots, \theta_{N} \} $ sampled from the prior, $\theta_{i} \sim p(\theta) $ , set $ \mathcal{Z} = 0$

Skilling (2004)

for $ i = 1, \ldots, K$

Compute the minimum likelihood $\mathcal{L_{i}} = \min(\mathcal{L}(\theta_{1}), \ldots \mathcal{L}(\theta_{N})) $ from the current set of live points.

while stopping criterion is unmet do

Add the point $ \theta_{i}$ associated with the lowest likelihood $ \mathcal{L_{i}}$ to a list of "saved" points.

Sample a new point $ \theta^{\prime}$ from the prior subject to $ \mathcal{L}(\theta^{\prime}) > \mathcal{L}_{i} $

Assign estimated prior mass at this step as, $\hat{X_{i}} = e^{-i/N}$

Assign weight for the saved point, $ w_{i} = \hat{X}_{i-1} - \hat{X}_{i}$ # has to be positive as volume in prior is shrinking at each step

Accumulate evidence, $ \mathcal{Z} = \mathcal{Z} + \mathcal{L}_{i}w_{i}$

Evaluate stopping criterion, if triggered then break;

end

return set of saved points $\{ \theta_{i}\}_{i=1}^{N + K} $, along with importance weights $ \{p_{i} \}_{i=1}^{N + K}$, and evidence $ \mathcal{Z} $

Add final N live points to the "saved" list with:

Each remaining weight $ w_{N} = \hat{X_{K}}/N$

Final evidence is given by, $ \mathcal{Z} = \sum_{i=1}^{N+K}\mathcal{L}_{i}w_{i} $

Importance weights for each sample are given by, $ p_{i} = \mathcal{L}_{i}w_{i}/\mathcal{Z}$

# final slab of enclosed prior mass

$ \rightarrow $ hard problem

# why?

Prior Mass Estimation

Why is $ X_{i} $ set to $ e^{-i/N}$ where $i$ is the iteration number and $N$ is the number of live points.

0 < ....< X_{i+1} < X_{i} < X_{i-1} < .....1

\begin{aligned} X_{i} &\approx \gamma X_{i-1} \\ \mathbb{E}(\gamma) &= 1 - \dfrac{1}{N+1} = \dfrac{N}{N + 1} \\ X_{i} &\approx \left(\dfrac{N}{N + 1}\right) X_{i-1} \\ &\approx \left(\dfrac{N}{N + 1}\right)^{i} X_{0} \\ &\approx \left(\dfrac{N}{N + 1}\right)^{i} \\ &\approx e^{-i/N} \end{aligned}

(images from An Intro to dynamic nested sampling. Speagle (2017) )

Sampling from the constrained prior

Dynamic Nested Sampling. Speagle (2019)

At every step we need a sample $ \theta^{\prime}$ such that $ \mathcal{L_{\theta^{\prime}}} > \mathcal{L}_{i}$

We could keep sampling uniformly from the prior and keep rejecting until we find one that meets the likelihood condition, but this takes too long when the likelihood threshold is high and there is a better way.

What if we could sample directly from the constrained prior?

p_{\lambda}(\theta) = \Bigg\{ \begin{aligned} &p(\theta) / X_{\lambda}, L(\theta) > \lambda \\ &0, \hspace{10mm} L(\theta) < \lambda \end{aligned}

New Directions

Regression to model functions with inhomogenous smoothness properties: building non-stationarity into kernels.
Probabilistic latent variable models + neural encoders.

Geometry: latent variables may live on a non-Euclidean manifold?
Building custom generative AI for scientific applications.

Neural encoder (parameteric)

Gaussian process (non-parameteric)

Flatiron Institute

By Vidhi Lalchand

Flatiron Institute

Vidhi Lalchand

Doctoral Student @ Cambridge Bayesian Machine Learning

vidhilalchand.co.uk

On Hierarchical Gaussian Processes: Methods and Applications

Flatiron Institute

More from Vidhi Lalchand