Marginalised Spectral Mixture Kernels with Nested Sampling

3rd Symposium on Advances in Approximate Bayesian Inference

Jan-Feb 2021

Fergus Simpson, Vidhi Lalchand, Carl E. Rasmussen

Background: Gaussian Processes

Gaussian Processes offer a powerful, Bayesian, non-parametric paradigm for learning functions.

\begin{aligned} p(\bm{y}|\bm{\theta}) &= \int p(\bm{y}|\bm{f})p(f|\bm{\theta})d\bm{f}\\ &= \int \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I})\mathcal{N}(\bm{0}, K_{\theta})d\bm{f} \\ &= \mathcal{N}(y; 0, K_{\theta} + \sigma^{2}_{n}\mathbb{I}) \end{aligned}

Background: Gaussian Processes

Gaussian Processes offer a powerful, Bayesian, non-parametric paradigm for learning functions.

\begin{aligned} p(\bm{y}|\bm{\theta}) &= \int p(\bm{y}|\bm{f})p(f|\bm{\theta})d\bm{f}\\ &= \int \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I})\mathcal{N}(\bm{0}, K_{\theta})d\bm{f} \\ &= \mathcal{N}(y; 0, K_{\theta} + \sigma^{2}_{n}\mathbb{I}) \end{aligned}
\bm{\theta_{\star}} = \argmax_{\bm{\theta}} p(\bm{y}|\bm{\theta})

Conventionally, learning occurs via maximisation of the marginal likelihood

Objectives

  • Marginalise the hyperparameters of the spectral mixture kernel

Objectives

  • Investigate the feasibility of using Nested Sampling to marginalise the hyperparameters of GPs
  • Marginalise the hyperparameters of the spectral mixture kernel

Learning with Spectral Mixture Kernels

k(\bm{\tau}) = \sum_{i=1}^Q w_i \cos( 2\pi \bm{\tau} \cdot \bm{\mu_{i}}) \prod _{d=1}^{D} \exp(- 2 \pi^2 \tau_{d}^2 {\sigma_{i}^{2}}^{(d)})

Learning with Spectral Mixture Kernels

k(\bm{\tau}) = \sum_{i=1}^Q w_i \cos( 2\pi \bm{\tau} \cdot \bm{\mu_{i}}) \prod _{d=1}^{D} \exp(- 2 \pi^2 \tau_{d}^2 {\sigma_{i}^{2}}^{(d)})

Marginalised Gaussian Processes

\begin{aligned} Hyperprior: & \hspace{2mm} \bm{\theta} \sim p(\bm{\theta}) \, , \\ % \quad Prior:& \hspace{2mm} \bm{f}| X, \bm{\theta} \sim \mathcal{N}(\bm{0}, K_{\theta}) \, , \\ % \quad Likelihood:& \hspace{2mm} \bm{y}| \bm{f} \sim \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I}) \, \end{aligned}

Generative Model

Marginalised Gaussian Processes

\begin{aligned} Hyperprior: & \hspace{2mm} \bm{\theta} \sim p(\bm{\theta}) \, , \\ % \quad Prior:& \hspace{2mm} \bm{f}| X, \bm{\theta} \sim \mathcal{N}(\bm{0}, K_{\theta}) \, , \\ % \quad Likelihood:& \hspace{2mm} \bm{y}| \bm{f} \sim \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I}) \, \end{aligned}
\begin{aligned} p(\bm{f}^{\star} | \bm{y}) &= \iint p(\bm{f}^{\star}| \bm{f},\bm{\theta})p(\bm{f} | \bm{\theta}, \bm{y})p(\bm{\theta}|\bm{y})d\bm{f}d\bm{\theta}\\ &= \int p(\bm{f}^{\star}| \bm{y},\bm{\theta})p(\bm{\theta}|\bm{y}) d \bm{\theta} \, , \\ & \simeq \dfrac{1}{M}\sum_{j=1}^{M}p(\bm{f^{\star}}| \bm{y}, \bm{\theta_{j}}) = \dfrac{1}{M}\sum_{j=1}^{M}\mathcal{N}(\bm{\mu}_{j}^{\star}, \Sigma_{j}^{\star}) \, % \bm{\theta_{j}} &\sim p(\bm{\theta} |\bm{y}) \end{aligned}

Generative Model

Predictions

Overview: Nested Sampling

Speagle JS. dynesty: a dynamic nested sampling package for estimating Bayesian posteriors and evidences. Monthly Notices of the Royal Astronomical Society. 2020 Apr;493(3):3132-58.

Overview: Nested Sampling

Speagle JS. dynesty: a dynamic nested sampling package for estimating Bayesian posteriors and evidences. Monthly Notices of the Royal Astronomical Society. 2020 Apr;493(3):3132-58.

Synthetic Experiments

Synthetic Experiments

Time-series prediction (1d)

Time series benchmarks

Time series benchmarks

Time-series prediction (1d)

Why is it better?

Pattern Prediction (2d)

y = (\cos 2 x_1 \times \cos 2x_2) \sqrt{|x_1 x_2|}

Train: 50 points chosen randomly and uniformly between [-6 x 6] grid.

Test: [-10 x 10] grid of 400 points.

NLPD
ML-II: 216
Hamiltonian Monte Carlo: 2.56

Nested Sampling : 2.62

Summary

  • Nested Sampling offers a powerful method for marginalising GPs such as the spectral mixture kernel.
  • Marginalised GPs give more robust prediction intervals.
  • Gains over ML-II become particularly pronounced with either sparse data or expressive kernels.

Summary

  • NS offers a powerful gradient free method for exploring multi-modal likelihood surfaces such as that of the SM kernel.
  • On the practicality of marginalisation: Sampling hyperparameters will add some computational overhead however can be easily integrated with inducing point methods (sparse GPs) and structure exploiting Kronecker methods.

Marginalised GPs give robust prediction intervals by accounting for hyperparameter uncertainty. Their advantage over ML-II become particularly pronounced when expressive kernels with several hyperparameters are used or where the training data is either sparse or noisy.

Thank you! 

AABI - Marginalised GPs with Nested Sampling

By Vidhi Lalchand

AABI - Marginalised GPs with Nested Sampling

  • 17