Sparse Gaussian Process Hyperparameters: Optimize or Integrate?

Vidhi Lalchand

Wessel P. Bruinsma

David R. Burt

Carl E. Rasmussen

Motivation

This work is about Bayesian hyperparameter inference in sparse Gaussian process regression.
Traditional gradient based optimisation (ML-II) can be extremely sensitive to starting values.
ML-II hyperparameter estimates are subject to high variability and underestimate prediction uncertainty.
We propose a novel and computationally efficient scheme for Fully Bayesian inference in sparse GPs.

y_n = f(x_n) + \epsilon_n, \, \, \epsilon_n \sim \mathcal{N}(0, \sigma^2), \, f \sim \mathcal{GP}(0, k_{\theta})

\log p(\bm{y}|\bm{\theta}) = \log \int p(\bm{y}|f)p(f|\bm{\theta})df = c \smash{ \underbrace{-\tfrac{1}{2}\bm{y}^{T} (K_{\theta} + \sigma^{2}I)^{-1} \bm{y}}_{\textrm{data fit term}} - \underbrace{\tfrac{1}{2}|K_{\theta} + \sigma^{2}I|}_{\textrm{complexity penalty}} }

Mathematical set-up

f \sim \mathcal{GP}(0, k_{\bm{\theta}})

\bm{y} = (y_{n})_{n=1}^{N} \subseteq \mathbb{R}

X = (\bm{x}_{n})_{n=1}^{N} \subseteq \mathbb{R}^D

p(\bm{y}|f) = \prod_{n=1}^{N}\mathcal{N}(y_{n}|f_{n}, \sigma^2)

\bm{u} = \{f(\bm{z}_m)\}_{m=1}^{M} \subseteq \mathbb{R}

Z = \{\bm{z}_{m}\}_{m=1}^{M}, \bm{z}_m \in \mathbb{R}^{D}

\bm{\theta}^* \in {\textstyle\argmax_{\bm{\theta},Z}}\, \mathcal{L}_{\bm{\theta},Z}.

Inputs

Outputs

Latent function prior

Factorised Gaussian likelihood

Inducing locations

Inducing variables

"Collapsed ELBO"

Hyperparameter inference

Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics, pages 567-574. PMLR, 2009.
intelligence and statistics, pages 567–574. PMLR, 2009

Titsias [2009] showed that in the case of a Gaussian likelihood the optimal variational distribution $ q^{*}(u)$ is Gaussian and can be derived in closed-form.

Canonical Inference for $ \theta$ in sparse GPs:

Specify a variational approximation to the posterior over ($ f, \bm{u}$)
Lower bound the GP log-marginal likelihood $ \log p(\bm{y}|\bm{\theta}) \geq \mathcal{L}_{\theta, Z}$
Use the closed-form ELBO to learn hyperparameters ($ \theta $) and inducing locations ($Z$)

p(f, \bm{u} | \bm{y}, {\bm{\theta}}) \approx q(f, \bm{u} | {\bm{\theta}}) = p(f | \bm{u}, {\bm{\theta}}) q(\bm{u})

Variational approximation to the posterior

[Ours] Doubly collapsed Inference for $ \theta$ in sparse GPs:

Specify a variational approximation to the posterior over ($ f, \bm{u}, \bm{\theta}$)
Lower bound the GP log-marginal likelihood $ \log p(\bm{y}) \geq \int q(\bm{\theta}) \mathcal{L}_{\bm{\theta}, Z}d\bm{\theta} - \textrm{KL}(q(\bm{\theta}) || p(\bm{\theta})) $
Crucially, we can write down the optimal $ q^{*}(\bm{\theta})$ upto a normalising constant.

\textrm{Sample} \hspace{2mm} \bm{\theta}^{*} \sim q^{*}(\bm{\theta})

Hyperparameter inference

Variational approximation to the posterior

p(f, \bm{u}, \bm{\theta}| \bm{y}) \approx q(f, \bm{u}, {\bm{\theta}}) = p(f | \bm{u}, {\bm{\theta}}) q(\bm{u}|\bm{\theta})q(\bm{\theta})

"Collapsed ELBO"

Algorithm

Overall, the core training algorithm alternates between two steps:

By sampling from $ q^{*}(\bm{\theta})$, we side-step the need to sample from the joint $ (\bm{u},\bm{\theta})$-space yielding a significantly more efficient algorithm in the case of regression with a Gaussian likelihood.

\begin{aligned} &\textrm{1. Sampling step for $\theta$:} \hspace{2mm} \theta_{j} \sim q^{\ast}(\theta) \propto \mathcal{L}({\bm{\theta}, Z_{opt}}) + \log p(\bm{\theta}), \hspace{2mm}\textcolor{orange}{[\textrm{Keep $Z_{opt}$ fixed}]}\\ &\textrm{2. Optimisation step for $Z$:} \hspace{2mm} Z_{opt} \longleftarrow \texttt{optim}(\mathcal{\hat{L}}), \textrm{where} \\ & \hspace{5mm} \mathcal{\hat{L}} = \mathbb{E}_{q^{\ast}(\theta)}[\mathcal{L}(\theta, Z)] \approx \dfrac{1}{J}\sum_{j=1}^{J}\mathcal{L}(\theta_{j}, Z_{opt}), \hspace{2mm}\textcolor{orange}{[\textrm{Keep $\theta$ fixed}}]\end{aligned}

\begin{aligned} \textrm{Approach}\hspace{4mm} & \textrm{Time/it.} & \textrm{Mem./it.} & \hspace{4mm} \textrm{Param / Vars} \\ \textrm{Non-collapsed [Hensman et al, 2015]}\hspace{4mm} & \hspace{4mm}\textcolor{darkgreen}{m^3} & \textcolor{darkgreen}{m^2} \hspace{4mm}& \hspace{6mm}\textcolor{orange}{n_\theta + m}\\ \textrm{Collapsed \textcolor{blue}{(ours)}}\hspace{4mm} & \hspace{4mm}\textcolor{orange}{nm^2} & \textcolor{darkgreen}{m^2} \hspace{4mm}& \hspace{6mm}\textcolor{darkgreen}{n_\theta} \\ \end{aligned}

James Hensman, Alexander G Matthews, Maurizio Filippone, and Zoubin Ghahramani. MCMC for variationally Sparse Gaussian processes. In Advances in Neural Information Porcessing Systems, pages 1648-1656, 2015.

pages 1648–1656, 2015bintelligence and statistics, pages 567–574. PMLR, 2009

$ n_{\theta}$ is the number of hyperparameters and $m$ is the number of inducing variables

$ \log p(\bm{y}) \geq \int q(\bm{\theta}) \mathcal{L}_{\bm{\theta}, Z}d\bm{\theta} - \textrm{KL}(q(\bm{\theta}) || p(\bm{\theta})) $

1d Synthetic Experiment

f (x) = \sin(3x) + 0.3\cos(πx) \\ \textrm{with the constraint} \hspace{2mm} (x < −2) \hspace{2mm} \textrm{and} \hspace{2mm} (x > 2).

Sparse GP Benchmarks

Our method, SGPR + HMC (--) outperforms other fully Bayesian benchmarks like jointHMC (--) and FBGP (--) in terms of negative log predictive density on unseen data.
It is significantly faster relative to jointHMC sand Exact GP inference with HMC (--).

Neg. log predictive density (mean $\pm$ se) on test data, 10 splits.

Thank you!

NeurIPS 2022

By Vidhi Lalchand

NeurIPS 2022

Vidhi Lalchand

Doctoral Student @ Cambridge Bayesian Machine Learning

vidhilalchand.co.uk

Sparse Gaussian Process Hyperparameters: Optimize or Integrate?

NeurIPS 2022

More from Vidhi Lalchand