Gibbs Poster

Gaussian Process Parameterised Covariance Kernels for Non-stationary Regression

Vidhi Lalchand $^{1}$ , Talay Cheema $^{1}$ , Laurence Aitchison $^{2}$ , Carl E. Rasmussen $^{1}$

Motivation: Non-Stationary Kernels

Learning a non-stationary kernel (1d)

Reconstructing a 2d non-stationary surface

A large cross-section of Gaussian process literature uses universal kernels like the squared exponential (SE) kernel along with automatic revelance determination (ARD) in high-dimensions. The ARD framework in covariance kernels operates by pruning away extraneous dimensions through contracting their inverse-lengthscales. This works considers probabilistic inference in the factorised Gibbs kernel and the multivariate Gibbs kernel with input-dependent lengthscales. These kernels allow for non-stationary modelling where samples from the posterior function space ``adapt" to the varying smoothness structure inherent in the ground truth. We propose parameterizing the lengthscale function of the factorised and multivariate Gibbs covariance function with a latent Gaussian process defined on the same inputs.

We use MAP inference with a GP prior over the lengthscale process to recover the ground truth kernel (left) with radomly distributed training data points.

\displaystyle\prod_{d=1}^{D}\sqrt{\dfrac{2\ell_{d}(\bm{x}_{i})\ell_{d}(\bm{x}_{j})}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}}\exp \left\{ - \sum_{d=1}^{D} \dfrac{(x_{i}^{(d)} - x_{j}^{(d)})^{2}}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}\right \}

\displaystyle\prod_{d=1}^{D}\sqrt{\dfrac{2\ell_{d}(\bm{x}_{i})\ell_{d}(\bm{x}_{j})}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}}\exp \left\{ - \sum_{d=1}^{D} \dfrac{(x_{i}^{(d)} - x_{j}^{(d)})^{2}}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}\right \}

Modelling Precipitation Across Continental United States

Approximate posterior predictive means for the 2d surface using 250 inducing points. The factorised Gibbs kernel (FGK) adapts to the lower length-scale behaviour in the central areas while the standard SE-ARD kernel is forced to subscribe to a single lengthscale.

University of Cambridge $^{1}$ , University of Bristol $^{2}$

The hierarchical GP framework is given by,

where $K_{mm}$ denotes the covariance matrix computed using the same kernel function $k_{\theta}$ on inducing locations $Z$ as inputs; the likelihood factorises across data points, $p(\bm{y}|\bm{f}) = \prod_{i=1}^{N}p(y_{i}|f_{i}) = \mathcal{N}(\bm{y}|\bm{f}, \sigma_{n}^{2}\mathbb{I})$ and $\psi$ denote parameters of the hyperprior. The joint model is given by, $p(\bm{y},\bm{f},\bm{u},\bm{\theta}) = p(\bm{y}|\bm{f})p(\bm{f}|\bm{u},\bm{\theta})p(\bm{u}|\bm{\theta})p(\bm{\theta})$ .

The standard marginal likelihood $p(\bm{y}) = \int p(\bm{y}|\bm{\theta})p(\bm{\theta})d\bm{\theta}$ is intractable. The inner term $p(\bm{y}|\bm{\theta})$ is the canonical marginal likelihood $\mathcal{N}(\bm{y}|\bm{0}, K + \sigma^{2}_{n}\mathbb{I})$ in the exact GP case and is approximated by a closed-form evidence lower bound (ELBO) in the sparse GP case for a Gaussian likelihood. The sparse variational objective in the extended model augments the ELBO with an additional term to account for the prior over hyperparameters, $\log p(\bm{y}, \bm{\theta}) \geq \mathcal{L}_{sgp} + \log p_{\psi}(\bm{\theta})$ .

\log(\ell_{d}) \sim \mathcal{N}(\mu_{\ell}, K_{\ell})

\log(\ell_{d}) \sim \mathcal{N}(\mu_{\ell}, K_{\ell})

Gaussian Process Parameterised Covariance Kernels for Non-stationary Regression Vidhi Lalchand 1 , Talay Cheema 1 , Laurence Aitchison 2 , Carl E. Rasmussen 1 Motivation: Non-Stationary Kernels Learning a non-stationary kernel (1d) Reconstructing a 2d non-stationary surface A large cross-section of Gaussian process literature uses universal kernels like the squared exponential (SE) kernel along with automatic revelance determination (ARD) in high-dimensions. The ARD framework in covariance kernels operates by pruning away extraneous dimensions through contracting their inverse-lengthscales. This works considers probabilistic inference in the factorised Gibbs kernel and the multivariate Gibbs kernel with input-dependent lengthscales. These kernels allow for non-stationary modelling where samples from the posterior function space ``adapt" to the varying smoothness structure inherent in the ground truth. We propose parameterizing the lengthscale function of the factorised and multivariate Gibbs covariance function with a latent Gaussian process defined on the same inputs. We use MAP inference with a GP prior over the lengthscale process to recover the ground truth kernel (left) with radomly distributed training data points. ∏ d = 1 D 2 ℓ d ( x i ) ℓ d ( x j ) ℓ d 2 ( x i ) + ℓ d 2 ( x j ) exp ⁡ { − ∑ d = 1 D ( x i ( d ) − x j ( d ) ) 2 ℓ d 2 ( x i ) + ℓ d 2 ( x j ) } \displaystyle\prod_{d=1}^{D}\sqrt{\dfrac{2\ell_{d}(\bm{x}_{i})\ell_{d}(\bm{x}_{j})}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}}\exp \left\{ - \sum_{d=1}^{D} \dfrac{(x_{i}^{(d)} - x_{j}^{(d)})^{2}}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}\right \} Modelling Precipitation Across Continental United States Approximate posterior predictive means for the 2d surface using 250 inducing points. The factorised Gibbs kernel (FGK) adapts to the lower length-scale behaviour in the central areas while the standard SE-ARD kernel is forced to subscribe to a single lengthscale. University of Cambridge 1 , University of Bristol 2 The hierarchical GP framework is given by, where K m m denotes the covariance matrix computed using the same kernel function k θ on inducing locations Z as inputs; the likelihood factorises across data points, p ( y ∣ f ) = ∏ i = 1 N p ( y i ∣ f i ) = N ( y ∣ f , σ n 2 I ) and ψ denote parameters of the hyperprior. The joint model is given by, p ( y , f , u , θ ) = p ( y ∣ f ) p ( f ∣ u , θ ) p ( u ∣ θ ) p ( θ ) . The standard marginal likelihood p ( y ) = ∫ p ( y ∣ θ ) p ( θ ) d θ is intractable. The inner term p ( y ∣ θ ) is the canonical marginal likelihood N ( y ∣ 0 , K + σ n 2 I ) in the exact GP case and is approximated by a closed-form evidence lower bound (ELBO) in the sparse GP case for a Gaussian likelihood. The sparse variational objective in the extended model augments the ELBO with an additional term to account for the prior over hyperparameters, log ⁡ p ( y , θ ) ≥ L s g p + log ⁡ p ψ ( θ ) . log ⁡ ( ℓ d ) ∼ N ( μ ℓ , K ℓ ) \log(\ell_{d}) \sim \mathcal{N}(\mu_{\ell}, K_{\ell})