Gaussian Process Parameterised Covariance Kernels for Non-stationary Regression

Vidhi Lalchand1^{1}, Talay Cheema1^{1}, Laurence Aitchison2^{2}, Carl E. Rasmussen1^{1}

Motivation: Non-Stationary Kernels

Learning a non-stationary kernel (1d)

Reconstructing a 2d non-stationary surface 

A large cross-section of Gaussian process literature uses universal kernels like the squared exponential (SE) kernel along with automatic revelance determination (ARD) in high-dimensions. The ARD framework in covariance kernels operates by pruning away extraneous dimensions through contracting their inverse-lengthscales. This works considers probabilistic inference in the factorised Gibbs kernel  and the multivariate Gibbs kernel with input-dependent lengthscales. These kernels allow for non-stationary modelling where samples from the posterior function space ``adapt" to the varying smoothness structure inherent in the ground truth. We propose parameterizing the lengthscale function of the factorised and multivariate Gibbs covariance function with a latent Gaussian process defined on the same inputs.

We use MAP inference with a GP prior over the lengthscale process to recover the ground truth kernel (left) with radomly distributed training data points. 

d=1D2d(xi)d(xj)d2(xi)+d2(xj)exp{d=1D(xi(d)xj(d))2d2(xi)+d2(xj)}\displaystyle\prod_{d=1}^{D}\sqrt{\dfrac{2\ell_{d}(\bm{x}_{i})\ell_{d}(\bm{x}_{j})}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}}\exp \left\{ - \sum_{d=1}^{D} \dfrac{(x_{i}^{(d)} - x_{j}^{(d)})^{2}}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}\right \}
\displaystyle\prod_{d=1}^{D}\sqrt{\dfrac{2\ell_{d}(\bm{x}_{i})\ell_{d}(\bm{x}_{j})}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}}\exp \left\{ - \sum_{d=1}^{D} \dfrac{(x_{i}^{(d)} - x_{j}^{(d)})^{2}}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}\right \}

Modelling Precipitation Across Continental United States 

Approximate posterior predictive means for the 2d surface using 250 inducing points. The factorised Gibbs kernel (FGK) adapts to the lower length-scale behaviour in the central areas while the standard SE-ARD kernel is forced to subscribe to a single lengthscale.

University of Cambridge1^{1}, University of Bristol2^{2}

The hierarchical GP framework is given by,

where KmmK_{mm} denotes the covariance matrix computed using the same kernel function kθk_{\theta} on inducing locations ZZ as inputs; the likelihood factorises across data points, p(yf)=i=1Np(yifi)=N(yf,σn2I)p(\bm{y}|\bm{f}) = \prod_{i=1}^{N}p(y_{i}|f_{i}) = \mathcal{N}(\bm{y}|\bm{f}, \sigma_{n}^{2}\mathbb{I}) and ψ\psi denote parameters of the hyperprior. The joint model is given by, p(y,f,u,θ)=p(yf)p(fu,θ)p(uθ)p(θ)p(\bm{y},\bm{f},\bm{u},\bm{\theta}) = p(\bm{y}|\bm{f})p(\bm{f}|\bm{u},\bm{\theta})p(\bm{u}|\bm{\theta})p(\bm{\theta}) .

The standard marginal likelihood p(y)=p(yθ)p(θ)dθp(\bm{y}) = \int p(\bm{y}|\bm{\theta})p(\bm{\theta})d\bm{\theta} is intractable. The inner term p(yθ)p(\bm{y}|\bm{\theta}) is the canonical marginal likelihood N(y0,K+σn2I)\mathcal{N}(\bm{y}|\bm{0}, K + \sigma^{2}_{n}\mathbb{I}) in the exact GP case and is approximated by a closed-form evidence lower bound (ELBO) in the sparse GP case for a Gaussian likelihood.  The sparse variational objective in the extended model augments the ELBO with an additional term to account for the prior over hyperparameters, logp(y,θ) Lsgp+logpψ(θ)\log p(\bm{y}, \bm{\theta}) \geq  \mathcal{L}_{sgp} + \log p_{\psi}(\bm{\theta}).

log(d)N(μ,K)\log(\ell_{d}) \sim \mathcal{N}(\mu_{\ell}, K_{\ell})
\log(\ell_{d}) \sim \mathcal{N}(\mu_{\ell}, K_{\ell})
Gaussian Process Parameterised Covariance Kernels for Non-stationary Regression Vidhi Lalchand 1 , Talay Cheema 1 , Laurence Aitchison 2 , Carl E. Rasmussen 1 Motivation: Non-Stationary Kernels Learning a non-stationary kernel (1d) Reconstructing a 2d non-stationary surface A large cross-section of Gaussian process literature uses universal kernels like the squared exponential (SE) kernel along with automatic revelance determination (ARD) in high-dimensions. The ARD framework in covariance kernels operates by pruning away extraneous dimensions through contracting their inverse-lengthscales. This works considers probabilistic inference in the factorised Gibbs kernel  and the multivariate Gibbs kernel with input-dependent lengthscales. These kernels allow for non-stationary modelling where samples from the posterior function space ``adapt" to the varying smoothness structure inherent in the ground truth. We propose parameterizing the lengthscale function of the factorised and multivariate Gibbs covariance function with a latent Gaussian process defined on the same inputs. We use MAP inference with a GP prior over the lengthscale process to recover the ground truth kernel (left) with radomly distributed training data points. ∏ d = 1 D 2 ℓ d ( x i ) ℓ d ( x j ) ℓ d 2 ( x i ) + ℓ d 2 ( x j ) exp ⁡ { − ∑ d = 1 D ( x i ( d ) − x j ( d ) ) 2 ℓ d 2 ( x i ) + ℓ d 2 ( x j ) } \displaystyle\prod_{d=1}^{D}\sqrt{\dfrac{2\ell_{d}(\bm{x}_{i})\ell_{d}(\bm{x}_{j})}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}}\exp \left\{ - \sum_{d=1}^{D} \dfrac{(x_{i}^{(d)} - x_{j}^{(d)})^{2}}{\ell_{d}^{2}(\bm{x}_{i}) + \ell_{d}^{2}(\bm{x}_{j})}\right \} Modelling Precipitation Across Continental United States Approximate posterior predictive means for the 2d surface using 250 inducing points. The factorised Gibbs kernel (FGK) adapts to the lower length-scale behaviour in the central areas while the standard SE-ARD kernel is forced to subscribe to a single lengthscale. University of Cambridge 1 , University of Bristol 2 The hierarchical GP framework is given by, where K m m denotes the covariance matrix computed using the same kernel function k θ on inducing locations Z as inputs; the likelihood factorises across data points, p ( y ∣ f ) = ∏ i = 1 N p ( y i ∣ f i ) = N ( y ∣ f , σ n 2 I ) and ψ denote parameters of the hyperprior. The joint model is given by, p ( y , f , u , θ ) = p ( y ∣ f ) p ( f ∣ u , θ ) p ( u ∣ θ ) p ( θ ) . The standard marginal likelihood p ( y ) = ∫ p ( y ∣ θ ) p ( θ ) d θ is intractable. The inner term p ( y ∣ θ ) is the canonical marginal likelihood N ( y ∣ 0 , K + σ n 2 I ) in the exact GP case and is approximated by a closed-form evidence lower bound (ELBO) in the sparse GP case for a Gaussian likelihood.  The sparse variational objective in the extended model augments the ELBO with an additional term to account for the prior over hyperparameters, log ⁡ p ( y , θ ) ≥ L s g p + log ⁡ p ψ ( θ ) . log ⁡ ( ℓ d ) ∼ N ( μ ℓ , K ℓ ) \log(\ell_{d}) \sim \mathcal{N}(\mu_{\ell}, K_{\ell})