On PCA and Gaussian process latent variable models

Vidhi Lalchand

05-04-2024

Outline

  • PCA and geometric interpretation 
  • Probabilistic PCA & Kernel PCA
  • Gaussian Process Latent variable models 
  • Applications of GPLVM

 

PCA

As a technique for multivariate analysis - PCA acts a foundational precursor to most unsupervised learning techniques & structure discovery. 

PCA projects high-dimensional data onto the axis of maximum variance - these 'axis' or 'directions' are the eigenvectors of the covariance matrix of the data or the 'principal components. 

\(y_{1}\)

\(y_{2}\)

PCA workflow

The PCs are geometrically orthogonal.

  • Standardize the data and compute the sample covariance matrix as \( S = \dfrac{1}{N}\mathbf{Y}^{T}\mathbf{Y}\)
    • Compute the eigendecomposition of the \(S=\) \( \mathbf{U}\Lambda\mathbf{U}^{T}\) where the matrix \( \mathbf{U}\) stores the eigenvectors column-wise and \( \Lambda\) stores the eigenvalues on the diagonal. 

 

  • The magnitude of the eigenvalues denote the explained variance captured in each PC and dimensionality reduction is achieved by taking the top \(q\) eigenvectors to form the projection matrix \(\textbf{W}\) of shape \( D \times Q\).
  • The low-dimensional representation is given by, \(\mathbf{X} = \textbf{Y}\textbf{W} \) 

Probabilistic PCA 

\mathbf{y}_n = \mathbf{W} \mathbf{x}_n + \varepsilon_n, \varepsilon_{n} \sim \mathcal{N}(0,\sigma^{2})

We assume a data generation model where some high-dimensional data \(\mathbf{Y} = [\mathbf{y}_{1}....\mathbf{y}_{n}]^{T}\) are linearly related to some low-dimensional data summarised in \( \mathbf{X} \in \mathbb{R}^{N \times Q}\) and \(Q < D\).

The linear map \(\mathbf{W}\) is a \(D \times Q\) matrix and relates each \(\mathbf{x}_{n}\) to \(\mathbf{y}_{n}\), further, 

\mathbf{x}_n \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_{Q})
p(\mathbf{y}_{n}|\mathbf{W},\mathbf{x}_{n}) = \mathcal{N}(\mathbf{W}\mathbf{x}_{n}, \sigma^{2})

The data likelihood:

Prior over latents:

Derive the marginal likelihood of the data:

p(\mathbf{y}_n \mid \mathbf{W}) = \int p(\mathbf{y}_{n}|\mathbf{x}_{n},\mathbf{W})p(\mathbf{x}_{n})d\mathbf{x}_{n} = \mathcal{N}(\mathbf{0}, \underbrace{\mathbf{W}\mathbf{W}^{\top} + \sigma^2 \mathbf{I}}_{\mathbf{C}})

(Using standard Gaussian identities as they are closed under affine operations)

\begin{aligned} \mathcal{L}(\mathbf{W}, \sigma^{2}) &= \log p(\mathbf{Y} \mid \mathbf{W}) = \sum_{n=1}^{N} \log p(\mathbf{y}_n \mid \mathbf{W}) &= -\frac{ND}{2} \log 2\pi - \frac{N}{2} \log |\mathbf{C}| - \frac{1}{2} \text{tr}(\mathbf{C}^{-1} \mathbf{Y}^{\top} \mathbf{Y})\end{aligned}
\hat{\mathbf{W}}, \hat{\sigma^{2}} = \text{argmax}_{\mathbf{W}, \sigma^{2}} \mathcal{L}(\mathbf{W}, \sigma^{2})

Probabilistic PCA 

\hat{\mathbf{W}}, \hat{\sigma^{2}} = \text{argmax}_{\mathbf{W}, \sigma^{2}} \mathcal{L}(\mathbf{W}, \sigma^{2})
\hat{\mathbf{W}} = \mathbf{U}_{Q}(\Lambda_{Q} - \sigma^{2}\mathbb{I})^{1/2}\mathbf{R}

Turns out analytical solutions exist:

Top Q eigenvectors of the sample covariance matrix \( \mathbf{S} = \dfrac{1}{N}\mathbf{Y}\mathbf{Y}^{T}\)

Diagonal matrix with Q eigenvalues

Arbitrary rotation matrix

The posterior of the latent variables can also be calculated in closed form:

p(\mathbf{x}_{n}|\mathbf{y}_{n}) = \dfrac{p(\mathbf{y}_{n}|\mathbf{W},\mathbf{x}_{n})p(\mathbf{x}_{n})}{p(\mathbf{y}_{n}|\mathbf{W})} =
\mathcal{N}((\mathbf{W}^{T}\mathbf{W} + \sigma^{2}I)^{-1}\mathbf{W}^{T}\mathbf{y}_{n}, \sigma^{2}(\mathbf{W}^{T}\mathbf{W} + \sigma^{2}I)^{-1})

 

(Note that  regular PCA is recovered as \( \sigma^{2} \rightarrow 0\))

Probabilistic PCA ----> GPLVM

\mathbf{y}_n = \mathbf{W} \mathbf{x}_n + \varepsilon_n, \varepsilon_{n} \sim \mathcal{N}(0,\sigma^{2})

Integrate out the projection matrix \( \mathbf{W}\) and optimise the latents \(\mathbf{X}\)

Integrate out the  latents \(\mathbf{X}\) and optimise the projection matrix \( \mathbf{W}\).

Method:

Method:

The linear map \(\mathbf{W}\) is a \(D \times Q\) matrix and relates each \(\mathbf{x}_{n}\) to \(\mathbf{y}_{n}\).

\hat{\mathbf{W}}, \hat{\sigma^{2}} = \text{argmax}_{\mathbf{W}, \sigma^{2}} \mathcal{L}(\mathbf{W}, \sigma^{2})
\mathbf{x}_n \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_{Q})

Prior over latents:

p(\mathbf{y}_n \mid \mathbf{W}, \sigma^{2}) = \int p(\mathbf{y}_{n}|\mathbf{x}_{n},\mathbf{W})p(\mathbf{x}_{n})d\mathbf{x}_{n} = \mathcal{N}(\mathbf{0}, \underbrace{\mathbf{W}\mathbf{W}^{\top} + \sigma^2 \mathbf{I}}_{\mathbf{C}})

Prior over projection:

Basic idea: Swap what is optimised and what is integrated

\mathbf{W} \sim \prod_{d=1}^{D}\mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I}_{Q})

(iid over rows of W)

p(\mathbf{y}_d \mid \mathbf{X}, \sigma^{2}) = \int p(\mathbf{y}_{n}|\mathbf{x}_{n},\mathbf{W})p(\mathbf{W})d\mathbf{W} = \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{X}\mathbf{X}^{\top} + \sigma^2 \mathbf{I})
\mathcal{L}(\mathbf{W}, \sigma^{2}) = \log p(\mathbf{Y} \mid \mathbf{W}) = \sum_{n=1}^{N} \log p(\mathbf{y}_n \mid \mathbf{W})
\mathcal{L}(\mathbf{X}, \sigma^{2}) = \log p(\mathbf{Y} \mid \mathbf{X}) = \sum_{d=1}^{D} \log p(\mathbf{y}_d \mid \mathbf{X})
\hat{\mathbf{X}}, \hat{\sigma^{2}} = \text{argmax}_{\mathbf{X}, \sigma^{2}} \mathcal{L}(\mathbf{X}, \sigma^{2})

The GPLVM methodology facilitates the use of kernels by non-linearising the map between latents  \(\mathbf{X}\) and \(\mathbf{Y}\).

Non-Probabilistic Probabilistic (in mapping)
Linear PCA Probabilistic-PCA
Non-Linear Kernel PCA GP-LVM

The GPLVM is probabilistic kernel PCA with a non-linear mapping from a low-dimensional latent space \( \mathbf{X}\) to a high-dimensional data space \(\mathbf{Y}\).  

Understanding the links

\mathbf{X} \xrightarrow{f}\mathbf{Y}
f \sim \mathcal{GP}(m,k(x, x^{\prime}))
\begin{bmatrix} y_{1,1} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & y_{n,d} & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

\(\mathbf{X}\)

Observed space

p(f_{1:D}|X) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}| 0, \mathbf{K}_{\theta})
\bm{y}_{n,d} = f_{d}(\bm{x}_{n}) + \bm{\epsilon}_{n} \\

\(\mathbf{Y}\)

N \times Q

Given: High dimensional training data \( Y \equiv \{\bm{y}_{n}\}_{n=1}^{N},  Y \in \mathbb{R}^{N \times D}\)

Learn: Low dimensional latent space \( X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}\)

GPLVM Workflow

\prod_{d=1}^{D}p(\bm{y}_{:,d}|\mathbf{X}) = \mathcal{L}(\mathbf{X}, \mathbf{\theta}) = -\frac{DN}{2} \log 2\pi - \frac{D}{2} \log |\mathbf{K}_{\theta}| - \frac{1}{2} \text{tr}(\mathbf{K}_{\theta}^{-1} \mathbf{Y}\mathbf{Y}^{\top})

GP prior over mappings 

(per dimension, \(d\))

p(f_{1:D}|X) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}| 0, \mathbf{K}_{\theta})
\begin{bmatrix} f_{1}(\bm{x}_{1}) & \vdots & \ldots & \ldots & f_{D}(\bm{x}_{1}) \\ \ldots & f_{2} & \ldots & \ldots & \vdots \\ f_{1}(\bm{x}_{N}) & \vdots & \ldots & \ldots & f_{D}(\bm{x}_{N}) \\ \end{bmatrix}_{N \times D}
\rightarrow
\bm{y}_{:,d} = f_{d}(\mathbf{X}) + \bm{\epsilon} \\

Choice of \( \mathbf{K}\) induces non-linearity

GP marginal likelihood

\hat{\mathbf{X}}, \hat{\theta} = \text{argmax}_{\mathbf{X}, \theta} \mathcal{L}(\mathbf{X}, \theta)
\rightarrow

Optimisation problem:

GPLVM generalises probabilistic PCA - one can recover probabilistic PCA by using a linear kernel

Feature Maps \( \rightarrow \) Kernels

A kernel \( k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} \) computes an inner product in some high-dimensional embedded space where the embedding is not usually explicitly represented.  

k(x,x') = \langle \phi(x), \phi(x') \rangle
\phi: \mathbb{R}^{N} \rightarrow \mathbb{R}^{M}

where usually, M > N.

It is also certainly possible for M to be infinite dimensional, which is the case for a Gaussian kernel. If one can reason about \( \phi \), one can derive a kernel corresponding to it by computing an inner product. This is a deterministic operation.

\bm{x} = (x_{1}, x_{2}) \rightarrow \phi(x) = (x_{1}^{2}, x_{2}^{2}, \sqrt{2}x_{1}, x_{2})
\begin{aligned} k(\bm{x}, \bm{y}) = \phi(\bm{x})^{T}\phi({y}) &= x_{1}^{2}y_{1}^{2} + x_{2}^{2}y_{2}^{2} + 2x_{1}x_{2}y_{1}y_{2} \\ &= (\bm{x}^{T}\bm{y})^{2} \end{aligned}

N x D

The Gaussian process mapping

2d latent space

High-dimensional data space

Schematic of a Gaussian process Latent Variable Model

 

. . . 

. . . 

. . . 

N

D

\( Z \in \mathbb{R}^{N \times Q}\)

\( f_{d} \sim \mathcal{GP}(0,k_{\theta})\)

\( Y \in \mathbb{R}^{N \times D} (= F + noise)\)

\begin{aligned} k_{f}(\bm{x}, \bm{x}^{\prime}) &= {\color{blue}{\sigma^{2}_{f}}}\exp\left\{-\sum_{q=1}^{Q}\frac{(\bm{x}_{q} - \bm{x}^{\prime}_{q})^{2}}{2{\color{blue}{\ell_{q}^{2}}}}\right\} \\ \end{aligned}

The latents are continuous values, hence, the most popular choice of kernel function is the RBF kernel with lengthscale per dimension. 

 

The behaviour of the lengthscales during training achieves sparsity - it prunes away redundant dimensions in the latent space.

N x D

The Gaussian process mapping

2d latent space

High-dimensional data space

Schematic of a Gaussian process Latent Variable Model

 

. . . 

. . . 

. . . 

N

D

\( Z \in \mathbb{R}^{N \times Q}\)

\( f_{d} \sim \mathcal{GP}(0,k_{\theta})\)

\( Y \in \mathbb{R}^{N \times D} (= F + noise)\)

Inference modes

1. Point estimation - Learn a latent point \( \bm{x}_{n}\) per \(\bm{y}_{n}\) using the GP marginal likelihood.

2. MAP - Maximum a priori, use a prior over the latents \(p(\bm{x}_{n})\) and learn a point estimate.

\begin{aligned} k_{f}(\bm{x}, \bm{x}^{\prime}) &= \exp\left\{-\sum_{q=1}^{Q}\frac{(\bm{x}_{q} - \bm{x}^{\prime}_{q})^{2}}{2{\color{blue}{l_{q}^{2}}}}\right\} \\ \end{aligned}

3. Variational Inference with an Encoder:

 

p(\mathbf{X}|\mathbf{Y}) \approx q(\mathbf{X}) = \prod_{n=1}^{N}\mathcal{N}( \bm{x}_{n}| G_{\phi_{1}}(\bm{y}_{n}), H_{\phi_{2}}(\bm{y_{n}}) H_{\phi_{2}}(\bm{y_{n}})^{T})

GPLVM Applications: Dimensionality reduction / Clustering

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023

Data: Multiphase oil flow data that consists of 1000, 12 dimensional observations belonging to three known classes corresponding to different phases of oil flow.

Plot: Showing the two dimensions corresponding to the highest inverse lengthscale.

The probabilistic method (which learn a latent distribution per data point, 3rd row) switches off  7 out of 10 latent dimensions - the point estimation methods are not able to achieve that pruning. 

Point estimate

MAP (point est. with a prior)

Bayesian (Variational Inference) 

Dominant dimensions

Inv. Lengthscale per dim. 

ELBO Loss

GPLVM Applications: Dimensionality reduction / Clustering

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023

Data: qPCR data for 48 genes obtained from mice.  The data contains 48 columns, with each column corresponding to (normalized) measurements of each gene. Cells differentiate during their development and these data were obtained at various stages of development. The various stages are labelled from the 1-cell stage to the 64-cell stage.

Point estimate

MAP (point est. with a prior)

Bayesian 

Dominant dimensions

Inv. Lengthscale per dim. 

ELBO Loss

Benchmark reconstruction

For the 32-cell stage, the data is further differentiated into ‘trophectoderm’ (TE) and ‘inner cell mass’ (ICM). ICM further differentiates into ‘epiblast’ (EPI) and ‘primitive endoderm’ (PE) at the 64-cell stage.

GPLVM Applications: Learning from partially complete / masked data

30% 

60%

Training data: MNIST images with masked pixels

Reconstruction

The model achieves disentanglement in the two dominant latent dimensions

Note: This is different to tasks where missing data is only introduced at test time

Trained with ~40% missing pixels

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023

Data: Frey Faces taken from a   taken from a video sequence that consists of 1965 images of size 28×20.

GPLVM Applications: Learning from partially complete / masked data

Ground truth

Test data

Reconstruction

Vidhi Lalchand, Aditya Ravuri, Neil D. Lawrence. Generalised GPLVM with Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023

GPLVM Applications: Learning dynamical data

Summary

Thank you!

  • GPLVM suite of methods are an alternative benchmark for latent variable learning, dimensionality reduction and working with missing data.
  • The model is versatile enough to be used in conjunction with:
    • ​Non-Gaussian likelihoods.
    • Missing data - both at training and test time.
    • Accommodate data-driven / informative priors in latent space.
    • Interpretable uncertainty - high bias correlates with high predictive variance.
    • Robust to overfitting - the redundant dimensions are pruned during training by making the inverse lengthscales \(1/\ell_{d} \rightarrow 0.\)

Learning from massively missing data: Movie ratings

Method | PMF | BiasMF | NNMF | GPLVM

MovieLens100K | 0.952 | 0.911 | 0.903 | 0.924 

Test reconstruction accuracy

Interpretable uncertainty: The confidence in predictions is higher when there is more data about a user.

Data: 943 users, 1638 movies (avg user has rated approx 20 movies)

GPLVMs

By Vidhi Lalchand

GPLVMs

  • 15