Modelling Technical and Biological Effects in scRNA-seq data with Scalable GPLVMs

Vidhi Lalchand*, Aditya Ravuri*, Emma Dann*, Natsuhiko Kumasaka, Dinithi Sumanaweera, Rik G.H. Lindeboom, Shaista Madad, Sarah A. Teichmann, Neil D. Lawrence

Machine Learning in Computational Biology

Machine Learning and Computational Biology , 2022

Single-cell RNA-seq datasets are growing in size and complexity, enabling the study of cellular composition in various biological and clinical contexts.
Dimension reduction techniques are an essential precursor to downstream tasks like cell type clustering, pseudotime estimation and sub-population identification.
Models that account for technical and biological confounders (e.g. batch effect, inter-individual variation, proliferation signatures) are required

Motivation

Single-cell RNA-seq datasets are growing in size and complexity, enabling the study of cellular composition in various biological and clinical contexts.
Dimension reduction techniques are an essential precursor to downstream tasks like cell type clustering, pseudotime estimation and sub-population identification.
Models that account for technical and biological confounders (e.g. batch effect, inter-individual variation, proliferation signatures) are required

Motivation

Gaussian Process Latent Variable Models (GPLVMs) for single-cell data

Trajectory analysis

(GrandPrix, Ahmed et al. 2019)

Exploratory analysis

(Verma and Engelhardt et al. 2022)

Spatio-temporal modelling

(MEFISTO, Velten et al. 2022)

Single-cell RNA-seq datasets are growing in size and complexity, enabling the study of cellular composition in various biological and clinical contexts.
Dimension reduction techniques are an essential precursor to downstream tasks like cell type clustering, pseudotime estimation and sub-population identification.
Models that account for technical and biological confounders (e.g. batch effect, inter-individual variation, proliferation signatures) are required

Motivation

Gaussian Process Latent Variable Models (GPLVMs) for single-cell data

Trajectory analysis

(GrandPrix, Ahmed et al. 2019)

Exploratory analysis

(Verma and Engelhardt et al. 2022)

Spatio-temporal modelling

(MEFISTO, Velten et al. 2022)

Limitations: bad scalability to large datasets, missing modelling of confounders

GPs can be used in the unsupervised settings by learning a non-linear, probabilistic mapping from latent space \( X \) to data-space \( Y \).
We assume the inputs \( X \) are latent (unobserved).

Given: High dimensional training data \( Y \equiv \{\bm{y}_{n}\}_{n=1}^{N}, Y \in \mathbb{R}^{N \times D}\)

Learn: Low dimensional latent space \( X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}\)

N x D

The mathematical bridge

2d latent space (each point represents a cell)

High-dimensional data space (A cell by gene matrix of expression counts)

A visualisation of the workings of a latent space model

Application: Understanding cellular similarities from a single-cell gene expression matrix

The Model: The Kernel function

The inductive biases of the GP mapping are controlled by a kernel function

N x D

D - independent Gaussian processes

low dimensional latent space (Q)

High-dimensional data space (D)

\( f_{d} \sim \mathcal{GP}(0, k_{f})\)

\begin{aligned} k_{f}(\bm{x}, \bm{x}^{\prime}) &= {\color{blue}{\sigma^{2}_{f}}}\exp\left\{\frac{-2\sin^{2}(|\bm{x}_{1} - \bm{x}_{1}^{\prime}|/2)} {\color{blue}{{l_{1}^{2}}}} \right\} \times\exp\left\{-\sum_{q=2}^{Q}\frac{(\bm{x}_{q} - \bm{x}^{\prime}_{q})^{2}}{2{\color{blue}{l_{q}^{2}}}}\right\} \\ &= k_{per} \times k_{rbf} \end{aligned}

The Model: Augmented Kernel Function

\begin{aligned} \tilde{F} = F + \Phi B \\ \end{aligned}

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

\begin{aligned} \mathbb{E}(\tilde{\bm{f}_{d}}) &= \mathbb{E}(\bm{f}_{d}) + \mathbb{E}(\Phi B_{d}) = {\color{blue}{\mu_{f}}}\mathbb{I}_{N} + \Phi{\color{blue}{\zeta_{d}}}\\ \textrm{Cov}(\tilde{\bm{f}_{d}}) &= \textrm{Cov}(\bm{f}_{d}) + \textrm{Cov}(\Phi B) = {\color{blue}{K_{nn}}} + {\color{blue}{\nu}}\Phi\Phi^{T} \end{aligned}

Y = \underbrace{F}_{\textrm{\tiny{{Sparse GP}}}} + \underbrace{\Phi}_{\textrm{\tiny{design matrix}}}\times\underbrace{B}_{\textrm{\tiny{random effects}}} + \underbrace{\bm{\epsilon}}_{\textrm{\tiny{noise model}}}

where we assume a constant mean \(\mu_{f} \in \mathbb{R} \) for the \(\bm{f}\) process, the design matrix \(\Phi\) with covariates is specified and \(\zeta_{d}\) encapsulates the mean of random effects \(B\).

The expression matrix \(Y\) is driven by this joint process \(\tilde{F}\) with columns \(\tilde{f}_{d}\) distributed as individual Gaussian processes

Metadata

Expression

data

The Model: Stochastic Variational Inference

p(\tilde{F}) = \prod_{d=1}^{D}p(\tilde{\bm{f}}_{d}) = \prod_{d=1}^{D}\mathcal{N}(\Phi\bm{\zeta}_{d}, K_{nn} + \nu\Phi\Phi^{T})

f_{d} \sim \mathcal{N}(\mu_f\mathbb{I}_{N}, K_{nn}),

\tilde{f}_{d} \sim \mathcal{N}(\mu_{f}\mathbb{I}_{N} + \Phi\zeta_{d}, K_{nn} + \nu\Phi\Phi^{T})

Canonical GP prior

Augmented GP prior

The augmented kernel formulation allows us to derive an objective (evidence lower bound) which factorises across both cells (\(N\)) and genes (\( D\)).

*Details about the derivation of the objective are in the paper

While not converged do

Choose a random mini-batch \( Y_{B} \subset Y\)
Form a mini-batch estimate of the ELBO:
\( p(Y) \geq \mathcal{L}(Y_{B}) = \dfrac{N}{B}(\sum_{b}\sum_{d}\mathcal{L}_{b,d})\) + terms*
Gradient step: All hyperparameters and latent variables \( \longrightarrow optim(\mathcal{L}(Y_{B}))\)

Application 1: Reproducing innate immunity analysis

Kumasaka et al. (2021) Mapping interindividual dynamics of innate immune response at single-cell resolution. biorXiv

Application 1: Reproducing Innate immunity analysis

Kumasaka et al. (2021) Mapping interindividual dynamics of innate immune response at single-cell resolution. biorXiv

Application 1: Reproducing Innate immunity analysis

Kumasaka et al. (2021) Mapping interindividual dynamics of innate immune response at single-cell resolution. biorXiv

Runtime comparison

Additive GPLVM (Kumasaka et al.): 4.5 hrs

Augmented GPLVM: 30 mins

Augmented kernel disentangles cell cycle and treatment effects

G1 phase

S phase

G2 phase

Treatment

Latent batch effect

Cell cycle phase

Augmented kernel disentangles cell cycle and treatment effects

Application 2: COVID-19 scRNA-seq cohort

Data: Stephenson et al. (2021) Single-cell multi-omics analysis of the immune response in COVID-19. Nat Medicine

54,941 cells

130 patients

PCA

scVI

Augmented GPLVM

Application 2: COVID-19 scRNA-seq cohort

Data: Stephenson et al. (2021) Single-cell multi-omics analysis of the immune response in COVID-19. Nat Medicine

PCA

scVI

Augmented GPLVM

54,941 cells

130 patients

The augmented GPLVM learns interpretable latent dimensions

scVI

Augmented GPLVM

scVI

Augmented GPLVM

The augmented GPLVM learns interpretable latent dimensions

scVI

Augmented GPLVM

Correlation to platelet differentiation signature

The augmented GPLVM learns interpretable latent dimensions

Modelling biological variation: COVID-19 severity

Latent severity captures variation in days since onset of symptoms

R2 = 0.25, p-val < 2e-16

GPLVM generative model recovers signatures of COVID-19 severity

Reported severity

Top perturbed genes

(highest variance upon perturbation)

GPLVM generative model recovers signatures of COVID-19 severity

Reported severity

Top perturbed genes

(highest variance upon perturbation)

Cell type markers

Viral entry factors

Interferon response

Summary

We introduce an augmented kernel function that jointly models known and unknown technical and biological covariates in scRNA-seq datasets
The formulation amenable to SVI enables application to complex cohort studies
Next steps:
- Further scale-up with amortised inference
- Modelling complex covariance structure (e.g. genotype effects for eQTL analysis)

Acknowledgements

Neil Lawrence

Vidhi Lachland

Adytia Ravuri

Sarah Teichmann

Emma Dann

Dinithi Sumanaweera

Rik Lindeboom

Shaista Madad

MLCB 2022

By Vidhi Lalchand

MLCB 2022

Vidhi Lalchand

Doctoral Student @ Cambridge Bayesian Machine Learning

vidhilalchand.co.uk

Modelling Technical and Biological Effects in scRNA-seq data with Scalable GPLVMs

Machine Learning in Computational Biology

MLCB 2022

More from Vidhi Lalchand