A Multi-modal Generative Model for Quasar Spectra based on Gaussian Processes

Vidhi Lalchand

24-03-2025

Astro Data Science Reading Group at the Kavli Institute

stars

galaxies

Smith MJ, Geach JE. Astronomia ex machina: a history, primer and outlook on neural networks in astronomy. Royal Society Open Science. 2023 May 31;10(5):221454.

 

Generative models are powerful tools for embeddings discrete objects into a continuous space, thereby allowing one to simulate them.

Latent space

The Basic Notion of Generative Models

In (a), we see a generative model attempting to learn the probability distributions of the latent representation of a dataset that contains a set of galaxies and a set of stars. In (b), we see a discriminative model attempting to learn the boundary that separates the star and galaxy types.

Smith MJ, Geach JE. Astronomia ex machina: a history, primer and outlook on neural networks in astronomy. Royal Society Open Science. 2023 May 31;10(5):221454.


Generative models are powerful tools for embeddings discrete objects into a continuous space, thereby allowing one to simulate them.

Latent space

The Basic Notion of Generative Models

Interpolating between the continuous representation of astronomical objects in latent space

Typical autoencoder style architecture of generative models

Motivation 

Aim: To build a joint generative model for quasar spectra and their black hole engines using a non-parametric modelling framework based on Gaussian processes.

+

[Lbol, Bhm, Eddington ratio, Redshift]

For each quasar we have,

Observation space 1                                                                           Observation space 2

  • The astrophysical application we consider is to predict physical properties of quasars like their black hole mass and other scientific attributes based on their spectral features alone.

\(N\) = ~23,000 quasars, \(D\) = 590 (spectral pixels),  \(L\) = 4 (scientific labels)

  • Understanding the formation, growth and evolution across cosmic time of quasars is an important goal of modern cosmology.

Data:

Dataset & Modelling Challenges

- Heteroscedasticity: The datasets have been constructed by combining measurements from different instruments / telescopes. Hence, due to the different redshift, the quasars cover slightly different rest-frame wavelengths.

- Missing spectral regions: The rectangular dataset is prepared by shifting the quasars into a restframe wavelength space and re-binning into a common wavelength grid with a fixed pixel scale. Any unobserved pixels are set to NaNs.

 

 

We need a probabilistic model which can handle these challenges in a principled and rigorous way.

Non-linear mapping / Decoder

X

Y

Z

Hidden / Latent variables per data point

Observed spectra & Labels

\( N \times D                        N \times L\)

\( N \times Q \)

Gaussian Processes & Latent Variable Models

Gaussian processes are a powerful non-parametric paradigm for performing probabilistic regression.

  • They are probabilistic \( \rightarrow\) user has a sense of prediction uncertainty.
  • They don't have standard parameters \( \) they model the mapping \( f \) directly by placing a prior in the space of functions!

We need to understand the notion of distribution over functions.

What is a GP?

A sample from a \(k\)-dimensional Gaussian \( \mathbf{x} \sim \mathcal{N}(\mu, \Sigma) \) is a vector of size \(k\).

 

$$ \mathbf{x} = [x_{1}, \ldots, x_{k}] $$

The mathematical crux of a GP is that \( [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{n})]\) is just a N-dimensional multivariate Gaussian \( \mathcal{N}(\mu, K) \).

\begin{bmatrix} f_{1} \\ \vdots\\ f_{499} \\ f_{500} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_{1} \\ \vdots\\ \mu_{499} \\ \mu_{500} \\ \end{bmatrix}, \begin{bmatrix} k(x_{1}, x_{1}) & \ldots & \ldots k(x_{1}, k_{500}) \\ \vdots & \ddots &\vdots \\ \vdots & \ddots & \vdots \\ k(x_{500}, x_{1}) & \ldots & \ldots k(x_{500}, k_{500}) \\ \end{bmatrix} \right)

A GP is an infinite dimensional analogue of a Gaussian distribution \( \rightarrow \)  a sample from it is a vector of infinite length?

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

But at any given point, we only need to represent our function \( f(x) \) at a finite index set \( \mathcal{X} = [x_{1},\ldots, x_{500}]\). So we are interested in our long function vector \( [f(x_{1}), f(x_{2}), f(x_{3}),....., f(x_{500})]\).

Function samples from a 1d GP

f(x) \sim \mathcal{GP}(m(x),k(x, x^{\prime}))

The kernel function \( k(x,x')\) is the heart of a GP, it controls all of the inductive biases in our function space like shape, periodicity, smoothness.

prior over functions \( \rightarrow \)

Sample draws from a zero mean GP prior under different kernel functions.

In reality, they are just draws from a multivariate Gaussian \( \mathcal{N}(0, K)\) where the covariance matrix has been evaluated by applying the kernel function to all pairs of data points.

Gaussian Processes & Regression

1. Given some noisy data \( \bm{y} = \lbrace{y_{i}}\rbrace_{i=1}^{N} \) at \( X = \{ x_{i}\}_{i=1}^N\) input locations.

2. You model your data as coming from a hidden function \( f\) corrupted by Gaussian noise.  

$$ \bm{y} = f(X) + \epsilon, \hspace{10pt} \epsilon \sim \mathcal{N}(0, \sigma^{2})$$

Data Likelihood: \( \hspace{10pt}  y|f \sim \mathcal{N}(f(x), \sigma^{2}) \)

Prior over functions: \( f|\theta \sim \mathcal{GP}(0, k_{\theta}) \)

(The choice of kernel function \( k_{\theta}\) controls how your functions space looks.)

\rightarrow
\begin{aligned} f_{i} &\sim \mathcal{N}(0, K_{\theta}) \\ K_{i,j} &= k_{\theta}(x_{i}, x_{j}) \\ &\hspace{2mm} \forall i, j\\ \end{aligned}
\begin{aligned} p(f_{*}|y, X_{*}, X, \theta_{*}) &= \mathcal{N}( \mu_{*}, \hspace{5pt} \Sigma_{*}) \\ \mu_{*} &= K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}y \\ \Sigma_{*} &= K_{**} - K_{*}(K_{\theta} + \sigma^{2}_{n})^{-1}K_{*}^T \\ \end{aligned}

Learning Step in Regression:

 

\begin{aligned} p(\bm{y}|\bm{\theta}) &= \int p(\bm{y}|\bm{f})p(f|\bm{\theta})d\bm{f}\\ &= \int \mathcal{N}(\bm{f}, \sigma_{n}^{2}\mathbb{I})\mathcal{N}(\bm{0}, K_{\theta})d\bm{f} \\ &= \mathcal{N}(0, K_{\theta} + \sigma^{2}_{n}\mathbb{I}) \end{aligned}
\bm{\theta_{*}} = \argmax_{\bm{\theta}}\log p(\bm{y}|\bm{\theta},X)

Learning in Gaussian process models occurs through the maximisation of the marginal likelihood w.r.t the kernel hyperparameters.

Data likelihood

Prior

Learning Step (when X is hidden/ latent):

 

X_{*}, \bm{\theta_{*}} = \argmax_{\bm{\theta}}\log p(\bm{y}|\bm{\theta},X)

Extending GPs to the latent variable set-up introduces two fundamental changes:

Supervised Learning in a Gaussian Process

1. We need to account for the multi-output space, the targets are not scalars but have \(D\) dimensions.

2. The inputs (assumed to be fixed in regression) are hidden and need to be learnt.

Gaussian Processes for unsupervised learning

Gaussian processes can also be used in contexts where the observations are a gigantic data matrix \( Y \equiv \{ y_{n}\}_{n=1}^{N}, y_{n} \in \mathbb{R}^{D}\). \(D\) can be pretty big \(\approx 1000s\).

Imagine a stack of images, where each image has been flattened into a vector of pixels and stacked together rowise in a matrix.

28

28

n = number of images

d = 784

Given: High dimensional training data \( Y \equiv \{\bm{y}_{n}\}_{n=1}^{N},  Y \in \mathbb{R}^{N \times D}\)

Learn: Low dimensional latent space \( X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}\)

\( Q << D\)

Given: High dimensional training data \( Y \equiv \{\bm{y}_{n}\}_{n=1}^{N},  Y \in \mathbb{R}^{N \times D}\)

Learn: Low dimensional latent space \( X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}\)

\prod_{d=1}^{D}p(\bm{y}_{:,d}|\mathbf{X}) = \mathcal{L}(\mathbf{X}, \mathbf{\theta}) = -\frac{DN}{2} \log 2\pi - \frac{D}{2} \log |\mathbf{K}_{\theta}| - \frac{1}{2} \text{tr}(\mathbf{K}_{\theta}^{-1} \mathbf{Y}\mathbf{Y}^{\top})

GP prior over mappings 

(per dimension, \(d\))

p(f_{1:D}|X) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}| 0, \mathbf{K}_{\theta})
\begin{bmatrix} f_{1}(\bm{x}_{1}) & \vdots & \ldots & \ldots & f_{D}(\bm{x}_{1}) \\ \ldots & f_{2} & \ldots & \ldots & \vdots \\ f_{1}(\bm{x}_{N}) & \vdots & \ldots & \ldots & f_{D}(\bm{x}_{N}) \\ \end{bmatrix}_{N \times D}
\rightarrow
\bm{y}_{:,d} = f_{d}(\mathbf{X}) + \bm{\epsilon} \\

Choice of \( \mathbf{K}\) induces non-linearity

GP marginal likelihood

\hat{\mathbf{X}}, \hat{\theta} = \text{argmax}_{\mathbf{X}, \theta} \mathcal{L}(\mathbf{X}, \theta)
\rightarrow

Optimisation problem:

The Gaussian process decoder

2d latent space

High-dimensional data space

. . . 

. . . 

N

D

\( X \in \mathbb{R}^{N \times Q}\)

\( F \in \mathbb{R}^{N \times D}\)

\( Y \in \mathbb{R}^{N \times D}\)

Gaussian Process Latent Variable Model

Gaussian Process Latent Variable Model

\begin{aligned} p(Y|f_{1:D}, X) &= \displaystyle \prod_{n=1}^{N}\prod_{d=1}^{D}\mathcal{N}(\bm{y}_{n,d}; f_{d}(\bm{x}_{n}), \sigma^{2}_{n}) \end{aligned}

Data Likelihood:

p(f_{1:D}|X) = \displaystyle \prod_{d=1}^{D}\mathcal{N}(f_{d}; 0, K_{d})

Prior structure:

F = \begin{bmatrix} \vdots & \vdots & \ldots & \ldots & \vdots \\ f_{1} & f_{2} & \ldots & \ldots & f_{D} \\ \vdots & \vdots & \ldots & \ldots & \vdots \\ \end{bmatrix}_{N \times D}

The data are stacked row-wise but modelled column-wise, each column with a GP.

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

\(X\)

\(x_{n}\)

\begin{bmatrix} y_{n,d} & \ldots & \ldots & | & \ldots \\ - & - & y_{n} & - & - \\ \ldots & \ldots & \ldots & | & \ldots \\ \ldots & \ldots & \ldots & y_{d} & \ldots \\ \ldots & \ldots & \ldots & | & \ldots \\ \end{bmatrix}_{N \times D}

Given: High dimensional training data \( Y \equiv \{\bm{y}_{n}\}_{n=1}^{N},  Y \in \mathbb{R}^{N \times D}\)

Learn: Low dimensional latent space \( X \equiv \{\bm{x}_{n}\}_{n=1}^{N}, X \in \mathbb{R}^{N \times Q}\)

(Q << D)

Observed space

Gaussian Process Mapping

Model:

p(X) = \displaystyle \prod_{n=1}^{N}\displaystyle \mathcal{N}(\bm{x}_{n};\bm{0}, \mathbb{I}_{Q})
\bm{y}_{n,d} = f_{d}(\bm{x}_{n}) + \bm{\epsilon}_{n} \\

The Gaussian process mapping

2d latent space

High-dimensional data space

Role of the Kernel Function

. . . 

. . . 

N

D

\( X\in \mathbb{R}^{N \times Q}\)

\( f_{d} \sim \mathcal{GP}(0,k_{\theta})\)

\( Y \in \mathbb{R}^{N \times D} (= F + noise)\)

\begin{aligned} k_{f}(\bm{x}, \bm{x}^{\prime}) &= {\color{blue}{\sigma^{2}_{f}}}\exp\left\{-\sum_{q=1}^{Q}\frac{(\bm{x}_{q} - \bm{x}^{\prime}_{q})^{2}}{2{\color{blue}{\ell_{q}^{2}}}}\right\} \\ \end{aligned}

The latents are continuous values, hence, the most popular choice of kernel function is the RBF kernel with lengthscale per dimension. 

 

The behaviour of the lengthscales during training achieves sparsity - it prunes away redundant dimensions in the latent space.

Parameteric VAEs

Probabilistic Latent Variable Models 

Auto-encoded GPLVM

Schematic & Graphical model

\text{log }p(X,Y|\theta_{x}, \theta_{y}, {\color{blue}{Z}}) = \sum_{d=1}^{D}\text{log} p(\bm{x}_{d}|\theta_{x}, {\color{blue}{Z}}) + \sum_{l=D+1}^{L}\text{log } p(\bm{y}_{l}|\theta_{y}, {\color{blue}{Z}})

Crux: We use two groups of GPs to model the spectra and the labels, however, they share the same latent space. 

Optimisation objective:

f_{d} \sim \mathcal{GP}(0,k_{\theta_{x}}) \hspace{10mm} f_{l} \sim \mathcal{GP}(0,k_{\theta_{y}})

Mathematical Framework

\text{log }p(X,Y|\theta_{x}, \theta_{y}, {\color{blue}{Z}}) = \sum_{d=1}^{D}\text{log} p(\bm{x}_{d}|\theta_{x}, {\color{blue}{Z}}) + \sum_{l=D+1}^{L}\text{log } p(\bm{y}_{l}|\theta_{y}, {\color{blue}{Z}})

Crux: We use two groups of GPs to model the spectra and the labels, however, they share the same latent space. 

Optimisation objective:

f_{d} \sim \mathcal{GP}(0,k_{\theta_{x}}) \hspace{10mm} f_{l} \sim \mathcal{GP}(0,k_{\theta_{y}})

Nice compartmentalisation of the GP marginal likelihood

Training Framework

Prediction Framework

Experimental Results

Reconstructing Spectra & Missing Regions of the Spectra

Reconstructing unseen Spectra

higher uncertainty bands at higher wavelengths (epistemic uncertainty)

Quantitatively we measure the reconstruction accuracy using error - deviation of the red line from the blue. 

 

The log predictive density captures how well the ground truth is contained within the uncertainty bounds.

\([f_{1}(z_{i}), f_{2}(z_{1}), f_{3}(z_{i}), \ldots f_{D}(z_{i})]\)

Prediction for object \(i\)

Evaluation of \(D\) GPs at the latent vector \( \color{blue}z_{i}\)

X_{partial}^{*} \longrightarrow Z^{*} \longrightarrow X_{est}^{*}

Reconstructing Scientific Labels

( X_{gt}^{*}, Y_{gt}^{*}) \longrightarrow Z^{*} \longrightarrow Y_{est}^{*}
X_{gt}^{*} \longrightarrow Z^{*} \longrightarrow Y_{est}^{*}

gt: ground truth

est: estimated

\(X^{*}\)

\(Z^{*}\)

\(Y^{*}\)

spectra

scientific labels

latents

Step 1: Learn latent \(Z^{*}\) from ground truth spectra or both spectra and labels. (Inference step)

Step 2: Decode \(Z^{*}\) using the GPs \(f_{l}\) to predict the labels.

(Cross-modal prediction)

Generating unseen spectra

An experiment demonstrating cross-modal prediction \(Y^{*} \longrightarrow Z^{*} \longrightarrow X^{*}\) . In the plots we show generated quasar spectra for simulated labels for black hole mass, bolometric luminosity and Eddington ratio. In each plot we vary the respective scientific label in a reasonable range (shown by the range on the colorbar) while keeping the other labels to fixed values.

Quantitative Results

Reconstruction error on unseen objects across all modalities (lower is better).

Uncertainty quantification measured through negative log predictive density on unseen objects (lower is better).

cross-modal

Future Work

  • Accounting for measurement uncertainties per object per pixel \(\sigma^{2}_{x,d}\) and per object per label \(\sigma^{2}_{y,l}\) in the kernel matrix of the GPLVM.
  • Introducing an encoder for even more scalable inference (~millions of objects).
  • Exploring the suitability of astro-foundation models.

Can we learn universal task-agnostic latent vectors in a self-supervised way; these latents can be used down-stream in smaller, custom models for downstream research. 

Foundation Models

IMAGEN: Text-to-image foundation model

Conditional Generation using a frozen pre-trained foundation model

frozen weights

Thank you! 

Prof. Anna-Christina Eilers

eilers@mit.edu

My contact:

vidrl@mit.edu

vr308@cam.ac.uk

@VRLalchand

  • NN based approaches like VAEs or GANs can be used to learn a shared latent space model, however, they are not immediately compatible with missing dimensions in the observation space.
  • Our generative model allows us to simultaneously model the spectral properties of the quasars as well as their labels opening up avenues to understand spectral dependence of the scientific labels like Bhm.

Summary 

Cambridge Kavli Astro Data Science Reading Group

By Vidhi Lalchand

Cambridge Kavli Astro Data Science Reading Group

  • 27