From Zero to Generative

IAIFI Fellow, MIT

Carolina Cuesta-Lazaro

Art: "The art of painting" by Johannes Vermeer

Learning Generative Modelling from scratch

p(\mathrm{World}|\mathrm{Prompt})
["Genie 2: A large-scale foundation model" Parker-Holder et al]
p(\mathrm{Drug}|\mathrm{Properties})
["Generative AI for designing and validating easily synthesizable and structurally novel antibiotics" Swanson et al]

Probabilistic ML has made high dimensional inference tractable

1024x1024xTime

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

https://parti.research.google​​​​​​​

A portrait photo of  a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

BEFORE

Artificial General Intelligence?

AFTER

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Scaling laws and emergent abilities

"Scaling Laws for Neural Language Models" Kaplan et al

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

"Sparks of Artificial General Intelligence: Early experiments with GPT-4" Bubeck et al

Produce Javascript code that creates a random graphical image that looks like a painting of Kandinsky

Draw a unicorn in TikZ

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Today's Plan

1. Recap of the Machine Learning building blocks

2. Learning to classify

BREAK

3. Tutorial: Build your first classifier

4. Introduction to Generative Models

5. Tutorial: Build your first generative model

(if time permits)

BREAK

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

The building blocks: 1. Data

Cosmic Cartography

(Pointclouds)

MNIST

(Images)

Wikipedia

(Text)

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

1024x1024

The curse of dimensionality

Inductive biases!

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

The building blocks: 2. Architectures

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Image Credit: CS231n Convolutional Neural Networks for Visual Recognition
4

Pixel 1

Pixel 2

Pixel N

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Multilayer Perceptron (MLP)

a^{(l)} = f^{(l)}(W^{(l)}a^{(l-1)} + b^{(l)})

Inductive bias: Translation Invariance

Data Representation: Images

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Convolutional Neural Network (CNN)

Inductive bias: Permutation Invariance

Data Representation: Sets, Pointclouds

+
+
= 4
f(x) = f(P(x))
f(x) = \oplus_{i=0}^N h_\theta(x_i)

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

+
+
= 4

Deep Sets

Text

Images

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Transformers

The Unifying architecture

As in Deep Sets, Transformers are permutation invariant

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Input token

x

W_Q

Q = x*W_Q

QUERY: What is X looking for?

W_K

K = x*W_K

KEY: What token X contains

W_V

V = x*W_V

VALUE:  What token X will provide

"The dog chased the cat because it was playful."

But, we decide to break the symmetry!

Positional Encodings

  • "Dog bites man"
  • "Man bites dog"

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

The building blocks: 3. Loss function

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

The building blocks: 4. The Optimizer

Image Credit: "Complete guide to Adam optimization" Hao Li et al

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Tutorial 1: Learning to classify

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

How do we output a probability?

0 \leq p_i(x) \leq 1
\sum_i^C p(x_i) = 1
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Pixel 1

Pixel 2

Pixel N

p Class 1

p Class 2

p Class 10

Loss function: Cross entropy

How different are two probability distributions?

Model Prediction

if True class is for i

y_{i} = 1
y_{i} = 0

otherwise

L = - \sum_{i=1}^{C} y_{i} \log(p(\hat{y}_{i}))

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Truth: Class = 0

True class

Predicted probability

import flax.linen as nn

class MLP(nn.Module):
    @nn.compact
    def __call__(self, x):
      	# Linear
        x = nn.Dense(features=64)(x)
        # Non-linearity
        x = nn.silu(x)
        # Linear
        x = nn.Dense(features=64)(x)
        # Non-linearity
        x = nn.silu(x)
        # Linear
        x = nn.Dense(features=2)(x)
        return x

model = MLP()

Jax Models

import jax.numpy as jnp

example_input = jnp.ones((1,4))
params = model.init(jax.random.PRNGKey(0), example_input) 
y = model.apply(params, example_input)

Architecture

Parameters

Call

A 2D animation of a folk music band composed of anthropomorphic autumn leaves, each playing traditional bluegrass instruments, amidst a rustic forest setting dappled with the soft light of a harvest moon

1024x1024

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Learning to Generate by bridging distributions: visualize bridge

p(x)
p(y|x)
p(x|y) = \frac{p(y|x)p(x)}{p(y)}
p(x|y)

Generation vs Discrimination

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

p_\phi(x)

Data

A PDF that we can optimize

Maximize the likelihood of the data

Generative Models

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Generative Models 101

Maximize the likelihood of the training samples

\hat \phi = \argmax \left[ \log p_\phi (x_\mathrm{train}) \right]
x_1
x_2

Parametric Model

p_\phi(x)

Training Samples

x_\mathrm{train}

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

x_1
x_2

Trained Model

p_\phi(x)

Evaluate probabilities

Low Probability

High Probability

Generate Novel Samples

Simulator

Generative Model

Generative Model

Simulator

Generative Models: Simulate and Analyze

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

The Generative Zoo

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Bridging two distributions

is p(z) fixed?

Bridge stochastic or deterministic?
SDE or ODE?

is path fixed?

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Change of variables

X \sim \mathcal{N}(0,1)

sampled from a Gaussian distribution with mean 0 and variance 1

Y = g(X) = a X + b

How is 

Y

distributed?

p_Y(y) = p_X(g^{-1}(y)) \left| \frac{dg^{-1}(y)}{dy}\right|
P(Y\le y) = P(g(X)\le y) = P(X\le g^{-1}(y))
\mathrm{CDF}_Y = \mathrm{CDF}_{X}(g^{-1}(y))

Base distribution

Target distribution

p_X(x) = p_Z(z) \left| \frac{dz}{dx}\right|
Z \sim \mathcal{N} (0,1) \rightarrow g(z) \rightarrow X

Invertible transformation

z \sim p_Z(z)
p_Z(z)

Normalizing flows

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

\mathrm{Uniform(0,1)} \rightarrow U_1, U_2
Z_0 = \sqrt{-2 \ln U_1} \cos(2 \pi U_2)
Z_1 = \sqrt{-2 \ln U_1} \sin(2 \pi U_2)
Z_0, Z_2 \leftarrow N(0,1)

Box-Muller transform

Normalizing flows in 1934

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Normalizing flows

[Image Credit: "Understanding Deep Learning" Simon J.D. Prince]
z \sim p(z)
x \sim p(x)
x = f(z)
p(x) = p(z = f^{-1}(x)) \left| \det J_{f^{-1}}(x) \right|

Bijective

Sample

Evaluate probabilities

Probability mass conserved locally

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

z_0 \sim p(z)
z_k = f_k(z_{k-1})
\log p(x) = \log p(z_0) - \sum_{k=1}^{K} \log \left| \det J_{f_k} (z_{k-1}) \right|
Image Credit: "Understanding Deep Learning" Simon J.D. Prince

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Masked Autoregressive Flows

p(x) = \prod_i{p(x_i \,|\, x_{1:i-1})}
p(x_i \,|\, x_{1:i-1}) = \mathcal{N}(x_i \,|\,\mu_i, (\exp\alpha_i)^2)
\mu_i, \alpha_i = f_{\phi_i}(x_{1:i-1})

Neural Network

x_i = z_i \exp(\alpha_i) + \mu_i
z_i = (x_i - \mu_i) \exp(-\alpha_i)

Sample

Evaluate probabilities

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

\theta

Forward Model

Observable

x
\color{darkgray}{\Omega_m}, \color{darkgreen}{w_0, w_a},\color{purple}{f_\mathrm{NL}}\, ...

Dark matter

Dark energy

Inflation

Predict

Infer

Parameters

Inverse mapping

\color{darkgray}{\sigma}, \color{darkgreen}{v}, ...

Fault line stress

Plate velocity

p(\theta|x)

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Simulation-based Inference

\color{white}{p(\theta|x)} \color{black}{=} \frac{\color{white}{p(x|\theta)} \color{white}{p(\theta)}}{\color{white}{p(x)}}

Likelihood

Posterior

Prior

Evidence

p(\theta|x)
p(x|\theta)
p(\theta)
p(x)

Markov Chain Monte Carlo MCMC

Hamiltonian Monte  Carlo HMC

Variational Inference VI

Carolina Cuesta-Lazaro IAIFI/MIT - Simulation-Based Inference

p(x) = \int p(x|\theta) p(\theta) d\theta

If can evaluate posterior (up to normalization), but not sample

Intractable

Unknown likelihoods

Amortized inference

Scaling high-dimensional

Marginalization nuisance

 

Invertible functions aren't that common!

Splines

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Issues NFs: Lack of flexibility

  • Invertible functions
  • Tractable Jacobians

 

\frac{dx_t}{dt} = v^\phi_t(x_t)
x_1 = x_0 + \int_0^1 v^\phi_t(x_t) dt
\frac{d p(x_t)}{dt} = - \nabla \left( v^\phi_t(x_t) p(x_t) \right)

In continuous time

Continuity Equation

[Image Credit: "Understanding Deep Learning" Simon J.D. Prince]
x_0 = x_1 + \int_1^0 v^\phi_t(x_t) dt

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Chen et al. (2018), Grathwohl et al. (2018)
x_1 = x_0 + \int_0^1 v_\theta (x(t),t) dt

Generate

x_0
x_1

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

t

Evaluate Probability

\log p_X(x) = \log p_Z(z) + \int_0^1 \mathrm{Tr} J_v (x(t)) dt

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Loss requires solving an ODE!

Diffusion, Flow matching, Interpolants... All ways to avoid this at training time

\frac{dx_t}{dt} = v_\theta(t,x_t)
= (1-t)x_0 + t x_1
v(t,x_t) = \mathbb{E}\left[\partial_t I | x_t=x \right]

Can we regress the velocity field directly?

Turned maximum likelihood into a regression problem!

I(t,x_0,x_1) = x_t = \alpha_t x_0 + \beta_t x_1
\mathcal{L} = \int_0^1 \mathbb{E}_{x_0,x_1} \left[v_\theta(t, x_t) - \partial_t I(t,x_0,x_1) \right] ^2 dt

Interpolant

+ \gamma_t z

Stochastic Interpolant

Expectation over all possible paths that go through xt

["Stochastic Interpolants: A Unifying framework for flows and diffusion" 
Albergo et al arXiv:2303.08797]

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Conditional Flow matching

x_t = (1-t) x_0 + t x_1
\mathcal{L}_\mathrm{conditional} = \mathbb{E}_{t,x_0,x_1}\left[\| u_t^\phi(x_t) - u_t(x_0,x_1) \|^2 \right]

Assume a conditional vector field (known at training time)

The loss that we can compute

The gradients of the losses are the same!

\nabla_\phi \mathcal{L}_\mathrm{conditional} = \nabla_\phi \mathcal{L}
x_0
x_1
["Flow Matching for Generative Modeling" Lipman et al]
["Stochastic Interpolants: A Unifying framework for Flows and Diffusions" Albergo et al]
u_t(x) = \int u_t(x|x_1) \frac{p_t(x|x_1) p_1(x_1)}{p_t(x)} \, dx_1
p_t(x) = \int p_t(x|x_1) q(x_1) \, dx_1

Intractable

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Flow Matching

\frac{dz_t}{dt} = u^\phi_t(z_t)
x = z_0 + \int_0^1 u^\phi_t(z_t) dt

Continuity equation

\frac{d p(z_t)}{dt} = - \nabla \left( u^\phi_t(z_t) p(z_t) \right)
[Image Credit: "Understanding Deep Learning" Simon J.D. Prince]

Sample

Evaluate probabilities

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Diffusion Models

Reverse diffusion: Denoise previous step

Forward diffusion: Add Gaussian noise (fixed)

Prompt

A person half Yoda half Gandalf

Denoising = Regression

Fixed base distribution:

Gaussian

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

["A point cloud approach to generative modeling for galaxy surveys at the field level"

Cuesta-Lazaro and Mishra-Sharma
International Conference on Machine Learning ICML AI4Astro 2023, Spotlight talk, arXiv:2311.17141]

Base Distribution

Target Distribution

Simulated Galaxy 3d Map

Prompt:

\Omega_m, \sigma_8

Prompt: A person half Yoda half Gandalf

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Tutorial 2

x_0

Gaussian

MNIST 

x_1

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Flow matching

\mathcal{L} = \mathbb{E}_{t \sim \mathcal{U}[0, 1]} \mathbb{E}_{x \sim p_t}\left[\| u_\theta(t, x) - u(t, x) \|^2 \right]

Regress the velocity field directly!

But we need to know u. If we know u, then why learn another one?

Image Credit: "An Introduction to flow matchig" Tor Fjelde et al

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Conditional Flow matching

x_t = (1-t) x_0 + t x_1
\mathcal{L}_\mathrm{conditional} = \mathbb{E}_{t,x_0,x_1}\left[\| u_\theta(t, x) - u(t, x_0,x_1) \|^2 \right]

Learn a conditional vector field (known at training time)

Approximate it with an unconditional one

The gradients of the losses are the same!

\nabla_\theta \mathcal{L}_\mathrm{conditional} = \nabla_\theta \mathcal{L}

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

x_0
x_1

Students at MIT are

Pre-trained on next word prediction

...

OVER-CAFFEINATED

NERDS

SMART

ATHLETIC

Large Language Models

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

https://www.astralcodexten.com/p/janus-simulators

How do we encode "helpful" in the loss function?

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Step 1

Human teaches desired output

Explain RLHF

After training the model...

Step 2

Human scores outputs

+ teaches Reward model to score

it is the method by which ...

Explain means to tell someone...

Explain RLHF

Step 3

Tune the Language Model to produce high rewards!

RLHF: Reinforcement Learning from Human Feedback

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

BEFORE RLHF

AFTER RLHF

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Reasoning

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Reasoning

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

RLVR (Verifiable Rewards)

Examples: Code execution, game playing, instruction following ....

https://arxiv.org/abs/2308.03688

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

Agents

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

References

cuestalz@mit.edu

Carolina Cuesta-Lazaro IAIFI/MIT - From Zero to Generative

From zero to generative - Arizona

By carol cuesta

From zero to generative - Arizona

  • 9