Umberto Maria Tomasini
1/29
Public defense @
Failure and success of the spectral bias prediction for Laplace Kernel Ridge Regression: the case of low-dimensional data
UMT, Sclocchi, Wyart
ICML 22
How deep convolutional neural network
lose spatial information with training
UMT, Petrini, Cagnetta, Wyart
ICLR23 Workshop,
Machine Learning: Science and Technology 2023
How deep neural networks learn compositional data:
The random hierarchy model
Cagnetta, Petrini, UMT, Favero, Wyart
PRX 24
How Deep Networks Learn Sparse and Hierarchical Data:
the Sparse Random Hierarchy Model
UMT, Wyart
ICML 24
Clusterized data
Invariance to deformations
Invariance to deformations
Hierarchy
Hierarchy
2/29
Humans perform certain tasks almost automatically with enough skill and expertise
3/29
ML model
"Cat"
ML model
"Dog"
A simple task: classifying images based on whether they contain a cat or a dog
4/29
"Dog"
5/29
"Cat"
"Dog"
6/29
"Cat"
"Dog"
These models can be very good!
\(\rightarrow\) Surprising, images are difficult for a PC
7/29
76
76
76
76
76
76
76
76
76
76
76
76
76
76
76
92
92
92
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
Hard to visualize. With a small change:
How can a ML model compare them?
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
Pixels!
8/29
76
76
76
76
76
76
76
76
76
76
76
76
76
76
76
92
92
92
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
"Dog"
Goal: learning function that given \(d\) pixels, returns the correct label
Difficulty increases with \(d\)
[images idea by Leonardo Petrini]
\(d=2\)
\(d=1\)
\(d=3\)
9/29
Simplest technique to learn:
\(P\propto (1/\varepsilon)^{d}\)
\(\Rightarrow\) Not the case for ML!
10/29
To classify a dog, we do not need every bit of information within the data.
Many irrelevant details that can be disregarded to solve the task.
"Dog"
"Dog"
11/29
Key questions motivating this thesis:
12/29
Reducing complexity with depth
Deep networks build increasingly abstract representations with depth (also in brain)
Intuition: reduces complexity of the task, ultimately beating curse of dimensionality.
[Zeiler and Fergus 14, Yosinski 15, Olah 17, Doimo 20,
Van Essen 83, Grill-Spector 04]
[Shwartz-Ziv and Tishby 17, Ansuini 19, Recanatesi 19, ]
13/29
Two ways for ML to lose information
by learning invariances
Discrete
Continuous (diffeomorphisms
[Bruna and Mallat 13, Mallat 16, Petrini 21]
Test error correlates with sensitivity to diffeo
[Petrini21]
Sensitivity to diffeo
Test error
+Sensitive
14/29
[Poggio 17, Mossel 16, Malach 18-20, Schmdit-Hieber 20, Allen-Zhu and Li 24]
3. Data are sparse,
4. The task is stable to transformations.
[Bruna and Mallat 13, Mallat 16, Petrini 21]
How many training points are needed for deep networks?
15/29
Part I: introduce a hierarchical model of data to quantify the gain in number of training points.
Part II: understand why the best networks are the least sensitive to diffeo, in an extension of the hierarchical model.
16/29
Part I: Hierarchical structure
Do deep hierarchical representations exploit the hierarchical structure of data?
[Poggio 17, Mossel 16, Malach 18-20, Schmdit-Hieber 20, Allen-Zhu and Li 24]
How many training points?
Quantitative predictions in a model of data
[Cagnetta, Petrini, UMT, Favero, Wyart, PRX 24]
sofa
[Chomsky 1965]
[Grenander 1996]
17/29
18/29
\(P^*\)
\(P^*\sim n_c m^L\)
19/29
20/29
To group the synonyms, \(P^*\sim n_c m^L\) points are needed
21/29
Key insight: sparsity brings invariance to diffeo
[Tomasini, Wyart, ICML 24]
22/29
Sparse Random Hierarchy Model
Sparsity\(\rightarrow\) Invariance to feature displacements (diffeo)
Sparsity \(\rightarrow\) invariance to features displacements (diffeo)
23/29
Analyzing this model:
Sensitivity to diffeo
Sensitivity to diffeo
Test error
24/29
\(x_1\)
\(x_2\)
\(x_3\)
\(x_4\)
25/29
With sparsity: to recover the synonyms, a factor \(1/p\) more data:
\(P^*_{\text{LCN}}\sim (1/p) P^*_0\)
26/29
\(x_1\)
\(x_2\)
\(x_3\)
\(x_4\)
27/29
Takeaways
28/29
The Bigger Picture
"Dog"
29/29
Thanks to the PCSL team!
Thanks to:
Thanks to:
Thanks to:
Thanks to:
Thanks to:
BACKUP SLIDES
17/28
At \(P^*\) the task and the synonyms are learnt
1.0
0.5
0.0
\(10^4\)
Training set size \(P\)
18/28
Diffeomorphisms
learnt with the task
Synonyms learnt with the task
The hidden representations become insensitive to the invariances of the task
26/28
\(\varepsilon\sim P^{-\beta}\)
3/28
Error \(\varepsilon\)
of ML model
Number of training points \(P\)
\(\tau(x)\)
\((x+\eta)\)
\(+\eta\)
\(\tau\)
\(f(\tau(x))\)
\(f(x+\eta)\)
\(x\)
Our model captures the fact that while sensitivity to diffeo decreases, the sensitivity to noise increases
\(\gamma_{k,l}=\mathbb{E}_{c,c'}[\omega^k_{c,c'}\cdot\Psi_l]\)
Few layers become low-pass: spatial pooling
Noise
Input
Avg Pooling
Layer 1
Layer 2
Avg Pooling
ReLU
\(G_k = \frac{\mathbb{E}_{x,\eta}\| f_k(x+\eta)-f_k(x)\|^2}{\mathbb{E}_{x_1,x_2}\| f_k(x_1)-f_k(x_2)\|^2}\)
\(\sim[m^L/(mv)]^{-1/2}\)
\(\sim[P/(n_c m v)]^{-1/2}\)
Large \(m,\, n_c,\, P\):
\(P_{corr}\sim n_c m^L\)
Uncorrelated version of RHM:
Curse of Dimensionality even for deep nets
Horizontal lines: random error
We consider a different version of a Convolutional Neural Network (CNN) without weight sharing
Standard CNN:
Locally Connected Network (LCN):
Single input feature!
The learning algorithm
Train loss:
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
E.g. Laplacian Kernel
Fixed Features
Failure and success of Spectral Bias prediction..., [ICML22]
Predicting generalization of KRR
[Canatar et al., Nature (2021)]
General framework for KRR
\(\rightarrow\) KRR learns the first \(P\) eigenmodes of \(K\)
\(\rightarrow\) \(f_P\) is self-averaging with respect to sampling
\(\rightarrow\) what is the validity limit?
Our toy model
Depletion of points around the interface
Data: \(x\in\mathbb{R}^d\)
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Motivation:
evidence for gaps between clusters in datasets like MNIST
Predictor in the toy model
(1) Spectral bias predicts a self-averaging predictor controlled by a characteristic length \( \ell(\lambda,P) \propto \lambda/P \)
For fixed regularizer \(\lambda/P\):
(2) When the number of sampling points \(P\) is not enough to probe \( \ell(\lambda,P) \):
\(d=1\)
Different predictions for
\(\lambda\rightarrow0^+\)
Crossover at:
\(\lambda\)
Spectral bias failure
Spectral bias success
Takeaways and Future directions
For which kind of data spectral bias fails?
Depletion of points close to decision boundary
Still missing a comprehensive theory for
KRR test error for vanishing regularization
Test error: 2 regimes
For fixed regularizer \(\lambda/P\):
\(\rightarrow\) Predictor controlled by extreme value statistics of \(x_B\)
\(\rightarrow\) Not self-averaging: no replica theory
(2) For small \(P\): predictor controlled by extremal sampled points:
\(x_B\sim P^{-\frac{1}{\chi+d}}\)
The self-averageness crossover
\(\rightarrow\) Comparing the two characteristic lengths \(\ell(\lambda,P)\) and \(x_B\):
Different predictions for
\(\lambda\rightarrow0^+\)
Technical remarks:
Fitting CIFAR10
Proof:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
first oscillations
Formal proof:
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)