Umberto Maria Tomasini
1/28
Failure and success of the spectral bias prediction for Laplace Kernel Ridge Regression: the case of low-dimensional data
UMT, Sclocchi, Wyart
ICML 22
How deep convolutional neural network
lose spatial information with training
UMT, Petrini, Cagnetta, Wyart
ICLR23 Workshop,
Machine Learning: Science and Technology 2023
How deep neural networks learn compositional data:
The random hierarchy model
Cagnetta, Petrini, UMT, Favero, Wyart
PRX 24
How Deep Networks Learn Sparse and Hierarchical Data:
the Sparse Random Hierarchy Model
UMT, Wyart
ICML 24
Clusterized data
Invariance to deformations
Invariance to deformations
Hierarchy
Hierarchy
2/28
Curse of dimensionality occurs when learning structureless data in high dimension \(d\):
VS
\(\varepsilon\sim P^{-\beta}\)
3/28
\(\Rightarrow\) Data must be structured and
Machine Learning should capture such structure.
Key questions motivating this thesis:
4/28
\(\Rightarrow\) Kernel Methods
3. Task depend on a few coordinates,
e.g. \(f(x)=f(x_1)\),
\(\Rightarrow\) Shallow Networks
[Bennet 69, Pope 21, Goldt 20, Smola 98, Rosasco 17, Caponetto and De Vito 05-07, Steinwart 09, Steinwart 20, Loureiro 21, Cui 21, Richards 21, Xu 21, Patil 21, Kanagawa 18, Jin 21, Bordelon 20, Canatar 21, Spigler 20, Jacot 20, Mei 21, Bahri 24]
[UMT, Sclocchi, Wyart, ICML 22]
Properties 1-3 not enough:
Why? What structure do they learn?
[Barron 93, Bach 17, Bach 20, Schmidt-Hieber 20,
Yehudai and Shamir 19, Ghorbani 19-20, Wei 19, Paccolat 20, Dandi 23]
5/28
Reducing complexity with depth
Deep networks build increasingly abstract representations with depth (also in brain)
Intuition: reduces complexity of the task, ultimately beating curse of dimensionality.
Two ways for losing information
by learning invariances
Discrete
Continuous
[Zeiler and Fergus 14, Yosinski 15, Olah 17, Doimo 20,
Van Essen 83, Grill-Spector 04]
[Shwartz-Ziv and Tishby 17, Ansuini 19, Recanatesi 19, ]
[Bruna and Mallat 13, Mallat 16, Petrini 21]
6/28
Test error correlates with sensitivity to diffeo
Test error anti-correlates with sensitivity to noise
\((x+\eta)\)
\(x\)
Can we explain this phenomenology?
[Petrini21]
\(x\)
\(\tau(x)\)
Sensitivity to diffeo
Test error
+Sensitive
Inverse of Sensitivity to noise
Test error
+Sensitive
7/28
[Poggio 17, Mossel 16, Malach 18-20, Schmdit-Hieber 20, Allen-Zhu and Li 24]
3. Data are sparse,
4. The task is stable to transformations.
[Bruna and Mallat 13, Mallat 16, Petrini 21]
How many training points are needed for deep networks?
8/28
Part I: analyze mechanistically why the best networks are the most sensitive to noise.
Part II: introduce a hierarchical model of data to quantify the gain in number of training points.
Part III: understand why the best networks are the least sensitive to diffeo, in an extension of the hierarchical model.
9/28
\(d<\xi\,\rightarrow\, y=1\)
\(d>\xi\,\rightarrow\, y=-1\)
\(d\): distance between active pixels
\(\xi\): characteristic scale
[UMT, Petrini, Cagnetta, Wyart, ICLR23 Workshop,
Machine Learning: Science and Technology 2023]
\(y=1\)
\(y=-1\)
\(\xi\)
\(\xi\)
10/28
Input
Avg Pooling
Layer 1
Layer 2
Avg Pooling
ReLU
Concentrates around
positive mean!
Noise
11/28
Next:
12/28
Part II: Hierarchical structure
Do deep hierarchical representations exploit the hierarchical structure of data?
[Poggio 17, Mossel 16, Malach 18-20, Schmdit-Hieber 20, Allen-Zhu and Li 24]
How many training points?
Quantitative predictions in a model of data
[Cagnetta, Petrini, UMT, Favero, Wyart, PRX 24]
sofa
[Chomsky 1965]
[Grenander 1996]
13/28
14/28
\(P^*\)
\(P^*\sim n_c m^L\)
15/28
How many training points are needed to group synonyms?
16/28
Patch \(\mu\)
Label \(\alpha\)
17/28
At \(P^*\) the task and the synonyms are learnt
1.0
0.5
0.0
\(10^4\)
Training set size \(P\)
18/28
19/28
Key insight: sparsity brings invariance to diffeo
[Tomasini, Wyart, ICML 24]
20/28
Sparse Random Hierarchy Model
Sparsity\(\rightarrow\) Invariance to feature displacements (diffeo)
Sparsity \(\rightarrow\) invariance to features displacements (diffeo)
21/28
Analyzing this model:
Sensitivity to diffeo
Sensitivity to diffeo
Test error
22/28
\(x_1\)
\(x_2\)
\(x_3\)
\(x_4\)
23/28
\(\Rightarrow\)To recover the synonyms, a factor \(1/p\) more data:
\(P^*_{\text{LCN}}\sim (s_0+1)^L P^*_0\)
CNN: factor independent on \(L\)
24/28
\(x_1\)
\(x_2\)
\(x_3\)
\(x_4\)
25/28
Diffeomorphisms
learnt with the task
Synonyms learnt with the task
The hidden representations become insensitive to the invariances of the task
26/28
Takeaways
Test error
27/28
Future directions
Thank you!
sofa
28/28
BACKUP SLIDES
\(\tau(x)\)
\((x+\eta)\)
\(+\eta\)
\(\tau\)
\(f(\tau(x))\)
\(f(x+\eta)\)
\(x\)
Our model captures the fact that while sensitivity to diffeo decreases, the sensitivity to noise increases
\(\gamma_{k,l}=\mathbb{E}_{c,c'}[\omega^k_{c,c'}\cdot\Psi_l]\)
Few layers become low-pass: spatial pooling
Noise
Input
Avg Pooling
Layer 1
Layer 2
Avg Pooling
ReLU
\(G_k = \frac{\mathbb{E}_{x,\eta}\| f_k(x+\eta)-f_k(x)\|^2}{\mathbb{E}_{x_1,x_2}\| f_k(x_1)-f_k(x_2)\|^2}\)
\(\sim[m^L/(mv)]^{-1/2}\)
\(\sim[P/(n_c m v)]^{-1/2}\)
Large \(m,\, n_c,\, P\):
\(P_{corr}\sim n_c m^L\)
Uncorrelated version of RHM:
Curse of Dimensionality even for deep nets
Horizontal lines: random error
We consider a different version of a Convolutional Neural Network (CNN) without weight sharing
Standard CNN:
Locally Connected Network (LCN):
Single input feature!
The learning algorithm
Train loss:
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
E.g. Laplacian Kernel
Fixed Features
Failure and success of Spectral Bias prediction..., [ICML22]
Predicting generalization of KRR
[Canatar et al., Nature (2021)]
General framework for KRR
\(\rightarrow\) KRR learns the first \(P\) eigenmodes of \(K\)
\(\rightarrow\) \(f_P\) is self-averaging with respect to sampling
\(\rightarrow\) what is the validity limit?
Our toy model
Depletion of points around the interface
Data: \(x\in\mathbb{R}^d\)
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Motivation:
evidence for gaps between clusters in datasets like MNIST
Predictor in the toy model
(1) Spectral bias predicts a self-averaging predictor controlled by a characteristic length \( \ell(\lambda,P) \propto \lambda/P \)
For fixed regularizer \(\lambda/P\):
(2) When the number of sampling points \(P\) is not enough to probe \( \ell(\lambda,P) \):
\(d=1\)
Different predictions for
\(\lambda\rightarrow0^+\)
Crossover at:
\(\lambda\)
Spectral bias failure
Spectral bias success
Takeaways and Future directions
For which kind of data spectral bias fails?
Depletion of points close to decision boundary
Still missing a comprehensive theory for
KRR test error for vanishing regularization
Test error: 2 regimes
For fixed regularizer \(\lambda/P\):
\(\rightarrow\) Predictor controlled by extreme value statistics of \(x_B\)
\(\rightarrow\) Not self-averaging: no replica theory
(2) For small \(P\): predictor controlled by extremal sampled points:
\(x_B\sim P^{-\frac{1}{\chi+d}}\)
The self-averageness crossover
\(\rightarrow\) Comparing the two characteristic lengths \(\ell(\lambda,P)\) and \(x_B\):
Different predictions for
\(\lambda\rightarrow0^+\)
Technical remarks:
Fitting CIFAR10
Proof:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
first oscillations
Formal proof:
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)