Based on:
Umberto Maria Tomasini
vs.
\(P\): training set size
\(d\) data-space dimension
Which aspects of real data make them learnable?
Cat
The cat is _____ \(\Rightarrow\) grey
A property of data that may be leveraged by networks
sofa
Image by [Kawar, Zada et al. 2023]
The cat sat on the sofa
The cat sat on the couch
Our approach:
Is hierarchy learnt by networks?
Synonyms
\(m=2\)
\(L\): depth
sofa
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
Rules Level 1
Rules Level 2
\(\Rightarrow\)
Sampled examples
Synonyms
\(L\): depth
(here \(L=2\))
\(s\): patch dimension
(input dimension \(s^L\))
Label 1
Label 2
\(m=2\)
Number classes
\(n_c =2\)
'How deep convolutional networks learn hierarchical tasks: the Random Hierarchy Model'
Number of features (colors) \(v=2\)
\(P^*\)
\(P^*\sim n_c m^L\)
Learning the task happens when synonyms are learnt.
\(S_2\propto \langle||f_2(x)-f_2(p(x))||^2\rangle_{x,p}\)
Also synonyms learnt at \(P^*\)
Invariance to Diffeomorphisms is central in image classification
Questions
Sparse Hierarchical Random Model
\(\textcolor{blue}{s_0=2}\)
We consider a different version of a Convolutional Neural Network (CNN) without weight sharing
Standard CNN:
Locally Connected Network (LCN):
\(P^*\sim \textcolor{blue}{(s_0+1)^L}n_c m^L\)
\(\Rightarrow\)To recover the synonyms and then solve the task,
it is then necessary to see many more data:
\(P^*_{\text{LCN}}\sim (s_0+1)^L P^*_0\)
\(s_0=1\)
Probability to see a signal in a given location:
\(p=\frac{1}{(s_0+1)^L}\)
2
Diffeomorphisms
learnt with the task
Synonyms learnt with the task
The hidden representations become insensitive to the invariances of the task
Since a network reconstructs the hierarchical task by leveraging patch-label correlations, the synonyms and their spatial rearrangements are grouped together, yielding invariance to diffeo.
\(x_1\)
\(x_2\)
\(x_3\)
\(x_4\)
A hierarchical representation, crucial for achieving good performance, is learnt precisely at the same number of training points at which insensitivity to diffeomorphisms is achieved.
Takeaways
Look whether also deep networks trained on real data learn invariant representations together with the task.
Image by [Kawar, Zada et al. 2023]
Thank you!
[Sclocchi, Favero 24]
BACKUP SLIDES
Diffeomorphisms
learnt with the task
Synonyms learnt with the task
Learning the task happens when synonyms are learnt.
Number of training points needed: \(P^*\sim (s_0+1)^L n_c m^L\)
If a network reconstructs the hierarchical task by leveraging local feature-label correlations, then it extends this capability at each equivalent location, yielding invariance to diffeo.
\(x_1\)
\(x_2\)
The learning algorithm
Train loss:
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
E.g. Laplacian Kernel
Fixed Features
Failure and success of Spectral Bias prediction..., [ICML22]
Predicting generalization of KRR
[Canatar et al., Nature (2021)]
General framework for KRR
\(\rightarrow\) KRR learns the first \(P\) eigenmodes of \(K\)
\(\rightarrow\) \(f_P\) is self-averaging with respect to sampling
\(\rightarrow\) what is the validity limit?
Our toy model
Depletion of points around the interface
Data: \(x\in\mathbb{R}^d\)
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Motivation:
evidence for gaps between clusters in datasets like MNIST
Predictor in the toy model
(1) Spectral bias predicts a self-averaging predictor controlled by a characteristic length \( \ell(\lambda,P) \propto \lambda/P \)
For fixed regularizer \(\lambda/P\):
(2) When the number of sampling points \(P\) is not enough to probe \( \ell(\lambda,P) \):
\(d=1\)
Different predictions for
\(\lambda\rightarrow0^+\)
Crossover at:
\(\lambda\)
Spectral bias failure
Spectral bias success
Takeaways and Future directions
For which kind of data spectral bias fails?
Depletion of points close to decision boundary
Still missing a comprehensive theory for
KRR test error for vanishing regularization
Test error: 2 regimes
For fixed regularizer \(\lambda/P\):
\(\rightarrow\) Predictor controlled by extreme value statistics of \(x_B\)
\(\rightarrow\) Not self-averaging: no replica theory
(2) For small \(P\): predictor controlled by extremal sampled points:
\(x_B\sim P^{-\frac{1}{\chi+d}}\)
The self-averageness crossover
\(\rightarrow\) Comparing the two characteristic lengths \(\ell(\lambda,P)\) and \(x_B\):
Different predictions for
\(\lambda\rightarrow0^+\)
Technical remarks:
Scaling Spectral Bias prediction
Fitting CIFAR10
Proof:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
first oscillations
Formal proof:
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)
\(\tau(x)\)
\((x+\eta)\)
\(+\eta\)
\(\tau\)
\(f(\tau(x))\)
\(f(x+\eta)\)
\(x\)
We define the relative insensitivity to diffeomorphisms as
Is this hyp. testable?
\(R_f = \frac{\mathbb{E}_{x,\tau}\|f(\tau(x))-f(x)\|^2}{\mathbb{E}_{x,\eta}\| f(x+\eta)-f(x)\|^2} \)
Hypothesis:
better nets are more stable to diffeo perturbations
'Relative stability toward diffeomorphisms indicates performance in deep nets', NeurIPS 2021
+ Error
- Error
- Sensitive
+ Sensitive
'Relative stability toward diffeomorphisms indicates performance in deep nets', NeurIPS 2021
\(\textcolor{blue}{D_f \propto \mathbb{E}_{x,\tau}\|f(\tau(x))-f(x)\|^2} \)
\(\textcolor{green}{G_f \propto \mathbb{E}_{x,\eta}\| f(x+\eta)-f(x)\|^2}\)
with \(x\), \(x_1\) and \(x_2\) from test set
Performance correlates with stability to diffeo
\(R_f=\frac{D_f}{G_f}\)
\(R_f = \frac{\textcolor{blue}{\mathbb{E}_{x,\tau}\|f(\tau(x))-f(x)\|^2}}{\textcolor{green}{\mathbb{E}_{x,\eta}\| f(x+\eta)-f(x)\|^2}} \)
diffeomorphisms
random transformation
Performance correlates with sensitivity to noise
Questions:
Spatial pooling
Channel pooling
Average pooling can be learned by making filters low pass
Channel pooling can be learned by properly adding filters together
\(w\cdot x=\)
\(1.0\)
0.2
filters \(w\)
1
input \(x\)
0.2
\(1.0\)
1
rotated input
'How deep convolutional neural network lose spatial information with training',
[ICLR23 Workshop], [Machine Learning: Science and Technology 2023]
\(R_k = \frac{\mathbb{E}_{x,\tau}\|f_k(\tau(x))-f_k(x)\|^2}{\mathbb{E}_{x,\eta}\| f_k(x+\eta)-f_k(x)\|^2} \)
- Sensitive
+ Sensitive
\(\rightarrow\) Neither the case: both poolings are learnt
We look at \(R_k\) of the representation \(f_k\) at layer \(k\).
\(d\): distance between
active pixels
\(\xi\): characteristic scale
\(d<\xi\,\rightarrow\, y=-1\)
\(d>\xi\,\rightarrow\, y=1\)
Why \(G_k\) increases?
\(R_k=\frac{D_k}{G_k}\)
\(\textcolor{black}{D_k \propto \mathbb{E}_{x,\tau}\|f_k(\tau(x))-f_k(x)\|^2} \)
\(G_k \propto \mathbb{E}_{x,\eta}\| f_k(x+\eta)-f_k(x)\|^2\)
Sensitivity to noise increseas layer by layer:
The positive noise piles up, affecting the representation more and more
ReLU
\(x_i\sim\mathcal{N}(0,1),\) \(i\in\{1,...,N\}\)
\(\rightarrow\frac{1}{N}\sum_{i=1}^N x_i\approx 0\)
\(\rightarrow\frac{1}{N}\sum_{i=1}^N \textcolor{red}{|x_i|>0}\)
Outline:
Key limitation of RHM:
To overcome it:
We introduce a model where the features are sparse and hierarchically structured.
\(P^*\sim\textcolor{blue}{ (s_0+1)^2} n_c m^L\)
Task
Sensitivities
\(\Rightarrow\) To recover the synonyms, a number of training points of order \(P^*_0\) is needed, up to a constant in \(L\), which we observe being \((s_0+1)^2\).
\(P^*_{\text{CNN}}\sim F^{-2/L} [n_c m^L]\)
\(P^*_{\text{LCN}}\sim \frac{1}{F}s^{L/2}[ n_c m^L]\)
\(F\): image relevant fraction
Thank you!
Image by [Kawar, Zada et al. 2023]
We consider a different version of a Convolutional Neural Network (CNN) without weight sharing
Standard CNN:
Locally Connected Network (LCN):
\( \textcolor{blue}{P_1}\): synonymic exchange at layer 1
\( \textcolor{blue}{S_{2, 1} \propto \langle\|f_{2}(x) - f_{2}(P_1 x)\|^2 \rangle_{x, P_1}}\)
Learning the task correlates with:
Is this observation robust? How many training data?
To learn the task:
\(P^*\sim\textcolor{blue}{ (s_0+1)^L} n_c m^L\)
\(\Rightarrow\)To recover the synonyms and then solve the task,
it is then necessary to see many more data:
\(P^*_{\text{LCN}}\sim (s_0+1)^L P^*_0\)
\(s_0=1\)
Probability to see a signal in a given location:
\(p=\frac{1}{(s_0+1)^L}\)
\(\Rightarrow\) change polynomial in \(d=(s(s0+1))^L\)
\(P^*_{\text{LCN}}\sim \frac{1}{F}s^{L/2}[ n_c m^L]\)
\(F\): image relevant fraction