Based on:
Umberto Maria Tomasini
vs.
\(P\): training set size
\(d\) data-space dimension
Which aspects of real data make them learnable?
Cat
The cat is _____ \(\Rightarrow\) grey
A property of data that may be leveraged by networks
sofa
Image by [Kawar, Zada et al. 2023]
The cat sat on the sofa
The cat sat on the couch
Our approach:
Is hierarchy learnt by networks?
Synonyms
\(m=2\)
\(L\): depth
sofa
\(P^*\)
\(P^*\sim n_c m^L\)
Learning the task happens when synonyms are learnt.
\(S_2\propto \langle||f_2(x)-f_2(p(x))||^2\rangle_{x,p}\)
Also synonyms learnt at \(P^*\)
Sparse Hierarchical Model in a nutshell
Look whether also deep networks trained on real data learn invariant representations together with the task.
Image by [Kawar, Zada et al. 2023]
Thank you!
BACKUP SLIDES
Does our model capture all the properties of the structure of data?
Sparse Hierarchical Random Model
\(\textcolor{blue}{s_0=2}\)
Diffeomorphisms
learnt with the task
Synonyms learnt with the task
Diffeomorphisms
learnt with the task
Synonyms learnt with the task
Questions
What we know about diffeomorphisms
Image by [Kawar, Zada et al. 2023]
Learning the task happens when synonyms are learnt.
Number of training points needed: \(P^*\sim (s_0+1)^L n_c m^L\)
If a network reconstructs the hierarchical task by leveraging local feature-label correlations, then it extends this capability at each equivalent location, yielding invariance to diffeo.
\(x_1\)
\(x_2\)
The learning algorithm
Train loss:
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
E.g. Laplacian Kernel
Fixed Features
Failure and success of Spectral Bias prediction..., [ICML22]
Predicting generalization of KRR
[Canatar et al., Nature (2021)]
General framework for KRR
\(\rightarrow\) KRR learns the first \(P\) eigenmodes of \(K\)
\(\rightarrow\) \(f_P\) is self-averaging with respect to sampling
\(\rightarrow\) what is the validity limit?
Our toy model
Depletion of points around the interface
Data: \(x\in\mathbb{R}^d\)
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Motivation:
evidence for gaps between clusters in datasets like MNIST
Predictor in the toy model
(1) Spectral bias predicts a self-averaging predictor controlled by a characteristic length \( \ell(\lambda,P) \propto \lambda/P \)
For fixed regularizer \(\lambda/P\):
(2) When the number of sampling points \(P\) is not enough to probe \( \ell(\lambda,P) \):
\(d=1\)
Different predictions for
\(\lambda\rightarrow0^+\)
Crossover at:
\(\lambda\)
Spectral bias failure
Spectral bias success
Takeaways and Future directions
For which kind of data spectral bias fails?
Depletion of points close to decision boundary
Still missing a comprehensive theory for
KRR test error for vanishing regularization
Test error: 2 regimes
For fixed regularizer \(\lambda/P\):
\(\rightarrow\) Predictor controlled by extreme value statistics of \(x_B\)
\(\rightarrow\) Not self-averaging: no replica theory
(2) For small \(P\): predictor controlled by extremal sampled points:
\(x_B\sim P^{-\frac{1}{\chi+d}}\)
The self-averageness crossover
\(\rightarrow\) Comparing the two characteristic lengths \(\ell(\lambda,P)\) and \(x_B\):
Different predictions for
\(\lambda\rightarrow0^+\)
Technical remarks:
Scaling Spectral Bias prediction
Fitting CIFAR10
Proof:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
first oscillations
Formal proof:
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)
\(\tau(x)\)
\((x+\eta)\)
\(+\eta\)
\(\tau\)
\(f(\tau(x))\)
\(f(x+\eta)\)
\(x\)
We define the relative insensitivity to diffeomorphisms as
Is this hyp. testable?
\(R_f = \frac{\mathbb{E}_{x,\tau}\|f(\tau(x))-f(x)\|^2}{\mathbb{E}_{x,\eta}\| f(x+\eta)-f(x)\|^2} \)
Hypothesis:
better nets are more stable to diffeo perturbations
'Relative stability toward diffeomorphisms indicates performance in deep nets', NeurIPS 2021
+ Error
- Error
- Sensitive
+ Sensitive
'Relative stability toward diffeomorphisms indicates performance in deep nets', NeurIPS 2021
\(\textcolor{blue}{D_f \propto \mathbb{E}_{x,\tau}\|f(\tau(x))-f(x)\|^2} \)
\(\textcolor{green}{G_f \propto \mathbb{E}_{x,\eta}\| f(x+\eta)-f(x)\|^2}\)
with \(x\), \(x_1\) and \(x_2\) from test set
Performance correlates with stability to diffeo
\(R_f=\frac{D_f}{G_f}\)
\(R_f = \frac{\textcolor{blue}{\mathbb{E}_{x,\tau}\|f(\tau(x))-f(x)\|^2}}{\textcolor{green}{\mathbb{E}_{x,\eta}\| f(x+\eta)-f(x)\|^2}} \)
diffeomorphisms
random transformation
Performance correlates with sensitivity to noise
Questions:
Spatial pooling
Channel pooling
Average pooling can be learned by making filters low pass
Channel pooling can be learned by properly adding filters together
\(w\cdot x=\)
\(1.0\)
0.2
filters \(w\)
1
input \(x\)
0.2
\(1.0\)
1
rotated input
'How deep convolutional neural network lose spatial information with training',
[ICLR23 Workshop], [Machine Learning: Science and Technology 2023]
\(R_k = \frac{\mathbb{E}_{x,\tau}\|f_k(\tau(x))-f_k(x)\|^2}{\mathbb{E}_{x,\eta}\| f_k(x+\eta)-f_k(x)\|^2} \)
- Sensitive
+ Sensitive
\(\rightarrow\) Neither the case: both poolings are learnt
We look at \(R_k\) of the representation \(f_k\) at layer \(k\).
\(d\): distance between
active pixels
\(\xi\): characteristic scale
\(d<\xi\,\rightarrow\, y=-1\)
\(d>\xi\,\rightarrow\, y=1\)
Why \(G_k\) increases?
\(R_k=\frac{D_k}{G_k}\)
\(\textcolor{black}{D_k \propto \mathbb{E}_{x,\tau}\|f_k(\tau(x))-f_k(x)\|^2} \)
\(G_k \propto \mathbb{E}_{x,\eta}\| f_k(x+\eta)-f_k(x)\|^2\)
Sensitivity to noise increseas layer by layer:
The positive noise piles up, affecting the representation more and more
ReLU
\(x_i\sim\mathcal{N}(0,1),\) \(i\in\{1,...,N\}\)
\(\rightarrow\frac{1}{N}\sum_{i=1}^N x_i\approx 0\)
\(\rightarrow\frac{1}{N}\sum_{i=1}^N \textcolor{red}{|x_i|>0}\)
Outline:
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
\(\frac{1}{2}\)
Rules Level 1
Rules Level 2
\(\Rightarrow\)
Sampled examples
Synonyms
\(L\): depth
(here \(L=2\))
\(s\): patch dimension
(input dimension \(s^L\))
Label 1
Label 2
\(m=2\)
Number classes
\(n_c =2\)
'How deep convolutional networks learn hierarchical tasks: the Random Hierarchy Model'
Number of features (colors) \(v\)
Key limitation of RHM:
To overcome it:
We introduce a model where the features are sparse and hierarchically structured.
\(P^*\sim\textcolor{blue}{ (s_0+1)^2} n_c m^L\)
Task
Sensitivities
\(\Rightarrow\) To recover the synonyms, a number of training points of order \(P^*_0\) is needed, up to a constant in \(L\), which we observe being \((s_0+1)^2\).
\(P^*_{\text{CNN}}\sim F^{-2/L} [n_c m^L]\)
\(P^*_{\text{LCN}}\sim \frac{1}{F}s^{L/2}[ n_c m^L]\)
\(F\): image relevant fraction
Thank you!
Image by [Kawar, Zada et al. 2023]
We consider a different version of a Convolutional Neural Network (CNN) without weight sharing
Standard CNN:
Locally Connected Network (LCN):
\( \textcolor{blue}{P_1}\): synonymic exchange at layer 1
\( \textcolor{blue}{S_{2, 1} \propto \langle\|f_{2}(x) - f_{2}(P_1 x)\|^2 \rangle_{x, P_1}}\)
Learning the task correlates with:
Is this observation robust? How many training data?
To learn the task:
\(P^*\sim\textcolor{blue}{ (s_0+1)^L} n_c m^L\)
\(\Rightarrow\)To recover the synonyms and then solve the task,
it is then necessary to see many more data:
\(P^*_{\text{LCN}}\sim (s_0+1)^L P^*_0\)
\(s_0=1\)
Probability to see a signal in a given location:
\(p=\frac{1}{(s_0+1)^L}\)
\(\Rightarrow\) change polynomial in \(d=(s(s0+1))^L\)
\(P^*_{\text{LCN}}\sim \frac{1}{F}s^{L/2}[ n_c m^L]\)
\(F\): image relevant fraction