First Workshop on Physics of Data
6 April 2022
Joint work with A. Sclocchi and M. Wyart, ICML 2022
Supervised Machine Learning (ML)
Example: recognize a gondola from a cruise ship
\(\rightarrow\) Assuming a simple structure for \(f^*\):
\(\beta\sim 1/d\)
Curse of Dimensionality
\(\rightarrow\) Images are high-dimensional objects:
E.g. \(32\times 32\) images \(\rightarrow\) \(d=1024\)
\(\rightarrow\) Learning would be impossible!
ML is able to capture structure of data
How \(\beta\) depends on data structure, the task and the ML architecture?
Very good performance \(\beta\sim 0.07-0.35\)
[Hestness et al. 2017]
In practice: ML works
We lack a general theory for computing \(\beta\) !
Algorithm:
Kernel Ridge Regression (KRR)
Train loss:
Motivation:
For \(\lambda=0\): equivalent to Neural networks with infinite width,
specific initialization [Jacot et al. 2018].
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
E.g. Laplacian Kernel
Dataset MNIST:
\(\rightarrow d=784\)
Low-dimensional representation:
\(\rightarrow d=2\)
Gaps between clusters!
[Paccolat, Spigler, Wyart 2020]
Data: isotropic Gaussian
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Depletion of points around the interface
[Tomasini, Sclocchi, Wyart 2022]
Simple models: testing the KRR literature theories
Deeper understanding with simple models
Relies on eigendecomposition of \(K\):
Eigenvectors \(\{\phi_{\rho}\}\) and eigenvalues \(\{\lambda_{\rho}\}\)
[Canatar et al., Nature (2021)]
Ridgeless limit \(\lambda\rightarrow0^+\)
Spectral Bias:
KRR first learns the \(P\) modes with largest \(\lambda_\rho\)
\(\rightarrow\) the underlying assumption is that \(f_P\) is
self-averaging with respect to sampling
\(\rightarrow\) obtained by replica theory
Predictor in the depleted stripe model
(1) Spectral bias predicts a self-averaging predictor controlled by a characteristic length \( \ell(\lambda,P) \propto \lambda/P \)
For fixed regularizer \(\lambda/P\):
(2) When the number of sampling points \(P\) is not enough to probe \( \ell(\lambda,P) \):
Different predictions for
\(\lambda\rightarrow0^+\)
Crossover at:
\(d=1\) and \(\chi=1\)
Conclusions
For small ridge: correct just for \(d\rightarrow\infty\).
Thank you for your attention!
For which kind of data spectral bias fails?
Classification task \(\pm 1\): a discontinuous function
Depletion of points close to decision boundary
Still missing a comprehensive theory for test error
BACKUP SLIDES
Scaling Spectral Bias prediction
Fitting CIFAR10
Proof:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
first oscillations
Formal proof:
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)