First Workshop on Physics of Data
6 April 2022
Joint work with A. Sclocchi and M. Wyart [2202.03348]
Supervised Machine Learning (ML)
Example: recognize a gondola from a yacht
\(\rightarrow\) Assuming a simple structure for \(f^*\):
\(\beta\sim 1/d\)
Curse of Dimensionality
\(\rightarrow\) Images are high-dimensional objects:
E.g. \(32\times 32\) images \(\rightarrow\) \(d=1024\)
\(\rightarrow\) Learning would be impossible!
ML is able to capture structure of data
How \(\beta\) depends on data structure, the task and the ML architecture?
Very good performance \(\beta\sim 0.07-0.35\)
[Hestness et al. 2017]
In practice: ML works
We lack a general theory for computing \(\beta\) !
Algorithm:
Kernel Ridge Regression (KRR)
Train loss:
Motivation:
For \(\lambda=0\): equivalent to Neural networks with infinite width,
specific initialization [Jacot et al. 2018]
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
E.g. Laplacian Kernel
[Paccolat, Spigler, Wyart 2020]
Data: isotropic Gaussian
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Depletion of points around the interface
[Tomasini, Sclocchi, Wyart 2022]
Simple models: testing the KRR literature theories
[Canatar et al., Nature (2021)]
KRR learns the first \(P\) eigenmodes of \(K\)
Test error: 2 regimes
(1) For \(P\rightarrow \infty\): predictor controlled by characteristic length:
\( \ell(\lambda,P) \sim \left(\frac{ \lambda \sigma}{P}\right)^{\frac{1}{1+d+\chi}}\)
In this regime, replica prediction works.
For fixed regularizer \(\lambda/P\):
Test error: 2 regimes
For fixed regularizer \(\lambda/P\):
\(\rightarrow\) Predictor controlled by extreme value statistics of \(x_B\)
\(\rightarrow\) Not self-averaging: no replica theory
(2) For small \(P\): predictor controlled by extremal sampled points:
\(x_B\sim P^{-\frac{1}{\chi+d}}\)
The self-averageness crossover
\(\rightarrow\) Comparing the two characteristic lengths \(\ell(\lambda,P)\) and \(x_B\):
Different predictions for
\(\lambda\rightarrow0^+\)
Conclusions
Technical remarks:
Thank you for your attention.
BACKUP SLIDES
Scaling Spectral Bias prediction
Fitting CIFAR10
Proof:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
first oscillations
Formal proof:
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)