Supervised Machine Learning (ML)
General high-dimensional arguments: very small \(\beta\sim 1/d\)
In practice: very good performance
ML is able to capture structure of data
How \(\beta\) depends on data structure, the task and the ML architecture?
Curse of Dimensionality
\(\varepsilon_t\sim P^{-\beta}\)
A bridge with Kernel Methods
Neural networks with infinite width, specific initialization
\(\varepsilon_t\) equivalence
[Jacot et al. 2018]
Kernel methods
Kernel Methods: a brief overview
Pros:
Cons:
1. High computational cost
Kernel Ridge Regression (KRR)
True function: \(f^*(x)\)
Data: \(x\sim p(x)\)
Train loss:
Test error:
[Paccolat, Spigler, Wyart 2020]
Data: isotropic Gaussian
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Depletion of points around the interface
[Tomasini, Sclocchi, Wyart 2022]
Simple models: testing the KRR literature theories
Deeper understanding with simple models
[Bordelon et al. 2020] [Canatar et al. 2020] [Loureiro et al. 2021]
Relies on eigendecomposition of \(K\):
Eigenvectors \(\{\phi_{\rho}\}\) and eigenvalues \(\{\lambda_{\rho}\}\)
\( 2b>(a-1)\)
Replica calculation + Gaussian approximation
Spectral Bias:
KRR first learns the \(P\) modes with largest \(\lambda_\rho\)
What happens in our context?
\(f^*(x_1)=\text{sign}(x_1)\)
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
A physical intuition
For fixed \(\lambda/P\):
Test error for \(\lambda\rightarrow0^+\):
what do we expect?
The predictor \(f_P\) is governed by two extremal points in the sample
\(x_A\) and \(x_{B}\)
\(\varepsilon_t \sim \int_0^{x_B}x^\chi dx \sim P^{-1}\)
For \(\lambda\rightarrow0^+\) and large \(P\)
\(\varepsilon_t\sim P^{-1}\) for \(\forall \chi\ge0\)
Eigenvectors \( \phi_{\rho} \) satisfy the following Schroedinger-like differential equation (ODE):
Spectral Bias prediction: eigendecomposition
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\)
\(|c_{\rho}|\sim \lambda_\rho^{\frac{\frac{3}{4}\chi+1}{\chi+2}}\)
Getting the coefficients \(c_\rho \)
for small \(\lambda_\rho\)
Spectral bias prediction
\(\lambda_\rho\sim\rho^{-2}\)
\(\varepsilon_B\approx \sum\limits_{\rho>P}c^2_{\rho}\approx P^{-1-\frac{\chi}{\chi+2}} \)
From the boundary condition \(|\phi(x)|\rightarrow 0\) for \(|x|\rightarrow \infty\)
Different predictions:
\(\varepsilon_B\sim P^{-1-\frac{\chi}{\chi+2}}\neq \varepsilon_t \sim P^{-1}\)
\(d=1\)
\(\chi=1\)
What happens for larger \(\lambda\)?
Increasing \(\lambda/P\) \(\rightarrow\) Increasing \(\ell(\lambda,P)\)
\(\rightarrow\) The predictor \(f_P\) is controlled by \(\ell(\lambda,P)\) even for small \(P\)
\(\rightarrow\) The predictor \(f_P\) is self-averaging
What happens for larger \(\lambda\)?
The self-averageness crossover
\( \ell(\lambda,P) \sim \left(\frac{ \lambda \sigma}{P}\right)^{\frac{1}{2+\chi}}\)
\(x_B\sim P^{-\frac{1}{\chi+1}}\)
\(\rightarrow\) Crossover:
Self-averageness crossover
Higher dimension setting:
Test Error
Self-averageness crossover, \(d>1\)
Comparing \(r_\text{min}\) and \(\ell(\lambda,P)\):
\(\lambda^*_{d,\chi}\sim P^{-\frac{1}{d+\chi}}\)
Higher dimension setting:
Test Error (ridgeless)
Fitting CIFAR10
Conclusions
To note:
Thank you for your attention.
BACKUP SLIDES
Scaling Spectral Bias prediction
Proof:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
first oscillations
Formal proof:
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)