Failure and success of the spectral bias prediction
for Kernel Ridge Regression:
the case of low-dimensional data
First Workshop on Physics of Data
6 April 2022
Joint work with A. Sclocchi and M. Wyart, ICML 2022

Supervised Machine Learning (ML)
- Used to learn a rule from data.
- Learn a target function \(f^*\) from \(P\) examples \(\{x_i,f^*(x_i)\}\).

Example: recognize a gondola from a cruise ship

\(\rightarrow\) Assuming a simple structure for \(f^*\):
\(\beta\sim 1/d\)
Curse of Dimensionality
- Key object: generalization error \(\varepsilon_t\) on new data
- Typically \(\varepsilon_t\sim P^{-\beta}\)
- \(\beta\) quantifies how many samples \(P\) are needed to achieve a certain error \(\varepsilon_t\)
\(\rightarrow\) Images are high-dimensional objects:
E.g. \(32\times 32\) images \(\rightarrow\) \(d=1024\)
\(\rightarrow\) Learning would be impossible!
ML is able to capture structure of data
How \(\beta\) depends on data structure, the task and the ML architecture?
Very good performance \(\beta\sim 0.07-0.35\)
[Hestness et al. 2017]
In practice: ML works
We lack a general theory for computing \(\beta\) !
Algorithm:
Kernel Ridge Regression (KRR)
- Predictor \(f_P\) linear in non-linear \(K\):
Train loss:
Motivation:
For \(\lambda=0\): equivalent to Neural networks with infinite width,
specific initialization [Jacot et al. 2018].
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
E.g. Laplacian Kernel
Looking for toy models of real data


Dataset MNIST:
- Label: integer number
- 70'000 pictures of size \(28 \times 28\)
\(\rightarrow d=784\)
Low-dimensional representation:
- Method: t-SNE
\(\rightarrow d=2\)
Gaps between clusters!
Depleted Stripe Model

[Paccolat, Spigler, Wyart 2020]
Data: isotropic Gaussian
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Depletion of points around the interface
[Tomasini, Sclocchi, Wyart 2022]

Simple models: testing the KRR literature theories
- They work very well on some real data
- Yet, we do not know why/when, and how general is their success.
Deeper understanding with simple models
General framework for KRR (1/2)
Relies on eigendecomposition of \(K\):
Eigenvectors \(\{\phi_{\rho}\}\) and eigenvalues \(\{\lambda_{\rho}\}\)
[Canatar et al., Nature (2021)]
Ridgeless limit \(\lambda\rightarrow0^+\)
Spectral Bias:
KRR first learns the \(P\) modes with largest \(\lambda_\rho\)
General framework for KRR (2/2)
\(\rightarrow\) the underlying assumption is that \(f_P\) is
self-averaging with respect to sampling
\(\rightarrow\) obtained by replica theory
Predictor in the depleted stripe model
(1) Spectral bias predicts a self-averaging predictor controlled by a characteristic length \( \ell(\lambda,P) \propto \lambda/P \)
For fixed regularizer \(\lambda/P\):

(2) When the number of sampling points \(P\) is not enough to probe \( \ell(\lambda,P) \):
- \(f_P\) is controlled by the statistics of the extremal points \(x_B\)
- spectral bias breaks down.
Different predictions for
\(\lambda\rightarrow0^+\)
- For \(\chi=0\): equal
- For \(\chi>0\): equal for \(d\rightarrow\infty\)

Crossover at:
\(d=1\) and \(\chi=1\)
Conclusions
For small ridge: correct just for \(d\rightarrow\infty\).
Thank you for your attention!
For which kind of data spectral bias fails?
Classification task \(\pm 1\): a discontinuous function
Depletion of points close to decision boundary
Still missing a comprehensive theory for test error
BACKUP SLIDES

Scaling Spectral Bias prediction
Fitting CIFAR10


Proof:
- WKB approximation of \(\phi_\rho\) in [\(x_1^*,\,x_2^*\)]:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
- MAF approximation outside [\(x_1^*,\,x_2^*\)]
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
- WKB contribution to \(c_\rho\) is dominant in \(\lambda_\rho\)
- Main source WKB contribution:
first oscillations
Formal proof:
- Take training points \(x_1<...<x_P\)
- Find the predictor in \([x_i,x_{i+1}]\)
- Estimate contribute \(\varepsilon_i\) to \(\varepsilon_t\)
- Sum all the \(\varepsilon_i\)
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)
- Let's consider the predictor \(f_P\) minimizing the train loss for \(P \rightarrow \infty\).
- With the Green function \(G\) satisfying:
- In Fourier space:
- In Fourier space:
- Two regimes:
- \(G_\eta(x)\) has a scale:
VeniceTalk_WorkshopPoD
By umberto_tomasini
VeniceTalk_WorkshopPoD
- 17