Failure and success of the spectral bias prediction
for Kernel Ridge Regression:
the case of low-dimensional data
Supervised Machine Learning (ML)
- Used to learn a rule from data.
- Learn from \(P\) examples \(\{x_i,y_i\}\) a rule \(f_P(x)\).
- Key object: generalization error \(\varepsilon_t\) on new data
- Typically \(\varepsilon_t\sim P^{-\beta}\)
General high-dimensional arguments: very small \(\beta\sim 1/d\)
In practice: very good performance
ML is able to capture structure of data
How \(\beta\) depends on data structure, the task and the ML architecture?
Curse of Dimensionality
\(\varepsilon_t\sim P^{-\beta}\)
A bridge with Kernel Methods
Neural networks with infinite width, specific initialization
\(\varepsilon_t\) equivalence
[Jacot et al. 2018]
Kernel methods
Kernel Methods: a brief overview
- Predictor \(f_P\) linear in non-linear \(K\):
Pros:
- Coefficients \(a_i\) found with a convex minimisation problem
- Choose the kernel \(K\)
Cons:
1. High computational cost
Kernel Ridge Regression (KRR)
True function: \(f^*(x)\)
Data: \(x\sim p(x)\)
Train loss:
Test error:
Can we apply KRR
on realistic toy models of data?

- MNIST
- Reducing dimensions for visualization (t-SNE)
- From towardsdatascience.com
- From 28x28 dimensions to 2, with \(10^5\) samples
Depleted Stripe Model

[Paccolat, Spigler, Wyart 2020]
Data: isotropic Gaussian
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Depletion of points around the interface
[Tomasini, Sclocchi, Wyart 2022]

Simple models: testing the KRR literature theories
- They work very well on real data
- Yet, their validity limit is not clear
- Which are the real data features which allow their success?
Deeper understanding with simple models
General framework for regression
[Bordelon et al. 2020] [Canatar et al. 2020] [Loureiro et al. 2021]
Relies on eigendecomposition of \(K\):
Eigenvectors \(\{\phi_{\rho}\}\) and eigenvalues \(\{\lambda_{\rho}\}\)
\( 2b>(a-1)\)
- Noiseless setting
- Ridgeless limit \(\lambda\rightarrow0^+\)
Replica calculation + Gaussian approximation
Spectral Bias:
KRR first learns the \(P\) modes with largest \(\lambda_\rho\)
What happens in our context?
\(f^*(x_1)=\text{sign}(x_1)\)
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
- Rigorous approach for \(d=1\)
- Scaling arguments for \(d>1\)


A physical intuition
For fixed \(\lambda/P\):
Test error for \(\lambda\rightarrow0^+\):
what do we expect?
The predictor \(f_P\) is governed by two extremal points in the sample
\(x_A\) and \(x_{B}\)
\(\varepsilon_t \sim \int_0^{x_B}x^\chi dx \sim P^{-1}\)

For \(\lambda\rightarrow0^+\) and large \(P\)
\(\varepsilon_t\sim P^{-1}\) for \(\forall \chi\ge0\)
Eigenvectors \( \phi_{\rho} \) satisfy the following Schroedinger-like differential equation (ODE):
Spectral Bias prediction: eigendecomposition
- Solve the ODE, via Wentzel–Kramers–Brillouin (WKB):
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\)
- Compute \(|c_\rho|=\int_{-\infty}^{\infty}f^*(x)\phi_\rho(x)p(x)dx\) at leading order in \(\lambda_\rho\).
\(|c_{\rho}|\sim \lambda_\rho^{\frac{\frac{3}{4}\chi+1}{\chi+2}}\)
Getting the coefficients \(c_\rho \)
for small \(\lambda_\rho\)
Spectral bias prediction
\(\lambda_\rho\sim\rho^{-2}\)
\(\varepsilon_B\approx \sum\limits_{\rho>P}c^2_{\rho}\approx P^{-1-\frac{\chi}{\chi+2}} \)
From the boundary condition \(|\phi(x)|\rightarrow 0\) for \(|x|\rightarrow \infty\)
Different predictions:
\(\varepsilon_B\sim P^{-1-\frac{\chi}{\chi+2}}\neq \varepsilon_t \sim P^{-1}\)

\(d=1\)
\(\chi=1\)
What happens for larger \(\lambda\)?


Increasing \(\lambda/P\) \(\rightarrow\) Increasing \(\ell(\lambda,P)\)
\(\rightarrow\) The predictor \(f_P\) is controlled by \(\ell(\lambda,P)\) even for small \(P\)
\(\rightarrow\) The predictor \(f_P\) is self-averaging
What happens for larger \(\lambda\)?
The self-averageness crossover
- Characteristic length of \(f_P\) for \(P\rightarrow \infty\) and \(\frac{\lambda}{P}\) fixed (\( \ell(\lambda,P)\) wins):
\( \ell(\lambda,P) \sim \left(\frac{ \lambda \sigma}{P}\right)^{\frac{1}{2+\chi}}\)
- Characteristic length of \(f_P\) for \(\lambda\rightarrow 0^+\) (\( x_B\) wins):
\(x_B\sim P^{-\frac{1}{\chi+1}}\)
\(\rightarrow\) Crossover:

Self-averageness crossover


Higher dimension setting:
Test Error

-
Ridgeless:
- \(f_P\) fluctuates on a distance \(r_\text{min}\sim P^{-1/(d+\chi)}\)
-
Finite ridge:

Self-averageness crossover, \(d>1\)
Comparing \(r_\text{min}\) and \(\ell(\lambda,P)\):
\(\lambda^*_{d,\chi}\sim P^{-\frac{1}{d+\chi}}\)

Higher dimension setting:
Test Error (ridgeless)
- For \(\chi=0\): equal
- For \(\chi>0\): equal for \(d\rightarrow\infty\)


Fitting CIFAR10
Conclusions
- Replica/Random Matrix Theory predictions works even for small \(d\), for large ridge.
- For small ridge: spectral bias prediction, if \(\chi>0\) correct just for \(d\rightarrow\infty\).
- Vanishing density of data points on the boundary: out of Gaussian universality class
-
\(\phi_\rho(x)\sim \frac{1}{x^{\chi/4}}\):
- \(P(\phi)\sim \phi^{-5-\frac{4}{\chi}}\)
- Eigenvectors not independent: all large for small \(x\)
To note:
Thank you for your attention.
BACKUP SLIDES

Scaling Spectral Bias prediction

Proof:
- WKB approximation of \(\phi_\rho\) in [\(x_1^*,\,x_2^*\)]:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
- MAF approximation outside [\(x_1^*,\,x_2^*\)]
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
- WKB contribution to \(c_\rho\) is dominant in \(\lambda_\rho\)
- Main source WKB contribution:
first oscillations
Formal proof:
- Take training points \(x_1<...<x_P\)
- Find the predictor in \([x_i,x_{i+1}]\)
- Estimate contribute \(\varepsilon_i\) to \(\varepsilon_t\)
- Sum all the \(\varepsilon_i\)
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)
- Let's consider the predictor \(f_P\) minimizing the train loss for \(P \rightarrow \infty\).
- With the Green function \(G\) satisfying:
- In Fourier space:
- In Fourier space:
- Two regimes:
- \(G_\eta(x)\) has a scale:
FailureSuccessKRR
By umberto_tomasini
FailureSuccessKRR
- 6