Failure and success of the spectral bias prediction

for Kernel Ridge Regression:

the case of low-dimensional data

First Workshop on Physics of Data

6 April 2022

Joint work with A. Sclocchi and M. Wyart, ICML 2022

Supervised Machine Learning (ML)

  • Used to learn a rule from data.
  • Learn a target function \(f^*\) from \(P\) examples \(\{x_i,f^*(x_i)\}\).

Example: recognize a gondola from a cruise ship

\(\rightarrow\) Assuming a simple structure for \(f^*\):

\(\beta\sim 1/d\)

Curse of Dimensionality

  • Key object: generalization error \(\varepsilon_t\) on new data
  • Typically \(\varepsilon_t\sim P^{-\beta}\)
  • \(\beta\) quantifies how many samples \(P\) are needed to achieve a certain error \(\varepsilon_t\)
\delta\sim P^{-1/d}

\(\rightarrow\) Images are high-dimensional objects:

E.g. \(32\times 32\) images \(\rightarrow\) \(d=1024\)

\(\rightarrow\) Learning would be impossible!

ML is able to capture structure of data

How \(\beta\) depends on data structure, the task and the ML architecture?

Very good performance \(\beta\sim 0.07-0.35\)

 [Hestness et al. 2017]

In practice: ML works

We lack a general theory for computing \(\beta\) !

Algorithm:

Kernel Ridge Regression (KRR)

  • Predictor \(f_P\) linear in non-linear \(K\):
f_P(x)=\sum_{i=1}^P a_i K(x_i,x)
\text{min}\left[\sum\limits_{i=1}^P\left|f^*(x_i)-f_P(x_i)\right|^2 +\lambda ||f_P||_K^2\right]

Train loss:

                                            Motivation:

For \(\lambda=0\): equivalent to Neural networks with infinite width,

specific initialization  [Jacot et al. 2018].

 

\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)

E.g. Laplacian Kernel

Looking for toy models of real data

Dataset MNIST:

  • Label: integer number
  • 70'000 pictures of size \(28 \times 28\)

\(\rightarrow d=784\)

Low-dimensional representation:

  • Method: t-SNE

\(\rightarrow d=2\)

 

Gaps between clusters!

Depleted Stripe Model

[Paccolat, Spigler, Wyart 2020]

Data: isotropic Gaussian

Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)

Depletion of points around the interface

p(x_1)= \frac{1}{\mathcal{Z}}\color{red}{|x_1|^\chi}\color{black} e^{-x_1^2}

[Tomasini, Sclocchi, Wyart 2022]

Simple models: testing the KRR literature theories

  • They work very well on some real data
  • Yet, we do not know why/when, and how general is their success.

Deeper understanding with simple models

General framework for KRR (1/2)

Relies on eigendecomposition of \(K\):

\int p(y) K(y,x)\phi_{\rho}(y)dy = \lambda_\rho\phi_{\rho}(x)

Eigenvectors \(\{\phi_{\rho}\}\) and eigenvalues \(\{\lambda_{\rho}\}\)

f^*(x)=\sum\limits_{\rho=1}^{\infty} \color{blue}{c_{\rho}}\color{black}\phi_{\rho}(x)

[Canatar et al., Nature (2021)]

\varepsilon_B \approx \sum\limits_{\rho>P}\color{blue}c^2_{\rho}

Ridgeless limit \(\lambda\rightarrow0^+\)

Spectral Bias:

KRR first learns the \(P\) modes with largest \(\lambda_\rho\)

General framework for KRR (2/2)

\(\rightarrow\) the underlying assumption is that \(f_P\) is  

  self-averaging with respect to sampling

\(\rightarrow\) obtained by replica theory

Predictor in the depleted stripe model

(1) Spectral bias predicts a self-averaging predictor controlled by a characteristic length \( \ell(\lambda,P) \propto \lambda/P \)

For fixed regularizer \(\lambda/P\):

(2) When the number of sampling points \(P\) is not enough to probe \( \ell(\lambda,P) \):

  1. \(f_P\) is controlled by the statistics of the extremal points \(x_B\)
  2. spectral bias breaks down. 
\varepsilon_B \sim P^{-(1+\frac{1}{d})\frac{1+\chi}{1+d+\chi}}
\varepsilon_t \sim P^{-\frac{1+\chi}{d+\chi}}
\neq

Different predictions for

\(\lambda\rightarrow0^+\)

  1. For \(\chi=0\): equal
  2. For \(\chi>0\): equal for \(d\rightarrow\infty\)
\lambda^*_{d,\chi}\sim P^{-\frac{1}{d+\chi}}

Crossover at:

\(d=1\) and \(\chi=1\)

Conclusions

For small ridge: correct just for \(d\rightarrow\infty\).

Thank you for your attention!

For which kind of data spectral bias fails?

Classification task \(\pm 1\): a discontinuous function

Depletion of points close to decision boundary 

Still missing a comprehensive theory for test error 

BACKUP SLIDES

Scaling Spectral Bias prediction

Fitting CIFAR10

Proof:

  • WKB approximation of \(\phi_\rho\) in [\(x_1^*,\,x_2^*\)]:

 \(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)

x_1^*
x_2^*
  • MAF approximation outside [\(x_1^*,\,x_2^*\)]

\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)

\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)

  • WKB contribution to \(c_\rho\) is dominant in \(\lambda_\rho\)
  • Main source WKB contribution:

        first oscillations

Formal proof:

  1. Take training points \(x_1<...<x_P\)
  2. Find the predictor in \([x_i,x_{i+1}]\)
  3. Estimate contribute \(\varepsilon_i\) to \(\varepsilon_t\)
  4. Sum all the \(\varepsilon_i\)

Characteristic scale of predictor \(f_P\), \(d=1\)

\sigma^2 \partial_x^2 f_P(x) =\left(\frac{\sigma}{\lambda/P}p(x)+1\right)f_P(x)-\frac{\sigma}{\lambda/P}p(x)f^*(x)

Minimizing the train loss for \(P \rightarrow \infty\):

\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation

\(\rightarrow\) Its solution yields:

\ell(\lambda,P)\sim \left(\frac{\lambda\sigma}{P}\right)^{\frac{1}{(2+\chi)}}

Characteristic scale of predictor \(f_P\), \(d>1\)

  • Let's consider the predictor \(f_P\) minimizing the train loss for \(P \rightarrow \infty\).
  •  
f_P(x)=\int d^d\eta \frac{p(\eta) f^*(\eta)}{\lambda/P} G(x,\eta)
  • With the Green function \(G\) satisfying:
\int d^dy K^{-1}(x-y) G_{\eta}(y) = \frac{p(x)}{\lambda/P} G_{\eta}(x) + \delta(x-\eta)
  • In Fourier space:
\mathcal{F}[K](q)^{-1} \mathcal{F}[G_{\eta}](q) = \frac{1}{\lambda/P} \mathcal{F}[p\ G_{\eta}](q) + e^{-i q \eta}
  • In Fourier space:
\mathcal{F}[K](q)^{-1} \mathcal{F}[G_{\eta}](q) = \frac{1}{\lambda/P} \mathcal{F}[p\ G_{\eta}](q) + e^{-i q \eta}
\begin{aligned} \mathcal{F}[G](q)&\sim q^{-1-d}\ \ \ \text{for}\ \ \ q\gg q_c\\ \mathcal{F}[G](q)&\sim \frac{\lambda}{P}q^\chi\ \ \ \text{for}\ \ \ q\ll q_c\\ \text{with}\ \ \ q_C&\sim \left(\frac{\lambda}{P}\right)^{-\frac{1}{1+d+\chi}} \end{aligned}
  • Two regimes:
  • \(G_\eta(x)\) has a scale:
\ell(\lambda,P)\sim 1/q_c\sim \left(\frac{\lambda}{P}\right)^{\frac{1}{1+d+\chi}}

VeniceTalk_WorkshopPoD

By umberto_tomasini

VeniceTalk_WorkshopPoD

  • 17