@ICML (2022)
Joint work with A. Sclocchi and M. Wyart
The learning algorithm
Train loss:
\(K(x,y)=e^{-\frac{|x-y|}{\sigma}}\)
E.g. Laplacian Kernel
Predicting generalization of KRR
[Canatar et al., Nature (2021)]
General framework for KRR
\(\rightarrow\) KRR learns the first \(P\) eigenmodes of \(K\)
\(\rightarrow\) \(f_P\) is self-averaging with respect to sampling
\(\rightarrow\) what is the validity limit?
Our toy model
Depletion of points around the interface
Data: \(x\in\mathbb{R}^d\)
Label: \(f^*(x_1,x_{\bot})=\text{sign}[x_1]\)
Motivation:
evidence for gaps between clusters in datasets like MNIST
Predictor in the toy model
(1) Spectral bias predicts a self-averaging predictor controlled by a characteristic length \( \ell(\lambda,P) \propto \lambda/P \)
For fixed regularizer \(\lambda/P\):
(2) When the number of sampling points \(P\) is not enough to probe \( \ell(\lambda,P) \):
\(d=1\)
Different predictions for
\(\lambda\rightarrow0^+\)
Crossover at:
\(\lambda\)
Spectral bias failure
Spectral bias success
Conclusions
For small ridge: correct just for \(d\rightarrow\infty\).
Thank you for your attention!
For which kind of data spectral bias fails?
Depletion of points close to decision boundary
Still missing a comprehensive theory for
KRR test error for vanishing regularization
BACKUP SLIDES
Scaling Spectral Bias prediction
Fitting CIFAR10
Proof:
\(\phi_\rho(x)\sim \frac{1}{p(x)^{1/4}}\left[\alpha\sin\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)+\beta \cos\left(\frac{1}{\sqrt{\lambda_\rho}}\int^{x}p^{1/2}(z)dx\right)\right]\)
\(x_1*\sim \lambda_\rho^{\frac{1}{\chi+2}}\)
\(x_2*\sim (-\log\lambda_\rho)^{1/2}\)
first oscillations
Formal proof:
Characteristic scale of predictor \(f_P\), \(d=1\)
Minimizing the train loss for \(P \rightarrow \infty\):
\(\rightarrow\) A non-homogeneous Schroedinger-like differential equation
\(\rightarrow\) Its solution yields:
Characteristic scale of predictor \(f_P\), \(d>1\)