Education

Bachelor in Physics (UniPD)
- Thesis on Quantum Mechanics
Master in Physics of Matter (UniPD)
- Thesis on Statistical Physics and Climate models
Supported by Excellence School scholarship

Work Experience

PhD in AI (EPFL):
- Quantify and interpret generalization in Deep Learning
- Research on Protein Design with Diffusion Models
- 5 papers (4 as first author) presented in 12 venues as ICLR and 2 ICML spotlights
Applied Scientist Intern (AWS AI Labs, California)
- Improve Large Language Models for complex reasoning
- One U.S. patent and 1 paper in preparation (first author).

PhD in AI: explaining the success of DL

Machine learning is highly effective across various tasks
Scaling laws: More training data leads to better performance

Curse of dimensionality occurs when learning structureless data in high dimension \(d\):

Slow decay: \(\beta=1/d\)
Number of training data to learn is exponential in \(d\).

Language Modeling: \(\beta\approx\,0.1\)
Image Classification: \(\beta=\,0.3\,-\,0.5\)
Speech Recognition: \(\beta\approx\,0.3\)

\(\varepsilon\sim P^{-\beta}\)

1/10

[Kaplan 20]

[Hestness 17]

\(\Rightarrow\) Data must be structured and

Machine Learning should capture such structure.

Key questions motivating this thesis:

What constitutes a learnable structure?
How does Machine Learning exploit it?
How many training points are required?

Data must be Structured

2/10

Reducing complexity with depth

Deep networks build increasingly abstract representations with depth (also in brain)

How many training points are needed?
Why are these representations effective?

Intuition: reduces complexity of the task, ultimately beating curse of dimensionality.

Which irrelevant information is lost ?

Two ways for losing information

by learning invariances

Discrete

Continuous

[Zeiler and Fergus 14, Yosinski 15, Olah 17, Doimo 20,

Van Essen 83, Grill-Spector 04]

[Shwartz-Ziv and Tishby 17, Ansuini 19, Recanatesi 19, ]

[Bruna and Mallat 13, Mallat 16, Petrini 21]

3/10

Hierarchical structure

Hierarchical representations simplify the task
Do deep hierarchical representations exploit the hierarchical structure of data?

How many training points?

Quantitative predictions in a model of data

sofa

[Chomsky 1965]

[Grenander 1996]

4/10

Random Hierarchy Model

Classification task with \(n_c\) classes
Generative model of data:
- Label generates a patch of \(s\) features
- Patches chosen randomly from \(m\) unambiguous choices (synonyms) according to random production rules,
  - e.g. \(a\rightarrow (b,c) ; a\rightarrow (c,z)\)

Generation iterated \(L\) times with a fixed tree topology.

Number of data \(\sim e^{d}\), memorization not practical.

5/10

Deep networks beat the curse of dimensionality

\(P^*\)

\(P^*\sim n_c m^L\)

Polynomial in the input dimension \(s^L\)
Beating the curse

Shallow network \(\rightarrow\) cursed by dimensionality
Depth is key to beat the curse

6/10

How deep networks learn the hierarchy?

Intuition: build a hierarchical representation mirroring the hierarchical structure. How to build such representation?
Start from the bottom. Group synonyms: learn that patches in input correspond to the same higher-level feature
Collapse representations for synonyms, lowering dimension from \(s^L\) to \(s^{L-1}\)
Iterate \(L\) times to get hierarchy

How many training points are needed to group synonyms?

7/10

Grouping synonyms by correlations

Synonyms are patches with same correlation with the label
Measure correlations by counting
Large enough training set is required to overcome sampling noise:
\(P^*\sim n_c m^L\), same as sample complexity
Simple argument: for \(P>P^*\) one step of Gradient Descent uses the synonyms-label correlations to collapse representations of synonyms

Patch \(\mu\)

Label \(\alpha\)

8/10

Testing whether deep representations collapse for synonyms

At \(P^*\) the task and the synonyms are learnt

How much changing synonyms at first level in data change second layer representation (sensitivity)
For \(P>P^*\) drops

1.0

0.5

0.0

\(10^4\)

Training set size \(P\)

9/10

Takeaways

Deep networks learn hierarchical tasks with a number of data polynomial in the input dimension
They do so by developing internal representations that learn the hierarchical structure layer by layer

Future directions

Thank you!

Extend the RHM to more realistic models of data (e.g. no fixed topology), and interpret how LLMs/Visual Models learn them.

Probe the hierarchical structure to generate data at different levels of abstraction (Diffusion Models).

sofa

28/28

Thank you!

10/10

Application: generating novel data by probing hierarchical structure

Goal: generate new data from existing ones
At different levels of abstraction

Generative technique:

Diffusion models: add noise + denoise
Scale of change/features level depends on the noise amount
Prediction: intermediate level of noise at which this scale is maximal (phase transition)
Validated on images and text.
What about proteins?

Diffusion on protein sequences

Model for Discrete Diffusion: EvoDiff-MSA
Dataset: J-Domain proteins

Note: Can be used to generate Intrinsically Disordered Regions (IDR) conditioning on structured regions

Generate new sequences by adding noise to protein sequence+denoising
Do not find a phase transition in the amount of change wrt noise
Uniform change along whole sequence
Not consistent with hierarchical structure
Future investigations:
- Better model
- Structure 3D space

Are protein sequences hierarchical?

Natural Language Constraint Satisfiability Problems

The problem (NL-CSP):

Finding a set of \(n\) objects which satisfy \(m\) constraints in the prompt
Some commonsense, other hard constraints
Crucial reasoning step in planning, common user interactions as querying databases...

Our approach:

Formalize NL-CSPs as infilling problems
Use a combination of formal solvers and LLMs to solve it

[U.S patent]

LLMs a Theory Solvers

Focus on LLM part:

Open-weight models: improve accuracy at inference time by reweighting the logits (10-20%)
Close-weight models: new prompting technique (2.5x improvement)