LSM

Large Science Models:

Foundation Models for
Generalizable Insights Into Complex Systems

Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

University of Kentucky

Proposed Concept

Develop Foundation models of complex systems with
- hundreds to thousands of evolving variables with apriori unknown cross-talk
- no governing equations are know a priori
- reflexivity: system changes if observed
Learn intrinsic system geometry from data
Derive equations of motion with variational principles (stationary action on Lagrangian).
Inference under data sparsity
Detect data (in)sufficiency, adapt to model drift
Support forward simulation and perturbation analysis
Digital twins of individuals & groups of entities

Data inference boundaries & limitations

Alignment validation

Complex phenomena

Adaptation to model obsolence

Precise validation protocols to assess process drift triggering re-calibration/training

Built-in flexibility for changing contexts and non-ergodicity

Scalable to thousands to millions of variables, intrinsic reflexivity

Component LSM predictors enforce statistical significance of splits in recursive partitioning, ensuring precise uncertainty quantification

*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.

emergent macro-structure

Component predictor (Conditional Inference Tree*)

Example: Influenza A HA protein

Recursive

LSM

forest

LSM Forest of Conditional Inference Trees*

Revealing Emergent Cross-talk

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

GSS 2018 dataset

Set of conditional inference trees (CIT)
- Strict statistical guarantees: quantifies inference uncertainty
Each tree models exactly one variable as a function of potentially all other variables
Non-leaf nodes are "hyperlinked" to other trees

Large Science Models

1. How will proposer form and maintain a computationally tractable LSM tree structure given, as proposed, hundreds to thousands of observable variables?

$\checkmark$

GSS 2018 dataset

Each predictor is inferred independently
Can scale up to thousands of variables in Python implementation
Further scale-up $10^6 - 10^8$ needs C/C++ implementation

https://34.66.189.202/data/trees2018/

Full Example of Hyperlinked Trees

Large Science Models: Mathematical Framework

\begin{aligned} \text{Observables:} \quad & \color{yellow}X = \{x^1, \ldots, x^N\}, \overbrace{x^i \in \Sigma^i}^{\text{finite alphabet}} \\ {\color{gray}\text{Notation:} }\quad & \color{gray} x^{-i} = \{x^j : j \ne i\}\\ \text{Crosstalk:} \quad & \forall i \ P(x^i) = \color{red} f_i(x^{-i}) \\ \text{System state:} \quad & \color{Cyan} \psi = \bigotimes_{i=1}^N \psi^i, \quad \psi^i \in \mathscr{D}(\Sigma^i) \cup \varnothing \\ %\textbf{Degenerate case:} \quad & \psi^i \text{ is a delta distribution (fully observed)} \\ {\color{gray}\text{Notation:} }\quad & \color{gray}\psi^{-i} = \bigotimes_{j \ne i} \psi^j \end{aligned}

	reliten	gunlaw	abany	---	grass
Person 1
Person 2
---
Person m

observables

samples

Distributions over alphabet $\Sigma^i$

\phi = \bigotimes_{i=1}^N \phi^i, \quad \phi^i(\psi^{-i}) \in \mathscr{D}(\Sigma^i) \\

Individual Predictor (CIT)

cross-talk

\phi(\psi) \vert \vert \psi

Tension between predicted and observed distribution drives change

Example

GSS topic: There should be more gun-control

$\psi^i$

strongly agree

agree

neutral

disagree

strongly disagree

\Sigma^i

Digital Twin

\phi^i(\psi^{-i}) \sim \widetilde{\psi}^i

$\phi$ estimates $\psi$

Examples: GSS, ANES, WVS, ESS, Eurobarometer, Afrobarometer, Asian Barometer etc

group

individual

estimate is always a non-empty non-degenerate distribution

missing observation

Large Science Models: Properties

LSM-Distance Metric*

\theta(\psi,\psi') \triangleq \frac{1}{N}\sum_{i=1}^{N} \sqrt{D_{JS}\Bigl(\phi^i(\psi^{-i}) \vert \vert \phi^i(\psi'^{-i})\Bigr)}

where $D_{JS}(P\vert \vert Q)$ is the Jensen-Shannon divergence.

g_{ij}(\psi) \;=\; \frac{1}{2}\,\frac{\partial^2}{\partial \psi^i\,\partial \psi^j}\,\theta^2(\psi,\psi')\Biggr|_{\psi'=\psi}

\left \lvert \ln \frac{\Pr(\psi\to \psi')}{\Pr(\psi' \rightarrow \psi')}\right \rvert \le \beta\,\theta(\psi,\psi')

Large Deviation Bound*

Induced Riemannian metric tensor

This bound connects ``closeness'' of samples to the odds of perturbing from one to the other, bridging geometry to dynamics

Ergodic Projection

\psi_\star \triangleq \bigotimes_{i=1}^N\phi^i\left (\prod_{1}^{N-1}\varnothing\right )

(Sanov's Theorem, Pinkser's Inequality)

$\psi$

$\psi'$

$\theta$

"spatial average": average of all plausible worldviews or states

* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400. https://www.science.org/doi/full/10.1126/sciadv.adj0400

persistence probability

Ergodic dispersion

\Psi_\star = \theta(\psi,\psi_\star)

Central to Model Drift Quantification

Start with opinion vector with all entries missing

This is a standard Physics construct, quantifying curvature of the underlying latent geometry

Pr(\psi \rightarrow \psi')

Easily computable in LSM framework!

Apply $\phi^i$

Random variable quantifying dispersion around the spatial average of worlviews

const. scaling as $N^2$

Digital Twin & Fidelity of Simulation

\mathcal{N}_\epsilon(\psi) \triangleq \big\{ \psi': {\color{red}\forall i \ \psi'_i \sim \phi^i\left ( \psi^{-i}\right )} \wedge {\color{yellow} \theta(\psi,\psi') \leqq \epsilon }\big \}

Sample predicted distributions

perturbed state within $\epsilon$ of $\psi$

Variable	Masked	Reconstructed
spkcom	allowed	allowed
colcom	not fired	not fired
spkmil	allowed	allowed
colmil	allowed	not allowed
libmil	not remove	not remove
libhomo	not remove	not remove
reliten	strong	no religion
pray	once a day	once a day
bible	inspired word	word of god
abhlth	yes	yes
abpoor	no	no
pillok	agree	agree
intmil	very interested	very interested
abpoorw	always wrong	not wrong at all
godchnge	believe now, always have	believe now, always have
prayfreq	several times a week	several times a week
religcon	strong disagree	disagree
religint	disagree	disagree

Variable	Masked	Reconstructed
spkcom	allowed	allowed
colcom	not fired	not fired
libmil	not remove	not remove
libhomo	not remove	not remove
gunlaw	favor	favor
reliten	no religion	no religion
prayer	approve	approve
bible	book of fables	inspired word
abnomore	yes	yes
abhlth	yes	yes
abpoor	yes	yes
abany	yes	yes
owngun	no	no
intmil	moderately interested	moderately interested
abpoorw	not wrong at all	not wrong at all
godchnge	believe now, didn't used to	believe now, always have
prayfreq	several times a week	several times a week

2018 GSS individual samples

Digital Twin

-Neighborhood of state $\psi$

\epsilon

Definition

Sample neighborhood to impute missing data

\psi

\epsilon

2018 GSS out-of-sample reconstruction

post-reconstruction error ratio (%)

LSM sampling: sampling the $\epsilon$-neighborhood of a state or worldview allows reconstruction of censored opinions

examples

Predictive ability of LSM quantified as ability to reconstruct censored out-of-sample opinions**

{\color{Tomato}\psi_\star }\rightarrow \psi \rightarrow \cdots \rightarrow \psi'

Null state (all missing observations)

Valid perturbations/ simulations

LSM sampling allows simulating opinion perturbations

Both Individuals and groups maybe modeled as digital twins$\dag$

Global Emergent Structure via Clusters & Poles

2018 GSS

\theta_t(\psi_+,\psi_-)

Polar separation over time

2016 Presidential Election Vote Prediction

2004

abany	no	yes
abdefctw	always wrong	not wrong at all
abdefect	no	yes
abhlth	no	yes
abnomore	no	yes
abpoor	no	yes
abpoorw	always wrong	not wrong at all
abrape	no	yes
absingle	no	yes
bible	inspired word	book of fables
colcom	fired	not fired
colmil	not fired	not allowed
comfort	strongly agree	strongly disagree
conlabor	hardly any	a great deal
godchnge	believe now, always have	don't believe now, never have
grass	not legal	legal
gunlaw	oppose	favor
intmil	very interested	not at all interested
libcom	remove	not remove
libmil	not remove	remove
maboygrl	true	false
owngun	yes	no
pillok	agree	strongly agree
pilloky	strongly disagree	strongly agree
polabuse	no	yes
pray	several times a day	never
prayer	disapprove	approve
prayfreq	several times a day	never
religcon	strongly disagree	strongly agree
religint	strongly disagree	strongly agree
reliten	strong	no religion
rowngun	yes	no
shotgun	yes	no
spkcom	not allowed	allowed
spkmil	allowed	not allowed
taxrich	about right	much too low

conservative pole

\psi_+

liberal pole

\psi_-

Clustering LSM distance $\theta(x,y)$ between out-of-sample individuals

conservative

liberal

poles:

partial states aligning with extreme opposing worldviews

Compare across time and different GSS surveys
Derived features for individuals (ideology index)

I(x) = \frac{\theta(x,\psi_+) - \theta(x,\psi_-)}{\theta(\psi_+,\psi_-)}

Predict 2016 votes using ideology index

Emergent global structure

Reflexivity and State Collapse on Observation

Emergent Equations of Motion

L \triangleq \frac{1}{2} \sum_i g_{kl} P^k_p \dot{\psi}^p_i P^l_n \dot{\psi}^n_i - \theta(\psi, \phi)

Define Lagrangian*

\frac{d}{dt} \left( \frac{\partial L}{\partial \dot{\psi}^m_i} \right) - \frac{\partial L}{\partial \psi^m_i} = 0

Via the Euler-Lagrange Equations$^\dag$:

\ddot{\psi}^m_i = -g^{km} P^k_m \frac{1}{2N} \sum_j \frac{1}{\sqrt{D_{JS}(\psi^m_j \| \phi^m_j)}} \left[ \ln\left( \frac{2e\psi^m_j}{\psi^m_j + \phi^m_j} \right) - \frac{1}{2(\psi^m_j + \phi^m_j)} \right]

Over-damped Gradient flow Equation*

where $-g^{km}$ is the inverse metric tensor

kinetic energy

state collapse

strongly agree

agree

neutral

disagree

strongly disagree

strongly agree

agree

neutral

disagree

strongly disagree

Query/

Observation

$X_i$

Non-local Influence propagation on measurement/observation (QM-like)

\phi^i(\psi^{-i})

potential energy

* Einstein notation used

Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.

$^\dag$

Principle of stationary action

Dynamics

Local potential field eqn

Local Potential Fields

Stable

(captured by local extrema)

Free to move locally towards extrema

Why propaganda works so well

* “Exposure to opposing views on social media can increase political polarization”
by Christopher A. Bail et al., published in PNAS in September 2018 (Vol. 115, No. 37, pp. 9216–9221; DOI: 10.1073/pnas.1804840115)

GSS 2018 individuals and neighborhoods

Influenza C : strains and their neighborhoods

Even random perturbations will tend to move individuals towards local extrema increasing polarization

Polarization is "easy", can occur via random perturbations (falling into the local well)

Hypotheses

Observation: This lineage (Mississippi lineage) is now extinct since 2022/23

stable lineage

Implications on Social Theory

The LSM tells the latent opinion "space-time" how to curve, the curved "space-time" tells opinions how to change.

Local potential fields can be computed given the LSM and dynamical considerations, which reveal future evolution

De-polarization is "hard", needs specific communication (climbing up from the well)

Data Sufficiency via Conservation of Complexity

%K(x) = K(S) + K(x \vert S_\star) + O(1) = K(S') + K(x \vert S'_\star) +O(1) K(x \vert S_\star) = O(1) = K(S \vert x_\star)

The No-cheating Thorem: Generative models cannot cheat on complexity

Kolmogorov Complexity

Optimal Generative Model

compressed data representation

compressed model representation

Theorem

K(\textrm{data}) = K(\textrm{LSM}) +O(1)

Conservation Law arising from the continuous symmetry of typicality*

\mu_0(X) \triangleq \frac{\delta(\vert \langle S(X) \rangle \vert)}{\delta(\vert \langle X \rangle \vert)} \leq 1

Saturation relation:

Data Sufficiency Statistic $\mu_0$

We need LSM-sampling to calculate this

*Noether's Theorem

For every continuous symmetry of a physical system, there exists a corresponding conserved quantity

\vert \langle X' \rangle \vert \approx \max\{1,\mu_0(X)\} \vert \langle X \rangle \vert

How much more data do we need?

Data saturation

Data deficient

Needed

Current

Empirical Validation

Model Drift Quantification

Ergodic dispersion

\Delta_\star = \theta(\Psi,\psi_\star)

z(\Delta_\star) = \frac{\Delta_\star^{[t]} - \langle \Delta_\star^{[t]} \rangle}{\sigma(\Delta_\star^{[t]} )}

Z-value of dispersion

Do new samples (survey respondents) still conform to the model?

GSS Model drift

ergodic projection (all missing values)

A random belief state (with possibly missing entries)

random variable

normal variate

\zeta(M) = \vert z(\Delta_\star^0) - z(\Delta_\star^{[t]}) \vert

Model drift stochastic process ($\zeta$)

\mathbf{E}(\zeta(M) )

assess if $\zeta$ is stationary: if not then new samples are not conforming to model

Example for GSS LSM inferred for year 2000

Large Science Models & Ergodicity

$\checkmark$ 4. Address whether your approach makes assumptions regarding ergodicity, and if so, how these assumptions affect the model's applicability to non-ergodic systems.

No Convergence

(~50% belief mismatch between pairs)

2018 GSS survey belief vectors simulated via LSM sampling

No ergodicity assumption: LSMs are built for non-ergodic systems
Sampling and simuation "remembers" the start point (No convergence), demonstrating non-ergodic learned structure
Local potential fields vary across the space
Potential wells may arise, driven by the dynamics at hand, not via assumptions
"change" is driven by non-equilibrium (dissonance)

Embedded Social Theories in LSM

When applied to Social Modeling and Opinion Dynamics

Belief about topic $i$ is expected to align with beliefs about other topics $\displaystyle\psi^{-i}$.
Deviations are exponentially improbable $\Rightarrow $ people/groups seek internal coherence.
Theory Link:
- Cognitive consistency theory – Abelson et al. (1968)
- Constraint satisfaction in beliefs – Read & Marcus-Newhall (1993)

Beliefs evolve to minimize tension between actual state and “expected” state.
Reflexive gradient flow — system reduces internal contradiction.
Theory Link:
- Cognitive Dissonance Theory – Festinger (1957)
- Homeostatic belief adjustment – Gawronski & Strack (2004)

Observing a belief changes it and affects all conditionals.
Direct encoding of feedback loops central to human systems.
Theory Link:
- Reflexivity in social systems – Giddens (1984), Soros (1994)
- Theory of mind / mutual modeling – Premack & Woodruff (1978)

Validation of Social Theory Questions:

Perception changes reality, which changes perception
The Constitution of Society
The Alchemy of Finance
Does a chimpanzee have a theory of mind?

Our system “wants” to reach a low-energy (low-dissonance) state — a direct computational analog of Festinger’s theory.

People strive to align beliefs and attitudes across related domains. Inconsistencies create cognitive discomfort, prompting adjustments across belief clusters to restore harmony.

Exploratory: Belief systems react measurably to exogenous events and shocks

Exploratory: Cross-dependencies between beliefs have observable effects on societal resilience.

Is Polarization an Inevitable Attractor?

Social Identity Theory vs. Belief Proximity

Large Science Models: Broader Applications

A General Framework for modeling Complex Systems

Genomic database: Missing heritability problem

Personalized Clinical Digital Twin, Virtual Patients

Any structured interview, PTSD fabrication

Assess sysmptom data and co-pathologies

Predict future mutations; which animal strain is closest to jumping to humans

Mental health diagnosis

Microbiome Analysis**

Algorithmic lie detector

Viral emergence

Teomims

Opinion Dynamics

Darkome

Generative model of complex microbial ecosystems, and their impact on health and disease

Data requirements

Tabular data
Potentially large number of features/covariates ($10^2 - 10^8 $)
Sufficient number of samples ($10^3 - 10^6$)
Small number of longitudinal samples (currently, $ < 100$)

Limitation	Mitigation / Response
Conventional time series is currently out-of-scope	Focus on cross-sectional interdependencies and belief geometry; time handled via drift
LSMs model statistical interdependence, not causal mechanisms	Use perturbation-based simulations to infer plausible influence pathways
Limited by observed belief variables	Integrate multiple surveys; use latent proxies and test sensitivity of digital twins
Social theory connections and interpretability may be challenging	Anchor dynamics with theory-driven constructs (e.g., ToM, cognitive dissonance)

LSMs for complex systems

**preliminary study published (https://www.science.org/doi/10.1126/sciadv.adj0400)

Large Science Models:

Foundation Models for
Generalizable Insights Into Complex Systems

Proposed Concept

LSM Forest of Conditional Inference Trees*

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Large Science Models

GSS 2018 dataset

Full Example of Hyperlinked Trees

Large Science Models: Mathematical Framework

Digital Twin

Large Science Models: Properties

LSM-Distance Metric*

Large Deviation Bound*

Induced Riemannian metric tensor

Ergodic Projection

Ergodic dispersion

Digital Twin & Fidelity of Simulation

Digital Twin

-Neighborhood of state \(\psi\)

Global Emergent Structure via Clusters & Poles

Reflexivity and State Collapse on Observation

Emergent Equations of Motion

Query/

Observation

Non-local Influence propagation on measurement/observation (QM-like)

Dynamics

Local Potential Fields

Hypotheses

Implications on Social Theory

Data Sufficiency via Conservation of Complexity

Model Drift Quantification

Ergodic dispersion

Z-value of dispersion

Model drift stochastic process (\(\zeta\))

Large Science Models & Ergodicity

Embedded Social Theories in LSM

Large Science Models: Broader Applications

Large Science Models:

Foundation Models for Generalizable Insights Into Complex Systems

Proposed Concept

LSM Forest of Conditional Inference Trees*

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Large Science Models

GSS 2018 dataset

Full Example of Hyperlinked Trees

Large Science Models: Mathematical Framework

Digital Twin

Large Science Models: Properties

LSM-Distance Metric*

Large Deviation Bound*

Induced Riemannian metric tensor

Ergodic Projection

Ergodic dispersion

Digital Twin & Fidelity of Simulation

Digital Twin

-Neighborhood of state \(\psi\)

Global Emergent Structure via Clusters & Poles

Reflexivity and State Collapse on Observation

Emergent Equations of Motion

Query/

Observation

Non-local Influence propagation on measurement/observation (QM-like)

Dynamics

Local Potential Fields

Hypotheses

Implications on Social Theory

Data Sufficiency via Conservation of Complexity

Model Drift Quantification

Ergodic dispersion

Z-value of dispersion

Model drift stochastic process (\(\zeta\))

Large Science Models & Ergodicity

Embedded Social Theories in LSM

Large Science Models: Broader Applications

Foundation Models for
Generalizable Insights Into Complex Systems