Large Science Models:

Foundation Models for
Generalizable Insights Into Complex Systems

with Psycho-social Application

PI: Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

University of Kentucky

DARPA-EA-25-02-05-MAGICS-PA-025

HR0011-26-3-E016

Proposed Concept

Develop Foundation models of complex systems with
- hundreds to thousands of evolving variables with apriori unknown cross-talk
- no governing equations are know a priori
- reflexivity: system changes if observed
Learn intrinsic system geometry from data
Derive equations of motion with variational principles (stationary action on Lagrangian).
Inference under data sparsity
Detect data (in)sufficiency, adapt to model drift
Support forward simulation and perturbation analysis
Digital twins of individuals & groups wrt to opinion dynamics

Proposer Overview

PI: Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

Associate Faculty Sanders-Brown Center of Aging

University of Kentucky

PI on 4 past DARPA grants
- D3M (Data-driven Discovery of Models, I20, PM: Wade Shen)
- PAI (Physics of AI, DSO, PM: James Gimlett)
- PREEMPT (PREventing EMerging Pathogenic Threats, BTO, Site-PI, PM: James Gimlett)
- YFA 2020 (Topic: Cognitive Dissonance, PM: Bartlett Russell)

High impact publications (https://scholar.google.com/citations?user=JpUbOmsAAAAJ&hl=en)
Funding from Alzheimer's Association, NIA
Advised 3 Postdocs, 2 PhD, over 30 graduate and undergraduate students
>40 Invited lectures including at NIH, DoD Facilities, National Labs

Dmytro Onishchenko

Staff Scientist + PhD Student:

ML/AI
C++/Python

Zhuoqun Li

Postdoctoral Associate:

ML/AI
Stochastic processes
C++/Python

Cost & Schedule

Estimated costs	USD
Labor cost	157,227.86
Other direct costs	9,993.00
Total (direct+indirects for 12 months)	257,520.12

Validation Plan Outline

Gantt Chart*

*Milestone definitions in next slide

Dataset Acquisition (10 survey datasets)

LSM inference

LSM predictive ability validation

LSM model drift sense validation

LSM data sufficiency tracking validation

LSM mediated social theory analysis

Milestones

1	Kickoff Meeting: A briefing on the technical plan for the effort to include milestone schedule and path to accomplish the objectives of the agreement.	Government acceptance / Kickoff meeting briefing slides	Month 1 after award start
2	Validation plan: Detailed validation plan, including description,acquisition plan, and justification for the ground truth data, and description of the metricsand benchmarks to be used to measure performance.	Government acceptance / Technical report as described.	Month 1
3	Milestone Title: Dataset Acquisition and LSM Inference Technical goal: a) Dataset acquisition (10 social survey datasets acquired: GSS, ANES, CES, Eurobarometer etc) b) Infer LSM models for each dataset using 50% random samples, multiple LSMs trained with different random splits for each dataset.	Government acceptance / Technical report detailing figure/code/data/etc. and all underlying materials generated in support of milestone, regardless of success	Month 2
4	Milestone Title: Masked sample reconstruction Technical goal: LSM predictive accuracy validation via censored sample reconstruction validation on out of sample data from each dataset, Demonstrate statistically significant reduction of LSM distance post reconstruction relative to post-masking. Target: Reconstruction metric error at least 50% improvement over 1) random imputation 2) median imputation	Government acceptance / Technical report detailing figure/code/data/etc. and all underlying materials generated in support of milestone, regardless of success	Month 4
5	Milestone Title: Model drift sensing validation Technical goal: Demonstrate that LSM framework can reliably sense when underlying model drifts. Assess if the model drift statistic is stationary from samples drawn from the same survey wave of our datasets, and reliably indicates non-stationary drift for samples from different survey waves. Target: Model drift statistic must have statistical significance at 5% level for survey waves 5 years apart for at least GSS, CES and Eurobarometer Deliverable are detailed documentation on all 10 datasets	Government acceptance / Technical report detailing figure/code/data/etc. and all underlying materials generated in support of milestone, regardless of success	Month 6
6	Milestone Title: Data sufficiency assessment capability Technical goal: Use the conservation of complexity principle to show that LSM framework can sense data deficiency and sufficiency.	Government acceptance / Technical report detailing figure/code/data/etc. and all underlying materials generated in support of milestone, regardless of success.Analysis results on all 10 datasets	Month 8
7	Milestone Title: Social Theory and Competing Hypotheses Adjudication Technical goal: a) Social Theory Hypothesis Assessment: Polarization is an inevitable attractor b) Investigate the competing hypotheses that socio-economic identity vs belief proximity and latent opinion space geometry is more predictive of specific opinion / belief outcomes	Government acceptance / Technical report detailing figure/code/data/etc. and all underlying materials generated in support of milestone, regardless of success	Month 10
8	Final milestone meeting and report (one month prior to award end date): The final briefing and final report should summarize all work completed on the project, highlighting accomplishments, lessons learned, unexpected outcomes, and challenges requiring further Research. Technical artifact delivery (Software release, evaluation results, source code, models, etc.)	Government acceptance / Technical report as described.For software: Github repository with deployable code complete with example notebooks	Month 11

Milestone Title / Detailed Description

Exit Criteria /Deliverable

Due Date (nlt)

Milestone #

Problem Focus

A General Framework for modeling Complex Systems with Psycho-social Application

Survey Datasets (Public or available at nominal cost)

Survey	Waves / Years	Avg Participants / Wave	Avg Questions / Wave	Participants (approx)	Data Source / Link
General Social Survey (GSS)	~33 (1972–2024)	~3,000	~1,500	~99,000	NORC GSS Data Explorer
ANES	~25 (election-year)	~3,100	~1,000	~77,500	ANES Data Portal
Cooperative Election Study (CES)	~18 (2006–2024)	~50,000	~200	~900,000	CES Portal
Eurobarometer	~100 (1973–2024, biannual)	~30,000	~100	~3,000,000	European Commission Archive
World Values Survey (WVS)	7 waves (1981–2020)	~2,000 / country	~250	~1,120,000	WVS Website
European Social Survey (ESS)	10 waves (2002–2022)	~2,500 / country	~250	~750,000	ESS Website
Latinobarómetro	~25 waves (1995–2024)	~18,000	~110	~450,000	Latinobarómetro Archive
Afrobarometer	6 rounds (1999–2022)	~1,800 / country	~120	~220,000	Afrobarometer Archive
Arab Barometer	5 waves (2006–2022)	~1,800 / country	~130	~135,000	Arab Barometer Site
Asian Barometer	4 waves (2001–2022)	~1,500 / country	~120	~108,000	Asian Barometer Network

\(\checkmark\)Exploration of Dataset Access Protocols Complete

Datasets

DatasetAccess modelLicense / use constraints (typical for research use)

General Social Survey (GSS)	Open public download	Free for research use; citation required; no redistribution of modified datasets
American National Election Studies (ANES)	Public-use + restricted-use tiers	Public-use data freely available; restricted-use data requires application and secure handling
Cooperative Election Study (CES)	Public download (common content)	Free for academic research; team modules may have additional citation or use constraints
Eurobarometer	Registration-based access (GESIS)	Free for non-commercial research; user registration required; citation and compliance with GESIS terms
World Values Survey (WVS)	Registration-based download	Free for non-commercial research; attribution required; redistribution restricted
European Social Survey (ESS)	Registration-based download	Free for non-commercial research; strict citation and documentation compliance
Latinobarómetro	Controlled public access	Use subject to project terms; citation required; redistribution limitations apply
Afrobarometer	Public download	Free for research and policy use; attribution required; redistribution limited
Arab Barometer	Form-based access	Free for non-commercial research; short request form; citation required
Asian Barometer	Application-based access	Explicit permission required; usage and redistribution restrictions apply

Access Difficulty

least easy

less easy

easy

Datasets: Global Coverage

World Value Survey (WVS) is global

\(\checkmark\) Overlapping survey datasets

General Social Survey (GSS)	United States	Repeated cross-sections with a stable core and rotating topical modules	Item nonresponse varies by topic; skip patterns common; structured missingness from module rotation	Long-horizon US belief drift with controlled module churn; strong testbed for latent reconstruction under partial observability
American National Election Studies (ANES)	United States	Election-year time series; some panel components depending on study	Complex skip logic; panel attrition where applicable; block-missingness across batteries	Links belief geometry to electoral cycles; supports cross-sectional vs panel consistency checks
Cooperative Election Study (CES)	United States	Large-N annual/biannual cross-sections with common content plus team modules	Strong module-induced missingness; very high N offsets sparsity	Stress-tests scalability and conditional belief inference under extreme module sparsity
Eurobarometer	Europe (multi-country)	Repeated cross-sections across multiple survey series (Standard/Special/Flash)	Cross-country harmonization issues; wording drift; topic-specific wave gaps	Ideal for cross-national latent-geometry comparisons and robustness to instrument drift
World Values Survey (WVS)	Global (multi-country)	Multi-year waves; repeated cross-sections with uneven country participation	Country-wave coverage gaps; partial item overlap; translation effects	Enables global worldview geometry and invariance-aware modeling across cultures
European Social Survey (ESS)	Europe (multi-country)	Biennial rounds; repeated cross-sections with rotating modules	High data quality; structured missingness from module rotation; variable country participation	Gold-standard benchmark for calibration, validation, and longitudinal stability
Latinobarómetro	Latin America (multi-country)	Annual/near-annual repeated cross-sections	Variable country-year coverage; evolving batteries; skip-pattern sparsity	Tests transferability to non-US/EU contexts and regime-sensitive belief dynamics
Afrobarometer	Africa (multi-country)	Multi-year rounds; repeated cross-sections	Uneven round participation; battery variation; structured round-level missingness	Robustness tests under irregular sampling and heterogeneous governance contexts
Arab Barometer	Middle East & North Africa	Wave-based repeated cross-sections	Coverage gaps driven by field conditions; variable item sets	Evaluates model stability under volatile sampling and political contexts
Asian Barometer	Asia (multi-country)	Wave/round-based repeated cross-sections	Heterogeneous item availability; access-driven release variation	Strong test of cross-cultural generalization and measurement invariance

Datasets

Why Relevant to MAGICS and digital twin construction

\(\checkmark\) Diverse observation contexts

Publications Planned

Kevin Wu, Feng Li, and I. Chattopadhyay, "Emergenet: Digital Twin of Influenza A Emergence From Non-Human Hosts", Military Medicine, In Review
I. Chattopadhyay and Jinyuan Li, "How Good Is Your Synthetic Data", In Preparation
Digital Twins of All Datasets
Opinion Influence and Reflexivity Results

Large Science Models: Broader Applications

A General Framework for modeling Complex Systems

Genomic database: Missing heritability problem

Personalized Clinical Digital Twin, Virtual Patients

Any structured interview, PTSD fabrication

Assess sysmptom data and co-pathologies

Predict future mutations; which animal strain is closest to jumping to humans

Mental health diagnosis

Microbiome Analysis**

Algorithmic lie detector

Viral emergence

Teomims

Opinion Dynamics

Darkome

Generative model of complex microbial ecosystems, and their impact on health and disease

Data requirements

Tabular data
Potentially large number of features/covariates (\(10^2 - 10^8 \))
Sufficient number of samples (\(10^3 - 10^6\))
Small number of longitudinal samples (currently, \( < 100\))

Limitation	Mitigation / Response
Conventional time series is currently out-of-scope	Focus on cross-sectional interdependencies and belief geometry; time handled via drift
LSMs model statistical interdependence, not causal mechanisms	Use perturbation-based simulations to infer plausible influence pathways
Limited by observed belief variables	Integrate multiple surveys; use latent proxies and test sensitivity of digital twins
Social theory connections and interpretability may be challenging	Anchor dynamics with theory-driven constructs (e.g., ToM, cognitive dissonance)

LSMs for complex systems

**preliminary study published (https://www.science.org/doi/10.1126/sciadv.adj0400)

Text

Copy of kickoff_MAGICS

By Ishanu Chattopadhyay

Copy of kickoff_MAGICS

DARPA-EA-25-02-05-MAGICS-PA-025 University of Kentucky Kickoff

Ishanu Chattopadhyay PRO

ML | Data Science Biomedical Informatics | Social Science | Assistant Professor

Large Science Models:

Foundation Models for Generalizable Insights Into Complex Systems

with Psycho-social Application

Proposed Concept

Proposer Overview

PI: Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

Associate Faculty Sanders-Brown Center of Aging

University of Kentucky

Cost & Schedule

Validation Plan Outline

Milestones

Problem Focus

Datasets

Access Difficulty

Datasets: Global Coverage

Datasets

Publications Planned

Large Science Models: Broader Applications

Copy of kickoff_MAGICS

More from Ishanu Chattopadhyay

Foundation Models for
Generalizable Insights Into Complex Systems