Large Science Models: Generative AI for Scientific Discovery
Ishanu Chattopadhyay, PhD
Assistant Professor of Internal Medicine
Institute of Biomedical Informatics
University of Kentucky
Modeling & predicting complex social interactions
ZeDlab Research Thrusts
General framework for inferring digital twins in biology and medicine
Point-of-care test-free screening for complex diseases
Ai
Electronic Healthcare Record
IPF
ASD
ADRD
Rapid Universal Point-of-care Screening for ILD/IPF Using Comorbidity Signatures in Electronic Health Records
Flag patients before they (or doctors) suspect
Primary Care
Pulmonologist
Zero-burden Co-morbid Risk Score (ZCoR)
Referral
Prognosis at Point-of-Diagnosis
Patient Journey
Early Diagnosis
Reduce screen failure rates
Holistic health surveillance
Predict antifibrotics continuation
improve outcomes
Interstitial Lung Disease / Pulmonary Fibrosis
1
2
3
Aim 1: Map AP Patient Journeys to Identify Risk Patterns in Acute and Recurrent Episodes.
Acute Pancreatitis
(with Darwin Conwell's group)
Aim 2: Model Transitions from AP to Type 3c Diabetes for Early Intervention.
Aim 3: Predict ICU Admission in AP Patients Based on Disease Severity Indicators.
~ 4yrs
current survival ~4yrs
~ 4yrs
current clinical DX
ZCoR screening
Onishchenko, D., Marlowe, R.J., Ngufor, C.G. et al. Screening for idiopathic pulmonary fibrosis using comorbidity signatures in electronic health records. Nat Med 28, 2107–2116 (2022). https://doi.org/10.1038/s41591-022-02010-y
n=~3M
AUC~90%
Likelihood ratio ~30
Conventional AI/ML attempts to model the physician
AI in IPF Research
ICD administrative codes
IPF
ILD
target codes appear
Past medical history
No target codes appear
case
control
2yrs
2yrs
prediction
Truven MarketScan (IBM) Commerical Claims & Encounters Database 2003-2018
>100M patients visible
>7B individual claims
>87K unique diagnostic codes
>7% Medicare data present
2,053,277 patients included in study
University of Chicago Medical Center 2012-2021
68,658 patients
Random sample from Optumlabs Data Warehouse courtsey Mayo Clinic
861,280 patients
2,983,215 patients
Data: Onishchenko etal. Nat. Medicine 2022
patient A
patient B
patient C
Beyond "risk factors" to personalized risk patterns
Upto 4 year "signal" resolution
decreases risk
increases risk
Patient Journey: Tracking Risk over time
ZeD Lab: Predictive Screening from Comorbidity Footprints
CELL Reports
ZCoR | Competition | |
---|---|---|
Autism | >83% | "obvious" |
Alzheimer's Disease | ~90% | 60-70% |
Idiopathic Pulmonary Fibrosis | ~90% | NA |
MACE | ~80% | ~70% |
Bipolar Disorder | ~85% | NA |
CKD | ~85% | NA |
Rare Cancers (Bladder, Uterus) | ~75-80% | Low |
Suicidality (with CAT-SS) | 98% PPV | Low |
How?
Odds ratios combined via ML
1
Data
cases
control
odds ratios for all ICD codes
ML Model
odds-based risk estimator
Probabilistic Finite State
Map health history to trinary streams
Chattopadhyay, Ishanu, and Hod Lipson. "Abductive learning of quantized stochastic processes with probabilistic finite automata." Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371, no. 1984 (2013): 20110543.
2
Longitudinal stochastic patterns
Cloud Deployment
Theoretical formulation
Multi-cohort validation
Launch User-Accessible Platform
3 years
2 years
[
{
"patient_id": "P000038",
"sex": "F",
"birth_date": "01-01-2006",
"DX_record": [
{"date": "07-31-2006", "code": "Z38.00"},
{"date": "08-07-2006", "code": "P59.9"},
{"date": "08-29-2016", "code": "J01.90"},
{"date": "09-10-2016", "code": "J01.90"},
{"date": "11-14-2016", "code": "J01.91"}
],
"RX_record": [
{"date": "10-29-2011", "code": "rxLDA017"},
{"date": "05-16-2015", "code": "rxIDG004"},
{"date": "08-08-2015", "code": "rxIDG004"},
{"date": "06-04-2016", "code": "rxIDD013"}
],
"PROC_record": [
{"date": "02-05-2007", "code": "90723"},
{"date": "11-05-2007", "code": "J1100"}
]
}
]
{
"predictions": [
{
"error_code": "",
"patient_id": "P000012",
"predicted_risk": 0.005794344620009157,
"probability": 0.8253881317184486
}
],
"target": "TARGET"
}
Data In
Data Out
Cohort Selection and Risk Analysis Testbed
Misleading Diagnosis of Idiopathic Pulmonary Fibrosis: A Clinical Concern
Javier Ramos-Rossy, MD, Onix Cantres-Fonseca, MD, Ginger Arzon-Nieves, Yomayra Otero-Dominguez, MD, Stella Baez-Corujo, MD, and William Rodríguez-Cintrón, MD
Project 1: ZCoR Dashboard and Implementation Optimization
Research Direction II
Digital Twins
General framework for inferring digital twins in biology and medicine
Stamping Out the Next Pandemic **Before** The First Human Infection
BioNorad
Chattopadhyay, Ishanu, Kevin Wu, Jin Li, and Aaron Esser-Kahn. "Emergenet: Fast Scalable Pandemic Risk Assessment of Influenza A Strains Circulating In Non-human Hosts." (2023). Under Review in Science
PREEMPT
Predicting Future Mutations for Viral Genomes in the Wild
predict future emergence risk
Hemaglutinnin
Neuraminidase
Q-Net
recursive forest
Hyperlinked Nodes
Northern Hemispehere H1N1 2023
Northern Hemispehere H1N1 2023
$$J \textrm{ is the Jensen-Shannon divergence }$$
Q-Net
recursive forest
q-distance
a biologically informed, adaptive distance between strains
q-distance
a biologically informed, adaptive distance between strains
Smaller distances imply a quantitatively high probability of spontaneous jump
Metric Structure
Tangent Bundle
geometry
dynamics
Intrinsic Distance Can Identify the Edge of Emergence
Next Steps: Generalize to new viruses, get experimental evidence
Influenza Risk Assessment Tool (IRAT) scoring for animal strains
slow (months), quasi-subjective, expensive
*https://www.cdc.gov/flu/pandemic-resources/monitoring/irat-virus-summaries.htm
24 scores in 14 years
~10,000 strains collected annually
CDC
Emergenet time: 1 second
Stamping Out the Next Pandemic **Before** The First Human Infection
BioNorad
Project 1: ZCoR Dashboard and Implementation Optimization
Project 2: BioNORAD Implementation
Mental health diagnosis
opinion dynamics
microbiome
viral emergence
Digital Twins for complex systems
algorithmic lie detector
teomims
Darkome
What other problems can it solve?
Second Prize 40,000 USD
PREPARE: Pioneering Research for Early Prediction of Alzheimer's and Related Dementias EUREKA Challenge
Phase 1
Phase 2
licensed patient data
digital twin
(generative AI)
teomims
(open cohort)
Project Teomim: Hyperrealistic digital twins of individual health trajectories
Phase 1
Phase 2
Uncorrelated, yet indistinguishable !!
VeRITaAS
Can A Generative AI Tell if you Are Lying?
Vetting Response Integrity from
cross-Talk in Adversarial
Surveys
Q-Net
Hidden structure of cross-talk between responses to interview items
PTSD diagnostic interview
Beat the test!
200 participants in
US
100 participants in
UK
30 forensic psychiatrists
10
6
1
Can-You-Fake-PTSD Challenge Results
successful attempts
Darkome: genotype to Phenotype Mapping
https://grants.nih.gov/grants/guide/pa-files/PAR-25-255.html
Project Darklight
Project 1: ZCoR Dashboard and Implementation Optimization
Project 2: BioNORAD Implementation
Project 3: Teomim dataset generation: Create validated repository
Project 4: VeRITaAS extension: Digital twins of surveys
Project 5: Darklight: genotype+ to phenotype mapping
Project 6: Cognet: Modeling belief propagation and opinion dynamics
Conservation of complexity!
for digital twins
THE PROBLEM
Assuming a 1000 species ecosystem, and 1 successful experiment every day to discern a single two-way relationship, we would need 1,368 years to go through all possibilities.
Digital Twin for the Maturing Human Microbiome
Boston U
U Chicago
Two centers
Ability to "fill in" missing data is equivalent to making trajectory forecasts
predicting neurodevelopmental deficits
forecasting ecosystem trajectories
"test-free" screening?
Lack of Universal Screening at the point of care
Early diagnosis is difficult, late or missed diagnosis costs lives
We lack Universal Screening
for most diseases
Number of possible responses
Minimum Performance (n=624)
Average Time: 3.5 min
No. of questions: 20
AUC > 0.95
PPV > 0.86
NPV > 0.92
At least 83.3% sensitivity at 94% specificity
Minimum AUC = \(0.95 \pm 0.005\)
Cannot be coached, or memorized
Datasets for training & validation
1. VA (n=294)
2. Prolific (n=300)
3. Psychiatrists (n=30)
Hyperlinked Nodes
Ohio H3N2 2017
Hyperlinked Nodes
A\Bretagne\24241\2021 H1N2
Variant
Off-the-shelf AI does not suffice