Large Science Models: Generative AI for Scientific Discovery

Ishanu Chattopadhyay, PhD

Assistant Professor of Internal Medicine

Institute of Biomedical Informatics

University of Kentucky

Modeling & predicting complex social interactions

ZeDlab Research Thrusts

General framework for inferring digital twins in biology and medicine

Point-of-care test-free screening for complex diseases

Electronic Healthcare Record

IPF

ASD

ADRD

Rapid Universal Point-of-care Screening for ILD/IPF Using Comorbidity Signatures in Electronic Health Records

Flag patients before they (or doctors) suspect

Primary Care

Pulmonologist

Zero-burden Co-morbid Risk Score (ZCoR)

Referral

Prognosis at Point-of-Diagnosis

Optimizing Management

Patient Journey

Continuous Risk Monitoring

Early Diagnosis

Universal Screening

Cohort Selection

Reduce screen failure rates

Holistic health surveillance

Predict antifibrotics continuation

improve outcomes

Interstitial Lung Disease / Pulmonary Fibrosis

Aim 1: Map AP Patient Journeys to Identify Risk Patterns in Acute and Recurrent Episodes.

Acute Pancreatitis

(with Darwin Conwell's group)

Aim 2: Model Transitions from AP to Type 3c Diabetes for Early Intervention.

Aim 3: Predict ICU Admission in AP Patients Based on Disease Severity Indicators.

~ 4yrs

current survival ~4yrs

~ 4yrs

current clinical DX

ZCoR screening

Onishchenko, D., Marlowe, R.J., Ngufor, C.G. et al. Screening for idiopathic pulmonary fibrosis using comorbidity signatures in electronic health records. Nat Med 28, 2107–2116 (2022). https://doi.org/10.1038/s41591-022-02010-y

n=~3M

AUC~90%

Likelihood ratio ~30

Conventional AI/ML attempts to model the physician

AI in IPF Research

Co-morbidity patterns
No data demands
Use whatever data is already on patient file

ICD administrative codes

IPF

ILD

target codes appear

Past medical history

No target codes appear

case

control

2yrs

prediction

Truven MarketScan (IBM)
Commerical Claims & Encounters Database
2003-2018

>100M patients visible

>7B individual claims

>87K unique diagnostic codes

>7% Medicare data present

2,053,277 patients included in study

University of Chicago Medical Center 
2012-2021

68,658 patients

Random sample from Optumlabs Data Warehouse courtsey Mayo Clinic

861,280 patients

2,983,215 patients

Data: Onishchenko etal. Nat. Medicine 2022

patient A

patient B

patient C

Beyond "risk factors" to personalized risk patterns

Upto 4 year "signal" resolution

decreases risk

increases risk

Patient Journey: Tracking Risk over time

ZeD Lab: Predictive Screening from Comorbidity Footprints

CELL Reports

	ZCoR	Competition
Autism	>83%	"obvious"
Alzheimer's Disease	~90%	60-70%
Idiopathic Pulmonary Fibrosis	~90%	NA
MACE	~80%	~70%
Bipolar Disorder	~85%	NA
CKD	~85%	NA
Rare Cancers (Bladder, Uterus)	~75-80%	Low
Suicidality (with CAT-SS)	98% PPV	Low

How?

Odds ratios combined via ML

Data

cases

control

\vdots

odds ratios for all ICD codes

ML Model

odds-based risk estimator

0: \textrm{healthy}\\ 1: \textrm{infections}\\ 2: \textrm{other}

Probabilistic Finite State

Map health history to trinary streams

Chattopadhyay, Ishanu, and Hod Lipson. "Abductive learning of quantized stochastic processes with probabilistic finite automata." Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371, no. 1984 (2013): 20110543.

Longitudinal stochastic patterns

Cloud Deployment

Theoretical formulation

Multi-cohort validation

Launch User-Accessible Platform

3 years

2 years

[
    {
        "patient_id": "P000038",
        "sex": "F",
        "birth_date": "01-01-2006",
        "DX_record": [
            {"date": "07-31-2006", "code": "Z38.00"},
            {"date": "08-07-2006", "code": "P59.9"},
            {"date": "08-29-2016", "code": "J01.90"},
            {"date": "09-10-2016", "code": "J01.90"},
            {"date": "11-14-2016", "code": "J01.91"}
        ],
        "RX_record": [
            {"date": "10-29-2011", "code": "rxLDA017"},
            {"date": "05-16-2015", "code": "rxIDG004"},
            {"date": "08-08-2015", "code": "rxIDG004"},
            {"date": "06-04-2016", "code": "rxIDD013"}
        ],
        "PROC_record": [
            {"date": "02-05-2007", "code": "90723"},
            {"date": "11-05-2007", "code": "J1100"}
        ]
    }
]

{
  "predictions": [
    {
      "error_code": "",
      "patient_id": "P000012",
      "predicted_risk": 0.005794344620009157,
      "probability": 0.8253881317184486
    }
  ],
  "target": "TARGET"
}

Data In

Data Out

Cohort Selection and Risk Analysis Testbed

https://paraknowledge.ai/zcor-testbed/

https://paraknowledge.ai/zcor-demo/

Misleading Diagnosis of Idiopathic Pulmonary Fibrosis: A Clinical Concern
Javier Ramos-Rossy, MD, Onix Cantres-Fonseca, MD, Ginger Arzon-Nieves, Yomayra Otero-Dominguez, MD, Stella Baez-Corujo, MD, and William Rodríguez-Cintrón, MD

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6248220/

Project 1: ZCoR Dashboard and Implementation Optimization

Research Direction II

Digital Twins

General framework for inferring digital twins in biology and medicine

Stamping Out the Next Pandemic **Before** The First Human Infection

BioNorad

Chattopadhyay, Ishanu, Kevin Wu, Jin Li, and Aaron Esser-Kahn. "Emergenet: Fast Scalable Pandemic Risk Assessment of Influenza A Strains Circulating In Non-human Hosts." (2023). Under Review in Science

PREEMPT

Predicting Future Mutations for Viral Genomes in the Wild

predict future emergence risk

Hemaglutinnin

Neuraminidase

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

Q-Net

recursive forest

Hyperlinked Nodes

Northern Hemispehere H1N1 2023

\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i(x_{-i}) , \Phi_i(y_{-i})\right ) \right )

This distance is "special"

$$J \textrm{ is the Jensen-Shannon divergence }$$

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

Q-Net

recursive forest

q-distance

a biologically informed, adaptive distance between strains

q-distance

a biologically informed, adaptive distance between strains

\theta(x,y) \triangleq \\ \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i(x_{-i}) , \Phi_i(y_{-i})\right ) \right )

Smaller distances imply a quantitatively high probability of spontaneous jump

Metric Structure

Tangent Bundle

geometry

dynamics

\theta(x,y) \sim \log Pr(x \rightarrow y)

\theta

\frac{\delta \theta(x,y)}{\delta y}

Intrinsic Distance Can Identify the Edge of Emergence

Next Steps: Generalize to new viruses, get experimental evidence

Influenza Risk Assessment Tool (IRAT) scoring for animal strains

slow (months), quasi-subjective, expensive

*https://www.cdc.gov/flu/pandemic-resources/monitoring/irat-virus-summaries.htm

24 scores in 14 years

~10,000 strains collected annually

CDC

Emergenet time: 1 second

Stamping Out the Next Pandemic **Before** The First Human Infection

BioNorad

Project 1: ZCoR Dashboard and Implementation Optimization

Project 2: BioNORAD Implementation

Mental health diagnosis

opinion dynamics

microbiome

viral emergence

Digital Twins for complex systems

algorithmic lie detector

teomims

Darkome

What other problems can it solve?

Second Prize 40,000 USD

PREPARE: Pioneering Research for Early Prediction of Alzheimer's and Related Dementias EUREKA Challenge

Phase 1

Phase 2

licensed patient data

digital twin

(generative AI)

teomims

(open cohort)

Project Teomim: Hyperrealistic digital twins of individual health trajectories

Phase 1

Phase 2

Uncorrelated, yet indistinguishable !!

VeRITaAS

Can A Generative AI Tell if you Are Lying?

Vetting Response Integrity from
cross-Talk in Adversarial
Surveys

Q-Net

Hidden structure of cross-talk between responses to interview items

PTSD diagnostic interview

Beat the test!

paraknowledge.ai/veritas

200 participants in

100 participants in

30 forensic psychiatrists

Can-You-Fake-PTSD Challenge Results

successful attempts

Darkome: genotype to Phenotype Mapping

https://grants.nih.gov/grants/guide/pa-files/PAR-25-255.html

Project Darklight

Project 1: ZCoR Dashboard and Implementation Optimization

Project 2: BioNORAD Implementation

Project 3: Teomim dataset generation: Create validated repository

Project 4: VeRITaAS extension: Digital twins of surveys

Project 5: Darklight: genotype+ to phenotype mapping

Project 6: Cognet: Modeling belief propagation and opinion dynamics

Conservation of complexity!

K(x) = K(S) + K(x \vert S_\star) + O(1)

for digital twins

K(x \vert S_\star) = O(1)

THE PROBLEM

Assuming a 1000 species ecosystem, and 1 successful experiment every day to discern a single two-way relationship, we would need 1,368 years to go through all possibilities.

Digital Twin for the Maturing Human Microbiome