Large Science Models: Generative AI for Scientific Discovery

Ishanu Chattopadhyay, PhD

Assistant Professor of Internal Medicine

Institute of Biomedical Informatics

University of Kentucky

Modeling & predicting complex social interactions

ZeDlab Research Thrusts

General framework for inferring digital twins in biology and medicine

Point-of-care test-free screening for complex diseases

Ai

Electronic Healthcare Record 

IPF

ASD

ADRD

Rapid Universal Point-of-care Screening for ILD/IPF Using Comorbidity Signatures in Electronic Health Records

Flag patients before they (or doctors) suspect 

Primary Care

Pulmonologist

Zero-burden Co-morbid Risk Score (ZCoR)

Referral

Prognosis at Point-of-Diagnosis 

  • Optimizing Management

Patient Journey 

  • Continuous Risk Monitoring

Early Diagnosis

  • Universal Screening
  • Cohort Selection

Reduce screen failure rates

Holistic health surveillance

Predict antifibrotics continuation

improve outcomes

Interstitial Lung Disease / Pulmonary Fibrosis

1

2

3

Aim 1: Map AP Patient Journeys to Identify Risk Patterns in Acute and Recurrent Episodes.

Acute Pancreatitis

(with Darwin Conwell's group)

Aim 2: Model Transitions from AP to Type 3c Diabetes for Early Intervention.

Aim 3: Predict ICU Admission in AP Patients Based on Disease Severity Indicators.

~ 4yrs

current  survival ~4yrs

~ 4yrs

current clinical DX

ZCoR screening

Onishchenko, D., Marlowe, R.J., Ngufor, C.G. et al. Screening for idiopathic pulmonary fibrosis using comorbidity signatures in electronic health records. Nat Med 28, 2107–2116 (2022). https://doi.org/10.1038/s41591-022-02010-y

n=~3M

AUC~90%

Likelihood ratio ~30

Conventional AI/ML  attempts to model the physician

AI in IPF Research

  • Co-morbidity patterns
  • No data demands
  • Use whatever data is already on patient file

ICD administrative codes

IPF

ILD

target codes appear

Past medical history

No target codes appear

case

control

2yrs

2yrs

prediction

Truven MarketScan (IBM)
Commerical Claims & Encounters Database
2003-2018

>100M patients visible 

>7B individual claims

>87K unique diagnostic codes

>7% Medicare data present

2,053,277 patients included in study

University of Chicago Medical Center 
2012-2021

68,658 patients

Random sample from Optumlabs Data Warehouse courtsey Mayo Clinic

861,280 patients 

2,983,215 patients

Data: Onishchenko etal. Nat. Medicine 2022

patient A

patient B

patient C

Beyond "risk factors" to personalized risk patterns

Upto 4 year "signal" resolution

decreases risk

increases risk

Patient Journey: Tracking Risk over time

ZeD Lab: Predictive Screening from Comorbidity Footprints

CELL Reports

ZCoR  Competition
Autism >83%  "obvious"
Alzheimer's Disease ~90%  60-70% 
Idiopathic Pulmonary Fibrosis ~90%  NA
MACE ~80%  ~70%  
Bipolar Disorder ~85%  NA
CKD ~85%  NA
Rare Cancers (Bladder, Uterus) ~75-80%  Low
Suicidality (with CAT-SS) 98% PPV Low

How?

Odds ratios combined via ML 

1

Data

cases

control

\vdots

odds ratios for all ICD codes

\}

ML Model

\}

odds-based risk estimator

0: \textrm{healthy}\\ 1: \textrm{infections}\\ 2: \textrm{other}

Probabilistic Finite State

Map health history to trinary streams

Chattopadhyay, Ishanu, and Hod Lipson. "Abductive learning of quantized stochastic processes with probabilistic finite automata." Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371, no. 1984 (2013): 20110543.

2

Longitudinal stochastic patterns

Cloud Deployment

Theoretical formulation

Multi-cohort validation

Launch User-Accessible Platform

3 years

2 years

[
    {
        "patient_id": "P000038",
        "sex": "F",
        "birth_date": "01-01-2006",
        "DX_record": [
            {"date": "07-31-2006", "code": "Z38.00"},
            {"date": "08-07-2006", "code": "P59.9"},
            {"date": "08-29-2016", "code": "J01.90"},
            {"date": "09-10-2016", "code": "J01.90"},
            {"date": "11-14-2016", "code": "J01.91"}
        ],
        "RX_record": [
            {"date": "10-29-2011", "code": "rxLDA017"},
            {"date": "05-16-2015", "code": "rxIDG004"},
            {"date": "08-08-2015", "code": "rxIDG004"},
            {"date": "06-04-2016", "code": "rxIDD013"}
        ],
        "PROC_record": [
            {"date": "02-05-2007", "code": "90723"},
            {"date": "11-05-2007", "code": "J1100"}
        ]
    }
]
{
  "predictions": [
    {
      "error_code": "",
      "patient_id": "P000012",
      "predicted_risk": 0.005794344620009157,
      "probability": 0.8253881317184486
    }
  ],
  "target": "TARGET"
}

Data In

Data Out

Cohort Selection and Risk Analysis Testbed

Misleading Diagnosis of Idiopathic Pulmonary Fibrosis: A Clinical Concern
Javier Ramos-Rossy, MD, Onix Cantres-Fonseca, MD, Ginger Arzon-Nieves, Yomayra Otero-Dominguez, MD, Stella Baez-Corujo, MD, and William Rodríguez-Cintrón, MD

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6248220/

Project 1:  ZCoR Dashboard and Implementation Optimization

Research Direction II

Digital Twins

General framework for inferring digital twins in biology and medicine

Stamping Out the Next Pandemic **Before** The First Human Infection

BioNorad

Chattopadhyay, Ishanu, Kevin Wu, Jin Li, and Aaron Esser-Kahn. "Emergenet: Fast Scalable Pandemic Risk Assessment of Influenza A Strains Circulating In Non-human Hosts." (2023). Under Review in Science

PREEMPT

Predicting Future Mutations for Viral Genomes in the Wild

predict future  emergence risk

Hemaglutinnin

Neuraminidase

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

Q-Net

recursive forest

Hyperlinked Nodes

Northern Hemispehere H1N1 2023

Northern Hemispehere H1N1 2023

\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i(x_{-i}) , \Phi_i(y_{-i})\right ) \right )

This distance is "special"

$$J \textrm{ is the Jensen-Shannon divergence }$$

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

Q-Net

recursive forest

q-distance

a biologically informed, adaptive distance between strains

q-distance

a biologically informed, adaptive distance between strains

\theta(x,y) \triangleq \\ \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i(x_{-i}) , \Phi_i(y_{-i})\right ) \right )

Smaller distances imply a quantitatively high probability of spontaneous jump

Metric Structure

Tangent Bundle

geometry

dynamics

\theta(x,y) \sim \log Pr(x \rightarrow y)
\theta
\frac{\delta \theta(x,y)}{\delta y}

Intrinsic Distance Can Identify the Edge of Emergence

Next Steps: Generalize to new viruses, get experimental evidence

Influenza Risk Assessment Tool (IRAT) scoring for animal strains

slow (months), quasi-subjective, expensive

*https://www.cdc.gov/flu/pandemic-resources/monitoring/irat-virus-summaries.htm

24 scores in 14 years

~10,000 strains collected annually

CDC

Emergenet time: 1 second

Stamping Out the Next Pandemic **Before** The First Human Infection

BioNorad

Project 1:  ZCoR Dashboard and Implementation Optimization

Project 2:  BioNORAD Implementation

Mental health diagnosis

opinion dynamics

microbiome

viral emergence

Digital Twins for complex systems

algorithmic lie detector

teomims

Darkome

What other problems can it solve?

Second Prize 40,000 USD

PREPARE: Pioneering Research for Early Prediction of Alzheimer's and Related Dementias EUREKA Challenge

Phase 1

Phase 2

licensed patient data

digital twin

(generative AI)

teomims

(open cohort)

Project Teomim: Hyperrealistic digital twins of individual health trajectories

Phase 1

Phase 2

Uncorrelated, yet indistinguishable !!

VeRITaAS

Can A Generative AI Tell if you Are Lying?

Vetting Response Integrity from
cross-Talk in Adversarial
Surveys

Q-Net

Hidden structure of cross-talk between responses to interview items

PTSD diagnostic interview

Beat the test!

200 participants in

US

100 participants in

UK

30 forensic psychiatrists

10

6

1

Can-You-Fake-PTSD Challenge Results

successful attempts

Darkome: genotype to Phenotype Mapping

https://grants.nih.gov/grants/guide/pa-files/PAR-25-255.html

Project Darklight

Project 1:  ZCoR Dashboard and Implementation Optimization

Project 2:  BioNORAD Implementation

Project 3:  Teomim dataset generation: Create validated repository

Project 4:  VeRITaAS extension: Digital twins of surveys

Project 5:  Darklight: genotype+ to phenotype mapping 

Project 6:  Cognet: Modeling belief propagation and opinion dynamics 

Conservation of complexity!

K(x) = K(S) + K(x \vert S_\star) + O(1)

for digital twins

K(x \vert S_\star) = O(1)

THE PROBLEM

Assuming  a 1000 species ecosystem, and 1 successful experiment every day to discern a single two-way relationship, we would need 1,368 years to go through all possibilities.

Digital Twin for the Maturing Human Microbiome 

  • Forecast microbiome maturation trajectories

 

  • Predict neurodevelopmental deficits

Boston U

U Chicago 

Two centers

Ability to "fill in" missing data is equivalent to making trajectory forecasts

predicting neurodevelopmental deficits

forecasting ecosystem trajectories

"test-free" screening?

  • Autism
  • Idiopathic Pulmonary Fibrosis
  • Alzheimer's Disease and related dementia
  • Suicidality, PTSD
  • Perioperative Cardiac Event
  • Aggressive Melanoma
  • Uterine Cancer
  • Pancreatic Cancer
  • non-existent biomarkers 

 

  • expensive, time-consuming diagnostic tests

Lack of Universal Screening at the point of care

Early diagnosis is difficult, late or missed diagnosis costs lives

We lack Universal Screening

for most diseases

Number of possible responses

Minimum Performance (n=624)

Average Time: 3.5 min

No. of questions: 20

AUC > 0.95

PPV > 0.86

NPV > 0.92

At least 83.3% sensitivity at 94% specificity

Minimum AUC = \(0.95 \pm 0.005\)

Cannot be coached, or memorized

Datasets for training & validation

1. VA (n=294)

2. Prolific (n=300)

3. Psychiatrists (n=30)

10^{25}

Hyperlinked Nodes

Ohio H3N2 2017

Hyperlinked Nodes

A\Bretagne\24241\2021 H1N2

Variant

Off-the-shelf AI does not suffice

CAAI

By Ishanu Chattopadhyay

CAAI

AI for medicine

  • 101