Large Science Models for Predictive Biosurveillance of Zoonotic Emergence

Ishanu Chattopadhyay, PhD

Asst Professor, Biomedical informatics & Computer Science

University of Kentucky

ishanu_ch@uky.edu

CURE-KY: Infectious Research Day 2026

March 19 2026

Are emergence events predictable?

  • 1918 onward: classical swine H1N1 persists in pigs [1,2]
  • 1979: Eurasian avian-like swine H1N1 becomes established in European swine [2]
  • 1998: North American triple-reassortant swine viruses emerge [3]
  • 2008-2009: pandemic genome assembled in swine; 6 segments from North American triple-reassortant swine viruses, 2 segments (NA, M) from Eurasian avian-like swine [3]
  • Mar-Apr 2009: human outbreak recognized in Mexico and the U.S. [3,4]
  • Jun 11, 2009: WHO declares pandemic [5]
  1. Memoli MJ, Tumpey TM, Jagger BW, Dugan VG, Sheng ZM, Qi L, Kash JC, Taubenberger JK. An early “classical” swine H1N1 influenza virus shows similar pathogenicity to the 1918 pandemic virus in ferrets and mice. Virology. 2009;393(2):338-345. doi:10.1016/j.virol.2009.08.021.
  2. Dunham EJ, Dugan VG, Kaser EK, Perkins SE, Brown IH, Holmes EC, Taubenberger JK. Different evolutionary trajectories of European avian-like and classical swine H1N1 influenza A viruses. Journal of Virology. 2009;83(11):5485-5494. doi:10.1128/JVI.02565-08.
  3. Smith GJD, Vijaykrishna D, Bahl J, Lycett SJ, Worobey M, Pybus OG, Ma SK, Cheung CL, Raghwani J, Bhatt S, Peiris JSM, Guan Y, Rambaut A. Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature. 2009;459(7250):1122-1125. doi:10.1038/nature08182.
  4. Centers for Disease Control and Prevention. Outbreak of swine-origin influenza A (H1N1) virus infection, Mexico, March-April 2009. MMWR Morb Mortal Wkly Rep. 2009;58(17):467-470.
  5. World Health Organization. DG Statement following the meeting of the Emergency Committee. 11 June 2009.

the 2009 swine-flu pandemic

1918

1979

1998

2008

2008

2009

Mar

Jun

Lets train a machine learning model to recognize human vs swine Influenza HA proteins

High confidence generally

\textrm{Emergence Risk} = \frac{1}{\textrm{classification accuracy}}
\textrm{Emergence Risk} = \frac{1}{\textrm{classification accuracy}}

Data source: NCBI and GISAID HA sequences collected between 1999 and 2016 globally (HA protein fasta)

Spatial Risk over time

The Signal is clear!

 

However, Not very actionable

Data source: NCBI and GISAID HA sequences collected between 1999 and 2016 globally (HA protein fasta)

Pr(x \rightarrow y)

Calculate the Quantitative Odds of a strain \(x\) giving rise to strain \(y\) in the wild

Preempt which specific strain is poised to cross the edge of emergence

Hemaglutinnin (HA)

Neuraminidase

Mediates Cellular Entry

Surface structures maximally involved in host interaction

Mediates Cellular Exit

Large Science Models

Learn the emergent constriants from data

Example:

Adjacent sequence blocks do not map to adjacent spatial regions after folding.

  • Non-collinearity: residues that are far apart in sequence can be:

    • physically adjacent in 3D

    • functionally coupled (e.g., receptor binding)

  • Functional dependence is long-range:

    • Mutations in HA1 (head) can affect HA2 (fusion machinery)

    • Antigenicity depend on global structural constraints, not local sequence neighborhoods

Functional coupling

Statistical association in random perturbations

Building A Large Science Model

NCBI + GISAID

>200,000 HA/NA sequences

Large Science Models: Mathematical Framework

\begin{aligned} \text{Observables:} \quad & \color{yellow}X = \{x^1, \ldots, x^N\}, \overbrace{x^i \in \Sigma^i}^{\text{finite alphabet}} \\ {\color{gray}\text{Notation:} }\quad & \color{gray} x^{-i} = \{x^j : j \ne i\}\\ \text{Crosstalk:} \quad & \forall i \ P(x^i) = \color{red} f_i(x^{-i}) \\ \text{System state:} \quad & \color{Cyan} \psi = \bigotimes_{i=1}^N \psi^i, \quad \psi^i \in \mathscr{D}(\Sigma^i) \cup \varnothing \\ %\textbf{Degenerate case:} \quad & \psi^i \text{ is a delta distribution (fully observed)} \\ {\color{gray}\text{Notation:} }\quad & \color{gray}\psi^{-i} = \bigotimes_{j \ne i} \psi^j \end{aligned}
222  223 224 --- 560
strain 1
strain 2
---
strain m

observables

samples

Distributions over alphabet \(\Sigma^i\)

\phi = \bigotimes_{i=1}^N \phi^i, \quad \phi^i(\psi^{-i}) \in \mathscr{D}(\Sigma^i) \\

Individual Predictor (CIT)

cross-talk

\phi(\psi) \vert \vert \psi

Tension between predicted and observed distribution drives change

Example: HA Site 223 on Influenza A

\(\psi^i\)

K G Y S T
\Sigma^i

Digital Twin

\phi^i(\psi^{-i}) \sim \widetilde{\psi}^i

\(\phi\) estimates \(\psi\)

population

individual

estimate is always a non-empty non-degenerate distribution

missing observation

Large Science Models: Properties

\left \lvert \ln \frac{\Pr(\psi\to \psi')}{\Pr(\psi' \rightarrow \psi')}\right \rvert \le \beta\,\theta(\psi,\psi')

Large Deviation Bound*

This bound connects ``closeness'' of strains to the odds of perturbing from one to the other, bridging geometry to dynamics

(Sanov's Theorem, Pinkser's Inequality)

persistence probability

const. scaling as \(N^2\) 

Pr(\psi \rightarrow \psi')

\(\psi\)

\(\psi'\)

\(\theta\)

LSM-Distance Metric*

\theta(\psi,\psi') \triangleq \frac{1}{N}\sum_{i=1}^{N} \sqrt{D_{JS}\Bigl(\phi^i(\psi^{-i}) \vert \vert \phi^i(\psi'^{-i})\Bigr)}

 where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.

Distance/similarity  depends on the strains AND the background circulation. 

Smaller distance \( \Rightarrow \) Higher odds of one jumping to the other

H1N1 2023 Influenza A HA

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Large Science Models

Learn the emergent constriants from data

H3N2 2021    Influenza A HA

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

Large Science Models

Learn the emergent constriants from data

LSM distance \(\neq \) Edit Distance

Same sequence/strain pair, but dfferent  background (times of collection)

LSM distance

Small edit distance (5-8) and large LSM distance

 Seasonal predictions for Influenza A.

The Vaccine Strain Selection Problem for the seasonal epidemic

Flu vaccine strain recommendation timeline

  • Year-round: WHO GISRS laboratories [1] collect and characterize circulating viruses worldwide.

  • February: WHO issues recommendation for next Northern Hemisphere season.

  • September: WHO issues the recommendation for the next Southern Hemisphere season.

  • March (U.S.): FDA/VRBPAC reviews global and U.S. data and issue recommendation

  • ~6–9 months before distribution: manufacturers produce vaccine after strain selection. 

Feb

Sep (Southern H.)

  • This is essentially a prediction problem

[1] World Health Organization. GISRS. https://www.who.int/initiatives/global-influenza-surveillance-and-response-system

Can we improve?

x_\star^{t+\delta} = \argmin_{y \in \cup_{\tau \leqq t} H^\tau} \left ( \sum_{x\in H^t} \theta_{m_t}(x,y) - \vert H^t \vert A \ln Pr(y \rightarrow y) \right )

Likely strain at time \(t+\delta\)

Set of strains circulating strains at time \(\tau\)

LSM distance of \(x\) to \(y\)

\(A\): constant depending on length of genome

 Seasonal predictions for Influenza A.

The Vaccine Strain Selection Problem for the seasonal epidemic

Solve the mathematical optimization problem

"Compute the most likely strain to arise in time \(t+\delta\) given current circulation"

 Seasonal predictions for Influenza A.

The Vaccine Strain Selection Problem for the seasonal epidemic

LSM does  better with edit-distance mismatch from realized circulation

Caveat: Edit distance closeness might not always reflect immunological response

Measure of Emergence Potential

\rho_t(x) \triangleq -\log \min_{\begin{subarray}{c}y,z \in H^t \\ r \in L_\mathcal{H},\\s \in L_\mathcal{N}\end{subarray}} \sqrt{\theta_r\left (x^{\mathcal{H}},y^{ \mathcal{H} }\right ) \theta_{s}\left (x^{\mathcal{N}},z^{\mathcal{N}}\right )},

Neuraminidase contribution

Hemagglutinin contribution

current animal circulation

Protein-specific Lsms

Which animal strain has the highest odds of giving rise to a strain similar to the circulating humans population?

  • Same subtype doesnt need to pre-exist in humans
  • Very few observations suffice in the animal circulation
  • Fewer strains increase quantifiable uncertainty

Optimization Problem

slow (months), quasi-subjective, expensive

*https://www.cdc.gov/flu/pandemic-resources/monitoring/irat-virus-summaries.htm

24 strains assessed in 14 years

~10,000 strains collected annually

CDC: Influenza Risk Assessment Tool* (IRAT) scoring for animal strains

10 dimensions of assessment

IRAT replication by LSM 

(automated, in seconds)

\textrm{Regression Eqn.} y = 0.47x + 2.77\\ R = 0.72\\ \textrm{p-value} = 0.0001

Stamping Out the Next Pandemic **Before** The First Human Infection

BioNorad

\frac{\delta \theta(x,y)}{\delta y}

Which sites are driving prediction?

Estimate how LSM distances perturb as we perturb individual sites

Re-drawing the Phylogeny

LSM-distance induces a new phylogeny

phylogeny with post-2020 sequences

Take-home Message

Emergence is not random—it is the delayed observation of structured, constrained evolution that can be inferred from data.

 

 

 

 

  • PREEMPT: PREventing EMerging Pathogenic Threats

  • MAGICS: Methodological Advancements for* Generalizable Insights into Complex Systems*

*The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of DARPA or the U.S. Government.

 

Acknowledgement

Prof. Li Feng

William Robert Mills Chair in Equine Infectious Disease, MIMG

Prof. Saurabh Chattopadhyay

Microbiology, Immunology, and Molecular Genetics

Collaborators

cureKY_research_day

By Ishanu Chattopadhyay

cureKY_research_day

  • 29