Ishanu Chattopadhyay PRO
ML | Data Science Biomedical Informatics | Social Science | Assistant Professor
Ishanu Chattopadhyay, PhD
Asst Professor, Biomedical informatics & Computer Science
University of Kentucky
ishanu_ch@uky.edu
CURE-KY: Infectious Research Day 2026
March 19 2026
Are emergence events predictable?
1918
1979
1998
2008
2008
2009
Mar
Jun
Lets train a machine learning model to recognize human vs swine Influenza HA proteins
High confidence generally
Data source: NCBI and GISAID HA sequences collected between 1999 and 2016 globally (HA protein fasta)
Spatial Risk over time
The Signal is clear!
However, Not very actionable
Data source: NCBI and GISAID HA sequences collected between 1999 and 2016 globally (HA protein fasta)
Hemaglutinnin (HA)
Neuraminidase
Mediates Cellular Entry
Mediates Cellular Exit
Learn the emergent constriants from data
Example:
Adjacent sequence blocks do not map to adjacent spatial regions after folding.
Non-collinearity: residues that are far apart in sequence can be:
physically adjacent in 3D
functionally coupled (e.g., receptor binding)
Functional dependence is long-range:
Mutations in HA1 (head) can affect HA2 (fusion machinery)
Antigenicity depend on global structural constraints, not local sequence neighborhoods
NCBI + GISAID
>200,000 HA/NA sequences
| 222 | 223 | 224 | --- | 560 | |
|---|---|---|---|---|---|
| strain 1 | |||||
| strain 2 | |||||
| --- | |||||
| strain m |
observables
samples
Distributions over alphabet \(\Sigma^i\)
Individual Predictor (CIT)
cross-talk
Tension between predicted and observed distribution drives change
Example: HA Site 223 on Influenza A
\(\psi^i\)
| K | G | Y | S | T |
\(\phi\) estimates \(\psi\)
population
individual
estimate is always a non-empty non-degenerate distribution
missing observation
This bound connects ``closeness'' of strains to the odds of perturbing from one to the other, bridging geometry to dynamics
(Sanov's Theorem, Pinkser's Inequality)
persistence probability
const. scaling as \(N^2\)
\(\psi\)
\(\psi'\)
\(\theta\)
where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.
H1N1 2023 Influenza A HA
Learn the emergent constriants from data
H3N2 2021 Influenza A HA
Learn the emergent constriants from data
Same sequence/strain pair, but dfferent background (times of collection)
LSM distance
Small edit distance (5-8) and large LSM distance
Flu vaccine strain recommendation timeline
Year-round: WHO GISRS laboratories [1] collect and characterize circulating viruses worldwide.
February: WHO issues recommendation for next Northern Hemisphere season.
September: WHO issues the recommendation for the next Southern Hemisphere season.
March (U.S.): FDA/VRBPAC reviews global and U.S. data and issue recommendation
~6–9 months before distribution: manufacturers produce vaccine after strain selection.
Feb
Sep (Southern H.)
[1] World Health Organization. GISRS. https://www.who.int/initiatives/global-influenza-surveillance-and-response-system
Likely strain at time \(t+\delta\)
Set of strains circulating strains at time \(\tau\)
LSM distance of \(x\) to \(y\)
\(A\): constant depending on length of genome
Caveat: Edit distance closeness might not always reflect immunological response
*https://www.cdc.gov/flu/pandemic-resources/monitoring/irat-virus-summaries.htm
~10,000 strains collected annually
10 dimensions of assessment
(automated, in seconds)
Estimate how LSM distances perturb as we perturb individual sites
Emergence is not random—it is the delayed observation of structured, constrained evolution that can be inferred from data.
*The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of DARPA or the U.S. Government.
Prof. Li Feng
William Robert Mills Chair in Equine Infectious Disease, MIMG
Prof. Saurabh Chattopadhyay
Microbiology, Immunology, and Molecular Genetics
By Ishanu Chattopadhyay
ML | Data Science Biomedical Informatics | Social Science | Assistant Professor