Analyzing Influenza C sequences with Large Science Models

Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

University of Kentucky

Large Science Models: Mathematical Framework

\begin{aligned} \text{Observables:} \quad & \color{yellow}X = \{x^1, \ldots, x^N\}, \overbrace{x^i \in \Sigma^i}^{\text{finite alphabet}} \\ {\color{gray}\text{Notation:} }\quad & \color{gray} x^{-i} = \{x^j : j \ne i\}\\ \text{Crosstalk:} \quad & \forall i \ P(x^i) = \color{red} f_i(x^{-i}) \\ \text{System state:} \quad & \color{Cyan} \psi = \bigotimes_{i=1}^N \psi^i, \quad \psi^i \in \mathscr{D}(\Sigma^i) \cup \varnothing \\ %\textbf{Degenerate case:} \quad & \psi^i \text{ is a delta distribution (fully observed)} \\ {\color{gray}\text{Notation:} }\quad & \color{gray}\psi^{-i} = \bigotimes_{j \ne i} \psi^j \end{aligned}
reliten  gunlaw abany --- grass
Person 1
Person 2
---
Person m

observables

samples

Distributions over alphabet \(\Sigma^i\)

\phi = \bigotimes_{i=1}^N \phi^i, \quad \phi^i(\psi^{-i}) \in \mathscr{D}(\Sigma^i) \\

Individual Predictor (CIT)

cross-talk

\phi(\psi) \vert \vert \psi

Tension between predicted and observed distribution drives change

Example

GSS topic: There should be more gun-control

\(\psi^i\)

strongly agree agree neutral disagree strongly disagree
\Sigma^i

Digital Twin

\phi^i(\psi^{-i}) \sim \widetilde{\psi}^i

\(\phi\) estimates \(\psi\)

Examples: GSS, ANES, WVS, ESS, Eurobarometer, Afrobarometer, Asian Barometer etc

group

individual

estimate is always a non-empty non-degenerate distribution

missing observation

*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.

emergent macro-structure

Component predictor (Conditional Inference Tree*)

Example: Influenza A HA protein

Recursive

LSM

forest

LSM Forest of Conditional Inference Trees*

Revealing Emergent Cross-talk

LSM-Distance Metric*

\theta(\psi,\psi') \triangleq \frac{1}{N}\sum_{i=1}^{N} \sqrt{D_{JS}\Bigl(\phi^i(\psi^{-i}) \vert \vert \phi^i(\psi'^{-i})\Bigr)}

 where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

H0

H1

M0

LSM-clustering on human HEV sequences

The three bovine sequences are not part of these clusters (these are all human ICV HE), but we can still compute the distance of the individual human sequences to each of the three bovine strains. And the cluster they come closest to.. Pretty clearly is the one labelled as M0. The other clusters are labeled H0 and H1.

Distance of bovine sequences to M0 cluster

'C/Miyagi/2/94',  'C/Saitama/2/2000',  'C/Yamagata/3/2000',  'C/Miyagi/7/93',  'C/Miyagi/4/96',  'C/Saitama/1/2004',  'C/Miyagi/7/96',  'C/Greece/1/79',  'C/Yamagata/5/92',  'C/Miyagi/3/93',  'C/Miyagi/4/93',  'C/Kyoto/41/82',  'C/Nara/82',  'C/Hyogo/1/83',  'C/Miyagi/1/94',  'C/Miyagi/6/93',  'C/Miyagi/3/94',  'C/Mississippi/80',  'C/Yamagata/26/2004',  'C/Mississippi/80'

Variation by Time of collection

Suggests movement from M0 to H0 to H1

Estimation of Cluster Fitness from LSM

\omega(\mathcal{C}) =\frac{1}{\vert \mathcal{C}\vert} \sum_{x \in \mathcal{C}}\log Pr(x \rightarrow x)
M0 -64.251
H0 -32.586
H1 -15.964

Fitness calculations are based on the Emergenet model, and correspond to the estimate loglikelihood of a strain NOT PERTURBING out of the cluster. Thus the H1 cluster is the most "fit", where the strains have moved over time, and is also the largest in the data. Overlap on the collection times between H0 and H1 implies this is not simply a collection bias effect (the sizes of the clusters). This has resulted in the strain disappearing from humans, as the virus found a more fit niche on the landscape.

Maximal Site Contribution to Fitness Delta

8 75 87 97 141 154 165 178 181 183 203 205 211 216 230 252 327 361 506 588

{i} =\argmax \delta \omega(x)

Local Potential Fields

Local potential fields can be computed given the LSM and dynamical considerations, which reveal future evolution

Stable

(captured by local extrema)

Free to move locally towards extrema

Observation: This lineage (Mississippi lineage) is now extinct since 2022/23

stable lineage

Equations of Motion

L \triangleq \frac{1}{2} \sum_i g_{kl} P^k_p \dot{\psi}^p_i P^l_n \dot{\psi}^n_i - \theta(\psi, \phi)

Define Lagrangian\(\dag\)

\ddot{\psi}^m_i = -g^{km} P^k_m \frac{1}{2N} \sum_j \frac{1}{\sqrt{D_{JS}(\psi^m_j \| \phi^m_j)}} \left[ \ln\left( \frac{2e\psi^m_j}{\psi^m_j + \phi^m_j} \right) - \frac{1}{2(\psi^m_j + \phi^m_j)} \right]

Over-damped Gradient flow Equation\(\dag\)

where \(-g^{km}\) is the inverse metric tensor

kinetic energy

potential energy

Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.

\(^\dag\)

Principle of stationary action

Local potential field eqn

Local Potential Fields

Stable

(captured by local extrema)

Free to move locally towards extrema

Influenza C :  strains and their neighborhoods

Observation: This lineage (Mississippi lineage) is now extinct since 2022/23

stable lineage

Local potential fields can be computed given the LSM and dynamical considerations, which reveal future evolution

LSM-InfluenzaC

By Ishanu Chattopadhyay

LSM-InfluenzaC

DARPA-EA-25-02-05-MAGICS-PA-025

  • 15