Pan-genomic Digital Twins for  Predictive Diagnosis of Pulmonary Fibrosis

Ishanu Chattopadhyay, PhD

Assistant Professor of Internal Medicine

Institute of Biomedical Informatics

University of Kentucky

Data

19000 samples

1.2M SNPS, SOCIAL, CLINICAL

\{

11K SNPS, SOCIAL, CLINICAL

the variable of interest (among the ~11K) is the diagnosis state for PF: PFdx

Risk distribution for samples with positive and negative PF diagnoses

Looking ahead: OOS validation

Take oos samples, sensor PFdx and J84 clinical data, and calculate risk as defined in Eq. 1 

Results

n=7000

Data

Digital Twin of Data

x1 x2 x3
xi
\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

We infer a "recursive forest" that yields a generative model of the data

Predict the distribution of a variable (\(x_i\)) as function of all the other available variables

Captures the cross-talk and epistetic effects

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

Viral genome example

\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i(x_{-i}) , \Phi_i(y_{-i})\right ) \right )

This distance is "special"

$$J \textrm{ is the Jensen-Shannon divergence }$$

q-distance

a biologically informed, adaptive distance between samples

Smaller distances implies high probability of a "valid perturbation"

Metric Structure

Tangent Bundle

geometry

dynamics

\theta(x,y) \sim \log Pr(x \rightarrow y)
\theta

Digital Twin of Transgenome

(limited to 10K PF correlates)

parameters: 975050

variables: 11k

samples: 12k

PFdx

MUC5b

MUC5b

JHU_11.119702210_G

PFdx

Inferred path from MUC5b promoter to PFdx outcome

inferred network driving PF diagnosis

This is a fraction of the emergent network among our variables that "drives" the PFdx variable

inferred network driving PF diagnosis

Some qualitative match with what is know. Many "new" relationships.

 

Need to inspect for mechanistic meaning

Building the Classifier for PF dx

  • Leverage the generative property
  • The "training set" is imperfectly specified (not all PF positive are diagnosed, some clinical diagnoses might be "placeholders" or wrong)
  • No-PF diagnoses does not mean "healthy"; it is a highly heterogeneous set
  • PF diagnoses also does not mean the rest of the data is homogenous

We make "anchor" perturbation samples

PF dx

no PF dx

  • set \(PFdx=X\) and perturb to generate "non-PF" samples
  • set \(PFdx=J84111\) and perturb to generate "PF" samples
\rho(s) = \sum_{i=1}^{m}\frac{ \theta(s,H_i)}{\theta(s,J_i)}

Risk of PF diagnosis

H_i
J_i

(1)

Convex hulls in Metric embedding of perturbation vectors

We make two sets of "valid perturbation" samples

Risk distribution for samples with positive and negative PF diagnoses

Out-of-sample validation

Take oos samples, sensor PFdx and J84 clinical data, and calculate risk as defined in Eq. 1 

Results

n=7000

Copy of PF-DT

By Ishanu Chattopadhyay

Copy of PF-DT

AI for medicine

  • 18