Pan-genomic Digital Twins for Predictive Diagnosis of Pulmonary Fibrosis

Ishanu Chattopadhyay, PhD

Assistant Professor of Internal Medicine

Institute of Biomedical Informatics

University of Kentucky

Data

19000 samples

1.2M SNPS, SOCIAL, CLINICAL

11K SNPS, SOCIAL, CLINICAL

the variable of interest (among the ~11K) is the diagnosis state for PF: PFdx

Risk distribution for samples with positive and negative PF diagnoses

Looking ahead: OOS validation

Take oos samples, sensor PFdx and J84 clinical data, and calculate risk as defined in Eq. 1

Results

n=7000

Data

Digital Twin of Data

x1	x2	x3

		xi

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

We infer a "recursive forest" that yields a generative model of the data

Predict the distribution of a variable ($x_i$) as function of all the other available variables

Captures the cross-talk and epistetic effects

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

Viral genome example

\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i(x_{-i}) , \Phi_i(y_{-i})\right ) \right )

This distance is "special"

$$J \textrm{ is the Jensen-Shannon divergence }$$

q-distance

a biologically informed, adaptive distance between samples

Smaller distances implies high probability of a "valid perturbation"

Metric Structure

Tangent Bundle

geometry

dynamics

\theta(x,y) \sim \log Pr(x \rightarrow y)

\theta

Digital Twin of Transgenome

(limited to 10K PF correlates)

parameters: 975050

variables: 11k

samples: 12k

PFdx

MUC5b

JHU_11.119702210_G

PFdx

Inferred path from MUC5b promoter to PFdx outcome

inferred network driving PF diagnosis

This is a fraction of the emergent network among our variables that "drives" the PFdx variable

inferred network driving PF diagnosis

Some qualitative match with what is know. Many "new" relationships.

Need to inspect for mechanistic meaning

Building the Classifier for PF dx

Leverage the generative property
The "training set" is imperfectly specified (not all PF positive are diagnosed, some clinical diagnoses might be "placeholders" or wrong)
No-PF diagnoses does not mean "healthy"; it is a highly heterogeneous set
PF diagnoses also does not mean the rest of the data is homogenous

We make "anchor" perturbation samples