Machine Learning & Advanced Analytics for Biomedicine

BMI 633

Ishanu Chattopadhyay

Assistant Professor of Biomedical Informatics and Computer Science

https://github.com/zeroknowledgediscovery/MLcourse/tree/main/BMI633

ishanu_ch@uky.edu

Machine Learning & Advanced Analytics for Biomedicine

What is Machine Learning

Learning from machines?

Learning with the help of computers?

Modeling data?

Regression?

data -> (intelligent) automated analysis -> actionable insights

How is Machine Learning different from...

Statistics

Data Mining

Deep Learning

How is Machine Learning different from...

"Machine learning is essentially a form of applied statistics”

“Machine learning is statistics scaled up to big data”

“Machine learning is Statistics minus any checking of models and assumptions.”

“I don’t know what Machine Learning will look like in ten years, but whatever it is I’m sure Statisticians will be whining that they did it earlier and better.”

Approach to a problem differs between mathematicians, statisticians & ML-experts

Central Limit Theorem
Measure Theory
Stochastic Processes

Linear Regression
General Linear Models
What is the "correct" statistical model for a problem/process ?
Often interest is "describing" data already observed

No model is correct.
The useful ones predict correctly more often than others
ONLY interested in how well a model works on unseen data

Decision Surfaces with Different Classification Algorithms

How Do We Teach Machines To..

Is there any good reason to assume that data that you have not seen yet will share any properties with data you have already seen?

ML Applications in Bio-medicine

Uncharted Possibilities

Predicting future disease
Optimizing interventions
Discovering unknown mechanisms

A new paradigm of scientific discovery
At-scale pattern discovery impossible otherwise

Data

Knowledge

Towards a grand unified theory of data

lots of data!

Classical Science

The age of data

Data

Insight

scientific knowledge

Clinical Decisions

social theory

Lets get down to the basics...

Diagnostic Tests for Diseases

Risk Factors
- Past Diagnoses
Laboratory Tests
Questionnaire
Familial Risks
Life Events

Does the patient have the disorder?

Not Always Obvious

autism

dementia

Diagnostic Tests for Diseases

Risk Factors
- Past Diagnoses
Laboratory Tests
Questionnaire
Familial Risks
Life Events

Does the patient have risk of the disorder ?

Not Always Obvious

autism

dementia

How do we quantify risk?

How do we map risk to severity?

Diagnostic Tests

Sensitivity & Specificity

Confusion Matrix with 2 classes

Performance Metrics

Relationships between Performance Metrics

TPR = \frac{t_p}{P} = \frac{t_p}{t_p+f_n}\\ TNR = \frac{t_n}{N} = \frac{t_n}{t_n+f_p}\\ FPR =1-TNR\\ PPV =\frac{t_p}{t_p+f_p}\\ \rho =\frac{P}{N+P}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

Relationships between Performance Metrics

PPV = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)} = \frac{TPR}{\rho + ((N-t_n)/N)(N/P)}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

s : \textrm{ sensitivity }, c: \textrm{ specificity }

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\rho} -1)}

Relationships between Performance Metrics

PPV = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)} = \frac{TPR}{\rho + ((N-t_n)/N)(N/P)}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

s : \textrm{ sensitivity }, c: \textrm{ specificity }

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

prevalence is intrinsic property of the disease

Relationships between Performance Metrics

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

Manic Episode with no Bipolar history

prevalence: ~10%

Relationships between Performance Metrics

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

Idiopathic Pulmonary Fibrosis

prevalence: ~0.5%

Relationships between Performance Metrics

The decision threshold is upto us to decide

Impacts sensitivity & specificity

Sensitivity Specificity Tradeoff

Each choice of a threshold produces a different test

Comparing Tests

Why is a "diagonal ROC" useless?

s=c \\ \Rightarrow \frac{t_p}{P} = \frac{t_n}{N} \\ \Rightarrow \frac{t_p}{t_n} = \frac{P}{N} = \frac{\wp}{1-\wp}

Let sensitivity be $s$, specificity be $c$, and prevalence P/(N+P) be $\wp$.

Then:

Hence, s=c is NO BETTER than a coin toss!

t_n

Comparing Tests

See papers 1-4 in https://github.com/zeroknowledgediscovery/course_notes/tree/master/paper_arxiv

AUC only considers ranks, not actual values
Related to the Mann-
Whitney U Test
Shows why AUC is immune to class imbalence

For 2 random samples, AUC is the probability that the positive sample is ranked higher than the negative one

Tests are tools to reduce uncertainty

Test Effectiveness

-LR=\frac{f_n}{t_n} \times \frac{1-\rho}{\rho} =\frac{1-s}{c}

+LR=\frac{t_p}{f_p} \times \frac{1-\rho}{\rho} =\frac{s}{(1-c) }

Prove this using Bayes' Theorem

Test Effectiveness

$$t_p/f_p$$

$$\frac{\rho}{1-\rho}$$

Test Effectiveness

Choosing Thresholds

Balancing False Positives & False Negatives

Cost	Positive	Negative
Test Positive	$0	$x
Test Negative	$y	$0

Cost Optimization to choose operating point

\textrm{minimize } \zeta = C(f_p)+C(f_n)

Criminal Justice: $$C(f_n) = 0 $$

Healthcare (Covid test?)

$$C(f_p) = 0 $$

naive dichotomy

Choosing Thresholds

Overlapping features are harder to classify

How do we formalize these trade-offs?

What happens if we test again?

0.045

1-0.045

0.69

But confirmatory tests might not be always feasible

Summary of Bayesian Inference

(H)

Bayes' Error

Classification & Decision Theory

\textrm{Estimate a function } f:X \rightarrow Y\\ \textrm{where } Y \textrm{ has finitely many elements}

\textrm{In classification we consider pairs } (x,y)\\ \textrm{ where $x$ is a feature vector, and $y$ is a class label}

x \in \mathcal{X}, y \in \mathcal{Y}

\textrm{Classification of hand written digits}: \mathcal{X} =\{\textrm{ image vectors representing $0-9$} \} \mathcal{Y} = \{ 0, \cdots, 9\}

Classification & Decision Theory

\textrm{Consider the joint distribution } P_{XY}

p_0=Pr(X \vert Y=0)

p_1=Pr(X \vert Y=1)

Classification & Decision Theory

\underbrace{R^\star = \inf_h R(h)}

Bayes Risk

R(h) \coloneq P_{XY}(h(X) \neq Y) = \mathbb{E}_{XY}[\mathbb{1}_{\{ h(X) \neq Y \}}]

Risk of a classifier:

Mathematical definition of classifier:

h:X \rightarrow Y

search over all possible classifiers

A classifier achieving the Bayes risk is a Bayes Optimal Classifier

Classification & Decision Theory

R^\star = \inf_h R(h)

Bayes Risk

A classifier achieving the Bayes risk is a Bayes Optimal Classifier

Bayesian Decision Theory

L(y, \widehat{y}) = \left \{ \begin{array}{ll} 1 & y \neq \widehat{y} \\ 0 & \textrm{otherwise} \end{array}\right.

\textrm{Zero-one Loss}

Minimizing the 0, 1-loss is equivalent to minimizing the overall misclassification rate. 0, 1-loss is an example of a symmetric loss function: all errors are penalized equally. In certain applications, asymmetric loss functions are more appropriate.

Recall cost of false negatives vs that of false positives

\textrm{Minimize the expected loss}\\\textrm{with respect to the probability distribution $p(x, y)$}

The expected 0, 1-loss is precisely the probability of making a mistake

Defining Bayes Optimal Classifier in terms of the Loss function

Bayesian Decision Theory

f:X \rightarrow Y \textrm{ classifier }

Bayes Optimal Classifier

The above derivation is of course for only 0-1 Loss

But this is true in general

Bayes Risk

\textrm{Bayes classifier is a classifier whose risk}\\\textrm{ $R(h)$ is minimal among all possible classifiers}\\

\textrm{The minimum risk $R^\star$ is called the Bayes risk}

p_0=Pr(X \vert Y=0)

p_1=Pr(X \vert Y=1)

R_0

R_1

\textrm{Bayes Classifier}\\h^\star(x) = \argmax_y P(Y \vert X)

Bayes Risk

R(h) = P_{XY}(h(X)\neq Y | Y=y_0) + P_{XY}(h(X)\neq Y | Y=y_1) \\ = 1 - \sum_i P(Y=y_i)\int_{R_i} P(x \vert Y=y_i)dx

p_0=Pr(X \vert Y=0)

p_1=Pr(X \vert Y=1)

R_0

R_1

\textrm{Bayes Classifier}\\h^\star(x) = \argmax_y P(Y \vert X)

Decision Trees

https://www-users.cs.umn.edu/~kumar001/dmbook/ch4.pdf

https://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf

https://www.cis.upenn.edu/~danroth/Teaching/CS446-17/LectureNotesNew/dtree/main.pdf

Books

Antonio Criminisi, Jamie Shotton (2013)
- [Decision Forests for Computer Vision and Medical Image Analysis] (http://link.springer.com/book/10.1007%2F978-1-4471-4929-3)
Trevor Hastie, Robert Tibshirani, Jerome Friedman (2008)
- [The Elements of Statistical Learning, (Chapter 10, 15, and 16)] (http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf)
Luc Devroye, Laszlo Gyorfi, Gabor Lugosi (1996)
- A Probabilistic Theory of Pattern Recognition (Chapter 20, 21)

Overfitting

How do we make decision trees better?

Reduce "bias"

Reduce "variance"

Cannot reduce "irreducible error"

THE TABULAR DATA FORMAT

Summing Up The ML Problem

Naive Bayes Assumption

Vox Populi Vox Dei

OK, Back to Making Decision Trees Better.....

Vox Populi Vox Dei

OK, Back to Making Decision Trees Better.....

Bias Variance, Reducible & Irreducible Error

Bias Variance, Reducible & Irreducible Error

\textrm{Let target variable $Y$ and feature $X$ be related as}\\ Y = f(X) + \epsilon \\

\textrm{Data}\\ d = \{ (x_1,y_1), \ldots , (x_m,y_m) \}

\textrm{Estimate of the function $f(x)$, called the hypothesis $h_d(x)$} \\ \textrm{The subscript $d$ reminds us that the hypothesis}\\\textrm{ is a random function that varies over training data sets.}

archive/references/biasVsVariance.pdf

Bias Variance, Reducible & Irreducible Error

E_{X,\epsilon} [ (Y(X,\epsilon) - h_{d}(X))^2 ]

Expected Test Error:

= E_X \left[ \left( f - E_{\mathcal{D}} \left[ h \right] \right)^2 \right] + E_X \left[ Var_{\mathcal{D}} \left[ h \right] \right] + Var_\epsilon \left[ \epsilon \right]

bias

variance

irreducible error

archive/references/biasVsVariance.pdf

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

clf_ = DecisionTreeClassifier(max_depth=4, class_weight='balanced')
clf = BaggingClassifier(base_estimator=clf_,n_estimators=10,oob_score=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train,y_train)
clf.predict(X_test,y_test)

Feature Importance for Bagging Classifier

import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = BaggingClassifier(DecisionTreeClassifier())
clf.fit(X, y)

feature_importances = np.mean([
    tree.feature_importances_ for tree in clf.estimators_
], axis=0)

Average over the feature importance of base models

Median might be better

Results no longer easily interpretable
One can no longer trace the "logic" of an output through a series of decisions based on predictor values

Problem with Ensemble Methods

Instead of one "rule", it is a distribution over rules, or a linear combination of rules

What happens when there is one strongly predicting feature?

How do we avoid this?

We can penalize the number of times a feature is used at a certain depth

We can penalize the number of times a feature can be used

We will only allow some features chosen by some meta analysis

No! Same Bias

features

samples

Bagging

features

samples

Random Forest

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=None, class_weight='balanced',n_estimators=300)
y_pred = clf.fit(X_train, y_train).predict(X_test)

https://www.stat.berkeley.edu/~breiman/RandomForests/

HW 1

Theodore H. Tulchinsky, "John Snow, Cholera, the Broad Street Pump; Waterborne Diseases Then and Now", Case Studies in Public Health. 2018 : 77–99 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7150208/

Cholera was a major global scourge in the 19th century, with frequent large-scale epidemics in European cities primarily originating in the Indian subcontinent.
John Snow conducted pioneering investigations on cholera epidemics in England and particularly in London in 1854 i
He demonstrated that contaminated water was the key source of the epidemics.
His thorough investigation of an epidemic in the Soho district of London led to his conclusion that contaminated water from the Broad Street pump was the source of the disease and, consequently, the removal of the handle led to cessation of the epidemic.

Can you retrace J. Snow's argument?

The prevailing Miasma Theory was that cholera was caused by airborne transmission of poisonous vapors from foul smells due to poor sanitation. At the same time, the competing Germ Theory that inspired Snow was still an unproven minority opinion in medical circles.

1854 London

HW 2: Epidemiology of COVI-19 in US

# Question 1

+ Train a decision-tree to predict which counties experience a >5% change in daily count
+ Train a decision-tree to predict which counties experience a >5% change in weekly count

# Question 2

+ Redo Question 1 using a Bagging classifier. Try to optimize the number of estimators, and depth of tree in each estimator

ishanu_ch@uky.edu