Recurrent VAE with Gaussian Process Decoders for De Novo Molecular Generation

Vidhi Lalchand\(^{1}\), Dave Lines\(^{2}\), Anna Perdrix Rosell\(^{2}\), Neil D. Lawrence\(^{3}\)

Motivation

  • Generative models have become an integral part of the discovery process in modern science and engineering. This work is about leveraging generative latent variable models for small molecule generation --- a predominant modality for modern drugs.

 

  • A critical bottleneck in drug design is the identification of novel molecules from an overwhelming search space of  \(\approx 10^{60}\) . Hence, a typical workflow is to make a small change to a known molecule and test the modified creation downstream for its chemical or physical properties.

 

  • A generative model can largely side-step this workflow by generating small molecules which have the greatest potential to have these properties.

 

  • In this work, we propose a paradigm for achieving precisely this with a coupled RNN based latent variable model that combines multiple decoders to simultaneously predict valid molecules along with chemical properties of interest like hydrophobicity, synthetic accessibility and QED score, all of which are represented through continuous values.

Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard\(^{1}\),Sixfold Bioscience\(^{2}\), University of Cambridge\(^{3}\)

Visualisation of decoded molecules reconstructed from a small neighbourhood around a fixed molecule. The immediate neighbourhood of a known molecule (highlighted in red) yields molecules which share structural similarity – this similarity weakens with increasing Euclidean distance. This underscores the utility of a smooth latent space. 

We visualise the predicted and measured property values for each property individually in the plots along a 45° line. The points in the scatter are shaded by the ground truth value of the properties. The grey bars denote 95% confidence intervals approximately corresponding to a \(2\sigma\) interval around the mean prediction. We also note the robustness of the prediction intervals as inaccurate predictions are accompanied by wider error bars.  

Model Architecture [V-RNN]

Regularised 2d subspaces per property prediction task. The latent points in each plot are shaded by the ground truth value of the property being modelled by the GP.

The plots show the ground-truth (right) context vector per data point (row) and the learnt context vector (left) which is learnt during training using an auxilliary fully connected layer which inputs the mean of the latents. 

Gaussian Process Decoders for property prediction

1) We learn independent GPs which map from latent space to the property target vector.

3) The GPs are trained jointly with the recurrent VAE to yield smooth subspaces (as a function of target properties) which can be used for gradient based optimisation.

Recurrent VAE with learnt context

2) We use independent SE-ARD kernels inducing automatic pruning of dimensions in latent space. 

1) Seq2Seq models typically need a context vector which encodes the context of the whole sequence. 

2) This is usually the terminal hidden state of the encoder but we wish to sample and optimise from the continuous latent space.

3) We learn a context vector during training which minimises the error between the terminal encoder hidden state. 

Datasets Structure accuracy [VRNN] Structure accuracy w. property prediction [VRNN + GPs]
QM9  96.7  (0.11) 94.2 (0.26)
ZINC [250K] 93.8 (0.41) 91.7 (0.55)

Model & Goals

Recurrent VAE + Learnt Context + Gaussian process decoders

Learn a continuous latent manifold encoding small molecules  which admits the ability to sample (novel molecules) and optimise using gradient based techniques. 

 

Encode similarities in latent space, for instance, molecules with similar properties to cluster closely in latent space. 

Overall loss = 

\mathcal{L}_{\text{V-RNN}} + \mathcal{L}_{\text{MSE}} + \mathcal{L}_{\text{GP}}

Sixfold Mol poster

By Vidhi Lalchand

Sixfold Mol poster

  • 15