Paper reading (三十六):Automatic chemical design using a dd continuous representation of molecules

论文题目:Automatic chemical design using a data-driven continuous representation of molecules

scholar 引用:438

页数:9

发表时间:2018.01

发表刊物: ASC(American Chemical Society) Central Science

作者:Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud ...

摘要:

We report a method to convert discrete representations of molecules to and from a multidimensional continous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundrends of thousands of (这里为什么不直接写出具体数量?) existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent(潜在的) continuous vector representation of the molecular. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing(扰乱) known chemical structures, or interpolating(内插,插值) between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.

结论:

  • In our autoencoder model, we observed high fidelity in reconstruction of SMILES strings and the ability to capture characteristic features of a molecular training set.
  •  In this work, we used a text-based molecular encoding, but using a graph-based autoencoder would have several advantages. 
  • building a neural network which can output arbitrary graphs is an open problem.
  • Further extensions of this work to use a explicitly defined grammar for SMILES instead of forcing the model to learn one or to actively learn valid sequences
  • Several proceeding works have further explored the use of Long Short-Term Memory (LSTM) networks and recurrent networks applied to SMILES strings to generate new molecules and predict the outcomes of organic chemistry reactions. LSTM的case出没。。。
  • he autoencoder sometimes produced molecules that are formally valid as graphs but contain moieties(亚基) that are not desirable because of stability or synthetic constraints.
  • One option is to train the autoencoder to predict properties related to steric constraints of other structural constraints. 

Introduction:

  • The goal of drug and material design is to identify novel molecules that have certain desirable properties. 
  • We view this as an optimization problem, in which we are searching for the molecules that maximize our quantitative desiderata.
  • the search space is large, discrete, and unstructured. 
  • Virtual screening can be used to speed up this search. 这不是之前看到过的么
  • Current methods either exhaustively search through a fixed library, or use discrete local search methods such as genetic algorithms or similar discrete interpolation techniques. 目前的三种主流做法
  • Fixed libraries are monolithic, costly to fully explore, and require hand-crafted rules to avoid impractical chemistries. 第一种主流方法的局限
  • The genetic generation of compounds requires manual specification of heuristics for mutation and crossover rules. 第二种主流方法的局限
  • Discrete optimization methods have difficulty effectively searching large areas of chemical space because it is not possible to guide the search with gradients.第三种主流方法的局限
  • A molecular representation method can easily be converted into a machine-readable molecule.
  • 1 hand-specified mutation rules are unnecessary; 2 a differentiable model that maps from molecular representations to desirable properties enable the use of gradient-based optimization to make larger jumps in chemical space; 3 a data-driven representation can leverage large sets of unlabeled chemical compounds to automatically build an even larger implicit library, and then use the smaller set of labeled examples to build a regression model from the continuous representation to the desired properties. 新方法的三个优点一一针对解决目前三种主流方法的局限
  • We apply such generative models to chemical design, using a pair of deep networks trained as an autoencoder to convert molecules represented as SMILES strings into a continuous vector representation. 
  • We chose to use SMILES representation because this representation can be readily converted into a molecule. 为什么选择SMILES representation
  • Using this new continuous vector-valued representation, we experiment with the use of continuous optimization to produce novel compounds. 

正文组织架构:

1. Introduction

2. Representation and Autoencoder Framework

3. Results and Discussion

3.1 Representation of Molecules in Latent Space

3.2 Property Prediction of Molecules

3.3 Optimization of Molecules via Properties

4. Conclusion

5. Methods

正文部分内容摘录:

2. Representation and Autoencoder Framework

  • The autoencoder in comprised of two deep networks: an encoder network to convert each string into a fixed-dimensional vector, and a decoder network to convert vectors back into strings.
  • Key to the design of the autoencoder is the mapping of strings through an information bottleneck
  • SMILES:Simplified molecular input line entry specification 简化分子线性输入规范
  •  To help ensure that points in the latent space correspond to valid realistic molecules, we chose to use a variational autoencoder (VAE) framework. 
  • We employed the open source cheminformatics suite RDKit to validate the chemical structures of output molecules and discard invalid ones. 
  • we added a model to the autoencoder that predicts the properties from the latent space representation. 
  • an additional multilayer perceptron (MLP) was used to predict the property from the latent vector of the encoded molecule. 
  • Two autoencoder systems were trained: one with 108 000 molecules from the QM9 data set of molecules with fewer than 9 heavy atoms and another with 250 000 drug-like commercially available molecules extracted at random from the ZINC database.
  • We performed random optimization over hyperparameters specifying the deep autoencoder architecture and training.

3. Results and Discussion

3.1 Representation of Molecules in Latent Space

  • Whereas the distribution of data point in each individual dimension shows a slightly different mean and standard deviation, all the distributions are normal as enforced by the variational regularizer.
  • When these resulting SMILES are re-encoded into the latent space, the most frequent decoding also tends to be the one with the lowest Euclidean distance to the original point, indicating the latent space is indeed capturing features relevant to molecules.
  • The probability of decoding from a point in latent space is dependent on how close this point is to the latent representations of other molecules
  • A continuous latent space allows interpolation of molecules by following the shortest Euclidean path between their latent representations.
  • Despite the fact that the VAE is trained purely on the SMILES strings independently of chemical properties, it is able to generate realistic-looking molecules whose features follow the intrinsic distribution of the training data. 
  • The hand-selected mutations are less able to generate new compounds while at the same time biasing the properties of the set to higher chemical complexity and decreased drug-likeness. 

3.2 Property Prediction of Molecules

  • The interest in discovering new molecules and chemicals is most often in relation to maximizing some desirable property. 

  • he latent space generated by autoencoders jointly trained with the property prediction task shows in the distribution of molecules a gradient by property values; molecules with high values are located in one region, and molecules with low values are in another.

  • Autoencoders that were trained without the property prediction task do not show a discernible pattern with respect to property values in the resulting latent representation distribution.

  • Our VAE model shows that property prediction performance for electronic properties (i.e., orbital energies) are similar to graph convolutions for some properties; prediction accuracy could be improved with further hyperparameter optimization.

3.3 Optimization of Molecules via Properties

  • We next optimized molecules in the latent space from the autoencoder which was jointly trained for property prediction.

  • we used a Gaussian process model to model the property predictor model. 

  • Since the training set is smaller, the predictive power of the GP is lower which when optimizing in latent space and as a result optimizes to several local minima instead of a global optimization. 

  • In cases where it is difficult to define an objective that completely describes all the traits desired in a molecule, it may be better to use this localized optimization approach to reach a larger diversity of potential molecules.

5. Methods

Autoencoder Architecture

  • We also experimented with convolutional networks for string encoding and observed improved performance. 那最后有没有用CNN?This is explained by the presence of repetitive, translationally invariant substrings that correspond to chemical substructures, e.g., cycles and functional groups.
  • Our SMILES-based text encoding used a subset of 35 different characters for ZINC and 22 different characters for QM9. 
  • The structure of the VAE deep network was as follows: For the autoencoder used for the ZINC data set, the encoder used three 1D convolutional layers of filter sizes 9, 9, 10 and 9, 9, 11 convolution kernels, respectively, followed by one fully connected layer of width 196. 果然用了CNN
  • The decoder fed into three layers of gated recurrent unit (GRU) networks with hidden dimension of 488.
  • For the model used for the QM9 data set, the encoder used three 1D convolutional layers of filter sizes 2, 2, 1 and 5, 5, 4 convolution kernels, respectively, followed by one fully connected layer of width 156. The three recurrent neural network layers each had a hidden dimension of 500 neurons. 前面不是three CNN?怎么突然冒出来RNN?是decoder的GRU的RNN
  • The last layer of the RNN decoder defines a probability distribution over all possible characters at each position in the SMILES string. 
  • The output GRU layer had one additional input, corresponding to the character sampled from the softmax output of the previous time step and was trained using teacher forcing.
  • The variational loss was annealed according to sigmoid schedule after 29 epochs, running for a total 120 epochs.
  • For property prediction, two fully connected layers of 1000 neurons were used to predict properties from the latent representation, with a dropout rate of 0.20. 
  • To simply shape the latent space, a smaller perceptron of 3 layers of 67 neurons was used for the property predictor, trained with a dropout rate of 0.15. 
  • For the algorithm trained on the ZINC data set, the objective properties include logP, QED, SAS. 目标属性
  • For the algorithm trained on the QM9 data set, the objective properties include HOMO energies, LUMO energies, and the electronic spatial extent (R2). 
  • The property prediction loss was annealed in at the same time as the variational loss. 
  • We used the Keras and TensorFlow packages to build and train this model and the RDKit package for cheminformatics.
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值