Paper reading (三十五):Neural-symbolic machine learning for retrosynthesis and reaction prediction

论文题目:Neural-symbolic machine learning for retrosynthesis and reaction prediction

scholar 引用:99

页数:9

发表时间:2017.01

发表刊物:Chemistry – A European Journal

作者:Marwin H. S. Segler and Mark P. Waller

摘要:

Reaction prediction and retrosynthesis are the cornerstone of organic chemistry. Rule-based expert systems have been the most widespread approach to computationally solve these two related challenge to date. However, reaction rules often fail because they ignore the molecular context, which leads to reactivity conflicts. Herein, we report that deep neural networks can learn to resolve reactivity conflicts and to prioritize the most suitable transformation rules. We show that by training out model on 3.5 million reactions taken from the collective published knowledge of the entire discipline of chemistry, out model exhibits a top 10-accuracy of 95% in retrosynthesis and 97% for reaction prediction on a validation set of almost 1 million reactions.

  • 那其他rule-based应用广泛的场景,或许也可以尝试一下DNN
  • top-10? 这里是说预测取排名前10

Limitations:

  • Most of the limitations stem from the underlying rules, and not the machine‐learning component. 
  1. our system shares with other rule‐based systems, is that it cannot predict anything outside its rule base. It does not solve the dilemma of rules: Either, one defines rules that are too general, which would generate a lot of noise or rules that are too specific, which can only predict the substrate used to derive the rule. This is especially problematic for reaction types that only occur a few times.  可能的解决方案:a model of chemical reasoning based on knowledge graphs
  2. our system does not take stereochemistry into account. 可能的解决方案:a global model without involving quantum chemistry
  • we report on a hybrid neural‐symbolic approach for both retrosynthesis and reaction prediction that can be trained with large reaction sets from databases. 
  •  neural networks can learn to which molecular context particular rules can be applied, and can prioritize the rules for both retrosynthesis and reaction prediction using either hand‐coded or automatically extracted rule sets. 
  • We anticipate that neural‐symbolic models will be a key building block in future systems for computer‐aided synthesis design, robot synthesis, virtual chemical space exploration, and de novo drug design.

Introduction:

  • To rationally synthesize new molecules, two intimately related problems, reaction prediction and retrosynthesis, have to be solved. 
  • 反应预测:任务是推断一组分子(原料)将如何反应以及产物将是什么。
  • 逆合成分析,也称作逆合成法、反合成分析,是解决有机合成路线的重要方法,也是有机合成路线设计的最简单、最基本的方法。其实质是目标分子的分拆,通过分析目标分子结构,逐步将其拆解为更简单、更容易合成的前体和原料,从而完成路线的设计。
  • The standard methodology for retrosynthesis and reaction prediction are rule‐based expert systems.
  • The rules are applied to the reactants to obtain the product in reaction prediction, or in reverse, to the product, for retrosynthesis. 
  • The great advantage of rules is that they are straightforward to interpret.
  • the rule‐based approach has several drawbacks:
  1. rule‐based expert systems cannot predict anything outside of their knowledge
  2. the rules have to be compiled and curated
  3. lack an inherent ranking mechanism
  • rule‐based expert systems for retrosynthesis have never been rigorously evaluated with large hold out test sets. 
  • 前人尝试过的方法:random forest,neural network,unsupervise pre-training of self‐organizing maps
  • we propose a novel neural‐symbolic model, which can be used for both reaction prediction and retrosynthesis.
  • We hypothesize that the advantage of combining machine learning with symbolic rules is that we retain the familiar concept of rules, whereas the model learns to prioritize the rules and to estimate selectivity and compatibility from the provided training data, which are successfully performed experiments. 
  • In top‐n accuracy, we examine if the correct reaction rule is among the n highest ranked rules, similar to being on the first page of the results of a search engine.
  • we compare our best neural‐symbolic models, a neural network with one hidden layer (FC512 ELU) and a deep highway network, to a purely rule‐based expert system operating with the same rule set.
  • no hand‐annotated expert systems are free or open source, and the annotations themselves are not published in the public domain making a direct comparison unfeasible.

正文组织架构:

1. Introduction

2. Hand-coded reactions

3. Automatically extracted rules

4. Timing

5. Limitations

6. Experimental Section

6.1 Data

6.2 Reaction rules

6.3 Molecular descriptors

6.4 Neural networks

正文部分内容摘录:

2. Hand-coded reactions

  • the models had to predict the correct rule amongst 103 hand‐coded rules for retrosynthesis and reaction prediction
  • For reaction prediction, the accuracy of the rule‐based system is 0.07, whereas our model reaches an accuracy of 0.92.
  • In retrosynthesis, the expert system yields an accuracy of 0.05 and an MRR of 0.01. Our single‐layer neural network reaches an accuracy of 0.78 and an MRR of 0.87

3. Automatically extracted rules

  •  for the prediction, The accuracy of the best neural‐symbolic model is 0.78.
  • In the retrosynthesis task, The best neural‐symbolic model reaches an accuracy of 0.64 and a top10‐accuracy of 0.95.
  • There are several observations to be discussed.
  1.  the neural‐symbolic models outperform expert systems in all experiments because the rule‐based system matches tens or hundreds of rules. the neural network has not only learned which functional groups are involved in a rule, but also which molecular contexts are tolerated.
  2. rule‐based systems without additional information about the molecular context perform only slightly better than random, even if the rule set is small. Reaction‐driven de novo molecular design approaches would therefore benefit from neural‐symbolic models.
  3. the overall metrics for retrosynthesis are lower than in the reaction prediction task. because in retrosynthesis the system has less information available (just the product)
  • the difference in performance between the hand coded and the algorithmically extracted rules can be attributed to the size of the rule sets.

4. Timing

  • Training our largest neural networks takes 6 h using an nvidia Tesla K80 GPU (graphical processing unit).
  • In contrast, the rule‐based expert system takes 62 min and 24 s. The neural symbolic is thus 150 times faster. 

6. Experimental Section

6.1 Data

  • As the dataset, we used all reactions with up to three reactants that lead to a single reported product from the Reaxys database, published from 1771 until 2015.
  • This left us with 3 million reactions for the hand‐coded rules, and 4.9 million reactions for the extracted rules.
  • The data were split randomly into a training set, a development set and a validation set (7:1:2).

6.2 Reaction rules

  • The reaction rules were obtained in two different ways.
  1. we entered 103 rules of common reactions by hand.
  2. we extracted very general rules algorithmically, following an established, shell‐based Scheme, which is also used by state‐of‐the‐art rule‐based systems
  •  Only rules that occurred at least 50, 100, and 5000 times were used to maintain robustness, leading to 17 370, 8720 and 137 rules, respectively. Rules were assigned with RDKit.

6.3 Molecular descriptors

  • we generate counted Extended‐Connectivity Fingerprints (ECFP4) with CDK 1.5.13.
  • we take the sum of these vectors to obtain a single vector x, which serves as our order‐invariant descriptor.

6.4 Neural networks

  • We define our problem as a multiclass classification.
  • As our classifier, we evaluated different neural network architectures with one or more fully connected hidden layer(s), and Highway Networks.
  • As the non‐linearity, we apply the exponential linear unit developed by Hochreiter and co‐workers.
  • The last layer of the neural network is a softmax, which gives the probability distribution over the reaction rules.
  • Keras was used as the machine‐learning framework.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值