生成SMILES以及对应的图像的评价指标(GuacaMol)

GuacaMol: Benchmarking Models for de Novo Molecular Design【2019】

GuacaMol:新分子设计的基准模型

paper:https://pubs.acs.org/doi/10.1021/acs.jcim.8b00839

 code:https://github.com/BenevolentAI/guacamol

数据下载地址:GuacaMol 

 Benchmark code:https://github.com/benevolentAI/guacamol_baselines


GuacaMol数据集介绍

Dataset for the GuacaMol benchmark suite for generative chemistry. This is a subset of molecules extracted from ChEMBL 24. For more information see https://www.ebi.ac.uk/chembl/ 

To generate the final dataset for the benchmarks, ChEMBL is post-processed by:

  • 1. removal of salts.
  • 2. charge neutralization.
  • 3. removal of molecules with SMILES strings longer than 100 characters.
  • 4. removal of molecules containing any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I.
  • 5. removal of molecules with a larger84 ECFP4 similarity than 0.323 compared to a holdout set consisting of 10 marketed drugs (celecoxib, aripiprazole, cobimetinib, osimertinib, troglitazone, ranolazine, thiothixene, albuterol, fexofenadine, mestranol). This allows us to define similarity benchmarks for targets that are not part of the training set. 

一、摘要:

De novo design seeks to generate molecules with required property profiles by virtual design-make-test cycles.

With the emergence of deep learning and neural generative models in many application areas, models for molecular design based on neural networks appeared recently and show promising results. However, the new models have not been profiled on consistent tasks, and comparative studies to well established algorithms have only seldom been performed. To standardize the assessment of both classical and neural models for de novo molecular design, we propose an evaluation framework, GuacaMol, based on a suite of standardized benchmarks. The benchmark tasks encompass measuring the fidelity of the models to reproduce the property distribution of the training sets, the ability to generate novel molecules, the exploration and exploitation of chemical space, and a variety of single and multiobjective optimization tasks. 

二、 评估新设计的技术

To profile(描述) models for de novo molecular design, we differentiate between their two main use cases:

  1. Given a training set of molecules, a model generates new molecules following the same chemical distribution.【Distribution-Learning Benchmarks】
  2. A model generates the best possible molecules to satisfy a predefined goal.【Goal-Directed Benchmarks.

The collection of benchmarks we propose below assesses both facets defined here. In the following we will refer to these two categories as distribution-learning benchmarks and goal-directed benchmarks, respectively.

3.1). Distribution-Learning Benchmarks.【分布学习的基准】

分布学习的基准是:评估模型学习生成类似于训练集的分子的效果,在这项工作中,训练集是ChEMBL。

The distribution-learning benchmarks assess how well models learn to generate molecules similar to a training set, which in this work is a standardized subset of the ChEMBL database (see Appendix 8.1). We consider five benchmarks for distribution learning:

3.1.1. Validity(有效性).

Validity就是看生成的分子的SMILES是否是理性的。

The validity benchmark assesses whether the generated molecules are actually valid, that is, whether they correspond to a (at least theoretically) realistic molecule. For example, molecules with an incorrect SMILES syntax, or with invalid valence, are penalized.

3.1.2. Uniqueness(独一无二性).

是否总是生成一个分子多次?

Given the high dimensionality of chemical space and the huge number of molecules potentially relevant in medicine, generative models should be able to generate a vast number of different molecules. The uniqueness benchmark assesses whether models are able to generate unique molecules--that is, if a model generates the same molecule multiple times, it will be penalized.

3.1.3. Novelty(新颖性).

模型是否生成存在于训练集中的分子?

Since the ChEMBL training set represents only a tiny subset of drug-like chemical space, a model for de novo molecular design with a good coverage of chemical space will rarely generate molecules present in the training set. The novelty benchmark penalizes models when they generate molecules already present in the training set. Therefore, models overfitting the training set will obtain low scores on this task.

3.1.4. Frechet ChemNet Distance.(FCD) 

来源于:Fréchet ChemNet Distance: A metric for generative models for molecules in drug discovery(https://arxiv.org/abs/1803.09518

评估生成的分子与训练集中的分子之间的相似性,越相似,FCD分数越低。

Preuer et al. introduced the Frechet ChemNet Distance (FCD) as a measure of how close distributions of generated data are to the distribution of molecules in the training set. In the context of drug discovery, the usefulness of this metric stems from its ability to incorporate important chemical and biological features. The FCD is determined from the hidden representation of molecules in a neural network called ChemNet trained for predicting biological activities, similarly to the Frechet Inception Distance(FID) sometimes applied to generative models for images. More precisely, the means and covariances of the activations of the penultimate layer of ChemNet are calculated for the reference set and for the set of generated molecules. The FCD is then calculated as the Frechet ́ distance for both pairs of values. Similar molecule distributions are characterized by low FCD values.

3.1.5. KL Divergence(KL 散度).

衡量分布P与分布Q近似程度。

The Kullback−Leibler (KL) divergence measures how well a probability distribution Q approximates another distribution P. For the KL divergence benchmark, the probability distributions of a variety of physicochemical descriptors for the training set and a set of generated molecules are compared, and the corresponding KL divergences are calculated. Models able to capture the distributions of molecules in the training set will lead to small KL divergence values.

3.2). Goal-Directed Benchmarks.【目标导向的基准】

3.2.1. Similarity(相似性).

Similarity is one of the core concepts of chemoinformatics.

3.2.2. Rediscovery(重新发现).

Rediscovery benchmarks are closely related to the similarity benchmarks described above.

3.2.3. Isomers.(同分异构体)

For the isomer benchmarks, the task is to generate molecules that correspond to a target molecular formula (e.g., C_{7}H_{8}N_{2}O_{2}).

3.2.4. Median Molecules.(中间分子)

In the median molecules benchmarks, the similarity to several molecules must be maximized simultaneously; it is designed to be a conflicting task.

4. METHODS

GuacaMol提供了用户友好的接口,允许研究人员以最小的代价将他们的模型与基准测试框架相结合:
Our framework for benchmarking models for de novo design, GuacaMol, is available as an open-source Python package and can be downloaded from the Internet at: www.github.com/BenevolentAI/guacamol. GuacaMol provides user-friendly interfaces that allow researchers to couple their models to the benchmarking framework with minimal effort.

We selected five distribution-learning and 20 goal-directed benchmarks. The exhaustive specifications and implementation details of the distribution-learning and goal-directed benchmarks are given in Appendices 7.2 and 7.3, respectively.

For all chemoinformatics-related operations, such as handling or canonicalization of SMILES strings, physicochemical property (logP, TPSA), or similarity calculations, GuacaMol relies on the RDKit package.

Several of the molecule generation approaches we studied require a training data set. ChEMBL  was used for this purpose. The generation of the data set is discussed and detailed in Appendix 7.1.

除了引入benchmark,这个模型还提供了在多个baseline下的评估,目的是公平的评估SOTA方法。
In addition to the introduction of a benchmark suite for generative chemistry, this paper provides its evaluation over several baselines. The goal is to present a fair comparison of some recent methods that seem to achieve state-of-the-art results in the field. These have been selected to represent a variety of methods, ranging from deep learning generative models to established genetic algorithms and more recent Monte Carlo Tree Search (MCTS). Internal molecule representation was also taken into account, so that methods using SMILES strings were represented alongside others using a graph representation of atoms and bonds. For completeness and to provide a lower bound on the benchmark scores, two dummy baselines, performing random sampling(Distribution-learning Benchmark) and best of data set picking(Goal-direction Benchmark), are also provided. All the baseline models assessed in this work are described in Appendix 7.4. Our implementation of those models is available on GitHub.(GitHub - BenevolentAI/guacamol_baselines: Baselines models for GuacaMol benchmarks

5. RESULTS AND DISCUSSION

5.1. Distribution-Learning Benchmarks.

Table 1 lists the results for the distribution-learning benchmarks.

  1. The random sampler model is a useful baseline for comparison, because it shows what scores can be expected for good models on the KL divergence and Frechet benchmarks. Unsurprisingly, its novelty score is zero, since all the molecules it generates are present in the data set.(random sample是在数据集中取出的,它的作用是作为baseline进行比较)
  2. The SMILES LSTM model sometimes produces invalid molecules. However, they are diverse, as attested by the uniqueness and novelty benchmarks, and closely resemble molecules from ChEMBL.
  3. The Graph MCTS model is characterized by a high degree of valid, unique, and novel molecules. Nevertheless, it is not able to reproduce the property distributions of the underlying training set as precisely as the SMILES LSTM model.
  4. The AAE model obtains excellent uniqueness and novelty scores. The molecules this model produces are, however, not as close to the data set as the ones generated by the SMILES LSTM model.
  5. ORGAN obtains poor scores for all the distribution-learning tasks. More than half the molecules it generates are invalid, and they do not resemble the training set, as shown by the KL divergence and FCD scores. This might indicate mode collapse, which is a phenomenon often observed when training GANs.
  6. The scores of the VAE model are not the best in any categories, but this model consistently produced relatively good scores for all the tasks.

Datasets :SMILES strings and their depictions for molecules generated by all six baseline models are available in the Supporting Information or on https://benevolent.ai/guacamol.

7.2. Implementation Details: Distribution-Learning Benchmarks

(1)Validity.

在10k个生成的分子中看能被RDKit成功解释的分子比例。

The validity score is calculated as the ratio of valid molecules out of 10 000 generated molecules. Molecules are considered to be valid if their SMILES representation can be successfully parsed by RDKit.

(2)Uniqueness.

10000个不同(不重复)的分子SMILES / 10000个分子

To calculate the uniqueness score, molecules are sampled from the generative model, until 10 000 valid molecules have been obtained. Duplicate molecules are identified by identical canonical SMILES strings. The score is obtained as the number of different canonical SMILES strings divided by 10 000.

(3)Novelty.

生成的分子中不在ChEMBL数据库中的比例

The novelty score is calculated by generating molecules, until 10 000 different canonical SMILES strings are obtained, and computing the ratio of molecules not present in the ChEMBL data set.

(4)Frechet ChemNet Distance (FCD). 

从ChEMBL数据库中取出10000个分子,然后随机取样10000个生成的分子,计算它们之间的相似度。

To generate the FCD score, a random subset of 10 000 molecules is selected from the ChEMBL data set, and the generative model is sampled, until 10 000 valid molecules are obtained. The FCD between both sets of molecules is calculated with the FCD package available on GitHub【FCD Package. https://github.com/bioinf-jku/FCD】, and the final score S is given by

(5)KL Divergence.

使用RDKit对sample sets和reference sets计算以下物理化学描述符:

For this task, the following physicochemical descriptors are calculated with RDKit for both the sampled and the reference sets: BertzCT, MolLogP, MolWt, TPSA, NumHAcceptors, NumHDonors, NumRotatableBonds, NumAliphaticRings, and NumAromaticRings. Furthermore, we calculate the distribution of maximum nearest neighbor similarities on ECFP4 fingerprints for both sets. Then, the distribution of these descriptors is computed via kernel density estimation (using the scipy package) for continuous descriptors or as a histogram for discrete descriptors. The KL divergence DKL,i is then computed for each descriptor i and is aggregated to a final score S via

 where k is the number of descriptors (in our case k = 9).

### 分子生成模型中的常用分子评价指标 #### 1. 多样性和质量评估 MOSES 平台实现了多种流行分子生成模型,并提供一组指标来衡量生成分子的质量和多样性[^1]。这些指标旨在确保生成的分子不仅具有化学合理性,还能够覆盖广泛的化学空间。 #### 2. FID 和 IS 的局限性 尽管 Frechet Inception Distance (FID) 和 Inception Score (IS) 是常用的生成式模型评估工具,但在特定情况下它们存在一定的局限性[^2]。例如,FID 可能无法捕捉到某些细微结构差异;IS 则可能高估了模式崩溃的影响。 #### 3. Geometry Score 作为一种替代方案,Geometry Score 被引入作为更精确地反映生成样本特性的度量标准。该分数基于几何分布理论计算真实数据集与合成数据之间的相似程度,从而更好地表征两者间的距离关系。 #### 4. 特异性指标 除了上述通用性能测量外,在药物研发领域内还有一些专门针对药理活性预测效果好坏的具体参数被广泛应用: - **有效性(Validity)**: 所有产生的化合物都应遵循基本有机化学规则; - **新颖性(Novelty)**: 新颖但合理的化学实体比例; - **独特性(Uniqueness)**: 不同个体间重复率低; - **类药性(Drug-likeness)**: 符合 Lipinski 规则或其他生物利用度筛选条件的比例。 ```python from moses.metrics import Metrics # 初始化评估器实例 metrics = Metrics() # 计算一系列指定属性得分 scores = metrics.evaluate(smiles_list) print(scores) ```
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Pengsen Ma

太谢谢了

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值