Chapter15 : Artificial Intelligence in Compound Design

reading notes of《Artificial Intelligence in Drug Design》


1.Introduction

  • De novo design is not restricted by availability of building blocks or a given set of chemical reactions and thus allows a potentially broader sampling of drug-like chemical space. However, synthesis of de novo designed molecules has often proven to be challenging.

2.Materials

请添加图片描述

3.Methods

3.1.Generation of Chemical Structures

  • For generative networks, SMILES strings have emerged as one of the most simple, yet productive approaches of representing and generating molecules.
  • Methods using fragment representations are used for a long time, while Graph-based methods have emerged more recently.
  • First methods using 3D-based molecular representations are now becoming available as well.

3.1.1.SMILES-Based Approaches

  • Therefore, the employed training set should represent the interesting chemical space with sufficient examples. Typically, several hundred thousand molecules from databases containing drug-like molecules are used to train a robust molecule generator. Prior to the actual training process, training sets are typically filtered by unwanted chemical fragments and unfavorable physicochemical property ranges to focus more on the “drug-like” structural space.
  • While for successful structure-based virtual screening, enumeration of possible tautomers and protomers is strongly suggested as part of the workflows to ensure that the state with optimal interaction possibilities is included, a consistent treatment based on for example neutral molecular states and canonical tautomers might be more useful for SMILES, 2D-fingerprint, and 2D-descriptor based approaches including similarity searching, machine-learning, and training database generation.
  • During the optimization of lead structures for a particular activity, some parts of the molecules are often kept constant, if they contain essential pharmacophoric elements. Therefore, other methods use a definition of a central scaffold (for example the central brown box in Fig. 1) and maintain this scaffold during the optimization.
    请添加图片描述

3.1.2.Autoencoder

请添加图片描述

  • A very promising method is the Junction Tree Variational Autoencoder (JT-VAE). For representing a molecule, it is first decomposed into a molecular graph and a junction tree.

3.1.3.Scaffold/Fragment-Based Approaches

  • With the rise of deep neural networks and growing computational resources, fragment- or scaffold-based methods rose again. In this context, Stahl et al. developed a framework called DeepFMPO.
  • Different rewarding schemes for AI-based design methods will be presented in Subheading 3.3.

3.1.4.Graph-Based Approaches

  • One of the first approaches using graphs for generative models was introduced by Li et al…
  • Another early approach is based on a variational autoencoder combined to a graph-representation of the molecules.
  • Further on, Jensen used graph-based methods with genetic algorithms and Monte-Carlo tree search.
  • Very recently, Mercado et al. introduced a graph-based design approach to generate optimized molecules called GraphINVENT.
  • An advantage of graph-based methods is the encoding of chemical information in a more mathematical way which can be beneficial for neural networks. In contrast, the representation of a molecule is more abstract which makes understanding and interpretation of results harder.

3.1.5.3D-Based Approaches

  • An implementation of a 3D-based approach was introduced by Fabritiis et al.
  • Another approach is to use generative methods based on 2D-molecule representations, while the scoring and selection process for interesting molecules is based on 3D-methods like 3D-shape overlays.

3.2.Training Generative Networks

  • In general, approaches to generate optimized compounds can be divided into three groups.
    • Filter approach: A molecule generator samples molecules from the large chemical space, while a subsequent reward function acts as a filter and selects the most promising candidates.
    • Biasing/Reward approach: A generator is coupled to an optimization/reward routine, which is biasing the generator toward the desired molecular properties. The bias is still unidirectional which means, the reward function is not influenced by the generator.
  • Combined generation and reward approach: Here, generation and optimization/reward steps are directly combined, that is, a bidirectional communication between both methods.

3.2.1.Transfer Learning

  • Early applications of SMILES-based approaches have used transfer learning to bias the neural networks which belongs to the first group of methods (“Filter approach”).
  • One main limitation of transfer learning is the inherent requirement of an appropriate set of molecules to learn from.
  • Successful applications for the design of novel molecules have been demonstrated.

3.2.2.Reinforcement Learning (RL)

  • One of the main challenges is the definition of an appropriate, balanced scoring scheme. With an unbalanced scoring, the optimization can be trapped in local minima or driven into undesirable regions of chemical space.
  • First successful applications of reinforcement learning for design were reported by Zhavoronkov et al.

3.2.3.Genetic Algorithms and Global Optimization

  • Key to a successful application of a global optimization approach is a proper way of building and generating new molecules.
  • In a recent study, Jensen explored a genetic algorithm with a graph-based representation of molecules, he also explored Monte-Carlo Tree Search as an optimization method. Both methods perform well in GuacaMol benchmark.

3.2.4.Generative Adversarial Networks

  • The group of AspuruGuzik proposed a framework of GANs called ORGAN and applied it to chemistry and material design (called ORGANIC)
  • Another method was proposed by Maziarka et al. who named their method Mol-CycleGAN, as it is based on the CycleGAN-algorithm

3.2.5.Autoencoders with Optimization Methods

  • Winter et al. further proposed a method using a particle swarm optimization (PSO) together with the latent space to generate compounds of desired properties.

3.3.Reward Functions for AI-Based De Novo Design

  • Hence, any useful function should rapidly and reliably identify virtual molecules showing favorable chemical motifs. Other approaches, which require significantly more computational power can only be applied at the end of such a workflow to further rank promising proposals.
  • We perceive scoring as a two-step process by (a) including the most relevant and rapid functions directly into the generative design step and (b) using other scoring functions in a multidimensional fashion to select interesting molecules from the larger ensemble of virtual proposals by the AI-engine.
  • Typical scoring functions capture molecular features based on similarity to previous lead molecules or statistical models summarizing previous project knowledge. Computational approaches for deriving these functions can be broadly grouped into the following categories:
    • Physicochemical properties filters
    • Ligand-based 2D similarity approaches
    • Ligand-based 3D similarity approaches
    • Machine-learning models from structure-activity data
    • Structure-based 3D approaches

3.3.1.Physicochemical and ADME Properties

  • Typical property filters for scoring use descriptors related to lipophilicity (e.g., logP, logD), surface area fractions (e.g., PSA), key functional group counts (e.g., number of donors or acceptors), measures of size and flexibility (molecular weight, number of rotatable bonds) recently reviewed by Mignani et al.

3.3.2.2D-Similarity Approaches

  • 2D descriptors with a stronger scaffold hopping potential such as topological pharmacophores (e.g., CATS) provide opportunities to drive molecule generation toward chemically different scaffolds with a similar pharmacophoric pattern.

3.3.3. 3D-Similarity Approaches

  • Nonobvious analogies between compound pairs beyond 2D chemical similarity can be obtained by 3D shape based technologies (e.g., ROCS).

3.3.4. Machine Learning Models for Scoring

  • Experimental data from harmonized assay conditions at one laboratory or even at different research sites are essential to provide balanced training sets, while minimizing any associated experimental error.

3.3.5.Validation of Machine-Learning Models

  • Therefore, the domain of applicability for a model should be investigated for each new molecule to be predicted.
  • Valid confidence intervals for predictions as error estimations were provided by Bender et al. using the “Deep Confidence” framework.
  • Many of our internal machine learning models apply a 2D fingerprint-based similarity threshold as applicability domain estimation.
  • Moreover, the direct use of 2D-similarity to query molecules in combination with more advanced machine-learning models might intrinsically also limit the generation of novel virtual molecules to a region in chemical space relevant for lead optimization.

3.3.6.Multidimensional Compound Optimization

  • A meaningful combination of different property prediction models captures a more complex property profile to guide generation of new molecules into the desired property space. These concepts combined with medicinal chemistry inspired structure transformations were shown to produce novel prioritized proposals following such a multidimensional optimization Scheme.

3.3.7.Reward Functions in Latent Chemical Space

  • Autoencoder based design approaches work in a latent chemical space. In this setting, reward functions can be implemented into the latent space directly to predict properties for latent continuous vector representations.

3.3.8.3D-Docking Based Reward Functions

  • The major drawback of docking is the time required for scoring a new proposal to give feedback to the design agent.
  • In particular, docking combined with machine learning has the potential to speed up this process for use in AI-based design.

3.4.Chemical Feasibility

  • One class of approaches compares molecular fragments of the molecule in question with fragments from synthesized molecules. A popular approach in this category is the synthetic accessibility score called SAScore.
  • The synthetic Bayesian accessibility, SYBA, trains a Bayes classifier from easy to synthesize compounds, collected from the ZINC database, and hard to synthesize molecules, which are artificially generated.
  • The SCScore builds on the complexity of the synthetic transformation. A deep feed-forward neural network was trained on 22 million pairs of reactants and products based on the idea that, on average, a chemical reaction typically increases the complexity of a molecule. Retrosynthetic analysis was taken a step further in the retrosynthetic accessibility score, the RAScore.

3.5.Integration of Scoring Functions in Internal Workflows

  • Reinforcement learning provides a robust framework for AI-based de novo design, in which a molecule generator is guided by different reward functions. As reward functions, we have integrated the following scoring functions to guide reinforcement learning:
    • Physicochemical property scoring using corresponding descrip- tors in RDKit.
    • 2D similarity-based scoring using Tanimoto coefficients to the starting molecule based on the ECFP6 or other fingerprints from RDKit.
    • 3D shape similarity scoring to the starting molecule in 3D using ROCS-3D shape similarity.
    • 2D-QSAR-models trained on target activity data using Cubist regression trees or graph convolutional neural networks using the DeepChem library.
    • A combined score based on addition of individual scoring terms listed above for properties, 2D fingerprints, 3D similarity, and 2D-QSAR.
    • A combined score based on multiplication of individual scoring terms.
  • Additionally, we tailor the search of the chemical space by adding 2D or 3D similarity calculations into the reward function.
  • Transfer learning provides an additional opportunity to bias the exploration of analogues around the actual chemical series.
  • Sufficient training and test sets should be selected either randomly or by statistical design. We prefer the latter approach using the RDKit implementation of the MinMax algorithm as efficient approach for picking diverse subsets from larger sets.
  • An important step is to use novel data during project progression for model validation, a method known as time-split validation.

3.6.Compound Selection and Filtering

  • There are, however, additional aspects to consider, when analyzing AI-based de novo design results.
    • First, the synthetic accessibility of design proposals is of critical importance for their realization in the chemical laboratory. Simple estimates include the number of chiral centers, complex ring systems and availability of motifs in internal building block repositories or vendor catalogues and are routinely added to focus the selected set of com- pounds into synthetically desired regions.
    • Next, the quest for novelty is addressed by 2D similarity searches in a project or corporate database and in a database of patented or reported compounds in the scientific literature for a particular project.

3.7.A Practical Example: AI-Design at Work

请添加图片描述
请添加图片描述
请添加图片描述

4.Summary

  • Chemists need to consider the outcome from such a statistical model to capture essential SAR knowledge for the next round of design in a project.

  • To this end, a group at GSK evaluated chemical structures from three AI-design algorithms. Significant differences were observed between these molecule generators in three different performance tests, highlighting the necessity to select the appro- priate design engine for the problem at hand. These tests include:

    • the ability to reproduce ideas from an experienced medicinal chemistry team;
    • the ranking of novel AI-generated ideas by medicinal chemists;
    • the generation of relevant structures in legacy discovery projects starting from a single seed molecule.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
C语言是一种广泛使用的编程语言,它具有高效、灵活、可移植性强等特点,被广泛应用于操作系统、嵌入式系统、数据库、编译器等领域的开发。C语言的基本语法包括变量、数据类型、运算符、控制结构(如if语句、循环语句等)、函数、指针等。在编写C程序时,需要注意变量的声明和定义、指针的使用、内存的分配与释放等问题。C语言中常用的数据结构包括: 1. 数组:一种存储同类型数据的结构,可以进行索引访问和修改。 2. 链表:一种存储不同类型数据的结构,每个节点包含数据和指向下一个节点的指针。 3. 栈:一种后进先出(LIFO)的数据结构,可以通过压入(push)和弹出(pop)操作进行数据的存储和取出。 4. 队列:一种先进先出(FIFO)的数据结构,可以通过入队(enqueue)和出队(dequeue)操作进行数据的存储和取出。 5. 树:一种存储具有父子关系的数据结构,可以通过中序遍历、前序遍历和后序遍历等方式进行数据的访问和修改。 6. 图:一种存储具有节点和边关系的数据结构,可以通过广度优先搜索、深度优先搜索等方式进行数据的访问和修改。 这些数据结构在C语言中都有相应的实现方式,可以应用于各种不同的场景。C语言中的各种数据结构都有其优缺点,下面列举一些常见的数据结构的优缺点: 数组: 优点:访问和修改元素的速度非常快,适用于需要频繁读取和修改数据的场合。 缺点:数组的长度是固定的,不适合存储大小不固定的动态数据,另外数组在内存中是连续分配的,当数组较大时可能会导致内存碎片化。 链表: 优点:可以方便地插入和删除元素,适用于需要频繁插入和删除数据的场合。 缺点:访问和修改元素的速度相对较慢,因为需要遍历链表找到指定的节点。 栈: 优点:后进先出(LIFO)的特性使得栈在处理递归和括号匹配等问题时非常方便。 缺点:栈的空间有限,当数据量较大时可能会导致栈溢出。 队列: 优点:先进先出(FIFO)的特性使得

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

_森罗万象

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值