论文简读《Efficient molecular sampling based on 3d protein pockets》

Pocket2mol: Efficient molecular sampling based on 3d protein pockets.

Pocket2Mol
ICML, 2022
在这里插入图片描述

概述:

近年来,深度生成模型在设计新型药物分子方面取得了巨大成功。一系列新的研究工作展示了考虑蛋白质口袋结构在内的因素,有助于提高体外药物设计的特异性和成功率。这种设定在采样新的化合物时提出了基本的计算挑战,这些化合物需要满足由口袋施加的多个几何约束。先前的采样算法要么在图空间中进行采样,要么仅考虑原子的三维坐标,而忽略了其他详细的化学结构,如键类型和功能基团。为了解决这一挑战,我们开发了Pocket2Mol,这是一个由两个模块组成的 E(3)-等变生成网络:1)一个新的图神经网络,捕捉结合口袋中原子之间的空间和键合关系;2)一个新的高效算法,根据可处理的分布从口袋表示条件下采样新的药物候选化合物,而不依赖于 MCMC。实验结果表明,从Pocket2Mol中采样的分子在结合亲和力和其他药物性质(如药物样性和合成可及性)方面取得了显着改善。

1、Generation Procedure

流程:Frontier atoms ---- focal atom (one of frontier atoms) ---- position ---- element type、valence bond ---- frontier…
the protein pocket is represented as a set of atoms with coordinates P ( p r o ) = ( a i ( p r o ) , r i ( p r o ) ) i = 1 N P^{(pro)} = {(a^{(pro)}_i , r^{(pro)}_i)}^N_{i=1} P(pro)=(ai(pro),ri(pro))i=1N, where a i ( p r o ) a^{(pro)}_i ai(pro) and r i ( p r o ) r^{(pro)}_i ri(pro) are the i th heavy atom identity and its coordinate respectively, and N is the number of atoms of the protein pocket. The molecules are sampled in a continuous way. The already generated molecular fragments with n atoms are denoted as a graph with coordinates G n ( m o l ) = ( a i ( m o l ) , r i ( m o l ) , b i ( m o l ) ) i = 1 n G^{(mol)}_n = {(a^{(mol)}_i , r^{(mol)}_i , b^{(mol)}_i )}^n_{i=1} Gn(mol)=(ai(mol),ri(mol),bi(mol))i=1n where a i ( m o l ) a^{(mol)}_i ai(mol) , r i ( m o l ) r^{(mol)}_i ri(mol) and b i ( m o l ) b^{(mol)}_i bi(mol) represent the i th heavy atom(重原子,即去除氢原子,减少蛋白质和水分子反映的影响), its coordinate and its valence bonds with other atoms, respectively. The model is denoted as φ φ φ and the generation process is defined as follows:
在这里插入图片描述
For each atom, the generation procedure consists of four major steps (Fig. 1).
(1) First, the model’s frontier predictor f f r o f_{fro} ffro will predict the frontier atoms of the current molecular fragments. The frontiers are defined as molecular atoms that can covalently connect to new atoms. If all atoms are not frontiers, it indicates the current molecule is complete and the generation process terminates.
(2) Second, the model samples an atom from the frontier set as the focal atom.
(3) Third, based on the focal atom, the model’s position predictor f p o s f_{pos} fpos predicts the relative position of the new atom.
(4) In the end, the model’s atom element predictor f e l e f_{ele} fele and bond type predictor f b o n d f_{bond} fbond will predict the probabilities of the element types and the bond types with existing atoms and then sample an element type and valence bonds for the new atom.
In this way, the new atom is successfully added into the current molecular fragments and the generation process continues until no frontier atom could be found. Note that this generation process is different for the first atom since there is no molecular atom to be chosen as the frontier yet. For the first atom, all atoms of the protein pocket are used to predict the frontiers and here the frontiers are defined as atoms that the new atoms can be generated within 4 ̊ A.

2、E(3)-Equivariant Neural Network

It has been shown that representing the vertices and edges in a 3D graph with both scalar(标量) and vector(向量) features can help boost the expressive power of the neural network. In our network, all vertices and edges in protein pockets P ( p r o ) P_{(pro)} P(pro) and molecular fragments G ( m o l ) n G_{(mol)n} G(mol)n are associated with both scalar and vector feature to better capture the 3D geometric information. In the rest of texts, we use symbols “·” and “→” overhead to explicitly denote scalar features and vector features (e.g. ̇ x and ⃗x).
We adopt the geometric vector perceptrons(感知器) (Jing et al., 2021) and the vector-based neural network (Deng et al., 2021) to achieve E(3)-equivariance. The geometric vector perceptrons (GVP) extend standard dense layers and can propagate information between scalar features and vector features (在标量特征和向量特征间传播信息)(Jing et al., 2021).
we modify the original GVP by adding a vector nonlinear activation(LeakyReLU) to the output vector of the GVP, which is denoted as G p e r G_{per} Gper:在这里插入图片描述
In addition, we define a geometric vector linear (GVL) block denoted as G l i n G_{lin} Glin by dropping the nonlinear activations of both scalar and vector outputs of GVP. The modified GVP blocks G p e r G_{per} Gper and the GVL blocks G l i n G_{lin} Glin are the primary building blocks of our model, which enables the model to be E(3)-equivariant. Besides, including vector features is crucial for our model to directly and accurately predict the atom positions based on the geometric environments provided by the protein pockets.
注: G p e r G_{per} Gper G l i n G_{lin} Glin类似于编码器、解码器,有助于在特征传递过程中提取和保留输入特征的重要信息

3、Encoder

We represent the protein pockets and molecular fragments as a k-nearest neighbor (KNN) graph in which vertices(顶点、角点、图中的结点) are atoms and each atom is connected to its k-nearest neighbors.
scalar featuresEdge : The scalar edge features include the distances encoded with Gaussian RBF kernels, the bond types and a boolean variable indicating whether there is a valence on the edge.
Atom : (1)the input scalar features of protein atoms include the element types, the amino acids they belong to and whether they are backbone or side-chain atoms.(2)The input scalar features of molecular atoms include the element types, the valence and the numbers of different chemical bonds.(3)all atoms have one more scalar feature indicating whether they belong to the protein or the molecular fragment.
vector features:The input vector vertex features include the coordinates of atoms, while the vector edge features are the unit directional vector of the edge in the 3D space.
First, multiple embedding layers are applied to embed these features as ( ̇ v(0) i , ⃗v(0) i ) for vertices and ( ̇ e(0) ij , ⃗e(0) ij ) for edges. Then L message passing modules M l ( l = 1 , . . . , L ) M_l(l = 1, . . . , L) Ml(l=1,...,L) and update modules U l ( l = 1 , . . . , L ) U_l(l = 1, . . . , L) Ul(l=1,...,L) are concatenated interleavedly to learn the local structure representations:
在这里插入图片描述
v(点) + e(边) → m(+v) → v`编码器提取了蛋白质口袋和分子的化学与几何属性的隐藏表示( ̇ v , ⃗v),其实,m为点和边的聚合特征(m违反了等变约束,再处理),再与点的特征聚合得到点的标量特征和向量特征。
在这里插入图片描述
在这里插入图片描述
The encoder functions extract hidden representations ( ̇ v(L) i , ⃗v(L) i ) capturing the chemical and geometric attributes of the protein pockets and molecule fragments. These representations are used for frontier predictor, position predictor and element-and-bond predictor (we omit superscript L in the following for simplicity).

4、Frontier Predictor

define a geometric vector MLP (GV-MLP) as a GVP block followed by a GVL block, denoted as G m l p G_{mlp} Gmlp. The frontier predictor takes the features of atom i as input and utilizes one GV-MLP layer to predict the probability of being a frontier pfro as follows,
在这里插入图片描述
GV-MLP:GVP(MLP) + GVL(Linear)
softmax(p’fro) ---- the probability of being a frontier(存在的每个原子为前沿原子的概率)

5、Position Predictor

The position predictor takes as input the features of focal atom i and predict the relative position of the new atom. Since the vector features are equivariant in our model, they can be directly used to generate the relative coordinates ∆ri towards the focal atom coordinate ri. We build the output of position predictor as a Gaussian Mixture Model with diagonal covariance
在这里插入图片描述
in which parameters are predicted by multiple neural networks as follows:
在这里插入图片描述
where σ(sf) is the softmax function. After processed by GV-MLP, the mean( ̇μ′i均值), covariance( ̇Σ′i协方差) and prior probability(σ(sf)( ̇π′i)先验概率)(经过 softmax 函数处理以确保它们的总和为 1,以满足概率的性质) of Gaussian components are predicted by three separate GVL blocks. Since the vector features are equivariant, the vector outputs of GVL block can directly represent the mean vectors and covariance vectors.

6、Element-and-Bond Predictor

在这里插入图片描述
After predicting the position of the new atom i, the element-and-bond predictor will predict the element type of the new atom i and the valence bonds between atom i and all the atoms q (∀q ∈ V(mol)) in the existing molecular fragment.
(1) Element Predictor
First, we collect k-nearest neighbor atoms j ∈ KNN(i) among all atoms. Then a message passing module (Eq. 4)(在encoder部分) is utilized to integrate the local information from neighbor atoms into the new atom i position as its representation ( ̇ vi, ⃗vi), from which the element type of atom i is predicted.
i原子的邻居点特征( ̇ vj, ⃗vj) + i原子与邻居连接的边特征( ̇ eij, ⃗eij)—→ representation( ̇ vi, ⃗vi)即基于position predictor预测的位置的原子的knn邻域中的点和边特征 + GV-MLP —→ element type
(2) Bond Predictor
the representation of edge between atom i and q, denoted as ( ̇ ziq ,⃗ ziq), is the concatenation of the features of atom i( ̇ vi, ⃗vi), the features of atom q( ̇ vq, ⃗vq) and the processed features of the edge( ̇ eiq, ⃗eiq), followed by a GV-MLP block, i.e.
在这里插入图片描述
where ( ̇ e′ iq, ⃗e′ iq) are the input edge features processed by the edge embedding and one GV-MLP block.
The edge representations are fed into an attention modules to predict the bond types (we regard no bond as a special bond type).
注:attention modules(引入了分子的几何约束)省略,根据原子间关系和特征( ̇ ziq ,⃗ ziq),自动计算原子间相关性( ̇ z′iq ,⃗ z′iq),从而预测之间可能形成键的类型(与前沿分子预测步骤类似,对( ̇ z′iq ,⃗ z′iq)进行GV-MLP再softmax)

7、Training

(1) Specifically, for each pocket-ligand pair, we sample a mask ratio from the uniform distribution U [0, 1] and mask corresponding number of molecular atoms. The remaining molecular atoms that have valence bonds to the masked atoms are defined as frontiers.
(2) Then the position predictor and the element-and-bond predictor try to recover the masked atoms that have valence bonds to the frontiers by predicting their positions towards corresponding frontiers, the element types and the bonds with remaining molecular atoms.
(3) If all molecular atoms are masked, the frontiers are defined as protein atoms that have masked atoms within 4 ̊ A and the masked atoms around the frontiers are to be recovered.
For the element type prediction, similar with (Luo et al., 2021), we add one more element type representing Nothing at the query position. During the training process, we sample not only the positions of masked atoms for element type predictions but also negative positions(负样本训练) from the ambient space and assign their labels as Nothing.
L o s s Loss Loss:The loss of the frontier prediction, L f r o L_{fro} Lfro, is the binary cross entropy loss(二元交叉熵损失) of predicted frontiers. The loss of the position predictor, L p o s _{Lpos} Lpos, is the negative log likelihood(负对数似然) of the masked atom positions. For the element type and bond type prediction, we used cross entropy losses(交叉熵损失,因为类型多种) for the classification, denoted as L e l e L_{ele} Lele and L b o n d L_{bond} Lbond respectively. The overall loss function is the summation of the above four loss functions:
在这里插入图片描述

8、Experiments

Data:CrossDocked dataset
Baseline:a CVAE-based model (CVAE) 、another auto-regressive generative model (AR)
Sampling:randomly sample 100 molecules for each protein pocket in the test set.
Metrics:Vina Score、High Affinity、QED、SA、LogP、Lipinski、Sim.Train、Diversity、Time
9、Results
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  • 17
    点赞
  • 38
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值