论文简读《3D Equivariant Diffusion For Target-Aware Molecule Generation and Affinity Prediction》

最新推荐文章于 2024-09-11 12:11:23 发布

嘿嘿我跑了

最新推荐文章于 2024-09-11 12:11:23 发布

阅读量900

点赞数 23

分类专栏：三维药物分子设计文章标签：人工智能

本文链接：https://blog.csdn.net/xulomh/article/details/137440101

版权

三维药物分子设计专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本文介绍了一个端到端框架，利用3DEquivariantDiffusion进行蛋白靶点条件下的分子生成，并考虑了物理相互作用。模型采用非自回归和SE(3)等变方式，通过GNN和移位中心操作，提出新评估指标提升性能。框架还展示了在评估分子质量和增强结合亲和力预测方面的应用。

摘要由CSDN通过智能技术生成

3D Equivariant Diffusion For Target-Aware Molecule Generation and Affinity Prediction

Targetdiff
ICLR 2023
在这里插入图片描述

1、Contributions

*一个端到端的框架，用于在蛋白靶点条件下生成分子，该框架明确考虑了蛋白质和分子在三维空间中的物理相互作用。
*就我们所知，这是针对靶向药物设计的第一个概率扩散公式，其中训练和采样过程以非自回归和SE(3)-等变的方式对齐，这得益于移位中心操作和等变GNN。
*提出了几个新的评估指标和额外的见解，使我们能够在许多不同的维度上评估模型生成的分子。实证结果证明了我们的模型优于另外两个代表性基准模型。
*提出了一种基于我们的框架评估生成分子质量的有效方法，其中模型可以作为评分函数来帮助排名，或者作为无监督特征提取器来提高结合亲和力预测的准确性。

2、Problem definition

A protein binding site is represented as a set of atoms ${(x^{(i)}_P , v^{(i)}_P )}^{N_P}_{i=1}$ , where $N_P$ is the number of protein atoms, $x_P ∈ R^3$ represents the 3D coordinates of the atom, and $v_P ∈ R^{N_f}$ represents protein atom features such as element types and amino acid types. Our goal is to generate binding molecules ${(x^{(i)}_L , v^{(i)}_L )}^{L_M}_{i=1}$ conditioned on the protein target. For brevity, we denote molecules as M = [x, v], where [·, ·] is the concatenation operator and $x ∈ R^{M×3}$ and $v ∈ R^{M×K}$ denote atom Cartesian coordinates and one-hot atom types respectively.

3、Molecular diffusion process

use a Gaussian distribution $N$ to model continuous atom coordinates x and a categorical distribution C to model discrete atom types v. The atom types are constructed as a one-hot vector containing information such as element types and membership in an aromatic ring. We formulate the molecular distribution as a product of atom coordinate distribution and atom type distribution. At each time step t, a small Gaussian noise and a uniform noise across all categories are added to atom coordinates and atom types separately, according to a Markov chain with fixed variance schedules β1, . . . , βT (K为k维的平均噪声向量)(实际上x，v的调度不一致):
在这里插入图片描述
Denoting $αt = 1 − β_t$ and
a desirable property of the diffusion process is to calculate the noisy data distribution $q(M_t|M_0)$ of any time step in closed-form(用闭合形式直接求出每个时间步时数据分布):
Using Bayes theorem, the normal posterior of atom coordinates and categorical posterior of atom types can both be computed in closed-form(通过贝叶斯公式求出后验分布):
在这里插入图片描述

4、Molecular generative process

The generative process, on reverse, will recover the ground truth molecule M0 from the initial noise MT , and we approximate the reverse distribution with a neural network parameterized by θ(t、P已知，Mt也已知，求μθ 、cθ):
在这里插入图片描述
There are different ways to parameterize $μ_θ([x_t, v_t], t, P)$ and $c_θ([x_t, v_t], t, P)$ . Here, we choose to let the neural network predict $x_0, v_0]$ and feed it through equation 4 to obtain $μ_θ$ and $c_θ$ which define the posterior distributions. we model the interaction between the ligand molecule atoms and the protein atoms with a SE(3)-Equivariant GNN:
在这里插入图片描述
At the l-th layer, the atom hidden embedding h(原子隐藏嵌入) and coordinates x(原子的坐标) are updated alternately as follows:

where $d_{ij} = ‖x_i − x_j‖$ is the euclidean distance(原子间欧几里德距离) between two atoms i and j and eij is an additional feature(两两原子间连接特征，可以视为邻接矩阵来描述原子之间的联系或连接类型) indicating the connection is between protein atoms, ligand atoms or protein atom and ligand atom. 1mol is the ligand molecule mask since we do not want to update protein atom coordinates. The initial atom hidden embedding $h^0$ is obtained by an embedding layer that encodes the atom information. The final atom hidden embedding $h^L$ is fed into a multi-layer perceptron and a softmax function to obtain $ˆ v_0$ . Since $ˆ x_0$ is rotation equivariant to $x_t$ and it is easy to see $x_{t−1}$ is rotation equivariant to $x_0$ according to equation 4, we achieve the desired equivariance for Markov transition.
注：the likelihood $p_θ(M_0|P)$ should be invariant to translation and rotation of the protein-ligand complex. Denoting the SE(3)-transformation as $T_g$ , we could achieve invariant likelihood w.r.t $T_g$ on the protein-ligand complex: $p_θ(T_g(M_0|P)) = p_θ(M_0|P)$ if we shift the Center of Mass (CoM) of protein atoms to zero and parameterize the Markov transition $p(x_{t−1}|x_t, x_P )$ with an SE(3)-equivariant network.

5、Training

The combination of q and p is a variational auto-encoder (Kingma and Welling, 2013). The model can be trained by optimizing the variational bound on negative log likelihood. For the atom coordinate loss, since $q(x_{t−1}|x_t, x_0)$ and $p_θ(x_{t−1}|x_t)$ are both Gaussian distributions, the KL-divergence can be written in closed form:
在这里插入图片描述
where
and $C$ is a constant. In practice, training the model with an unweighted MSE loss (set $γ_t$ = 1) could also achieve better performance as Ho et al. (2020) suggested. For the atom type loss, we can directly compute KL-divergence of categorical distributions as follows:
在这里插入图片描述
The final loss is a weighted sum of atom coordinate loss and atom type loss: $L^{(x)}_{t−1} + λL^{(v)}_{t−1}$ . We summarize the overall training and sampling procedure of TargetDiff in Appendix E.
(1) training

(2) sampling
At the l-th layer, we dynamically construct the protein-ligand complex as a k-nearest neighbors (knn) graph based on known protein atom coordinates and current ligand atom coordinates, which is the output of the l − 1-th layer. We choose k = 32 in our experiments. The protein atom features include chemical elements, amino acid types and whether the atoms are backbone atoms. The ligand atom types are one-hot vectors consisting of the chemical element types and aromatic information. The edge features are the outer products of distance embedding and bond types, where we expand the distance with radial basis functions located at 20 centers between 0 ̊ A and 10 ̊ A and the bond type is a 4-dim one-hot vector indicating the connection is between protein atoms, ligand atoms, protein-ligand atoms or ligand-protein atoms.
在这里插入图片描述

6、Experiments

Data：Crossocked2022
Baseline：liGAN、AR、Pocket2Mol、GraphBP
Targetiff：Our model contains 9 equivariant layers described in equation 7, where fh and fx are specifically implemented as graph attention layers with 16 attention heads and 128 hidden features. We first decide on the number of atoms for sampling by drawing a prior distribution estimated from training complexes with similar binding pocket sizes. After the model finishes the generative process, we then use OpenBabel (O’Boyle et al., 2011) to construct the molecule from individual atom coordinates as done in AR and liGAN.

7、Results

在这里插入图片描述

8、Target Binding Affinity

We first establish the connection between unsupervised generative models and binding affinity ranking / prediction. Under our parameterization, the network predicts the denoised $ˆ x_0, ˆ v_0]$ . Given the protein-ligand complex, we can feed $φ_θ$ with $x_0, v_0]$ while freezing the x-update branch (i.e. only atom hidden embedding $h$ is updated), and we could finally obtain $h^L$ and $ˆ v_0$ :
在这里插入图片描述
Our assumption is that if the ligand molecule has a good binding affinity to protein, the flexibility of atom types should be low, which could be reflected in the entropy of $ˆ v_0$ (v_ent). Therefore, it can be used as a scoring function to help ranking, whose effectiveness is justified in the experiments. In addition, hL also includes useful global information. We found the binding affinity ranking performance can be greatly improved by utilizing this feature with a simple linear transformation.
在这里插入图片描述