Generating 3D molecules conditional on receptor binding sites with deep generative models.
liGAN : a conditional VAE model
Chem Sci, 13:2701–2713, Feb 2022.
流程:
data–(VAE)–>atomic density grid–(atom fitting、bond inference)–>molecular conformations
1、Atom typing
First, we assign atom types to molecules using a set of Np atomic property functions p and value ranges for those properties v, which are listed in Table 1. For a given atom a, the atom type vector t∈ℝNT is created by concatenating Np atomic property vectors p through the following:
p:原子性质函数。Np:性质函数的个数。v:某个性质函数对应的值(某个性质的指标)。 NT:原子的个数
The atomic properties we used were element, aromaticity, H-bond donor and acceptor status, and formal charge. Different element ranges were represented for receptor atoms and ligand atoms(受体和配体的原子个数可能不同), but the value ranges for all other properties were the same. The process we used to construct value ranges for properties and compare different type schemes is described in the supplement.
2、Atom gridding
The density value of an atom at a grid point is defined by a kernel function f: ℝ * ℝ → ℝ that takes as input the distance d between the atom coordinate and the grid point and the atomic radius r:
The radius was fixed at r =1.0 for all atoms in this work. Grid values are computed by summing the density kernel of each atom at each point on a 3D grid, multiplied by the value of the atom’s type vector in the corresponding grid channel. A molecule with N atoms and atom type vectors of length NT can be represented as a matrix of atom types T∈ℝNNT and a matrix of atomic coordinates C∈ℝN3. The function that computes atomic density grids g: ℝNNT * ℝN → ℝNTNXNY*NZ is then defined as follows:
All atoms that fit within the spatial extent of the grid are represented. We used cubic grids with side lengths of 23.5 ̊ A and 0.5 ̊ A resolution, resulting in spatial dimensions NX = NY = NZ = 48.
3、Atom fitting
The inverse problem of converting a reference density grid Gref into a discrete 3D molecular structure does not have an analytic solution, so we solve it as the following optimization problem:
网格密度g(T,C)近似参考密度网格
We can detect initial locations of atoms on a grid by selecting from the grid points with the largest density values. libmolgrid allows us to compute the grid representation of an atomic structure and backpropagate a gradient from grid values to atomic coordinates. Therefore, we devised an algorithm that combines iterative atom detection with gradient descent to find a set of atoms that best fits a reference density.
4、Bond inference
基于openbabel
5、Conditional VAE
Z:(输入编码器生成的)reg和lig的潜在特征向量
c:(条件编码器生成的)rec的特征向量
liggen:(解码器生成的)配体密度
6、Training
Lrecon:重构损失,最大化在给定受体密度的情况下解码潜在样本为真实配体密度的概率
LKL:鼓励近似后验分布与真实先验分布匹配
Lsteric:立体阻碍,确保生成的配体与受体在空间上没有碰撞或重叠,减少分子不稳定性
The loss weights were initialized at lrecon = 4.0, lKL = 0.1, and lsteric = 1.0, though the KL divergence loss weight was gradually ramped up to 1.6 over 200 000 iterations, starting at iteration 450 000. The model was trained using RMSprop with learning rate 10-5 for 1 000 000 iterations with a batch size of 8.
7、Experiments
Data:CrossDocked2020 data
Metrics:Validity、novelty、uniqueness、Fingerprint similarity、Per-target diversity、Shape similarity、Molecular weight and drug-likeness、UFF energy minimization、Vina energy and predicted binding affinity、Atom type distributions、Bond length distributions、Bond angle distributions、Torsion angle distributions.
Sampling methods:
prior sampling:先验采样,从标准正态分布中绘制潜在变量,再结合CVAE
变异性因子增大时,多样性增大,结合口袋中能量稳定性和有利性下降
posterior sampling:后验采样,将真实的蛋白质-配体复合物编码为潜在变量参数,再结合CVAE
变异性因子增大时,分子大小、复杂性增大