【DeepDTA模型解读】

VictoryZhou_

已于 2023-04-03 21:55:29 修改

阅读量2.1k

点赞数 2

分类专栏： Python 文章标签： python 深度学习

于 2022-11-15 22:58:01 首次发布

本文链接：https://blog.csdn.net/VictoryZhou_/article/details/127869958

版权

Python 专栏收录该内容

5 篇文章

订阅专栏

1.数据集 Datasets

(1)Davis

Comprehensive analysis of kinase inhibitor selectivity @ Nat. Biotechnol.2011
442蛋白-68配体，30056对相互作用
dissociation constant (Kd) → $pK_d = -1og_{10}(\frac{K_d} {1e9})$

(2)KIBA

Making sense of large-scale kinase inhibitor bioactivitydata sets: a comparative and integrative analysis @J. Chem. Inf. Model2014
467靶点-52498药物
过滤：至少10对相互作用 → 229蛋白 - 2111药物，118254对相互作用
KIBA（Ki,Kd,IC50) → 构建KIBA分数：Simboost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines.@J. Cheminform.2017

2. 输入表示

(1)SMILE

eg. [C N = C = O] = [1 3 63 1 63 5]
64 labels 固定最大长度 85 （Davis） 100 （KIBA）← 基于分布，cover 80% 蛋白，90%化合物

$>$ maximum length 被截断，短序列用0填充

(2) 蛋白序列

Label encoding
25 categories 固定最大长度1200（Davis) 1000(KIBA)

3. 模型

DeepDTA
激活函数：ReLU $g (x) = ma x (0, x)$
回归任务→ MSE损失函数

$\frac 1n \sum_{i=1}^n(P_i -Y_i)^2$

P 是预测向量， Y是真实值向量, n 是样本数量

训练100 个周期，mini-batch size = 256

优化算法 Adam 默认学习率 0.001

使用 Keras Embedding layer ，用128维 dense向量表示符号

Davis 输入（85，128）和（1200，128）
KIBA 输入（100，128）和（1000，128）

Concordance Index (CI)测量模型表现，与KronRLS 和 SimBoost 比较

4.实验与结果

(1)Baselines

Kron-RLS的目的是最小化下面函数，f 是预测函数
$\sum_{i=1}^m(y_i - f(x_i))^2 + \lambda||f||_k^2$
$f||_k^2$ 是 f 的范数，与核函数k 有关； $\lambda >0$ 是正则超参数，由用户定义
$J (f)$ 的最小值可以定义为：
当 $\sum_{i=1}^ma_ik(x,x_i)$ 时
$k$ 是核函数， $J (f)$ 取得最小。

为了表示化合物，用相似性矩阵表示，由Pubchem structure clustering server(Pubchem Sim)@http://pubchem.ncbi.nlm.nih.gov
对于蛋白用Smith-Waterman 算法去构建蛋白相似性矩阵

SimBoost
药物，靶点，药物-靶点对构建特征
这些特征喂到监督学习方法（梯度提升回归树）

任意药物-靶点对 $dt_i$ , 预测的结合亲和力分数 $\bar{y_i}$ 表示为：
$\bar{y_i} = \theta(dt_i) = \sum_{m=1}^Mf_m(dt_i), f_m \in F$
M 表示回归树的数量，F表示所有可能的树的空间

正则化目标函数去学习树 $f_m$ 的参数：

$R(\theta) = \sum_il(y_i,\bar{y_i}) + \sum_m\alpha(f_m)$

$l$ 是损失函数：测量实际亲和力 $y_i$ 与预测值 $\bar{y_i}$ 的差异，
$\alpha$ 是控制模型复杂度的调整参数
为了表示化合物，用相似性矩阵表示，由Pubchem structure clustering server(Pubchem Sim)@http://pubchem.ncbi.nlm.nih.gov
对于蛋白用Smith-Waterman 算法去构建蛋白相似性矩阵

(2)评价方法

Concordance Index (CI)

$\frac 1Z \sum_{\delta_i > \delta_j}h(b_i -b_j)$

where $b_i$ is the prediction value for the larger affinity $\delta_i$ , $b_j$ is the prediction value for the smaller affinity $\delta_j$ , $Z$ is a normalization constant,
$h (x)$ 是阶梯函数
$\begin{cases}1\quad \text {if \textcolor{orange}{x>0}} \\ 0.5\quad \text{if \textcolor{orange}{x=0}}\\ 0 \quad\text{if \textcolor{orange}{x<0}} \end{cases}$