机器学习阴性集的选择 —— drug-target interactions （DTIs）

最新推荐文章于 2024-09-13 13:00:03 发布

已经变秃何时变强

最新推荐文章于 2024-09-13 13:00:03 发布

阅读量1.7k

点赞数

分类专栏： #数据选择 #计算化学文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_51719414/article/details/111217864

版权

#数据选择同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

#计算化学

1 篇文章 0 订阅

订阅专栏

文章目录

前言
一、已存在的阴性集选择方法
二. 文献中阴性集选择
- 1. 药物对的阴性集
- 2. 药靶阴性集

前言

在机器学习中，阴性集的选择会影响结果的准确性。
高度可靠的阴性样本可以帮助分类模型学习明确的决策边界，从而有助于提高性能。

一、已存在的阴性集选择方法

1、未知的DTIs

在drug-target interaction的预测模型中，通常把已知的DTIs当作阳性集，未知的DTIs或其随机子集当作阴性集。

缺点：可能包含潜在的候选DTIs，不准确的阴性集会大大影响结果的准确性

ref 1：Chen R, Liu X, Jin S, Lin J, Liu J. Machine Learning for Drug-Target Interaction Prediction. Molecules. 2018;23(9):2208. Published 2018 Aug 31. doi:10.3390/molecules23092208
ref 2: Wang JT, Liu W, Tang H, et al. Screening drug target proteins based on sequence information. J Biomed Inform 2014;49:269–74.

2. wang et al.的两种策略

目的：旨在提高交叉验证的预测准确性，并过滤掉尽可能多的非药物靶蛋白。

These two strategies aim at increasing the prediction accuracy in crossvalidation and filtering out as many non-drug-target proteins as possible, respectively

2.1 策略一

The training datasets have two classes. One is called the positive dataset (proteins that are known as DT proteins), and the other is called the negative dataset (proteins that are not DT proteins)

药物蛋白的deviation定义为：

X_ij：表示第i种(药物)蛋白的第j个属性
X_i = ( x_i1, x_i2,… , x_im)：蛋白i的m个属性
X_j = ( x_1j, x_2j,…, x_nj)：属性j的向量

作者的实验中，选择结果>0.42的蛋白作为阴性集，因为
在这里插入图片描述

2.2 策略二

The negative dataset (non-DT proteins) was chosen from the proteins whose mean values of protein sequence properties have a larger difference from the positive data.

未知蛋白i作为阴性集的概率为：
在这里插入图片描述

In the author’s experiments, they supposed each proteinhas a probability of 0.5 to be considered as the negative sample.

ref：Wang JT, Liu W, Tang H, et al. Screening drug target proteins based on sequence information. J Biomed Inform 2014;49:269–74.

3. 基于guilt-by-association反向选择

Based on the “guilt-by-association” assumption that similar drugs tend to interact with similar targets, the existing methods have achieved remarkable performance.

Thus it is also reasonable to select reliable negative samples based on its converse negative proposition, i.e., a drug dissimilar to all drugs known to interact with a target is less likely to bind the target and vice versa.

ref : Zheng Y, Peng H, Zhang X, Zhao Z, Gao X, Li J. Old drug repositioning and new drug discovery through similarity learning from drug-target joint feature spaces. BMC Bioinformatics. 2019;20(Suppl 23):605. Published 2019 Dec 27. doi:10.1186/s12859-019-3238-y

4. OCSVM：基于阳性推测阴性集

One-class Support Vector Machine (OCSVM) [11] has demonstrated its advantages for classification in the absence of positive or negative samples [12].

OCSVM requires one-class data only, thus it is an ideal technique to identify reliable negatives (i.e., outliners) for drug-target prediction where only positives are available.

ref1：Zheng Y, Peng H, Zhang X, Zhao Z, Gao X, Li J. Old drug repositioning and new drug discovery through similarity learning from drug-target joint feature spaces. BMC Bioinformatics. 2019;20(Suppl 23):605. Published 2019 Dec 27. doi:10.1186/s12859-019-3238-y
ref2 : Xiao Y, Wang H, Xu W. Parameter selection of gaussian kernel for one-class svm. IEEE Trans Cybernet. 2014;45(5):941–53. doi: 10.1109/TCYB.2014.2340433
ref3 :Khan SS, Madden MG. Irish Conference on Artificial Intelligence and Cognitive Science. Dublin: Springer; 2009. A survey of recent trends in one class classification.

5. 结合guilt-by-association逆否命题和OCSVM

In this work, we propose a method to construct highly-reliable negative samples for drug target prediction by a pairwise drug-target similarity measurement and OCSVM with a high-recall constraint.

On one hand, we measure the pair-wise similarity between every two drug-target interactions by combining the chemical similarity between their drugs and the Gene Ontology-based similarity between their targets. Then we calculate the accumulative similarity with all known drug-target interactions for every unobserved drug-target interaction.

On the other hand, we obtain the signed distance using OCSVM learned from the known interactions with high recall (≥0.95) for each unobserved drug-target interaction. Unobserved DTPs with lower accumulative similarities or lower signed distances are less likely to be positives, thus of high-probability to be negatives.

Consequently, we compute the score for each unobserved drug-target interaction via averaging its accumulative similarity and signed distance after normalizing all accumulative similarities and signed distances to the range [0,1].

Unobserved interactions with lower scores are preferentially served as reliable negative samples for the classification algorithms. The specific negative number is determined by the negative sample ratio which will be discussed in the experiment section.

文章代码和结果数据下载网址：https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6933655/

ref：Zheng Y, Peng H, Zhang X, Zhao Z, Gao X, Li J. Old drug repositioning and new drug discovery through similarity learning from drug-target joint feature spaces. BMC Bioinformatics. 2019;20(Suppl 23):605. Published 2019 Dec 27. doi:10.1186/s12859-019-3238-y

二. 文献中阴性集选择

1. 药物对的阴性集

Drug targets were extracted from DrugBank and drug pairs were classified as a “shared-target” pair if they had at least one target in common.

We used fivefold cross validation to split our set of drug pairs into a test and training set containing 20% and 80% of the drug pairs respectively.

We sub-sampled the two classes (ST and non-ST drug pairs) and required the ratio of true positives (ST pairs) to true negatives (non-ST pairs) to remain the same as the total set.

ref : Madhukar NS, Khade PK, Huang L, et al. A Bayesian machine learning approach for drug target identification using diverse data types. Nat Commun. 2019;10(1):5221. Published 2019 Nov 19. doi:10.1038/s41467-019-12928-6