在drug-target interaction的预测模型中,通常把已知的DTIs当作阳性集,未知的DTIs或其随机子集当作阴性集。


2. wang et al.的两种策略


These two strategies aim at increasing the prediction accuracy in crossvalidation and filtering out as many non-drug-target proteins as possible, respectively

2.1 策略一

The training datasets have two classes. One is called the positive dataset (proteins that are known as DT proteins), and the other is called the negative dataset (proteins that are not DT proteins)

Xi = ( xi1, xi2,… , xim):蛋白i的m个属性
Xj = ( x1j, x2j,…, xnj):属性j的向量


2.2 策略二

The negative dataset (non-DT proteins) was chosen from the proteins whose mean values of protein sequence properties have a larger difference from the positive data.


In the author’s experiments, they supposed each proteinhas a probability of 0.5 to be considered as the negative sample.

3. 基于guilt-by-association反向选择

Based on the “guilt-by-association” assumption that similar drugs tend to interact with similar targets, the existing methods have achieved remarkable performance.

Thus it is also reasonable to select reliable negative samples based on its converse negative proposition, i.e., a drug dissimilar to all drugs known to interact with a target is less likely to bind the target and vice versa.

4. OCSVM: 基于阳性推测阴性集

One-class Support Vector Machine (OCSVM) [11] has demonstrated its advantages for classification in the absence of positive or negative samples [12].

OCSVM requires one-class data only, thus it is an ideal technique to identify reliable negatives (i.e., outliners) for drug-target prediction where only positives are available.

5. 结合guilt-by-association逆否命题和OCSVM

In this work, we propose a method to construct highly-reliable negative samples for drug target prediction by a pairwise drug-target similarity measurement and OCSVM with a high-recall constraint.

On one hand, we measure the pair-wise similarity between every two drug-target interactions by combining the chemical similarity between their drugs and the Gene Ontology-based similarity between their targets. Then we calculate the accumulative similarity with all known drug-target interactions for every unobserved drug-target interaction.

On the other hand, we obtain the signed distance using OCSVM learned from the known interactions with high recall (≥0.95) for each unobserved drug-target interaction. Unobserved DTPs with lower accumulative similarities or lower signed distances are less likely to be positives, thus of high-probability to be negatives.

Consequently, we compute the score for each unobserved drug-target interaction via averaging its accumulative similarity and signed distance after normalizing all accumulative similarities and signed distances to the range [0,1].

Unobserved interactions with lower scores are preferentially served as reliable negative samples for the classification algorithms. The specific negative number is determined by the negative sample ratio which will be discussed in the experiment section.


二. 文献中阴性集选择

1. 药物对的阴性集

Drug targets were extracted from DrugBank and drug pairs were classified as a “shared-target” pair if they had at least one target in common.

We used fivefold cross validation to split our set of drug pairs into a test and training set containing 20% and 80% of the drug pairs respectively.

We sub-sampled the two classes (ST and non-ST drug pairs) and required the ratio of true positives (ST pairs) to true negatives (non-ST pairs) to remain the same as the total set.

2. 药靶阴性集


