Protein binding residues-NucBind（SuHong）-CSDN博客

本文链接：https://blog.csdn.net/Bad_girl_/article/details/105359570

2020.3.12讲解日志
文献：
Improving the prediction of protein–nucleic acids binding residues（蛋白质-核酸结合残基）
via multiple sequence profiles（多序列预测）
and the consensus of complementary （互补方法的一致性）method（改进蛋白质-核酸结合残基的多序列预测和互补方法的一致性）

1、基准数据集

获取数据连接：http://yanglab.nankai.edu.cn/NucBind/benchmark/
本文应用了3个数据集：YFK16、YK17、MW15，主要来自于三个文献 (Miao and Westhof, 2015; Yan et al., 2016; Yan and Kurgan, 2017)，这些数据最初都是在PDB（Rose et al., 2017）数据库中得到的
数据集的总结

关于3.5/5的介绍：
A residue is defined as a DNA-/ RNA-binding residue if one of the atomic distances between this residue and the DNA/RNA molecule are smaller than a specified distance cutoff. Two cutoffs were used in the above datasets: 3.5 and 5 A.

1.1 YFK16(Yan et al., 2016)

The structures released before/after 2010 were used for training/test.（2010年之前的是训练集，2010年之后发布的是测试集）；
The sequence identity between the training and test proteins is less than 30%.（训练集和测试集中的序列相似性小于30%）；
A unique feature of this dataset is that the binding annotations for the structures in the dataset were enriched by transferring the annotations from other similar proteins in PDB（数据特殊的地方在于：通过转移PDB中其他类似蛋白质的注释，丰富了数据集结构的绑定注释）；
当截断值选择3.5埃时，309条DNA-training和158条RNA-training；47条DNA-test和48条RNA-test；当截断值选择5埃时，311条DNA-training和158条RNA-training；48条DNA-test和17条RNA-test。

1.2 YK17(Yan and Kurgan, 2017)

与YFK16相似，但包括比YFK16更多的结构，划分规则与相似性与YFK16都是相同的；
在这个数据集中，截断值只选择3.5，339条DNA-training和161条RNA-training；49条DNA-test和33条RNA-test。

1.3 MW15(Miao and Westhof, 2015)

collected after 2014（比前两个数据集更新，是2014年之后的数据）；
The sequence identity between this dataset and others used for training the assessed methods is less than 25%（序列相似性小于25%）；
An independent test dataset, which includes 31 DNA-binding proteins and 15 RNA-binding proteins（只有测试集，没有训练集，可以作为独立测试集进行使用，包括31条DNA结合蛋白序列和15条RNA结合蛋白序列）。

2、 Overall architecture of the NucBind algorithm（NucBind算法整体构架）

NucBind算法是由COACH-D和SVMnuc两种预测方法结合起来的方法；
COACH-D方法主要是基于模板的比对，将序列放入I-TASSER算法中，直接给出预测的结果；SVMnuc方法主要是基于提取序列特征并通过建模方法进行的，首先原始序列通过PSI-BLAST、PSIPRED和HHblits三个算法可以提取关于序列的相关特征，然后将特征放入到SVMnuc中进行预测；
NucBind通过比较两种方法，哪个预测效果更好，选择那个预测结果作为最终的预测结果（两种预测方法在后面会具体介绍）。

Template-based prediction by COACH-D（基于模板的COACH-D算法）：

COACH-D is a general-purpose template-based method for protein–ligand binding residues prediction, which combines five individual methods（COACH-D是一种通用的基于模板的蛋白质配体结合残基预测方法，它结合了五种不同的方法）；
The prediction is made by transferring the binding residues from homologues ligand-binding templates in the BioLiP database (Yang et al., 2013b)（通过将同源配体结合模板中的结合残基转移到BioLiP数据库中进行预测）
To make fair comparison with other methods, all structure templates and ligand-binding templates with > 30% sequence identity to the query sequence were excluded, in the procedures of both structure modeling and binding residues prediction, respectively（为了与其他方法进行公平的比较，在结构建模和绑定残基预测的过程中，我们分别排除了对查询序列具有> 30%序列标识的所有结构模板和配体绑定模板）

3、Feature design for the ab-initio method SVMnuc（SVMnuc方法中的特征构建）

As the binding residues are evolutionarily more conserved than others, the protein sequence is first submitted to three programs to generate three complementary sequence profiles（因为结合残基比其它残基更保守，所以首先要将蛋白质序列放到3个程序中，生成3个互补的序列）
A comprehensive set of features are extracted from these profiles to encode each residue in a protein（从上面得到的结果中提取一组完整的特征）
The resulting feature vectors are finally fed into support vector machine (SVM) for the prediction of DNA-/RNA-binding residues（将提取到的特征放到SVM中对DNA结合残基/RNA结合残基进行预测）

下面介绍三个程序：PSI-BLAST、PSIPRED、HHblits（Let L denote the number of residues in a protein）

ª PSI-BLAST（维度L*20）

The query sequence is searched by the sequence-profile alignment tool PSI-BLAST (with parameters ‘-j3 -b 0.001’) through the NCBI non-redundant sequence database, with the sequence profile represented in the form of a position-specific scoring matrix (PSSM) of dimension L20.（通过NCBI非冗余序列数据库，使用序列配置文件对齐工具PSI-BLAST(参数为e> 0.001)搜索查询序列，序列配置文件以维度L20的特定于位置的评分矩阵(PSSM)的形式表示）之后，通过标准化（logistic）将其转化到（0,1）上

ª PSIPRED（L*3）

One of the most popular tools PSIPRED was applied to predict the three-state secondary structure (SS) profile. This profile provides the probabilities of each residues folding into one of the three states: alpha, beta and random
coil. Thus, the dimension of the SS profile is L*3（利用PSIPRED对三态二级结构剖面进行预测，可以得到每个残基折叠成alpha, beta 和 random coil的概率）

ª HHblits（L*20）

The profile hidden Markov models (HMMs) have been successfully used for protein structure prediction based on the HMM-HMM alignment. （隐马尔科夫成功对蛋白质结构进行了预测）

In this study, the HMM profile was generated by searching the query sequence against the database uniprot20_2015_06 using HHblits. The dimension of HMM profile is L*30, but only the first 20 columns are used in this study. （本文主要是用HHblits产生了30个特征，但只选用了前20个特征进行使用）

The integers in HMM are equal to 1000 times the negative logarithm of the amino acid frequency. Thus, each element x in the HMM profile is converted to a frequency number by the inverse transform 20.001*x（这是对x进行了一个转化）

从上面的特征构建中得到了43（=20+3+20）个特征，考虑结合残基并不是独立存在的，因此使用滑动窗口（目的：考虑邻近残基的影响）；

当选择窗口大小为w时，则生成了43*w个特征（w选择在本文中是基于训练集YFK16_DNA_3.5确定的），并采用径向基函数（RBF）在五折交叉验证下确定最优的c和γ；

之后使用libSVM实现支持向量机模型的构建。

4、Performance evaluation（six metrics）

本文中评估指标包括4个评估二分类结果的指标和2个评估预测倾向评分的指标：