
MemBrain-contact 2.0: a new two-stage machine learning model for the prediction enhancement of transmembrane protein residue contacts in the full chain

Inter-residue contacts in proteins have been widely acknowledged to be valuable for protein 3 D structure prediction. Accurate prediction of long-range transmembrane inter-helix residue contacts can significantly improve the quality of simulated membrane protein models.



In this paper, we present an updated MemBrain predictor, which aims to predict transmembrane protein residue contacts. Our new model benefits from an efficient learning algorithm that can mine latent structural features, which exist in original feature space. The new MemBrain is a two-stage inter-helix contact predictor. The first stage takes sequence-based features as inputs and outputs coarse(粗糙的)contact probabilities for each residue pair, which will be further fed into convolutional neural network together with predictions from three direct-coupling analysis approaches in the second stage(三种直接耦合分析方法). Experimental results on the training dataset show that our method achieves an average accuracy of 81.6% for the top L/5 predictions using a strict sequence-based jackknife cross-validation. Evaluated on the test dataset, MemBrain can achieve 79.4% prediction accuracy. Moreover, for the top L/5 predicted long-range loop contacts, the prediction performance can reach an accuracy of 56.4%. These results demonstrate that the new MemBrain is promising for transmembrane protein’s contact map prediction.

1 Introduction

Integral membrane proteins act(扮演) essential functional roles in living organisms(生物体) and they are involved in(参与到)various crucial cellular processes, such as, molecular transport, cell signaling and cell adhesion(细胞粘附). On the drug market, it has been shown that more than half of all current drug targets are membrane proteins and knowing(了解) their three-dimensional (3 D) structures is valuable for drug design. However, the number of membrane protein structures in the protein data bank (PDB) is relatively(相当地) small compared with that of soluble proteins(可溶性蛋白) because of the experimental difficulties in their study (e.g. they are hard to crystallize). Fortunately, in recent years, several studies have suggested(表明) that inter-helix contacts can assist(帮助) membrane protein structure prediction.


Residue contact prediction has been of long-term interest(长期关注的问题) due to its critical importance and wide applications in protein structural bioinformatics(蛋白质结构生物信息学). To date, a large number of methods have been proposed to predict residue contacts based on machine learning (ML) framework, correlated mutation analysis (CMA)(相关的突变分析) or the combination of them. Nevertheless(然而), most of currently available residue contact predictors were developed for soluble proteins, such as, SVMSEQ, CMAPpro, DNCON, PhyCMAP , PconsC2 , MetaPSICOV, CoinDCA and R2C. The reason could be that the limited number of membrane protein structures hinders(阻碍) the progress of developing high-quality contact prediction model for membrane proteins because of the small training sample size problem. Even so, several predictors have been designed to predict inter-helix contacts for transmembrane (TM) proteins.


Inter-helix contacts can be predicted either by ML-based methods or CMA-based approaches. The ML-based methods try to learn statistical models guided by informative(提供有用信息的) sequence-derived features, for instance, TMHcon , TMhit, MEMPACK, TMhhcp and COMSAT. These prediction models rely on ML algorithms such as neural network (NN), support vector machine (SVM) or random forest (RF) etc. The CMA-based approaches aim to detect residue contacts by analysing multiple sequence alignment (MSA)(多序列比对) using local or global algorithms. This type of algorithms includes HelixCorr, mfDCA, PSICOV, plmDCA, GREMLIN and CCMpred . For a protein with sufficient homologous sequences(足够的同源序列), the CMA-based approaches can give precise predictions, while the ML-based methods perform better on proteins with few sequence homologies. Predictors that combine the above two different classes of methods, e.g. MemBrain and MemConP, are also available. It has been demonstrated that the combination of the ML and CMA-based methods provides an improvement of the prediction performance.


Recently, several methods have been developed to improve inter-helix residue contact prediction in TM proteins. MemConP, an updated version of TMHcon, incorporates(合并) a series of sequence-based features and correlated mutations(相关突变) generated by Freecontact to train a RF model on a non-redundant dataset. To better handle the case of insufficient homologous sequences(同源序列不足), a hybrid method called COMSAT was proposed, which integrates(结合) SVM and mixed integer linear programming(混合整数线性规划). When the statistical SVM model fails to predict any contacts, the optimization-based method works to maximize the cumulative potential(累积势) of residue contacts. Despite the significant progress, there is still much room to further improve the prediction performance. For instance, at the current stage, for the top L/5 inter-helix residue contact prediction, a prediction accuracy of 65.6% was reported, which is expected to be further enhanced.

Deep learning has been successfully applied to computer vision, speech recognition, natural language processing and also bioinformatics, because it can learn high-level abstract features from original inputs and thus would perform quite well by reducing the noise effects embedded into the original features. For instance, DeepBind uses a deep convolutional neural network (CNN) to predict the sequence specificities(特异性) of protein binding; RaptorX-Contact uses a deep residual network to predict protein contact map. In this work, we applied CNN to develop a new TM inter-helix residue contact prediction model and update our former predictor MemBrain to further improve its performance.

For residue contact prediction, a series of sequence-based features are used to encode a residue pair. Among these features, correlated mutation score indicates the potential(可能性) of two residues forming a spatial contact. Usually, with two independent feature vectors for the two target residues, a direct combination of them(目标残基之间的直接组合) is fed into the prediction model, which may lack straightforward biophysical meanings(直接的生物物理学意义). Furthermore, the doubled dimensions will result in more parameters of CNN to be optimized. Therefore, it could not be an optimal choice(最佳选择) to take all the sequence-based features as inputs to CNN, especially when there are not sufficient training samples from membrane protein structures to fit the weights(来适应权重). Motivated by these, we developed a two-stage prediction framework, where the first stage is used to get a coarse prediction map followed by a deep refinement(深度细化) of the second stage.

2 Materials and methods

2.1 Datasets

To make a fair comparison with previous studies, we selected the same benchmark training and test datasets used in MemConP, which were collected from the PDBTM database in 2015. The original training dataset contains 90 TM proteins. In this work, however, we excluded(排除) the proteins 2e74B and 4mt4A from the dataset, because the protein 2e74B has too few inter-helix contacts and the protein 4mt4A is a beta-barrel protein. The remaining 88 alpha-helical TM proteins form the final training dataset and the test dataset contains 30 alpha-helical TM proteins. All the proteins in the benchmark datasets have at least 3 TM helices(螺旋), and their locations were extracted from the PDBTM. Supplementary Table S1 lists the details of the training and test datasets. The high-quality benchmark datasets were screened rigorously(严格筛选)so that: (1) resolution(分辨率) is less than 3.5 Å; (2) pairwise(成对地) sequence identity is less than 35%; (3) pairwise TM-score is below 0.5; (4) proteins are from different Pfam families.

In order to evaluate our method with more TM proteins, we prepared a larger independent test dataset by only considering the above first two criteria. First, we downloaded all redundant alpha-helical TM proteins from the PDBTM and removed the proteins appearing in the training and test datasets or having less than 3 TM helices. Then, the remaining proteins were culled(挑选) by running PISCES server to get a non-redundant dataset. Next, we discarded(丢弃) the proteins in the non-redundant dataset sharing more than 35% sequence identity with any proteins from the training dataset. Finally, 175 TM proteins were obtained (denoted(标记)as ITD35). We also created another independent test dataset with pairwise sequence identity less than 30% and no more than(至多)30%(小于等于30%) of sequence identity with the training dataset (155 TM proteins collected, denoted as ITD30). These two independent test datasets are listed in Supplementary Tables S2–S3.————问题:什么叫sequence identity(序列一致性)

2.2 Contact definition

In the literature(文献), there are multiple definitions of residue contact. For instance, in the well-known Critical Assessment of protein Structure Prediction (CASP) competition(在著名的蛋白质结构预测的关键评估(CASP)竞赛中), contact definition is based on Cβ atoms, i.e. if the Euclidean distance between Cβ atoms (Cα for GLY) of two residues is less than 8 Å, then the two residues are said to be in contact. But in the case of TM proteins, residue contacts are often determined according to residues’ heavy atoms. Concretely(具体地), two residues from different TM helices are considered to be in contact if the minimal distance of their side chain or backbone heavy atoms is less than 5.5 Å.

For a fair comparison, we used the definition based on heavy atoms for inter-helix residue contact prediction, which is the same as previous studies. Using the above contact definition, we only obtained 19920 contact residue pairs from the training dataset. To enlarge the positive set, we also took contacts (sequence separation  > = 6) involving residues from loop regions into account, which resulted in 62 493 contact residue pairs. In the end, three types of contacts were present(出现) in the training dataset: (I) contacts between residues from TM helical regions, (II) contacts between residues from loop regions and (III) contacts between residues from TM helical regions and loop regions, respectively. By doing so, our new model is able to predict contact map for the entire TM protein sequence, not just limited to the TM helical regions.

When considering the residue contact prediction as a classification problem (recognizing the contact pairs from those not), it is actually an imbalanced learning problem, i.e. the non-contact pairs are much more than the contact ones. Previous statistics have shown that the contact density is approximately 2%–3%. Thus, to balance the positive and negative training samples, we used an under-sampling strategy, where all the positive samples and a subset of the negative samples are used for model training. To determine a proper sampling ratio with respect to(关于) the positive samples, we tested eight ratios and found that a ratio of 1:5 gives good and robust(可靠的) results (Supplementary Table S4).


2.3 Feature extraction

For machine learning algorithms, discriminative(识别的)features are crucial for model building and unknown samples classification. Ab initio residue contact prediction mainly relies on sequence-derived information. In this work, six different types of input features were extracted for training MemBrain model, including amino acid composition, secondary structure, solvent accessibility, residue conservation score, contact potential and correlated mutation score, which were commonly(一般地,通常地) used to build ML models . These features are described in detail below.


Single residue features Amino acid composition represents the appearance frequencies of 20 amino acids and also gap occurring at a certain position in MSA. We used HHblits to search against the bundled UniProt20 database with three iterations to generate MSA for each protein sequence. The predicted secondary structure and solvent accessibility were calculated by running PSIPRED and SOLVPRED, respectively. Residue conservation is used to measure the probability of a given residue to mutate(突变) in another.

The conservation score of each column in MSA was calculated according to the Shannon entropy, which is defined as follows:


where fi is the frequency of a certain residue occurring at a column of interest. For these local features, a sliding window of size 9 was used to encode the current and neighboring information. In addition, for a given residue pair (i, j), a window of size 5 centered at position (i + j)/2 was used to extract extra local features.

Residue pair features Contact potential is a mean value and was computed by averaging the contact energies of all residue pairs that go through the certain two columns in MSA. Correlated mutation score between two columns in MSA indicates the potential of that residue pair forming molecular contact. It is an informative descriptor(信息描述符) because the inferred(推断的) values can be directly used for residue contact prediction. In recent years, some elegant global CMA-based algorithms were proposed to detect coevolving residue pairs(检测协同进化残基对) in MSA. When sufficient sequence homologies are available, contact predictions are in consequence(因此,结果) quite reliable. There are only few solved 3D membrane protein structures in the PDB when compared with soluble proteins; however, many homologous sequences can be searched due to membrane proteins constitute(组成) approximately 30% of the proteins. In the training dataset, the smallest number of homologous sequences is 48 for the protein 1yewC, and the largest number is 46701 for the protein 4tpjB. The average number of homologous sequences for the whole training dataset is 3464, which means that correlated mutation score is very powerful for model training and evaluation. To reduce calculation bias, we used five different algorithms to calculate this type of feature, i.e. MI, MIp, mfDCA, PSICOV, CCMpred. Since residue contacts are densely distributed in native structures(原生结构), a 9 by 9 window centered at the position (i, j) was applied to extract nearby correlated mutation information.

2.4 MemBrain-contact 2.0 prediction model

MemBrain-contact 2.0 is a hierarchical(分级的) two-stage residue contact predictor. The first stage is a conventional(常见、传统的)two-hidden-layer perceptron. 1084-dimensional sequence-based features (26 × 9 × 2 + 26 × 5 + 6 × 9 × 9) are fed into this neural network with 150 units for each of the two hidden layers. The single output indicates the contact potential(接触电势) of a given residue pair. The second stage is the fusion(融合) of three powerful CNNs, which have one, two and three convolution layers, respectively. We also tried more convolution layers, but found no improvement due to insufficient training samples. On the top of each CNN, a fully connected layer with 150 hidden units is used to predict the final contact probability(最终的接触概率). For a target residue pair (i, j), it takes four 25 by 25 patches from the raw contact maps generated by the first stage of MemBrain, mfDCA , PISCOV and CCMpred as inputs, where each patch centers at the position (i, j). Then, these sub contact maps go though the subsequent(随后的,后面的) convolution and max-pooling layers.

The convolution operator is formulated by:


where P is a 25 by 25 patch, F is a 5 by 5 filter. The max-pooling is a form of down-sampling, which outputs the maximum value of the interested 2 by 2 patch. Figure 1 illustrates the flow chart of the new MemBrain-contact 2.0 protocol.

During the training phase(阶段), we used batch gradient descent(批量梯度下降) to minimize cross entropy(交叉熵) with 100 training samples over 30 epochs. We also introduced L2-norm regularization to avoid overfitting.


The loss function is defined as follows:


where N is the number of training samples, yi is the expected output, pi is the prediction, w is the parameters of the entire model and λ is used to balance cross entropy and penalty term, which is set to 1e-4. The learning rates for the first stage of multilayer perceptron and the second stage of CNN are 0.001 and 0.01, respectively. In the second stage, we trained three CNN models, from which outputs were averaged as the final predictions.


2.5 Evaluation criteria

The predictions can be separated into four categories(类别), i.e. true positive (TP), false negative (FN), false positive (FP) and true negative (TN). TP is the group that contains correctly predicted positive samples, FN is the set of positive samples, which are mistakenly predicted as negative samples, FP includes wrongly predicted negative samples and TN indicates accurately predicted negative samples. Based on these metrics(指标、度量), three derived evaluation criteria were used to compare the prediction performance with state-of-the-art methods.


The first performance criterion is accuracy (Acc). It is defined as the fraction(分数) of correctly predicted contacts with respect to all the predicted contacts:



where TP and FP are defined above. The accuracy is calculated according to the top predictions, such as, top L/5 or top L, where L is the length of the concatenate(连接的,连结的) TM helices.

The second performance evaluation is coverage (Cov), which is also indicated as(被表示为) sensitivity. It is defined as the ratio of correctly predicted contacts from all the observed true inter-helix contacts in native structure:

Thus, keeping the TP unchanged, a larger coverage can be obtained on the protein with less native contacts.


The last measure is Matthews correlation coefficient (MCC), which is used to evaluate the performance and robustness of the certain predictor. It is formulated as follows:

3 Results

3.1 Influence of neighboring contact pattern(相邻接触方式的影响)

In the current state-of-the-art(最先进的) residue contact predictors, correlated mutations are incorporated into(纳入) their protocols to enhance the prediction ability. Although CMA-based approaches are limited by sufficient homologies in MSA, they are still valuable for ML-based methods. As we know, CMA-based algorithms perform poorly on CASP hard targets(在CASP硬目标上) due to the few homologous sequences that can be searched. In the case of TM proteins, although there are not many solved 3 D structures, there are enough homologous sequences to analyse as stated in previous section(如前一节所述). Inspired by the intrinsic characteristic(固有的特点) of protein structures where residue contacts are densely distributed, in the first stage of MemBrain protocol, correlated mutations were encoded by a 9 by 9 window and then flattened(平展) to a feature vector so that it can be fed into a traditional multilayer perceptron. This strategy has been demonstrated to be helpful for improving the prediction performance. Here, we trained four one-convolution-layer CNNs(四个单层CNN)with sub contact maps generated by mfDCA, PSICOV, CCMpred and the first stage of MemBrain, respectively, where each patch covers the target and nearby residue pairs to see the influence of neighboring contact pattern.


Figure 2 shows the performance improvement after using the initial sub contact maps to train CNN models. No matter what the input source is, the CNN can further help to increase the prediction accuracy visibly(明显地). From Table 1, we can see that PSICOV achieves an average accuracy of 55.0%/53.2% for the top L/5 predicted inter-helix contacts on the training/test dataset. When we decomposed(分解) each contact map from the training dataset into a series of patches and used these sub contact maps to train a CNN model, the prediction accuracy on the test dataset is increased to 69.2%, which is 16.0% higher than the initial prediction. Similar conclusions can be conducted on mfDCA, CCMpred and MemBrain. After using CNN framework, mfDCA/CCMpred/MemBrain can achieve 72.0%/73.8%/76.4% prediction accuracy, where the improvements are 14.8%, 11.9% and 5.3%, respectively. From the four CNN models, we can see that the improvement of MemBrain is quite lower compared with that provided by other CMA-based methods. This is because our ML-based model has already used correlated mutations as neighboring contact information. Even so, the prediction accuracy can be increased from 71.1% to 76.4%. In addition, given the same MSAs of proteins in the test dataset, CCMpred performs much better than PSICOV by the fact that the former achieves 8.7% higher prediction accuracy than that of the latter. This difference can be decreased to 4.6% with the help of CNN. These results demonstrate that neighboring contact pattern is indeed important for residue contact prediction. However, structural features hidden in sub contact maps are omitted(忽略) by traditional serial feature combination(传统的串行特征组合). When we take the patches as inputs, the latent structural features can be mined and thus improve the prediction performance.


3.2 Evaluation of inter-helix residue contact prediction

In this section, we compare the performance of MemBrain with the state-of-the-art inter-helix contact predictor MemConP. We also list the results of three representative(典型的) CMA-based approaches, i.e. mfDCA , PSICOV and CCMpred. Since(由于) many homologous sequences can be searched for most TM proteins, the CMA-based approaches can also give good predictions. On the training dataset, we used a strict sequence-based jackknife cross validation to evaluate our method. During the process of validation, each protein of the training dataset was selected to test the model, which as trained using the remaining proteins. Note that(注意) MemConP used a 10-fold cross validation on the training dataset. Table 1 shows the results of different methods for inter-helix residue contact prediction.


On both the training and test datasets, CCMpred achieves the best performance among the three CMA-based approaches in terms of(在…方面) all evaluation criteria. When compared with the ML-based predictor MemConP, CCMpred gives close accuracies for the top L/5 predicted inter-helix contacts, where the differences are 5.6% and 3.7% on the training and test datasets, respectively. However, the differences are increased to 12.1% and 6.9% for the top L predicted inter-helix contacts. The reason could be that CMA-based predictions are widespread(普遍的,广泛的), when more contacts are evaluated, it has a higher chance to introduce(引入) more false positives. The first stage of MemBrain can achieve 73.3%/71.1% prediction accuracy on the training/test dataset for the top L/5 predicted inter-helix contacts, which is 2.4%/5.5% higher than that of MemConP. When we used CNN architecture to enhance MemBrain at the second stage, we can obtain 81.6%/79.4% accuracy, 12.4%/11.0% coverage and a MCC of 0.308/0.285 on the training/test dataset, which is 10.7%/13.8% higher than that of MemConP in terms of(在…方面) accuracy. For the top L predicted inter-helix contacts, MemBrain can also give 8.1%/13.0% higher accuracy, 4.1%/7.2% higher coverage and 0.064/0.101 higher MCC on the training/test dataset compared with MemConP. As shown in Supplementary Figure S1, the area under the curve (AUC) of the final MemBrain is 0.915, which is higher than those of the first stage of MemBrain and other methods.

To have a deep insight into the difference(为了对差异有更深入的了解) between the first stage and the second stage of MemBrain, we report(报道) in Figure 3 the prediction accuracies for 118 TM proteins from the training and test datasets performed(执行) by both the first stage and the second stage of MemBrain. As can be seen, most of targets are better predicted with higher prediction accuracies after applying CNN framework refinement(改进). Among these targets, the largest improvement occurs for the protein 4mndA, where the top L/5 prediction accuracy is increased from 25.0% to 65.0%. Supplementary Figure S2 also shows the comparison of the top L prediction performance. In Supplementary Table S5, we list the detailed predictions for each of the proteins from the training and test datasets. On the training dataset, there are 763 true contacts eliminated(消除) from the top L predictions by the second stage of MemBrain. However, 2309 new true contacts are introduced, which improves the average accuracy from 45.6% to 56.4%. On the test dataset, the extra 577 true contacts result in 10.4% accuracy improvement.

3.3 Evaluation of contact prediction for loop region


Since all TM protein contact predictors focus on the performance of inter-helix contacts, in this section we evaluate long-range type II and III contact residue pairs (one or both residues are from the loop region) to see how reliable the prediction of these types of contacts is. Since(因为) MemBrain was trained using the entire native contact maps, it has the ability to predict type II and III contacts. For these kinds of contacts, we can also use the contact predictors developed for soluble proteins. Here, we show the prediction performance of MetaPSICOV. We also list the performance of mfDCA, PSICOV and CCMpred . Since these three approaches can be viewed as unsupervised learning algorithms, they are also suitable for inferring type II and III contacts. For the purpose of quantitative comparison(定量比较) with inter-helix residue contact prediction, the definition of L is the same as above, i.e. the length of the concatenate TM helices. But, the number of native contacts for each type of contacts is different. It just covers the corresponding(相应的) type of contacts in native structure. Table 2 lists the prediction performance of different methods for type II contacts.

CCMpred performs the best among the three CMA-based approaches. The ML-based methods MetaPSICOV and MemBrain work better than CCMpred because they used additional(额外的) sequence-derived features to predict residue contacts. MetaPSICOV performs 41.5%/48.0% prediction accuracy and a MCC of 0.173/0.172 for the top L/5 predictions on the training/test dataset, while MemBrain reaches 52.1%/56.4% prediction accuracy, which is 10.6%/8.4% higher than that of MetaPSICOV. Also, MemBrain gives the best MCC of 0.224/0.217 on the training/test dataset. For the top L predicted loop contacts, MemBrain achieves 30.5% and 35.3% prediction accuracies on the training and test datasets, respectively, which are 7.1% and 5.4% higher than that of MetaPSICOV.

For type III contacts, we evaluate the top L/5 predicted contacts, where Supplementary Table S6 shows the prediction performance. MemBrain achieves 45.0%/37.8% prediction accuracy for the top L/5 predicted contacts on the training/test dataset. The results demonstrate that MemBrain is capable of predicting residue contacts for the entire TM protein sequence. Compared with inter-helix residue contact prediction, where MemBrain achieves 79.4% prediction accuracy for the top L/5 predicted contacts on the test dataset, it provides only 56.4%/37.8% prediction accuracy for type II/III contacts. These results show that although MemBrain was trained with relative few inter-helix residue pairs, it still performs much better on inter-helix contacts. This interesting phenomenon indicates that inter-helix contacts are more conserved and hence they are easy to be detected by contact predictor. Due to the flexibility of the loop region, modeling the loop contacts will still be a challenging task.

对于III型接触,我们评估top L/5预测接触,其中补充表S6显示预测性能。MemBrain对训练/测试数据集中top L/5个预测接触的预测准确率达到45.0%/37.8%。结果表明,MemBrain能够预测整个TM蛋白序列的残基接触。与螺旋残基接触预测相比,MemBrain对测试数据集上top L/5预测的接触预测准确率为79.4%,而对II/III型接触预测准确率仅为56.4%/37.8%。这些结果表明,虽然MemBrain的训练相对较少的螺旋残基对,但它仍然表现出更好的螺旋间接触。这一有趣的现象表明,螺旋间的接触更为保守,因此很容易被接触预测器检测到。由于环路区域的灵活性,环路接触的建模仍然是一个具有挑战性的任务。

3.4 Performance on the bigger independent test dataset


On the training and test datasets, MemBrain achieves 81.6% and 79.4% prediction accuracies for the top L/5 predicted inter-helix contacts, respectively. To evaluate MemBrain with more TM proteins, a larger independent test dataset was prepared without considering the criterion of TM-score. Table 3 lists the overall(总体的) performance of inter-helix residue contact prediction, where MemBrain performs 84.5%/60.8% prediction accuracy, 12.3%/42.4% coverage and a MCC of 0.311/0.488 for the top L/5 and top L predicted contacts on the ITD35. When tested on the ITD30, where sequence identity is reduced from 35% to 30%, the average accuracy slightly decreases to 84.0% for the top L/5 predicted contacts. This is because that the performance of CMA-based approaches is not sensitive to sequence identity but relies on the number of effective sequences in MSA, and thus the prediction accuracy does not decrease much. From Tables 1 and 3, we can see that MemBrain performs better on the ITD35 and ITD30 than on the test dataset. Also, the three CMA-based approaches achieve better performance on these two datasets when compared with the test dataset. This can partially(部分地) explain why MemBrain gives better predictions on the ITD35 and ITD30, because it takes the CMA-based predictions as input features. Since these two datasets were screened(筛选) mainly based on sequence identity(序列一致性), thus, it could be more redundant than the test dataset in terms of(在…方面) TM-score criterion.

For each protein from the ITD35, we used TM-align to get the largest TM-score compared with all the proteins from the training dataset. As can be seen in Figure 4, in general, the larger the TM-score, the higher the prediction accuracy MemBrain gives. There are three cases that have low prediction accuracy (less than 10.0%) but with TM-score more than 0.5. The reason is that low-quality MSAs result in unreliable correlated mutations, which lead to poor performance by MemBrain. From these 175 TM proteins, 42 proteins have the largest TM-score less than 0.5 against the training dataset. For this non-redundant sub dataset, MemBrain achieves 79.4% and 51.3% prediction accuracies for the top L/5 and L predicted inter-helix contacts, respectively, which is comparable with the performance on the test dataset. The results demonstrate that MemBrain is robust for inter-helix contact prediction. In addition, when we remove the redundancy among protein sequences, sequence identity alone is not enough to obtain a non-redundant dataset. There exist sequence pairs that have low sequence identity but with large TM-score, which is known as the ‘twilight zone’ phenomena. Therefore, structural similarity may also be considered to ensure no similarity among proteins of interest.

3.5 Case study 个案研究

In this section, we used the TM protein 3wajA from the test dataset as an illustrative(说明性的) case to show the efficiency of CNN architecture. This protein has 13 TM helices. From the top L/5 predicted inter-helix contacts, 42 out of 46 residue contacts are correctly predicted by MemBrain, resulting in a prediction accuracy of 91.3%. For the top L predicted contacts, the prediction accuracy drops to(下降到) 47.4%. Before applying(使用) CNN, the prediction accuracies for the top L/5 and L predicted inter-helix contacts are 69.6% and 33.8%, respectively, which are much worse. To dig into the data, Supplementary Figure S3 shows the prediction details, where green, red and blue points represent(代表) native contacts, the predicted contacts by the first stage of MemBrain and the final MemBrain, respectively. As can be seen, red points within red ellipses(红色椭圆) (false positives) are partially or totally eliminated(部分或全部被消除) with the help of CNN. Also, more points are introduced in(被引进) the blue ellipses (true positives). The results show that, from the point of CNN’s view, a residue pair surrounded by more contact pairs has a higher chance to be predicted as positive pair. This rule is also consistent with(相一致) the observation that contacts are densely distributed in native structures.

在本节中,我们使用来自测试数据集的TM protein 3wajA作为一个说明性案例来展示CNN架构的效率。这种蛋白质有13个TM螺旋。在预测top L/5的螺旋间接触时,MemBrain对46个残基接触的预测正确率为42,预测正确率为91.3%。对于top L的预测,预测精度下降到47.4%。在使用CNN之前,top L/5的预测准确率为69.6%,top L的螺旋间接触的预测准确率为33.8%,这两个值的预测准确率要差得多。为了深入挖掘数据,补充图S3显示了预测细节,其中绿色、红色和蓝色的点分别代表原生接触、MemBrain第一阶段的预测接触和MemBrain最后阶段的预测接触。可以看出,在CNN的帮助下,红色椭圆内的红点(假阳性)被部分或全部消除。此外,在蓝色椭圆中引入了更多的点(真阳性)。结果表明,从CNN的观点来看,被更多接触对包围的残基对有更高的机会被预测为正对。这一规律也与观察到的接触在自然结构中密集分布的现象相一致。

4 Discussions

In recent years, residue contact prediction reached a high level of performance with the technique of machine learning and data mining. For inter-helix residue contact prediction, MemConP used a high-quality dataset to train a RF model, and improved the prediction performance. Although there are few non redundant membrane protein structures available, compared with that of several years ago, we can get more structures to study the characters of contact residue pairs. By observing helix packing(螺旋堆积), we can see that residue contacts are densely distributed in native structures. This conclusion can also be extended to soluble proteins. Conventional serial feature combination of neighboring contact potential can make partial use of this kind of information, but structural relationship of neighboring contact pattern could be missed. In this work, we used CNN architecture to mine the latent structural features. Our MemBrain achieves 79.4% prediction accuracy for the top L/5 predictions on the test dataset, which is a significant improvement for inter-helix residue contact prediction with the limited training samples.


MemBrain was trained with inter-helix residue pairs and also long-range residue pairs of type II and III. On one hand, we can better fit the parameters of CNN; on the other, MemBrain can give the entire contact map of the query TM protein sequence and not just the inter-helix contact map. We compared the performance of MemBrain on loop contacts with MetaPSICOV and found that MemBrain performs better and achieves 56.4% prediction accuracy for the top L/5 predicted loop contacts. However, it is still worse than inter-helix contact prediction, where a prediction accuracy of 79.4% is reached on the test dataset. Although there are more contact pairs (2.1 times) in loop region to train MemBrain model, inter-helix contacts are easier to be detected. The reason may be that inter-helix contacts are more conserved(保守).

By evaluating(评价) MemBrain on the independent test datasets, we show that it is important to consider both the sequence identity(序列一致性) and the structural similarity(结构相似性) to remove homologous redundancy(同源冗余). We separated the ITD35 dataset into two sub datasets, one dataset consisting of(包含) proteins that have the largest TM-score less than 0.5, the other of the remaining proteins. On the first sub dataset, MemBrain performs 79.4% and 51.3% prediction accuracies for the top L/5 and L predicted contacts, respectively. Regarding the second sub dataset, the corresponding(相应的,一致的) prediction accuracies are increased to 86.1% and 63.8%. Thus, we suggest that both the sequence and structural similarity need to be taken into account(考虑) when dealing with homologous redundancy(同源冗余).


The promising performance of the new MemBrain-contact 2.0 algorithm is due to the new hierarchical design of the prediction model and CNN’s powerful capability of mining latent structural features existed in original feature space. The so-called ‘curse of dimensionality(维数灾难)’ is a typical challenge(典型的挑战) of this study, i.e. more samples will be required to model a reliable predictor with the increasing number of feature dimensions. Due to the relatively(相当地,相对地) small number of solved membrane protein structures used for training, the original 1084 dimensions of features, resulted from(由…) combining two target residues, becomes a much heavy high-dimension computation load(成为一个沉重的高维计算负荷). Our first stage model is used to transform the high dimensional feature space to a coarse prediction represented by a 2D probability image(二维概率图像), from which the CNN is applied to learn the latent spatial structural correlations(潜在的空间结构相关性). Our results have shown that the final prediction from the hierarchical two-stage model is 8.3% higher than that of the first stage in terms of(依据,按照) the top L/5 predictions, demonstrating the efficacy(有效性) of the new protocol.

Compared with our previous MemBrain-contact 1.0 model, the new predictor is significantly improved in the following aspects: (1) a more powerful learning algorithm is used. We applied deep learning algorithm CNN to mine the latent structural features of neighboring contact pattern(近邻接触模式) and thus enhance the prediction ability. (2) the application scope(适用范围) is extended. In our previous model, the prediction is just for the inter-helix contacts, whereas(然而) now MemBrain 2.0 is capable of(能够) predicting residue contacts in the full chain. Currently, MemBrain predictor is constructed(构建) on alpha-helical TM proteins, but it could potentially(可能地,潜在地) be extended to beta-barrel membrane proteins. A potential challenge is that the number of solved beta-barrel membrane protein structures is even less than that of alpha-helical TM proteins, which will result in a smaller training dataset than the one used for the current study. One of our important future improvements of MemBrain will consist in(主要在于,在于) developing a specific residue contact prediction engine for beta-barrel membrane proteins.

