scholar 引用:289
页数:7
发表时间:2010.03
发表刊物:Bioinformatics
作者:Pedro J. Ballester, John B. O. Mitchell
摘要:
Motivation: Accurately predicting the binding affinities of large sets of diverse protein–ligand complexes is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for analysing the outputs of molecular docking, which in turn is an important technique for drug discovery, chemical biology and structural biology. Each scoring function assumes a predetermined theory-inspired functional form for the relationship between the variables that characterize the complex, which also include parameters fitted to experimental or simulation data and its predicted binding affinity. The inherent problem of this rigid approach is that it leads to poor predictivity for those complexes that do not conform to the modelling assumptions. Moreover, resampling strategies, such as cross-validation or bootstrapping, are still not systematically used to guard against the overfitting of calibration data in parameter estimation for scoring functions.
Results: We propose a novel scoring function (RF-Score) that circumvents the need for problematic modelling assumptions via non-parametric machine learning. In particular, Random Forest was used to implicitly capture binding effects that are hard to model explicitly. RF-Score is compared with the state of the art on the demanding PDBbind benchmark. Results show that RF-Score is a very competitive scoring function. Importantly, RF-Score's performance was shown to improve dramatically with training set size and hence the future availability of more high-quality structural and interaction data is expected to lead to improved versions of RF-Score.
正文组织架构:
1. Introduction
2. Materials
2.1 Validation using the PDBbind benchmark
3. Methods
3.1 Intermolecular interaction features
3.2 RFs for regression
3.3 Scoring functions for comparative assessment
4. Results and discussion
4.1 Building RF-Score
4.2 RF-Score on the PDBbind benchmark
4.3 Comparison with the state of the art
5. Conslusions
正文部分内容摘录:
1. Biological Problem: What biological problems have been solved in this paper?
- predicting the binding affinities
- predicting how strongly the docked conformation binds to the target (scoring)
2. Main discoveries: What is the main discoveries in this paper?
- Results show that RF-Score is a very competitive scoring function.
- Importantly, RF-Score's performance was shown to improve dramatically with training set size and hence the future availability of more high-quality structural and interaction data is expected to lead to improved versions of RF-Score.
- conclusions: RF-Score has been shown to be particularly effective as a re-scoring function and can be used for virtual screening and lead optimization purposes.
3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?
- a novel scoring function (RF-Score)
- the first application of Random Forests (RFs) to predicting protein–ligand binding affinity.
- The process of training RF to provide a new scoring function (RF-Score) starts by separating the 195 complexes of the core set from the remaining 1105 complexes in the refined set. The former constitutes the test set of the PDBbind benchmark, while the latter is used here as training data.
- The PDBbind benchmark essentially consists of testing the predictions of scoring functions on the 2007 core set, which comprises 195 diverse complexes with measured binding affinities spanning more than 12 orders of magnitude
4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?
- circumvents the need for problematic modelling assumptions via non-parametric machine learning
- Random Forest was used to implicitly capture binding effects that are hard to model explicitly.
- RF does not assume any a priori relationship between the descriptors that characterize the complex and binding data, and thus should be sufficiently flexible to account for the wide variety of binding mechanisms observed across diverse protein–ligand complexes.
- RF is particularly suited for this task, as it has been shown to perform very well in non-linear regression.
- RF can be also used to estimate variable importance as a way to identify those protein–ligand contacts that contribute the most to the binding affinity prediction across known complexes.
5. Biological Significance: What is the biological significance of these ML methods’ results?
- It is very encouraging that this initial version has already obtained a high correlation with measured binding affinities on such a diverse test set.
- interpretability is currently a drawback of this approach.However, it is important to realize that, although the terms comprising model-based scoring functions provide a description of protein–ligand binding, such a description is only as good as the accuracy of the scoring function.
- This is quantified through Pearson's correlation coefficient (R), defined as the ratio of the covariance of both variables over the product of their standard deviations (SDs). In this training set, R = 0.953, indicating a very high linear dependence between these variables over the training data. Another commonly reported performance measure is the root mean square error (RMSE)
6. Prospect: What are the potential applications of these machine learning methods in biological science?
- we plan to study the use of distance-dependent features, which could result in further performance improvements given that the strength of intermolecular interactions naturally depends on atomic separation.
- less coarse atom types will be investigated by considering the atom's hybridization state and bonding environment.
- machine learning-based scoring functions constitute an effective way to assimilate the fast growing volume of high-quality structural and interaction data in the public domain and are expected to lead to more accurate and general predictions of binding affinity
7. Mine Question(Optional)