Paper reading (六十一):A comprehensive evaluation of MC methods for microarray gene expression ca

论文题目:A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis

scholar 引用:879

页数:13

发表时间:2004.09

发表刊物:Bioinformatics

作者:Alexander Statnikov, Constantin F. Aliferis, ..., Shawn Levy

摘要:

Abstract
Motivation: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types.

Results: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.

正文组织架构:

1. Introduction

2. Materials and methods

2.1 Support vector machine-based classification methods

        2.1.1 Binary SVMs

        2.1.2 Multiclass SVMs: one-versus-rest (OVR)

        2.1.3 Multiclass SVMs: one-versus-one (OVO)

        2.1.4 Multiclass SVMs: DAGSVM

        2.1.5 Multiclass SVMs: method by Weston and Watkins (WW)

        2.1.6 Multiclass SVMs: method by Crammer and Singer (CS)

2.2 Non-SVM classification methods

        2.2.1 K-nearest neighbors

        2.2.2 Backpropagation neural networks

        2.2.3 Probabilistic neural networks

2.3 Ensemble classification methods

2.4 Parameters for the classification algorithms

2.5 Datasets and data preparatory steps

2.6 Experimental designs for model selection and evaluation

2.7 Gene selection

2.8 Performance metrics

2.9 Overall research design

2.10 Statistical comparison among classifiers

3. Implementations

4. Results and Analysis

4.1 Classification without gene selection

4.2 Classification with gene selection

4.3 Ensemble classification

4.4 Comparison with previously published results

5. Discussion and limitations

6. Conclusions

正文部分内容摘录:

1. Biological Problem: What biological problems have been solved in this paper?

  • multicategory classification
  • performing accurate cancer diagnosis from gene expression data
  • multicategory diagnosis

2. Main discoveries: What is the main discoveries in this paper?

  • Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data.
  • Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms.
  • Ensemble classifiers do not generally improve performance of the best non-ensemble models.
  • MSVMs are the best family of algorithms for these type of data and medical tasks. They outperform other popular non-SVM machine learning techniques by a large margin.
  • Among MC-SVM methods, the ones by Crammer and Singer, Weston and Watkins and OVR have superior classification performance.
  • The performance of both MC-SVM and non-SVM methods can be moderately (for MC-SVMs) or significantly (for non-SVM) improved by gene selection.
  • Ensemble classification does not further improve the classification performance of the best MC-SVM models.

3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?

  • several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types.
  • Multicategory support vector machines (MC-SVMs)
  • the 11 datasets had 2–26 distinct diagnostic categories, 50–308 samples (patients) and 2308–15 009 variables (genes) after the data preparatory steps outlined above. All datasets are available for download (A.Statnikov, C.Aliferis, I.Tsamardinos, D.Hardin and S.Levy, http://www.gems-system.org).

4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?

  •  The superior classification performance of the SVM-based methods compared to KNN, NN and PNN reflects that SVMs are less sensitive to the curse of dimensionality and more robust to a small number of high-dimensional gene expression samples than other non-SVM techniques.
  • SVM algorithms are fairly stable in a sense that small changes in the training data do not result in large changes in the predictive model's behavior, and stable algorithms do not usually tend to benefit from the ensemble classification

5. Biological Significance: What is the biological significance of these ML methods’ results?

  • we use accuracy and RCI as our performance measures
  • The first metric is accuracy since we wanted to compare our results with the previously published studies that also used this performance metric.
  • The second metric is relative classifier information (RCI), which corrects for differences in prior probabilities of the diagnostic categories, as well as the number of categories. 
  • we decided to use random permutation testing to test that differences in accuracy between the best method (i.e. one with the largest average accuracy) and all remaining algorithms are non-random.

6. Prospect: What are the potential applications of these machine learning methods in biological science?

  • These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures.
  • This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets.
  • GEMS treats the task in the most comprehensive manner and is the first such system to be informed after a rigorous analysis of the available algorithms and datasets.
  • A particularly interesting direction for future research is to improve our existing gene selection procedures with the selection of ‘optimal’ number of genes by cross-validation. 
  • limitations:One of the limitations of the present study is that we use accuracy and RCI as our performance measures. 
  • no study can convincingly answer the central question of this research—what is the best learning algorithm for multicategory cancer diagnosis based on gene expression data?

7. Mine Question(Optional)

Backpropagation neural networks在2004年的评估跟现在的评估结果应该会不一样了吧。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值