Paper reading (五十九):Explaining Diversity in Meta-Data by Phylogenetic-Based Feature Weighting

论文题目:Explaining Diversity in Metagenomic Datasets by Phylogenetic-Based Feature Weighting

scholar 引用:17

页数:18

发表时间:2015.03

发表刊物:PLOS computational biology

作者:Davide Albanese,Carlotta De Filippo,...,Claudio Donati 

摘要:

Metagenomics is revolutionizing our understanding of microbial communities, showing that their structure and composition have profound effects on the ecosystem and in a variety of health and disease conditions. Despite the flourishing of new analysis methods, current approaches based on statistical comparisons between high-level taxonomic classes often fail to identify the microbial taxa that are differentially distributed between sets of samples, since in many cases the taxonomic schema do not allow an adequate description of the structure of the microbiota. This constitutes a severe limitation to the use of metagenomic data in therapeutic and diagnostic applications. To provide a more robust statistical framework, we introduce a class of feature-weighting algorithms that discriminate the taxa responsible for the classification of metagenomic samples. The method unambiguously groups the relevant taxa into clades without relying on pre-defined taxonomic categories, thus including in the analysis also those sequences for which a taxonomic classification is difficult. The phylogenetic clades are weighted and ranked according to their abundance measuring their contribution to the differentiation of the classes of samples, and a criterion is provided to define a reduced set of most relevant clades. Applying the method to public datasets, we show that the data-driven definition of relevant phylogenetic clades accomplished by our ranking strategy identifies features in the samples that are lost if phylogenetic relationships are not considered, improving our ability to mine metagenomic datasets. Comparison with supervised classification methods currently used in metagenomic data analysis highlights the advantages of using phylogenetic information.

正文组织架构:

1. Introduction

2. Results

2.1 Definition of the scores

2.2 Correlation between the lineages and identification of the clades

2.3 Applications

2.4 Predictivity of the ranked features in supervised classification problems

3. Discussion

4. Materials and Methods

4.1 The PhyloRelief algorithm

4.2 Predictive classification pipeline

5. Supporting information

正文部分内容摘录:

This task faces several difficulties. On one hand, most of the microorganisms composing the human and environmental microbiota are poorly characterized, difficult to cultivate, and lack a precise taxonomic classification. On the other hand, methods to unambiguously define the microbial taxa that are responsible for these differences are still lacking, and their identification usually relies on a small number of arbitrarily chosen association tests with high-level taxonomic classes, or on statistical learning methods, both evaluating only taxa for which a taxonomic classification is possible

1. Biological Problem: What biological problems have been solved in this paper?

  • classification of metagenomic samples
  • identify and rank the microbial taxa

2. Main discoveries: What is the main discoveries in this paper?

  • Comparison with supervised classification methods currently used in metagenomic data analysis highlights the advantages of using phylogenetic information.
  • PhyloRelief can be applied both to cases in which sequences can be classified according to a known taxonomy, and to cases in which this is not feasible, a common occurrence in metagenomic data analysis given the increasing number of new and uncultivable taxa that are discovered using these technologies.
  • Comparing the performances of the algorithm to LEfSe, MetaPhyl and to Random Forest in a classical supervised classification schema using cross validation, we found that the taxa ranked by PhyloRelief had also a high predictive value, performing as well as—and in some cases outperforming—current gold standard methods.
  • We applied the method to a meta-analysis of two recent datasets of comparative studies of the gut microbiota of European, USA, African and South American healthy individuals, identifying bacterial taxa that are differentially distributed with geography and age.

3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?

  • PhyloRelief, a novel feature-ranking algorithm that fills this gap by integrating the phylogenetic relationships amongst the taxa into a statistical feature weighting procedure. 
  • PhyloRelief is an algorithm that resolves the problem of relevant taxa identification by applying the Relief strategy of feature ranking in a phylogenetic context. 
  • PhyloRelief, a ranking strategy to identify the taxa significantly contributing to the differentiation of groups of amplicon metagenomic samples. 

4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?

  • current approaches based on statistical comparisons between high-level taxonomic classes often fail to identify the microbial taxa that are differentially distributed between sets of samples, since in many cases the taxonomic schema do not allow an adequate description of the structure of the microbiota.
  • The improvement of this method over existing ones consists in its ability to accomplish a ranking of the microbial clades, defined on the basis of the taxa distribution amongst the samples weighted by phylogenetic information, discovering those that contribute to the differentiation between two or more classes of samples.
  • Importantly, this result is obtained without relying on a predefined set of taxonomic categories that are often hard pressed to describe the complexity of the evolutionary relationships between microorganisms.

5. Biological Significance: What is the biological significance of these ML methods’ results?

  • To identify the number of clades that were more relevant to differentiate the two classes, we performed ANOSIM and PERMANOVA analysis with increasing number of clades ranked according to the PhyloRelief weights
  • The performances were assessed in terms of average predictive accuracy using the K-category correlation coefficient (KCCC), a multiclass extension of the Matthews Correlation Coefficient (MCC)

6. Prospect: What are the potential applications of these machine learning methods in biological science?

  • The algorithm is general and does not rely on any specific sequencing technology, as long as a phylogenetic tree of the OTUs and the distribution of the OTUs in the different samples are available. 
  • the algorithm can readily be extended to regression problems
  • The PhyloRelief class of algorithms fills a significant gap in the growing array of computational methods that are currently used for the analysis of metagenomic data, and will impact importantly on the application of metagenomics to the development of novel diagnostic markers, leading the application of these approaches from the bench to the bedside.

7. Mine Question(Optional)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值