scholar 引用:2
页数:16
发表时间:2018.10
发表刊物:Microbiome
作者:Janko Tackmann, Natasha Arora, ...,Christian von Mering
摘要:
Background
The identification of body site-specific microbial biomarkers and their use for classification tasks have promising applications in medicine, microbial ecology, and forensics. Previous studies have characterized site-specific microbiota and shown that sample origin can be accurately predicted by microbial content. However, these studies were usually restricted to single datasets with consistent experimental methods and conditions, as well as comparatively small sample numbers. The effects of study-specific biases and statistical power on classification performance and biomarker identification thus remain poorly understood. Furthermore, reliable detection in mixtures of different body sites or with noise from environmental contamination has rarely been investigated thus far. Finally, the impact of ecological associations between microbes on biomarker discovery was usually not considered in previous work.
Results
Here we present the analysis of one of the largest cross-study sequencing datasets of microbial communities from human body sites (15,082 samples from 57 publicly available studies). We show that training a Random Forest Classifier on this aggregated dataset increases prediction performance for body sites by 35% compared to a single-study classifier. Using simulated datasets, we further demonstrate that the source of different microbial contributions in mixtures of different body sites or with soil can be detected starting at 1% of the total microbial community. We apply a biomarker selection method that excludes indirect environmental associations driven by microbe-microbe associations, yielding a parsimonious set of highly predictive taxa including novel biomarkers and excluding many previously reported taxa. We find a considerable fraction of unclassified biomarkers (“microbial dark matter”) and observe that negatively associated taxa have a surprisingly high impact on classification performance. We further detect a significant enrichment of rod-shaped, motile, and sporulating taxa for feces biomarkers, consistent with a highly competitive environment.
Conclusions
Our machine learning model shows strong body site classification performance, both in single-source samples and mixtures, making it promising for tasks requiring high accuracy, such as forensic applications. We report a core set of ecologically informed biomarkers, inferred across a wide range of experimental protocols and conditions, providing the most concise, general, and least biased overview of body site-associated microbes to date.
正文组织架构:
1. Background
2. Results
2.1 A large and heterogeneous collection of microbial sequencing samples from human body sites
2.2 Cross-study classifier outperforms single-study model in predictive accuracy
2.3 Even trace amounts of body site microbiomes can be reliably identified in mixtures between body sites or body site and environment
2.4 A parsimonious core set of directly associated microbial biomarkers for human body sites
2.5 Negatively associated microbes are numerous and contribute strongly to sample prediction accuracy
2.6 Previously unreported associations between microbes and body sites
2.7 Aerobicity is the most defining characteristic of microbial biomarkers found in body sites
3. Discussion
3.1 Improved classification accuracy in large cross-study datasets
3.2 A core set of ecologically informed biomarkers
3.3 Taxonomic and phylogenetic patterns of detected biomarkers
3.4 Microbial trait enrichment in particular body sites
3.5 Limitations
4. Conclusion
5. Methods
正文部分内容摘录:
1. Biological Problem: What biological problems have been solved in this paper?
- The task was to identify a microbial community from a target body site in mixtures of communities from the target body site and a background body site, along a gradient of increasing mixture fractions, for all pairs of body sites.
-
Classification of body sites
2. Main discoveries: What is the main discoveries in this paper?
- This study is, to our knowledge, the most comprehensive cross-study evaluation of human body site classification and the first analysis of ecologically informed biomarkers for human body sites.
- We show that training a Random Forest Classifier on this aggregated dataset increases prediction performance for body sites by 35% compared to a single-study classifier.
- Using simulated datasets, we further demonstrate that the source of different microbial contributions in mixtures of different body sites or with soil can be detected starting at 1% of the total microbial community.
3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?
- 15,082 samples from 57 publicly available studies, 1329 soil samples
- 60,892 operational taxonomic units (OTUs)
- We analyzed a large-scale dataset composed of over 15,000 samples from five human body sites and showed that a RFC model trained on this data (RFC-global) is considerably more accurate for body site prediction than multiple models trained on data from single studies (RFC-single, RFC-single-hmp) used in previous body site classification benchmarks
- We used the body site dataset for classifier training and evaluated its performance on single-source samples, as well as in silico mixtures of samples from two body sites or from a body site and soil. We compared this performance to a classifier trained on a single study, subject to previous machine learning benchmarks, and demonstrated that the cross-study classifier makes strongly improved predictions.
- We used Generalized Local Learning (GLL, [27]) (Fig. 4a), an approach that has advantages over feature importances reported by Random Forests and decision trees ([31], see Methods).
4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?
- This increase in data volume and diversity has opened the door for human microbiome studies to apply more advanced statistical and machine learning tools
- The aggregation of sequencing data from different studies into large meta-datasets and their utilization for classifier training could thus lead to more general and predictive models that can reliably classify samples produced under a variety of experimental protocols and from a wide range of subjects.
5. Biological Significance: What is the biological significance of these ML methods’ results?
- performance was measured through F1 scores, which take into account both precision and recall and are less affected by imbalanced numbers of samples per body site than other metrics.
- Classification performance was measured through the area under the ROC curve (AUC) metric, which similarly to F1 scores is robust to label imbalances, but quantifies predictive performance independent of a decision threshold.
- Limitations of our study include the use of in silico simulations for mixture analysis.
6. Prospect: What are the potential applications of these machine learning methods in biological science?
- We report a core set of ecologically informed biomarkers, inferred across a wide range of experimental protocols and conditions, providing the most concise, general, and least biased overview of body site-associated microbes to date.
- We are confident that this data-intense approach will continue to expand our understanding of the human microbiome and lead to generalized insights into our microbial ecosphere.
- Our machine learning model shows strong body site classification performance, both in single-source samples and mixtures, making it promising for tasks requiring high accuracy, such as forensic applications.
7. Mine Question(Optional)