Paper reading (四十三): a phylogenetic tree embedded architecture for CNN for metagenomic data

论文题目:Popphy-cnn: a phylogenetic tree embedded architecture for convolution neural networks for metagenomic data

scholar 引用:6

页数:9

发表时间:Posted January 31, 2018

作者:Derek Reiman, Ahmed A. Metwally, Yang Dai

摘要:

Motivation: Accurate prediction of the host phenotype from a metagenomic sample and identification of associated bacterial markers are important in metagenomic studies. We introduce PopPhy-CNN, a novel convolutional neural networks (CNN) learning architecture that effectively exploits phylogentic structure in microbial taxa. PopPhy-CNN provides an input format of 2D matrix created by embedding the phylogenetic tree that is populated with the relative abundance of microbial taxa in a metagenomic sample. This conversion empowers CNNs to explore the spatial relationship of the taxonomic annotations on the tree and their quantitative characteristics in metagenomic data.

Results: PopPhy-CNN is evaluated using three metagenomic datasets of moderate size. We show the superior performance of PopPhy-CNN compared to random forest, support vector machines, LASSO and a baseline 1D-CNN model constructed with relative abundance microbial feature vectors. In addition, we design a novel scheme of feature extraction from the learned CNN models and demonstrate the improved perfromance when the extracted features are used to train support vector machines.

Conclusion: PopPhy-CNN is a novel deep learning framework for the prediction of host phenotype from metagenomic samples. PopPhy-CNN can efficiently train models and does not require exceesive amount of data. PopPhy-CNN facilities not only retrieval of informative microbial taxa from the trained CNN models but also visualization of the taxa on the phynogenetic tree.

CODE: https://github.com/derekreiman/PopPhy-CNN

结论:

  • PopPhy-CNN can be readily used for developing a predictive model from a metagenomic dataset of moderate size.
  • It also facilitates the extraction and visualization of a ranked microbial taxonomic set for biological interpretation of the learned predictive model.

Discussion:

  • The key contribution is leveraging biological knowledge in microbial taxa relative abundance profiles through a phylogenetic tree by our novel propagation and embedding procedure. 
  • The 2D matrix input obtained from this procedure enables CNNs to exploit the topological structure of the phylogenetic tree for developing more accurate predictive models.
  • CNNs can deliver more robust performance without requiring excessively large training sets.
  • the results provide the evidence that the activation maps on the first layer of the CNNs maintain spatial relationship between the microbial taxa on the phylogenetic tree. 
  • This implies that PopPhy-CNN benefits from learning informative features on the populated phylogentic tree embedded in the matrix format.
  • There are several directions for further study.
  • The phylogenetic tree is the one of the core components in the PopPhy-CNN learning framework.
  • different ways of embedding the populated trees into the matrix format may also affect the model performance.
  •  if the number of microbial taxa substantially outnumbers that of the learning samples, more effective regularization schemes or algorithms that promote the learning of important features in CNNs are likely necessary.

Introduction:

  •  A metagenomic sample is usually described by its microbial taxanomic composition represented as nodes on a phylogenetic tree. The identification of microbial taxa that are associated with the host disease can benefit the early diagnosis, the development of microbial reconstitution (e.g., Probiotic) therapies , and the understanding of the disease mechanism.
  • Alternative approaches using machine learning models, e.g., Random Forest (RF), LASSO and Support Vector Machines (SVMs), and recently, deep neural networks (DNN), demonstrated the potential of developing microbial biomarker signature for the prediction of disease or phenotype of the host.
  • owing the ability of deep architectures in identifying potential interactions of microbial taxa for disease prediction
  • DNN: their requirement of excessive amount of training data;   black-boxes
  • it is unclear whether they can outperform the existing models, such as RF, LASSO and SVMs, and whether they can learn a set of informative microbial taxa from metagenomics data.
  • To empower CNNs in metagenomic phenotype prediction, it is important to provide structural input with certain distance metric among the microbial taxas.
  •  our contribution is summarized as follows.
  1. We investigated the effect of up-sampling in addressing the issue of the moderate datasize in the current metagenomic study. Our experimental results indicates that learning from the original data is sufficient to achieve the maximum performance.
  2. We conducted a comprehensive evaluate of the performance of our CNN model in comparison with other models (RF, LASSO, SVMs) and a baseline 1D CNN using the vector form of relative abundance profiles. We demonstrate the superior performance of our CNN models using three datasets with moderate size: (1) cirrhosis (114 cases vs. 118 controls) ; (2) type 2 diabetes (223 cases vs. 217 controls) , and (3) obesity (164 cases vs. 89 controls) .
  3. We developed a novel procedure to retrieve microbial taxa from the trained CNN models and demonstrated the usefulness of the extracted features for prediction. In addition, we demonstrated a visualization using Cytoscape to facilitate the examination and interpretation of the retrieved taxa on the phylogenetic tree.

正文组织架构:

1. Introduction

2. Methods

2.1 Embedding the Phylogenetic Tree

2.2 Architecture of Convolution Neural Network

2.3 Extraction of the informative features from learned CNN models

3. Results

3.1 Datasets

3.2 Model Evaluation

3.3 Extracted Features

3.4 Visualization of Extracted Features

3.5 Evaluation of Extracted Features for Prediction

3.6 Computation time

4. Discussion 

5. Conclusion

正文部分内容摘录:

1. Biological Problem: What biological problems have been solved in this paper?

  • host phenotype prediction

2. Main discoveries: What is the main discoveries in this paper?

  • We show the superior performance of PopPhy-CNN compared to random forest, support vector machines, LASSO and a baseline 1D-CNN model constructed with relative abundance microbial feature vectors.
  • In addition, we design a novel scheme of feature extraction from the learned CNN models and demonstrate the improved performance when the extracted features are used to train support vector machines.

3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?

  • datasets:  three metagenomic datasets of moderate size,  (1) cirrhosis (114 cases vs. 118 controls) ; (2) type 2 diabetes (223 cases vs. 217 controls) , and (3) obesity (164 cases vs. 89 controls) .
  • The convolutional neural network is composed of three convolutional layers. Each layer contains 64 kernels and uses max-pooling and ReLU as the activation function. The output from the last convolutional layer is passed to two fully connected layers of 1024 neurons and then finally to a softmax output layer with 2 neurons.

4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?

  • traditional methods: random forest, support vector machines, LASSO and a baseline 1D-CNN model constructed with relative abundance microbial feature vectors
  • PopPhy-CNN can efficiently train models and does not require excessive amount of data. 
  • PopPhy-CNN facilities not only retrieval of informative microbial taxa from the trained CNN models but also visualization of the taxa on the phynogenetic tree.
  • owing the ability of deep architectures in identifying potential interactions of microbial taxa for disease prediction
  • CNNs can deliver more robust performance without requiring excessively large training sets.

5. Biological Significance: What is the biological significance of these ML methods’ results?

  • Extraction of the informative features from learned CNN models

  • The key contribution is leveraging biological knowledge in microbial taxa relative abundance profiles through a phylogenetic tree by our novel propagation and embedding procedure. 
  • The 2D matrix input obtained from this procedure enables CNNs to exploit the topological structure of the phylogenetic tree for developing more accurate predictive models.
  • the results provide the evidence that the activation maps on the first layer of the CNNs maintain spatial relationship between the microbial taxa on the phylogenetic tree. 
  • This implies that PopPhy-CNN benefits from learning informative features on the populated phylogentic tree embedded in the matrix format.

6. Prospect: What are the potential applications of these machine learning methods in biological science?

  • The phylogenetic tree is the one of the core components in the PopPhy-CNN learning framework.
  • different ways of embedding the populated trees into the matrix format may also affect the model performance.
  •  if the number of microbial taxa substantially outnumbers that of the learning samples, more effective regularization schemes or algorithms that promote the learning of important features in CNNs are likely necessary.

7. Mine Question(Optional)

  • we first tried to increase our sample size by re-sampling and adding noise to each new sample.
  • 是不是还可以试试GAN呢?
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值