Paper reading (六十五)：Kernel-penalized regression for analysis of microbiome data

盲人骑瞎马5555

于 2019-11-07 14:52:29 发布

阅读量414

点赞数

分类专栏： Paper Reading

本文链接：https://blog.csdn.net/wxw060709/article/details/102947372

版权

Paper Reading 专栏收录该内容

133 篇文章 9 订阅

订阅专栏

论文题目：Kernel-penalized regression for analysis of microbiome data

scholar 引用：15

页数：29

发表时间：2018.03

发表刊物：Institute of Mathematical Statistics

作者：Timothy W. Randolph, Sen Zhao, ..., and Ali Shojaie

摘要：

The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxon-specific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.

正文组织架构：

1. Introduction

2. Kernel Penalized Regression for Microbiome Data

2.1 Background for PCoA and principal component regression

2.2 Penalized regression and DPCoA

2.3 Kernel-based regression with two kernels

2.4 Regression with compositional data

3. Numerical Experiments

3.1 Regression and DPCoA

3.2 Regression and PCoA with respect to a UniFrac kernel

3.3 Regression and PCoA using an edge-matrix kernel

4. Application to an observational study

5. Discussion

正文部分内容摘录：

1. Biological Problem: What biological problems have been solved in this paper?

The analysis of human microbiome data

2. Main discoveries: What is the main discoveries in this paper?

use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxon-specific associations with a phenotype or clinical outcome.
how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant.
An interesting feature of the proposed kernel-penalized regression framework is its ability to sidestep some of the problems inherent in compositional data analysis.

3. ML(Machine Learning) Methods: What are the ML methods applied in this paper?

describe a framework of high-dimensional regression models that extends these distance-based methods.
A primary motivation for PCoA graphical displays is the ability to incorporate biologically-inclined measures of (dis)similarity.
提出的方法：kernel penalized regression
We show how phylogenetic and other structure can be incorporated via kernel penalized regression in either the primal (p-dimensional) feature space or the dual (n-dimensional) samples space
以前的方法：PCoA？standard (Euclidean-based) statistical models
dataset：We apply our kernel-penalized regression framework to data from 16S rRNA gene collected in a study of premenopausal women (Hullar et al., 2015). This study investigated aspects of gut microbial communities in stool samples from premenopausal women using 454 pyrosequencing of the 16S rRNA gene. The abundances of 127 species were zero for more than 90% of the subjects and were removed from our analysis. The data set we consider consists of p = 128 species sampled from n = 102 women.

4. ML Advantages: Why are these ML methods better than the traditional methods in these biological problems?

traditional methods: dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Principal coordinate analysis
none of these analyses proceed to estimate the individual associations
In contrast, we focus on estimating the coefficient vector, which is a key aspect of any approach used to draw scientific conclusions based on the association of microbial communities with an outcome or phenotype.
Our approach, which differs somewhat from that of Li (2015), may also be viewed as a penalized version of the low-dimensional linear model for compositions by Tolosana-Delgado and Van Den Boogart (2011), who use the isometric log-ratio (ILR) coordinates.
for addressing well-known problems that arise from applying standard (Euclidean-based) statistical models to compositional data

5. Biological Significance: What is the biological significance of these ML methods’ results?

In this analysis, we obtain estimates of associations between microbial species and percent fat measured in premenopausal women, and also provide inference for these estimates by applying a recent significance test in our kernel-penalized regression (KPR) framework.

6. Prospect: What are the potential applications of these machine learning methods in biological science?

the proposed framework also allows us to use existing inference frameworks for high-dimensional regression, and in particular the Grace test (Zhao and Shojaie, 2016), to assess the significance of estimated regression coefficients.