我以前写过临床微生物组的文章,其中数据分析用过microeco包,在这里,将我学到的资源分享给大家。
R语言microeco:一个用于微生物群落生态学数据挖掘的R包。
主要功能R6类;分类群丰度图,维恩图,Alpha多样性,Beta多样性,差异丰度分析,环境数据分析,零模型分析,网络分析,功能分析。
install.packages("microeco")
library(microeco)
library(magrittr)
library(ggplot2)
set.seed(123)
theme_set(theme_bw())
> data(sample_info_16S) #样本信息表
> data(otu_table_16S) #otu表
> data(taxonomy_table_16S) #分类表
> data(phylo_tree_16S) #
> data(env_data_16S) #环境因子
otu_table_16S[1:5, 1:5] S1 S2 S3 S4 S5 OTU_4272 1 0 1 1 0 OTU_236 1 4 0 2 35 OTU_399 9 2 2 4 4 OTU_1556 5 18 7 3 2 OTU_32 83 9 19 8 102
> taxonomy_table_16S[1:5, 1:3] Kingdom Phylum Class OTU_4272 k__Bacteria p__Firmicutes c__Bacilli OTU_236 k__Bacteria p__Chloroflexi c__ OTU_399 k__Bacteria p__Proteobacteria c__Betaproteobacteria OTU_1556 k__Bacteria p__Acidobacteria c__Acidobacteria OTU_32 k__Archaea p__Miscellaneous Crenarchaeotic Group c__
有时,分类学表可能会有一些混乱的信息,如NA、unidentified和unknown。这些信息可以影响接下来的分类丰度计算。因此,有必要使用以下代码清理该文件。该操作的另一个重要部分是统一分类学前缀,例如将门级的D_1__转换为p__。
> taxonomy_table_16S %<>% tidy_taxonomy
> sample_info_16S[1:5, ] SampleID Group Type Saline S1 S1 IW NE Non-saline soil S2 S2 IW NE Non-saline soil S3 S3 IW NE Non-saline soil S4 S4 IW NE Non-saline soil S5 S5 IW NE Non-saline soil
> head(env_data_16S) Latitude Longitude Altitude Temperature Precipitation TOC NH4 NO3 pH Conductivity TN S1 52.96482 122.5635 432 -4.2 445 10.02 18.95966 1.93 5.10 108.7 0.28 S2 52.94877 122.6109 445 -4.3 449 6.94 21.05730 1.46 6.53 31.2 0.22 S3 52.94932 122.6099 430 -4.3 449 8.38 14.31937 2.38 5.13 68.2 0.26 S4 52.94900 122.6104 430 -4.3 449 8.17 20.89645 1.38 5.81 64.4 0.25 S5 52.95292 122.6057 429 -4.3 449 8.41 9.65465 0.77 5.50 33.5 0.24 S6 52.95300 122.6123 429 -4.3 449 8.13 15.94577 1.31 5.37 33.3 0.26
> class(phylo_tree_16S) [1] "phylo"
然后,我们创建一个microtable类的对象。该操作与package phyloseq非常相似,但microeco更简短、更简单。microtable类中的otu_table必须是物种-样本格式:行名必须是OTU名称,colname必须是样本名称。在sample_table的rownames和otu_table的colnames中,所需的样例名称必须相同。
dataset <- microtable$new(sample_table = sample_info_16S, otu_table = otu_table_16S, tax_table = taxonomy_table_16S, phylo_tree = phylo_tree_16S)
> class(dataset) [1] "microtable" "R6"
> print(dataset) microtable-class object: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13628 rows and 7 columns phylo_tree have 14096 tips
为了使物种和样本信息在数据集对象的不同文件中保持一致,我们可以使用函数tidy_dataset()来修剪数据集。
>dataset$tidy_dataset() > print(dataset) microtable-class object: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13628 rows and 7 columns phylo_tree have 13628 tips
然后,我们去除在“k__Archaea”或“k__Bacteria”中未分配的otu。
> dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" | Kingdom == "k__Bacteria") > print(dataset) microtable-class object: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13330 rows and 7 columns phylo_tree have 13628 tips
我们还去除分类上带有“线粒体”或“叶绿体”的otu。
> dataset$filter_pollution(taxa = c("mitochondria", "chloroplast")) Total 34 features are removed from tax_table ... > print(dataset) microtable-class object: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13296 rows and 7 columns phylo_tree have 13628 tips
然后,为了使otu_table、tax_table和phylo_tree中的otu相同,我们再次使用tidy_dataset()。
> dataset$tidy_dataset() > print(dataset) microtable-class object: sample_table have 90 rows and 4 columns otu_table have 13296 rows and 90 columns tax_table have 13296 rows and 7 columns phylo_tree have 13296 tips
然后使用sample_sum()检查每个样本中的序列数。
> dataset$sample_sums() %>% range [1] 10316 37087
有时,为了减少物种数量对多样性测量的影响,我们需要进行重采样,使每个样本的序列数相等。函数rarefy_samples可以在细化之前和之后自动调用函数tidy_dataset。
# As an example, we use 10000 sequences in each sample > dataset$rarefy_samples(sample.size = 10000) 530 features are removed because they are no longer present in any sample after random subsampling ... 530 taxa with 0 abundance are removed from the otu_table ... > dataset$sample_sums() %>% range [1] 10000 10000
然后,我们使用cal_abund()计算每个分类等级上的分类群丰度。这个函数返回一个名为taxa_abund的列表,其中包含每个分类等级的丰度信息的几个数据帧。该列表自动存储在microtable对象中。
> dataset$cal_abund()
# return dataset$taxa_abund
> class(dataset$taxa_abund)
如果希望将分类库丰度文件保存到本地,请使用save_abund()。
> dir.create("taxa_abund")
> dataset$save_abund(dirpath = "taxa_abund")
然后,我们计算alpha分集。结果也自动存储在对象微表中。例如,我们不计算系统发育多样性。
> dataset$cal_alphadiv(PD = FALSE)
# return dataset$alpha_diversity
> class(dataset$alpha_diversity)
> dir.create("alpha_diversity")
> dataset$save_alphadiv(dirpath = "alpha_diversity")
我们还使用函数cal_betadiv()计算beta分集的距离矩阵。我们提供了四种最常用的指数:Bray-curtis、Jaccard、加权Unifrac和未加权Unifrac。
> dir.create("alpha_diversity") > dataset$save_alphadiv(dirpath = "alpha_diversity") > dataset$cal_betadiv(unifrac = TRUE) Accumulate the abundance along the tree branches... Compute pairwise distances ... Completed! The result is stored in object$beta_diversity ... # return dataset$beta_diversity > class(dataset$beta_diversity) [1] "list" > dir.create("beta_diversity") > dataset$save_betadiv(dirpath = "beta_diversity")