一文学会PCA/PCoA相关统计检验(PERMANOVA)和可视化

最新推荐文章于 2024-08-11 10:51:44 发布

生信宝典

最新推荐文章于 2024-08-11 10:51:44 发布

阅读量2.3w

点赞数 36

文章标签： python 机器学习人工智能数据分析 java

本文链接：https://blog.csdn.net/qazplm12_3/article/details/120620477

版权

方差分析中的元和因素

试验中要考察的指标称为试验指标，影响试验指标的条件称为因素，因素所处的状态称为水平 (通常用于3个或更多水平时；如果只有2个水平考虑T-test);若试验中只有一个因素改变则称为单因素试验，若有两个因素改变则称为双因素试验，若有多个因素改变则称为多因素试验。

方差分析就是对试验数据进行分析，检验方差相等的多个正态总体 均值是否相等，进而判断各因素对试验指标的影响是否显著；根据影响试验指标条件的个数可以区分为单因素方差分析、双因素方差分析和多因素方差分析。(来源于：百度百科)

方差分析中的因素

方差分析中的因素通常是人为选定或可控的影响条件，如对样品的人为处理、样品自身的标记属性等。不可控因素如病人的心情、试验操作人的心情等一般不视为因素或不作为关注的因素；(还有一些不可控因素或通常认为不会带来很多影响的因素，如不同的取样时间、不同的RNA提取时间、提取人、细胞所处的分裂周期等；在某些情况下，如果我们记录了这些因素并且关心这些因素时，也会变为方差分析中的因素)。

举个例子，比如病人服用不同浓度药物后基因表达变化试验中：

基因表达是试验指标;
药物浓度是因素，假设有3个水平低浓度、中浓度和高浓度。

这就是单因素方差分析 (one-way ANOVA)，比较病人服用不同浓度药物后基因表达的均值是否相等；

如果同时考虑病人的年龄的影响，则

年龄也是因素，有多个水平比如幼年、青年、成年、老年等。

这就是两因素方差分析 (two-way ANOVA)，比较用药浓度和年龄对基因表达变化的影响，称为“主效应”影响；有时还需要同时比较浓度+年龄组成的新变量对基因表达变化的影响，称为“交互效应”影响。(如果只是比较浓度+年龄组成的新变量对基因表达变化的影响，就又是单因素方差分析了)

如果再考虑病人的籍贯、药物种类、吃药时间、病人Marker突变等的影响，就是多因素方差分析了。

方差分析中的试验指标

试验中要考察的指标称为试验指标。在上面的例子中基因表达是一个试验指标，不过很笼统，默认为是单个基因的表达，称为一元方差分析。

那如果是关注两个基因或所有基因的表达变化整体是否有差异呢？

这就是多元方差分析，每组样本不是只包含一个试验指标而是多个试验指标。

表现在数据形式上：

(一元)方差分析是比较多组向量的均值是否存在显著差异。
多元方差分析是比较多组矩阵的均值是否存在显著差异。

因此，比较多组样本整体基因表达的差异、多组样本整体菌群构成的差异，就需要多元方差分析了。

多元方差分析

在统计学中，多元方差分析 (MANOVA, multivariate analysis of variance) 是一种对多个分组中检测了多个指标变量 (这里的变量等同于上面的指标；如每个样本中每个物种的丰度信息、每个样本中每个基因的表达信息)的样本整体均值的检验方法。作为一个多变量过程，它在有两个或多个因变量时使用，并且通常会分别涉及各个因变量的显着性检验。它有助于回答：

自变量 (因素)的变化是否对因变量 (试验指标)有显着影响？
因变量之间有什么关系？
自变量之间有什么关系？

注: 对应上面 - 所有的因素都是自变量 (independent variable)，而试验指标是因变量 (dependent variable)。这在看英文文献或不同教程时需要注意描述差异。

多元方差分析 (MANOVA, multivariate analysis of variance)的前提假设可类比于一元方差分析 (观测指标值的独立性、正态性、方差齐性)

数据独立性。
每个分组内的检测指标符合多元正态分布。
每个分组内的检测指标的协方差矩阵一致。

但在很多生物、生态和环境数据集中，多元方差分析的前提假设通常难以满足。

一些鲁棒性更强、对数据分布依赖更少的检验方法被提出来并且获得广泛应用，如ANOSIM (analysis of similarities), PERMANOVA (permutational multivariate analysis of variance) (也称为NPMANOVA, non-parametric MNOAVA), 和Mantel test。这些方法都通过一个样本间的距离矩阵或相似性矩阵构建ANOVA分析类似的统计量，然后对每组的观测结果进行随机置换来计算显著性P-value。对于单因素分析，对数据唯一的假设条件就是观察指标数据存在可置换性 (exchangeability)。

下面我们再介绍如何应用PERMANOVA来检验PcOA等的结果的显著性。

PERMANOVA基本原理

PERMANOVA是多元方差分析的非参数变体。它用来比较多组观测样本的统计指标值的异同。

它利用距离矩阵(如欧式距离、Bray-Curtis距离)对总方差进行分解，分析不同分组因素或不同环境因子对样品差异的解释度，并使用置换检验对各个变量解释的统计学意义进行显著性分析。

目的是检测不同分组的响应变量如菌群构成是否有显著差异。因主要用函数adonis进行分析，有时也称为adonis 检验。

The goal of this test is to tell you if there are significant differences in your response variables among your groupings.

原始假设 (null hypothesis)是每组样本在其检测指标构成的检测空间中的中心点 (centroid)和离散度dispersion无差别。

计算出P值小于0.05时拒绝原假设，也就是不同组样品在检测空间的中心点或分布显著不同。

该检验需要预先计算试验样品在检测指标定义的多维空间的距离，如欧式距离、Bray-Curtis距离等。

The null hypothesis that the centroids and dispersion of the groups as defined by measure space are equivalent for all groups. A rejection of the null hypothesis means that either the centroid and/or the spread of the objects is different between the groups.

比如，对宏基因组检测的物种丰度数据进行PCA/NMDS/PCoA降维可视化后，不同组的样品之间存在一些重叠，那怎么判断这些组之间的样品构成是否存在显著差别呢？

这就需要用到PERMANOVA检验了，检验不同组的样品中心点是否重叠。

当然，PERMANOVA并不依赖于某种降维方法，而是依赖于距离矩阵。如果检测出p值大于0.05，表示不同组的物种构成或相对丰度没有显著差异。

PERMANOVA对检测数据的分布没有任何限制，也不受组之间数据协方差不同的影响，对多重共线性和很多0-值不敏感，其依赖的前提假设是：

每个对象的数据具有可交换性 (exchangeable)
可交换的对象（样品）彼此独立
每个样品的检测数据有一致的多变量分布（每组数据的离散程度相近）

PERMANOVA分析等同于分组变量为解释变量矩阵的哑变量时的基于距离的冗余分析 (db-RDA)。

PERMANOVA测试的统计值是伪F值 (pseudo F-ratio)，类比于ANOVA分析的F值。它的计算方式是不同组样品之间的距离（或距离的排序）平方和(图中黄色部分)除以同一组样品之间的距离（或距离的排序)平方和（图中蓝色部分），具体如下面公式。

更大的F值表示更强的组分离。通常这个值的显著性要比这个值本身的大小更有意义。

PERMANOVA采用数据置换的方式计算pseudo F-值的统计显著性，比较随机置换数据获得的pseudo F-值是否高于或等于实际观测到的值。如果多于5%随机置换计算的pseudo-F值高于实际观测值，则表示不同组的样品之间不存在显著差异 (p-value > 0.05)。

It is vital that the correct permutational scheme is defined and only exchangeable units are permuted. In nested studies, this would mean restricting permutations to an appropriate subgroup of the data set. At times, exact permutation tests either cannot be done, or are restricted to so few objects, that they are not useful.

PERMANOVA 实战 (一)

采用vegan包自带的一套数据（也解释了如何自己准备数据）看下PERMANOVA的具体代码和应用。

dune数据集描述

dune是一套包含了20个样品和30个物种丰度数据的统计表。其格式是常见OTU表转置后的格式，每一行是一个样品，每一列是一个物种 (检测指标)。

library(vegan)
data(dune)

dim(dune)

## [1] 20 30

head(dune)

##   Achimill Agrostol Airaprae Alopgeni Anthodor Bellpere Bromhord Chenalbu Cirsarve Comapalu Eleopalu Elymrepe Empenigr Hyporadi
## 1        1        0        0        0        0        0        0        0        0        0        0        4        0        0
## 2        3        0        0        2        0        3        4        0        0        0        0        4        0        0
## 3        0        4        0        7        0        2        0        0        0        0        0        4        0        0
## 4        0        8        0        2        0        2        3        0        2        0        0        4        0        0
## 5        2        0        0        0        4        2        2        0        0        0        0        4        0        0
## 6        2        0        0        0        3        0        0        0        0        0        0        0        0        0

如果我们有一个OTU丰度表，怎么转成这个格式呢？

otu_table <- read.table("otutable_rare",sep="\t", row.names=1, header=T)

as.data.frame(t(otu_table))

##       OTU1 OTU2 OTU3
## Samp1    2   12   22
## Samp2   13   13   10
## Samp3   14    8   14
## Samp4   15   10   11

dune.env是元数据信息，包含数据的分子信息、生存环境信息等，记录了5个因素 (同时包含连续变量信息和分组变量信息)：

A1: 土壤厚度信息 a numeric vector of thickness of soil A1 horizon.
Moisture: 湿度等级信息，分4个等级，1 < 2 < 4 < 5.
Management: 分组信息，不同的管理方式 a factor with levels: BF (Biological farming), HF (Hobby farming), NM (Nature Conservation Management), and SF (Standard Farming).
Use: 一个分组信息 an ordered factor of land-use with levels: Hayfield < Haypastu < Pasture.
Manure: 一个分组信息，0 < 1 < 2 < 3 < 4.

data("dune.env")

head(dune.env)

##    A1 Moisture Management      Use Manure
## 1 2.8        1         SF Haypastu      4
## 2 3.5        1         BF Haypastu      2
## 3 4.3        2         SF Haypastu      4
## 4 4.2        2         SF Haypastu      4
## 5 6.3        1         HF Hayfield      2
## 6 4.3        1         HF Haypastu      2

summary(dune.env)

##        A1         Moisture Management       Use    Manure
##  Min.   : 2.800   1:7      BF:3       Hayfield:7   0:6   
##  1st Qu.: 3.500   2:4      HF:5       Haypastu:8   1:3   
##  Median : 4.200   4:2      NM:6       Pasture :5   2:4   
##  Mean   : 4.850   5:7      SF:6                    3:4   
##  3rd Qu.: 5.725                                    4:3   
##  Max.   :11.500

这个文件就是我们常用的metadata文件，组织格式也一致，每一行是一个样品，每一列对应样品的不同属性。

绘制一个PcOA的图看一下

多维排列 (Multidimensional scaling, `MDS`)是可视化多变量样品(如多个物种丰度、多个基因表达)相似性水平的一种方法。其基于距离矩阵进行一系列的排序分析。

经典的MDS (`CMDS`)分析就是前面提到的`PCoA`分析，也称为度量性MDS分析。PCoA分析原理与PCA类似，都是一样的因式分解、求取特征值和特征向量；只是PCA是依赖于欧式距离（隐式依赖），PCoA可以处理任何距离矩阵（显示计算距离作为输入）。

# 计算加权bray-curtis距离
dune_dist <- vegdist(dune, method="bray", binary=F)

dune_pcoa <- cmdscale(dune_dist, k=3, eig=T)

dune_pcoa_points <- as.data.frame(dune_pcoa$points)
sum_eig <- sum(dune_pcoa$eig)
eig_percent <- round(dune_pcoa$eig/sum_eig*100,1)

colnames(dune_pcoa_points) <- paste0("PCoA", 1:3)

dune_pcoa_result <- cbind(dune_pcoa_points, dune.env)

head(dune_pcoa_result)

##         PCoA1       PCoA2       PCoA3  A1 Moisture Management      Use Manure
## 1 -0.35473182 -0.25667235  0.31129225 2.8        1         SF Haypastu      4
## 2 -0.29462318 -0.18609437  0.03355954 3.5        1         BF Haypastu      2
## 3 -0.07276681 -0.29087086 -0.01169171 4.3        2         SF Haypastu      4
## 4 -0.06925423 -0.26419764 -0.01634735 4.2        2         SF Haypastu      4
## 5 -0.30706200  0.03031589 -0.09124310 6.3        1         HF Hayfield      2
## 6 -0.25302974  0.09420852  0.02814297 4.3        1         HF Haypastu      2

library(ggplot2)

ggplot(dune_pcoa_result, aes(x=PCoA1, y=PCoA2, color=Management)) +
  labs(x=paste("PCoA 1 (", eig_percent[1], "%)", sep=""),
       y=paste("PCoA 2 (", eig_percent[2], "%)", sep="")) +
  geom_point(size=4
  ) + stat_ellipse(level=0.6) +
  theme_classic()

## Too few points to calculate an ellipse

## Warning: Removed 1 row(s) containing missing values (geom_path).

样品中重复太少了，做不出置信椭圆。换个方式，用ggalt包中的geom_encircle把样品包起来。

# install.packages("ggalt")
library(ggalt)
ggplot(dune_pcoa_result, aes(x=PCoA1, y=PCoA2, color=Management, group = Management)) +
  labs(x=paste("PCoA 1 (", eig_percent[1], "%)", sep=""),
       y=paste("PCoA 2 (", eig_percent[2], "%)", sep="")) +
  geom_point(size=5) + 
  geom_encircle(aes(fill=Management), alpha = 0.1, show.legend = F) +
  theme_classic() + coord_fixed(1)

那么不同管理风格对物种组成是否有显著影响呢？

关注不同管理风格对物种组成是否有显著影响

假如关注的问题是：不同的管理风格对物种组成是否有显著影响？这就是一个典型的单因素非参多元方差分析。因素就是Management。

基于bray-curtis距离进行PERMANOVA分析，代码和结果如下：

dune是转置后的物种丰度表 (抽平或相对比例都行)
Management是dune.env中的列名字，代表一列信息，可以是任意样品属性信息或分组信息
permutations设置置换次数
method指定距离计算方法
R2值显示Management可以解释总体差异的34.2%，且P<0.05，表示不同的管理风格下的物种组成差异显著。
当然还有65.8%的差异是其它因素造成的。
这通常是我们对PcOA等降维图标记统计检验P值的常用方式。

注意：因为是随机置换，在未指定随机数种子时，每次执行的结果都会略有不同，但通常对结论没有影响。

# 基于bray-curtis距离进行计算
dune.div <- adonis2(dune ~ Management, data = dune.env, permutations = 999, method="bray")

dune.div

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = dune ~ Management, data = dune.env, permutations = 999, method = "bray")
##            Df SumOfSqs      R2      F Pr(>F)   
## Management  3   1.4686 0.34161 2.7672  0.004 **
## Residual   16   2.8304 0.65839                 
## Total      19   4.2990 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

注意：因为是随机置换，在未指定随机数种子时，每次执行的结果都会略有不同，但通常对结论没有影响。也可以如下设置随机数种子，则结果稳定。

# 基于bray-curtis距离进行计算
set.seed(1)
dune.div <- adonis2(dune ~ Management, data = dune.env, permutations = 999, method="bray")

dune.div

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = dune ~ Management, data = dune.env, permutations = 999, method = "bray")
##            Df SumOfSqs      R2      F Pr(>F)   
## Management  3   1.4686 0.34161 2.7672  0.002 **
## Residual   16   2.8304 0.65839                 
## Total      19   4.2990 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

把统计检验结果加到PcOA的图上。

dune_adonis <- paste0("adonis R2: ",round(dune.div$R2,2), "; P-value: ", dune.div$`Pr(>F)`)

# install.packages("ggalt")
library(ggalt)
ggplot(dune_pcoa_result, aes(x=PCoA1, y=PCoA2, color=Management, group = Management)) +
  labs(x=paste("PCoA 1 (", eig_percent[1], "%)", sep=""),
       y=paste("PCoA 2 (", eig_percent[2], "%)", sep=""),
       title=dune_adonis) +
  geom_point(size=5) + 
  geom_encircle(aes(fill=Management), alpha = 0.1, show.legend = F) +
  theme_classic() + coord_fixed(1)

整体有差异了，后面就看看具体那两组之间有差异，哪两组之间无差异~~~

配对Adonis确定不同管理方式两两之间对物种组成差异的影响

adonis分析可以检验某个因素整体对物种组成差异的影响，但不能比较这个因素的多个水平之间两两是否差异显著，如Management中的BF与HM两种方式是否对物种组成差异有显著影响？

这时就需要pairwise.adonis来进行配对检验了。

# devtools::install_github("pmartinezarbizu/pairwiseAdonis/pairwiseAdonis")
library(pairwiseAdonis)

# This is a wrapper function for multilevel pairwise comparison 
# using adonis() from package 'vegan'. 
# The function returns adjusted p-values using p.adjust().
dune.pairwise.adonis <- pairwise.adonis(x=dune, factors=dune.env$Management, sim.function = "vegdist",
  sim.method = "bray",
  p.adjust.m = "BH",
  reduce = NULL,
  perm = 999)

dune.pairwise.adonis

##      pairs Df SumsOfSqs  F.Model        R2 p.value p.adjusted sig
## 1 SF vs BF  1 0.4016624 2.514890 0.2643110   0.055     0.0825    
## 2 SF vs HF  1 0.2828804 1.857489 0.1710790   0.117     0.1404    
## 3 SF vs NM  1 0.7575728 3.425694 0.2551595   0.008     0.0480   .
## 4 BF vs HF  1 0.1617135 1.567531 0.2071390   0.197     0.1970    
## 5 BF vs NM  1 0.5662456 2.715242 0.2794827   0.017     0.0510    
## 6 HF vs NM  1 0.6513088 3.423068 0.2755413   0.031     0.0620

拼一起画个图

library(ggpubr)
library(patchwork)
tab2 <- ggtexttable(dune.pairwise.adonis[,c("pairs","R2","p.value","p.adjusted")], rows = NULL, 
                    theme = ttheme("blank")) %>% 
  tab_add_hline(at.row = 1:2, row.side = "top", linewidth = 1)  %>% 
  tab_add_hline(at.row = nrow(dune.pairwise.adonis)+1, row.side = "bottom", linewidth = 1)  

p2 = p + tab2 
p2

p2 + plot_layout(design=c(area(1,1), area(2,1)))
# p / tab2
# 调布局

ANOSIM和PERMANOVA的pairwise analysis声明：“Pairwise tests are not possible in vegan. My understanding is that the non-R software with such tests makes separate pairwise tests using subsets of data with only two levels of a factor in one test. We don’t provide that in vegan and have no plans to provide this in the future.” (cited by Jari Oksanen, author of anosim and Adonis{vegan} in R)https://stat.ethz.ch/pipermail/r-sig-ecology/2013-June/003865.html

为PERMANOVA/Adonis分析保驾护航，检验数据离散度

非参数检验也不是什么都不需要关注，比如上面提到的因素顺序和方差加和方式是一个需要注意的点。除此之外，非参数多元方差分析应用时还有下面这些注意事项：

PERMANOVA检验没有考虑变量之间的共线性关系，因此也不能够用于探索这种关系。
嵌套或分层设计 (Nested or hierarchical designs)时需要考虑合适的置换策略。
需要明确哪些样品是真正可以交换的 (exchangeable)。
PERMANOVA有个假设是balanced designs (不同分组的样本数相等), 非平衡设计也能处理。
如果不同组的样品在检测指标构成的空间的中心点没有差别，但每个组内检测指标离散度较大，也会导致获得显著性的P值。
在解释结果时，需要同时评估数据离散度的影响。

vegdist评估数据离散度，再解释adonis的结果

前面我们用下面的代码检验了Managment对物种组成差异影响的显著程度，获得P-value=0.002 < 0.05，表示管理方式对物种组成有显著影响。但这一影响是否受到每个分组里面数据离散程度的影响呢？

library(vegan)
data(dune)
data(dune.env)
# 基于bray-curtis距离进行计算
set.seed(1)
dune.div <- adonis2(dune ~ Management, data = dune.env, permutations = 999, method="bray")

dune.div

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = dune ~ Management, data = dune.env, permutations = 999, method = "bray")
##            Df SumOfSqs      R2      F Pr(>F)   
## Management  3   1.4686 0.34161 2.7672  0.002 **
## Residual   16   2.8304 0.65839                 
## Total      19   4.2990 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

我们还需要利用betadisper评估下每组样本物种组成的多元一致性 (Multivariate homogeneity of groups dispersions (variances))。如下代码计算出P=0.168表示不同分组样品检测指标的离散度(方差)没有显著差异。那么，adonis检测出的差异就是因为每组数据在空间的中心点不同造成的，进一步说明Management对物种组成有显著影响。

# 计算加权bray-curtis距离
dune.dist <- vegdist(dune, method="bray", binary=F)

# One measure of multivariate dispersion (variance) for a group of samples 
# is to calculate the average distance of group members to the group centroid 
# or spatial median in multivariate space. 
# To test if the dispersions (variances) of one or more groups are different, 
# the distances of group members to the group centroid are subject to ANOVA. 
# This is a multivariate analogue of Levene's test for homogeneity of variances 
# if the distances between group members and 
# group centroids is the Euclidean distance.
dispersion <- betadisper(dune.dist, group=dune.env$Management)
permutest(dispersion)

## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df  Sum Sq  Mean Sq      F N.Perm Pr(>F)
## Groups     3 0.13831 0.046104 1.9506    999  0.159
## Residuals 16 0.37816 0.023635

从下面的图上也可以看出，4种管理方式下样品在空间的中心点相距较远。（也可以参考前面如何美化这个图）

plot(dispersion, hull=FALSE, ellipse=TRUE) ##sd ellipse

Q: When running adonis (vegan package) I got an r2 = 0.45, andp = 0.001. When I ran the betadisper and ran a subsequent permutation test I got an F = 1 and p = 0.3.

A: A non-significant result in betadisper is not necessarily related to a significant/non-significant result in adonis. The two tests are testing different hypothesis. The former testshomogeneity of dispersion among groups (regions in your case), which is a condition (assumption) for adonis. The latter tests no difference in ‘location’, that is, tests whether composition among groups is similar or not. You may have the centroids of two groups in NMS at a very similar position in the ordination space, but if theirdispersions are quite different, adonis will give you a significant p-value, thus, the result is heavily influenced not by thedifference in composition between groups but bydifferences in composition within groups (heterogeneous dispersion, and thus a measure of beta diversity). In short, your results are fine, you are meeting the ‘one assumption’ for adonis (homogeneous dispersion) and thus you are certain that results from adonis are ‘real’ and not an artifact of heterogeneous dispersions. For more information you can read Anderson (2006) Biometrics 62(1):245-253 and Anderson (2006) Ecology Letters 9(6):683-693. Hope this helps!

https://stats.stackexchange.com/questions/212137/betadisper-and-adonis-in-r-am-i-interpreting-my-output-correctly

数据离散度不同而中心点一致，adonis也可能显著

下面我们看一个模拟的例子，模拟出3套群体的物种丰度表，群体1、群体2、群体3的物种空间的中心点一致，而物种丰度的离散度依次变小，PERMANOVA检验显著，betadisper结果也显著，这时解释数据时就要小心。这个导致显著的原因是什么。

set.seed(1)
num <- 30
# 控制每个物种的均值
mean <- seq(10,120,by=10)
# 控制离散度
disp <- c(5,50,200)

# 模拟3组样品的数据；直接是转置后的物种丰度表
sites.a <- as.data.frame(mapply(rnbinom, n=num, size=disp[1], mu=mean))
rownames(sites.a) <- paste('site.a', 1:num, sep=".")
colnames(sites.a) <- paste('Species',letters[1:length(mean)], sep=".")

sites.b <- as.data.frame(mapply(rnbinom, n=num, size=disp[1:2], mu=mean))
rownames(sites.b) <- paste('site.b', 1:num, sep=".")
colnames(sites.b) <- paste('Species',letters[1:length(mean)], sep=".")

sites.c <- as.data.frame(mapply(rnbinom, n=num, size=disp, mu=mean))
rownames(sites.c) <- paste('site.c', 1:num, sep=".")
colnames(sites.c) <- paste('Species',letters[1:length(mean)], sep=".")

otu_table_t <- rbind(sites.a,sites.b,sites.c)
otu_table_t[sample(1:90,5),]

##           Species.a Species.b Species.c Species.d Species.e Species.f Species.g Species.h Species.i Species.j
## site.c.22        13        15        43        29        49        72        24       102        75        96
## site.a.26         8        23        46        29        25        15        91        49        58        54
## site.a.13        14        30        47        56        18        77       111       128        90        53
## site.a.14         5        15        17        56        37        75        81        59        63        58
## site.b.21        15        24         8        33        28        42       108        74        76        64
##           Species.k Species.l
## site.c.22       139       142
## site.a.26        87       129
## site.a.13        33        47
## site.a.14       164       183
## site.b.21        52       103

生成Metadata数据，包含样品的分组信息。目的就是检验不同组的物种构成是否有显著差异。

metadata <- data.frame(Sample=rownames(otu_table_t), Group=rep(c("A","B","C"), each=num))
rownames(metadata) <- metadata$Sample
metadata[sample(1:90,5),,drop=F]

##              Sample Group
## site.a.28 site.a.28     A
## site.b.12 site.b.12     B
## site.a.20 site.a.20     A
## site.b.10 site.b.10     B
## site.a.10 site.a.10     A

PCoA和NMDS分析可视化不同组样品物种组成的差异度

统计分析前，先直观的看一下不同组样本在物种定义的空间上的分布。

为什么要画个图：参考 - 什么是安斯库姆四重奏？为什么统计分析之前必须要作图？

# 计算加权bray-curtis距离
otu_dist <- vegdist(otu_table_t, method="bray", binary=F)

otu_pcoa <- cmdscale(otu_dist, k=3, eig=T)

otu_pcoa_points <- as.data.frame(otu_pcoa$points)
sum_eig <- sum(otu_pcoa$eig)
eig_percent <- round(otu_pcoa$eig/sum_eig*100,1)

colnames(otu_pcoa_points) <- paste0("PCoA", 1:3)

otu_pcoa_result <- cbind(otu_pcoa_points, metadata)

从PCoA的结果上来看，A,B,C三个组在第一、第二、第三主坐标轴没有明显的区分开。

library(ggplot2)
library(patchwork)

ggplot(otu_pcoa_result, aes(x=PCoA1, y=PCoA2, color=Group)) +
  labs(x=paste("PCoA 1 (", eig_percent[1], "%)", sep=""),
       y=paste("PCoA 2 (", eig_percent[2], "%)", sep="")) +
  geom_point(size=4) + stat_ellipse(level=0.9) +
  theme_classic() + coord_fixed() +
  ggplot(otu_pcoa_result, aes(x=PCoA1, y=PCoA3, color=Group)) +
  labs(x=paste("PCoA 1 (", eig_percent[1], "%)", sep=""),
       y=paste("PCoA 3 (", eig_percent[3], "%)", sep="")) +
  geom_point(size=4) + stat_ellipse(level=0.9) +
  theme_classic() + coord_fixed()

从NMDS结果看，A,B,C三组也区分不开。

otu_mds <- metaMDS(otu_table_t, k=5)  #using all the defaults

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1131245 
## Run 1 stress 0.1131233 
## ... New best solution
## ... Procrustes: rmse 0.0003155417  max resid 0.001341899 
## ... Similar to previous best
## Run 2 stress 0.1131243 
## ... Procrustes: rmse 0.0009154324  max resid 0.00352237 
## ... Similar to previous best
## Run 3 stress 0.1131238 
## ... Procrustes: rmse 0.0002307456  max resid 0.001378836 
## ... Similar to previous best
## Run 4 stress 0.1131239 
## ... Procrustes: rmse 0.0002008885  max resid 0.0008441584 
## ... Similar to previous best
## Run 5 stress 0.1131233 
## ... Procrustes: rmse 0.0004594988  max resid 0.00248363 
## ... Similar to previous best
## Run 6 stress 0.1136538 
## Run 7 stress 0.1131231 
## ... New best solution
## ... Procrustes: rmse 6.187922e-05  max resid 0.0002788433 
## ... Similar to previous best
## Run 8 stress 0.1131234 
## ... Procrustes: rmse 0.000457399  max resid 0.002017475 
## ... Similar to previous best
## Run 9 stress 0.1131243 
## ... Procrustes: rmse 0.0003620819  max resid 0.001329571 
## ... Similar to previous best
## Run 10 stress 0.1131235 
## ... Procrustes: rmse 0.0001788438  max resid 0.0008840311 
## ... Similar to previous best
## Run 11 stress 0.1131248 
## ... Procrustes: rmse 0.0004674201  max resid 0.001960981 
## ... Similar to previous best
## Run 12 stress 0.1131231 
## ... New best solution
## ... Procrustes: rmse 0.0003807188  max resid 0.001578129 
## ... Similar to previous best
## Run 13 stress 0.1131238 
## ... Procrustes: rmse 0.0004016239  max resid 0.002178598 
## ... Similar to previous best
## Run 14 stress 0.113123 
## ... New best solution
## ... Procrustes: rmse 0.0001931854  max resid 0.0007886561 
## ... Similar to previous best
## Run 15 stress 0.1176584 
## Run 16 stress 0.1131244 
## ... Procrustes: rmse 0.000621146  max resid 0.002339344 
## ... Similar to previous best
## Run 17 stress 0.1131237 
## ... Procrustes: rmse 0.0004553297  max resid 0.0019548 
## ... Similar to previous best
## Run 18 stress 0.1131236 
## ... Procrustes: rmse 0.000454603  max resid 0.001894929 
## ... Similar to previous best
## Run 19 stress 0.1131241 
## ... Procrustes: rmse 0.0005855289  max resid 0.002455173 
## ... Similar to previous best
## Run 20 stress 0.113124 
## ... Procrustes: rmse 0.0005247607  max resid 0.001899271 
## ... Similar to previous best
## *** Solution reached

otu_mds_scores <- as.data.frame(scores(otu_mds))  

otu_mds_scores <- cbind(otu_mds_scores, metadata)

library(ggplot2)
ggplot(data=otu_mds_scores, aes(x=NMDS1,y=NMDS2,colour=Group)) + 
  geom_point(size=4) + 
  stat_ellipse(level = 0.9) +
  theme_classic()

进行Adonis检验和数据离散度评估

adonis结果显示Pr(>F)<0.05，统计显著；不同组之间的物种组成存在显著差异。这与PCoA和NMDS的结果还是有些不一致的。那这个统计差异是怎么来的呢？

library(vegan)
adon.results<-adonis(otu_dist ~ Group, data=metadata, perm=999)
print(adon.results)

## 
## Call:
## adonis(formula = otu_dist ~ Group, data = metadata, permutations = 999) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs  MeanSqs F.Model      R2 Pr(>F)    
## Group      2   0.10752 0.053760  2.4707 0.05375  0.001 ***
## Residuals 87   1.89300 0.021759         0.94625           
## Total     89   2.00052                  1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

betadisper检验Pr(>F)<0.05表明不同组的数据在空间分布的离散度显著不同。这是导致adonis结果显著的主要原因。不同分组之间物种的构成的显著不同不是体现在物种空间中心点的变化，而是物种空间离散度的变化。

mod <- betadisper(otu_dist, metadata$Group)
permutest(mod)

## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##           Df   Sum Sq  Mean Sq      F N.Perm Pr(>F)    
## Groups     2 0.157498 0.078749 80.188    999  0.001 ***
## Residuals 87 0.085439 0.000982                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

用一组可视化来展示这个差异的成因

把每组样本抽提出来，分别绘制下PCoA的样品分布，可以看出，每组样品在PCoA定义的空间上中心点是很相近的，而样品分散程度不同。也就是说分组内样品的多样性反应到了不同分组的物种构成差异上了，这个“显著”的差异是不是我们关注的，需要自己来判断了。

# extract the centroids and the site points in multivariate space.  
centroids<-data.frame(grps=rownames(mod$centroids),data.frame(mod$centroids))
vectors<-data.frame(group=mod$group,data.frame(mod$vectors))

# to create the lines from the centroids to each point we will put it in a format that ggplot can handle
seg.data<-cbind(vectors[,1:3],centroids[rep(1:nrow(centroids),as.data.frame(table(vectors$group))$Freq),2:3])
names(seg.data)<-c("group","v.PCoA1","v.PCoA2","PCoA1","PCoA2")

# create the convex hulls of the outermost points
grp1.hull<-seg.data[seg.data$group=="A",1:3][chull(seg.data[seg.data$group=="A",2:3]),]
grp2.hull<-seg.data[seg.data$group=="B",1:3][chull(seg.data[seg.data$group=="B",2:3]),]
grp3.hull<-seg.data[seg.data$group=="C",1:3][chull(seg.data[seg.data$group=="C",2:3]),]
all.hull<-rbind(grp1.hull,grp2.hull,grp3.hull)

library(gridExtra)

panel.a<-ggplot() +
  geom_polygon(data=all.hull[all.hull=="A",],aes(x=v.PCoA1,y=v.PCoA2),colour="black",alpha=0,linetype="dashed") +
  geom_segment(data=seg.data[1:30,],aes(x=v.PCoA1,xend=PCoA1,y=v.PCoA2,yend=PCoA2),alpha=0.30) + 
  geom_point(data=centroids[1,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=16) + 
  geom_point(data=seg.data[1:30,], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=16) +
  labs(title="A",x="",y="") +
  coord_cartesian(xlim = c(-0.2,0.2), ylim = c(-0.25,0.2)) +
  theme_classic() + 
  theme(legend.position="none")

panel.b<-ggplot() + 
  geom_polygon(data=all.hull[all.hull=="B",],aes(x=v.PCoA1,y=v.PCoA2),colour="black",alpha=0,linetype="dashed") +
  geom_segment(data=seg.data[31:60,],aes(x=v.PCoA1,xend=PCoA1,y=v.PCoA2,yend=PCoA2),alpha=0.30) + 
  geom_point(data=centroids[2,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=17) + 
  geom_point(data=seg.data[31:60,], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=17) +
  labs(title="B",x="",y="") +
  coord_cartesian(xlim = c(-0.2,0.2), ylim = c(-0.25,0.2)) +
  theme_classic() + 
  theme(legend.position="none")

panel.c<-ggplot() + 
  geom_polygon(data=all.hull[all.hull=="C",],aes(x=v.PCoA1,y=v.PCoA2),colour="black",alpha=0,linetype="dashed") +
  geom_segment(data=seg.data[61:90,],aes(x=v.PCoA1,xend=PCoA1,y=v.PCoA2,yend=PCoA2),alpha=0.30) +
  geom_point(data=centroids[3,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=15) + 
  geom_point(data=seg.data[61:90,], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=15) + 
  labs(title="C",x="",y="") +
  coord_cartesian(xlim = c(-0.2,0.2), ylim = c(-0.25,0.2)) +
  theme_classic() + 
  theme(legend.position="none")

panel.d<-ggplot() + 
  geom_polygon(data=all.hull,aes(x=v.PCoA1,y=v.PCoA2),colour="black",alpha=0,linetype="dashed") +
  geom_segment(data=seg.data,aes(x=v.PCoA1,xend=PCoA1,y=v.PCoA2,yend=PCoA2),alpha=0.30) + 
  geom_point(data=centroids[,1:3], aes(x=PCoA1,y=PCoA2,shape=grps),size=4,colour="red") + 
  geom_point(data=seg.data, aes(x=v.PCoA1,y=v.PCoA2,shape=group),size=2) + 
  labs(title="All",x="",y="") +
  coord_cartesian(xlim = c(-0.2,0.2), ylim = c(-0.25,0.2)) +
  theme_classic() + 
  theme(legend.position="none")

grid.arrange(panel.a,panel.b,panel.c,panel.d,nrow=1)

`PERMANOVA`的作者对这个问题的看法

Marti Anderson: “[…] Although there is also no explicit assumption regarding the homogeneity of spread within each group, PERMANOVA, like ANOSIM (Clarke 1993), will be sensitive to differences in spread (variability) among groups. Thus, if a significant difference between groups is detected using PERMANOVA, then this could be due to differences in location, differences in spread, or a combinationof the two. Perhaps the best approach is to perform a separate test for homogeneity (e.g., using the program PERMDISP) including pair-wise comparisons, as well as examining the average within and between-group distances and associated MDS plots. This will help to determine the nature of the difference between any pair of groups, whether it be due to location, spread, or a combination of the two. […]”

adonis多因素分析需要注意什么？

假如我们关注不同的管理风格 (Management)和土壤厚度 (A1)对物种组成是否有显著影响？，应该怎么检验呢？

library(vegan)
# 数据的解释和准备见前面的推文
data(dune)
data(dune.env)

A1在前，Moisture在后。这个情况下，A1和Moisture都与群体结构有显著关系。A1可以解释16.8%的总体差异，Moisture解释27.6%的总体差异。

adonis(dune ~ A1 + Moisture, data=dune.env, permutations=9999)

## 
## Call:
## adonis(formula = dune ~ A1 + Moisture, data = dune.env, permutations = 9999) 
## 
## Permutation: free
## Number of permutations: 9999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## A1         1    0.7230 0.72295  4.5393 0.16817 0.0003 ***
## Moisture   3    1.1871 0.39569  2.4845 0.27613 0.0061 ** 
## Residuals 15    2.3890 0.15927         0.55571           
## Total     19    4.2990                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Moisture在前，A1在后。这个情况下，只有Moisture与群体结构有显著关系。Moisture可以解释40.2%的总体差异，A1解释0.04%的总体差异。

For adonis and sequential tests in general, the order of terms should be meaningful. If it is not meaningful, the tests are hardly meaningful.

adonis(dune ~ Moisture + A1, data=dune.env, permutations=9999)

## 
## Call:
## adonis(formula = dune ~ Moisture + A1, data = dune.env, permutations = 9999) 
## 
## Permutation: free
## Number of permutations: 9999
## 
## Terms added sequentially (first to last)
## 
##           Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## Moisture   3    1.7282 0.57606  3.6169 0.40199 0.0002 ***
## A1         1    0.1819 0.18186  1.1419 0.04230 0.3181    
## Residuals 15    2.3890 0.15927         0.55571           
## Total     19    4.2990                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

输出结果中Terms added sequentially (first to last)这一句话很关键，表明环境因子的顺序对结果是有影响的，尤其是环境因子之间存在相关性时。

As there is some linear dependency, which ones goes into the model first determines how much variation is left to be explained by the second of the pair of covariates.

这时可以使用dbrda (基于距离的冗余分析)，或者通过adonis2计算边缘概率 (by="margin")。

adonis2(dune ~ Moisture + A1, data=dune.env, permutations=9999, by="margin")

## Permutation test for adonis under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 9999
## 
## adonis2(formula = dune ~ Moisture + A1, data = dune.env, permutations = 9999, by = "margin")
##          Df SumOfSqs      R2      F Pr(>F)   
## Moisture  3   1.1871 0.27613 2.4845 0.0052 **
## A1        1   0.1819 0.04230 1.1419 0.3228   
## Residual 15   2.3890 0.55571                 
## Total    19   4.2990 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

adonis2(dune ~ A1 + Moisture, data=dune.env, permutations=9999, by="margin")

## Permutation test for adonis under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 9999
## 
## adonis2(formula = dune ~ A1 + Moisture, data = dune.env, permutations = 9999, by = "margin")
##          Df SumOfSqs      R2      F Pr(>F)   
## A1        1   0.1819 0.04230 1.1419 0.3257   
## Moisture  3   1.1871 0.27613 2.4845 0.0066 **
## Residual 15   2.3890 0.55571                 
## Total    19   4.2990 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ord <- dbrda(dune ~ A1 + Moisture, data = dune.env, dist = 'bray')
anova(ord, by = 'margin')

## Permutation test for dbrda under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 999
## 
## Model: dbrda(formula = dune ~ A1 + Moisture, data = dune.env, distance = "bray")
##          Df SumOfSqs      F Pr(>F)   
## A1        1  0.18186 1.1419  0.329   
## Moisture  3  1.18708 2.4845  0.009 **
## Residual 15  2.38899                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

在控制A1的情况下，Moisture对菌群的影响是显著的。反之，在控制Moisture的情况下，A1对菌群的影响是不显著的。

adonis和adonis2的区别

vegan包提供了两个函数adonis和adonis2来进行PERMANOVA分析，这两个函数有什么区别呢？

adonis函数对提供的变量执行的是序贯检验 (sequential test)。也就是说变量的顺序会对结果有影响（尤其是变量之间存在相关时）。系统会先评估第一个变量解释的差异比例，再评估后续变量解释的剩余总体差异的比例。后面会有一个例子展示差异。这等同于adonis2使用参数by="terms" (默认参数)。这通常被称为I型误差平方和 (Type I sums of squares)，此时，
- 对于模型Y ~ A + B来讲，变量A的误差平方和为SS(A)；
  变量B的误差平方和是在给定A的基础上的平方和SS(B|A) = SS(A, B) - SS(A)。
- 对于模型Y ~ B + A来讲，变量B的误差平方和为SS(B)；
  变量A的误差平方和是在给定B的基础上的平方和SS(A|B) = SS(A, B) - SS(B)。
如果你希望变量的顺序不影响结果，那么需要使用adonis2，并且设置参数by="margin"。这时计算显著性时会考虑公式中其它所有变量，而不只是当前变量前面的那些变量。这通常被称为II型误差平方和 (Type II sums of squares)，此时
- 对于模型Y ~ A + B来讲，变量A的误差平方和为SS(A|B) = SS(A, B) - SS(B)；
  变量B的误差平方和SS(B|A) = SS(A, B) - SS(A)。
- 对于模型Y ~ B + A来讲，变量A的误差平方和为SS(A|B) = SS(A, B) - SS(B)；
  变量B的误差平方和SS(B|A) = SS(A, B) - SS(A)。
或者你想看整体模型是否显著，也需要使用adonis2，并且设置参数by="null"。

Order does not matter when by="margin" because the significance is tested against a model that includes all other variables not just the ones preceding it in the formula. It seems that strata is now deprecated in favor of defining blocks in the permutations argument now (see adonis help). Anyway, these arguments allow you to specify how to restrict which rows can be exchanged during the permutation procedure used to calculate p values.

adonis performs a sequential test of terms。adonis2 can perform sequential, marginal and overall tests. Function adonis2 also allows using additive constants or squareroot of dissimilarities to avoid negative eigenvalues，but both functions can handle semimetric indices (such as Bray-Curtis) that produce negative eigenvalues. Functionadonis2 can be much slower than adonis, in particular with several terms.

参考

https://www.scribbr.com/frequently-asked-questions/one-way-vs-two-way-anova/
MANOVA的前提假设 https://www.real-statistics.com/multivariate-statistics/multivariate-analysis-of-variance-manova/manova-assumptions/ https://www.statology.org/manova-assumptions/
https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php
https://www.yunbios.net/h-nd-570.html
https://mp.weixin.qq.com/s/v_k4Yhe9rBWM9y9A3P3wQw
https://mp.weixin.qq.com/s?__biz=MzUzMjA4Njc1MA==&mid=2247484678&idx=1&sn=f95418a311e639704e9848545efc7fd7&scene=21#wechat_redirect
https://chrischizinski.github.io/rstats/vegan-ggplot2/
https://chrischizinski.github.io/rstats/adonis/
https://chrischizinski.github.io/rstats/ordisurf/
https://www.rdocumentation.org/packages/vegan/versions/1.11-0/topics/adonis
https://www.jianshu.com/p/dfa689f7cafd
https://stats.stackexchange.com/questions/312302/adonis-in-vegan-order-of-variables-non-nested-with-one-degree-of-freedom-for
https://stats.stackexchange.com/questions/188519/adonis-in-vegan-order-of-variables-or-use-of-strata?noredirect=1
https://github.com/vegandevs/vegan/issues/229
https://stats.stackexchange.com/questions/476256/adonis-vs-adonis2
清晰解释Type I, Type II, Type III https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
清晰解释Type I, Type II, Type III https://stats.stackexchange.com/questions/60362/choice-between-type-i-type-ii-or-type-iii-anova
https://thebiobucket.blogspot.com/2011/08/two-way-permanova-adonis-with-custom.html#more
adonis的前提条件 https://thebiobucket.blogspot.com/2011/04/assumptions-for-permanova-with-adonis.html#more
作者的论文 https://static1.squarespace.com/static/580e3c475016e191c523a0e2/t/5813ba8b5016e1a5b61f454a/1477687949842/Anderson_et_al-2013-ANOSIM+vs.+PERMANOVA.pdf
离散度 adonis https://chrischizinski.github.io/rstats/adonis/