Differential gene expression analysis using DESeq2 (comprehensive tutorial)可以直接使用很全

生信小博士

已于 2023-06-13 15:14:58 修改

阅读量1k

点赞数

文章标签： java servlet 开发语言

于 2022-09-24 22:17:31 首次发布

本文链接：https://blog.csdn.net/qq_52813185/article/details/127031958

版权

Differential gene expression analysis using DESeq2 (comprehensive tutorial)

Renesh Bedre 9 minute read

IntroductionPermalink

Differential gene expression (DGE) analysis is commonly used in the transcriptome-wide analysis (using RNA-seq) for studying the changes in gene or transcripts expressions under different conditions (e.g. control vs infected).
RNA sequencing (bulk and single-cell RNA-seq) using next-generation sequencing (e.g. Illumina short-read sequencing) is a de facto method for quantifying the transcriptome-wide gene or transcript expressions and performing DGE analysis.
There are several computational tools are available for DGE analysis. In this article, I will cover DESeq2 for DGE analysis.

Check DGE analysis using edgeR

DGE analysis using DESeq2Permalink

The standard workflow for DGE analysis involves the following steps

RNA-seq with a sequencing depth of 10-30 M reads per library (at least 3 biological replicates per sample)
aligning or mapping the quality-filtered sequenced reads to respective genome (e.g. HISAT2 or STAR)
quantifying reads that are mapped to genes or transcripts (e.g. featureCounts, RSEM, HTseq)
Raw integer read counts (un-normalized) are then used for DGE analysis using DESeq2

This standard and other workflows for DGE analysis are depicted in the following flowchart,

Note: DESeq2 requires raw integer read counts for performing accurate DGE analysis. The normalized read counts should not be used in DESeq2 analysis. DESeq2 internally normalizes the count data correcting for differences in the library sizes as sequencing depth influence the read counts (sample-specific effect). DESeq2 does not consider gene length for normalization as gene length is constant for all samples (it may not have significant effect on DGE analysis). Read more about DESeq2 normalization

Perform the DGE analysis using DESeq2 for read count matrix,

For DGE analysis, I will use the sugarcane RNA-seq data. The DGE analysis will be performed using the raw integer read counts for control and fungal treatment conditions. The goal here is to identify the differentially expressed genes under infected condition.

Note: This article focuses on DGE analysis using a count matrix. But, If you have gene quantification from Salmon, Sailfish, Kallisto, or RSEM, you can use the tximport package to import the count data to perform DGE analysis using DESeq2. Read more here

Install DESeq2 (if you have not installed before),

# R version 4.1.3 (2022-03-10) 
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

Load DESeq2 and run DGE analysis,

# load library 
# DESeq2 version 1.34.0
library("DESeq2")

# get count dataset
count_matrix <- as.matrix(read.csv("https://reneshbedre.github.io/assets/posts/gexp/df_sc.csv", row.names = "gene"))
# view first two rows
head(count_matrix, 2)
                 ctr1 ctr2 ctr3 trt1 trt2 trt3 length
Sobic.001G000200  338  324  246  291  202  168   1982
Sobic.001G000400   49   21   53   16   16   11   4769

# drop length column
count_matrix <- count_matrix[, -7]
head(count_matrix, 2)
                ctr1 ctr2 ctr3 trt1 trt2 trt3
Sobic.001G000200  338  324  246  291  202  168
Sobic.001G000400   49   21   53   16   16   11

DESeq2 needs sample information (metadata) for performing DGE analysis. Let’s create the sample information (you can also import sample information if you have it in a file),

coldata <- data.frame(
   sample = c( "ctr1", "ctr2", "ctr3", "trt1", "trt2", "trt3" ),
   condition = c( "control", "control",  "control", "infected", "infected", "infected" ), 
  row.names = "sample" )
coldata$condition <- as.factor(coldata$condition)

coldata
     condition
ctr1   control
ctr2   control
ctr3   control
trt1  infected
trt2  infected
trt3  infected

Note: DESeq2 does not support the analysis without biological replicates ( 1 vs. 1 comparison). There is no other recommended alternative for performing DGE analysis without biological replicates. If you do not have any biological replicates, you can analyze log fold changes without any significance analysis.

It is essential to have the name of the columns in the count matrix in the same order as that in name of the samples (rownames in coldata). Check this article for how to reorder column names in a Data Frame.

Let’s check using the below code,

all(rownames(coldata) %in% colnames(count_matrix))
[1] TRUE

all(rownames(coldata) == colnames(count_matrix))
[1] TRUE

Now, construct DESeqDataSet for DGE analysis,

dds <- DESeqDataSetFromMatrix(countData = count_matrix, colData = coldata, 
                              design = ~ condition)

Note: The design formula specifies the experimental design to model the samples. It is used in the estimation of dispersions (spread or variability) and log2 fold changes (LFCs) of the model. The design formula also allows controlling additional factors (other than the variable of interest) in the model such as batch effects, type of sequencing, etc.

DESeq2 for paired sample: If you have paired samples (if the same subject receives two treatments e.g. before and after treatment), then you need to include the subject (sample) and treatment information in the design formula for estimating the treatment effect while considering differences in subjects. If sample and treatments are represented as subjects and condition in coldata table, then the design formula should be design = ~ subjects + condition. The factor of interest such as condition should go at the end of the formula. If you have more than two factors to consider, you should use proper multifactorial design.

Pre-filter the genes which have low counts. Low count genes may not have sufficient evidence for differential gene expression. Furthermore, removing low count genes reduce the load of multiple hypothesis testing corrections.

2. 实操https://www.jianshu.com/p/8aa995149744

经过多次分析和调整，最后用的代码是：

（1）安装DESeq2包

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("DESeq2")

（2）导入数据两两比较

setwd('xxx')
colData <- read.table('group.txt', header=TRUE, row.names=1)
readscount <- read.table('readscount.txt', header=TRUE, row.names=1)
condition <- factor(c(rep("treat1",10),rep("treat2",10)))
individual <- factor(c(rep("idv1",3),"idv2",rep("idv3",3),rep("idv4",3), 
                  rep("idv1",3),rep("idv2",3),rep("idv4",4)))
colData
head(readscount)
condition
individual
library(DESeq2) 
dds <- DESeqDataSetFromMatrix(readscount, colData, design =~ individual + condition)
keep <- rowSums(counts(dds) >= 10) >= 3   #过滤低表达基因，至少在3个样品中都有10个reads 
dds <- dds[keep, ]

（3）PCA（还有全部样品的PCA在另一个脚本）

vsdata <- vst(dds, blind=FALSE)  #归一化
assay(vsdata) <- limma::removeBatchEffect(assay(vsdata), vsdata$individual)  #去批次效应
plotPCA(vsdata, intgroup = "condition") #自带函数

（4）差异分析

dds_norm <- DESeq(dds, minReplicatesForReplace = Inf) #标准化; 不剔除outliers; 与cookscutoff结果相同
dds_norm$condition   #保证是levels是按照后一个比前一个即trt/untrt，否则需在results时指定
res <- results(dds_norm, contrast = c("condition","treat2","treat1"), cooksCutoff = FALSE) #alpha=0.05可指定padj; cookCutoff是不筛选outliers因为太多了
summary(res)  
#resOrdered <- res[order(res$pvalue), ] #排序
sum(res$padj<0.05, na.rm = TRUE)
res_data <- merge(as.data.frame(res),
              as.data.frame(counts(dds_norm,normalize=TRUE)),
              by="row.names",sort=FALSE)
up_DEG <- subset(res_data, padj < 0.05 & log2FoldChange > 1)
down_DEG <- subset(res_data, padj < 0.05 & log2FoldChange < -1)
write.csv(res_data, "all.csv") #全部基因不筛选，做火山图的背景
write.csv(up_DEG, "up.csv")
write.csv(down_DEG, "down.csv")

（5）判断欧氏距离，若有异常样品则不用cooksCutoff；当有上千个异常值时也不用：（完全可以不做）

par(mar=c(8,5,2,2))
boxplot(log10(assays(dds_norm)[["cooks"]]), range=0, las=2)

（6）lfcshrink & MA plot

library(apeglm)  
resultsNames(dds_norm)  #看一下要shrink的维度;shrink数据更加紧凑,少了一项stat，但并未改变padj，但改变了foldchange
res_shrink <- lfcShrink(dds_norm, coef="condition_treat2_vs_treat1", type="apeglm") #最推荐apeglm算法;根据resultsNames(dds)的第5个维度，coef=5，也可直接""指定;apeglm不allow contrast，所以要指定coef
pdf("MAplot.pdf", width = 6, height = 6) 
plotMA(res_shrink, ylim=c(-10,10), alpha=0.1, main="MA plot: ")
dev.off()

（7）火山图

library(ggplot2)
voldata <-read.csv(file = "all.csv",header = TRUE, row.names =1)
pdf("volcano.pdf", width = 6.13, height = 5.18)
ggplot(data=voldata, aes(x=log2FoldChange,y= -1*log10(padj))) +
  geom_point(aes(color=significant)) +
  scale_color_manual(values=c("#546de5", "#d2dae2","#ff4757")) + 
  labs(title="Volcano Plot: ", x=expression(log[2](FC), y=expression(-log[10](padj)))) +
  geom_hline(yintercept=1.3,linetype=4) +  #反对数,代表0.05的线
  geom_vline(xintercept=c(-1,1),linetype=4) +
  theme_bw() + theme(panel.grid = element_blank())  #主次网格线均为空白
dev.off()

3. 心得

必须充分了解自己的数据，分析方法万万千，选最合适的才是最好的！
任何软件教程都不如官网README 和文章，遇到问题时要想“一定不是我一个人”，善于用Bioconductor论坛。
不要钻牛角尖，完整比完美重要，任何分析都有不周到，学会带着遗憾继续走，不必十全十美，因为永远也不可能做到。

1.基本原理

1.1 概述

全称：DESeq2 package for differential analysis of count data;
利用负二项分布广义线性模型( negative binomial generalized linear models），同时，还利用了离散型估计、logFoldChange;
负二项分布是一个离散分布，符合测序reads分布；

1.2 构建dds

要求输入原始 reads count 数；不接受已经做过处理的FPKM/TPM等，因为软件有自己的标准化计算方法；

构建dds。需要设置design公式，即告诉软件你的数据是怎样来的，基本试验设计如何，软件会根据几个变量综合计算；
一般：design =~ variable1 + variable2 + ...；
只有一个变量时：design=~ condition；
很多医学分析会加入年龄、性别等：design=~sex+disease+condition；
可以对应几个变量，但如果没有额外参数，log2FC和p值是默认对design公式中的最后一个变量或者最后一个因子与参考因子进行比较；

1.3 函数与计算

1.3.1 标准化：DESeq函数

不同样品的测序量有差异，最简单的标准化方式是计算counts per million (CPM) = 原始reads count ÷ 总reads数 x 1,000,000；
这种计算方式，易受到极高表达且在不同样品中存在差异表达的基因的影响：这些基因的打开或关闭会影响到细胞中总的分子数目，可能导致这些基因标准化之后就不存在表达差异了，而原本没有差异的基因标准化之后却有差异了；
RPKM、FPKM、TPM 是 CPM按照基因或转录本长度归一化后的表达，都会受到这一影响；
DESeq2的方法：

量化因子 (size factor,SF)，首先计算每个基因在所有样品中表达的几何平均值；每个细胞的SF是所有基因与其在所有样品中的表达值的几何平均值的比值的中位数；由于几何平均值的使用，只有在所有样品中表达都不为0的基因才能用来计算。这一方法又被称为RLE(relative log expression)。
不但考虑了测序深度的问题，还考虑了表达量超高或者极显著差异表达的基因导致count的分布出现偏倚。

DESeq函数分析：

三步：estimation of size factors（estimateSizeFactors)， estimation of dispersion（estimateDispersons)， Negative Binomial GLM fitting and Wald statistics（nbinomWaldTest）；
可以分步运行，也可一步到位，最后返回 results可用的DESeqDataSet对象。