简 介
RNA-seq 是研究选择性剪接和其他形式的选择性异构体表达的有力工具。了解这些过程的调节需要在不同条件、细胞类型或组织之间的比较中对差异异构体丰度进行敏感和特异性的检测。提出 DEXSeq 一种在 RNA-seq 数据中测试差异外显子使用的统计方法。DEXSeq 使用广义线性模型,并通过考虑生物变异提供可靠的错误发现控制。DEXSeq 检测高灵敏度基因,在许多情况下外显子,受差异外显子使用。通过将 DEXSeq 应用于多个数据集来演示它的多功能性。该方法有助于在全基因组范围内研究选择性外显子使用的调控和功能。DEXSeq 的实现可作为R/Bioconductor包提供。
基因模型的平坦化:这个(虚构的)基因有三个注释转录本,涉及三个外显子(浅色阴影),其中一个具有替代边界。我们从所描述的外显子中形成计数箱(暗阴影框);可变长度的外显子被分成两个箱子。
软件包安装
软件包安装非常简单,但是需要注意以下Rtools,R,BiocManager版本号,这种问题经常出现,更新匹配一直就没问题了。
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DEXSeq")
BiocManager::install("pasilla")
数据读取
输入数据需要三个必不可少的类型:
count matrix
dir <- system.file("extdata", package = "pasilla")
dir
## [1] "D:/Program Files/R-4.4.0/library/pasilla/extdata"
files <- list.files(dir, pattern = "fb.txt$", full.names = TRUE)
files
## [1] "D:/Program Files/R-4.4.0/library/pasilla/extdata/treated1fb.txt"
## [2] "D:/Program Files/R-4.4.0/library/pasilla/extdata/treated2fb.txt"
## [3] "D:/Program Files/R-4.4.0/library/pasilla/extdata/treated3fb.txt"
## [4] "D:/Program Files/R-4.4.0/library/pasilla/extdata/untreated1fb.txt"
## [5] "D:/Program Files/R-4.4.0/library/pasilla/extdata/untreated2fb.txt"
## [6] "D:/Program Files/R-4.4.0/library/pasilla/extdata/untreated3fb.txt"
## [7] "D:/Program Files/R-4.4.0/library/pasilla/extdata/untreated4fb.txt"
head(read.table(files[1]))
## V1 V2
## 1 FBgn0000003:001 0
## 2 FBgn0000008:001 0
## 3 FBgn0000008:002 0
## 4 FBgn0000008:003 0
## 5 FBgn0000008:004 1
## 6 FBgn0000008:005 4
gff
gff <- paste(dir, "Dmel.BDGP5.25.62.DEXSeq.chr.gff", sep = "/")
gff
## [1] "D:/Program Files/R-4.4.0/library/pasilla/extdata/Dmel.BDGP5.25.62.DEXSeq.chr.gff"
metadata
sampleTable <- data.frame(row.names = c(paste("treated", 1:3, "fb", sep = ""), paste("untreated",
1:4, "fb", sep = "")), condition = rep(c("treated", "untreated"), c(3, 4)), type = c("single-read",
"paired-end", "paired-end", "single-read", "single-read", "paired-end", "paired-end"))
sampleTable
## condition type
## treated1fb treated single-read
## treated2fb treated paired-end
## treated3fb treated paired-end
## untreated1fb untreated single-read
## untreated2fb untreated single-read
## untreated3fb untreated paired-end
## untreated4fb untreated paired-end
实例操作
1. 构建对象
构建对线有三种方式:
1).DEXSeqDataSet()
DEXSeqDataSet 类派生自 DESeqDataSet。因此包含列数据、行数据和一些特定数据的常用访问器函数。DEXSeqDataSet 对象中的核心数据是每个外显子的计数。DEXSeqDataSet的每一行在每一列中都包含给定外显子(“this”)的计数数据以及属于同一基因的其他外显子(“others”)的计数数据。
DEXSeqDataSet( countData, sampleData,
design= ~ sample + exon + condition:exon,
featureID, groupID, featureRanges=NULL,
transcripts=NULL, alternativeCountData=NULL)
2).DEXSeqDataSetFromHTSeq()
countfiles:一个字符向量,包含由脚本'dexseq_count.py'生成的文件的路径。
DEXSeqDataSetFromHTSeq(
countfiles, sampleData,
design= ~ sample + exon + condition:exon,
flattenedfile=NULL )
3).DEXSeqDataSetFromSE()
一个summarizeexperiments对象,使用SummarizeOverlaps函数生成。
DEXSeqDataSetFromSE(SE,
design= ~ sample + exon + condition:exon )
这里我们选择使用函数 DEXSeqDataSetFromHTSeq() 构造对象:
library(DEXSeq)
dxd <- DEXSeqDataSetFromHTSeq(files, sampleData = sampleTable, design = ~sample +
exon + condition:exon, flattenedfile = gff)
这个注释以及关于DEXSeqDataSet的每一列的所有信息都在colData()中指定,counts()获得count matrix,
colData(dxd)
## DataFrame with 14 rows and 4 columns
## sample condition type exon
## <factor> <factor> <character> <factor>
## 1 treated1fb treated single-read this
## 2 treated2fb treated paired-end this
## 3 treated3fb treated paired-end this
## 4 untreated1fb untreated single-read this
## 5 untreated2fb untreated single-read this
## ... ... ... ... ...
## 10 treated3fb treated paired-end others
## 11 untreated1fb untreated single-read others
## 12 untreated2fb untreated single-read others
## 13 untreated3fb untreated paired-end others
## 14 untreated4fb untreated paired-end others
head(counts(dxd), 5)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## FBgn0000003:E001 0 0 1 0 0 0 0 0 0 0 0 0
## FBgn0000008:E001 0 0 0 0 0 0 0 78 46 43 47 89
## FBgn0000008:E002 0 0 0 0 0 1 0 78 46 43 47 89
## FBgn0000008:E003 0 1 0 1 1 1 0 78 45 43 46 88
## FBgn0000008:E004 1 0 1 0 1 0 1 77 46 42 47 88
## [,13] [,14]
## FBgn0000003:E001 0 0
## FBgn0000008:E001 53 27
## FBgn0000008:E002 52 27
## FBgn0000008:E003 52 27
## FBgn0000008:E004 53 26
split(seq_len(ncol(dxd)), colData(dxd)$exon)
## $others
## [1] 8 9 10 11 12 13 14
##
## $this
## [1] 1 2 3 4 5 6 7
sampleAnnotation(dxd)
## DataFrame with 7 rows and 3 columns
## sample condition type
## <factor> <factor> <character>
## 1 treated1fb treated single-read
## 2 treated2fb treated paired-end
## 3 treated3fb treated paired-end
## 4 untreated1fb untreated single-read
## 5 untreated2fb untreated single-read
## 6 untreated3fb untreated paired-end
## 7 untreated4fb untreated paired-end
还可以只访问属于外显子区域的计数的前五行('this')(不显示来自同一基因的其他外显子的计数总和),该表包含有关注释数据的信息,例如基因和外显子id、外显子的基因组坐标以及包含外显子的转录本列表。
head(featureCounts(dxd), 5)
## treated1fb treated2fb treated3fb untreated1fb untreated2fb
## FBgn0000003:E001 0 0 1 0 0
## FBgn0000008:E001 0 0 0 0 0
## FBgn0000008:E002 0 0 0 0 0
## FBgn0000008:E003 0 1 0 1 1
## FBgn0000008:E004 1 0 1 0 1
## untreated3fb untreated4fb
## FBgn0000003:E001 0 0
## FBgn0000008:E001 0 0
## FBgn0000008:E002 1 0
## FBgn0000008:E003 1 0
## FBgn0000008:E004 0 1
数据注释的前3行:
head(rowRanges(dxd), 3)
## GRanges object with 3 ranges and 5 metadata columns:
## seqnames ranges strand | featureID groupID
## <Rle> <IRanges> <Rle> | <character> <character>
## FBgn0000003:E001 chr3R 2648220-2648518 + | E001 FBgn0000003
## FBgn0000008:E001 chr2R 18024494-18024495 + | E001 FBgn0000008
## FBgn0000008:E002 chr2R 18024496-18024531 + | E002 FBgn0000008
## exonBaseMean exonBaseVar transcripts
## <numeric> <numeric> <list>
## FBgn0000003:E001 0.142857 0.142857 FBtr0081624
## FBgn0000008:E001 0.000000 0.000000 FBtr0100521
## FBgn0000008:E002 0.142857 0.142857 FBtr0071763,FBtr0100521
## -------
## seqinfo: 14 sequences from an unspecified genome; no seqlengths
2. 数据标准化
DEXSeq使用与DESeq和DESeq2相同的方法,该方法在函数estimateSizeFactors()中提供。
dxd = estimateSizeFactors(dxd)
3. 分散估计
为了测试不同的外显子使用情况,需要估计数据的可变性。这能够区分技术和生物变异(噪音)从实际影响外显子的使用,由于不同的条件。关于噪声强度的信息是从数据集中的生物重复中推断出来的,并以所谓的分散为特征。在 RNA-Seq 实验中,复制的数量通常太小,无法可靠地估计单个外显子的方差或分散参数,因此,方差信息在外显子和基因之间以强度依赖的方式共享。
dxd = estimateDispersions(dxd)
plotDispEsts(dxd)
4. 差异外显子
testForDEU()函数对每个基因中的每个外显子执行这些测试。从拟合模型的系数中,可以区分整体基因表达效应,即改变所有外显子的计数,以及外显子使用效应,这些效应由相互作用项条件:外显子捕获,并单独影响每个外显子。
dxd = testForDEU(dxd)
dxd = estimateExonFoldChanges(dxd, fitExpToVar = "condition")
dxd
## class: DEXSeqDataSet
## dim: 70463 14
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(70463): FBgn0000003:E001 FBgn0000008:E001 ... FBgn0261575:E001
## FBgn0261575:E002
## rowData names(43): featureID groupID ... untreated
## log2fold_untreated_treated
## colnames: NULL
## colData names(5): sample condition type exon sizeFactor
dxr1 = DEXSeqResults(dxd)
5. 附加的技术或实验变量
指定了两个设计公式,这表明类型因素应被视为阻塞因素:
formulaFullModel = ~sample + exon + type:exon + condition:exon
formulaReducedModel = ~sample + exon + type:exon
dxd = estimateDispersions(dxd, formula = formulaFullModel)
dxd = testForDEU(dxd, reducedModel = formulaReducedModel, fullModel = formulaFullModel)
dxr2 = DEXSeqResults(dxd)
table(dxr2$padj < 0.1)
##
## FALSE TRUE
## 43766 424
table(before = dxr1$padj < 0.1, now = dxr2$padj < 0.1)
## now
## before FALSE TRUE
## FALSE 42792 193
## TRUE 2 231
6. 结果可视化
plotDEXSeq()提供了一种可视化分析结果的方法。
a) 显著差异的外显子
plotDEXSeq(dxr2, "FBgn0010909", legend = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)
b) 所有注释的转录本
wh = (dxr2$groupID == "FBgn0010909")
stopifnot(sum(dxr2$padj[wh] < formals(plotDEXSeq)$FDR) == 1)
plotDEXSeq(dxr2, "FBgn0010909", displayTranscripts = TRUE, legend = TRUE, cex.axis = 1.2,
cex = 1.3, lwd = 2)
c) 每个样品中每个外显子的标准化计数值
plotDEXSeq(dxr2, "FBgn0010909", expression = FALSE, norCounts = TRUE, legend = TRUE,
cex.axis = 1.2, cex = 1.3, lwd = 2)
d) 减去了基因表达的总体变化。
plotDEXSeq(dxr2, "FBgn0010909", expression = FALSE, splicing = TRUE, legend = TRUE,
cex.axis = 1.2, cex = 1.3, lwd = 2)
e) 允许对结果进行更详细的探索
为了生成所有分析结果的易于浏览的详细概览,该包提供了一个HTML报告生成器,在函数DEXSeqHTML()中实现。该函数使用包hwriter (Pau和Huber2009)创建一个结果表,其中包含指向重要结果图的链接,从而允许对结果进行更详细的探索。
DEXSeqHTML(dxr2, FDR = 0.1, color = c("#FF000080", "#0000FF80"))
结果解读
从这个对象,可以看出有多少外显子区域是显著的,错误发现率为10%,有多少基因受到了影响,MA图分析不同外显子使用情况的能力。
mcols(dxr1)$description
## [1] "group/gene identifier"
## [2] "feature/exon identifier"
## [3] "mean of the counts across samples in each feature/exon"
## [4] "exon dispersion estimate"
## [5] "LRT statistic: full vs reduced"
## [6] "LRT p-value: full vs reduced"
## [7] "BH adjusted p-values"
## [8] "exon usage coefficient"
## [9] "exon usage coefficient"
## [10] "relative exon usage fold change"
## [11] "GRanges object of the coordinates of the exon/feature"
## [12] "matrix of integer counts, of each column containing a sample"
## [13] "list of transcripts overlapping with the exon"
table(dxr1$padj < 0.1)
##
## FALSE TRUE
## 42985 233
table(tapply(dxr1$padj < 0.1, dxr1$groupID, any))
##
## FALSE TRUE
## 5220 166
plotMA(dxr1, cex = 0.8)
Reference
Anders S, Reyes A, Huber W. Detecting differential usage of exons from RNA-seq data. Genome Res. 2012 Oct;22(10):2008-17. doi: 10.1101/gr.133744.111. Epub 2012 Jun 21. PMID: 22722343; PMCID: PMC3460195.
桓峰基因,铸造成功的您!
未来桓峰基因公众号将不间断的推出单细胞系列生信分析教程,
敬请期待!!
桓峰基因官网正式上线,请大家多多关注,还有很多不足之处,大家多多指正!http://www.kyohogene.com/
桓峰基因和投必得合作,文章润色优惠85折,需要文章润色的老师可以直接到网站输入领取桓峰基因专属优惠券码:KYOHOGENE,然后上传,付款时选择桓峰基因优惠券即可享受85折优惠哦!https://www.topeditsci.com/