简 介
可变剪接从单个基因中产生多个功能转录物。已知剪接失调与疾病有关,并且是癌症的标志。现有的差分转录使用 (DTU) 分析工具要么缺乏性能,要么不能解释复杂的实验设计,要么不能扩展到大规模的单细胞转录组测序 (scRNA-seq) 数据集。引入了 satuRn,这是一个快速灵活的准二项广义线性建模框架,它与来自大量 RNA-seq 领域的最佳 DTU 方法相当,同时提供了良好的 FDR 控制,解决复杂的实验设计,并扩展到 scRNA-seq 应用。
该软件对6种差分转录本使用(DTU)方法的性能和可扩展性进行评价。
软件包安装
软件包安装非常简单,但是需要注意以下Rtools,R,BiocManager版本号,这种问题经常出现,更新匹配一直就没问题了。
if(!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("satuRn")
BiocManager::install("stageR")
devtools::install_github("statOmics/satuRn")
加载所需要的软件包:
library(satuRn)
library(AnnotationHub)
library(ensembldb)
library(edgeR)
library(SummarizedExperiment)
library(ggplot2)
library(DEXSeq)
library(stageR)
数据读取
输入数据需要三个必不可少的类型:
1. count matrix
Tasic_counts_vignette <- readRDS("Tasic_counts_vignette.RDS")
Tasic_counts_vignette[1:5, 1:3]
## F2S4_160622_013_D01 F2S4_160624_023_C01 F2S4_160622_019_B01
## ENSMUST00000037739 95.994 164.492 12.397
## ENSMUST00000228774 0.000 23.174 0.000
## ENSMUST00000025204 225.737 537.474 266.000
## ENSMUST00000237499 19.263 54.526 0.000
## ENSMUST00000042857 80.139 72.885 0.000
dim(Tasic_counts_vignette)
## [1] 9151 60
2. 基因与转录本ID对应列表
txInfo <- readRDS("txInfo.RDS")
head(txInfo)
## isoform_id gene_id
## ENSMUST00000037739 ENSMUST00000037739 ENSMUSG00000042354
## ENSMUST00000228774 ENSMUST00000228774 ENSMUSG00000042354
## ENSMUST00000025204 ENSMUST00000025204 ENSMUSG00000024346
## ENSMUST00000237499 ENSMUST00000237499 ENSMUSG00000024346
## ENSMUST00000042857 ENSMUST00000042857 ENSMUSG00000038331
## ENSMUST00000114415 ENSMUST00000114415 ENSMUSG00000038331
dim(txInfo)
## [1] 9151 2
3. metadata
data(Tasic_metadata_vignette)
head(Tasic_metadata_vignette)
## sample_name brain_region cluster
## 12332 F2S4_160622_013_D01 VISp L5_IT_VISp_Hsd11b1_Endou
## 12938 F2S4_160624_023_C01 VISp L5_IT_VISp_Hsd11b1_Endou
## 12378 F2S4_160622_019_B01 VISp L5_IT_VISp_Hsd11b1_Endou
## 15610 F2S4_160818_008_D01 VISp L5_IT_VISp_Hsd11b1_Endou
## 15600 F2S4_160818_007_B01 VISp L5_IT_VISp_Hsd11b1_Endou
## 11026 F2S4_160603_025_G01 ALM L5_IT_ALM_Tmem163_Dmrtb1
dim(Tasic_metadata_vignette)
## [1] 60 3
实例操作
1. 构建对象
Tasic_metadata_vignette$group <- paste(Tasic_metadata_vignette$brain_region, Tasic_metadata_vignette$cluster,
sep = ".")
这个注释以及关于 SummarizedExperiment 的每一列的所有信息都在colData()中指定,counts()获得count matrix,
# Specify design formula from colData
sumExp <- SummarizedExperiment(assays = list(counts = Tasic_counts_vignette), colData = Tasic_metadata_vignette,
rowData = txInfo)
metadata(sumExp)$formula <- ~0 + as.factor(colData(sumExp)$group)
sumExp
## class: SummarizedExperiment
## dim: 9151 60
## metadata(1): formula
## assays(1): counts
## rownames(9151): ENSMUST00000037739 ENSMUST00000228774 ...
## ENSMUST00000127554 ENSMUST00000132683
## rowData names(2): isoform_id gene_id
## colnames(60): F2S4_160622_013_D01 F2S4_160624_023_C01 ...
## F2S4_160919_010_B01 F2S4_160915_002_D01
## colData names(4): sample_name brain_region cluster group
2. 拟合拟二项式广义线性模型
sumExp <- satuRn::fitDTU(object = sumExp, formula = ~0 + group, parallel = FALSE,
BPPARAM = BiocParallel::bpparam(), verbose = TRUE)
结果模型拟合现在保存到 SummarizedExperiment 对象的 rowData 中,名称为 fitDTUModels。这些模型可以通过以下方式访问:
rowData(sumExp)[["fitDTUModels"]]$ENSMUST00000037739
## An object of class "StatModel"
## Slot "type":
## [1] "glm"
##
## Slot "params":
## $coefficients
## designgroupALM.L5_IT_ALM_Tmem163_Dmrtb1
## 1.612656
## designgroupALM.L5_IT_ALM_Tnc
## 1.773648
## designgroupVISp.L5_IT_VISp_Hsd11b1_Endou
## 1.232522
##
## $df.residual
## [1] 55
##
## $dispersion
## [1] 28.14375
##
## $vcovUnsc
## designgroupALM.L5_IT_ALM_Tmem163_Dmrtb1
## designgroupALM.L5_IT_ALM_Tmem163_Dmrtb1 0.004760564
## designgroupALM.L5_IT_ALM_Tnc 0.000000000
## designgroupVISp.L5_IT_VISp_Hsd11b1_Endou 0.000000000
## designgroupALM.L5_IT_ALM_Tnc
## designgroupALM.L5_IT_ALM_Tmem163_Dmrtb1 0.000000000
## designgroupALM.L5_IT_ALM_Tnc 0.004363295
## designgroupVISp.L5_IT_VISp_Hsd11b1_Endou 0.000000000
## designgroupVISp.L5_IT_VISp_Hsd11b1_Endou
## designgroupALM.L5_IT_ALM_Tmem163_Dmrtb1 0.000000000
## designgroupALM.L5_IT_ALM_Tnc 0.000000000
## designgroupVISp.L5_IT_VISp_Hsd11b1_Endou 0.004042164
##
##
## Slot "varPosterior":
## [1] 27.73976
##
## Slot "dfPosterior":
## [1] 59.4294
3. DTU测试
建立对比矩阵
group <- as.factor(Tasic_metadata_vignette$group)
design <- model.matrix(~0 + group) # constructs design matrix
colnames(design) <- levels(group)
L <- limma::makeContrasts(Contrast1 = VISp.L5_IT_VISp_Hsd11b1_Endou - ALM.L5_IT_ALM_Tnc,
Contrast2 = VISp.L5_IT_VISp_Hsd11b1_Endou - ALM.L5_IT_ALM_Tmem163_Dmrtb1, levels = design) # constructs contrast matrix
执行测试
接下来,可以使用 testDTU 执行差异使用测试,采用默认设置。
sumExp <- testDTU(object = sumExp, contrasts = L, diagplot1 = TRUE, diagplot2 = TRUE,
sort = FALSE)
DTU可视化
最后,可视化在选定的治疗组中使用前3个差异转录本的情况。通过将转录本和基因参数设置为NULL并指定top。N = 3时,显示 FDR 最小(经验正确)的3个特征。另外,通过分别指定转录本或基因参数,可以可视化感兴趣的转录本或感兴趣基因内的所有转录本。
group1 <- colnames(sumExp)[colData(sumExp)$group == "VISp.L5_IT_VISp_Hsd11b1_Endou"]
group2 <- colnames(sumExp)[colData(sumExp)$group == "ALM.L5_IT_ALM_Tnc"]
plots <- satuRn::plotDTU(object = sumExp, contrast = "Contrast1", groups = list(group1,
group2), coefficients = list(c(0, 0, 1), c(0, 1, 0)), summaryStat = "model",
transcripts = NULL, genes = NULL, top.n = 3)
# to have same layout as in our paper
for (i in seq_along(plots)) {
current_plot <- plots[[i]] + scale_fill_manual(labels = c("VISp", "ALM"), values = c("royalblue4",
"firebrick")) + scale_x_discrete(labels = c("Hsd11b1_Endou", "Tnc"))
print(current_plot)
}
结果解读
head(rowData(sumExp)[["fitDTUResult_Contrast1"]]) # first contras
## estimates se df t pval
## ENSMUST00000037739 -0.5411265 0.4828721 59.4294 -1.1206416 0.26694865
## ENSMUST00000228774 0.5411265 0.4828721 59.4294 1.1206416 0.26694865
## ENSMUST00000025204 0.1929718 0.1952590 61.4294 0.9882864 0.32688926
## ENSMUST00000237499 -0.1929718 0.1952590 61.4294 -0.9882864 0.32688926
## ENSMUST00000042857 -0.8245461 0.4351069 58.4294 -1.8950427 0.06303775
## ENSMUST00000114415 0.8245461 0.4351069 58.4294 1.8950427 0.06303775
## regular_FDR empirical_pval empirical_FDR
## ENSMUST00000037739 0.6450007 0.3952020 0.9769609
## ENSMUST00000228774 0.6450007 0.4148973 0.9797565
## ENSMUST00000025204 0.6977858 0.4727593 0.9834579
## ENSMUST00000237499 0.6977858 0.4515026 0.9808231
## ENSMUST00000042857 0.3566360 0.1579658 0.9152296
## ENSMUST00000114415 0.3566360 0.1685028 0.9203772
head(rowData(sumExp)[["fitDTUResult_Contrast2"]]) # second contrast
## estimates se df t pval
## ENSMUST00000037739 -0.3801339 0.4941514 59.4294 -0.769266 0.444782126
## ENSMUST00000228774 0.3801339 0.4941514 59.4294 0.769266 0.444782126
## ENSMUST00000025204 0.2971434 0.1921038 61.4294 1.546786 0.127051689
## ENSMUST00000237499 -0.2971434 0.1921038 61.4294 -1.546786 0.127051689
## ENSMUST00000042857 -1.4866500 0.4997583 58.4294 -2.974738 0.004257964
## ENSMUST00000114415 1.4866500 0.4997583 58.4294 2.974738 0.004257964
## regular_FDR empirical_pval empirical_FDR
## ENSMUST00000037739 0.7772316 0.56104440 0.9893602
## ENSMUST00000228774 0.7772316 0.58504064 0.9893602
## ENSMUST00000025204 0.5021149 0.26790898 0.9893602
## ENSMUST00000237499 0.5021149 0.25297843 0.9893602
## ENSMUST00000042857 0.1491038 0.03349528 0.9893602
## ENSMUST00000114415 0.1491038 0.03654226 0.9893602
对于每个对比,结果将是一个包含8列的数据框:
-estimates:估计的对数-比值比(log base e);
-se:估计的标准误差;
-df:检验统计量的后验自由度;
-t:t检验统计量,用给定estimates和se的Wald检验计算;
-pval:给定t和df的“原始”p值;
-FDR:错误发现率,使用Benjamini和Hochberg对pval的多重测试修正计算;
-empirical_pval:一种“经验性的”p值,通过经验性地估计检验统计量的零分布来计算;
-empirical_FDR:FDR,使用Benjamini和Hochberg对pval_empirical的多重检验修正计算。
Reference
Gilis Jeroen, Koen Van den Berge & Lieven Clement, Kristoffer Vitting-Seerup. 2021. “satuRn: Scalable Analysis of differential Transcript Usage for bulk and single-cell RNA-sequencing applications.” F1000, May.
桓峰基因,铸造成功的您!
未来桓峰基因公众号将不间断的推出单细胞系列生信分析教程,
敬请期待!!
桓峰基因官网正式上线,请大家多多关注,还有很多不足之处,大家多多指正!http://www.kyohogene.com/
桓峰基因和投必得合作,文章润色优惠85折,需要文章润色的老师可以直接到网站输入领取桓峰基因专属优惠券码:KYOHOGENE,然后上传,付款时选择桓峰基因优惠券即可享受85折优惠哦!https://www.topeditsci.com/