exomePeak2学习&使用记录
文章目录
写在前面
学习和使用一个新的算法或者算法包之前,必须要认真研究相关的帮助文档,以及函数里的相关参数和注释。
关于exomePeak2
exomePeak2为甲基化RNA免疫沉淀测序数据(MeRIP-Seq)提供了有偏倚感知的定量和峰值检测。MeRIP-Seq是一种常用的测序技术,可以测量给定细胞系条件下RNA修饰位点的位置和丰度。然而,MeRIPSeq的定量和峰值呼叫对PCR扩增偏差很敏感,这通常存在于下一代测序(NGS)技术中。此外,由RNA-Seq生成的计数数据在生物复制物之间表现出显著的生物差异。exomePeak2通过引入一系列为MeRIPSeq定制的强大数据科学工具来共同解决这些挑战。使用exomePeak2,用户可以通过一个直接的单步函数来执行峰值调用、修改站点量化和差分分析。或者,可以使用多步功能来生成诊断图和执行自定义分析。
下载exomePeak2
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("SummarizedExperiment","cqn","Rsamtools",
"GenomicAlignments","GenomicRanges","GenomicFeatures",
"DESeq2","ggplot2","mclust",
"genefilter","BSgenome","BiocParallel",
"IRanges","S4Vectors","quantreg",
"reshape2","rtracklayer","apeglm","RMariaDB"))
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("ZW-xjtlu/exomePeak2", build_vignettes = TRUE)
下载出错的一些情况记录…
某个包had non-zero exit status,我最近装什么包都出这个问题…
解决办法:
- 不停的尝试,网络问题,建议使用Rgui安装(悲伤)
- 删掉Rstudio和Rgui安装目录下的包和该包依赖的其他的包,然后重新安装(魔幻操作)
希望大家都能装各种R包顺利。后面希望搞个R里遇见的错误合集…
关于exomePeak2的使用
帮助文档对该算法包的描述:
exomePeak2从MeRIP-seq实验的BAM文件进行峰值呼叫和峰值统计计算。该功能集成了标准MeRIP-seq数据分析流程的以下步骤。
- 使用ScanMeripBAM检查并对BAM文件进行索引。
- 调用使用exome调用的调用修改峰值。
- 用标准化的GC来计算GC含量偏差的偏移因子。
- 利用广义线性模型或glmDM计算微分修正统计量。
- 按exportResults以用户定义的格式导出峰值/站点统计数据。
调用函数
Usage
exomePeak2(
bam_ip = NULL,
bam_input = NULL,
bam_treated_ip = NULL,
bam_treated_input = NULL,
txdb = NULL,
bsgenome = NULL,
genome = NA,
gff_dir = NULL,
mod_annot = NULL,
paired_end = FALSE,
library_type = c("unstranded", "1st_strand", "2nd_strand"),
fragment_length = 100,
binding_length = 25,
step_length = binding_length,
min_peak_width = fragment_length/2,
max_peak_width = fragment_length * 10,
pc_count_cutoff = 5,
bg_count_cutoff = 50,
p_cutoff = 1e-05,
p_adj_cutoff = NULL,
log2FC_cutoff = 0,
parallel = 1,
background_method = c("all", "Gaussian_mixture", "m6Aseq_prior", "manual"),
manual_background = NULL,
correct_GC_bg = TRUE,
qtnorm = FALSE,
glm_type = c("DESeq2", "Poisson", "NB"),
LFC_shrinkage = c("apeglm", "ashr", "Gaussian", "none"),
export_results = TRUE,
export_format = c("CSV", "BED", "RDS"),
table_style = c("bed", "granges"),
save_plot_GC = TRUE,
save_plot_analysis = FALSE,
save_plot_name = "",
save_dir = "exomePeak2_output",
peak_calling_mode = c("exon", "full_tx", "whole_genome")
)
其中需要用到的的参数:
-
bam_ip :(control)IP样本的BAM文件目录的一个字符向量。
-
bam_input:(control)input样本的BAM文件目录的一个字符向量。
-
bam_treated_ip:已处理(treated)的IP示例的BAM文件目录的一个字符向量。
-
bam_treated_input:已处理(treated)的输入样本的BAM文件目录的一个字符向量。如果样本文件不包含处理组,用户应只能填写BAM_ip和BAM_input的参数。
-
genome:UCSC基因组名称的一个字符串,是UCSC可以接受的字符串。例如:“hg19”。默认情况下,该参数=NA,应在bsgenome或/和TxDb对象不可用时提供。
-
gff_dir :该参数可选,指定指向基因注释GFF/GTF文件的目录的字符,在TxDb对象不可用时应用,默认为=空。
-
mod_annot:为用户提供的GRanges对象提供单个的RNA修改注释。如果用户提供基于单一的RNA修改注释,exomePeak2将对所提供的注释执行读取计数和(差异)修改量化。
单基注释的两侧将有length=floor(fragment_length-binding_length/2),以说明序列库的片段长度。
exomePeak2使用方法
无处理组,寻找富集峰[Peak Calling]
#Specify File Directories
GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")
f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)
f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
INPUT_BAM = c(f1,f2,f3)
# Peak Calling
sep <- exomePeak2(bam_ip = IP_BAM,
bam_input = INPUT_BAM,
gff_dir = GENE_ANNO_GTF,
genome = "hg19",
paired_end = FALSE)
sep
有处理组时,修正峰的差分修正分析(两种条件的比较)
# Differential Modification Analysis on Modification Peaks (Comparison of Two Conditions)
f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
TREATED_IP_BAM = c(f1)
f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
TREATED_INPUT_BAM = c(f1)
sep <- exomePeak2(bam_ip = IP_BAM,
bam_input = INPUT_BAM,
bam_treated_input = TREATED_INPUT_BAM,
bam_treated_ip = TREATED_IP_BAM,
gff_dir = GENE_ANNO_GTF,
genome = "hg19",
paired_end = FALSE)
sep
基于单一基础的修改注释的修改级别量化
# Modification Level Quantification on Single Based Modification Annotation
f2 = system.file("extdata", "mod_annot.rds", package="exomePeak2")
MOD_ANNO_GRANGE <- readRDS(f2)
sep <- exomePeak2(bam_ip = IP_BAM,
bam_input = INPUT_BAM,
gff_dir = GENE_ANNO_GTF,
genome = "hg19",
paired_end = FALSE,
mod_annot = MOD_ANNO_GRANGE)
sep
单基修改注释的差分改进分析
# Differential Modification Analysis on Single Based Modification Annotation
sep <- exomePeak2(bam_ip = IP_BAM,
bam_input = INPUT_BAM,
bam_treated_input = TREATED_INPUT_BAM,
bam_treated_ip = TREATED_IP_BAM,
gff_dir = GENE_ANNO_GTF,
genome = "hg19",
paired_end = FALSE,
mod_annot = MOD_ANNO_GRANGE)
sep
由于后面两个的mod_annot文件我没有处理过,所以我也不是很清楚具体文件长什么样子…
exomePeakCalling-methods使用方法(分步骤)
描述:从MeRIP-seq数据集调用RNA修改的峰值。exomePeakCalling从由用户提供的文字转录本注释定义的外显子区域上的MeRIP-seq的BAM文件执行峰值调用。如果提供BSgenome对象,将使用GC含量偏置校正进行峰值调用。
在默认设置下,对于每个窗口,exomePeak2将符合负二项式(NB)的GLM,以及在DESeq中开发的过色散参数的调节估计。在每个滑动窗口上对IP/input的Fold Change(LFC)<=0进行Wald检验。选择使用截止p值<0.0001的显著修正峰。
Usage
exomePeakCalling(
merip_bams = NULL,
txdb = NULL,
bsgenome = NULL,
genome = NA,
mod_annot = NULL,
glm_type = c("DESeq2", "NB", "Poisson"),
background_method = c("Gaussian_mixture", "m6Aseq_prior", "manual", "all"),
manual_background = NULL,
correct_GC_bg = TRUE,
qtnorm = FALSE,
gff_dir = NULL,
fragment_length = 100,
binding_length = 25,
step_length = binding_length,
min_peak_width = fragment_length/2,
max_peak_width = fragment_length * 10,
pc_count_cutoff = 5,
bg_count_cutoff = 50,
p_cutoff = 1e-05,
p_adj_cutoff = NULL,
log2FC_cutoff = 0,
parallel = 3,
bp_param = NULL
)
## S4 method for signature 'MeripBamFileList'
exomePeakCalling(
merip_bams = NULL,
txdb = NULL,
bsgenome = NULL,
genome = NA,
mod_annot = NULL,
glm_type = c("DESeq2", "NB", "Poisson"),
background_method = c("all", "Gaussian_mixture", "m6Aseq_prior", "manual"),
manual_background = NULL,
correct_GC_bg = TRUE,
qtnorm = FALSE,
gff_dir = NULL,
fragment_length = 100,
binding_length = 25,
step_length = binding_length,
min_peak_width = fragment_length/2,
max_peak_width = fragment_length * 10,
pc_count_cutoff = 5,
bg_count_cutoff = 50,
p_cutoff = 1e-05,
p_adj_cutoff = NULL,
log2FC_cutoff = 0,
parallel = 1,
bp_param = NULL
)
参数基本上与exomePeak2是一样的,但是如果在使用的时候,还是建议根据自己实验目的仔细选择参数选项。
选择文件[根据自己的需要]
### Define File Directories
GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")
f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)
f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
INPUT_BAM = c(f1,f2,f3)
f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
TREATED_IP_BAM = c(f1)
f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
TREATED_INPUT_BAM = c(f1)
Peak calling的调用
## Peak Calling and Visualization in Multiple Steps
#The exomePeak2 package can achieve peak calling and peak statistics calulation with multiple functions.
IP_BAM=c(f1,f2)
INPUT_BAM=c(f3,f4)
## 1. 检查bam文件
MeRIP_Seq_Alignment <- scanMeripBAM(
bam_ip = IP_BAM,
bam_input = INPUT_BAM,
paired_end = FALSE)
# 同时含有处理组和非处理组
MeRIP_Seq_Alignment <- scanMeripBAM(
bam_ip = IP_BAM,
bam_input = INPUT_BAM,
bam_treated_input = TREATED_INPUT_BAM,
bam_treated_ip = TREATED_IP_BAM,
paired_end = FALSE)
# str函数可以方便快速的查看s4对象的结构和内容
str(MeRIP_Seq_Alignment,max.level = 4)
#2. 使用bam文件进行 peak calling
SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
gff_dir = GENE_ANNO_GTF)
#可选,用来评估数据
SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
gff_dir = GENE_ANNO_GTF,
genome = "hg19",
mod_annot = MOD_ANNO_GRANGE)
# 查看peak结果
str(SummarizedExomePeaks, max.level = 4)
#3. 计算size factors用来对GC偏倚进行矫正
SummarizedExomePeaks <- normalizeGC(SummarizedExomePeaks)
#4. 使用glmM构造peak统计量
SummarizedExomePeaks <- glmM(SummarizedExomePeaks)
#5. Generate the scatter plot between GC content and log2 Fold Change (LFC).
p <- plotLfcGC(SummarizedExomePeaks)
# 点的大小有点小,取出来数据重新加工
library(ggplot2)
data <- p$data
head(data)
p1 <- ggplot(data,aes(x=GC_idx,y=Log2FC,color=Label)) + geom_point(size=2) + theme_classic()
p1
#6. Generate the bar plot for the sequencing depth size factors.
plotSizeFactors(SummarizedExomePeaks)
#7. Export the modification peaks and the peak statistics with user decided format.
exportResults(SummarizedExomePeaks, format = "CSV")### Peak Calling
MeRIP_Seq_Alignment <- scanMeripBAM(
bam_ip = IP_BAM,
bam_input = INPUT_BAM,
paired_end = FALSE
)
sep <- exomePeakCalling(
merip_bams = MeRIP_Seq_Alignment,
gff_dir = GENE_ANNO_GTF,
genome = "hg19"
)
sep <- normalizeGC(sep)
sep <- glmM(sep)
exportResults(sep)
差分修正分析(两种条件的比较)
### Differential Modification Analysis (Comparison of Two Conditions)
MeRIP_Seq_Alignment <- scanMeripBAM(bam_ip = IP_BAM,
bam_input = INPUT_BAM,
bam_treated_ip = TREATED_IP_BAM,
bam_treated_input = TREATED_INPUT_BAM,
paired_end = FALSE)
sep <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
gff_dir = GENE_ANNO_GTF,
genome = "hg19")
sep <- normalizeGC(sep)
sep <- glmDM(sep)
exportResults(sep)
使用算法遇见的错误
这个是我使用分步骤进行peak calling出错,应该是网络的问题,一旦调用UCSC的“hg38”,就会报上图的错误。取消调用"hg38"就可以了,但是会出现其他以下的问题。
没有genome,bsgenome和txdb的调用,就会导致peak calling没法进行GC含量的修正。
虽然成功了,但是暂时不知道这个报错能用什么解决。。。(知道的话麻烦告诉我hhhh)
使用的文件
-
bam文件,注意区分IP和input的放入
-
gff_dir使用的文件,需要使用GFF/GTF格式的文件,在网上的教程看到都说使用UCSC的文件,具体参考注释取决于你bam文件里的注释基因组是哪个版本的。
但是我使用UCSC下载的好几个版本都不行,还会报错,可能是我理解错误?但是使用GENCODE数据库里的基因组注释文件就正常了。。。
最后
总之,exomePeak2算法包和exomePeak算法包光是下载对我来说就一波三折…更别提后面的注释,人生艰难hhhh,希望自己后面有能力处理到这些无比神奇的报错吧!
欢迎关注我的公众号呀~
参考资料:
- https://github.com/ZW-xjtlu/exomePeak2
- https://www.mac-soft.cn/t/15602