在Windows下完成WGBS-Seq分析

最新推荐文章于 2024-09-20 11:44:46 发布

Nuvolar

最新推荐文章于 2024-09-20 11:44:46 发布

阅读量3.7k

点赞数

文章标签：生物信息学 r语言 windows

本文链接：https://blog.csdn.net/sinat_41244077/article/details/123876185

版权

纯粹就是记录下如何在Window下完成WGBS-Seq,并没有详细讲解WGBS-Seq的原理和每一步背后的含义。想深入了解可以翻生信技能树，生信宝典，组学大讲堂等等公众号。本文里的命令都是用(powershell or cmd, 部分需要cygwin64) 和 R脚本。
本文数据来源
OsChz1 acts as a histone chaperone in modulating chromatin organization and genome function in rice
知乎搬运的图
知乎搬运流程

工具准备

perl5 (需要添加到环境变量, 我装的草莓版的)windows环境安装Perl
java (需要添加到环境变量)[JAVA(windows)安装教程]
R Windows下安装R的详细教程
cygwin64 Windows：安装cygwin教程
SRA Toolkit (需要添加到环境变量，仅用于将sra文件转为fastq)下载地址
bowtie2 (需要添加到环境变量，最新版的还没被编译到windows平台，这里用的是2.3.4版本，名字含mingw的zip就可以。运行bowtie2需要perl在环境变量中)下载地址
Trimmomatic (用来去reads的接头，需要java)下载地址
samtools (用cygwin安装)
FastQC windows 安装 fastqc;
Bismark mannual
R包：
DSS
ChIPseeker
clusterProfiler

数据下载

文献中有项目号： GSE155269。可以直接在GEO里找到数据下载。其中WGBS-seq共四个样本，分别是野生型和突变体，两个重复。作者上传了差异甲基化区域结果和原始reads。这里就重头call开始演示一下。其实这篇文章里WGBS-seq的结果并不多，从sra开始看看能不能得到一些有趣的信息。
下载sra
试验处理
和前边的blog一样，用fastq-dump 转成fastq格式。

在powershell 键入命令

cd F:/WGBS-Seq/
mkdir ./output/log

$SRR="SRR1233930"
for($x=5;$x -lt 9;$x=$x+1)
{
#--split-3 用于双端测序 --gzip 输出结果为fastq.gz格式比较省空间，缺点是比对过程会慢些
fastq-dump --split-3 --gzip ${SRR}${x}
rm ${SRR}${x}
}

数据分析

FastQC

这里用FastQC和multiqc 来查看reads质量。multiqc 能把FastQC的结果汇总，更好看点。

#fastqc
$SRR="SRR1233930"
mkdir ./output/fastqc
for($x=5;$x -lt 9;$x=$x+1)
{
perl D:\RNA-Seq_tools\FastQC\fastqc -t 4 ${SRR}${x}_1.fastq.gz -q -o .\output\fastqc
perl D:\RNA-Seq_tools\FastQC\fastqc -t 4 ${SRR}${x}_2.fastq.gz -q -o .\output\fastqc
sleep -s 150 #隔2.5分钟跑下一个
}

multiqc ./output/fastqc/*fastqc.zip -o ./output/fastqc --export # summary report

multiqc
相比于其他二代测序，BS-seq的GC含量高多了。

Trimmomatic

Trimmomatic是比较常用质控软件。质控后清除没用的文件，省点空间。
Trimmomatic参数参考

mkdir ./output/clean
$SRR="SRR1233930"
for($x=5;$x -lt 9;$x=$x+1)
{
java -jar "D:/RNA-Seq_tools/Trimmomatic-0.39/trimmomatic-0.39.jar" PE -threads 16 -phred33 -trimlog ./output/log/${SRR}${x}_trim.log ${SRR}${x}_1.fastq.gz ${SRR}${x}_2.fastq.gz ./output/clean/${SRR}${x}_R1.clean.fastq.gz ./output/clean/${SRR}${x}_R1.unpaired.fastq.gz ./output/clean/${SRR}${x}_R2.clean.fastq.gz ./output/clean/${SRR}${x}_R2.unpaired.fastq.gz ILLUMINACLIP:Merged.Adapter.fa:2:30:10 SLIDINGWINDOW:4:15 LEADING:3 TRAILING:3 MINLEN:36
}
rm ./output/log/*_trim.log #clean trim log
rm ./output/clean/*.unpaired.fastq.gz #clean unpaired fastq

alignment

用bismark进行比对，bismark内部还是用bowtie2进行比对的，所以要预先将bowtie2加入环境变量。需要先构建参考基因组引索。
在powershell 键入命令

##build index 
mkdir ./output/ref
### soft link
copy E:/IRGSP-1.0_representative/referent_fasta/all.chrs.fasta ./output/ref/all.chrs.fasta
perl D:/wgc_tools/Bismark-0.23.1/bismark_genome_preparation --bowtie2  --verbose ./output/ref/

构建好引索后进行比对，这步比较费内存，也比较慢。
具体过程就是先将fastq中CG进行转换。MethylExtractBSCR可以评估甲基化转换效率，但需要预知Lambda.fa，但这个项目里没找到。bismark_methylation_extractor 是提取CHH,CHG,CpG的位点，加–CX可以合并三种类型，建议查查具体参数，这个是有重复样本的，bismark_methylation_extractor也可以做单样本，有需要可以看推荐阅读里。bismark2summary 可以汇总多个样本的结果，bismark2report是单个的，样本名中不用带clean啥的后缀，应是pe.bam结尾的。
在cygwin或sh中键入命令

for i in $(seq 5 8)
do
perl D:/wgc_tools/Bismark-0.23.1/bismark --bowtie2 -N 0 -L 20 --quiet --un --ambiguous --sam --nucleotide_coverage --genome ./output/ref/ -o ./output/bam/ -1 ./output/clean/SRR1233930${i}_R1.clean.fastq.gz -2 ./output/clean/SRR1233930${i}_R2.clean.fastq.gz;
#### 评估甲基化效率，Lambda.fa为建库的内参序列
# perl D:/wgc_tools/methylextract/MethylExtractBSCR.pl seqFile=./output/ref/Lambda.fa inFile=./output/bam/SRR1233930${i}_R1.clean_bismark_bt2_pe.sam flagW=99,147 flagC=83,163

#MarkDuplicates WGBS duplicates RRBS not
perl D:/wgc_tools/Bismark-0.23.1/deduplicate_bismark -p --bam --output_dir ./output/bam/ ./output/bam/SRR1233930${i}_R1.clean_bismark_bt2_pe.sam;

rm ./output/bam/SRR1233930${i}_R1.clean_bismark_bt2_pe.sam;
done

#methylation extractor
perl D:/wgc_tools/Bismark-0.23.1/bismark_methylation_extractor -p --CX --comprehensive --no_overlap --bedGraph --counts --report --cytosine_report --multicore 3 -o ./output/result/ --buffer_size 50G --genome_folder F:/WGBS-Seq/output/ref/ ./output/bam/SRR1233930*_R1.clean_bismark_bt2_pe.deduplicated.bam 
#bismark2summary

cp ./output/result/*_splitting_report.txt ./output/bam/
perl D:/wgc_tools/Bismark-0.23.1/bismark2summary ./output/bam/SRR12339305_R1.clean_bismark_bt2_pe.bam  ./output/bam/SRR12339306_R1.clean_bismark_bt2_pe.bam  ./output/bam/SRR12339307_R1.clean_bismark_bt2_pe.bam  ./output/bam/SRR12339308_R1.clean_bismark_bt2_pe.bam

result
这里结果文件居大，有134G，建议留好200G左右的空间。
bismark2summary

DSS

DSS是比较常用的call DML（差异甲基化位点）和DMR（差异甲基化区域）R包。主要是文献里的ViewBS在cygwin里都不能成功运行（我也不知道为啥，有明白的麻烦告诉我一声）。

R代码如下
导入需要的包，并定义好参数。这里只对CHH的结果进行分析，主要是CpG实在是太大了。CHH一共有2600W+个位点，但其实有些位点覆盖的reads比较少，而且没有检测到存在甲基化reads，这些位点可以用二项分布检验来去除，主要就是filter_cov 这个函数，whether_filter_cov <- T就可以启用，大致算了下，只留下10W+个位点，不同p值结果差异很大，输入的A为四个样本的甲基化效率在前面MethylExtractBSCR得到。好一点的电脑估计4个小时左右能过滤完。我这里没有过滤，但2600W+个位点转换BSobj真的很慢慢慢，BSobj为1.8G，希望你们的内存条够用。

rm(list = ls())
library(R.utils)
library(dplyr)
library(stats)
library(DSS)
require(bsseq)

whether_filter_cov <- F
filter_p <- 0.1
filter_method <- "fdr"
filter_cov <- function(data,filter_p,filter_method,p){
    p.value <- lapply(1:nrow(data), FUN=function(x){
        bt <- binom.test(data$X[x],data$N[x],p)$p.value
        return(bt)
    })
    p.value <- unlist(p.value)
    q.value <- p.adjust(p.value,filter_method)
    pos <- which(q.value <= filter_p)
    return(pos)
}

out_dir <- "F:/WGBS-Seq/output/DSS/"
dir.create(out_dir)
rep_num <- 2
smoothing <- T
Param <- SnowParam(workers = 5) #number core
dml_num <- 4e+05 # 5e+05
dml.threshold <- 0.001
dml.delta <- 0.1

dmr.threshold <- 0.001
dmr.delta <- 0.1

library(org.Osativa.eg.db)
rice <- org.Osativa.eg.db
rice_txdb <- loadDb("E:/blog/Chip-seq/rice_2021_10.sqlite") #调用你建好的txdb对象
rice_anno_txdb <- makeTxDbFromGFF("E:/IRGSP-1.0_representative/Oryza_sativa.IRGSP-1.0.51.gff3.gz") # 下载的注释文件(gff3)构建annoDb 注释文件最好要对应

up.flank <- 2000
down.flank <- 2000
group_name <- c("C1","C2", "N1", "N2")

NIP_1 <- "F:/WGBS-Seq/output/result/CHG_context_SRR12339305.gz.bismark.cov.gz"
NIP_2 <- "F:/WGBS-Seq/output/result/CHG_context_SRR12339306.gz.bismark.cov.gz"
oschz1_1 <- "F:/WGBS-Seq/output/result/CHG_context_SRR12339307.gz.bismark.cov.gz"
oschz1_2 <- "F:/WGBS-Seq/output/result/CHG_context_SRR12339308.gz.bismark.cov.gz"
group <- c(NIP_1,NIP_2, oschz1_1, oschz1_2)

A <- c(0.0032,0.0032,0.0032,0.0032) #甲基化效率

obj <- list()

for (i in group) {
    temp_gf <- gzfile(i,"rt")
    temp_data <- read.table(temp_gf, header = F, sep = "\t", quote = "", comment.char = "#")
    temp_data <- data.frame(chr=temp_data[,1],pos=temp_data[,2],N=temp_data[,5]+temp_data[,6],X=temp_data[,5])
    if (whether_filter_cov==T){
         p <- A[which(group %in% i)]
        # p <- 0.1*mean(temp_data$X/temp_data$N)
        pos <- filter_cov(temp_data)
        temp_data <- temp_data[pos,]
    }
    obj[[group_name[which(group %in% i)]]] <- temp_data
}
rm(temp_gf)
rm(temp_data)
BSobj <- makeBSseqData(obj, group_name)
rm(obj)
gc()
temp <- object.size(BSobj)
print(temp, units = "auto")

call DML，如果全部一起运行内存和CPU都会吃不消，所以就分段算，大概算了34个小时。

dmlTest <- data.frame(chr=NA,pos=NA,mu1=NA,mu2=NA,diff=NA,diff.se=NA,stat=NA,pval=NA,fdr=NA)[0,]
# 1:ceiling(length(BSobj)/dml_num)
for(i in 1:ceiling(length(BSobj)/dml_num)){
    if(i >= ceiling(length(BSobj)/dml_num)){
        tempobj <- BSobj[((i-1)*dml_num+1):length(BSobj),]
    }else{
        tempobj <- BSobj[((i-1)*dml_num+1):(i*dml_num),]
    }
    tempdml <- DMLtest(tempobj, group1=group_name[1:rep_num], group2=group_name[(length(group_name)-rep_num+1):length(group_name)], smoothing=smoothing,BPPARAM=Param)
    dmlTest <- rbind(dmlTest,tempdml)
}
rm(tempobj)
rm(tempdml)
gc()
dmlTest$fdr <- p.adjust(dmlTest$pval,"fdr")
write.table(dmlTest,file=paste0(out_dir,"dmlTest.txt"),sep = "\t",append = FALSE,row.names = FALSE, col.names = TRUE, quote = FALSE)
dmls <- callDML(dmlTest,delta=dml.delta, p.threshold=dml.threshold)
write.table(dmls,file=paste0(out_dir,"dmls.txt"),sep = "\t",append = FALSE,row.names = FALSE, col.names = TRUE, quote = FALSE)

用genomation注释dmls，bed文件是用gtf转的bed12格式的文件。

library(genomation)

gene.obj <- readTranscriptFeatures("E:/IRGSP-1.0_representative/IRGSP-1.0_representative_transcript_exon_2022-03-11.bed12",up.flank=up.flank,down.flank=down.flank)
dmls_grangs <- data.frame(chr=dmls$chr,start=dmls$pos,end=dmls$pos,strand="*",fdr=dmls$fdr)
diffgene <- annotateWithGeneParts(as(dmls_grangs,"GRanges"),gene.obj)

genomation
call DMR

dmrs <- callDMR(dmlTest,delta=dmr.delta, p.threshold=dmr.threshold)
write.table(dmrs,file=paste0(out_dir,"dmrs.txt"),sep = "\t",append = FALSE,row.names = FALSE, col.names = TRUE, quote = FALSE)
showOneDMR(dmrs[1,], BSobj)

showDMRs
对DMR进行注释，genomation也可以，这里用chipseeker注释，顺带进行GO和KEGG分析。要注意这里是对所有注释到的gene进行富集分析，但甲基化位于启动子区和外显子区的功能是有差别的，实际操作中应筛选例如TSS1500，TSS2000这种同一特征的进行富集。

library(ChIPseeker)
library(GenomicFeatures)
library(UpSetR)
library(clusterProfiler)

peak_plot <- function(peak,region,up_flank,down_flank){
    tagMatrix <- getTagMatrix(peak,windows=region)
    tagHeatmap(tagMatrix,xlim = c(-up_flank,down_flank),color = "red") # Heatmap of ChIP binding to TSS regions
    plotAvgProf(tagMatrix,xlim = c(-up_flank,down_flank),conf = 0.95) # Average Profile of ChIP binding to TSS region
}

keggplot <- function(i,gene_list,showCategory,outfile,p){
    kk <- enrichKEGG(gene_list[[i]],organism='dosa',keyType = 'kegg',                
                 pvalueCutoff=p, pAdjustMethod='none',                 
                 qvalueCutoff=p)
    print(barplot(kk,showCategory = showCategory,title = ""))
    print(dotplot(kk,showCategory = showCategory,title = ""))
    print(heatplot(kk,showCategory = showCategory))
    write.table(data.frame(kk@result),file = outfile,sep="\t", row.names = FALSE, col.names =TRUE, quote =FALSE) #output 
}
goplot <- function(i,gene_list,showCategory,outfile,p){
    ego.BP <- enrichGO(gene_list[[i]],OrgDb = rice, keyType = "RAP", ont="BP",              
                 pvalueCutoff=p, pAdjustMethod='none',                 
                 qvalueCutoff=p)
    ego.CC <- enrichGO(gene_list[[i]],OrgDb = rice, keyType = "RAP", ont="CC",              
                 pvalueCutoff=p, pAdjustMethod='none',                 
                 qvalueCutoff=p)
    ego.MF <- enrichGO(gene_list[[i]],OrgDb = rice, keyType = "RAP", ont="MF",              
                 pvalueCutoff=p, pAdjustMethod='none',                 
                 qvalueCutoff=p)
    ego <- enrichGO(gene_list[[i]],OrgDb = rice, keyType = "RAP", ont="ALL",              
                 pvalueCutoff=p, pAdjustMethod='none',                 
                 qvalueCutoff=p)
    write.table(data.frame(ego@result),file = outfile,sep="\t", row.names = FALSE, col.names =TRUE, quote =FALSE) #output 
    print(barplot(ego.BP,showCategory = showCategory,title = paste0("BP")))
    print(dotplot(ego.BP,showCategory = showCategory,title = paste0("BP")))
    print(heatplot(ego.BP,showCategory = showCategory))
    print(plotGOgraph(ego.BP))
    print(barplot(ego.CC,showCategory = showCategory,title = paste0("CC")))
    print(dotplot(ego.CC,showCategory = showCategory,title = paste0("CC")))
    print(heatplot(ego.CC,showCategory = showCategory))
    print(plotGOgraph(ego.CC))
    print(barplot(ego.MF,showCategory = showCategory,title = paste0("MF")))
    print(dotplot(ego.MF,showCategory = showCategory,title = paste0("MF")))
    print(heatplot(ego.MF,showCategory = showCategory))
    print(plotGOgraph(ego.MF))
}

write.bed <- function(dmrs,file){
    methy <- data.frame(chrom=unlist(lapply(dmrs$chr, function(x){
        t <- strsplit(x,split = "chr")[[1]][2]
        return(t)
    })),chromStart=dmrs$start,chromEnd=dmrs$end,name=1:nrow(dmrs),score=dmrs$nCG,strand=".",thickStart=dmrs$start,thickEnd=dmrs$end,itemRGB="255,255,255",blockCount=1,blockSizes=dmrs$end-dmrs$start,blockStarts=dmrs$start,signalValue=dmrs$diff.Methy,pValue=-1,qValue=-1)
    write.table(methy,file = file,sep = "\t",row.names = F,col.names = F,quote = F)
}

write.bed(dmrs,paste0(out_dir,"dmrs.bed"))
dmrs_bed<- readPeakFile(paste0(out_dir,"dmrs.bed"))

covplot(dmrs_bed,title="dmrs")

promoter <- getPromoters(TxDb=rice_txdb, upstream=up.flank, downstream=down.flank)
gene <- getBioRegion(TxDb=rice_txdb, upstream=up.flank, downstream=down.flank,by="exon")

peak_plot(dmrs_bed,promoter,up.flank,down.flank)
peak_plot(dmrs_bed,gene,up.flank,down.flank)

dmrs_anno <- annotatePeak(dmrs_bed, annoDb = "rice_anno_txdb",TxDb=rice_txdb,tssRegion=c(-up.flank,down.flank))

plotDistToTSS(dmrs_anno) #tss feature
plotAnnoBar(dmrs_anno) #peaks feature

plotAnnoPie(dmrs_anno) 
vennpie(dmrs_anno)
write.table(data.frame(dmrs_anno),file =paste0(out_dir,"dmrs.anno"),sep="\t", row.names = FALSE, col.names =TRUE, quote =FALSE) 

genes <- data.frame(dmrs_anno)$geneId
genes <-  genes[!duplicated(genes)]
genes <- list(dmrs = genes)
transcript <- data.frame(dmrs_anno)$transcriptId
transcript <- transcript[!duplicated(transcript)]
transcript <- list(dmrs = transcript)

lapply(1, keggplot,showCategory=10,gene_list=transcript,outfile=paste0(out_dir,"dmrs_kegg.txt"),p=1)

lapply(1, goplot,showCategory=10,gene_list=genes,outfile=paste0(out_dir,"dmrs_go.txt"),p=1)