由于运行BAMixChecker软件需要构建过variant的bam文件,直接用RNA-seq STAR输出的bam文件会报字眼为“GATK HaplotypeCaller…”的错误,因此需进行GATK的rnaseq:
即STAR 2-pass --> picard --> gatk的使用
1. 构建基因组
注意:通过这一fasta文件构建的基因组,后续在gatk的参数中也要保持同一个fasta文件,否则会报错找不到某个contig
#!/bin/sh
name=$1
STAR --runMode genomeGenerate \
--runThreadN 15 \
--genomeDir /public/wgs/JG_PA/RNAseq/index_forGATK/ \
--genomeFastaFiles /public/wgs/JG_PA/WES/hg38/Homo_sapiens_assembly38.fasta \
--sjdbGTFfile /public/wgs/JG_PA/RNAseq/HG38/gencode.v43.annotation.gtf \
--sjdbOverhang 149
上述代码保存成“1.getSTAR2pass.sh”后运行:
sbatch -J 29GENO -p research -N 1 -n 16 -e 29g.err --output=29g.out --wrap "sh 1.getSTAR2pass.sh"
*过程文件和耗时:
结果:
2. STAR 2-pass alignment
比对工具:STAR,与TopHat相比,STAR灵敏度更高,使用STAR的two-pass mode比对可以使新型切割片段获得更好的比对结果。
有教程构建两次,在第二次中利用到第一次产生的chrX.out.tab文件。
但其实只需要用two-pass mode一步就可以完成:
time STAR --outFileNamePrefix /public/wgs/JG_PA/RNAseq/BAMixChecker/bamforGATK/${sample}_ \
--outSAMtype BAM SortedByCoordinate \
--outFilterMultimapNmax 1 \
--outSAMstrandField intronMotif \
--genomeDir /public/wgs/JG_PA/RNAseq/index_forGATK/ \
--runThreadN 15 \
--readFilesCommand zcat \
--readFilesIn /public/wgs/JG_PA/RNAseq/FASTQ/${sample}_R1_clean.fastq.gz /public/wgs/JG_PA/RNAseq/FASTQ/${sample}_R2_clean.fastq.gz \
--twopassMode Basic
上述代码保存成“1.getSTAR2pass.sh”后运行:
sbatch -J 29GENO -p research -N 1 -n 16 -e 29g.err --output=29g.out --wrap "sh 1.getSTAR2pass.sh 7864166-T"
*过程文件和耗时:
结果:
3. 去除PCR重复+SplitNCigarReads
#!/bin/sh
sample=$1
opdir="/public/wgs/JG_PA/RNAseq/BAMixChecker/bamforGATK"
gatk='/public/wgs/JG_PA/WES/biosoft/gatk-4.1.9.0/gatk'
GENOME='/public/wgs/JG_PA/WES/hg38/Homo_sapiens_assembly38.fasta'
sambamba markdup --overflow-list-size 600000 --tmpdir='./' -r $opdir/${sample}_Aligned.sortedByCoord.out.bam $opdir/${sample}_rmPCR.bam
$gatk CreateSequenceDictionary -R $GENOME # 最新版gatk,这个步骤非常快
$gatk SplitNCigarReads -R $GENOME \
-I $opdir/${sample}_rmPCR.bam \
-O $opdir/${sample}_rmPCR_split.bam
*过程文件和耗时:
结果:
4. Read groups + Base Quality Recalibration
#!/bin/sh
sample=$1
opdir="/public/wgs/JG_PA/RNAseq/BAMixChecker/bamforGATK"
gatk='/public/wgs/JG_PA/WES/biosoft/gatk-4.1.9.0/gatk'
DBSNP=/public/wgs/JG_PA/WES/hg38/dbsnp_146.hg38.vcf.gz
kgSNP=/public/wgs/JG_PA/WES/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
kgINDEL=/public/wgs/JG_PA/WES/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
GENOME='/public/wgs/JG_PA/WES/hg38/Homo_sapiens_assembly38.fasta'
# 因为我的star比对得到的bam文件里面没有 Read groups
# 参考:https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
$gatk AddOrReplaceReadGroups \
-I $opdir/${sample}_rmPCR_split.bam \
-O $opdir/${sample}_rmPCR_split_add.bam \
-LB $sample -PL ILLUMINA \
-PU $sample -SM $sample
$gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" BaseRecalibrator \
-I $opdir/${sample}_rmPCR_split_add.bam \
-R $GENOME \
-O $opdir/${sample}_recal.table \
--known-sites $kgSNP --known-sites $kgINDEL
$gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" ApplyBQSR \
-I $opdir/${sample}_rmPCR_split_add.bam -R $GENOME \
--output $opdir/${sample}_recal.bam \
-bqsr $opdir/${sample}_recal.table
*过程文件和耗时:
5. Variant Calling
这里没有参考生信技能树代码添加 -L参数设置bed文件(-L/–intervals:指定感兴趣的区域,可以是染色体名称、区域坐标或者BED文件),因为会报错。暂未深究,去掉这一参数也可以运行。
$gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \
-ERC GVCF -R $GENOME -I $opdir/${sample}_recal.bam \
--dbsnp $DBSNP -O $opdir/${sample}_gatk.gvcf
*过程文件会有warning
*参考以下说法https://www.biostars.org/p/437456/,warning可以忽略
*过程文件和耗时:
结果:
—至此,从fastq文件到得到gatk.vcf文件,大约需要15h
6.BAMixChecker
发现不需要runGATK最后一步,用recal.bam文件即可,可以节约8h左右
python BAMixChecker/BAMixChecker.py -l bam_list.txt -p 15 \
-r /public/wgs/JG_PA/WES/hg38/Homo_sapiens_assembly38.fasta -o BAMixRT
合并后统一运行
#!/bin/sh
# conda activate rnaseq_fastp_htseq_new
sample=$1
#STAR --runMode genomeGenerate \
# --runThreadN 15 \
# --genomeDir /public/wgs/JG_PA/RNAseq/index_forGATK/ \
# --genomeFastaFiles /public/wgs/JG_PA/WES/hg38/Homo_sapiens_assembly38.fasta \
# --sjdbGTFfile /public/wgs/JG_PA/RNAseq/HG38/gencode.v43.annotation.gtf \
# --sjdbOverhang 149
mkdir -p /public/wgs/JG_PA/RNAseq/BAMixChecker/LOG
opdir="/public/wgs/JG_PA/RNAseq/BAMixChecker/bamforGATK"
gatk='/public/wgs/JG_PA/WES/biosoft/gatk-4.1.9.0/gatk'
samtools='/public/home/hpcsongjy/miniconda3/envs/RseQC_BAMixChecker_python2/bin/samtools'
GENOME='/public/wgs/JG_PA/WES/hg38/Homo_sapiens_assembly38.fasta'
DBSNP=/public/wgs/JG_PA/WES/hg38/dbsnp_146.hg38.vcf.gz
kgSNP=/public/wgs/JG_PA/WES/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
kgINDEL=/public/wgs/JG_PA/WES/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
if [ ! -f $opdir/LOG/ok.gvcf.${sample}.status ]; then
time STAR --outFileNamePrefix $opdir/${sample}_ \
--outSAMtype BAM SortedByCoordinate \
--outFilterMultimapNmax 1 \
--outSAMstrandField intronMotif \
--genomeDir /public/wgs/JG_PA/RNAseq/index_forGATK/ \
--runThreadN 15 \
--readFilesCommand zcat \
--readFilesIn /public/wgs/JG_PA/RNAseq/FASTQ/${sample}_R1_clean.fastq.gz /public/wgs/JG_PA/RNAseq/FASTQ/${sample}_R2_clean.fastq.gz \
--twopassMode Basic
#time sambamba markdup --overflow-list-size 600000 --tmpdir='/public/wgs/JG_PA/RNAseq/BAMixChecker/tmp' \
# -r $opdir/${sample}_Aligned.sortedByCoord.out.bam $opdir/${sample}_rmPCR.bam
# 因为我的star比对得到的bam文件里面没有 Read groups
# 参考:https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
time $gatk AddOrReplaceReadGroups \
-I $opdir/${sample}_Aligned.sortedByCoord.out.bam \
-O $opdir/${sample}_RG.bam \
-LB $sample -PL ILLUMINA \
-PU $sample -SM $sample && echo "** ADD RG done **"
time $gatk MarkDuplicates \
-I $opdir/${sample}_RG.bam \
-M $opdir/${sample}_markdup_metrics.txt \
-O $opdir/${sample}_RG_markdup.bam &&\
time $samtools index $opdir/${sample}_RG_markdup.bam && echo "** ADD RG_markdup.bam index done **"
time $gatk CreateSequenceDictionary -R $GENOME # 最新版gatk,这个步骤非常快
time $gatk SplitNCigarReads -R $GENOME \
-I $opdir/${sample}_RG_markdup.bam \
-O $opdir/${sample}_RG_markdup_split.bam && echo "** ADD SplitNCigarReads done **"
time $gatk BaseRecalibrator \
-I $opdir/${sample}_RG_markdup_split.bam \
-R $GENOME \
-O $opdir/${sample}_recal.table --known-sites $kgSNP --known-sites $kgINDEL
time $gatk ApplyBQSR \
-I $opdir/${sample}_RG_markdup_split.bam -R $GENOME \
--output $opdir/${sample}_recal.bam \
-bqsr $opdir/${sample}_recal.table && echo "** ADD recal done **"
# ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/All_20180418.vcf.gz
# time $gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \
# -ERC GVCF -R $GENOME -I $opdir/${sample}_recal.bam --dbsnp $DBSNP -O $opdir/${sample}_gatk.gvcf
if [ $? -eq 0 ]; then
echo "SUCCEED" $sample
touch $opdir/LOG/ok.recal.${sample}.status
rm $opdir/${sample}_RG.bam $opdir/${sample}_RG_markdup.bam $opdir/${sample}_RG_markdup_split.bam
rm $opdir/${sample}_RG_markdup.bam.bai $opdir/${sample}_RG_markdup_split.bai
else
echo "FAILED" $sample
fi # check gvcf status
fi ## check whether we need to process current sampel
参考文章:
主要参考生信技能树 https://zhuanlan.zhihu.com/p/109796467
每一步都很清晰 https://blog.csdn.net/weixin_44649331/article/details/103872706#:~:text=%E5%B0%86%E8%90%BD%E5%9C%A8%E5%A4%96%E6%98%BE%E5%AD%90%E4%B8%8A%E7%9A%84reads%E5%88%86%E7%A6%BB%E5%87%BA%E6%9D%A5%EF%BC%8C%E5%8F%96%E5%87%BAN%E9%94%99%E8%AF%AF%E7%A2%B1%E5%9F%BA%EF%BC%8C%E5%8E%BB%E9%99%A4%E5%86%85%E5%90%AB%E5%AD%90%E5%8C%BA%E5%9F%9F%E7%9A%84reads%E3%80%82%20gatk%20SplitNCigarReads,-R.%2Fgenome%2FchrX.fa%20-I.%2Fstart_2pass%2FchrX_dedup.bam%20-O.%2Fstart_2pass%2FchrX_dedup_split.bam
参数解释的很好 https://www.jianshu.com/p/c8a7ad8584e1