比对工具:STAR
与TopHat相比,STAR灵敏度更高。使用STAR的two-pass mode比对可以使新型切割片段获得更好的比对结果。
第一次建立参考基因组
STAR --runMode genomeGenerate
--runThreadN 8
--genomeDir ./star_index/
--genomeFastaFiles ./genome/chrX.fa
--sjdbGTFfile ./genes/chrX.gtf
--sjdbOverhang INT(read length -1)
第一次序列比对
STAR --runThreadN 8
--genomeDir ./star_index/
--readFilesIn ./samples/chrX_1.fastq,gz ./samples/chrX_2.fastq,gz
--readFilesCommand zcat
--outFileNamePrefix ./star_1pass/chrX
根据第一次比对得到的所有剪切位点,第二次建立参考索引
STAR --runMode genomeGenerate
--runThreadN 8
--genomeDir ./star_index_2pass/
--genomeFastaFiles ./genome/chrX.fa
--sjdbFileChrStartEnd ./star_1pass/chrX.out.tab
第二次序列比对
STAR --runThreadN 8
--genomeDir ./star_index_2_pass/
--readFilesIn ./samples/chrX_1.fastq,gz ./samples/chrX_2.fastq,gz
--readFilesCommand zcat
--outFileNamePrefix ./start_2pass/chrX
比对文件预处理
添加RG标签:AddOrReplaceReadGroups
java - jar picard.jar AddOrReplaceReadGroups
I = ./start_2pass/chrXAligned.out.sam
O = ./start_2pass/chrX_rg_added_sorted.bam
SO = coordinate
RGID = ERR188044
RGLB = rna
RGPL = illumina
RGPU = hiseq
RGSM = ERR188044
标记重复并排序:MarkDuplicates
java -jar picard.jar MarkDuplicates
I = ./start_2pass/chrX_rg_added_sorted.bam
O = ./start_2pass/chrX_dedup.bam
CREATE_INDEX = true
VALIDATION_STRINGENCY = SLIENT
M = ./start_2pass/chrX_dedup.metrics
Call Variants
SplitNCigarReads
将落在外显子上的reads分离出来,取出N错误碱基,去除内含子区域的reads。
gatk SplitNCigarReads
-R ./genome/chrX.fa
-I ./start_2pass/chrX_dedup.bam
-O ./start_2pass/chrX_dedup_split.bam
重新校正碱基质量值:baseRecalibrator、applyBQSR
gatk BaeRecalibrator
-I ./start_2pass/chrX_dedup_split.bam
-O ./start_2pass/chrX_recal_data.table
-R ./genome/chrX.fa
--known-sites 1000G_phase1.snps.high_confidence.hg19.sites.vcf
--known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
gatk ApplyBQSR
-R ./genome/chrX.fa
-I ./start_2pass/chrX_dedup_split.bam
-O ./start_2pass/chrX_BQSR.bam
--bqsr-recal-file ./start_2pass/chrX_recal_data.table
检测变异:HaplotypeCaller
gatk HaplotypeCalller
-R ./genome/chrX.fa
-I ./start_2pass/chrX_BQSR.bam
-O ./start_2pass/chrX.vcf
-dontUseSodtCilppedBases
-stand_call_conf 20.0
--dbsnp ${dbSNP_vcf}
变异过滤:VariantFiltration
RNA-Seq建议使用硬过滤。经过过滤的VCF,其中将通过的变体注释为PASS。
gatk VariantFiltration
-R ./genome/chrX.fa
-V ./start_2pass/chrX.vcf
-O ./star_2pass/chrX_filtered.vcf
--filter-name FS
--filter-expression "FS>30.0"
--filter-name QD
--filter-expression "QD<2.0"
--window 35
--cluster 3