目录
BQSR记录:问题处理
注意:除了人之外,大多数动物是没有dbSNP的。
我下载的dbSNP数据参考基因组与我处理数据比对用的参考基因组版本不一致。遇到问题:
User error: input files reference and features have incompatible contigs
contig名称不一致。
解决办法:改dbSNP里面的contig名。
1. GATK下载:
# 下载:
wget https://github.com/broadinstitute/gatk/releases/download/4.1.9.0/gatk-4.1.9.0.zip
2.BQSR(Base (Quality Score) Recalibration),已知集已经下载;
参考网站: https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-4-Base_recalibration.pdf
①介绍
BQSR处理步骤是对每个样本执行,包括应用机器学习来检测和纠正基本质量分数中的系统误差模型,这些基本质量分数是比对软件为每个碱基给出的置信度分数。在变异发现过程中,基础质量分数在权衡是否是变异等位基因位点方面起着重要作用,因此纠正数据中观察到的任何系统偏差是很重要的。偏差可以起源于在文库准备和测序过程中的生化过程,从芯片的制造缺陷,或测序器的仪器缺陷。重新校准过程包括从数据集中的所有碱基中收集协变量测量值,根据这些统计信息构建模型,并基于结果模型对数据集应用基本质量调整。最初的统计数据收集可以通过分散在基因组坐标上并行化,通常是通过染色体,但如果需要,可以进一步分解以提高吞吐量。然后,每个区域的统计数据必须被收集到一个单一的全基因组共变模型中;这不能并行化,但它在计算上很简单,因此不是瓶颈。最后,将由模型推导出的重新校准规则应用于原始数据集,生成重新校准的数据集。这与最初的统计信息收集是相同的并行方式,在基因组区域上,然后是最后的文件合并操作,以生成每个样本的单个分析准备文件。
注意:BQSR与VQSR相区别;
②成功重新校准的重要因素
Read groups
@RG 标签:对reads数据进行分区,反映reads在测序仪上的lane等
数据量
③运行方法:
# 两个关键步骤
# 1.BaseRecalibrator tool:建立一个基于输入数据和一组已知变量的共变模型,生成一个重新校准文件;
# 2.ApplyBQSR tool:根据模型,调整数据中的基本质量分数,生成一个新的BAM文件;
# 3.有一个可选但强烈推荐的步骤AnalyzeCovariates,生成碱基校正前后的图,以可视化重新校准过程的效果。这对于质量控制是很有用的。
# 如果没有可用的已知集,可对原始的、未经重新校准的数据进行第一轮的变量调用;然后将最信任的变量作为VCF文件提供给BaseRecalibrator,并将其作为已知变量的数据库使用;最后,对重新校准的数据进行一次真正的变量调用。这些步骤可以重复多次,直到趋同;
gatk=/*****/software/gatk-4.1.9.0/gatk
$gatk BaseRecalibrator \
-R /*****/index/ARS-UCD.1.2.fa \
-I /******/Bostaurus_K894-01-R01_good/Bostaurus_K894-01-R01_good_mkdup.bam \
--known-sites /*****/SRR4279978/SRR4279978.snp.vcf.gz \
-O /*****/group_one/bqsr_result
# 出问题,染色体名称不同:
reference contigs = [CM008177.2, CM008178.2, CM008179.2, CM008180.2, CM008181.2, CM008182.2, CM008183.2, CM008184.2, CM008185.2, CM008186.2, CM008168.2, CM008187.2, CM008188.2, CM008189.2, CM008190.2, CM008191.2, CM008192.2, CM008193.2, CM008194.2, CM008195.2, CM008196.2, CM008169.2, CM008170.2, CM008171.2, CM008172.2, CM008173.2, CM008174.2, CM008175.2, CM008176.2, CM008197.2]
features contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, X, Y, MT, NKLS02000031.1, NKLS02000032.1, NKLS02000033.1, NKLS02000034.1, NKLS02000035.1, NKLS02000036.1, NKLS02000037.1, NKLS02000038.1, NKLS02000039.1, NKLS02000040.1, NKLS02000041.1, NKLS02000042.1, NKLS02000043.1, NKLS02000044.1, NKLS02000045.1, NKLS02000046.1, NKLS02000047.1, NKLS02000048.1, NKLS02000049.1, NKLS02000050.1, NKLS02000051.1, NKLS02000052.1, NKLS02000053.1, NKLS02000054.1, NKLS02000055.1, NKLS02000056.1, NKLS02000057.1, NKLS02000058.1, NKLS02000059.1, NKLS02000060.1, NKLS02000061.1, NKLS02000062.1, NKLS02000063.1, NKLS02000064.1, NKLS02000065.1, NKLS02000066.1, NKLS02000067.1, NKLS02000068.1, NKLS02000069.1, NKLS02000070.1, NKLS02000071.1, NKLS02000072.1, NKLS02000073.1, NKLS02000074.1, NKLS02000075.1, NKLS02000076.1, NKLS02000077.1, NKLS02000078.1, ...
根据报错信息,查了很多解决办法,其中一种方法是可以直接改动known-sites数据中的contig名称 。
3. 改known-sites的VCF文件中contig名称:
容易出问题,改名后需要重建索引,所以必须保证符合VCF文件标准。
zcat "/****/BQSR_kownsite/ARS1.2PlusY_BQSR_v3.vcf.gz" | \
awk '$1!~/^#/{
if($1!="X"){
printf("CM0081%d.2\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n",$1+67,$2,$3,$4,$5,$6,$7,$8);
}
else{
printf("CM008197.2\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n",$2,$3,$4,$5,$6,$7,$8);
}
}
$1~/^##/{
split($1,x,"ID=")
split(x[2],y,",")
if(y[1] > 0){
if(y[1] < 30){
printf("##contig=<ID=CM0081%d.2,%s\n",y[1]+67,y[2]);
}
else if(y[1]=="X"){
printf("##contig=<ID=CM008197.2,%s\n",y[2]);
}
else{
printf("%s\n",$0);
}
}
else{
printf("%s\n",$0);
}
}
$1~/^#CHR/{
printf("%s\n",$0);
}' - > re_chrname.vcf
4. 为更改名称的known-sites VCF文件压缩和建立索引
bgzip -c re_chrname.vcf > re_chrname.vcf.vcf.gz
/home/lifan/software/bcftools/bcftools-1.8/bcftools index -t re_chrname.vcf.vcf.gz.vcf.gz
5. 重新开展BQSR
BQSR 流程图:
图片来自于GATK官方:一般做左边这一步就行了,右边是画出图对比重校正的效果的。
其中,PrintReads 功能被4.0版本之后的ApplyBQSR所替换。
①.BaseRecalibrator tool:建立一个基于输入数据和一组已知变量的共变模型,生成一个重新校准文件;
②.ApplyBQSR tool:根据模型,调整数据中的基本质量分数,生成一个新的BAM文件(即校正后的文件);
③.有一个可选但强烈推荐的步骤AnalyzeCovariates,生成碱基校正前后的图,以可视化重新校准过程的效果。这对于质量控制是很有用的。
[yanglv@localhost bqsr]$ gatk=/home/yanglv/software/gatk-4.1.9.0/gatk
[yanglv@localhost bqsr]$ $gatk BaseRecalibrator \
> -R /*****/index/ARS-UCD.1.2.fa \
> -I /******/Bostaurus_K894-01-R01_good/Bostaurus_K894-01-R01_good_mkdup.bam \
> --known-sites /****/BQSR_kownsite/re_chrname.vcf.vcf.gz.vcf.gz \
> -O /******r/Bostaurus_K894-01-R01_good_recal.table
Using GATK jar /********/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /software/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar BaseRecalibrator -R ARS-UCD.1.2.fa -I Bostaurus_K894-01-R01_good/Bostaurus_K894-01-R01_good_mkdup.bam --known-sites re_chrname.vcf.vcf.gz.vcf.gz -O Bostaurus_K894-01-R01_good_recal.table
15:54:21.085 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/yanglv/software/gatk-4.1.9.0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 15, 2021 3:54:21 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
15:54:21.532 INFO BaseRecalibrator - ------------------------------------------------------------
15:54:21.533 INFO BaseRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.9.0
15:54:21.533 INFO BaseRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
15:54:21.533 INFO BaseRecalibrator - Executing as yanglv@localhost.localdomain on Linux v3.10.0-1127.19.1.el7.x86_64 amd64
15:54:21.533 INFO BaseRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_262-b10
15:54:21.533 INFO BaseRecalibrator - Start Date/Time: January 15, 2021 3:54:20 PM CST
15:54:21.533 INFO BaseRecalibrator - ------------------------------------------------------------
15:54:21.533 INFO BaseRecalibrator - ------------------------------------------------------------
15:54:21.533 INFO BaseRecalibrator - HTSJDK Version: 2.23.0
15:54:21.533 INFO BaseRecalibrator - Picard Version: 2.23.3
15:54:21.533 INFO BaseRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:54:21.533 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:54:21.533 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:54:21.533 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:54:21.534 INFO BaseRecalibrator - Deflater: IntelDeflater
15:54:21.534 INFO BaseRecalibrator - Inflater: IntelInflater
15:54:21.534 INFO BaseRecalibrator - GCS max retries/reopens: 20
15:54:21.534 INFO BaseRecalibrator - Requester pays: disabled
15:54:21.534 INFO BaseRecalibrator - Initializing engine
15:54:22.033 INFO FeatureManager - Using codec VCFCodec to read file file:///sdb1/usrdata/yanglv/BQSR_kownsite/ARS1.2PlusY_BQSR_v3_shift_chrname_clean.vcf.gz
15:54:22.166 INFO BaseRecalibrator - Done initializing engine
15:54:22.170 INFO BaseRecalibrationEngine - The covariates being used here:
15:54:22.170 INFO BaseRecalibrationEngine - ReadGroupCovariate
15:54:22.170 INFO BaseRecalibrationEngine - QualityScoreCovariate
15:54:22.170 INFO BaseRecalibrationEngine - ContextCovariate
15:54:22.170 INFO BaseRecalibrationEngine - CycleCovariate
15:54:22.172 INFO ProgressMeter - Starting traversal
15:54:22.172 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
15:54:32.238 INFO ProgressMeter - CM008177.2:1626113 0.2 167000 998306.3
15:54:42.258 INFO ProgressMeter - CM008177.2:3590308 0.3 291000 869305.5