最近想把GATK流程化,方便后续工作。看到WDL+Cromwell的方法还是比较方便的。而且后续GATK出来的best practice也是按照WDL写的。就相当于是学习了。这里记录3个用的可能比较多的例子,具体的可见官网。
case2. 编写一个多步骤(multi-step)流程
例子数据下载
这个任务是要分开从haplotypeCaller得到的SNP和indel。
- Workflow
这个例子涉及到task alising。因为select要用2次而cromwell不允许同一个task被调用2次(若有第二次,会覆盖第一次的结果)。如果后续要对结果做下一步工作,会调用selectSNPs.rawSubset
而不是select.rawSubset
。对于第二个和第三个任务,需要从haplotypeCaller
调结果,那么需要用input: inputname=taskname.outputname
如下:
workflow SimpleVariantSelection {
call haplotypeCaller { input: }
call select as selectSNPs { input: }
call select as selectIndels { input: }
}
workflow SimpleVariantSelection {
File gatk
File refFasta
File refIndex
File refDict
String name
call haplotypeCaller {
input:
sampleName=name,
RefFasta=refFasta,
GATK=gatk,
RefIndex=refIndex,
RefDict=refDict
}
call select as selectSNPs {
input:
sampleName=name,
RefFasta=refFasta,
GATK=gatk,
RefIndex=refIndex,
RefDict=refDict,
type="SNP",
rawVCF=haplotypeCaller.rawVCF
}
call select as selectIndels {
input:
sampleName=name,
RefFasta=refFasta,
GATK=gatk,
RefIndex=refIndex,
RefDict=refDict,
type="INDEL",
rawVCF=haplotypeCaller.rawVCF
}
}
- Task
task select {
File GATK
File RefFasta
File RefIndex
File RefDict
String sampleName
String type
File rawVCF
command {
java -jar ${GATK} \
-T SelectVariants \
-R ${RefFasta} \
-V ${rawVCF} \
-selectType ${type} \
-o ${sampleName}_raw.${type}.vcf
}
output {
File rawSubset = "${sampleName}_raw.${type}.vcf"
}
}
case3. 跑一个单样本variant discovery的mini pipeline
数据下载
- workflow
workflow SimpleVariantDiscovery {
File gatk
File refFasta
File refIndex
File refDict
String name
call haplotypeCaller {...}
call select as selectSNPs {...}
call select as selectIndels {...}
call hardFilterSNP {
input:sampleName=name,
RefFasta=refFasta,
GATK=gatk,
RefIndex=refIndex,
RefDict=refDict,
rawSNPs=selectSNPs.rawSubset
}
call hardFilterIndel {
input:
sampleName=name,
RefFasta=refFasta,
GATK=gatk,
RefIndex=refIndex,
RefDict=refDict,
rawIndels=selectIndels.rawSubset
}
call combine {
input:
sampleName=name,
RefFasta=refFasta,
GATK=gatk,
RefIndex=refIndex,
RefDict=refDict,
filterSNPs=hardFilterSNP.filteredSNPs,
filteredIndels=hardFilterIndel.filteredIndels
}
}
- Task
task hardFilterSNP {
File GATK
File RefFasta
File RefIndex
File RefDict
String sampleName
File rawSNPs
command {
java -jar ${GATK} \
-T VariantFiltration \
-R ${RefFasta} \
-V ${rawSNPs} \
--filterExpression "FS > 60.0" \
--filterName "snp_filter" \
-o ${sampleName}.filtered.snps.vcf
}
output {
File filteredSNPs = "${sampleName}.filtered.snps.vcf"
}
}
task hardFilterIndel {
File GATK
File RefFasta
File RefIndex
File RefDict
String sampleName
File rawIndels
command {
java -jar ${GATK} \
-T VariantFiltration \
-R ${RefFasta} \
-V ${rawIndels} \
--filterExpression "FS > 200.0" \
--filterName "indel_filter" \
-o ${sampleName}.filtered.indels.vcf
}
output {
File filteredIndels = "${sampleName}.filtered.indels.vcf"
}
}
task combine {
File GATK
File RefFasta
File RefIndex
File RefDict
String sampleName
File filteredSNPs
File filteredIndels
command {
java -jar ${GATK} \
-T CombineVariants \
-R ${RefFasta} \
-V ${filteredSNPs} \
-V ${filteredIndels} \
--genotypemergeoption UNSORTED \
-o ${sampleName}.filtered.snps.indels.vcf
}
output {
File filteredVCF = "${sampleName}.filtered.snps.indels.vcf"
}
}
case4. 用scatter-gather进行joint call genotypes
数据下载,本例子用了toy dataset: NA12878, NA12877, and NA12882, subset to chromosome 20.
这个方法经常用于大的队列,这里只用3个样本进行尝试。从inputSamples
开始,每个sample都是一个包含3个文件的数组。第一个文职array index0
是样本的名字,array index1
是 bam文件,array index2
是bam的index文件。例子中,我们根据sample进行scatter,并在第一步后gather。gathered GVCF用GenotypeGVCFs去生成一个多样本VCF。
- Write your script
首先要准备Array[Array[File]] inputSamples
,命名为inputsTSV.txt
。每个input由tab分割。这个文件可以帮助减轻在inputs json的nested array declarations。
sampleName inputBam bamIndex
sampleName2 inputBam2 bamIndex2
...etc.
- Workflow
为导入inputSamplesFile里的内容,转换为Array[Array[File]]格式,需要用到WDL的标准库read_tsv:
File inputSamplesFile
Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)
然后用sample
去scatter
,我们的call task那部分包含在HaplotypeCallerERC
函数里。同时在scatter
操作之后,要做一步gather
,这里的GenotypeGVCFs
包含HaplotypeCallerERC
的输出,即GVCFs=HaplotypeCallerERC.GVCF
:
workflow jointCallingGenotypes {
File inputSamplesFile
Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)
File gatk
File refFasta
File refIndex
File refDict
scatter (sample in inputSamples) {
call HaplotypeCallerERC {
input: GATK=gatk,
RefFasta=refFasta,
RefIndex=refIndex,
RefDict=refDict,
sampleName=sample[0],
bamFile=sample[1],
bamIndex=sample[2]
}
}
call GenotypeGVCFs {
input: GATK=gatk,
RefFasta=refFasta,
RefIndex=refIndex,
RefDict=refDict,
sampleName="CEUtrio",
GVCFs=HaplotypeCallerERC.GVCF
}
}
- Task
HaplotyepCallerERC
task HaplotypeCallerERC {
File GATK
File RefFasta
File RefIndex
File RefDict
String sampleName
File bamFile
File bamIndex
command {
java -jar ${GATK} \
-T HaplotypeCaller \
-ERC GVCF \
-R ${RefFasta} \
-I ${bamFile} \
-o ${sampleName}_rawLikelihoods.g.vcf
}
output {
File GVCF = "${sampleName}_rawLikelihoods.g.vcf"
}
}
GenotypeGVCFs
task GenotypeGVCFs {
File GATK
File RefFasta
File RefIndex
File RefDict
String sampleName
Array[File] GVCFs
command {
java -jar ${GATK} \
-T GenotypeGVCFs \
-R ${RefFasta} \
-V ${sep=" -V " GVCFs} \
-o ${sampleName}_rawVariants.vcf
}
output {
File rawVCF = "${sampleName}_rawVariants.vcf"
}
}
- Running the pipeline
#validate:
java -jar wdltool.jar validate jointCallingGenotypes.wdl
#generate inputs:
java -jar wdltool.jar inputs jointCallingGenotypes.wdl > jointCallingGenotypes_inputs.json
#run locally:
java -jar cromwell.jar run jointCallingGenotypes.wdl jointCallingGenotypes_inputs.json