WDL学习

最近想把GATK流程化,方便后续工作。看到WDL+Cromwell的方法还是比较方便的。而且后续GATK出来的best practice也是按照WDL写的。就相当于是学习了。这里记录3个用的可能比较多的例子,具体的可见官网

case2. 编写一个多步骤(multi-step)流程

例子数据下载
这里写图片描述
这个任务是要分开从haplotypeCaller得到的SNP和indel。

  • Workflow
    这个例子涉及到task alising。因为select要用2次而cromwell不允许同一个task被调用2次(若有第二次,会覆盖第一次的结果)。如果后续要对结果做下一步工作,会调用selectSNPs.rawSubset而不是select.rawSubset。对于第二个和第三个任务,需要从haplotypeCaller调结果,那么需要用input: inputname=taskname.outputname如下:
workflow SimpleVariantSelection {
  call haplotypeCaller { input: }
  call select as selectSNPs { input: }
  call select as selectIndels { input: }
}
workflow SimpleVariantSelection {
  File gatk
  File refFasta
  File refIndex
  File refDict
  String name

  call haplotypeCaller {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict
  }
  call select as selectSNPs {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      type="SNP",
      rawVCF=haplotypeCaller.rawVCF
  }
  call select as selectIndels {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      type="INDEL", 
      rawVCF=haplotypeCaller.rawVCF
  }
}
  • Task
task select {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  String type
  File rawVCF

  command {
    java -jar ${GATK} \
      -T SelectVariants \
      -R ${RefFasta} \
      -V ${rawVCF} \
      -selectType ${type} \
      -o ${sampleName}_raw.${type}.vcf
  }
  output {
    File rawSubset = "${sampleName}_raw.${type}.vcf"
  }
}
case3. 跑一个单样本variant discovery的mini pipeline

数据下载
这里写图片描述

  • workflow
workflow SimpleVariantDiscovery {
    File gatk
    File refFasta
    File refIndex
    File refDict
    String name

    call haplotypeCaller {...}
    call select as selectSNPs {...}
    call select as selectIndels {...}
    call hardFilterSNP {
        input:sampleName=name, 
          RefFasta=refFasta, 
          GATK=gatk, 
          RefIndex=refIndex, 
          RefDict=refDict,  
          rawSNPs=selectSNPs.rawSubset
    }
    call hardFilterIndel {
        input: 
        sampleName=name, 
        RefFasta=refFasta, 
        GATK=gatk, 
        RefIndex=refIndex, 
        RefDict=refDict, 
        rawIndels=selectIndels.rawSubset
    }
    call combine {
        input: 
        sampleName=name, 
        RefFasta=refFasta, 
        GATK=gatk, 
        RefIndex=refIndex, 
        RefDict=refDict, 
        filterSNPs=hardFilterSNP.filteredSNPs,
        filteredIndels=hardFilterIndel.filteredIndels
    }
}
  • Task
task hardFilterSNP {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File rawSNPs

  command {
    java -jar ${GATK} \
      -T VariantFiltration \
      -R ${RefFasta} \
      -V ${rawSNPs} \
      --filterExpression "FS > 60.0" \
      --filterName "snp_filter" \
      -o ${sampleName}.filtered.snps.vcf
  }
  output {
    File filteredSNPs = "${sampleName}.filtered.snps.vcf"
  }
}
task hardFilterIndel {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File rawIndels

  command {
    java -jar ${GATK} \
      -T VariantFiltration \
      -R ${RefFasta} \
      -V ${rawIndels} \
      --filterExpression "FS > 200.0" \
      --filterName "indel_filter" \
      -o ${sampleName}.filtered.indels.vcf
  }
  output {
    File filteredIndels = "${sampleName}.filtered.indels.vcf"
  }
}
task combine {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File filteredSNPs
  File filteredIndels

  command {
    java -jar ${GATK} \
      -T CombineVariants \
      -R ${RefFasta} \
      -V ${filteredSNPs} \
      -V ${filteredIndels} \
      --genotypemergeoption UNSORTED \
      -o ${sampleName}.filtered.snps.indels.vcf
  }
  output {
    File filteredVCF = "${sampleName}.filtered.snps.indels.vcf"
  }
}
case4. 用scatter-gather进行joint call genotypes

数据下载,本例子用了toy dataset: NA12878, NA12877, and NA12882, subset to chromosome 20.
这里写图片描述
这个方法经常用于大的队列,这里只用3个样本进行尝试。从inputSamples开始,每个sample都是一个包含3个文件的数组。第一个文职array index0是样本的名字,array index1是 bam文件,array index2是bam的index文件。例子中,我们根据sample进行scatter,并在第一步后gather。gathered GVCF用GenotypeGVCFs去生成一个多样本VCF。

  • Write your script
    首先要准备Array[Array[File]] inputSamples,命名为inputsTSV.txt。每个input由tab分割。这个文件可以帮助减轻在inputs json的nested array declarations。
sampleName  inputBam  bamIndex
sampleName2 inputBam2 bamIndex2
...etc.
  • Workflow
    为导入inputSamplesFile里的内容,转换为Array[Array[File]]格式,需要用到WDL的标准库read_tsv:
File inputSamplesFile
Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)

然后用samplescatter,我们的call task那部分包含在HaplotypeCallerERC函数里。同时在scatter操作之后,要做一步gather,这里的GenotypeGVCFs包含HaplotypeCallerERC的输出,即GVCFs=HaplotypeCallerERC.GVCF

workflow jointCallingGenotypes {

  File inputSamplesFile
  Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)
  File gatk
  File refFasta
  File refIndex
  File refDict

  scatter (sample in inputSamples) {
    call HaplotypeCallerERC {
      input: GATK=gatk, 
        RefFasta=refFasta, 
        RefIndex=refIndex, 
        RefDict=refDict, 
        sampleName=sample[0],
        bamFile=sample[1], 
        bamIndex=sample[2]
    }
  }
  call GenotypeGVCFs {
    input: GATK=gatk, 
      RefFasta=refFasta, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      sampleName="CEUtrio", 
      GVCFs=HaplotypeCallerERC.GVCF
  }
}
  • Task
    HaplotyepCallerERC
task HaplotypeCallerERC {

  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File bamFile
  File bamIndex

  command {
    java -jar ${GATK} \
        -T HaplotypeCaller \
        -ERC GVCF \
        -R ${RefFasta} \
        -I ${bamFile} \
        -o ${sampleName}_rawLikelihoods.g.vcf
  }
  output {
    File GVCF = "${sampleName}_rawLikelihoods.g.vcf"
  }
}

GenotypeGVCFs

task GenotypeGVCFs {

  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  Array[File] GVCFs

  command {
    java -jar ${GATK} \
        -T GenotypeGVCFs \
        -R ${RefFasta} \
        -V ${sep=" -V " GVCFs} \
        -o ${sampleName}_rawVariants.vcf
  }
  output {
    File rawVCF = "${sampleName}_rawVariants.vcf"
  }
}
  • Running the pipeline
#validate:
java -jar wdltool.jar validate jointCallingGenotypes.wdl

#generate inputs:
java -jar wdltool.jar inputs jointCallingGenotypes.wdl > jointCallingGenotypes_inputs.json

#run locally:
java -jar cromwell.jar run jointCallingGenotypes.wdl jointCallingGenotypes_inputs.json

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值