WDL学习

最新推荐文章于 2024-06-05 09:47:41 发布

theomarker

最新推荐文章于 2024-06-05 09:47:41 发布

阅读量3k

点赞数 2

分类专栏： WES whole-exome sequencing WDL GATK

WES 同时被 3 个专栏收录

9 篇文章 0 订阅

订阅专栏

whole-exome sequencing

4 篇文章 0 订阅

订阅专栏

GATK

3 篇文章 0 订阅

订阅专栏

最近想把GATK流程化，方便后续工作。看到WDL+Cromwell的方法还是比较方便的。而且后续GATK出来的best practice也是按照WDL写的。就相当于是学习了。这里记录3个用的可能比较多的例子，具体的可见官网。

case2. 编写一个多步骤（multi-step）流程

例子数据下载
这里写图片描述
这个任务是要分开从haplotypeCaller得到的SNP和indel。

Workflow
这个例子涉及到task alising。因为select要用2次而cromwell不允许同一个task被调用2次（若有第二次，会覆盖第一次的结果）。如果后续要对结果做下一步工作，会调用selectSNPs.rawSubset而不是select.rawSubset。对于第二个和第三个任务，需要从haplotypeCaller调结果，那么需要用input: inputname=taskname.outputname如下：

workflow SimpleVariantSelection {
  call haplotypeCaller { input: }
  call select as selectSNPs { input: }
  call select as selectIndels { input: }
}

workflow SimpleVariantSelection {
  File gatk
  File refFasta
  File refIndex
  File refDict
  String name

  call haplotypeCaller {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict
  }
  call select as selectSNPs {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      type="SNP",
      rawVCF=haplotypeCaller.rawVCF
  }
  call select as selectIndels {
    input: 
      sampleName=name, 
      RefFasta=refFasta, 
      GATK=gatk, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      type="INDEL", 
      rawVCF=haplotypeCaller.rawVCF
  }
}

Task

task select {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  String type
  File rawVCF

  command {
    java -jar ${GATK} \
      -T SelectVariants \
      -R ${RefFasta} \
      -V ${rawVCF} \
      -selectType ${type} \
      -o ${sampleName}_raw.${type}.vcf
  }
  output {
    File rawSubset = "${sampleName}_raw.${type}.vcf"
  }
}

case3. 跑一个单样本variant discovery的mini pipeline

数据下载
这里写图片描述

workflow

workflow SimpleVariantDiscovery {
    File gatk
    File refFasta
    File refIndex
    File refDict
    String name

    call haplotypeCaller {...}
    call select as selectSNPs {...}
    call select as selectIndels {...}
    call hardFilterSNP {
        input:sampleName=name, 
          RefFasta=refFasta, 
          GATK=gatk, 
          RefIndex=refIndex, 
          RefDict=refDict,  
          rawSNPs=selectSNPs.rawSubset
    }
    call hardFilterIndel {
        input: 
        sampleName=name, 
        RefFasta=refFasta, 
        GATK=gatk, 
        RefIndex=refIndex, 
        RefDict=refDict, 
        rawIndels=selectIndels.rawSubset
    }
    call combine {
        input: 
        sampleName=name, 
        RefFasta=refFasta, 
        GATK=gatk, 
        RefIndex=refIndex, 
        RefDict=refDict, 
        filterSNPs=hardFilterSNP.filteredSNPs,
        filteredIndels=hardFilterIndel.filteredIndels
    }
}

Task

task hardFilterSNP {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File rawSNPs

  command {
    java -jar ${GATK} \
      -T VariantFiltration \
      -R ${RefFasta} \
      -V ${rawSNPs} \
      --filterExpression "FS > 60.0" \
      --filterName "snp_filter" \
      -o ${sampleName}.filtered.snps.vcf
  }
  output {
    File filteredSNPs = "${sampleName}.filtered.snps.vcf"
  }
}

task hardFilterIndel {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File rawIndels

  command {
    java -jar ${GATK} \
      -T VariantFiltration \
      -R ${RefFasta} \
      -V ${rawIndels} \
      --filterExpression "FS > 200.0" \
      --filterName "indel_filter" \
      -o ${sampleName}.filtered.indels.vcf
  }
  output {
    File filteredIndels = "${sampleName}.filtered.indels.vcf"
  }
}

task combine {
  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File filteredSNPs
  File filteredIndels

  command {
    java -jar ${GATK} \
      -T CombineVariants \
      -R ${RefFasta} \
      -V ${filteredSNPs} \
      -V ${filteredIndels} \
      --genotypemergeoption UNSORTED \
      -o ${sampleName}.filtered.snps.indels.vcf
  }
  output {
    File filteredVCF = "${sampleName}.filtered.snps.indels.vcf"
  }
}

case4. 用scatter-gather进行joint call genotypes

数据下载，本例子用了toy dataset: NA12878, NA12877, and NA12882, subset to chromosome 20.
这里写图片描述
这个方法经常用于大的队列，这里只用3个样本进行尝试。从inputSamples开始，每个sample都是一个包含3个文件的数组。第一个文职array index0是样本的名字，array index1是 bam文件，array index2是bam的index文件。例子中，我们根据sample进行scatter，并在第一步后gather。gathered GVCF用GenotypeGVCFs去生成一个多样本VCF。

Write your script
首先要准备Array[Array[File]] inputSamples，命名为inputsTSV.txt。每个input由tab分割。这个文件可以帮助减轻在inputs json的nested array declarations。

sampleName  inputBam  bamIndex
sampleName2 inputBam2 bamIndex2
...etc.

Workflow
为导入inputSamplesFile里的内容，转换为Array[Array[File]]格式，需要用到WDL的标准库read_tsv：

File inputSamplesFile
Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)

然后用sample去scatter，我们的call task那部分包含在HaplotypeCallerERC函数里。同时在scatter操作之后，要做一步gather，这里的GenotypeGVCFs包含HaplotypeCallerERC的输出，即GVCFs=HaplotypeCallerERC.GVCF：

workflow jointCallingGenotypes {

  File inputSamplesFile
  Array[Array[File]] inputSamples = read_tsv(inputSamplesFile)
  File gatk
  File refFasta
  File refIndex
  File refDict

  scatter (sample in inputSamples) {
    call HaplotypeCallerERC {
      input: GATK=gatk, 
        RefFasta=refFasta, 
        RefIndex=refIndex, 
        RefDict=refDict, 
        sampleName=sample[0],
        bamFile=sample[1], 
        bamIndex=sample[2]
    }
  }
  call GenotypeGVCFs {
    input: GATK=gatk, 
      RefFasta=refFasta, 
      RefIndex=refIndex, 
      RefDict=refDict, 
      sampleName="CEUtrio", 
      GVCFs=HaplotypeCallerERC.GVCF
  }
}

Task
HaplotyepCallerERC

task HaplotypeCallerERC {

  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  File bamFile
  File bamIndex

  command {
    java -jar ${GATK} \
        -T HaplotypeCaller \
        -ERC GVCF \
        -R ${RefFasta} \
        -I ${bamFile} \
        -o ${sampleName}_rawLikelihoods.g.vcf
  }
  output {
    File GVCF = "${sampleName}_rawLikelihoods.g.vcf"
  }
}

GenotypeGVCFs

task GenotypeGVCFs {

  File GATK
  File RefFasta
  File RefIndex
  File RefDict
  String sampleName
  Array[File] GVCFs

  command {
    java -jar ${GATK} \
        -T GenotypeGVCFs \
        -R ${RefFasta} \
        -V ${sep=" -V " GVCFs} \
        -o ${sampleName}_rawVariants.vcf
  }
  output {
    File rawVCF = "${sampleName}_rawVariants.vcf"
  }
}

Running the pipeline

#validate:
java -jar wdltool.jar validate jointCallingGenotypes.wdl

#generate inputs:
java -jar wdltool.jar inputs jointCallingGenotypes.wdl > jointCallingGenotypes_inputs.json

#run locally:
java -jar cromwell.jar run jointCallingGenotypes.wdl jointCallingGenotypes_inputs.json

theomarker

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
WDL学习

最近想把GATK流程化，方便后续工作。看到WDL+Cromwell的方法还是比较方便的。而且后续GATK出来的best practice也是按照WDL写的。就相当于是学习了。这里记录3个用的可能比较多的例子，具体的可见官网。case2. 编写一个多步骤（multi-step）流程例子数据下载这个任务是要分开从haplotypeCaller得到的SNP和indel。...
复制链接

扫一扫

专栏目录