摘要
时隔半年,终于把WGS前面的分析用snakemake搭建好了。读者不要嫌我慢,确实是项目不多,流程也不算特别复杂。之前的shell脚本也能用,因此迟迟没有真正搭建。现在项目慢慢多了,考虑到提升工作效率,趁着前几天做了2个WGS的项目,把这个流程梳理出来。
命令行
#vim: set syntax=python
#__author__ = "Yang Xin"
#__copyright__ = "Copyright 2021, Wang lab"
#__email__ = "491861610@qq.com"
#__license__ = "SAU"
"""
A WGS analysis workflow using trim_galore, BWA, Samtools and GATK.
"""
configfile:"config.yaml"
import yaml
rule all:
input:
expand("03.snp/{sample}_stat.csv", sample=config["samples"]),#比较组的突变输出表格
expand("{genome}.bwt", genome=config["reference"]),#参考基因组建立索引,bwa比对需要
expand("{genome}.fai", genome=config["reference"]),#参考基因组建立索引,GATK比对需要
expand("02.align/ref/{genome_profix}.dict", genome_profix=config["profix"]),#参考基因组建立索引,GATK比对需要
"01.data/data_quality_stat.txt",#质控统计
#############################quality_control######################
rule trim_galore:
input:
R1 = "01.data/{sample}_1.fq.gz",
R2 = "01.data/{sample}_2.fq.gz"
output:
dir = directory("01.data/trim/{sample}"),#输出样品质控文件夹
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
threads:8
shell:
"trim_galore -j {threads} -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o {output.dir} --fastqc {input.R1} {input.R2} "
rule fastp:
input:
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz"
output:
temp("01.data/trim/{sample}.json")#生成临时文件.json,下一个rule使用后删除。
threads:16
shell:
"fastp -w {threads} -i {input.R1} -I {input.R2} -j {output} "
rule fastp_stat:
input:
expand("01.data/trim/{sample}.json",sample=config["samples"])
output:
"01.data/fastp_stat.txt"
script:
"scripts/fastp_stat.py"
#根据json文档统计所有样品的质控信息,往期文章有。
rule sort_fastp_stat:
input:
"01.data/fastp_stat.txt"
output:
"01.data/data_quality_stat.txt"
shell:
"sort 01.data/fastp_stat.txt > 01.data/data_quality_stat.txt;sed -i '1i Sample_ID Clean_reads Clean_bases ≥Q30 GC_Content' 01.data/data_quality_stat.txt;rm -rf 01.data/fastp_stat.txt"
#整理格式
#################################align#########################
rule bwa_index:
input:
genome = config["reference"]
output:
"{genome}.bwt",
shell:
"bwa index {input.genome}"
rule bwa_map:
input:
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
genome_index = expand("{genome}.bwt",genome=config["reference"])
output:
temp("02.align/{sample}_align.sam")
params:
genome = config["reference"],
threads:16
shell:
"bwa mem -R \'@RG\\tID:foo\\tSM:bar\\tLB:Abace\' -t {threads} {params.genome} {input.R1} {input.R2} > {output}"
#注意,双引号里面的特殊符号(()''%&等)会产生歧义,需要使用反斜杠符号进行注释。
rule samtools_view:#sam2bam
input:
"02.align/{sample}_align.sam"
output:
temp("02.align/{sample}.tmp.bam")
shell:
"samtools view -bS {input} > {output} "
rule samtools_sort:#排序
input:
"02.align/{sample}.tmp.bam"
output:
"02.align/{sample}.bam",
shell:
"samtools sort {input} -o {output} "
rule samtools_faidx:#建立索引
input:
genome = config["reference"]
output:
"{genome}.fai"
shell:
"samtools faidx {input.genome}"
########################GATK_filter##################
rule GATK_dict:#建立索引
input:
genome = config["reference"]
output:
dict = "02.align/ref/{genome_profix}.dict"
shell:
"gatk CreateSequenceDictionary -R {input.genome} -O {output.dict}"
rule GATK_sort:#使用GATK对bam文件重新排序,去重。
input:
sample = "02.align/{sample}.bam",
output:
sort_bam = temp("02.align/{sample}_sort.bam")
params:
genome = config["reference"],
shell:
"gatk SortSam -I {input.sample} -O {output.sort_bam} -R {params.genome} -SO coordinate --CREATE_INDEX true"
rule GATK_dupli:
input:
sample = "02.align/{sample}_sort.bam",
output:
dup_bam = temp("02.align/{sample}_sort_dup.bam")
params:
genome = config["reference"]
log:
metrics = "02.align/log/{sample}_markdup_metrics.txt"
shell:
"gatk MarkDuplicates -I {input.sample} -O {output.dup_bam} -M {log.metrics}"
rule samtools_index2:
input:
"02.align/{sample}_sort_dup.bam"
output:
"02.align/{sample}_sort_dup.bam.bai"
shell:
"samtools index {input}"
rule Calling:
input:
sample = "02.align/{sample}_sort_dup.bam",
index = "02.align/{sample}_sort_dup.bam.bai"
output:
sample_vcf = temp("03.snp/{sample}.g.vcf.gz")
params:
genome = config["reference"]
shell:
"gatk HaplotypeCaller -ERC GVCF -R {params.genome} -I {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
#注意-ERC参数,小基因组的物种可加可不加,大基因组一定要加上,并且使用-L按照染色体号分别注释,之后再进行合并。
rule GVCF2VCF:#寻找变异位点,这里的结果包括了snp和indel
input:
sample = "03.snp/{sample}.g.vcf.gz",
output:
sample_vcf = "03.snp/{sample}.vcf"
params:
genome = config["reference"]
shell:
"gatk GenotypeGVCFs -R {params.genome} -V {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
rule SNP_calling:#筛选出snp位点
input:
sample = "03.snp/{sample}.vcf",
output:
sample_vcf = "03.snp/{sample}_snp.vcf"
params:
genome = config["reference"]
shell:
"gatk SelectVariants -R {params.genome} -V {input.sample} -O {output.sample_vcf} --select-type-to-include SNP"
rule VCF_stat:#对snp位点进行统计
input:
snp_vcf = "03.snp/{sample}_snp.vcf"
output:
"03.snp/{sample}_stat.csv"
shell:
"python scripts/snp_frequency.py {input.snp_vcf}"
部分统计脚本介绍文档
fastp_stat.py
snp_frequency.py
总结
这个流程已经被证明是可以跑通的。当然,流程的完整度还是不够,比如snp位点还可以再进行筛选,或者把indel也统计出来。同时,里面还有一些文件需要提供,比如config.yaml, fastp_stat.py, snp_frequency.py。有些文档可以在往期文章看到,有些脚本我会把整个流程需要的软件上传到github上面。希望大家能够测试这个流程并反馈使用建议,欢迎大家加VX:bbplayer2021 (木青)进群交流,备注 申请加入生信交流群。