2021.07.30【WGS/GWAS】丨全基因组分析全流程(上)

9 篇文章 30 订阅
9 篇文章 8 订阅

摘要

时隔半年,终于把WGS前面的分析用snakemake搭建好了。读者不要嫌我慢,确实是项目不多,流程也不算特别复杂。之前的shell脚本也能用,因此迟迟没有真正搭建。现在项目慢慢多了,考虑到提升工作效率,趁着前几天做了2个WGS的项目,把这个流程梳理出来。

命令行

#vim: set syntax=python

#__author__ = "Yang Xin"
#__copyright__ = "Copyright 2021, Wang lab"
#__email__ = "491861610@qq.com"
#__license__ = "SAU"

"""
A WGS analysis workflow using trim_galore, BWA, Samtools and GATK.
"""

configfile:"config.yaml"

import yaml

rule all:
    input:
        expand("03.snp/{sample}_stat.csv", sample=config["samples"]),#比较组的突变输出表格
        expand("{genome}.bwt", genome=config["reference"]),#参考基因组建立索引,bwa比对需要
        expand("{genome}.fai", genome=config["reference"]),#参考基因组建立索引,GATK比对需要
        expand("02.align/ref/{genome_profix}.dict", genome_profix=config["profix"]),#参考基因组建立索引,GATK比对需要
        "01.data/data_quality_stat.txt",#质控统计
#############################quality_control######################

rule trim_galore:
    input:
        R1 = "01.data/{sample}_1.fq.gz",
        R2 = "01.data/{sample}_2.fq.gz"
    output:
        dir = directory("01.data/trim/{sample}"),#输出样品质控文件夹
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
    threads:8
    shell:
        "trim_galore -j {threads} -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o {output.dir} --fastqc {input.R1} {input.R2} "

rule fastp:
    input:
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz"
    output:
        temp("01.data/trim/{sample}.json")#生成临时文件.json,下一个rule使用后删除。
    threads:16
    shell:
        "fastp -w {threads} -i {input.R1} -I {input.R2} -j {output} "

rule fastp_stat:
    input:
        expand("01.data/trim/{sample}.json",sample=config["samples"])
    output:
        "01.data/fastp_stat.txt"
    script:
        "scripts/fastp_stat.py"
        #根据json文档统计所有样品的质控信息,往期文章有。

rule sort_fastp_stat:
    input:
        "01.data/fastp_stat.txt"
    output:
        "01.data/data_quality_stat.txt"
    shell:
        "sort 01.data/fastp_stat.txt > 01.data/data_quality_stat.txt;sed -i '1i Sample_ID       Clean_reads     Clean_bases     ≥Q30    GC_Content' 01.data/data_quality_stat.txt;rm -rf 01.data/fastp_stat.txt"
        #整理格式

#################################align#########################

rule bwa_index:
    input:
        genome = config["reference"]
    output:
        "{genome}.bwt",
    shell:
        "bwa index {input.genome}"

rule bwa_map:
    input:
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
        genome_index = expand("{genome}.bwt",genome=config["reference"])
    output:
        temp("02.align/{sample}_align.sam")
    params:
        genome = config["reference"],
    threads:16
    shell:
        "bwa mem -R \'@RG\\tID:foo\\tSM:bar\\tLB:Abace\' -t {threads} {params.genome} {input.R1} {input.R2} > {output}"
        #注意,双引号里面的特殊符号(()''%&等)会产生歧义,需要使用反斜杠符号进行注释。

rule samtools_view:#sam2bam
    input:
        "02.align/{sample}_align.sam"
    output:
        temp("02.align/{sample}.tmp.bam")
    shell:
        "samtools view -bS {input} > {output} "

rule samtools_sort:#排序
    input:
        "02.align/{sample}.tmp.bam"
    output:
        "02.align/{sample}.bam",
    shell:
        "samtools sort {input} -o {output} "


rule samtools_faidx:#建立索引
    input:
        genome = config["reference"]
    output:
        "{genome}.fai"
    shell:
        "samtools faidx {input.genome}"

########################GATK_filter##################

rule GATK_dict:#建立索引
    input:
        genome = config["reference"]
    output:
        dict = "02.align/ref/{genome_profix}.dict"
    shell:
        "gatk CreateSequenceDictionary -R  {input.genome} -O {output.dict}"

rule GATK_sort:#使用GATK对bam文件重新排序,去重。
    input:
        sample = "02.align/{sample}.bam",
    output:
        sort_bam = temp("02.align/{sample}_sort.bam")
    params:
        genome = config["reference"],
    shell:
        "gatk SortSam -I {input.sample} -O {output.sort_bam} -R {params.genome} -SO coordinate --CREATE_INDEX true"

rule GATK_dupli:
    input:
        sample = "02.align/{sample}_sort.bam",
    output:
        dup_bam = temp("02.align/{sample}_sort_dup.bam")
    params:
        genome = config["reference"]
    log:
        metrics = "02.align/log/{sample}_markdup_metrics.txt"
    shell:
        "gatk  MarkDuplicates -I {input.sample} -O {output.dup_bam} -M {log.metrics}"

rule samtools_index2:
    input:
        "02.align/{sample}_sort_dup.bam"
    output:
        "02.align/{sample}_sort_dup.bam.bai"
    shell:
        "samtools index {input}"

rule Calling:
    input:
        sample = "02.align/{sample}_sort_dup.bam",
        index = "02.align/{sample}_sort_dup.bam.bai"
    output:
        sample_vcf = temp("03.snp/{sample}.g.vcf.gz")
    params:
        genome = config["reference"]
    shell:
        "gatk HaplotypeCaller -ERC GVCF -R {params.genome} -I {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
#注意-ERC参数,小基因组的物种可加可不加,大基因组一定要加上,并且使用-L按照染色体号分别注释,之后再进行合并。
rule GVCF2VCF:#寻找变异位点,这里的结果包括了snp和indel
    input:
        sample = "03.snp/{sample}.g.vcf.gz",
    output:
        sample_vcf = "03.snp/{sample}.vcf"
    params:
        genome = config["reference"]
    shell:        
        "gatk GenotypeGVCFs -R {params.genome} -V {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"

rule SNP_calling:#筛选出snp位点
    input:
        sample = "03.snp/{sample}.vcf",
    output:
        sample_vcf = "03.snp/{sample}_snp.vcf"
    params:
        genome = config["reference"]
    shell:                
        "gatk SelectVariants -R {params.genome} -V {input.sample} -O {output.sample_vcf} --select-type-to-include SNP"

rule VCF_stat:#对snp位点进行统计
    input:
        snp_vcf = "03.snp/{sample}_snp.vcf"
    output:
        "03.snp/{sample}_stat.csv"
    shell:
        "python scripts/snp_frequency.py {input.snp_vcf}"


部分统计脚本介绍文档

fastp_stat.py
snp_frequency.py

总结

这个流程已经被证明是可以跑通的。当然,流程的完整度还是不够,比如snp位点还可以再进行筛选,或者把indel也统计出来。同时,里面还有一些文件需要提供,比如config.yaml, fastp_stat.py, snp_frequency.py。有些文档可以在往期文章看到,有些脚本我会把整个流程需要的软件上传到github上面。希望大家能够测试这个流程并反馈使用建议,欢迎大家加VX:bbplayer2021 (木青)进群交流,备注 申请加入生信交流群。

  • 5
    点赞
  • 34
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

穆易青

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值