2021.07.30【WGS/GWAS】丨全基因组分析全流程(上)

置顶穆易青

已于 2022-03-25 11:45:19 修改

阅读量4.3k

点赞数 5

分类专栏：基因组 GWAS 变异检测文章标签： python

于 2021-07-30 17:05:54 首次发布

本文链接：https://blog.csdn.net/yangl7/article/details/119250723

版权

基因组同时被 3 个专栏收录

24 篇文章 22 订阅

订阅专栏

GWAS

9 篇文章 30 订阅

订阅专栏

变异检测

9 篇文章 8 订阅

订阅专栏

摘要

时隔半年，终于把WGS前面的分析用snakemake搭建好了。读者不要嫌我慢，确实是项目不多，流程也不算特别复杂。之前的shell脚本也能用，因此迟迟没有真正搭建。现在项目慢慢多了，考虑到提升工作效率，趁着前几天做了2个WGS的项目，把这个流程梳理出来。

命令行

#vim: set syntax=python

#__author__ = "Yang Xin"
#__copyright__ = "Copyright 2021, Wang lab"
#__email__ = "491861610@qq.com"
#__license__ = "SAU"

"""
A WGS analysis workflow using trim_galore, BWA, Samtools and GATK.
"""

configfile:"config.yaml"

import yaml

rule all:
    input:
        expand("03.snp/{sample}_stat.csv", sample=config["samples"]),#比较组的突变输出表格
        expand("{genome}.bwt", genome=config["reference"]),#参考基因组建立索引，bwa比对需要
        expand("{genome}.fai", genome=config["reference"]),#参考基因组建立索引，GATK比对需要
        expand("02.align/ref/{genome_profix}.dict", genome_profix=config["profix"]),#参考基因组建立索引，GATK比对需要
        "01.data/data_quality_stat.txt",#质控统计
#############################quality_control######################

rule trim_galore:
    input:
        R1 = "01.data/{sample}_1.fq.gz",
        R2 = "01.data/{sample}_2.fq.gz"
    output:
        dir = directory("01.data/trim/{sample}"),#输出样品质控文件夹
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
    threads:8
    shell:
        "trim_galore -j {threads} -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o {output.dir} --fastqc {input.R1} {input.R2} "

rule fastp:
    input:
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz"
    output:
        temp("01.data/trim/{sample}.json")#生成临时文件.json，下一个rule使用后删除。
    threads:16
    shell:
        "fastp -w {threads} -i {input.R1} -I {input.R2} -j {output} "

rule fastp_stat:
    input:
        expand("01.data/trim/{sample}.json",sample=config["samples"])
    output:
        "01.data/fastp_stat.txt"
    script:
        "scripts/fastp_stat.py"
        #根据json文档统计所有样品的质控信息，往期文章有。

rule sort_fastp_stat:
    input:
        "01.data/fastp_stat.txt"
    output:
        "01.data/data_quality_stat.txt"
    shell:
        "sort 01.data/fastp_stat.txt > 01.data/data_quality_stat.txt;sed -i '1i Sample_ID       Clean_reads     Clean_bases     ≥Q30    GC_Content' 01.data/data_quality_stat.txt;rm -rf 01.data/fastp_stat.txt"
        #整理格式

#################################align#########################

rule bwa_index:
    input:
        genome = config["reference"]
    output:
        "{genome}.bwt",
    shell:
        "bwa index {input.genome}"

rule bwa_map:
    input:
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
        genome_index = expand("{genome}.bwt",genome=config["reference"])
    output:
        temp("02.align/{sample}_align.sam")
    params:
        genome = config["reference"],
    threads:16
    shell:
        "bwa mem -R \'@RG\\tID:foo\\tSM:bar\\tLB:Abace\' -t {threads} {params.genome} {input.R1} {input.R2} > {output}"
        #注意，双引号里面的特殊符号（()''%&等）会产生歧义，需要使用反斜杠符号进行注释。

rule samtools_view:#sam2bam
    input:
        "02.align/{sample}_align.sam"
    output:
        temp("02.align/{sample}.tmp.bam")
    shell:
        "samtools view -bS {input} > {output} "

rule samtools_sort:#排序
    input:
        "02.align/{sample}.tmp.bam"
    output:
        "02.align/{sample}.bam",
    shell:
        "samtools sort {input} -o {output} "


rule samtools_faidx:#建立索引
    input:
        genome = config["reference"]
    output:
        "{genome}.fai"
    shell:
        "samtools faidx {input.genome}"

########################GATK_filter##################

rule GATK_dict:#建立索引
    input:
        genome = config["reference"]
    output:
        dict = "02.align/ref/{genome_profix}.dict"
    shell:
        "gatk CreateSequenceDictionary -R  {input.genome} -O {output.dict}"

rule GATK_sort:#使用GATK对bam文件重新排序，去重。
    input:
        sample = "02.align/{sample}.bam",
    output:
        sort_bam = temp("02.align/{sample}_sort.bam")
    params:
        genome = config["reference"],
    shell:
        "gatk SortSam -I {input.sample} -O {output.sort_bam} -R {params.genome} -SO coordinate --CREATE_INDEX true"

rule GATK_dupli:
    input:
        sample = "02.align/{sample}_sort.bam",
    output:
        dup_bam = temp("02.align/{sample}_sort_dup.bam")
    params:
        genome = config["reference"]
    log:
        metrics = "02.align/log/{sample}_markdup_metrics.txt"
    shell:
        "gatk  MarkDuplicates -I {input.sample} -O {output.dup_bam} -M {log.metrics}"

rule samtools_index2:
    input:
        "02.align/{sample}_sort_dup.bam"
    output:
        "02.align/{sample}_sort_dup.bam.bai"
    shell:
        "samtools index {input}"

rule Calling:
    input:
        sample = "02.align/{sample}_sort_dup.bam",
        index = "02.align/{sample}_sort_dup.bam.bai"
    output:
        sample_vcf = temp("03.snp/{sample}.g.vcf.gz")
    params:
        genome = config["reference"]
    shell:
        "gatk HaplotypeCaller -ERC GVCF -R {params.genome} -I {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
#注意-ERC参数，小基因组的物种可加可不加，大基因组一定要加上，并且使用-L按照染色体号分别注释，之后再进行合并。
rule GVCF2VCF:#寻找变异位点，这里的结果包括了snp和indel
    input:
        sample = "03.snp/{sample}.g.vcf.gz",
    output:
        sample_vcf = "03.snp/{sample}.vcf"
    params:
        genome = config["reference"]
    shell:        
        "gatk GenotypeGVCFs -R {params.genome} -V {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"

rule SNP_calling:#筛选出snp位点
    input:
        sample = "03.snp/{sample}.vcf",
    output:
        sample_vcf = "03.snp/{sample}_snp.vcf"
    params:
        genome = config["reference"]
    shell:                
        "gatk SelectVariants -R {params.genome} -V {input.sample} -O {output.sample_vcf} --select-type-to-include SNP"

rule VCF_stat:#对snp位点进行统计
    input:
        snp_vcf = "03.snp/{sample}_snp.vcf"
    output:
        "03.snp/{sample}_stat.csv"
    shell:
        "python scripts/snp_frequency.py {input.snp_vcf}"

部分统计脚本介绍文档

fastp_stat.py
snp_frequency.py

总结

这个流程已经被证明是可以跑通的。当然，流程的完整度还是不够，比如snp位点还可以再进行筛选，或者把indel也统计出来。同时，里面还有一些文件需要提供，比如config.yaml, fastp_stat.py, snp_frequency.py。有些文档可以在往期文章看到，有些脚本我会把整个流程需要的软件上传到github上面。希望大家能够测试这个流程并反馈使用建议，欢迎大家加VX：bbplayer2021 （木青）进群交流，备注申请加入生信交流群。

穆易青

关注

5
点赞
踩
34

收藏

觉得还不错? 一键收藏
打赏
5
评论
2021.07.30【WGS/GWAS】丨全基因组分析全流程(上)

目录摘要命令行三级目录摘要时隔半年，终于把WGS前面的分析用snakemake搭建好了。读者不要嫌我慢，确实是项目不多，流程也不算特别复杂。之前的shell脚本也能用，因此迟迟没有真正搭建。现在项目慢慢多了，考虑到提升工作效率，趁着前几天做了2个WGS的项目，把这个流程梳理出来。命令行#vim: set syntax=python#__author__ = "Yang Xin"#__copyright__ = "Copyright 2021, Wang lab"#__email__ = "4
复制链接

扫一扫