Snakemake搭建流程 - 干货级

最新推荐文章于 2024-07-25 15:24:30 发布

心如止水-WTF

最新推荐文章于 2024-07-25 15:24:30 发布

阅读量4.9k

点赞数 3

文章标签： python linux shell

本文链接：https://blog.csdn.net/qq_28723681/article/details/107808675

版权

本文目录

1. snakemake 简介
2. snakemake 安装
3. snakemake 参数简介
4. snakemake 使用说明
参考

1. snakemake 简介

最近碰到一个snakemake搭建的流程，挺好奇，便学习了一波，在此特分享一些体会和心得，仅供想快速入手的骚年们一些参考吧。为了照顾大家的耐心，都只放干货了（代码示例）

2. snakemake 安装

snakemake是基于python，使用conda安装起来非常方便

#安装miniconda3
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh # 下载miniconda3安装文件
bash Miniconda3-latest-MacOSX-x86_64.sh  # 运行下载的文件，即可安装miniconda3
wget https://github.com/snakemake/snakemake-tutorial-data/archive/v5.4.5.tar.gz  # 下载snakemake安装所需的environment.yaml文件
tar -xf v5.4.5.tar.gz --strip 1  #解压
conda env create --name snakemake --file environment.yaml #使用conda 根据environment.yaml的配置创建snakemake的conda环境，此时conda会安装snakemake所以来的所有软件
conda activate snakemake  # 激活snakemake 环境
snakemake --help  #运行后，若弹出help文档则大功告成

3. snakemake 参数简介

参数非常多，不做一一介绍，具体可参考 https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options ，下面仅介绍一些常用参数，仅供浏览浏览即可，用的时候再细究吧
usage: snakemake \

参数	描述
-h	打印帮助信息
–dry-run, --dryrun, -n	显示工作内容，比如对于一个很大的流程，可以使用–dry-run –quiet 可以展示流程的概要
–profile	Snakemake的配置文件
–snakefile, -s	指定snakemake的工作流定义文件。通常，默认情况下，Snakemake会在当前工作目录下按顺序搜索“Snakefile”、“Snakefile”、“workflow/Snakefile”、“workflow/Snakefile”。如果你的命名规则不一样或者不在当前目录可以使用该参数来指定
–cores, --jobs, -j	最多并行使用N个CPU内核/任务。如果省略N或“全部”，则限制设置为可用CPU内核的数量。
–config, -C	设置或覆盖工作流配置对象中的值。工作流配置对象可以作为工作流中的变量配置访问。可以通过提供一个JSON文件来设置默认值(参见文档)。
–rerun-incomplete, --ri	重新运行未完成的所有作业
–restart-times	重新启动失败作业的次数(缺省值为0)
–latency-wait, --output-wait, -w	如果作业完成后不存在作业的输出文件，则等待给定的秒数。如果您的文件系统存在延迟(默认为5)，这将有所帮助。
–keep-going, -k	如果一个工作失败了，继续。
–dag	不执行任何操作，打印作业的流程图（ directed acyclic graph）

4. snakemake 使用说明

4.1 定义workflow

sneakmake 有自己的一套语法规则，这套规则定义了流程的名称，输入，输出，命令行，使用计算资源等关键信息，是sneakmake 的核心
下面主要以实例+注释来说明下：

rule snakemake_tutorial:   # 定义一个snakemake_tutorial的task
	input:    #定义输入
		"{dataset}/inputfile"  
		 expand("{dataset}/a.txt", dataset=DATASETS) # 指定dataset来源于DATASETS
	params:	prefix="somedir/{sample}"  # 有时候可能需要定义一些特定的参数	
	output:   #定义输出
		"{dataset}/file.{group}.txt"
		 multiext("some/plot", ".pdf", ".svg", ".png") # 当输出包含多个文件的时候可以使用后缀的形式指定
    wildcard_constraints: dataset="\d+" # 通配符指定变量格式；例如下方指定dataset为多个数字组成	
    threads: 8  #指定线程数为8
    resources: mem_mb=100 #指定内存100MB	
    message: "Executing somecommand with {threads} threads on the following files {input}." #提供简短说明信息
    priority: 10 # 指定优先级
    conda: "envs/conda.yaml"  #指定所需conda环境 - python依赖包
	log: log1="logs/abc.log", log2="logs/xyz.log"  #指定日志文件  
	version: "1.0" #版本信息
	shell: 
		"somecommand --group {wildcards.group} < {input} > {output}"
		"somecommand --threads {threads} {input} {output}"
		"somecommand --log {log.log1} METRICS_FILE={log.log2} {input} {output}"

4.2 配置信息Configuration

Snakemake允许使用配置文件来使工作流更加灵活，例如有些全局的变量如样本名，软件路径参数等变量可以通过config文件来配置，一般写成JSON or YAML 的文件形式，也可以写成制表符分隔的文件，不过需要这种格式需要pandas.read_table来读取。

另外可以指定环境变量使用envvars: 指定环境变量

envvars:
    "SOME_VARIABLE"
rule do_something:
    output:
         "test.txt"
    params:
        x=os.environ["SOME_VARIABLE"]
    shell:
        "echo {params.x} > {output}"

4.3 运行代码示例

以官网的流程为例：bwa ->sort->call-> stats , 完整的工作流（Snakefile ）应该是这样的:（如果安装了所有的conda 环境（python依赖包）可以忽略conda参数）

SAMPLES = ["A", "B"]

rule all:
    input:
        "plots/quals.svg"
rule bwa_map:
    input:
    fastq="samples/{sample}.fastq",
    idx=multiext("genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
	conda:
    "environment.yml"
	output:
    "mapped_reads/{sample}.bam"
	params:
    idx=lambda w, input: os.path.splitext(input.idx[0])[0]
	shell:
    "bwa mem {params.idx} {input.fastq} | samtools view -Sb - > {output}"
rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    conda:
        "environment.yml"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"
rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    conda:
        "environment.yml"
    shell:
        "samtools index {input}"
rule bcftools_call:
    input:
        fa="genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    conda:
        "environment.yml"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"
rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    conda:
        "environment.yml"
    script:
        "plot-quals.py"

定义好后，运行

snakemake --google-lifesciences --default-remote-prefix snakemake-testing-data --use-conda --google-lifesciences-region us-west1

snakemake 会自动识别当前目录下的“Snakefile ”文件（即上面编写的rule）来运行。注意上述的rule都是写的相对路径。

参考

snakemake官方介绍文档

心如止水-WTF

关注

3
点赞
踩
22

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫