snakemake[2:tutorial-advanced]

M型无头脑玩家

已于 2022-12-02 18:46:40 修改

阅读量396

点赞数

分类专栏： snakemake学习文章标签： linux conda bash

于 2022-12-01 17:49:48 首次发布

本文链接：https://blog.csdn.net/qq_44650830/article/details/128130195

版权

snakemake学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

官方文档的进阶功能

1 指定使用的线程数

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

在代码块中设置threads，并利用大括号传递
前提是：bwa支持线程数的调用
同时snakemake命令中支持设置CPU内核的数量

snakemake --cores 10

--cores [N], --jobs [N], -j [N]
                        Use at most N cores in parallel (default: 1). If N is
                        omitted, the limit is set to the number of available
                        cores.

指定内核的数量优先级大于threads数量。

※使用--forceall可以强制重新执行工作流

  --forceall, -F        Force the execution of the selected (or the first)
                        rule and all rules it is dependent on regardless of
                        already created output.

2 配置文件

可以用JSON或者YAML编写
给出的示例是YAML，给出YAML的基本语法：

语法规则：类似于python
数据组成：
	1.map对象
	键值对，基本格式为--->键：（空格）值
	eg：
	aaa:
	  name: aaa
	  age: 111
	<===>aaa: {name: aaa,age: 111}
	
	2.list对象
	以-开头
	eg:
	-
	 aaa:
	   - nanme: aaa
	   - age: 111
	-
	 bbb:
	   - name: bbb
	   - age: 222	 
 	<===>
 	-
 	 aaa: [{name: aaa},{age: 111}]
 	-
 	 bbb: [{name: bbb},{age: 222}]

创建yaml配置文件：

vim config.yaml

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq

在任务分配前Snakemake将读取该文件并将里面的内容保存为字典格式存入到名为config的全局变量中

print(config["samples"])
===>{'A': 'data/samples/A.fastq', 'B': 'data/samples/B.fastq'}
print(expand("sorted_reads/{sample}.bam", sample=config["samples"]))
===>['sorted_reads/A.bam', 'sorted_reads/B.bam']
#expand 函数可指定替换键值对中的键？？？

修改前：

SAMPLES = ["A", "B"]
rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

修改后：

configfile: "config.yaml"
rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "(bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv -P {params.rate} - > {output}) 2> {log}"

3 输入函数

Snakemake的工作流程：

初始化阶段：将解析定义工作流的文件并实例化所有规则。
DAG阶段：通过填充通配符并将输入文件与输出文件匹配来构建所有作业的有向无环依赖关系图。
计划阶段：将执行作业的DAG，并根据可用资源启动作业。

本例的执行顺序：

bwa_map
samtools_sort
samtools_index
bcftools_call
plot_quals

规则bcftools_call的输入文件列表中的expand函数会在初始化阶段执行。我们不知道作业、通配符值和规则依赖关系。因此，我们无法从配置文件中确定规则bwa_map的FASTQ路径，因为我们甚至不知道将从该规则生成哪些作业。
相反，我们需要***将输入文件的确定推迟到DAG阶段***。这可以通过在输入指令内指定输入函数而不是字符串来实现。
所以函数是在第二个阶段进行实现的。
对于规则bwa_map，其工作方式如下：

回忆：wildcards用来访问通配符
函数的第一个参数必须为 wildcards，还可以接受四个参数 input、 output、threads 和 resources，这几个参数的位置可以任意顺序。

def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
...

[wildcards.sample]的sample必须在任务中出现
通配符wildcards是在shell字段中出现的
有点看不懂参考了b站教学

将函数作为输入：
输入文件可以是单个字符串或字符串列表的形式，还可以使用返回值是一个或多个输入文件的函数，例如，使用通配符获取输入文件

def myfunc(wildcards):
    return [... a list of input files depending on given wildcards ...]

rule:
    input:
        myfunc
    output:
        "someoutput.{somewildcard}.txt"
    shell:
        "..."

config[“samples”]是一个字典
config[“samples”][wildcards.sample]可以获得字典的值

print ("{}".format(config["samples"]["A"]))
===>data/samples/A.fastq

wildcards.sample就是"A"和"B"？？？？？？？

configfile: "config.yaml"

def get_bwa_map_input_fastqs(wildcards):
    return config["samples"][wildcards.sample]

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

运行上述命令发生报错：Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
虽然我们知道通配符代表了我们将要输入输出文件的命名范式，但snakemake 并不知道对应哪些文件。因此，这时候我们就需要显式的去指定输出的文件了

snakemake -np -s S1 mapped_reads/A.bam
snakemake -np -s S1 mapped_reads/B.bam
snakemake -np -s S1 mapped_reads/{A,B}.bam
# 都可以运行

根据输出看输入在到shell中进行命令的执行吗？
但整个流程也没有指定AB啊为什么不需要指定直接就可以运行呢？

在整个流程中存在以下代码：
其中利用expand直接访问了config在初始化阶段。按按照总体的运行顺序，先在bcftools中获得了sample然后在到bwa_map中自动匹配。

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    params:
        rate=config["prior_mutation_rate"]
    log:
        "logs/bcftools_call/all.log"
    shell:
        "(bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv -P {params.rate} - > {output}) 2> {log}"

4规则参数

params定义执行命令是用到的额外参数

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    threads: 8
    shell:
        "bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"

5 日志记录

每个规则也可以使用 log 关键字定义日志文件

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"

6临时文件和受保护的文件

可以将输出文件标记为temp文件。一旦所有消耗作业（需要它作为输入）都已执行， Snakemake将为您删除标记的文件。
一些比较耗时且大型的文件，我们可能想保护它免受意外删除或覆盖，可以使用 protected 函数

rule bwa_map:
    input:
        "data/genome.fa",
        get_bwa_map_input_fastqs
    output:
        temp("mapped_reads/{sample}.bam")
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    log:
        "logs/bwa_mem/{sample}.log"
    threads: 8
    shell:
        "(bwa mem -R '{params.rg}' -t {threads} {input} | "
        "samtools view -Sb - > {output}) 2> {log}"
        
rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"