使用SnakeMake搭建生信流程——02rule书写、config.yaml文件以及常见错误及解决方法

彼岸花128

已于 2022-07-27 10:37:18 修改

阅读量4.7k

点赞数 5

分类专栏： snakemake流程学习文章标签： python 生物信息学

于 2022-07-07 16:34:26 首次发布

本文链接：https://blog.csdn.net/weixin_44616693/article/details/125462010

版权

snakemake流程学习专栏收录该内容

3 篇文章 3 订阅

订阅专栏

snakemake运行

Snakemake工作流程执行的一些原理，它执行分为三个阶段

在初始化阶段，工作流程会被解析，所有规则都会被实例化。
在DAG阶段，也就是生成有向无环图，确定依赖关系的时候，所有的通配符部分都会被真正的文件名代替。
在调度阶段，DAG的任务按照顺序执行。

1. 一个rule种可以有的要素

input：输入文件
output：输出文件或输出目录

output:
	"/home/data/result.bam"
	directory("/home/data/")

log：日志文件（你也可以在output里面指定一个log文件）
params：参数，仅在该rule中使用
shell：需要执行的linux命令。如果有一个shell语句过长或使用多个shell语句，方法很多，列出我的习惯

shell:（单个shell语句）
	" fastp --disable_adapter_trimming -q 20 -u 50 --length_required 150 -w 16 \
	--dedup -R {params.name}_report -j 01cleandata/{params.name}.json -h 01cleandata/{params.name}.html \
	-i {input[0]} -o {output[0]} -I {input[1]} -O {output[1]}"
shell:（多个shell语句）
	"jellyfish count -m 21 -C -s 1G -t 32 {input.cleanread_1} {input.cleanread_2} -o {output.JF} | tee -a {log} && \
	jellyfish histo -t 10 {output.JF} -o {output.HISTO} | tee -a {log} && \
	jellyfish stats {output.JF} -o {output.STAT} | tee -a {log}"
shell:（多个shell语句）
	"jellyfish count -m 21 -C -s 1G -t 32 {input.cleanread_1} {input.cleanread_2} -o {output.JF} | tee -a {log}"
	"jellyfish histo -t 10 {output.JF} -o {output.HISTO} | tee -a {log}"

run：需要执行的python命令
threads：线程数
conda：指定conda的yaml文件

conda:
	"/home/zhaohuiyao/miniconda3/yaml/GenomeScope.yaml"
#有这个要素，那么在执行snakemake时，需添加参数--use-conda。snakemake在运行时会先下载需要的所有conda环境
snakemake -s snakemake.py -p -c 8 --use-conda

resources：执行程序运行的内存等信息
每一个要素可以写python语句。例如，我要用format赋值

CONDA_DIR=config["conda_dir"]
conda:
	"{}/GenomeScope.yaml".format( CONDA_DIR)

2. snakemake文件内容解读

snakemake是基于python的软件，所以python语句在该文件中也会被执行。

import os
workdir: config["workdir"]	
SOAP_CONFIG_FILE=config["soap_config"]			#定义变量

os.system('cp {0} ./soap_config_file'.format( SOAP_CONFIG_FILE))   
#snakemake文件在运行时就会执行这句python语句。你会发现在python标准语句中，调用变量时直接使用变量名

snakemake文件开头通常会定义的各种变量
①在后面的python语句中，直接使用变量名调用该变量。
②如果在rule中的shell要素中使用，则需要{变量名}，否则不能识别。
③如果在rule中的input/output/log要素中使用，你可能需要expand()函数。

SOAPDENOVO2_DIR=config["soapdenovo2_dir"]
SOAP_NAME=config["soapname"]
print ("{}\n{}".format( SOAPDENOVO2_DIR, SOAP_NAME))		#输出SOAPDENOVO2_DIR和SOAP_NAME

rule SOAPdenovo2:
	output:
		expand("03assemble/{name}_SOAPdenovo2.log",name=SOAP_NAME)
		#这个时候name就是wildcards（通配符），后面name=任一变量名。
	shell:
		"{SOAPDENOVO2_DIR}SOAPdenovo-63mer all -s ./soap_config_file -K 21 -R -F -o {output[0]}/{SOAP_NAME} 1>{log[0]} 2>{log[1]}"

3. 写一个config.yaml文件，作为snakemake文件的配置文件

confi.yaml文件需要在snakemake文件中指明

configfile: "config.yaml"    		  #表示config.yaml在snakemake所在目录下
configfile: "./config/config.yaml"    #表示config.yaml在snakemake所在目录的config目录下

snakemake文件指明工作路径

workdir: config["workdir"]			  

#config.yaml文件中					  #默认工作路径是snakemake目录
workdir: ./testresults/			      #配置工作路径是snakemake所在目录的testresults目录

config.yaml文件中的参数内容可以在config.yaml文件中提供，也可以通过命令行提供

#假设在config.yaml文件中用out_dir指明输出目录。之后在snakefile中用config["out_dir"]调用
out_dir: /home/zhaohuiyao/Genome_survey/Snakemake/
#也可以在config.yaml文件中不指明。在执行snakefile时
snakefile -s snakefile.py -p -c 8 --use-conda --config out_dir=/home/zhaohuiyao/Genome_survey/Snakemake/

config.yaml文件被snakemake读入时是一个字典形式。

#config.yaml文件。下面是三种经典变量
kmername: "21kmer_shuxi"
fastp_dir: /home/zhaohuiyao/Biosoft/
samples:
        test_L1:
                ["../testdata/test_L1_1.fq","../testdata/test_L1_2.fq"]
        test_L2:
                ["../testdata/test_L2_1.fq","../testdata/test_L2_2.fq"]
adaptor_seq: ["GATCGGAAGAGCACACGTCTGAACTCCAGTCACACGAGAATGCATCTCGTATGCCGTCTTCTGCTTG","AGATCGGAAGAGCGTCGTGTAGGGAAAGA"]

#因为config.yaml文件被读入时，是字典。字典就有key值和对应的value值。上面的文件中，key值分别是kmername、fastp_dir、samples、adaptor_seq。那么对应的value值表示就是config["kmername"]这样
#snakemake文件

NAME=config["kmername"]				#NAME变量是一个字符串，内容是21kmer_shuxi
FASTP_DIR=config["fastp_dir"]		#FASTP_DIR变量是一个字符串，内容是/home/zhaohuiyao/Biosoft/。因此这个引号""，在只有一个变量是可有可无
SAMPLES=config["samples"]			#SAMPLES变量是一个字典，有两个key值，分别是test_L1和test_L2，有各自的value值
SAMPLES=config["samples"].keys()	#SAMPLES变量是一个字典的keys信息列表。仅包括test_L1和test_L2
ADAPTORS=config["adaptor_seq"]		#ADAPTORS变量是一个列表，长度为2。在之后的shell中调用，{ADAPTORS[0]}和{ADAPTORS[1]}。若在python语句中则只需要ADAPTORS[0]即可。

4. 重点来了，在一个snakemake种需要对一个rule执行多次。这时你可能需要理解下面的内容。

#config.yaml文件信息
samples:
        test_L1:
                ["../testdata/test_L1_1.fq","../testdata/test_L1_2.fq"]
        test_L2:
                ["../testdata/test_L2_1.fq","../testdata/test_L2_2.fq"]

SAMPLES=config["samples"].keys()
print ("{}".format( SAMPLES))	
#dict_keys(['test_L1', 'test_L2'])，SAMPLES变量是一个字典的keys信息列表，但不是列表，不能用下标。仅包括test_L1和test_L2

rule cutadapt:
	input:
		lambda: wildcards: config["samples"][wildcards.sample]		
		#定义一个函数，参数是wildcards，返回值config["samples"][wildcards.sample]
		#lambda不需要return函数。
		#config["samples"]表示samples字典。它有两个key值，我们放在SAMPLES变量中保存。而在文中其他地方也表示了通配符sample表示SAMPLES变量内容。
		#那么拿到value值，就需要config["samples"][wildcards.sample]
		#如果你想具体查看，你可以print ("{}".format( config["samples"]["test_L1"]))，你会得到["../testdata/test_L1_1.fq","../testdata/test_L1_2.fq"]
		#sample表示通配符，wildcards.sample表示通配符内容。{wildcards.sample}用在shell中。[wildcards.sample]这是字典表示的方法
	output:
		"01cleandata/{sample}_1_cutadapt.fq",
		"01cleandata/{sample}_2_cutadapt.fq"
	log:
		"01cleandata/{sample}_cutadapt.log"
	threads:
		32
	conda:
		"conda_yaml_files/cutadapt.yaml"
	shell:
		"cutadapt -a {ADAPTORS[0]} -A {ADAPTORS[1]} --pair-filter=any --discard-trimmed --max-n=0.05 -o {output[0]} -p {output[1]} {input[0]} {input[1]} | tee -a {log}"

rule kmer:
	input:
        cleanread_1=expand("01cleandata/{sample}_1_clean.fq",sample=SAMPLES),
        cleanread_2=expand("01cleandata/{sample}_2_clean.fq",sample=SAMPLES)

通过上述例子，你需要着重理解{sample}，{wildcards.sample}，[wildcards.sample]，SAMPLES的区别。
一个小知识点，snakemake在第二步DAG阶段，才会有wildcards这个变量，因此上面有一个函数定义。但是如果只在shell中用{wildcards.sample}，则不会报错，若你用到了[wildcards.sample]，若是没有函数定义，则会报错。

建议

在书写snakemake中，需要多次修改的地方，尽量用变量，或者多次被使用的软件或文件，也用变量表示
多多练习，碰的错误多了，才能总结经验

3. 报错

下面给出了报错类型和解决方法，如果仍无法解决，请原谅。当然也可以和我讨论

1. invalid syntax

1. SyntaxError in line 85 of /home/zhaohuiyao/snakemake/Genome_survey/snakemake.py

1. TokenError:(‘EOF in multi-line statement’, (113, 0))

检查书写格式。例如缩格，花括号，圆括号，方括号，引号等使用不完整或错误

2. EOL while scanning string literal

这是shell书写不规范导致的错误。下面举了例子，仔细看，和自己的shell书写进行比较

shell:
	"mkdir -p /home/test/ && cd /home/test/"	#正确
	"mkdir -p /home/test/ && \
	cd /home/test/"								#正确
	"mkdir -p /home/test/ \
	cd /home/test/"								#错误
	"mkdir -p /home/test/ && \
	cd /home/test/
	"											#错误（最常犯的宠物）

3. Not all output, log and benchmark files of rule kmer contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.

这是你的output、log要素中使用的通配符不相同。snakemake要求你在output、log要素中使用的通配符必须一致，不能少也不能多。

4. MissingRuleException:

No rule to produce snakemake (if you use input functions make sure that they don’t raise unexpected exceptions).
这是rule all的问题，对rule all的内容进行查看。rule all中通常只使用一个要素input。
rule all，告诉snakemake，你需要输出的结果文件或目录。snakemake会为了拿到该文件或目录，执行需要的所有rule。
snakmake是倒推，你需要这个文件，那么它会在所有的rule里面找，这个文件作为output的rule，然后查看该rule的input，在去找下一个rule，直到input确定）。因此不是按照从上到下依次执行，而是找逻辑。

5. one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!

这也是的shell内容出错，表示你的shell执行结果存在错误。我记得我出现这个错是在使用conda虚拟环境时，我就使用了要素conda，就没有报错

#报这个错误的shell内容
shell:
	"source /home/zhaohuiyao/miniconda3/bin/activate GenomeScope && \
	genomescope2 -i {input} -o {output} -k 21 -n {NAME} | tee -a {log} && \
	conda deactivate"
#修改后，没有报错
conda:
	"./conda_yaml_files/GenomeScope.yaml"
shell:
	"genomescope2 -i {input} -o {output} -k 21 -n {NAME} | tee -a {log}"

6. Wildcards in input files cannot be determined from output files:

这是input的文件中存在通配符不恰当。对你的输出和输入文件进行expand，详细给出通配符信息

7. CreateCondaEnvironmentException:

Could not create conda environment from /home/zhaohuiyao/snakemake/Genome_survey/conda_yaml_files/cutadapt.yaml:
Command:
mamba env create --quiet --file “/home/zhaohuiyao/snakemake/Genome_survey/testresults/.snakemake/conda/62cc0359c233d7053c596cfebac171c6.yaml” --prefix “/home/zhaohuiyao/snakemake/Genome_survey/testresults/.snakemake/conda/62cc0359c233d7053c596cfebac171c6”
Output:
Exceeded max retries, giving up
LockError:
LOCKERROR: It looks like conda is already doing something.
The lock [‘/home/zhaohuiyao/miniconda3/pkgs/cache.pid21497.conda_lock’] was found. Wait for it to finish before continuing.
If you are sure that conda is not running, remove it and try again.
You can also use: $ conda clean --lock

这是conda的错误。手动将/home/zhaohuiyao/miniconda3/pkgs/cache.pid21497.conda_lock给删除即可