已经在Linux系统的conda环境安装好如下软件(conda activate snakemake)
软件名称 | 软件功能 |
Trimmomatic | 质控 |
megahit | 组装 |
prodigal | 基因预测 |
linclust | 聚类 |
salmon | 定量 |
diamond | 注释 |
-
准备工作:
将raw read放到./raw_fq文件中
编写mapping.txt(/ZYdata2/ys2022/metegenome/script/mapping.txt)首行不动,一行一样品名称
mapping.txt
#SampleID
VAB
VCD
备注:./表示project_config.yaml中的Workdir
project_config.yaml
# == current & fastq & mapping file location ==
workdir: /ZYdata2/ys2022/metegenome/V24_LZmedium2
Fqlocation: /ZYdata2/ys2022/metegenome/V24_LZmedium2/raw_fq
mapping_file: /ZYdata2/ys2022/metegenome/script/mapping.txt
# == Software parameters ==
trim:
method: fastp
fastp:
# parameter: -5 -W 5 -M 20 -q 15 -u 40 -l 50#华师钟胜基使用的参数,本方案采用默认参数,比钟的参数更加严格
threads: 120
assembly:
method: megahit
threads: 40
megahit:
parameter: --k-min 35 --k-max 95 --k-step 20 --min-contig-len 500 -m 0.8
taxa:
topN:
10
代码全部都在/ZYdata2/ys2022/metegenome/script中,跑程序就在改路径跑就好了
-
质控
-
方法一:fastp
运行下列qc.smk文件
snakemake -s qc.smk --core 2 -n
备注:此处的--core是并行运算的数量,多少个样品就设置多少个core。它与软件内部的--thread是没有矛盾的,后者只单个任务使用的核数
fastp默认参数-W 4( the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])) -M 20(the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])) -q 15(the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])) -u 40(how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])) -l 15(reads shorter than length_required will be discarded, default is 15. (int [=15]))
默认参数中--cut_front关闭,现在启用(-5)
import os#导入os模块
from os.path import join#导入join模块
#加载配置文件
configfile:'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
Fqlocation = config["Fqlocation"]
os.chdir(Workdir)
rule all:#编写全局输出,即最终输出文件
input:
expand("1.clean_data/{sample}/clean_R1.fq.gz",sample = Sample),
expand("1.clean_data/{sample}/clean_R2.fq.gz",sample = Sample)
rule trim:#编写质控规则
input:#输入双端测序原始数据位置,其中{sample}是指所有样品名称
fq1 = join(Fqlocation,"{sample}.R1.fq.gz"),#Fqlocation中要以/结尾。此处需要加","
fq2 = join(Fqlocation,"{sample}.R2.fq.gz")#此处不需要加","
output:#编写输出文件路径和名称
"1.clean_data/{sample}/clean_R1.fq.gz",
"1.clean_data/{sample}/clean_R2.fq.gz"
params:
threads = config["trim"]["fastp"]["threads"]
shell:#运行软件,及其参数。"斜杠"表示换行。
"fastp --in1 {input.fq1} --in2 {input.fq2} --out1 {output[0]} --out2 {output[1]} --thread {params.threads} -5 -W 5 -M 20 -q 15 -u 40 -l 50 \
-j ./1.clean_data/{wildcards.sample}/{wildcards.sample}_fastp.json \
-h ./1.clean_data/{wildcards.sample}/{wildcards.sample}_fastp.html"
-
方法二:Trimmomatic(采用此方法,输出数据质量会高一点)
conda activate snakemake
snakemake -s qctrim.smk --core 100 -p
import os#导入os模块
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
Fqlocation = config["Fqlocation"]
os.chdir(Workdir)
rule all:#编写全局输出,即最终输出文件
input:
expand("1.clean_data/{sample}/clean_R1.fq.gz",sample = Sample),
expand("1.clean_data/{sample}/clean_R2.fq.gz",sample = Sample),
expand("1.clean_data/{sample}/output_forward_unpaired.fq.gz",sample = Sample),
expand("1.clean_data/{sample}/output_reverse_unpaired.fq.gz",sample = Sample)
rule trim:#编写质控规则
input:#输入双端测序原始数据位置,其中{sample}是指所有样品名称
fq1 = join(Fqlocation,"{sample}.R1.fq.gz"),#Fqlocation中要以/结尾。此处需要加","
fq2 = join(Fqlocation,"{sample}.R2.fq.gz")#此处不需要加","
output:#编写输出文件路径和名称
"1.clean_data/{sample}/clean_R1.fq.gz",
"1.clean_data/{sample}/clean_R2.fq.gz",
"1.clean_data/{sample}/output_forward_unpaired.fq.gz",
"1.clean_data/{sample}/output_reverse_unpaired.fq.gz"
shell:#运行软件,及其参数。"斜杠"表示换行。
"java -jar /ZYdata2/ys2022/metegenome/script/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 -threads 10 {input.fq1} {input.fq2} {output[0]} {output[2]} {output[1]} {output[3]} ILLUMINACLIP:/ZYdata2/ys2022/metegenome/script/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10:2:True LEADING:3 TRAILING:3 SLIDINGWINDOW:5:20 MINLEN:50"
#FastQC判断清洗质量
rule FastQC:
input:
"1.clean_data/{sample}/clean_R1.fq.gz",
"1.clean_data/{sample}/clean_R2.fq.gz"
output:"1.clean_data/{sample}/FastQC"
shell:
"/ZYdata2/ys2022/metegenome/script/FastQC/fastqc {input[0]} {input[1]} -o {output} -t 10"
如果FastQC步骤没有自己跑,就手动就跑,进入目录1.clean_data,创建FastQC
运行:/ZYdata2/ys2022/metegenome/script/FastQC/fastqc clean_R1.fq.gz clean_R2.fq.gz -o ./FastQC -t 10
-
组装——megahit
snakemake -s assambly.smk --core 40 -n
megahit中的-t参数是单个样品的核心数,snakemake的--core是所有样品总核心数。例如-t 20和--core 100,将会有5个core = 20的样品同时跑,其余样品排队。
import os
from os.path import join
#加载配置文件
configfile:'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
Fqlocation = config["Fqlocation"]
os.chdir(Workdir)
rule all:
input:
expand("2.assembly/{sample}",sample = Sample)
rule megahit:
input:
r1 = "1.clean_data/{sample}/clean_R1.fq.gz",#Fqlocation中要以/结尾
r2 = "1.clean_data/{sample}/clean_R2.fq.gz"
output:directory("2.assembly/{sample}")
params:
param = config['assembly']['megahit']["parameter"]
threads:config["assembly"]["threads"]
shell:
"megahit -1 {input.r1} -2 {input.r2} -o {output} {params.param} -t {threads}"
运行RenameContig.py在基因名称前加上样品名称
python RenameContig.py
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
print('正在修改名称,请等待..................')
#修改名称
for sample in Sample:
input = os.path.join(os.getcwd(),'2.assembly',sample,"final.contigs.fa")
file = open(input,'r')
Newfile = open(os.path.join(os.getcwd(),'2.assembly',sample,"final.contigsRename.fa"),'a+')
for line in file:
if line.startswith('>'):
line = line.replace('>', ">" + sample + '_')
Newfile.write(line)
else:
Newfile.write(line)
print('成功修改基因名称(在名称前加上样品标签),保存于2.assembly/final.contigsRename.fa')
-
基因预测——prodigal,聚类——cd-hit,定量——salmon
-
方法一:基因预测——prodigal,聚类——cd-hit,定量——salmon(舍弃,cd-hit聚类程度太大)
snakemake -s gene.smk --core 120 -p
import os
from os.path import join
#加载配置文件
configfile:'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
os.chdir(Workdir)
rule all:
input:
expand("3.gene/prodigal/{sample}.gff",sample = Sample),
expand("3.gene/prodigal/{sample}_prot.faa",sample = Sample),
expand("3.gene/prodigal/{sample}_nucl.fna",sample = Sample),
expand("3.gene/prodigal/log/{sample}.log",sample = Sample),
expand("3.gene/prodigal/all_nucl.fna"),
expand("3.gene/uniq_genes_nucl.fna"),
expand("3.gene/uniq_genes_prot.faa"),
expand("3.gene/salmon/index"),
expand("3.gene/salmon/{sample}.quant",sample = Sample),
expand("3.gene/gene.TPM"),
expand("3.gene/gene.count")
#用prodigal预测基因
rule prodigal:
input:"2.assembly/{sample}/final.contigsRename.fa"
output:
gff = ("3.gene/prodigal/{sample}.gff"),
faa = ("3.gene/prodigal/{sample}_prot.faa"),
fna = ("3.gene/prodigal/{sample}_nucl.fna"),
log = ("3.gene/prodigal/log/{sample}.log")
shell:
"prodigal -i {input} -o {output.gff} -d {output.fna} -a {output.faa} -p meta &> {output.log}"
#将所有样品的基因核酸序列合并到一个文件
rule get_all_nucl:
input:
expand("3.gene/prodigal/{sample}_nucl.fna",sample=Sample)
output:
"3.gene/prodigal/all_nucl.fna"
shell:
"cat {input} > {output}"
#聚类获得非冗余基因
# aS覆盖度,c相似度,G局部比对,g最优解,T多线程,M内存0不限制
rule cluster:
input: "3.gene/prodigal/all_nucl.fna"
output: "3.gene/uniq_genes_nucl.fna"
shell:
"cd-hit-est -i {input} -o {output} -aS 0.9 -c 0.95 -G 0 -g 0 -T 0 -M 0"
#将非冗余基因翻译成蛋白质
rule trans_seq:
input: rules.cluster.output
output: "3.gene/uniq_genes_prot.faa"
shell:
"seqkit translate {input} -T 11 > {output}"
#创建salmon索引
rule salmon_index:
input: "3.gene/uniq_genes_nucl.fna"
output: directory("3.gene/salmon/index")
shell:
"salmon index -t {input} -p 9 -i {output}"
# 定量,l文库类型自动选择,p线程,--meta宏基因组模式, 2个任务并行2个样
rule salmon_quant:
input:
fq1 = "1.clean_data/{sample}/clean_R1.fq.gz",
fq2 = "1.clean_data/{sample}/clean_R2.fq.gz",
index = "3.gene/salmon/index"
output: directory("3.gene/salmon/{sample}.quant")
shell:
"salmon quant -i {input.index} -l A -p 8 --meta \
-1 {input.fq1} -2 {input.fq2} -o {output}"
rule salmon_quantmerge:
input: expand("3.gene/salmon/{sample}.quant",sample=Sample)
output: TPM = "3.gene/gene.TPM",
Count = "3.gene/gene.count"
shell:'''
salmon quantmerge --quants 3.gene/salmon/*.quant -o {output.TPM}
salmon quantmerge --quants 3.gene/salmon/*.quant --column NumReads -o {output.Count}
'''
-
方法二:基因预测——prodigal,聚类——linclust,定量——salmon
snakemake -s geneLinclust1.smk -p --core 100
import os
from os.path import join
# 加载配置文件
configfile: 'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open(mapping_file, 'r') as mf:
for line in mf:
Sample.append(line.strip())
Sample = Sample[1:]
Workdir = config["workdir"]
os.chdir(Workdir)
# 调试信息
print("Current working directory:", os.getcwd())
print("Sample list:", Sample)
rule all:
input:
expand("3.gene/prodigal/{sample}.gff", sample=Sample),
expand("3.gene/prodigal/{sample}_prot.faa", sample=Sample),
expand("3.gene/prodigal/{sample}_nucl.fna", sample=Sample),
expand("3.gene/prodigal/log/{sample}.log", sample=Sample),
"3.gene/prodigal/all_nucl.fna",
# 用prodigal预测基因
rule prodigal:
input: "2.assembly/{sample}/final.contigsRename.fa"
output:
gff = "3.gene/prodigal/{sample}.gff",
faa = "3.gene/prodigal/{sample}_prot.faa",
fna = "3.gene/prodigal/{sample}_nucl.fna",
log = "3.gene/prodigal/log/{sample}.log"
shell:
"prodigal -i {input} -o {output.gff} -d {output.fna} -a {output.faa} -p meta &> {output.log}"
# 将所有样品的基因核酸序列合并到一个文件
rule get_all_nucl:
input:
expand("3.gene/prodigal/{sample}_nucl.fna", sample=Sample)
output:
"3.gene/prodigal/all_nucl.fna"
shell:
"cat {input} > {output}"
snakemake -s geneLinclust2.smk -p --core 100 -n
此处是为了获取命令,手动跑每一条命令。
import os
from os.path import join
# 加载配置文件
configfile: 'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open(mapping_file, 'r') as mf:
for line in mf:
Sample.append(line.strip())
Sample = Sample[1:]
Workdir = config["workdir"]
os.chdir(Workdir)
# 调试信息
print("Current working directory:", os.getcwd())
print("Sample list:", Sample)
rule all:
input:
"3.gene/linclust/all_nucl.linclustDB",
"3.gene/linclust/all_nucl.linclust_DB_clu",
"3.gene/linclust/tmp",
"3.gene/linclust/DB_clu_re",
"3.gene/uniq_genes_nucl.fna",
"3.gene/uniq_genes_prot.faa",
directory("3.gene/salmon/index"),
expand("3.gene/salmon/{sample}.quant", sample=Sample),
"3.gene/gene.TPM",
"3.gene/gene.count"
# 聚类获得非冗余基因
# 将all_nucl.fna数据转为DB格式
rule DB:
input: "3.gene/prodigal/all_nucl.fna"
output: "3.gene/linclust/all_nucl.linclustDB"
shell:
"mmseqs createdb {input} {output}"
# linclust聚类
# 手动跑
rule cluster:
input: "3.gene/linclust/all_nucl.linclustDB"
output:
DB_clu = "3.gene/linclust/all_nucl.linclust_DB_clu",
tmp = "3.gene/linclust/tmp"
shell:
"mmseqs linclust {input} {output.DB_clu} {output.tmp} -e 0.001 --min-seq-id 0.9 -c 0.80 --threads 100"
# 提取代表序列
rule DB_clu_re:
input:
DB = "3.gene/linclust/all_nucl.linclustDB",
DB_clu = "3.gene/linclust/all_nucl.linclust_DB_clu"
output: "3.gene/linclust/DB_clu_re"
shell:
"mmseqs createsubdb {input.DB_clu} {input.DB} {output}"
# 将DB_clu_re转为fasta格式
rule refasta:
input: "3.gene/linclust/DB_clu_re"
output: "3.gene/uniq_genes_nucl.fna"
shell:
"mmseqs convert2fasta {input} {output}"
# 将非冗余基因翻译成蛋白质
rule trans_seq:
input: "3.gene/uniq_genes_nucl.fna"
output: "3.gene/uniq_genes_prot.faa"
shell:
"seqkit translate {input} -T 11 > {output}"
# 创建salmon索引
rule salmon_index:
input: "3.gene/uniq_genes_nucl.fna"
output: directory("3.gene/salmon/index")
shell:
"salmon index -t {input} -p 10 -i {output}"
# 定量,文库类型自动选择,p线程,--meta宏基因组模式
rule salmon_quant:
input:
fq1 = "1.clean_data/{sample}/clean_R1.fq.gz",
fq2 = "1.clean_data/{sample}/clean_R2.fq.gz",
index = "3.gene/salmon/index"
output: directory("3.gene/salmon/{sample}.quant")
shell:
"salmon quant -i {input.index} -l A -p 10 --meta -1 {input.fq1} -2 {input.fq2} -o {output}"
rule salmon_quantmerge:
input: expand("3.gene/salmon/{sample}.quant", sample=Sample)
output:
TPM = "3.gene/gene.TPM",
Count = "3.gene/gene.count"
shell: '''
salmon quantmerge --quants 3.gene/salmon/*.quant -o {output.TPM}
salmon quantmerge --quants 3.gene/salmon/*.quant --column NumReads -o {output.Count}
'''
-
NR物种注释——diamond
-
方法一(不推荐)
此方法使用diamond内置简化LCA算法,大概率naive LCA
snakemake -s annotionNR.smk --core 120 -p
import os
from os.path import join
#加载配置文件
configfile:'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
os.chdir(Workdir)
rule all:
input:
expand("4.annotion/NR/nr.blast.LCA")
rule NR_align:
input:
gene = "3.gene/uniq_genes_prot.faa",
nr_dmnd = "/ZYdata1/ys2022/metegenome/Database/NRtaxonomy.dmnd"
output:"4.annotion/NR/nr.blast.LCA"
shell:'''
diamond blastp -d {input.nr_dmnd} -q {input.gene} -f 102 -p 120 -e 0.00001 -o {output} -b 10 -c 1
'''
将diamond LCA注释的Taxonomy ID转为Lineages
其中/ZYdata1/ys2022/metegenome/Database/NRfromYs/ncbi_lineages_2022-12-20.csv参照GitHub - zyxue/ncbitax2lin: 🐞 Convert NCBI taxonomy dump into lineages获取
python TaxonomyIDtoLineages.py
import pandas as pd
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
#步骤一,增加表头
#读取diamond比对NCBI,LCA算法输出的结果
request = pd.read_csv(('./4.annotion/NR/nr.blast.LCA'),
sep='\t',header=None,names=['Unigene_ID','Taxonomy_ID','e-value'])
request = request.sort_values(by='Taxonomy_ID')
#增加新的列
request[['superkingdom','phylum','class','order',
'family','genus','species']]=request.apply(lambda x:('','','','','','',''),axis=1,result_type='expand')
#读取ncbi_lineages数据库
data = pd.read_csv(('/ZYdata1/ys2022/metegenome/Database/NRfromYs/ncbi_lineages_2022-12-20.csv'),
usecols=['tax_id','superkingdom','phylum','class','order','family','genus','species'],
low_memory=False, index_col='tax_id')
#Lineages搜索
row_num = len(request)
i = 0
ID1 = ''
#创建需要报错的,double-check,如果输出的double-check文件非空,则需要去NCBI taxonomy数据库搜索报错的ID,
#再去ncbi_lineages_2022-12-20-2.csv修改,然后重新跑一次该步骤
check = open('./4.annotion/NR/neededDoubleCheck','w')
while i < row_num:
ID2 = request.loc[i,'Taxonomy_ID']
if ID2 != ID1:#判断ID2是否等于ID1
try:#如果try中的步骤没有报错,则执行它而不执行except
#从data中搜索ID2,并将搜索结果做成dataframe
s = data.loc[ID2]
df = s.to_frame().T
a = df.reset_index(drop=True)
#将搜索结果合并到request dataframe中
request.loc[i,'superkingdom'] = a.iloc[0,0]
request.loc[i, 'phylum'] = a.iloc[0, 1]
request.loc[i, 'class'] = a.iloc[0, 2]
request.loc[i, 'order'] = a.iloc[0, 3]
request.loc[i, 'family'] = a.iloc[0, 4]
request.loc[i, 'genus'] = a.iloc[0, 5]
request.loc[i, 'species'] = a.iloc[0, 6]
except:#如果try中步骤报错,表明data中搜索不到该Taxonomy ID,执行except中的步骤
check.write(str(ID2)+'\n')
else:#如果ID2==ID1成立,则不需要进行搜索,直接写入上一次搜索结果
request.loc[i, 'superkingdom'] = a.iloc[0, 0]
request.loc[i, 'phylum'] = a.iloc[0, 1]
request.loc[i, 'class'] = a.iloc[0, 2]
request.loc[i, 'order'] = a.iloc[0, 3]
request.loc[i, 'family'] = a.iloc[0, 4]
request.loc[i, 'genus'] = a.iloc[0, 5]
request.loc[i, 'species'] = a.iloc[0, 6]
ID1 = ID2#将上一次搜索的taxonomy ID赋值给ID1.返回while下的第一步,即读取第一个taxonomy ID
i += 1
request.to_csv('./4.annotion/NR/nr.blast.LCA2Lineages.csv',index=False)#输出文件,默认mode='w',即清理文件后写入
#检查是否有需要double-check的ID
check = open('check.txt','r')
num = len(check.readlines())
if num == 0:
print('匹配成功,请进入Lineages2Quant操作')
else:
print(str(num) + ' ID needed double-check')
print('请到NCBI toxonomy数据库搜索该ID,并将其整合到ncbi_lineages_2022-12-20.csv中,然后再重新跑一次')
check = open('check.txt','r')
for line in check:
line = line.strip()
print(str(line) + ' needed double-check')
整个Lineages与Quant
python Lineages2Quant.py
import pandas as pd
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
#步骤三:整合quant
print('正在读取nr.blast.LCA2Lineages.csv,请等待................')
request = pd.read_csv('./4.annotion/NR/nr.blast.LCA2Lineages.csv',sep=',')
#增加新的列
#编写新的列名称,需要做成[]字典格式
new_TMP_col = []
new_count_col = []
kongzhi = []
for name in Sample:
new_TMP_col.append(name + '.gene.TPM')
kongzhi.append('')
new_count_col.append(name + '.gene.count')
kongzhi.append('')
#增加新的列
request[new_TMP_col + new_count_col]=request.apply(lambda x:(kongzhi),axis=1,result_type='expand')
print('request增加新列成功')
#读取定量文件
print('正在读取定量文件,请等待...............')
TPM = pd.read_csv('./3.gene/gene.TPM',sep='\t',index_col='Name')
count = pd.read_csv('./3.gene/gene.count',sep='\t',index_col='Name')
row_num = len(request)
i = 0
print('正在匹配,请等待.................')
while i < row_num:
ID = request.loc[i,'Unigene_ID']#逐一读取nr.blast.LCA2Lineages.csv中的Unigene_ID列
sTMP = TPM.loc[ID]#在TMP定量文件中索引ID
#将索引结果转化为dataframe
dfTMP = sTMP.to_frame().T
a = dfTMP.reset_index(drop=True)
#在nr.blast.LCA2Lineages中追加TMP列
j = 0
for new_TMP in new_TMP_col:
request.loc[i, new_TMP] = a.iloc[0, j]
j += 1
#整合count结果
scount = count.loc[ID]
dfcount = scount.to_frame().T
b = dfcount.reset_index(drop=True)
k = 0
for new_count in new_count_col:
request.loc[i, new_count] = b.iloc[0, k]
k += 1
i += 1
print('匹配完成,正在保存文件,请等待..............')
request.to_csv('./4.annotion/NR/lineages2geneQuant.csv',index=False)
print('已成功将NR.LCA2Lineages与定量文件整和,输出路径:./4.annotion/NR/lineages2geneQuant.csv')
#步骤四,拆分结果文件,如果有必要的话
#为了满足windows excel最多读取104w行,将lineages2geneQuant.csv每100W行另存为一个文件
lineages2geneQuant = pd.read_csv('./4.annotion/NR/lineages2geneQuant.csv',sep=',',low_memory=False)
row_num = len(lineages2geneQuant)
step = 1000000
for start in range(0, row_num, step):
stop = start + step
filename = './4.annotion/NR/lineages2geneQuant_{}-{}.csv'.format(start,stop)
d = lineages2geneQuant[start : stop]
print("Saving file : " + filename + ", data size : " + str(len(d)))
d.to_csv(filename,index=0)
-
方法二:diamond(比对) + megan(显示物种 + weighted LCA)
diamond比对输出.daa文件
snakemake -s annotionNRdaa.smk --core 120 -p
import os
from os.path import join
#加载配置文件
configfile:'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
os.chdir(Workdir)
rule all:
input:
expand("4.annotion/NR/nr.blast.daa")
rule NR_align:
input:
gene = "3.gene/uniq_genes_prot.faa",
nr_dmnd = "/ZYdata1/ys2022/metegenome/Database/NRfromYs/NR/NR.dmnd"
output:"4.annotion/NR/nr.blast.daa"
shell:'''
diamond blastp -d {input.nr_dmnd} -q {input.gene} -f 100 -p 120 -e 0.00001 -o {output} -b 10 -c 1
'''
.daa转.rma。在pwd='/4.annotion/NR'中运行下列代码
/home/ys202108/megan/tools/daa2rma -i nr.blast.daa -mdb /ZYdata1/ys2022/metegenome/Database/MeganData/megan-map-Feb2022.db -alg weighted -t 50 -o ./nr.blast.rma
继续运行,提取物种信息:
/home/ys202108/megan/tools/rma2info -i ./nr.blast.rma -r2c Taxonomy -n true --paths true --ranks true -mro true -v > ./TaxonomyMegan.txt
运行python脚本,整理物种信息(方法一):(此脚本运行太慢,发现[K]kingdom分类的行都是病毒,先将它们排除出来)(舍弃该方法)
python MeganSpilt.py
import pandas as pd
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
request = pd.read_csv(('./4.annotion/NR/TaxonomyMegan.txt'),
sep='\t',header=None,names=['Unigene_ID','Level','Taxonomy'])
request[['superkingdom','kingdom','phylum','class','order',
'family','genus','species']]=request.apply(lambda x:('','','','','','','',''),axis=1,result_type='expand')
request.sort_values(by='Level',ignore_index = True,inplace= True)
row_num = len(request)
i = 0
j = 0
k = 0
while i < row_num:
ID2 = request.loc[i, 'Taxonomy']
if isinstance(ID2, float):
k += 1
i += 1
i = k
while i < row_num:
ID2 = request.loc[i, 'Taxonomy']
toxonomy = ID2.split(';')
long = len(toxonomy)
for tox in toxonomy:
level = tox.split('] ')
if level != ['']:
#print(level[0].split('[')[1])
a = (level[0].split('[')[1])
if a == 'D':
request.loc[i, 'superkingdom'] = level[1]
if a == 'K':
request.loc[i, 'kingdom'] = level[1]
if a == 'P':
request.loc[i, 'phylum'] = level[1]
if a == 'C':
request.loc[i, 'class'] = level[1]
if a == 'O':
request.loc[i, 'order'] = level[1]
if a == 'F':
request.loc[i, 'family'] = level[1]
if a == 'G':
request.loc[i, 'genus'] = level[1]
if a == 'S':
request.loc[i, 'species'] = level[1]
j += 1
i += 1
request.to_csv('./4.annotion/NR/TaxonomyMeganOutput.csv',index=False)
运行python脚本,整理物种信息(方法二):
python MeganSpilt.py
import pandas as pd
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
#增加表头,另存为csv文件
print('增加表头,另存为csv文件')
request = pd.read_csv(('./4.annotion/NR/TaxonomyMeganWeigh5.txt'),
sep='\t',header=None,names=['Unigene_ID','Level','Taxonomy'])
request.to_csv('./4.annotion/NR/TaxonomyMegan.csv',index=False)
#筛选含有kingdom层级的行
print('筛选含有kingdom层级的行,并另存为')
NoK = open('./4.annotion/NR/TaxonomyMeganNoK.csv','w')
K = open('./4.annotion/NR/TaxonomyMeganK.csv','w')
with open('./4.annotion/NR/TaxonomyMegan.csv','r') as request:
for line in request:
if '[K]' in line:
K.write(line)
if 'Unigene_ID' in line:
NoK.write('Unigene_ID,Level,superkingdom,phylum,class,order,family,genus,species' + '\n')
if '[K]' not in line:
if 'Unigene_ID' in line:
pass
else:
line = line.replace('; ',',')
line = line.replace(';', ',')
NoK.write(line)
在./4.annotion/NR/目录下运行:
cat TaxonomyMeganK.csv >> TaxonomyMeganNoK.csv
rm TaxonomyMeganK.csv TaxonomyMegan.csv
mv TaxonomyMeganNoK.csv TaxonomyMegan.csv
-
kegg功能注释——diamond
snakemake -s annotionKegg.smk --core 120 -p
import os
from os.path import join
#加载配置文件
configfile:'project_config.yaml'
mapping_file = config["mapping_file"]
Sample = []
with open (mapping_file,'r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
Workdir = config["workdir"]
os.chdir(Workdir)
rule all:
input:
expand('4.annotion/kegg/kegg.blast.txt')
rule kegg_align:
input:
gene = "3.gene/uniq_genes_prot.faa",
kegg_dmnd = "/ZYdata1/ys2022/metegenome/Database/kegg.dmnd"
output:"4.annotion/kegg/kegg.blast.txt"
shell:'''
diamond blastp -d {input.kegg_dmnd} -q {input.gene} -f 6 -p 120 -e 0.00001 -o {output} --outfmt 6 qseqid qlen sseqid stitle slen pident length mismatch gapopen qstart qend sstart send evalue bitscore
'''
取e-value最小值作为基因的注释结果
python lowestevalueKegg.py
import pandas as pd
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
sample = Sample
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
#一共三个步骤,每个步骤分开运行
#运行第一步
#读取diamond比对kegg结果
print('正在读取kegg结果,请等待.............')
request = pd.read_csv(os.path.join('./4.annotion/kegg/kegg.blast.txt'),
sep='\t',header=None,names=['qseqid','qlen','sseqid','stitle','slen','pident','length','mismatch','gapopen',
'qstart','qend','sstart','send','evalue','bitscore'])
#运行第二步
#方法一
#重新读取带表头的csv
print('正在取最小e-value,请等待.............')
request.drop_duplicates(subset='qseqid',keep='first',inplace=True)
request.to_csv('./4.annotion/kegg/kegg.bestevalue.csv',index=False)
print('取最小e-value完成,保存在./4.annotion/kegg/kegg.bestevalue.csv')
'''
#方法二,用了方法一,无需运行方法二
file = open('kegg.blast.csv','r')
out = open('kegg.bestevalue.csv','w')
qseqid1 = ''
for line in file:
qseqid2 = line.split(",")[0]
if qseqid2 != qseqid1:
out.write(line)
else:
pass
qseqid1 = qseqid2
'''
将kegg.bestevalue.csv与gene.TMP和gene.count整合(舍弃该步骤)
python kegg2quant.py
import pandas as pd
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
#将kegg.bestevalue.csv与gene.TMP和gene.count整合
request = pd.read_csv('./4.annotion/kegg/kegg.bestevalue.csv',sep=',',low_memory=False)
#增加新的列
#编写新的列名称,需要做成[]字典格式
new_TMP_col = []
new_count_col = []
kongzhi = []
for name in Sample:
new_TMP_col.append(name + '.gene.TPM')
kongzhi.append('')
new_count_col.append(name + '.gene.count')
kongzhi.append('')
request[new_TMP_col + new_count_col]=request.apply(lambda x:(kongzhi),axis=1,result_type='expand')
print('request增加新列成功')
print('正在读取定量文件,请等待...............')
TPM = pd.read_csv('./3.gene/gene.TPM',sep='\t',index_col='Name')
count = pd.read_csv('./3.gene/gene.count',sep='\t',index_col='Name')
row_num = len(request)
i = 0
print('正在匹配,请等待.................')
while i < row_num:
ID = request.loc[i,'qseqid']
sTMP = TPM.loc[ID]
dfTMP = sTMP.to_frame().T
a = dfTMP.reset_index(drop=True)
j = 0
for new_TMP in new_TMP_col:
request.loc[i, new_TMP] = a.iloc[0, j]
j += 1
scount = count.loc[ID]
dfcount = scount.to_frame().T
b = dfcount.reset_index(drop=True)
k = 0
for new_count in new_count_col:
request.loc[i, new_count] = b.iloc[0, k]
k += 1
i += 1
request.to_csv('./4.annotion/kegg/kegg.bestevalue2geneQuant.csv',index=False)
print('完成kegg2quant,保存在./4.annotion/kegg/kegg.bestevalue2geneQuant.csv')
#运行第四步
#为了满足windows excel最多读取104w行,将kegg.bestevalue.csv每100W行另存为一个文件
request = pd.read_csv('./4.annotion/kegg/kegg.bestevalue2geneQuant.csv',sep=',',low_memory=False)
row_num = len(request)
step = 1000000
for start in range(0, row_num, step):
stop = start + step
filename = './4.annotion/kegg/kegg.bestevalue2geneQuant_{}-{}.csv'.format(start,stop)
d = request[start : stop]
print("Saving file : " + filename + ", data size : " + str(len(d)))
d.to_csv(filename,index=0)
-
整合NR_taxonomy和kegg2quant,运行NRandKeggMerge.py(舍弃该步骤)
import pandas as pd
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
def merge(file1,file2):
data1 = pd.read_csv(file1,sep=',',low_memory=False)
data2 = pd.read_csv(file2, sep=',', low_memory=False)
result = pd.merge(data1,data2,left_on= ['Unigene_ID'],right_on=['qseqid'], how= 'outer')
result.to_csv('./4.annotion/NRandKegg/NR2keggQuant.csv',index=False)
NR = r'./4.annotion/NR/TaxonomyMegan.csv'
kegg = r'./4.annotion/kegg/kegg.bestevalue.csv'
merge(NR,kegg)
#运行第四步
#为了满足windows excel最多读取104w行,将kegg.bestevalue.csv每100W行另存为一个文件
request = pd.read_csv('./4.annotion/NRandKegg/NR2keggQuant.csv',sep=',',low_memory=False)
row_num = len(request)
step = 1000000
for start in range(0, row_num, step):
stop = start + step
filename = './4.annotion/NRandKegg/NR2keggQuant_{}-{}.csv'.format(start,stop)
d = request[start : stop]
print("Saving file : " + filename + ", data size : " + str(len(d)))
d.to_csv(filename,index=0)
Bug问题:步骤5中的kegg2quant.py,和步骤6的NRandKeggMerge.py,因为kegg注释结果一般少于NR,以kegg2quant作为输出结果中的定量数据会导致部分NR注释没有定量。所以现在放弃上述两个步骤,而采用下属思路:
-
整个NR kegg TMP count结果
-
将TMP和count取并集1(取count没必要了,将其放弃)
-
将NR和kegg结果取并集2
-
取上述TMP和并集2的并集
此方法无论基因是否被NR或kegg注释,都会被整合
记得将列名称都改成Unigene_ID
python allMerge.py
import pandas as pd
import os
print('读取样品名称')
#读取mapping.txt中的样品名称
Sample = []
with open ('mapping.txt','r') as mf:
for line in mf:
Sample.append(line.split('\n')[0])
Sample = Sample[1:]
print('读取工作路径,并修改当前工作路径')
#读取project_config.yaml中的文件路径,并修改当前工作路径
with open ('project_config.yaml','r') as pj:
for line in pj:
if 'workdir' in line:
line = line.strip()
workdir = line.split(': ')[1]
os.chdir(workdir)
#记得将列名称都改成Unigene_ID
print('merge NR kegg')
def merge(file1,file2):
largest_column_count =0
with open(file1, 'r') as temp_f:
lines = temp_f.readlines()
for l in lines:
column_count = len(l.split(',')) + 1
#找到列数最多的行
largest_column_count = column_count if largest_column_count < column_count else largest_column_count
temp_f.close()
# colunm_names为最大列数展开
column_names = [i for i in range(0, largest_column_count)]
data1 = pd.read_csv(file1, header=None, delimiter=',', names=column_names)
# data1 = pd.read_csv(file1,sep=',',low_memory=False)
data2 = pd.read_csv(file2, sep=',', low_memory=False)
result = pd.merge(data1,data2,on= 'Unigene_ID', how= 'outer')
result.to_csv('./4.annotion/NRandKegg/NR2kegg.csv',index=False)
NR = r'./4.annotion/NR/TaxonomyMeganSort.csv'
kegg = r'./4.annotion/kegg/kegg.bestevalue.csv'
merge(NR,kegg)
NR2kegg = pd.read_csv('./4.annotion/NRandKegg/NR2kegg.csv',sep=',',low_memory=False)
#记得将列名称都改成Unigene_ID
print('merge NRkegg TMP')
def merge(file1,file2):
data1 = pd.read_csv(file1,sep=',',low_memory=False)
data2 = pd.read_csv(file2, sep='\t', low_memory=False)
result = pd.merge(data1,data2,on= 'Unigene_ID', how= 'outer')
result.to_csv('./4.annotion/NRandKegg/NR2kegg2TMP.csv',index=False)
NRkegg = r'./4.annotion/NRandKegg/NR2kegg.csv'
TMP = r'./3.gene/gene.TPM'
merge(NRkegg,TMP)
#运行第四步
#为了满足windows excel最多读取104w行,将kegg.bestevalue.csv每100W行另存为一个文件
request = pd.read_csv('./4.annotion/NRandKegg/NR2kegg2TMP.csv',sep=',',low_memory=False)
row_num = len(request)
step = 1000000
for start in range(0, row_num, step):
stop = start + step
filename = './4.annotion/NRandKegg/NR2kegg2TMP_{}-{}.csv'.format(start,stop)
d = request[start : stop]
print("Saving file : " + filename + ", data size : " + str(len(d)))
d.to_csv(filename,index=0)