DAS Tool
DAS Tool 是一种自动化的处理方法, 集成了多个 binning 算法的结果, 从而从单个 assembly 结果中获取优质的, 非冗余的 bins. 与其他方法相比, 其可以从土壤基因组中重建更多接近完整的基因组 1 2
安装
DAS Tool 可以通过 Bioconda 安装. 存储库.
conda install -c bioconda das_tool
使用方法
基本使用方式
(例 1) 对 MetaBAT, MaxBin, Concot, TourESOM 的 binning 结果运行 DAS Tool.
$ ./DAS_Tool -i \
sample_data/sample.human.gut_concoct_scaffolds2bin.tsv, \
sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv, \
sample_data/sample.human.gut_metabat_scaffolds2bin.tsv, \
sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv \
-l concoct,maxbin,metabat,tetraESOM \
-c sample_data/sample.human.gut_contigs.fa \
-o sample_output/DASToolRun1
其中 -i
指定不同 binning 软件输出的 bin, -l
指定标签, 也就是对应 binning 结果的输出软件, -c
指定用于此次 binning 的叠连群, 指定为 fasta 格式. -o
指定输出文件前缀.
注意, -i
输入的最后一个文件名之后,不允许以,
结尾.
输入文件
bins
用逗号分隔的 bin 表
-i, --bins methodA.scaffolds2bin,...,methodN.scaffolds2bin
列表为用 "\t"
分隔的 scaffold-IDs 和 bin-IDs, 如下:
Scaffold_1 bin.01
Scaffold_8 bin.01
Scaffold_42 bin.02
Scaffold_49 bin.03
Contigs
FASTA 格式的叠连群 (contigs)
-c, --contigs contigs.fa
也就是用于 binning 的 assembly 文件, 如下:
>Scaffold_1
ATCATCGTCCGCATCGACGAATTCGGCGAACGAGTACCCCTGACCATCTCCGATTA...
>Scaffold_2
GATCGTCACGCAGGCTATCGGAGCCTCGACCCGCAAGCTCTGCGCCTTGGAGCAGG...
(可选) Proteins
预先预测的蛋白序列
--proteins proteins.faa
格式如
>Scaffold_1_1
MPRKNKKLPRHLLVIRTSAMGDVAMLPHALRALKEAYPEVKVTVATKSLFHPFFEG...
>Scaffold_1_2
MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN...
输出文件
输出文件包括
- 汇总的 binning 信息, 包括质量和完整性评估 (_DASTool_Summary.txt).
- DAS 综合评估后输出的 binning 文件 (_DASTool_scaffolds2bin.txt), 不含标题的 tsv 文件, 第一列为 contig 名, 第二列为 bin 名, 同上.
- 可选
- 若设置
--write_bin_evals
为 1 1 1 (默认为 1 1 1), 则保存输入bin集合的质量和完整性估计 (_[method].eval). - 若设置
--create_plots
为 1 1 1 (默认为 1 1 1), 则显示每种方法的高质量 bin 的数量和分数分布 (_DASTool_hqBins.pdf,_DASTool_scores.pdf). - 若设置
--write_bins
为 1 1 1 (默认为 0 0 0), 则以 FASTA 格式输出 bin (DASTool_Bins).
- 若设置
详细介绍
DAS_Tool -i methodA.scaffolds2bin,...,methodN.scaffolds2bin
-l methodA,...,methodN -c contigs.fa -o myOutput
-i, --bins Comma separated list of tab separated scaffolds to bin tables.
-c, --contigs Contigs in fasta format.
-o, --outputbasename Basename of output files.
-l, --labels Comma separated list of binning prediction names. (optional)
--search_engine Engine used for single copy gene identification [blast/diamond/usearch].
(default: usearch)
--write_bin_evals Write evaluation for each input bin set [0/1]. (default: 1)
--create_plots Create binning performance plots [0/1]. (default: 1)
--write_bins Export bins as fasta files [0/1]. (default: 0)
--proteins Predicted proteins in prodigal fasta format (>scaffoldID_geneNo).
Gene prediction step will be skipped if given. (optional)
--score_threshold Score threshold until selection algorithm will keep selecting bins [0..1].
(default: 0.5)
--duplicate_penalty Penalty for duplicate single copy genes per bin (weight b).
Only change if you know what you're doing. [0..3]
(default: 0.6)
--megabin_penalty Penalty for megabins (weight c). Only change if you know what you're doing. [0..3]
(default: 0.5)
--db_directory Directory of single copy gene database. (default: install_dir/db)
--resume Use existing predicted single copy gene files from a previous run [0/1]. (default: 0)
--debug Write debug information to log file.
-t, --threads Number of threads to use. (default: 1)
-v, --version Print version number and exit.
-h, --help Show this message.
Example 2: Run DAS Tool again with different parameters. Use the proteins predicted in Example 1 to skip the gene prediction step, disable writing of bin evaluations, set the number of threads to 2 and score threshold to 0.6. Output files will start with the prefix DASToolRun2:
$ ./DAS_Tool -i sample_data/sample.human.gut_concoct_scaffolds2bin.tsv, \
sample_data/sample.human.gut_maxbin2_scaffolds2bin.tsv, \
sample_data/sample.human.gut_metabat_scaffolds2bin.tsv, \
sample_data/sample.human.gut_tetraESOM_scaffolds2bin.tsv \
-l concoct,maxbin,metabat,tetraESOM \
-c sample_data/sample.human.gut_contigs.fa \
-o sample_output/DASToolRun2 \
--proteins sample_output/DASToolRun1_proteins.faa \
--write_bin_evals 0 \
--threads 2 \
--score_threshold 0.6
输入文件的制备
不是所有的 binning 工具都以 "\t"
分隔的 scaffold-ID 和 bin-ID 文件形式输出. DAS 工具同时提供了一个脚本, 将一组 fasta 格式的 bin 转化为 “scaffolds2bin” 表格, 用于 DAS Tool 的输入: Fasta_to_Contigs2Bin
使用方法
$ src/Fasta_to_Contigs2Bin.sh -h
Fasta_to_Scaffolds2Bin: Converts genome bins in fasta format to scaffolds-to-bin table.
Usage: Fasta_to_Contigs2Bin.sh -e fasta > my_scaffolds2bin.tsv
-e, --extension Extension of fasta files. (default: fasta)
-i, --input_folder Folder with bins in fasta format. (default: ./)
-h, --help Show this message.
感谢评论区 @Sophilingsky 的提醒, 之前同功能脚本
Fasta_to_Scaffolds2Bin.sh
已更名为Fasta_to_Contigs2Bin.sh
示例
$ ls /maxbin/output/folder
maxbin.001.fasta maxbin.002.fasta maxbin.003.fasta...
$ src/Fasta_to_Scaffolds2Bin.sh -i /maxbin/output/folder -e fasta > maxbin.scaffolds2bin.tsv
$ head gut_maxbin2_scaffolds2bin.tsv
NODE_10_length_127450_cov_375.783524 maxbin.001
NODE_27_length_95143_cov_427.155298 maxbin.001
NODE_51_length_78315_cov_504.322425 maxbin.001
NODE_84_length_66931_cov_376.684775 maxbin.001
NODE_87_length_65653_cov_460.202156 maxbin.001
问题
- 路径
DASTool_output/
需要手动创建, 否则运行结束后不会输出. - 出现了奇怪的错误
mv: cannot stat ‘DASTool_output/_proteins.faa.scg’: No such file or directory
mv: cannot stat ‘DASTool_output/_proteins.faa.scg’: No such file or directory
rm: cannot remove ‘DASTool_output/_proteins.faa.findSCG.b6’: No such file or directory
rm: cannot remove ‘DASTool_output/_proteins.faa.scg.candidates.faa’: No such file or directory
rm: cannot remove ‘DASTool_output/_proteins.faa.all.b6’: No such file or directory
使用 --search_engine diamond
后运行成功.
DAS_Tool \
-i MetaBat.scaffolds2bin.tsv,MaxBin.scaffolds2bin.tsv,CONCOCT.scaffolds2bin.tsv \
-l MetaBat,MaxBin,CONCOCT \
-c ../scaffold.fa -o DASTool_output/
--write_bins 1 --search_engine diamond --score_threshold 0 \
-t ${THREAD} \
--debug
https://www.baidu.com/link?url=JbN0z_QhZbcz05SXOmXghq4KtVaCf00Tbp6YBX3qm3O6AB-yyFw2gN9XISe880jE3sylTvZ4mTI3k-XvDwzTg9D8mefZI0koVLxEVn_M6gk_jaRX6x8BXgfeRqsWaQmH&wd=&eqid=f554c6c4000ce067000000065eca2021 (DAS Tool for Genome Reconstruction from Metagenomes) ↩︎
https://doi.org/10.1038/s41564-018-0171-1 (Christian M. K. Sieber, Alexander J. Probst, Allison Sharrar, Brian C. Thomas, Matthias Hess, Susannah G. Tringe & Jillian F. Banfield (2018). Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nature Microbiology.) ↩︎