最新Virsorter2-病毒组序列分析工具安装及使用20241126，真牛皮

2401_84140140

于 2024-05-12 19:56:39 发布

阅读量238

点赞数 3

分类专栏：程序员文章标签：运维学习面试

本文链接：https://blog.csdn.net/2401_84140140/article/details/138765167

版权

程序员专栏收录该内容

147 篇文章 1 订阅

订阅专栏

最全的Linux教程，Linux从入门到精通

======================

linux从入门到精通(第2版)
Linux系统移植
Linux驱动开发入门与实战
LINUX 系统移植第2版
Linux开源网络全栈详解从DPDK到OpenFlow

华为18级工程师呕心沥血撰写3000页Linux学习笔记教程

第一份《Linux从入门到精通》466页

====================

内容简介

====

本书是获得了很多读者好评的Linux经典畅销书**《Linux从入门到精通》的第2版**。本书第1版出版后曾经多次印刷，并被51CTO读书频道评为“最受读者喜爱的原创IT技术图书奖”。本书第﹖版以最新的Ubuntu 12.04为版本，循序渐进地向读者介绍了Linux 的基础应用、系统管理、网络应用、娱乐和办公、程序开发、服务器配置、系统安全等。本书附带1张光盘，内容为本书配套多媒体教学视频。另外,本书还为读者提供了大量的Linux学习资料和Ubuntu安装镜像文件，供读者免费下载。

华为18级工程师呕心沥血撰写3000页Linux学习笔记教程

本书适合广大Linux初中级用户、开源软件爱好者和大专院校的学生阅读，同时也非常适合准备从事Linux平台开发的各类人员。

需要《Linux入门到精通》、《linux系统移植》、《Linux驱动开发入门实战》、《Linux开源网络全栈》电子书籍及教程的工程师朋友们劳烦您转发+评论

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化的资料的朋友，可以点击这里获取！

一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！

更多参数设置说明：

choosing viral groups (`--include-groups`)

The available options are dsDNAphage, NCLDV, RNA, ssDNA virus, and lavidaviridae. The default is dsDNAphage and ssDNA (changed from all groups since version 2.2), suitable for those only interested in phage. If you are only interested in RNA virus, you can run:

rm -rf test.out
virsorter run -w test.out -i test.fa --include-groups RNA -j 4 all

re-run with different score cutoff (`--min-score` and `--classify`)

VirSorter2 takes one positional argument, all or classify. The default is all, which means running the whole pipeline, including 1) preprocessing, 2) annotation (feature extraction), and 3) classification. The main computational bottleneck is the annotation step, taking about 95% of CPU time. In case you just want to re-run with different score cutoff (–min-score), classify argument can skip the annotation steps, and only re-run only the classify step.

virsorter run -w test.out -i test.fa --include-groups "dsDNAphage,ssDNA" -j 4 --min-score 0.8 classify

The above overwrites the previous final output files. If you want to keep previous results, you can use --label to add a prefix to the new final output files.

virsorter run -w test.out -i test.fa --include-groups "dsDNAphage,ssDNA" -j 4 --min-score 0.9 --label rerun classify

speed up a run (`--provirus-off`)

In case you need to have some results quickly, there are two options: 1) turn off provirus step with --provirus-off; this reduces sensitivity on sequences that are only partially viral; 2) subsample ORFs from each sequence with --max-orf-per-seq; This option subsamples ORFs if a sequence has more ORFs than the number provided. Note that this option is only availale when --provirus-off is used.

rm -rf test.out
virsorter run -w test.out -i test.fa --provirus-off --max-orf-per-seq 20 all

其他选项

You can run virsorter run -h to see all options. VirSorter2 is a wrapper around snakemake, a great pipeline management tool designed for reproducibility, and running on computer clusters. All snakemake options still work with VirSorter2, and users can simply append those snakemake option to virsorter options (after all or classify). For example, the --forceall snakemake option can be used to re-run the pipeline.

virsorter run -w test.out -i test.fa --provirus-off --max-orf-per-seq 20 all --forceall

When you re-run any VirSorter2 command, it will pick up at the step (rule in snakemake term) where it stopped last time. It will do nothing if it succeeded last time. The --forceall option can be used to enforce the re-run.

DRAMv compatibility

点这里查看DRAMv工具的使用：DRAM（Distilling and Refining Annotations of Metabolism，提取和精练代谢注释）工具安装和使用-CSDN博客文章浏览阅读251次。默认使用conda安装吧，也建议使用conda，pip安装其实都差不多，但容易破坏当前环境。所以其实这里还需要安装virsorter软件，virsorter的操作使用后面再补上。基于virsorter的结果进行注释，相当于过滤了。看不懂的百度翻译吧，反正需求内存不小。基因组完整注释和提取。外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传 https://blog.csdn.net/zrc_xiaoguo/article/details/134578766?spm=1001.2014.3001.5502

DRAMv is a tool for annotating viral contigs identified by VirSorter. It needs two input files from VirSorter: 1) viral contigs, 2) affi-contigs.tab that have info on viral/nonviral and hallmark genes along contigs. In VirSorter2, these files can be generated by --prep-for-dramv flag.

rm -rf test.out
virsorter run --prep-for-dramv -w test.out -i test.fa -j 4 all
ls test.out/for-dramv

Detailed description on output files

final-viral-combined.fa

identified viral sequences, including two types:

full sequences identified as viral (identified with suffix ||full);
partial sequences identified as viral (identified with suffix ||{i}_partial); here {i} can be numbers starting from 0 to max number of viral fragments found in that contig;
short (less than two genes) sequences with hallmark genes identified as viral (identified with suffix ||lt2gene);

final-viral-score.tsv

This table can be used for further screening of results. It includes the following columns:

sequence name
score of each viral sequences across groups (multiple columns)
max score across groups
max score group
contig length
hallmark gene count
viral gene %
nonviral gene %

NOTE

Note that classifiers of different viral groups are not exclusive from each other, and may have overlap in their target viral sequence space, which means this information should not be used or considered as reliable taxonomic classification. We limit the purpose of VirSorter2 to viral idenfication only.

final-viral-boundary.tsv

only some of the columns in this file might be useful:

seqname: original sequence name
trim_orf_index_start, trim_orf_index_end: start and end ORF index on orignal sequence of identified viral sequence
trim_bp_start, trim_bp_end: start and end position on orignal sequence of identified viral sequence
trim_pr: score of final trimmed viral sequence
partial: full sequence as viral or partial sequence as viral; this is defined when a full sequence has score > score cutoff, it is full (0), or else any viral sequence extracted within it is partial (1)
pr_full: score of the original sequence
hallmark_cnt: hallmark gene count
group: the classifier of viral group that gives high score; this should NOT be used as reliable classification

NOTE

VirSorter2 tends to sometimes overestimate the size of viral sequence during provirus extraction procedure in order to achieve better sensitity. We recommend cleaning these provirus predictions to remove potential host genes on the edge of the predicted viral region, e.g. using a tool like CheckV (Bitbucket).

Training customized classifiers

VirSorter2 currently has classifiers of five viral groups (dsDNAphage, NCLDV, RNA, ssNA virus, and lavidaviridae). It’s designed for easy addition of more classifiers. The information of classifiers are store in the database (-d) specified during setup step. For each viral group, it needs four files below:

model

random forest classifier model for identifying viral sequences

customized.hmm (optional)

a collection of viral HMMs for gene annotation; if not specified, the one in db/hmm/viral/combined.hmm is used.

hallmark-gene.list (optional)

names of hallmark gene hmm in the above viral hmm database file; These hallmark gene hmms can be collected by literature search or identified by comparing hallmark gene sequences (protein) against HMMs database above with hmmsearch; if not specified, no hallmark genes are counted in feature table

rbs-prodigal-train.db (optional)

prodigal RBS (ribosomal binding site) motif training model; this can be produced with -t option in prodigal; This is useful feature for NCLDV due to large genome size for training; For other viral groups, it’s OK to skip this file.

In this tutorial, I will show how to make model for the autolykiviridae family.

First, prepare the dataset needed: 1) high quality viral genomes 2) protein sequence of hallmark gene; and install two more dependecies.

# download genome sequences
wget https://github.com/jiarong/small-dataset/raw/master/autolyki/vibrio_autolyki.fna.gz -O autolyki.fna.gz
# download hallmark gene seqs
wget https://raw.githubusercontent.com/jiarong/small-dataset/master/autolyki/DJR.fa -O DJR.fa
# download source code
git clone https://github.com/jiarong/VirSorter2.git
# install two more dependencies
conda install -c bioconda -y screed hmmer

Then identify hallmark gene HMMs by protein sequences of hallmark genes

Note that we will need the VirSorter2 database here. If you skip the tutorial above, you can download the database by virsorter setup -d db -j 4. This will take 10+ mins.

# compare all HMMs and protein sequences of hallmark gene
# this will take 10+ mins due to large hmm database file
hmmsearch -T 50 --tblout DJR.hmmtbl --cpu 4 -o /dev/null db/hmm/viral/combined.hmm DJR.fa
# get HMMs names that are signicant hits with protein sequences of hallmark gene
python VirSorter2/virsorter/scripts/prepdb-train-get-seq-from-hmm-domtbl.py 50 DJR.hmmtbl > hallmark-gene.list

With hallmark-gene.list and the high quality genomes autolyki.fna.gz, you can train the features that are used for the classifier model.

# train feature file
virsorter train-feature --seqfile autolyki.fna.gz --hallmark hallmark-gene.list --hmm db/hmm/viral/combined.hmm --frags-per-genome 5 --jobs 4 -w autolyki-feature.out 
# check output
ls autolyki-feature.out

In the output directory (autolyki-feature.out), all.pdg.ftr is the feature file needed for next step.

To make the classifier model, we also need a feature file from cellular organisms. This can be done by collecting genomes from cellular organisms and repeat the above step. Note number of cellular genomes are very large (>200K). Here I will re-use the feature file I have prepared before.

# fetch feature file for cellular organisms
wget https://zenodo.org/record/3823805/files/nonviral-common-random-fragments.ftr.gz?download=1 -O nonviral.ftr.gz
gzip -d nonviral.ftr.gz
# train the classifier model
virsorter train-model --viral-ftrfile autolyki-feature.out/all.pdg.ftr --nonviral-ftrfile nonviral.ftr --balanced --jobs 4 -w autolyki-model.out

In autolyki-model.out, feature-importances.tsv shows the importance of each feature used. model is the classifier model we need. Then put the model and hallmark-gene.list in database directory as the existing viral groups. Note that only letters are allowed for group directory under db/group/.

### attention: only letters (both upper and lower case) are allowed in group names
mkdir db/group/autolykiviridae
cp autolyki-model.out/model db/group/autolykiviridae
cp hallmark-gene.list db/group/autolykiviridae/

Now you can try this new classifier on the testing dataset, and compare with dsDNAphage classifier:

# download the testing dataset
wget -O test.fa https://raw.githubusercontent.com/jiarong/VirSorter2/master/test/8seq.fa
# identify viral sequences in testing dataset; it takes 10+ mins;
virsorter run -w autolyki-model-test.out -i test.fa --include-groups "dsDNAphage,autolykiviridae" -j 4 --min-score 0.8 all
# check the scores in two classifiers
cat autolyki-model-test.out/final-viral-score.tsv

FAQ

Q: How should I pick a score cutoff?

A: Generally, those with score >0.9 are high confidence. Those with score between 0.5 and 0.9 could be a mixture of viral and non-viral. It’s hard to find a optimal score separating viral and non-viral since it depends on % of host sequence and unknown sequences. So we recommend using the default cutoff (0.5) for maximal sensitivity and then applying a quality checking step using checkV to for removing false positives (other than predicting completeness). Here is the viral identification SOP in the Sullivan Lab icon-default.png?t=N7T8 https://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-btv8nn9w.这里还是给大家推荐一下官方的SOP分析流程，前面已经写了，大家参考使用即可