Virsorter2-病毒组序列分析工具安装及使用20231126

小果运维

已于 2023-11-26 22:12:31 修改

阅读量3.6k

点赞数 49

分类专栏：生信分析-bioinfo 文章标签： VirSorter2 DRAMv 宏基因组病毒基因提取

于 2023-11-26 21:43:44 首次发布

本文链接：https://blog.csdn.net/zrc_xiaoguo/article/details/134633181

版权

生信分析-bioinfo 专栏收录该内容

40 篇文章

订阅专栏

在使用之前大家还是要好好了解一下文章介绍：VirSorter: mining viral signal from microbial genomic data [PeerJ]

VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses - PubMed

Github访问正常的和英语功底还可以的直接查看github吧，这里主要以自己使用的东西来介绍virsorter2的整个安装和主要使用内容GitHub - jiarong/VirSorter2: customizable pipeline to identify viral sequences from (meta)genomic data

1、安装

通过bioconda安装

conda环境下的mamba安装，要注意哦，是mamba，不过不要晕，和conda命令的使用方式一样，只是换了个名而已，但需要自己安装配置，这里要小心，安装mamba时大家最好时在新的环境下测试，旧的环境下安装后可能会造成原有conda环境的后续更新和安装异常。

###单独安装virsorter2，官方建议mamba命令
mamba create -n vs2 -c conda-forge -c bioconda virsorter=2
mamba activate vs2

###与checkV和DRAMv一起安装，整合基因注释，单独使用virsorter的就可以直接略过。
# install VirSorter2, checkV and DRAMv 
conda create -n viral-id-sop virsorter=2 checkv dram 
# activate env 
conda activate viral-id-sop

# vs2 db: db-vs2
virsorter setup -d db-vs2 -j 4

# checkv db: checkv-db-v1.0
checkv download_database .

# DRAMv: db-dramv
DRAM-setup.py prepare_databases --skip_uniref --output_dir db-dramv


########################################################33333
##以下是官方建议的SOP分析流程
#Run VirSorter2
#First, run VirSorter2 with a loose cutoff of 0.5 for maximal sensitivity. We are only interested in phages (dsDNA and ssDNA phage). A minimal length 5000 bp is chosen since it is the minimum required by downstream viral
classification. You can adjust the "-j" option based on the availability of CPU cores. Note that the "--keep-original-seq" option preserves the original sequence of circular and (near) fully viral contigs (score >0.8 as a whole sequence) and we are passing them to checkV to trim possible host genes left at ends and handle duplicate segments of circular contigs.

virsorter run --keep-original-seq -i 5seq.fa -w vs2-pass1 --include-groups dsDNAphage,ssDNA --min-length 5000 --min-score 0.5 -j 28 all

#Run checkV
#There could be some non-viral sequences or regions in the VirSorter2 results with a minimal score cutoff of 0.5. Here we use CheckV to quality control the VirSorter2 results and also to trim potential host regions left at the ends of proviruses. You can adjust the "-t" option based on the availability of CPU cores.
checkv end_to_end vs2-pass1/final-viral-combined.fa checkv -t 28 -d /fs/project/PAS1117/jiarong/db/checkv-db-v1.0


#Run VirSorter2 again
#Then we run the checkV-trimmed sequences through VirSorter2 again to generate "affi-contigs.tab" files needed by DRAMv to identify AMGs. You can adjust the "-j" option based on the availability of CPU cores. Note the "--seqname-suffix-off" option preserves the original input sequence name since we are sure there is no chance of getting >1 proviruses from the same contig in this second pass, and the "--viral-gene-enrich-off" option turns off the requirement of having more viral genes than host genes to make sure that VirSorter2 is not doing any screening at this step. The above two options require VirSorter2 version >=2.2.1.

cat checkv/proviruses.fna checkv/viruses.fna > checkv/combined.fna
virsorter run --seqname-suffix-off --viral-gene-enrich-off --prep-for-dramv -i checkv/combined.fna -w vs2-pass2 --include-groups dsDNAphage,ssDNA --min-length 5000 --min-score 0.5 -j 28 all

#Run DRAMv
#Then run DRAMv to annotate the identified sequences, which can be used for manual curation. You can adjust the "--threads" option based on availability of CPU cores.

# step 1 annotate
DRAM-v.py annotate -i vs2-pass2/for-dramv/final-viral-combined-for-dramv.fa -v vs2-pass2/for-dramv/viral-affi-contigs-for-dramv.tab -o dramv-annotate --skip_trnascan --threads 28 --min_contig_size 1000
#step 2 summarize anntotations
DRAM-v.py distill -i dramv-annotate/annotations.tsv -o dramv-distill

手动安装开发版，就是最新功能版

mamba create -n vs2 -c conda-forge -c bioconda "python>=3.6,<=3.10" scikit-learn=0.22.1 imbalanced-learn pandas seaborn hmmer==3.3 prodigal screed ruamel.yaml "snakemake>=5.18,<=5.26" click "conda-package-handling<=1.9"
mamba activate vs2
git clone https://github.com/jiarong/VirSorter2.git
cd VirSorter2
pip install -e .

github访问不顺畅的可以直接在这里下载Virsorter2的最新版，https://download.csdn.net/download/zrc_xiaoguo/88571289?spm=1001.2014.3001.5501

2、配置数据库

自动配置

###新安装的情况下直接使用命令会自动下载和配置数据库到db目录下，db路径建议使用绝对路径，-j是线程数，自己指定
virsorter setup -d db -j 4

###如果不是第一次，或是之前失败了就得先删除然后重新下载和配置
rm -rf db
virsorter setup -d db -j 4

手动配置

###先下载对应数据库文件，这个直接浏览器回车就会下载文件出现db.tgz：
https://osf.io/v46sc/download
tar -xzf db.tgz
virsorter config --init-source --db-dir=./db

3、工具使用

注意事项，这个要说一下，以下是官方的说明，也就是最好使用CheckV处理序列文件去除被预测病毒区可能潜在的宿主基因

VirSorter2 tends to sometimes overestimate the size of viral sequence during provirus extraction procedure in order to achieve better sensitity. We recommend cleaning these provirus predictions to remove potential host genes on the edge of the predicted viral region, e.g. using a tool like CheckV (Bitbucket).

快速命令

#获取快速测试文件，当然大家可以用自己的序列文件测试
wget -O test.fa https://raw.githubusercontent.com/jiarong/VirSorter2/master/test/8seq.fa
#激活vs2环境后使用-j 4个线程运行，输入all 所有结果，-w指定输入结果的文件夹
virsorter run -w test.out -i test.fa --min-length 1500 -j 4 all
#查看结果文件夹
ls test.out

###结果文件夹中的文件主要是final-viral-combined.fa，final-viral-score.tsv，final-viral-boundary.tsv这三个文件，以下是官方介绍
Due to the large HMM database that VirSorter2 uses, this small dataset takes a few mins to finish. In the output directory (test.out), three files are useful:

final-viral-combined.fa: identified viral sequences
final-viral-score.tsv: table with score of each viral sequences across groups and a few more key features, which can be used for further filtering
final-viral-boundary.tsv: table with boundary information; This is a intermediate file that 1) might have extra records compared to other two files and should be ignored; 2) do not include the viral sequences w/ < 2 gene but have >= 1 hallmark gene; 3) the group and trim_pr are intermediate results and might not match the max_group and max_score respectively in final-viral-score.tsv

质量控制：

The default score cutoff (0.5) works well known viruses (RefSeq). For the real environmental data, we can expect to get false positives (non-viral) with the default cutoff. Generally, samples with more host (e.g. bulk metaG) and unknown sequences (e.g. soil) tends to have more false positives. We find a score cutoff of 0.9 work well as a cutoff for high confidence hits, but there are also many viral hits with score <0.9. It's difficult to separate the viral and non-viral hits by score alone. So we recommend using the default score cutoff (0.5) for maximal sensitivity and then applying a quality checking step using checkV. Here is a tutorial of viral identification SOP used in Sullivan Lab.

Detailed description on output files

final-viral-combined.fa
identified viral sequences, including two types:
- full sequences identified as viral (identified with suffix ||full);
- partial sequences identified as viral (identified with suffix ||{i}_partial); here {i} can be numbers starting from 0 to max number of viral fragments found in that contig;
- short (less than two genes) sequences with hallmark genes identified as viral (identified with suffix ||lt2gene);
final-viral-score.tsv
This table can be used for further screening of results. It includes the following columns:
- sequence name
- score of each viral sequences across groups (multiple columns)
- max score across groups
- max score group
- contig length
- hallmark gene count
- viral gene %
- nonviral gene %

NOTE

Note that classifiers of different viral groups are not exclusive from each other, and may have overlap in their target viral sequence space, which means this information should not be used or considered as reliable taxonomic classification. We limit the purpose of VirSorter2 to viral idenfication only.

final-viral-boundary.tsv
only some of the columns in this file might be useful:
- seqname: original sequence name
- trim_orf_index_start, trim_orf_index_end: start and end ORF index on orignal sequence of identified viral sequence
- trim_bp_start, trim_bp_end: start and end position on orignal sequence of identified viral sequence
- trim_pr: score of final trimmed viral sequence
- partial: full sequence as viral or partial sequence as viral; this is defined when a full sequence has score > score cutoff, it is full (0), or else any viral sequence extracted within it is partial (1)
- pr_full: score of the original sequence
- hallmark_cnt: hallmark gene count
- group: the classifier of viral group that gives high score; this should NOT be used as reliable classification

NOTE

Training customized classifiers

VirSorter2 currently has classifiers of five viral groups (dsDNAphage, NCLDV, RNA, ssNA virus, and lavidaviridae). It's designed for easy addition of more classifiers. The information of classifiers are store in the database (-d) specified during setup step. For each viral group, it needs four files below:

model

random forest classifier model for identifying viral sequences
customized.hmm (optional)

a collection of viral HMMs for gene annotation; if not specified, the one in db/hmm/viral/combined.hmm is used.
hallmark-gene.list (optional)

names of hallmark gene hmm in the above viral hmm database file; These hallmark gene hmms can be collected by literature search or identified by comparing hallmark gene sequences (protein) against HMMs database above with hmmsearch; if not specified, no hallmark genes are counted in feature table
rbs-prodigal-train.db (optional)

prodigal RBS (ribosomal binding site) motif training model; this can be produced with -t option in prodigal; This is useful feature for NCLDV due to large genome size for training; For other viral groups, it's OK to skip this file.

In this tutorial, I will show how to make model for the autolykiviridae family.

First, prepare the dataset needed: 1) high quality viral genomes 2) protein sequence of hallmark gene; and install two more dependecies.

# download genome sequences
wget https://github.com/jiarong/small-dataset/raw/master/autolyki/vibrio_autolyki.fna.gz -O autolyki.fna.gz
# download hallmark gene seqs
wget https://raw.githubusercontent.com/jiarong/small-dataset/master/autolyki/DJR.fa -O DJR.fa
# download source code
git clone https://github.com/jiarong/VirSorter2.git
# install two more dependencies
conda install -c bioconda -y screed hmmer

Then identify hallmark gene HMMs by protein sequences of hallmark genes

Note that we will need the VirSorter2 database here. If you skip the tutorial above, you can download the database by virsorter setup -d db -j 4. This will take 10+ mins.

# compare all HMMs and protein sequences of hallmark gene
# this will take 10+ mins due to large hmm database file
hmmsearch -T 50 --tblout DJR.hmmtbl --cpu 4 -o /dev/null db/hmm/viral/combined.hmm DJR.fa
# get HMMs names that are signicant hits with protein sequences of hallmark gene
python VirSorter2/virsorter/scripts/prepdb-train-get-seq-from-hmm-domtbl.py 50 DJR.hmmtbl > hallmark-gene.list

With hallmark-gene.list and the high quality genomes autolyki.fna.gz, you can train the features that are used for the classifier model.

# train feature file
virsorter train-feature --seqfile autolyki.fna.gz --hallmark hallmark-gene.list --hmm db/hmm/viral/combined.hmm --frags-per-genome 5 --jobs 4 -w autolyki-feature.out 
# check output
ls autolyki-feature.out

In the output directory (autolyki-feature.out), all.pdg.ftr is the feature file needed for next step.

To make the classifier model, we also need a feature file from cellular organisms. This can be done by collecting genomes from cellular organisms and repeat the above step. Note number of cellular genomes are very large (>200K). Here I will re-use the feature file I have prepared before.

# fetch feature file for cellular organisms
wget https://zenodo.org/record/3823805/files/nonviral-common-random-fragments.ftr.gz?download=1 -O nonviral.ftr.gz
gzip -d nonviral.ftr.gz
# train the classifier model
virsorter train-model --viral-ftrfile autolyki-feature.out/all.pdg.ftr --nonviral-ftrfile nonviral.ftr --balanced --jobs 4 -w autolyki-model.out

In autolyki-model.out, feature-importances.tsv shows the importance of each feature used. model is the classifier model we need. Then put the model and hallmark-gene.list in database directory as the existing viral groups. Note that only letters are allowed for group directory under db/group/.

### attention: only letters (both upper and lower case) are allowed in group names
mkdir db/group/autolykiviridae
cp autolyki-model.out/model db/group/autolykiviridae
cp hallmark-gene.list db/group/autolykiviridae/

Now you can try this new classifier on the testing dataset, and compare with dsDNAphage classifier:

# download the testing dataset
wget -O test.fa https://raw.githubusercontent.com/jiarong/VirSorter2/master/test/8seq.fa
# identify viral sequences in testing dataset; it takes 10+ mins;
virsorter run -w autolyki-model-test.out -i test.fa --include-groups "dsDNAphage,autolykiviridae" -j 4 --min-score 0.8 all
# check the scores in two classifiers
cat autolyki-model-test.out/final-viral-score.tsv

FAQ

Q: How should I pick a score cutoff?

A: Generally, those with score >0.9 are high confidence. Those with score between 0.5 and 0.9 could be a mixture of viral and non-viral. It's hard to find a optimal score separating viral and non-viral since it depends on % of host sequence and unknown sequences. So we recommend using the default cutoff (0.5) for maximal sensitivity and then applying a quality checking step using checkV to for removing false positives (other than predicting completeness). Here is the viral identification SOP in the Sullivan Labhttps://www.protocols.io/view/viral-sequence-identification-sop-with-virsorter2-btv8nn9w.这里还是给大家推荐一下官方的SOP分析流程，前面已经写了，大家参考使用即可

Q: Why does virsorter work in when running interactively but does not work when submit as batch script (e.g. showing error `No module name screed`)?

A: This is usually caused by the incompatibility between two different package/environment managing tools: Modules (module load) and conda (conda activate). There are two solutions: 1) install conda on your own (user level) instead of using the system conda, and thus avoiding module load; 2) Sometimes server system admins discourage users to install conda at user level. If so, you can remove the module load or module use in batch scripts, and instead run them interactively in the terminal to load necessary packages before submitting the batch scripts.

Q: Why is there installation error with macOS?

A: MacOS is not supported currently. VirSorter2 runs are typically computationally expensive, and should be run in servers (typically Linux). VirSorter2 leverages large viral protein HMM reference databases to achieve its high sensitivity, and the flip side is that its computationally expensive.

Q: How can I speed up the run?

A: Here are a few ways: 1) use more cpu cores (-j); 2) filter your contigs on length (>1500 or >5000) with --min-length; 3) reduce the viral groups in --include-groups. For most people interestd in phage, only dsDNAphage and ssDNA are needed, which is the default since version 2.2; 4) increase the threads for hmmsearch (the default is 2) by virsorter config --set HMMSEARCH_THREADS=4. Usually the IO is the bottleneck, not the CPU though.

Q: How can I tell if an identified viral sequence is provirus?

A: Only partially viral sequences (ending with _partial) can be confirmed as provirus. Fully viral sequences (ending with ||full) in VirSorter2 defined as contigs with significant viral signal (score >=0.8) as a whole sequene. Thus some could be provirus too: 1) they could be a fragment from within a provirus; 2) the whole sequence has strong viral signal in spite of some host genes at ends.

Q: Why are there host genes left at ends of predicted viral sequences?

A: The provirus boundary dectection algorithm in VirSorter2 tends to overextend to host regions. VirSorter2 estimate boundaries by looking at the peak score of sub-sequences and then overextend a bit (within 95% of peak score by default). This is a design decision so that predicted viral sequences can be further passed to a more specialized provirus boundary prediction tool.

Q: Why am I getting many false positives (non-viral sequences)?

A: The default score cutoff (0.5) has high sensitivity but also brings in many non-viral sequences. For phages, we recommend using checkV to remove those non-viral sequences following this protocol. See more details in the answer to the how should I pick a score cutoff.

Q: Why are fully viral sequences (ending with ||full) trimmed?

A: There are three situations that a fully viral sequence can be trimmed. 1) VirSorter2 is based on genes called by prodigal. A few bases overhang beyond the first and last gene are ignored by prodigal and also ignored VirSorter2 by default; 2) Circular sequences are usually split in the middle of a gene and have duplicate segments. VirSorter2 trims the duplicate segments and fixes the split gene by moving the partial gene the start to the end. 3) fully viral sequences only means the whole sequence has significant viral signal (score >=0.95 by default), but VirSorter2 still applies an end trimming step (10% of genes on each end) on them to find the optimal viral segments (longest within 95% of peak score by default). Again, the "full" sequences trimmed by the end trimming step should not be interpreted as provirus, since genes that have low impact on score, such as unknown gene or genes shared by host and virus, could be trimmed. If you prefer the full sequences (ending with ||full) not to be trimmed and leave it to specialized tools such as checkV, you can use --keep-original-seq option.