interprotscan安装与调试

道曼曼

已于 2023-06-27 22:59:44 修改

阅读量629

点赞数 3

文章标签： python

于 2023-06-22 10:51:23 首次发布

本文链接：https://blog.csdn.net/Kun_98/article/details/131338662

版权

环境要求

64-bit Linux
Perl 5 (default on most Linux distributions)
Python 3 (InterProScan 5.30-69.0 onwards)
Java JDK/JRE version 11 (InterProScan 5.37-76.0 onwards)
Environment variables set
- $JAVA_HOME should point to the location of the JVM
- $JAVA_HOME/bin should be added to the $PAT

检查配置

perl -version
python3 --version
java -version

获取软件

mkdir my_interproscan
cd my_interproscan
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.62-94.0/interproscan-5.62-94.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.62-94.0/interproscan-5.62-94.0-64-bit.tar.gz.md5

# Recommended checksum to confirm the download was successful:
md5sum -c interproscan-5.62-94.0-64-bit.tar.gz.md5
# Must return *interproscan-5.62-94.0-64-bit.tar.gz: OK*
# If not - try downloading the file again as it may be a corrupted copy.

tar -pxvzf interproscan-5.62-94.0-*-bit.tar.gz

#创建索引
cd interproscan
python3 setup.py -f interproscan.properties

使用本地预先计算的匹配查找服务（可选）

默认情况下，联网使用在线数据库进行比对。

默认情况下，配置了 InterProScan （在 interproscan.properties file）来使用托管在 EBI 上的 Web 服务。

InterProScan 使用此服务检索预先计算的匹配项，减少对服务器上计算的需求并加快响应时间。

也可关闭，使用本地数据库，要关闭该服务的使用，请使用 -dp 命令行选项或编辑 interproscan.properties 并在以下行注释掉该行或删除下一行：

precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup

调试

./interproscan.sh -i test_all_appl.fasta -f tsv -dp

help

# 数据库选项，逗号分隔，可选数据库分析，见下 
-appl,--applications <ANALYSES>                           Optional, comma separated list of analyses.  If this option
                                                           is not set, ALL analyses will be run.

# 输出文件名，默认输入文件.tsv
 -b,--output-file-base <OUTPUT-FILE-BASE>                  Optional, base output filename (relative or absolute path).
                                                           Note that this option, the --output-dir (-d) option and the
                                                           --outfile (-o) option are mutually exclusive.  The
                                                           appropriate file extension for the output format(s) will be
                                                           appended automatically. By default the input file path/name
                                                           will be used.

# 使用的cpu个数，不过这里由java虚拟机限制了，如果使用ava -XX:+UseParallelGC -XX:ParallelGCThreads=2，线程被限制到了2个，即使这里设置了>2，也没有效果
 -cpu,--cpu <CPU>                                          Optional, number of cores for inteproscan.

# 输出目录，与----output-file不兼容
 -d,--output-dir <OUTPUT-DIR>                              Optional, output directory.  Note that this option, the
                                                           --outfile (-o) option and the --output-file-base (-b) option
                                                           are mutually exclusive. The output filename(s) are the same
                                                           as the input filename, with the appropriate file extension(s)
                                                           for the output format(s) appended automatically .

# 不使用预先计算好的match，所有数据在本地运算 
-dp,--disable-precalc                                     Optional.  Disables use of the precalculated match lookup
                                                           service.  All match calculations will be run locally.

# 输出的XML，JSON文件中不包含位点
 -dra,--disable-residue-annot                              Optional, excludes sites from the XML, JSON output

# TSV格式文件中，包含位点
 -etra,--enable-tsv-residue-annot                          Optional, includes sites in TSV output

# 排除一些数据库分析
 -exclappl,--excl-applications <EXC-ANALYSES>              Optional, comma separated list of analyses you want to
                                                           exclude.
# 输出文件格式
 -f,--formats <OUTPUT-FORMATS>                             Optional, case-insensitive, comma separated list of output
                                                           formats. Supported formats are TSV, XML, JSON, GFF3, HTML and
                                                           SVG. Default for protein sequences are TSV, XML and GFF3, or
                                                           for nucleotide sequences GFF3 and XML.
# 启用GO注释，在最终结果中包含GO注释
 -goterms,--goterms                                        Optional, switch on lookup of corresponding Gene Ontology
                                                           annotation (IMPLIES -iprlookup option)

 -help,--help                                              Optional, display help information
# 输入文件，也可已进行XML格式转换，当输入文件时XML结果文件时
 -i,--input <INPUT-FILE-PATH>                              Optional, path to fasta file that should be loaded on Master
                                                           startup. Alternatively, in CONVERT mode, the InterProScan 5
                                                           XML file to convert.

# 只有调用该参数，才能使用deactivated analysess， 这些分析默认关闭
 -incldepappl,--incl-dep-applications <INC-DEP-ANALYSES>   Optional, comma separated list of deprecated analyses that
                                                           you want included.  If this option is not set, deprecated
                                                           analyses will not run.

# 注释信息
 -iprlookup,--iprlookup                                    Also include lookup of corresponding InterPro annotation in
                                                           the TSV and GFF3 output formats.

# 最短ORF长度阈值
 -ms,--minsize <MINIMUM-SIZE>                              Optional, minimum nucleotide size of ORF to report. Will only
                                                           be considered if n is specified as a sequence type. Please be
                                                           aware of the fact that if you specify a too short value it
                                                           might be that the analysis takes a very long time!

# 输出文件名
 -o,--outfile <EXPLICIT_OUTPUT_FILENAME>                   Optional explicit output file name (relative or absolute
                                                           path).  Note that this option, the --output-dir (-d) option
                                                           and the --output-file-base (-b) option are mutually
                                                           exclusive. If this option is given, you MUST specify a single
                                                           output format using the -f option.  The output file name will
                                                           not be modified. Note that specifying an output file name
                                                           using this option OVERWRITES ANY EXISTING FILE.
# 注释蛋白对应的通路
 -pa,--pathways                                            Optional, switch on lookup of corresponding Pathway
                                                           annotation (IMPLIES -iprlookup option)

# 输入序列类型，默认是蛋白
 -t,--seqtype <SEQUENCE-TYPE>                              Optional, the type of the input sequences (dna/rna (n) or
                                                           protein (p)).  The default sequence type is protein.
# 临时数据存储文件夹
 -T,--tempdir <TEMP-DIR>                                   Optional, specify temporary file directory (relative or
                                                           absolute path). The default location is temp/.

# 展示log信息
 -verbose,--verbose                                        Optional, display more verbose log output

# 版本
 -version,--version                                        Optional, display version number

# log信息
 -vl,--verbose-level <VERBOSE-LEVEL>                       Optional, display verbose log output at level specified.


 -vtsv,--output-tsv-version                                Optional, includes a TSV version file along with any TSV
                                                           output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.

Available analyses:
                      TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs).
                         SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
                      PANTHER (15.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
                       Gene3D (4.2.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
                        Hamap (2020_05) : High-quality Automated and Manual Annotation of Microbial Proteomes.
                        Coils (2.2.1) : Prediction of coiled coil regions in proteins.
              ProSiteProfiles (2019_11) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                        SMART (7.1) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
                          CDD (3.18) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
                       PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
              ProSitePatterns (2019_11) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                         Pfam (33.1) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
                   MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.
                        PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.

Deactivated analyses:
                      Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl
                  SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
        SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
        SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
                        TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model

**-appl / –applications application_name (optional)**

包括以下数据库:
CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM

-goterms,–goterms (optional)

Option that provides mappings to the Gene Ontology (GO). These mappings are based on the matched manually curated InterPro entries. (IMPLIES -iprlookup option)

**-b / –output-file-base file_name (optional)**

（可选）可以为结果文档提供路径和基本名称（不包括文档扩展名）

The appropriate file extension will be added to each output file, depending upon the format(s) requested. (It is therefore recommended that you do not include a file extension yourself.)

Note that using this option will not overwrite existing files. If a file with the required name exists at the path specified, the provided file name will have ‘underscore_number’ appended in front of the file extension.

-t / –seqtype (optional)

InterProScan supports analysis of both protein and nucleic acid sequences (DNA/RNA). Your input sequences are interpreted as protein sequences by default. If you like to scan nucleotide sequences you must set the -t option

-T / –tempdir (optional)

Optionally, you can specify the location of the InterProScan temporary directory. This directory is used as a working directory. The default temporary directory will be in the same directory as the InterProScan script file (interproscan.sh). By default, this directory is completely cleaned up after InterProScan finished all analyses successfully.

运行

氨基酸序列

interproscan.sh -cpu 40 -d anno.dir -dp -i protein.fa

核酸序列

interproscan.sh -cpu 40 -d anno.dir -dp -t n -i transcripts.fa

报错汇总

/lib64/libm.so.6: version `GLIBC_2.27' not found (required by bin/prosite/pfscanV3)

#查看服务器系统
lsb_release -a

#替换pfscanV3/pfsearchV3文件
#在interproscan.properties文件中
#Binary file locations (required for setup.py)下加两句话
binary.prosite.pfscanv3.path=${bin.directory}/prosite/centos7.9/pfscanV3
binary.prosite.pfsearchv3.path=${bin.directory}/prosite/centos7.9/pfsearchV3

参考：

InterProScan documentation — interproscan-docs documentation