环境要求
- 64-bit Linux
- Perl 5 (default on most Linux distributions)
- Python 3 (InterProScan 5.30-69.0 onwards)
- Java JDK/JRE version 11 (InterProScan 5.37-76.0 onwards)
- Environment variables set
- $JAVA_HOME should point to the location of the JVM
- $JAVA_HOME/bin should be added to the $PAT
检查配置
perl -version
python3 --version
java -version
获取软件
mkdir my_interproscan
cd my_interproscan
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.62-94.0/interproscan-5.62-94.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.62-94.0/interproscan-5.62-94.0-64-bit.tar.gz.md5
# Recommended checksum to confirm the download was successful:
md5sum -c interproscan-5.62-94.0-64-bit.tar.gz.md5
# Must return *interproscan-5.62-94.0-64-bit.tar.gz: OK*
# If not - try downloading the file again as it may be a corrupted copy.
tar -pxvzf interproscan-5.62-94.0-*-bit.tar.gz
#创建索引
cd interproscan
python3 setup.py -f interproscan.properties
使用本地预先计算的匹配查找服务(可选)
默认情况下,联网使用在线数据库进行比对。
默认情况下,配置了 InterProScan (在 interproscan.properties file) 来使用托管在 EBI 上的 Web 服务。
InterProScan 使用此服务检索预先计算的匹配项, 减少对服务器上计算的需求并加快响应时间。
也可关闭,使用本地数据库,要关闭该服务的使用,请使用 -dp 命令行选项或编辑 interproscan.properties 并在以下行注释掉该行或删除下一行:
precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup
调试
./interproscan.sh -i test_all_appl.fasta -f tsv -dp
help
# 数据库选项,逗号分隔,可选数据库分析,见下
-appl,--applications <ANALYSES> Optional, comma separated list of analyses. If this option
is not set, ALL analyses will be run.
# 输出文件名,默认输入文件.tsv
-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path).
Note that this option, the --output-dir (-d) option and the
--outfile (-o) option are mutually exclusive. The
appropriate file extension for the output format(s) will be
appended automatically. By default the input file path/name
will be used.
# 使用的cpu个数,不过这里由java虚拟机限制了,如果使用ava -XX:+UseParallelGC -XX:ParallelGCThreads=2,线程被限制到了2个,即使这里设置了>2,也没有效果
-cpu,--cpu <CPU> Optional, number of cores for inteproscan.
# 输出目录,与----output-file不兼容
-d,--output-dir <OUTPUT-DIR> Optional, output directory. Note that this option, the
--outfile (-o) option and the --output-file-base (-b) option
are mutually exclusive. The output filename(s) are the same
as the input filename, with the appropriate file extension(s)
for the output format(s) appended automatically .
# 不使用预先计算好的match,所有数据在本地运算
-dp,--disable-precalc Optional. Disables use of the precalculated match lookup
service. All match calculations will be run locally.
# 输出的XML,JSON文件中不包含位点
-dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output
# TSV格式文件中,包含位点
-etra,--enable-tsv-residue-annot Optional, includes sites in TSV output
# 排除一些数据库分析
-exclappl,--excl-applications <EXC-ANALYSES> Optional, comma separated list of analyses you want to
exclude.
# 输出文件格式
-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output
formats. Supported formats are TSV, XML, JSON, GFF3, HTML and
SVG. Default for protein sequences are TSV, XML and GFF3, or
for nucleotide sequences GFF3 and XML.
# 启用GO注释,在最终结果中包含GO注释
-goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology
annotation (IMPLIES -iprlookup option)
-help,--help Optional, display help information
# 输入文件,也可已进行XML格式转换,当输入文件时XML结果文件时
-i,--input <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on Master
startup. Alternatively, in CONVERT mode, the InterProScan 5
XML file to convert.
# 只有调用该参数,才能使用deactivated analysess, 这些分析默认关闭
-incldepappl,--incl-dep-applications <INC-DEP-ANALYSES> Optional, comma separated list of deprecated analyses that
you want included. If this option is not set, deprecated
analyses will not run.
# 注释信息
-iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in
the TSV and GFF3 output formats.
# 最短ORF长度阈值
-ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will only
be considered if n is specified as a sequence type. Please be
aware of the fact that if you specify a too short value it
might be that the analysis takes a very long time!
# 输出文件名
-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute
path). Note that this option, the --output-dir (-d) option
and the --output-file-base (-b) option are mutually
exclusive. If this option is given, you MUST specify a single
output format using the -f option. The output file name will
not be modified. Note that specifying an output file name
using this option OVERWRITES ANY EXISTING FILE.
# 注释蛋白对应的通路
-pa,--pathways Optional, switch on lookup of corresponding Pathway
annotation (IMPLIES -iprlookup option)
# 输入序列类型,默认是蛋白
-t,--seqtype <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n) or
protein (p)). The default sequence type is protein.
# 临时数据存储文件夹
-T,--tempdir <TEMP-DIR> Optional, specify temporary file directory (relative or
absolute path). The default location is temp/.
# 展示log信息
-verbose,--verbose Optional, display more verbose log output
# 版本
-version,--version Optional, display version number
# log信息
-vl,--verbose-level <VERBOSE-LEVEL> Optional, display verbose log output at level specified.
-vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV
output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.
Available analyses:
TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs).
SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
PANTHER (15.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
Gene3D (4.2.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
Hamap (2020_05) : High-quality Automated and Manual Annotation of Microbial Proteomes.
Coils (2.2.1) : Prediction of coiled coil regions in proteins.
ProSiteProfiles (2019_11) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
SMART (7.1) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
CDD (3.18) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
ProSitePatterns (2019_11) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
Pfam (33.1) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.
PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
Deactivated analyses:
Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl
SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model
-appl / –applications application_name (optional)
包括以下数据库:
CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM
-goterms,–goterms (optional)
Option that provides mappings to the Gene Ontology (GO). These mappings are based on the matched manually curated InterPro entries. (IMPLIES -iprlookup option)
-b / –output-file-base file_name (optional)
(可选)可以为结果文档提供路径和基本名称(不包括文档扩展名)
The appropriate file extension will be added to each output file, depending upon the format(s) requested. (It is therefore recommended that you do not include a file extension yourself.)
Note that using this option will not overwrite existing files. If a file with the required name exists at the path specified, the provided file name will have ‘underscore_number’ appended in front of the file extension.
-t / –seqtype (optional)
InterProScan supports analysis of both protein and nucleic acid sequences (DNA/RNA). Your input sequences are interpreted as protein sequences by default. If you like to scan nucleotide sequences you must set the -t option
-T / –tempdir (optional)
Optionally, you can specify the location of the InterProScan temporary directory. This directory is used as a working directory. The default temporary directory will be in the same directory as the InterProScan script file (interproscan.sh). By default, this directory is completely cleaned up after InterProScan finished all analyses successfully.
运行
氨基酸序列
interproscan.sh -cpu 40 -d anno.dir -dp -i protein.fa
核酸序列
interproscan.sh -cpu 40 -d anno.dir -dp -t n -i transcripts.fa
报错汇总
/lib64/libm.so.6: version `GLIBC_2.27' not found (required by bin/prosite/pfscanV3)
#查看服务器系统
lsb_release -a
#替换pfscanV3/pfsearchV3文件
#在interproscan.properties文件中
#Binary file locations (required for setup.py)下加两句话
binary.prosite.pfscanv3.path=${bin.directory}/prosite/centos7.9/pfscanV3
binary.prosite.pfsearchv3.path=${bin.directory}/prosite/centos7.9/pfsearchV3
参考:
InterProScan documentation — interproscan-docs documentation