InterProScan本地化安装、配置、使用（含批量操作）

Xsisikk

已于 2023-08-08 15:56:38 修改

阅读量5.2k

点赞数 4

分类专栏：基因组功能注释文章标签： linux 服务器

于 2023-03-30 14:41:58 首次发布

本文链接：https://blog.csdn.net/weixin_58269397/article/details/129757971

版权

基因组功能注释专栏收录该内容

11 篇文章

订阅专栏

interproscan 是个基因注释工具，可以一次运行实现多个数据库的注释。用户既可以提交蛋白序列也可以提交核酸序列，还可以对输出格式进行设置。

本地化InterProScan的优缺点：
优点：使用本地化的数据库，在断网和计算机资源充足的情况下，能加快注释速度；本地化网页版能同时比对多条序列；本地化能对DNA序列进行interpro注释。
缺点：本地化安装InterProScan比较复杂耗时；需要不时更新本地数据库；本地化运行耗费计算资源大。

参考链接：

interproscan 安装及运行错误调试 - 简书

Interproscan 安装教程及运行报错解决 - 简书

interproscan的安装与使用 - 简书

Ubuntu 20.04系统下JDK的安装与配置_ubuntu20.04安装jdk_夏小悠的博客-CSDN博客

InterProScan的三种使用方法 | 陈连福的生信博客

本地批量做InterProScan - 知乎

InterProScan的使用教程 | KeepNotes blog

安装要求

64-bit Linux
Perl 5 (default on most Linux distributions)
Python 3 (InterProScan 5.30-69.0 onwards)
Java JDK/JRE version 11 (InterProScan 5.37-76.0 onwards)
Environment variables set
$JAVA_HOME should point to the location of the JVM
$JAVA_HOME/bin should be added to the $PATH
具体如何配置安装环境查看官方文档。

1.检验perl安装版本(linux在安装的时候应该是自动配置了perl)

perl -version
# 如下
This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi

Copyright 1987-2021, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

2.检验python版本(linux在安装的时候应该是自动配置了python,我的是有配置的)

python3 --version
# Python 3.9.7

3.检验java 版本

java -version
# 如下
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)

配置Java 环境

1）下载jdk包：本章使用的为后缀为tar.gz的文件（不需要安装）

JDK 11下载路径：官方链接

根据自己的系统选择相应的版本，博主这里是64位的系统，应选择x64 Compressed Archive，否则系统识别不了可执行文件。

在这里插入图片描述下载到了/media/aa/DATA/SZQ2路径下 jdk-11.0.18_linux-x64_bin.tar.gz

2）将压缩包上传至服务器安装目录下，并解压

 tar -zxvf jdk-11.0.18_linux-x64_bin.tar.gz

3）解压生成jdk-11.0.18文件夹，将文件夹目录路径添加到 .proflie 文件中

JAVA_HOME=/media/aa/DATA/SZQ2/jdk-11.0.18
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export PATH
export CLASSPATH

4）运行 source ./.profile

source ./.profile
# bash: ./.profile: No such file or directory

5）运行 java -version 看是否生效。若出现jdk版本号，则安装并配置环境变量成功

java -version
# 如下
java version "11.0.18" 2023-01-17 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.18+9-LTS-195)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.18+9-LTS-195, mixed mode)

应该每次要用的时候都要重新配置一下java环境。

下载安装 interproscan

1.下载压缩包

cd /media/aa/DATA/SZQ2
mkdir my_interproscan
cd my_interproscan
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.61-93.0/interproscan-5.61-93.0-64-bit.tar.gz
# 添加权限，以避免软件解压过程hmm model不完全
chmod 777 interproscan-5.61-93.0-64-bit.tar.gz
# 当文件被压缩时强烈推荐使用md5sum检查文件是否已下载且没有错误
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.61-93.0/interproscan-5.61-93.0-64-bit.tar.gz.md5

# checksum 是为了验证文件下载的完整性，interproscan 比较大，验证是为了减少后续软件缺省的麻烦:
md5sum -c interproscan-5.61-93.0-64-bit.tar.gz.md5
# Must return *interproscan-5.55-88.0-64-bit.tar.gz: OK*
# 如果 failed 则需要重新下载.

2.解压

tar -pxvzf interproscan-5.61-93.0-64-bit.tar.gz

# where:
#     p = preserve the file permissions
#     x = extract files from an archive
#     v = verbosely list the files processed
#     z = filter the archive through gzip
#     f = use archive file

3. 和hmm model 建立索引

首次运行interproscan之前，应运行以下命令：

Running InterProScan — interproscan-docs documentation

cd interproscan-5.61-93.0
python3 setup.py -f interproscan.properties
#时间比较长

4. pather model

最新版不需要下载pather model！！！

5. 运行

解压缩获得InterProScan后，可以直接从命令行运行InterProScan。
运行提供的shell脚本。如果您在没有参数的情况下运行此脚本，将向您显示使用说明。

./interproscan.sh
# 若出现
Java version 11 is required to run InterProScan.
Detected version 1.8.0_362
Please install the correct version.
# 则重新进行
JAVA_HOME=/media/aa/DATA/SZQ2/jdk-11.0.18
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export PATH
export CLASSPATH
# 查看版本
java -version
# 再次运行
./interproscan.sh
# 如下
25/03/2023 14:34:06:730 Welcome to InterProScan-5.61-93.0
25/03/2023 14:34:06:731 Running InterProScan v5 in STANDALONE mode... on Linux
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M
            -Xmx2048M -jar interproscan-5.jar


Please give us your feedback by sending an email to

interhelp@ebi.ac.uk

 -appl,--applications <ANALYSES>                           Optional, comma separated list of analyses.  If this option
                                                           is not set, ALL analyses will be run.
 -b,--output-file-base <OUTPUT-FILE-BASE>                  Optional, base output filename (relative or absolute path).
                                                           Note that this option, the --output-dir (-d) option and the
                                                           --outfile (-o) option are mutually exclusive.  The
                                                           appropriate file extension for the output format(s) will be
                                                           appended automatically. By default the input file path/name
                                                           will be used.
 -cpu,--cpu <CPU>                                          Optional, number of cores for inteproscan.
 -d,--output-dir <OUTPUT-DIR>                              Optional, output directory.  Note that this option, the
                                                           --outfile (-o) option and the --output-file-base (-b) option
                                                           are mutually exclusive. The output filename(s) are the same
                                                           as the input filename, with the appropriate file extension(s)
                                                           for the output format(s) appended automatically .
 -dp,--disable-precalc                                     Optional.  Disables use of the precalculated match lookup
                                                           service.  All match calculations will be run locally.
 -dra,--disable-residue-annot                              Optional, excludes sites from the XML, JSON output
 -etra,--enable-tsv-residue-annot                          Optional, includes sites in TSV output
 -exclappl,--excl-applications <EXC-ANALYSES>              Optional, comma separated list of analyses you want to
                                                           exclude.
 -f,--formats <OUTPUT-FORMATS>                             Optional, case-insensitive, comma separated list of output
                                                           formats. Supported formats are TSV, XML, JSON, and GFF3.
                                                           Default for protein sequences are TSV, XML and GFF3, or for
                                                           nucleotide sequences GFF3 and XML.
 -goterms,--goterms                                        Optional, switch on lookup of corresponding Gene Ontology
                                                           annotation (IMPLIES -iprlookup option)
 -help,--help                                              Optional, display help information
 -i,--input <INPUT-FILE-PATH>                              Optional, path to fasta file that should be loaded on Master
                                                           startup. Alternatively, in CONVERT mode, the InterProScan 5
                                                           XML file to convert.
 -incldepappl,--incl-dep-applications <INC-DEP-ANALYSES>   Optional, comma separated list of deprecated analyses that
                                                           you want included.  If this option is not set, deprecated
                                                           analyses will not run.
 -iprlookup,--iprlookup                                    Also include lookup of corresponding InterPro annotation in
                                                           the TSV and GFF3 output formats.
 -ms,--minsize <MINIMUM-SIZE>                              Optional, minimum nucleotide size of ORF to report. Will only
                                                           be considered if n is specified as a sequence type. Please be
                                                           aware of the fact that if you specify a too short value it
                                                           might be that the analysis takes a very long time!
 -o,--outfile <EXPLICIT_OUTPUT_FILENAME>                   Optional explicit output file name (relative or absolute
                                                           path).  Note that this option, the --output-dir (-d) option
                                                           and the --output-file-base (-b) option are mutually
                                                           exclusive. If this option is given, you MUST specify a single
                                                           output format using the -f option.  The output file name will
                                                           not be modified. Note that specifying an output file name
                                                           using this option OVERWRITES ANY EXISTING FILE.
 -pa,--pathways                                            Optional, switch on lookup of corresponding Pathway
                                                           annotation (IMPLIES -iprlookup option)
 -t,--seqtype <SEQUENCE-TYPE>                              Optional, the type of the input sequences (dna/rna (n) or
                                                           protein (p)).  The default sequence type is protein.
 -T,--tempdir <TEMP-DIR>                                   Optional, specify temporary file directory (relative or
                                                           absolute path). The default location is temp/.
 -verbose,--verbose                                        Optional, display more verbose log output
 -version,--version                                        Optional, display version number
 -vl,--verbose-level <VERBOSE-LEVEL>                       Optional, display verbose log output at level specified.
 -vtsv,--output-tsv-version                                Optional, includes a TSV version file along with any TSV
                                                           output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.

Available analyses:
                      TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs).
                       FunFam (4.3.0) : Prediction of functional annotations for novel, uncharacterized sequences.
                         SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
                      PANTHER (17.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
                       Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
                        Hamap (2021_04) : High-quality Automated and Manual Annotation of Microbial Proteomes.
                       PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
              ProSiteProfiles (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                        Coils (2.2.1) : Prediction of coiled coil regions in proteins.
                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
                        SMART (9.0) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs). 
                          CDD (3.20) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
                        PIRSR (2021_05) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
              ProSitePatterns (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                      AntiFam (7.0) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.
                         Pfam (35.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
                   MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.
                        PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.

Deactivated analyses:
        SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
        SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
                      Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl
                  SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
                        TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model

运行完后，在一系列参数说明的结尾，若提示有某些软件无法获得，这些软件需要自行前往官网注册并下载,并将下载软件添加到 interproscan 相应目录下。

输入格式

1）提交的蛋白序列不能有星号

2）fasta 格式的蛋白或核酸序列，序列中不能含有 · - 或 * 等非法字符。

./interproscan.sh -i test_all_appl.fasta -f tsv

05.InterProScan

**1）将蛋白文件去除*号**

cat /media/aa/DATA/SZQ2/bj/functional_annotation/pep70/$i.fasta | perl -pe 's/\*//g' > $i.noStar.fasta

批量操作

批量 pep70

for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/70list.txt`
do
    echo "cat /media/aa/DATA/SZQ2/bj/functional_annotation/pep70/$i.fasta | perl -pe 's/\*//g' > $i.noStar.fasta"
done > command.noStar.list

ParaFly -c command.noStar.list -CPU 48

批量 pepmy

for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmylist.txt`
do
    echo "cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmy/$i.fasta | perl -pe 's/\*//g' > $i.noStar.fasta"
done > command.noStar.list

ParaFly -c command.noStar.list -CPU 48

protein.fa：/media/aa/DATA/SZQ2/bj/functional_annotation/pep70_noStar/$i.noStar.fasta

protein.fa：/media/aa/DATA/SZQ2/bj/functional_annotation/pepmy_noStar/$i.noStar.fasta

需要把“>Amath1|Amath100001”中的“|”换成“_”，不然会报错

protein.fa：/media/aa/DATA/SZQ2/bj/functional_annotation/pep70_noStar_clean/$i.noStar.clean.fasta

protein.fa：/media/aa/DATA/SZQ2/bj/functional_annotation/pepmy_noStar_clean/$i.noStar.clean.fasta

新建文件夹

mkdir 05.InterProScan && cd 05.InterProScan
# 需要root权限
sudo su
密码

2）使用本地InterProScan注释

/media/aa/DATA/SZQ2/my_interproscan/interproscan-5.61-93.0/interproscan.sh -i /media/aa/DATA/SZQ2/bj/functional_annotation/pep70_noStar_clean/$i.noStar.clean.fasta -f tsv -o $i.interpro.tsv
# 整合
grep IPR $i.interpro.tsv | cut -f 1,12,13 | /media/aa/DATA/SZQ2/command/clf/bin/gene_annotation_from_table.pl - > $i.Interpro.txt

批量操作

批量 pep70

for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/70list.txt`
do
    echo "/media/aa/DATA/SZQ2/my_interproscan/interproscan-5.61-93.0/interproscan.sh -i /media/aa/DATA/SZQ2/bj/functional_annotation/pep70_noStar_clean/$i.noStar.clean.fasta -f tsv -o $i.interpro.tsv"
done > command.InterProScan.list
ParaFly -c command.InterProScan.list -CPU 70

需要把“>Amath1|Amath100001”中的“_”换成“|”

Ubuntu 下批量替换文件内容_slldxmm的博客-CSDN博客

find . -name "$i.interpro.tsv" | xargs sed -i "s/$i_$i/$i|$i/g" `grep $i_$i -rl /media/aa/DATA/SZQ2/bj/functional_annotation/05.InterProScan/pepmy/`

# 替换
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/70list.txt`
do
    echo "find . -name <$i.interpro.tsv> | xargs sed -i <s/$i*$i/$i&$i/g> （grep $i*$i -rl /media/aa/DATA/SZQ2/bj/functional_annotation/05.InterProScan/pep70/）"
done > command.rename.list
# 用notepad修改符号
ParaFly -c command.rename.list -CPU 48

# 整理
mkdir Interpro.txt && cd Interpro.txt
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/70list.txt`
do
    echo "grep IPR ../$i.interpro.tsv | cut -f 1,12,13 | /media/aa/DATA/SZQ2/command/clf/bin/gene_annotation_from_table.pl - > $i.Interpro.txt"
done > command.grep.list
ParaFly -c command.grep.list -CPU 20

批量 pepmy

for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmylist.txt`
do
    echo "/media/aa/DATA/SZQ2/my_interproscan/interproscan-5.61-93.0/interproscan.sh -i /media/aa/DATA/SZQ2/bj/functional_annotation/pepmy_noStar_clean/$i.noStar.clean.fasta -f tsv -o $i.interpro.tsv"
done > command.InterProScan.list
ParaFly -c command.InterProScan.list -CPU 48

# 替换
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmylist.txt`
do
    echo "find . -name <$i.interpro.tsv> | xargs sed -i <s/$i*$i/$i&$i/g> （grep $i*$i -rl /media/aa/DATA/SZQ2/bj/functional_annotation/05.InterProScan/pepmy/）"
done > command.rename.list
# 用notepad修改符号
ParaFly -c command.rename.list -CPU 48

# 整理
mkdir Interpro.txt && cd Interpro.txt
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmylist.txt`
do
    echo "grep IPR ../$i.interpro.tsv | cut -f 1,12,13 | /media/aa/DATA/SZQ2/command/clf/bin/gene_annotation_from_table.pl - > $i.Interpro.txt"
done > command.grep.list
ParaFly -c command.grep.list -CPU 20

以上完成

可用于后续的GO、TF注释