interproscan 是个基因注释工具,可以一次运行实现多个数据库的注释。用户既可以提交蛋白序列也可以提交核酸序列,还可以对输出格式进行设置。
本地化InterProScan的优缺点:
优点:使用本地化的数据库,在断网和计算机资源充足的情况下,能加快注释速度;本地化网页版能同时比对多条序列;本地化能对DNA序列进行interpro注释。 缺点:本地化安装InterProScan比较复杂耗时;需要不时更新本地数据库;本地化运行耗费计算资源大。
参考链接:
Ubuntu 20.04系统下JDK的安装与配置_ubuntu20.04安装jdk_夏小悠的博客-CSDN博客
安装要求
- 64-bit Linux
- Perl 5 (default on most Linux distributions)
- Python 3 (InterProScan 5.30-69.0 onwards)
- Java JDK/JRE version 11 (InterProScan 5.37-76.0 onwards)
- Environment variables set
$JAVA_HOME should point to the location of the JVM
$JAVA_HOME/bin should be added to the $PATH
具体如何配置安装环境查看官方文档。
1.检验perl安装版本(linux在安装的时候应该是自动配置了perl)
perl -version
# 如下
This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi
Copyright 1987-2021, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.
2.检验python版本(linux在安装的时候应该是自动配置了python,我的是有配置的)
python3 --version
# Python 3.9.7
3.检验java 版本
java -version
# 如下
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)
配置Java 环境
1)下载jdk包:本章使用的为后缀为tar.gz的文件(不需要安装)
JDK 11
下载路径:官方链接
根据自己的系统选择相应的版本,博主这里是
64位
的系统,应选择x64 Compressed Archive
,否则系统识别不了可执行文件。
下载到了/media/aa/DATA/SZQ2路径下 jdk-11.0.18_linux-x64_bin.tar.gz
2)将压缩包上传至服务器安装目录下,并解压
tar -zxvf jdk-11.0.18_linux-x64_bin.tar.gz
3)解压生成jdk-11.0.18文件夹,将文件夹目录路径添加到 .proflie 文件中
JAVA_HOME=/media/aa/DATA/SZQ2/jdk-11.0.18
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export PATH
export CLASSPATH
4)运行 source ./.profile
source ./.profile
# bash: ./.profile: No such file or directory
5)运行 java -version
看是否生效。若出现jdk版本号
,则安装并配置环境变量成功
java -version
# 如下
java version "11.0.18" 2023-01-17 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.18+9-LTS-195)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.18+9-LTS-195, mixed mode)
应该每次要用的时候都要重新配置一下java环境。
下载安装 interproscan
1.下载压缩包
cd /media/aa/DATA/SZQ2
mkdir my_interproscan
cd my_interproscan
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.61-93.0/interproscan-5.61-93.0-64-bit.tar.gz
# 添加权限,以避免软件解压过程hmm model不完全
chmod 777 interproscan-5.61-93.0-64-bit.tar.gz
# 当文件被压缩时强烈推荐使用md5sum检查文件是否已下载且没有错误
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.61-93.0/interproscan-5.61-93.0-64-bit.tar.gz.md5
# checksum 是为了验证文件下载的完整性,interproscan 比较大,验证是为了减少后续软件缺省的麻烦:
md5sum -c interproscan-5.61-93.0-64-bit.tar.gz.md5
# Must return *interproscan-5.55-88.0-64-bit.tar.gz: OK*
# 如果 failed 则需要重新下载.
2.解压
tar -pxvzf interproscan-5.61-93.0-64-bit.tar.gz
# where:
# p = preserve the file permissions
# x = extract files from an archive
# v = verbosely list the files processed
# z = filter the archive through gzip
# f = use archive file
3. 和hmm model 建立索引
首次运行interproscan之前,应运行以下命令:
cd interproscan-5.61-93.0
python3 setup.py -f interproscan.properties
#时间比较长
4. pather model
最新版不需要下载pather model!!!
5. 运行
解压缩获得InterProScan后,可以直接从命令行运行InterProScan。
运行提供的shell脚本。如果您在没有参数的情况下运行此脚本,将向您显示使用说明。
./interproscan.sh
# 若出现
Java version 11 is required to run InterProScan.
Detected version 1.8.0_362
Please install the correct version.
# 则重新进行
JAVA_HOME=/media/aa/DATA/SZQ2/jdk-11.0.18
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export PATH
export CLASSPATH
# 查看版本
java -version
# 再次运行
./interproscan.sh
# 如下
25/03/2023 14:34:06:730 Welcome to InterProScan-5.61-93.0
25/03/2023 14:34:06:731 Running InterProScan v5 in STANDALONE mode... on Linux
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M
-Xmx2048M -jar interproscan-5.jar
Please give us your feedback by sending an email to
interhelp@ebi.ac.uk
-appl,--applications <ANALYSES> Optional, comma separated list of analyses. If this option
is not set, ALL analyses will be run.
-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path).
Note that this option, the --output-dir (-d) option and the
--outfile (-o) option are mutually exclusive. The
appropriate file extension for the output format(s) will be
appended automatically. By default the input file path/name
will be used.
-cpu,--cpu <CPU> Optional, number of cores for inteproscan.
-d,--output-dir <OUTPUT-DIR> Optional, output directory. Note that this option, the
--outfile (-o) option and the --output-file-base (-b) option
are mutually exclusive. The output filename(s) are the same
as the input filename, with the appropriate file extension(s)
for the output format(s) appended automatically .
-dp,--disable-precalc Optional. Disables use of the precalculated match lookup
service. All match calculations will be run locally.
-dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output
-etra,--enable-tsv-residue-annot Optional, includes sites in TSV output
-exclappl,--excl-applications <EXC-ANALYSES> Optional, comma separated list of analyses you want to
exclude.
-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output
formats. Supported formats are TSV, XML, JSON, and GFF3.
Default for protein sequences are TSV, XML and GFF3, or for
nucleotide sequences GFF3 and XML.
-goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology
annotation (IMPLIES -iprlookup option)
-help,--help Optional, display help information
-i,--input <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on Master
startup. Alternatively, in CONVERT mode, the InterProScan 5
XML file to convert.
-incldepappl,--incl-dep-applications <INC-DEP-ANALYSES> Optional, comma separated list of deprecated analyses that
you want included. If this option is not set, deprecated
analyses will not run.
-iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in
the TSV and GFF3 output formats.
-ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will only
be considered if n is specified as a sequence type. Please be
aware of the fact that if you specify a too short value it
might be that the analysis takes a very long time!
-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute
path). Note that this option, the --output-dir (-d) option
and the --output-file-base (-b) option are mutually
exclusive. If this option is given, you MUST specify a single
output format using the -f option. The output file name will
not be modified. Note that specifying an output file name
using this option OVERWRITES ANY EXISTING FILE.
-pa,--pathways Optional, switch on lookup of corresponding Pathway
annotation (IMPLIES -iprlookup option)
-t,--seqtype <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n) or
protein (p)). The default sequence type is protein.
-T,--tempdir <TEMP-DIR> Optional, specify temporary file directory (relative or
absolute path). The default location is temp/.
-verbose,--verbose Optional, display more verbose log output
-version,--version Optional, display version number
-vl,--verbose-level <VERBOSE-LEVEL> Optional, display verbose log output at level specified.
-vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV
output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.
Available analyses:
TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs).
FunFam (4.3.0) : Prediction of functional annotations for novel, uncharacterized sequences.
SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
PANTHER (17.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
Hamap (2021_04) : High-quality Automated and Manual Annotation of Microbial Proteomes.
PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
ProSiteProfiles (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
Coils (2.2.1) : Prediction of coiled coil regions in proteins.
SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
SMART (9.0) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
CDD (3.20) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
PIRSR (2021_05) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
ProSitePatterns (2022_05) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
AntiFam (7.0) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.
Pfam (35.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.
PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
Deactivated analyses:
SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl
SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model
运行完后,在一系列参数说明的结尾,若提示有某些软件无法获得,这些软件需要自行前往官网注册并下载,并将下载软件添加到 interproscan 相应目录下。
输入格式
1)提交的蛋白序列不能有星号
2)fasta 格式的蛋白或核酸序列,序列中不能含有 ·
-
或 *
等非法字符。
./interproscan.sh -i test_all_appl.fasta -f tsv
05.InterProScan
1)将蛋白文件去除*号
cat /media/aa/DATA/SZQ2/bj/functional_annotation/pep70/$i.fasta | perl -pe 's/\*//g' > $i.noStar.fasta
批量操作
批量 pep70
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/70list.txt`
do
echo "cat /media/aa/DATA/SZQ2/bj/functional_annotation/pep70/$i.fasta | perl -pe 's/\*//g' > $i.noStar.fasta"
done > command.noStar.list
ParaFly -c command.noStar.list -CPU 48
批量 pepmy
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmylist.txt`
do
echo "cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmy/$i.fasta | perl -pe 's/\*//g' > $i.noStar.fasta"
done > command.noStar.list
ParaFly -c command.noStar.list -CPU 48
protein.fa:/media/aa/DATA/SZQ2/bj/functional_annotation/pep70_noStar/$i.noStar.fasta
protein.fa:/media/aa/DATA/SZQ2/bj/functional_annotation/pepmy_noStar/$i.noStar.fasta
需要把“>Amath1|Amath100001”中的“|”换成“_”,不然会报错
protein.fa:/media/aa/DATA/SZQ2/bj/functional_annotation/pep70_noStar_clean/$i.noStar.clean.fasta
protein.fa:/media/aa/DATA/SZQ2/bj/functional_annotation/pepmy_noStar_clean/$i.noStar.clean.fasta
新建文件夹
mkdir 05.InterProScan && cd 05.InterProScan
# 需要root权限
sudo su
密码
2)使用本地InterProScan注释
/media/aa/DATA/SZQ2/my_interproscan/interproscan-5.61-93.0/interproscan.sh -i /media/aa/DATA/SZQ2/bj/functional_annotation/pep70_noStar_clean/$i.noStar.clean.fasta -f tsv -o $i.interpro.tsv
# 整合
grep IPR $i.interpro.tsv | cut -f 1,12,13 | /media/aa/DATA/SZQ2/command/clf/bin/gene_annotation_from_table.pl - > $i.Interpro.txt
批量操作
批量 pep70
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/70list.txt`
do
echo "/media/aa/DATA/SZQ2/my_interproscan/interproscan-5.61-93.0/interproscan.sh -i /media/aa/DATA/SZQ2/bj/functional_annotation/pep70_noStar_clean/$i.noStar.clean.fasta -f tsv -o $i.interpro.tsv"
done > command.InterProScan.list
ParaFly -c command.InterProScan.list -CPU 70
需要把“>Amath1|Amath100001”中的“_”换成“|”
find . -name "$i.interpro.tsv" | xargs sed -i "s/$i_$i/$i|$i/g" `grep $i_$i -rl /media/aa/DATA/SZQ2/bj/functional_annotation/05.InterProScan/pepmy/`
# 替换
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/70list.txt`
do
echo "find . -name <$i.interpro.tsv> | xargs sed -i <s/$i*$i/$i&$i/g> (grep $i*$i -rl /media/aa/DATA/SZQ2/bj/functional_annotation/05.InterProScan/pep70/)"
done > command.rename.list
# 用notepad修改符号
ParaFly -c command.rename.list -CPU 48
# 整理
mkdir Interpro.txt && cd Interpro.txt
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/70list.txt`
do
echo "grep IPR ../$i.interpro.tsv | cut -f 1,12,13 | /media/aa/DATA/SZQ2/command/clf/bin/gene_annotation_from_table.pl - > $i.Interpro.txt"
done > command.grep.list
ParaFly -c command.grep.list -CPU 20
批量 pepmy
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmylist.txt`
do
echo "/media/aa/DATA/SZQ2/my_interproscan/interproscan-5.61-93.0/interproscan.sh -i /media/aa/DATA/SZQ2/bj/functional_annotation/pepmy_noStar_clean/$i.noStar.clean.fasta -f tsv -o $i.interpro.tsv"
done > command.InterProScan.list
ParaFly -c command.InterProScan.list -CPU 48
# 替换
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmylist.txt`
do
echo "find . -name <$i.interpro.tsv> | xargs sed -i <s/$i*$i/$i&$i/g> (grep $i*$i -rl /media/aa/DATA/SZQ2/bj/functional_annotation/05.InterProScan/pepmy/)"
done > command.rename.list
# 用notepad修改符号
ParaFly -c command.rename.list -CPU 48
# 整理
mkdir Interpro.txt && cd Interpro.txt
for i in `cat /media/aa/DATA/SZQ2/bj/functional_annotation/pepmylist.txt`
do
echo "grep IPR ../$i.interpro.tsv | cut -f 1,12,13 | /media/aa/DATA/SZQ2/command/clf/bin/gene_annotation_from_table.pl - > $i.Interpro.txt"
done > command.grep.list
ParaFly -c command.grep.list -CPU 20
以上完成
可用于后续的GO、TF注释