hmmer建立hmmscan

最新推荐文章于 2024-11-28 15:37:50 发布

zxx11542

最新推荐文章于 2024-11-28 15:37:50 发布

阅读量3.8k

点赞数 7

分类专栏： linux 知识文章标签： linux

本文链接：https://blog.csdn.net/zxx11542/article/details/117652375

版权

知识同时被 2 个专栏收录

8 篇文章

订阅专栏

linux

2 篇文章

订阅专栏

本文介绍了如何使用HMMER工具hmmpress创建自定义的HMM数据库，并通过hmmscan进行序列比对。首先，从指定URL下载数据库文件，然后解压并合并所有HMM文件。接着，使用hmmpress压缩和索引合并后的文件，创建可被hmmscan读取的数据库。最后，通过hmmscan调用比对，并设置输出选项，如E_value阈值，以获取比对结果。该过程对于生物信息学分析，特别是序列搜索和功能注释，非常有用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学术辣鸡前两天想要用hmmsearch 比对文件，但是发现hmmsearch所使用的pfam-a.hmm他就是一个hmm，依赖数据库vogdb就没有一个完整的hmm，所以不会设置。。。。

然后我发现了hmmscan(学术辣鸡没有认真查看HMMER说明书HMMER User’s Guide，淦，hmmscan写了可以在调用任务之前把所有文件整合成一个hmm，调用的时hmmpress：

Constructs binary compressed datafiles for hmmscan, starting from a profile database
hmmfile in standard HMMER3 format. The hmmpress step is required for hmmscan to work.
Four files are created: hmmfile.h3m, hmmfile.h3i, hmmfile.h3f, and hmmfile.h3p. The
hmmfile.h3m file contains the profile HMMs and their annotation in a binary format.
The hmmfile.h3i file is an SSI index for the hmmfile.h3m file. The hmmfile.h3f file contains precomputed data structures for the fast heuristic filter (the MSV filter). The
hmmfile.h3p file contains precomputed data structures for the rest of each profile.
hmmfile may not be ’-’ (dash); running hmmpress on a standard input stream rather
than a file is not allowed.

hmmpress [options] hmmfile
用法如下：
如果需要下载数据库，怎么下的无所谓，只是参照：
curl -LO http://fileshare.csb.univie.ac.at/vog/latest/vog.hmm.tar.gz
如果你下好数据库就直接创建文件夹吧！
mkdir vog
tar -C vog -xf vog.hmm.tar.gz
cat vog/* > VOGs.hmms
hmmpress VOGs.hmms

这样就完成了自己的hmm库建立。
然后再调用hmmscan就可以了：

hmmscan [options] hmmdb seqfile
hmmdb 依赖的数据库，
seqfile 你的文件
-h 查看 options

我把它挂起了：
nohup hmmscan -o vog_out/output_19292.txt --tblout vog_out/output_19292_pro.tbl --domtblout
vog_out/output_19292_pro.dom -E 1e-5 VOGs.hmms my_data/1kbvotu/virsorter_vhmm_result_prot.fasta &

-o FILE

将结果输出到指定的文件中。默认是输出到标准输出。

–tblout FILE

将蛋白质家族的结果以表格形式输出到指定的文件中。默认不输出该文件。

–domtblout FILE

将蛋白结构域的比对结果以表格形式输出到指定的文件中。默认不输出该文件。该表格中包含query序列起始结束位点与目标序列起始结束位点的匹配信息。

–acc

在输出结果中包含 PF 的编号，默认是蛋白质家族的名称。

–noali

在输出结果中不包含比对信息。输出文件的大小则会更小。

-E FLOAT default:10.0

设定 E_value 阈值，推荐设置为 1e-5 。（看到的蛮多文章都是-5）

-T FLOAT （目前没有看到太多设定）

设定 Score 阈值。

–domE FLOAT default:10.0

设定 E_value 阈值。该参数和 -E 参数类似，不过是 domain 比对设定的值。

–cpu

多线程运行的CPU。默认应该是大于1的，表示支持多线程运行。但其实估计一般一个hmmscan程序利用150%个CPU。并且若进行并行化调用hmmscan，当并行数高于4的时候，会报错：Fatal exception (source file esl_threads.c, line 129)。这时，设置–cpu的值为1即可。

p.s.
HMMER说hmmsearch要效率更高。可是谁不知道呢，谁让我是学术辣鸡呢！

Either hmmsearch or hmmscan can compare a set of profiles to a set of sequences. Due to disk access patterns of the two tools, it is usually more efficient to use hmmsearch, unless the number of profiles greatly exceeds the number of sequences.