CompareM-平均氨基酸一致性（AAI）计算

Neptuneyut

已于 2024-01-10 21:55:07 修改

阅读量2.3k

点赞数 3

文章标签： CompareM

于 2023-11-12 20:16:07 首次发布

本文链接：https://blog.csdn.net/qq_42491125/article/details/122469638

版权

文章目录

Comparem简介
安装
使用
- 基于基因组计算氨基酸一致性
- 基于基因组蛋白计算氨基酸一致性
结果
转变成矩阵
问题
- terminate called after throwing an instance of 'File_write_exception'
参考

Comparem简介

CompareM 是一个支持进行大规模基因组比较分析的软件工具包。它提供跨基因组（如氨基酸一致性）和单个基因组（如密码子使用率）的统计计算。同时可以并行化，以便能够扩展到数千个基因组。主要功能：

比较基因组统计

基因组之间的平均氨基酸一致性（AAI）
通过计算查询基因组与参考数据库之间的 AAI 进行物种分类

基因组使用模式

密码子使用
氨基酸使用
k <= 8 的 kmer 使用情况（如四核苷酸）
终止密码子使用

其他

识别基因侧向转移（LGT）的二核苷酸和密码子使用模式
使用差异矩阵、分层聚类树和热图进行数据探索

安装

# 安装
(base)$ mamba install -c bioconda comparem

使用

基于基因组计算氨基酸一致性

使用aai_wf 流程计算氨基酸一致性:

# 待比较基因组序列
(base) [yutao@myosin test]$ ls
GCA_001780165.1_genomic.fa  HTR8_metabat2_bin.67.fa GCA_003235575.1_genomic.fa

# 运行aai_wf
(base) [yutao@myosin test]$ time comparem aai_wf -c 30 -x fa . aaiwf_out &>aaiwf.log                 
real    0m23.745s
user    0m40.652s
sys     0m1.594s
# -c 线程
# -x 基因组后缀
# . 带比较基因组目录
# aaiwf_out 输出目录

(base) [yutao@myosin test]$ cat aaiwf.log
[2022-01-24 11:25:04] INFO: CompareM v0.1.2
[2022-01-24 11:25:04] INFO: comparem aai_wf -c 30 -x fa . aaiwf_out
[2022-01-24 11:25:04] INFO: Identifying genes within genomes:
  Finished processing 3 of 3 (100.00%) genomes. # 共3个基因组，总共3种AAI比较方式
[2022-01-24 11:25:19] INFO: Identified genes written to: aaiwf_out/genes # 对基因组预测蛋白
[2022-01-24 11:25:19] INFO: Appending genome identifiers to query genes.
[2022-01-24 11:25:19] INFO: Creating DIAMOND database (be patient!).
[2022-01-24 11:25:19] INFO: Performing self similarity sequence between genomes (be patient!). # 在基因组之间进行相似性比较
[2022-01-24 11:25:19] INFO: Sorting table with hits (be patient!).
[2022-01-24 11:25:20] INFO: Sequence similarity results written to: aaiwf_out/similarity
[2022-01-24 11:25:20] INFO: Calculating length of genes.
[2022-01-24 11:25:20] INFO: Indexing sorted hit table.
[2022-01-24 11:25:20] INFO: Calculating AAI between all 3 pairs of genomes: #计算3对基因组的AAI
  Finished processing 3 of 3 (100.00%) pairs.
[2022-01-24 11:25:20] INFO: Summarizing AAI results.
[2022-01-24 11:25:20] INFO: AAI between genomes written to: aaiwf_out/aai/aai_summary.tsv

其中当前目录包含一组FASTA格式的基因组，结果被写入一个名为aai_output的目录，30个处理器应被用于计算结果。
可以看到comparem最终是对所有基因组两两之间进行比较（不考虑顺序），所以是一个组合情况，可以通过R的```chose(n, k)``得到最终的组合数，例如，3个基因组的最终需要进行3次比较，16个基因组最终需要进行120次比较。
还可以指定一些可选的参数。这包括用于定义基因组间互为最佳命中（即同源物）的序列相似性参数。默认情况下，e值（–evalue）、序列同一性百分比（–per_identity）和比对长度百分比（–per_aln_len）参数被设置为1e-5、30%和70%。当指定要处理的基因组目录时，CompareM只处理扩展名为fna的文件。这可以用-x(–file_ext)参数来改变。此外，如果基因组已经由氨基酸蛋白质序列表示（相对于基因组核苷酸序列），这必须用–蛋白质标志来指定。否则，将使用Prodigal从头识别基因。通过使用-cpus参数指定的多个处理器，可以大大减少计算所有成对AAI值的时间。

基于基因组蛋白计算氨基酸一致性

对于基因组序列：默认使用prodigal预测基因；<input_file>参数表示要比较的基因组集合，可以是i）一个文本文件，其中每一行表示一个基因组的位置，或者ii）一个包含所有要比较的基因组/蛋白（氨基酸）的目录。基因组/蛋白的序列必须是FASTA格式。<output_dir>表示所有输出文件目录。
对于蛋白序列：直接使用faa氨基酸序列

nohup time comparem aai_wf --proteins -c  30 -x gz GTDBr214_479_B.anthracis_gene GTDBr214_479_B.anthracis_gene_aai &>aaiwf.log &
# aai_wf AAI工作流程
# --proteins 指定输入文件是蛋白序列
# -c 线程
# -x 输入蛋白序列后缀名称

结果

(base) [yutao@myosin Two_new_classes]$ ls aaiwf_out/
aai  comparem.log  genes  similarity
(base) [yutao@myosin aaiwf_out]$ head aai/aai_summary.tsv
#Genome A       Genes in A      Genome B        Genes in B      # orthologous genes     Mean AAI        StdOrthologous fraction (OF)
HTR8_metabat2_bin.67    2502    GCA_001780165.1_genomic 3086    497     47.82   10.13   19.86
HTR8_metabat2_bin.67    2502    GCA_003235575.1_genomic 2464    430     47.97   10.59   17.45
GCA_001780165.1_genomic 3086    GCA_003235575.1_genomic 2464    965     52.99   11.96   39.16
(base) [yutao@myosin aaiwf_out]$

成对的AAI统计数据在输出文件./<output_dir>/aai/aai_summary.tsv中提供。该文件由8列组成，具体含义如下，其中第6列即是AAI值。

1-第一个基因组的标识符
2-第一个基因组中的基因数
3-第二个基因组的标识符
4-第二个基因组中的基因数
5-两个基因组之间确定的直系同源基因的数量
6-直系同源基因的平均氨基酸一致性（AAI）。
7-直系同源基因的AAI的标准偏差
8-两个基因组之间的直系亲缘关系（OF），定义为直系亲缘关系的基因数除以其中一个基因组的最小基因数。

转变成矩阵

上述长列表数据，可以通过如下代码转换为矩阵，随后可按照如下方式可视化（20个基因组左右时适合）。

if(!requireNamespace('pacman')){install.packages('pacman')}
pacman::p_load(igraph, corrplot, ggsci, ggplot2 )
# 用于将长表变成宽表
setwd('/Users/yut/Documents/12个海洋元基因组/MAGs_gene_functions/Two_novel_class/')
f = 'Eisenbacteria_aai_summary(1).tsv' # 读入compareM aai_wf输出长格式表
f ='Krumholzibacteriota_aai_summary(1).tsv'
d <- read.table(f, header = F)[c(1, 3, 6)] # 取出第一个基因组，第二个基因组及平均AAI
g = graph.data.frame(d, directed = FALSE) #使用igraph读入数据框
mat <- get.adjacency(g, attr = 'V6' # 显示的AAI值
                     , spars = F # 非稀疏，没有的填充0
                     )
mat[mat == 0] <- 100 # 自身AAI 100
#mat <- round(mat / 100, 2) # 将百分比转化成小数，便于作图
mat

# 设置颜色映射
my.col <- colorRampPalette(c('#4e9cb8' #最小值的颜色
                             ,'#f2f1f1' # 中间颜色
                             , '#de6589' # 最大值颜色
                             ))(10) # 取10个连续色
# corrplot
pdf('Eisenbacteria_aai_summary.pdf', height = 15, width = 15)
pdf('Krumholzibacteriota_aai_summary.pdf', height = 15, width = 15)
p <- corrplot(mat
         , method = 'circle' # cell shape
         , type = 'lower' # lower triangle
         , order = 'hclust' # 排序方式为层级聚类，"original", "AOE", "FPC", "hclust", "alphabet"
         , hclust.method = 'complete' # 聚类方法： ward', 'ward.D', 'ward.D2', 'single', 'complete', 'average', 'mcquitty', 'median' or 'centroid'
         , col.lim = c(0, 100) # 设置颜色mapping值的范围
         , is.corr = F # 非相关系数的值
         ,col = my.col # 颜色
         , tl.col = 'black' # x/y坐标字体颜色
         , addCoef.col = T # 显示值
         , title = 'AAI (%)'
         )
dev.off()
# 不支持ggsave(plot = p, 'Eisenbacteria_aai_summary.pdf', height = 12, width = 12, device = 'pdf')

在这里插入图片描述

问题

terminate called after throwing an instance of ‘File_write_exception’

缓存磁盘写满

[2024-01-10 21:33:57] INFO: Performing self similarity sequence between genomes (be patient!).
No space left on device
terminate called after throwing an instance of 'File_write_exception'
  what():  Error writing file /tmp/diamond-tmp-ZmehKz
[2024-01-10 21:38:48] INFO: Sorting table with hits (be patient!).
[2024-01-10 21:38:48] INFO: Sequence similarity results written to: Fungi_5417_genome_protein_AAI_out/similarity
[2024-01-10 21:38:48] INFO: Calculating length of genes.
[2024-01-10 21:41:23] INFO: Indexing sorted hit table.

mkdir tmp
--tmp_dir tmp #对于很多序列一定要使用（需要自己先创建），否则默认输出中间文件在/tmp/,一旦磁盘写满就会报错

参考

comparem github