NCBI中对所有原核生物ANI计算的统计结果简单讲解
来龙去脉还没搞清楚,就先从结果切入。放上一个计算结果的链接https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/
里面有一个文件ANI_report_prokaryotes.txt就是最终的统计结果。
根据README_ANI_report_prokaryotes.txt这个文件里面的介绍可以看到:
1.这个文件是时刻更新的
2.这里面包含了对于Genbank中所有提交的原核生物基因组的ANI信息
3.计算ANI的方法如这篇文章里面讲的一样。
ANI
ANI是average nucleotide identity,也就是平均核苷酸相似度,是在核苷酸水平比较两个基因组亲缘关系的指标。ANI被定义为两个微生物基因组同源片段之间平均的碱基相似度,他的特点是在近缘物种之间有较高的区分度。[1]
就结果本身而言
先从说明文件中了解一下结果文件这些列分别表示什么:
0~8列,组装序列基本信息
0.序列组装的GenBank登录号 | 1.组装序列所使用的RefSeq | 2.组装序列所对应的分类编号 | 3.组装序列所对应的物种的的分类编号【当这个序列是在亚种层面组装,或者是从一个有自己分类学编号的较老品种中得到时会与前一列【2】编号不同】 | 4.与【2】对应的,组装序列所对应的分类名称 | 5.与【3】对应的,组装序列所对应的物种名 | 6.组装名,对于本次序列组装的识别符 | 7.如果组装序列来自于模式株,则对于它的type进行分类,分为“type”, “neotype”, “pathovar”, “reftype”, “syntype”, “suspected-type”。如果不是来自于模式株则为“na” | 8.组装序列被排除在RefSeq外的理由。如果组装序列非常可靠则为"na" |
---|---|---|---|---|---|---|---|---|
genbank-accession | refseq-accession | taxid | species-taxid | organism-name | species-name | assembly-name | assembly-type-category | excluded-from-refseq |
【7】的补充说明:
type - the sequences in the genome assembly were derived from type material
neotype - the sequences in the genome assembly were derived from neotype material
pathovar - the sequences in the genome assembly were derived from pathovar
material
reftype - the sequences in the genome assembly were derived from reference
material where type material never was available and is not likely to ever be available
syntype - the sequences in the genome assembly were derived from synonym type material
suspected-type - the type is one of the types listed above but because it does
not match other type-strain assemblies for the same species, or cannot be vetted for some other reason, it is not used to make taxid changes even though it is used to generate ANI data.
【7】【8】的补充说明:
Any type-strain assembly that is untrustworthy as type will have “na” in the assembly-type-category column.
一些从模式株中分离出的序列在【8】中有一些理由不被收录为RefSeq,并且这些理由使这个组装序列不可信,那么【7】中也会给这个序列标为"na"。
9~14列,declared-type-assembly匹配结果
9.这个物种中与该组装序列匹配最好的模式株组装序列,或者以"no-type"表示这个物种没有模式株组装的序列。如果这个组装序列来自于模式株,则是匹配最好的其他模式株组装序列,或者以"same"表示这个模式株只有这一个序列组装 | 10.【9】中序列的分类名称 | 11.对【9】中序列以与【7】相同的type分类方式进行标注。以"no-type"表示该物种没有模式株组装序列,或者以"na"表示这个组装序列就是唯一的模式株组装序列 | 12.组装序列与该物种模式株组装序列的ANI。“na”表示这个物种没有模式株组装序列,或者【13】或【14】中<10% | 13.【9】中模式株组装序列对该组装序列的覆盖百分比 | 14.该组装序列对【9】中模式株组装序列的覆盖百分比 |
---|---|---|---|---|---|
declared-type-assembly | declared-type-organism-name | declared-type-category | declared-type-ANI | declared-type-qcoverage | declared-type-scoverage |
15~24列,best-match-type-assembly匹配结果
15.根据ANI得到的最佳匹配模式株组装序列。“none-found“表示没有模式株组装序列和该组装序列匹配 | 16.【15】中序列对应的物种的分类学标识符 | 17.【15】中序列对应的物种名称 | 18.与【7】中相同的方式标注【15】中的序列的type类别 | 19.该组装序列与【15】中序列的ANI | 20.该组装序列被【15】中序列所覆盖的百分比 | 21.【15】中序列被该组装序列所覆盖的百分比 | 22.【15】中序列与该组装序列best match的情况 | 23. | 24.综合【22】和【23】中的表述得到3个级别的分类检验等级。”ok”,“inconclusive”和“failed” |
---|---|---|---|---|---|---|---|---|---|
best-match-type-assembly | best-match-species-taxid | best-match-species-name | best-match-type-category | best-match-type-ANI | best-match-type-qcoverage | best-match-type-scoverage | best-match-status | comment | taxonomy-check-status |
【22】的补充说明:
Values that indicate the species declared for the query assembly is OK:
- species-match
- the query assembly matches a type-strain assembly for the declared species.
- subspecies-match
- the query assembly matches a type-strain assembly for the declared species and both are the same subspecies.
- synonym-match
- the query assembly matches a type-strain assembly for a synonym of the
declared species. A specialized synonymy list is used to handle difficult
cases of typing. - derived-species-match
- the query assembly matches a type-strain assembly for a subspecies of the declared species.
- genus-match
- the query assembly has an informal species name (usually “sp.” format), and the best-matching type-strain assembly shares the same genus.
- approved-mismatch
- the query assembly best matches a type-strain assembly from a different
species above ANI threshold, but the mismatch was manually reviewed and the declared species was accepted.
Values that indicate the species declared for the query assembly is incorrect:
- mismatch
- 尽管这一物种有模式株的序列组装,但是该组装序列仍然匹配到了别的物种的模式株序列。the query assembly best matches a type-strain assembly from a different species, above ANI threshold, even though a type-strain assembly for the declared species is available. GenBank will address the mismatch when high coverage values provide high confidence in the mismatch result, i.e. query coverage and subject coverage are both over 80%.
Values that indicate the ANI data are inconclusive:
- below-threshold-match
- the query assembly matches a type-strain assembly for the declared species but the ANI is below the species ANI threshold.
- below-threshold-mismatch
- the query assembly best matches a type-strain assembly from a different
species but the ANI is below the species ANI threshold. - low-coverage
- the query assembly did not match the best-matching type-strain assembly above 10% query-coverage and/or 10% subject-coverage.
【23】的补充说明:
- Assembly is the type-strain, no match is expected
- the assembly is the only type-strain assembly for the species, hence it is
expected that it may not match any other type-strain assembly. - Assembly is the type-strain, mismatch is within genus and expected
- the assembly is the only type-strain assembly for the species, hence it is
expected that its best match may be to a type-strain assembly from another species on the same genus but with ANI below 98%. - Assembly is type-strain, failed to match other type-strains on its species
- a type-strain assembly is expected to match all other type-strain assemblies on the species.
【24】的补充说明:
OK
- the ANI result is consistent with the declared species
The best-match-status is species-match, subspecies-match,
derived-species-match, synonym-match, genus-match, approved-mismatch, or the comment indicates either that the assembly is the type-strain and no match is expected, or that the assembly is the type-strain, the mismatch is within genus and is expected.
Inconclusive
- the ANI result is inconclusive
The best-match-status is low-coverage, below-threshold-match, below-threshold-mismatch, na, or the comment indicates that the assembly is a type-strain that failed to match other type-strains on its species.
Failed
- the ANI result is inconsistent with the declared species The best-match-status is mismatch and the comment is na.
参考
1,基因组相似性计算:ANI,星空Idealist