使用biopython解析kegg数据库

最新推荐文章于 2023-05-19 18:31:59 发布

生信修炼手册

最新推荐文章于 2023-05-19 18:31:59 发布

阅读量2k

点赞数 3

文章标签：编程语言数据分析数据可视化人工智能 python

本文链接：https://blog.csdn.net/weixin_43569478/article/details/112386873

版权

本文介绍了如何利用biopython的Bio.KEGG模块结合KEGG API在Python环境中进行数据下载和解析，特别是针对KEGG的pathway和enzyme数据库。通过示例代码展示了如何查找人类中DNA修复相关的基因，强调了结合Python逻辑处理能力实现个性化分析的优势。

摘要由CSDN通过智能技术生成

欢迎关注”生信修炼手册”!

KEGG数据库称之为基因组百科全书，是一个包含gene, pathway等多个子数据库的综合性数据库。为了更好的查询kegg数据，官方提供了对应的API。

在biopython中，通过Bio.KEGG模块，对kegg官方的API进行了封装，允许在python环境中使用kegg API。KEGG API与python代码的对应关系如下

/list/hsa:10458+ece:Z5100 -> REST.kegg_list(["hsa:10458", "ece:Z5100"])
/find/compound/300-310/mol_weight -> REST.kegg_find("compound", "300-310", "mol_weight")
/get/hsa:10458+ece:Z5100/aaseq -> REST.kegg_get(["hsa:10458", "ece:Z5100"], "aaseq")

利用REST模块，可以下载API支持的任何类型的数据，以pathway为例，示例如下

>>> from Bio.KEGG import REST
>>> pathway = REST.kegg_get('hsa00010')

对于查询获得的内容，通过read方法可以转换为纯文本，示例如下

>>> pathway = REST.kegg_get('hsa00010')
>>> res = pathway.read().split("\n")
>>> res[0]
'ENTRY hsa00010 Pathway'
>>> res[1]
'NAME Glycolysis / Gluconeogenesis - Homo sapiens (human)'
>>> res[2]
'DESCRIPTION Glycolysis is the process of converting glucose into pyruvate and generating small amounts of ATP (energy) and NADH (reducing power). It is a central pathway that produces important precursor metabolites: six-carbon compounds of glucose-6P and fructose-6P and three-carbon compounds of glycerone-P, glyceraldehyde-3P, glycerate-3P, phosphoenolpyruvate, and pyruvate [MD:M00001]. Acetyl-CoA, another important precursor metabolite, is produced by oxidative decarboxylation of pyruvate [MD:M00307]. When the enzyme genes of this pathway are examined in completely sequenced genomes, the reaction steps of three-carbon compounds from glycerone-P to pyruvate form a conserved core module [MD:M00002], which is found in almost all organisms and which sometimes contains operon structures in bacterial genomes. Gluconeogenesis is a synthesis pathway of glucose from noncarbohydrate precursors. It is essentially a reversal of glycolysis with minor variations of alternative paths [MD:M00003].'

这样就可以通过字符串解析，来获取通路对应的编号，名称，注释等信息。对于KEGG数据的解析，biopython还提供了专门的解析函数，但是解析函数并不完整，目前只覆盖了compound, map, enzyme等子数据库。以enzyme数据库为例，用法如下