用Python将GO数据库中obo文件解析成Pandas的DataFrame

最近做的课题需要构建一个生物机制的大网络,其中需要GO本体之间关系的数据,但官网只能下到obo格式的文件,它的数据结构大致长这样:

format-version: 1.2
data-version: releases/2023-01-01
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: prokaryote_subset "GO subset for prokaryotes"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
ontology: go

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]
is_a: GO:0007005 ! mitochondrion organization

[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
subset: goslim_agr
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_pir
subset: goslim_plant
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process

[Term]
id: GO:0000005
name: obsolete ribosomal chaperone activity
namespace: molecular_function
def: "OBSOLETE. Assists in the correct assembly of ribosomes or ribosomal subunits in vivo, but is not a component of the assembled ribosome when performing its normal biological function." [GOC:jl, PMID:12150913]
comment: This term was made obsolete because it refers to a class of gene products and a biological process rather than a molecular function.
synonym: "ribosomal chaperone activity" EXACT []
is_obsolete: true
consider: GO:0042254
consider: GO:0044183
consider: GO:0051082

......
......

可以发现,在数据的最顶端是这个数据的一些介绍,后面带有[Term]项的就是每一个GO本体的信息,其中每一行储存其相应的一个变量。这里我需要的变量只有id, name, namespace以及通过is_a或者relationship与之连接的父GO本体。

from tqdm import tqdm
import pandas as pd

f = open('GO_GO_in_GO.obo.txt')
lines = f.readlines()

nodes1 = []
names1 = []
namespaces = []
relationships = []
nodes2 = []
names2 = []

def append_all_lists():
    nodes1.append(node1)
    names1.append(name1)
    namespaces.append(namespace)
    relationships.append(relationship)
    nodes2.append(node2)
    names2.append(name2)

for line in tqdm(lines):
    line = line.strip()
    if line.find('id:')==0:
        node1 = line[line.find('GO:'):].strip()
    elif line.find('name:')==0:
        name1 = line[5:].strip()
    elif line.find('namespace:')==0:
        namespace = line[10:].strip()
    elif line.find('is_a:')==0:
        relationship = 'is a'
        node2 = line.split('!')[0][5:].strip()
        name2 = line.split('!')[1].strip()
        append_all_lists()
    elif line.find('relationship:')==0:
        relationship = line.split('GO:')[0][13:].strip()
        node2 = line.split('!')[0][line.find('GO:'):].strip()
        name2 = line.split('!')[1].strip()
        append_all_lists()

GO_GO_data = pd.DataFrame([nodes1,names1,namespaces,relationships,nodes2,names2]).T
GO_GO_data.columns = ['GO_id1','GO_name1','type','relationship','GO_id2','GO_name2']

那么通过上面的代码,能够将obo文件解析成一个pandas的DataFrame,大致长下面这个样子:

这样就能够很直观地看出每一个GO本体与其相连的其他本体,并且也能够输入到networkx中去构建网络了。

希望这篇文章能够帮助到有类似需求的朋友。

  • 5
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值