用Python将GO数据库中obo文件解析成Pandas的DataFrame

最新推荐文章于 2023-03-31 14:29:37 发布

qw213e

最新推荐文章于 2023-03-31 14:29:37 发布

阅读量873

点赞数 5

文章标签： python

本文链接：https://blog.csdn.net/qw213e/article/details/128856842

版权

最近做的课题需要构建一个生物机制的大网络，其中需要GO本体之间关系的数据，但官网只能下到obo格式的文件，它的数据结构大致长这样：

format-version: 1.2
data-version: releases/2023-01-01
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: prokaryote_subset "GO subset for prokaryotes"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
ontology: go

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw]
is_a: GO:0007005 ! mitochondrion organization

[Term]
id: GO:0000003
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
subset: goslim_agr
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_pir
subset: goslim_plant
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process

[Term]
id: GO:0000005
name: obsolete ribosomal chaperone activity
namespace: molecular_function
def: "OBSOLETE. Assists in the correct assembly of ribosomes or ribosomal subunits in vivo, but is not a component of the assembled ribosome when performing its normal biological function." [GOC:jl, PMID:12150913]
comment: This term was made obsolete because it refers to a class of gene products and a biological process rather than a molecular function.
synonym: "ribosomal chaperone activity" EXACT []
is_obsolete: true
consider: GO:0042254
consider: GO:0044183
consider: GO:0051082

......
......

可以发现，在数据的最顶端是这个数据的一些介绍，后面带有[Term]项的就是每一个GO本体的信息，其中每一行储存其相应的一个变量。这里我需要的变量只有id, name, namespace以及通过is_a或者relationship与之连接的父GO本体。

from tqdm import tqdm
import pandas as pd

f = open('GO_GO_in_GO.obo.txt')
lines = f.readlines()

nodes1 = []
names1 = []
namespaces = []
relationships = []
nodes2 = []
names2 = []

def append_all_lists():
    nodes1.append(node1)
    names1.append(name1)
    namespaces.append(namespace)
    relationships.append(relationship)
    nodes2.append(node2)
    names2.append(name2)

for line in tqdm(lines):
    line = line.strip()
    if line.find('id:')==0:
        node1 = line[line.find('GO:'):].strip()
    elif line.find('name:')==0:
        name1 = line[5:].strip()
    elif line.find('namespace:')==0:
        namespace = line[10:].strip()
    elif line.find('is_a:')==0:
        relationship = 'is a'
        node2 = line.split('!')[0][5:].strip()
        name2 = line.split('!')[1].strip()
        append_all_lists()
    elif line.find('relationship:')==0:
        relationship = line.split('GO:')[0][13:].strip()
        node2 = line.split('!')[0][line.find('GO:'):].strip()
        name2 = line.split('!')[1].strip()
        append_all_lists()

GO_GO_data = pd.DataFrame([nodes1,names1,namespaces,relationships,nodes2,names2]).T
GO_GO_data.columns = ['GO_id1','GO_name1','type','relationship','GO_id2','GO_name2']

那么通过上面的代码，能够将obo文件解析成一个pandas的DataFrame，大致长下面这个样子：