Biopython —— 你不知道的 NCBI 访问方式

最新推荐文章于 2024-08-24 09:51:44 发布

名本无名

最新推荐文章于 2024-08-24 09:51:44 发布

阅读量387

点赞数 2

分类专栏：生信工具文章标签： python

本文链接：https://blog.csdn.net/dxs18459111694/article/details/134012901

版权

生信工具专栏收录该内容

1 篇文章 0 订阅

订阅专栏

介绍

Biopython 是由 Python 编写的一组免费可用的计算生物学工具，解决一些生物信息学需求。

Biopython 致力于通过创造高质量和可重复利用的 Python 模块及类，使 Python 在生物信息学中的应用变得更加容易。

利用这个库，我们可以编写脚本，让程序自动搜索下载数据库中的信息

不用为搜索几十上百个基因的相关文献发愁，也不用再为寻找几十个基因的序列、转录本信息等愁得焦头烂额了

Biopython 能干嘛？

将生物信息学文件解析为 Python 数据结构
- Blast 比对结果
- FASTA
- Clustalw
- GenBank
- PubMed 和 Medline
- Enzyme 和 Prosite 等 ExPASy 文件
- SCOP，包括dom和lin文件
- UniGene
- SwissProt
支持的文件格式能够被迭代遍历或通过字典来索引
处理常用的生物信息学数据库
- NCBI 中的 Blast、Entrez 和 PubMed 数据库
- ExPASy 中的 Swiss-Prot 和 Prosite entries，以及 Prosite searches
- KEGG 数据库
生物信息学常用软件的接口
- NCBI 中的 Standalone Blast
- Clustalw 中的比对程序
- EMBOSS 命令行工具
能处理序列、ID 和序列特征
对序列的常规操作工具，如转录、翻译及重量计算
利用 k-最近邻、Bayes 或 SVM 对数据进行分类
处理比对，能够创建和处理打分矩阵
多进程处理

安装

pip install biopython

访问 NCBI Entrez 简单示例

from Bio import Entrez

# 使用email参数，这样如果遇到什么问题，
# NCBI可以通过邮件联系到你
# 千万不要胡乱的填写邮件地址，不填写都比这要好
Entrez.email = "A.N.Other@example.com"

1. EInfo: 获取 Entrez 数据库的信息

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"
>>> handle = Entrez.einfo()
>>> result = handle.read()

result 得到的是 XML 格式的数据库列表

>>> print(result)
<?xml version="1.0"?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoResult, 11 May 2002//EN"
 "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbList>
        <DbName>pubmed</DbName>
        <DbName>protein</DbName>
        <DbName>nucleotide</DbName>
        <DbName>nuccore</DbName>
        <DbName>nucgss</DbName>
        ...
</DbList>
</eInfoResult>

使用 Bio.Entrez 的解析器，可以直接将这个 XML 读取成一个 Python 对象：

>>> from Bio import Entrez
>>> handle = Entrez.einfo()
>>> record = Entrez.read(handle)

查看数据库字典对象

>>> record.keys()
[u'DbList']

>>> record["DbList"]
['pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest',
 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap',
 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene',
 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc',
 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound',
 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists']

获取特定数据库的信息

>>> handle = Entrez.einfo(db="pubmed")
>>> record = Entrez.read(handle)
>>> record["DbInfo"]["Description"]
'PubMed bibliographic record'
>>> record["DbInfo"]["Count"]
'17989604'

2. ESearch: 搜索 Entrez 数据库

我们可以使用 Bio.Entrez.esearch() 来搜索任意的数据库。

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"
>>> handle = Entrez.esearch(db="pubmed", term="biopython[title]", retmax="40" )
>>> record = Entrez.read(handle)
>>> "19304878" in record["IdList"]
True
>>> print(record["IdList"])
['22909249', '19304878']

我们可以看到输出了七个 PubMed IDs，有 7 篇文献与 Biopython 有关

再来个例子，在 GenBank 中搜索matK

>>> handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]", idtype="acc")
>>> record = Entrez.read(handle)
>>> record["Count"]
'348'
>>> record["IdList"]
['JQ660909.1', 'JQ660908.1', 'JQ660907.1', 'JQ660906.1', ..., 'JQ660890.1']

Cypripedioideae[Orgn] 表示在搜索的时候加上特定的物种名字

也可以在搜索的时候使用 NCBI 的 taxon ID，如 txid158330[Orgn]

3. EFetch: 从 Entrez 下载更多的记录

使用 Bio.Entrez.efetch() 从 Entrez 通过指定 rettype 和或者 retmode 这些可选的参数下载某种特定的格式数据

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
>>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text")
>>> print handle.read()
LOCUS       EU490707                1302 bp    DNA     linear   PLN 05-MAY-2008
DEFINITION  Selenipedium aequinoctiale maturase K (matK) gene, partial cds;
            chloroplast.
ACCESSION   EU490707

FEATURES             Location/Qualifiers
     source          1..1302
                     /organism="Selenipedium aequinoctiale"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /specimen_voucher="FLAS:Blanco 2475"
                     /db_xref="taxon:256374"
     gene            <1..>1302
                     /gene="matK"
     CDS             <1..>1302
                     /gene="matK"
                     /codon_start=1
ORIGIN
        1 attttttacg aacctgtgga aatttttggt tatgacaata aatctagttt agtacttgtg
       61 aaacgtttaa ttactcgaat gtatcaacag aattttttga tttcttcggt taatgattct
      121 aaccaaaaag gattttgggg gcacaagcat tttttttctt ctcatttttc ttctcaaatg
      181 gtatcagaag gttttggagt cattctggaa attccattct cgtcgcaatt agtatcttct
      241 cttgaagaaa aaaaaatacc aaaatatcag aatttacgat ctattcattc aatatttccc
//

访问 KEGG API

1. 解析 KEGG 记录

例如，KEGG 数据库中的一条记录 phosphoglucomutase 变位酶
ec_5.4.2.2.txt

>>> from Bio.KEGG import Enzyme
>>> records = Enzyme.parse(open("ec_5.4.2.2.txt"))
>>> record = list(records)[0]
>>> record.classname
['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
>>> record.entry
'5.4.2.2'

使用 read 解析器

>>> from Bio.KEGG import Enzyme
>>> record = Enzyme.read(open("ec_5.4.2.2.txt"))
>>> record.classname
['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
>>> record.entry
'5.4.2.2'

2. 查询 KEGG API

除了能够解析 KEGG 记录文件，也可以直接使用 KEGG API，

>>> from Bio.KEGG import REST
>>> from Bio.KEGG import Enzyme
>>> request = REST.kegg_get("ec:5.4.2.2")
>>> open("ec_5.4.2.2.txt", "w").write(request.read())
>>> records = Enzyme.parse(open("ec_5.4.2.2.txt"))
>>> record = list(records)[0]
>>> record.classname
['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
>>> record.entry
'5.4.2.2'

下面是一个比较实用的例子，获取与 DNA 修复有关的通路的基因 symbol

from Bio.KEGG import REST

# 获取 KEGG 人类通路信息
human_pathways = REST.kegg_list("pathway", "hsa").read()

# 获取 repair 通路的 entry id
repair_pathways = []
for line in human_pathways.rstrip().split("\n"):
    entry, description = line.split("\t")
    if "repair" in description:
        repair_pathways.append(entry)

# Get the genes for pathways and add them to a list
repair_genes = []
for pathway in repair_pathways:
    # 获取通路信息
    pathway_file = REST.kegg_get(pathway).read()

    # 解析每条通路的信息，获取通路中的基因的 symbol
    current_section = None
    for line in pathway_file.rstrip().split("\n"):
        # 基因信息在前 12 列
        section = line[:12].strip()
        if not section == "":
            current_section = section

        if current_section == "GENE":
            gene_identifiers, gene_description = line[12:].split("; ")
            gene_id, gene_symbol = gene_identifiers.split()

            if not gene_symbol in repair_genes:
                repair_genes.append(gene_symbol)

print(
    "There are %d repair pathways and %d repair genes. The genes are:"
    % (len(repair_pathways), len(repair_genes))
)
print(", ".join(repair_genes))

3. 对应关系

API 与 REST 方法的对应关系

/list/hsa:10458+ece:Z5100          -> REST.kegg_list(["hsa:10458", "ece:Z5100"])
/find/compound/300-310/mol_weight  -> REST.kegg_find("compound", "300-310", "mol_weight")
/get/hsa:10458+ece:Z5100/aaseq     -> REST.kegg_get(["hsa:10458", "ece:Z5100"], "aaseq")
/conv/eco/ncbi-geneid              -> REST.kegg_conv('eco', 'ncbi-geneid')
/link/pathway/hsa                  -> REST.kegg_link('pathway', 'hsa')