利用 eutils 实现自动下载序列文件（python实现）

最新推荐文章于 2024-05-13 16:22:26 发布

Bio大恐龙

最新推荐文章于 2024-05-13 16:22:26 发布

阅读量1.1k

点赞数 1

分类专栏：生物信息 python学习笔记文章标签： python 开发语言

本文链接：https://blog.csdn.net/ouyangk1026/article/details/126903167

版权

生物信息同时被 2 个专栏收录

10 篇文章 3 订阅

订阅专栏

python学习笔记

10 篇文章 2 订阅

订阅专栏

大家好✨，这里是bio🦖。作为生物专业的学子，有时我们需要下载序列文件信息，如果只靠鼠标点击来完成少量的下载还是可行的，如果是十万，一百万嘞？鼠标连点器来了都要流泪。不过兵来将挡水来土掩，这里为大家介绍如何使用eutils实现自动下载序列文件。阅读完本文你将学习到：
1. eutils 的使用
2. python requests库的简单使用
3. 实现自动化下载应该考虑的因素

1. `eutils` 的介绍

eutils(Entrez Programming Utilities) 是一组提供9个服务器端的程序，为NCBI的查询和数据库系统提供稳定的接口。eutils使用灵活的URL语法将标准的一组输入参数转化成NCBI搜索的必须值，并返回你需要的数据。因此，eutils是Entrez（美国国家生物技术信息中心在线资源检索器）系统的接口。目前该系统包含38个数据库，包括核酸、蛋白质序列，基因注释，三维分子结构以及生物医药方面的文献。

在使用eutils之前，需要申请NCBI的API_Key，API_Key是唯一的字符串包含在你的HTTP请求中。申请过程非常简单，你只需要注册NCBI的账号 --> 然后登录你的账号 --> 点击你的头像选择账户设置（account setting）–> 产生API_Key就好啦。如果你没有API_Keu的话，你的请求不能超过3次/s，有API_Key请求可以提高到不能超过10次/s。（如果这里有问题的会可以留言或私信）

不过在其官方文档解释为什么NCBI要做这个的时候，有一句非常令人无语的一段话：“如果没有限制，单个用户发出过多请求将会降低其他人的服务速度。所以，限制无API_Key用户的速度，能保持对每个人的快速服务”（也有可能是我理解错了，原文连接：eutils介绍）

（图一，寻找API_Key指北图）

图一，寻找API_Key指北图

2. 分析可行性

eutils能根据你给的词语返回在检索道德UIDs的列表（如图二）。
eutils之间分享一个相同的URLhttps://eutils.ncbi.nlm.nih.gov/entrez/eutils/，意思就是说接下来我们需要用到它来构建查询的URL。
基本的搜索需要你输入查询的数据库以及相关词语，它的结构是esearch.fcgi?db=<database>&term=<query>。NCBI给了一个例子：查询2018年，关于乳腺癌发表在Science文章。其链接展示在下方。它选择的数据库是pubmed，然后在term后加上相关的词语。
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+breast+cancer+AND+2008[pdat]
在这里插入图片描述

图二，UIDs示意图

而本文的需求是根据accession号下载对应的核酸序列文件，这里先尝试一下将数据库改成核酸数据库，相关词语是accession号。构造一个简单的URL，试试能否查询。发现能够得到一个ID，而后根据NCBI给的基本下载结构efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval_type> &retmode=<retrieval_mode>构造一个下载URL既可实现我们的功能。那就浅浅的将得到的ID按照下载结构构建一个下载URL吧，数据库选择核算数据库，文件类型选择fasta，然后再试试能不能下载成功吧！yes！下载成功，如下方代码块所示。

检索URL
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=KY549147

下载URL
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=119311760&rettype=fasta

>EE975162.1 Q56400A FNM Bos taurus cDNA clone Q5640 5', mRNA sequence
GCCCCAGGACCCCAGCCATGAAGCTCTTCGTCCCCGCCCTGCTGTCCCTTGGAGCCCTTGGACTGTGTCT
GGCTGCCCCGAGGAAAAACGTTCGATGGTGTACCATCTCCCAACCCGAGTGGTTCAAATGCCGCCGATGG
CAGTGGAGGATGAAGAAGCTGGGTGCTCCCTCTATCACCTGTGTGAGGAGGGCCTTTGCCTTGGAATGTA
TCCGGGCCATCGCGGAGAAAAAGGCGGATGCTGTGACCCTGGATGGTGGCATGGTGTTTGAGGCGGGCCG
GGACCCCTACAAACTGCGGCCAGTAGCAGCAGAGATCTATGGGACGAAAGAGTCTCCCCAAACCCACTAT
TATGCTGTGGCCGTCGTGAAGAAGGGCAGCAACTTTCAGCTGGACCAGCTGCAAGGCCGGAAGTCCTGCC
ATACGGGCCTTGGCAGGTCCGCTGGGTGGGTCATCCCTATGGGAATCCTTCGCCCGTACTTGAGCTGGAC
AGAGTCACTCGAGCCCCTCCAGGGAGCTGTGGCTAAATTCTTCTCTGCCAGCTGTGTTCCCTGCATTGAT
AGACAAGCATACCCCAACCTGTGTCAACTGTGCA

在这里插入图片描述

图三，尝试检索图

3. 构建函数

3.1 构建下载文件函数

python的requests库允许我们以网络爬虫的速度，使用URL去获取信息。我们这里只需要简单的使用该库最简单的一个函数get即可。关于get的介绍大家可以从网络上获取，十分容易。使用get获取网页查询网页信息，然后使用re库的正则表达式匹配所有ID，再次使用get获取下载网页信息，最后写入文件中即可。这里使用的分块下载，没看懂也没关系。使用轮子的人不一定需要轮子是怎么造的。具体的代码展示如下。（这里的代码不能直接使用，你需要提供你的API_Key给URL）

import requests
import sys

# data storage location
location = sys.argv[1]

def get_info_from_url(segment, num):

    # Using requests function get the id from term url
    term_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&api_key=YOUR_API_KEY&term=' + ",".join(segment)
    term_info = requests.get(term_url,timeout=(6,60)).text # to avoid script running but ineffective

    id_pattern = re.compile(r'<Id>(\d+)</Id>')
    accession_id = re.findall(id_pattern, term_info)

    # get sequencing information from id
    fasta_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&id={}&api_key=YOUR_API_KEY'.format(','.join(accession_id))
    with open('{}/sequence_{}.fasta'.format(location, num),'wb') as f, requests.get(fasta_url,stream=True, timeout=(6,60)) as info:
        for chunk in info.iter_content(chunk_size=256*1024):
            if not chunk:
                break
            f.write(chunk)

3.2 构建记录函数

在下载的过程难免会遇到各式各样的情况，例如网络情况差导致请求超时，导致脚本运行出错。所以为了脚本能完整的跑完，可以使用try except来规避类似情况。当脚本运行完成后，那些数据下载成功了，那些数据下载失败了，该如何得知？再写一个检查的脚本？写一个记录函数不失为一个好点子。

首先，写一个记录下载成功的函数。下次运行的时候便可以避免重复下载，避免资源与时间的浪费。

def record_download(num):
    with open('{}/sequence_{}.fasta'.format(location, num), 'r') as seq:
        info = seq.read()
    pattern_accession_record = re.compile(r'>[A-Z]{1,2}\d+')
    download_record = re.findall(pattern_accession_record, info)
    with open('/YOUR_WORK_PATH/download_record.txt', 'a') as download:
        download.writelines(list(map(lambda x:x.strip('>')+'\n', download_record)))

然后写一个记录下载出错的函数，这样下一次只需要下载出错的即可。还能方便检查，有多少数据是下载失败的。但是利用try except语句，可以在except语句后添加记录文件即可，避免了额外代码的编写。

4. 完整代码

import re
import requests
import sys


# data storage location
location = sys.argv[1]

# circular number
start = int(sys.argv[2])
end = int(sys.argv[3])

# this function is designed to get sequence information
def get_info_from_url(segment, num):

    # Using requests function get the id from term url
    term_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&api_key=YOUR_API_KEY&term=' + ",".join(segment)
    term_info = requests.get(term_url,timeout=(6,60)).text # to avoid script running but ineffective

    id_pattern = re.compile(r'<Id>(\d+)</Id>')
    accession_id = re.findall(id_pattern, term_info)

    # get sequencing information from id
    fasta_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&id={}&api_key=YOUR_API_KEY'.format(','.join(accession_id))
    with open('{}/sequence_{}.fasta'.format(location, num),'wb') as f, requests.get(fasta_url,stream=True, timeout=(6,60)) as info:
        for chunk in info.iter_content(chunk_size=256*1024):
            if not chunk:
                break
            f.write(chunk)


# this function is designed to check whether the file includes 20 sequences
def check_intergrity(num):
    with open('{}/sequence_{}.fasta'.format(location, num), 'r') as seq:
        info = seq.read()

    pattern_len = re.compile(r'>')
    len_num = len(re.findall(pattern_len, info))

    if len_num != 20:
        with open('/YOUR_WORK_PATH/check_record.txt', 'a') as rec:
            rec.write(str(num)+'\n')
    
# this function is designed to record downloaded accession
def record_download(num):
    with open('{}/sequence_{}.fasta'.format(location, num), 'r') as seq:
        info = seq.read()
    pattern_accession_record = re.compile(r'>[A-Z]{1,2}\d+')
    download_record = re.findall(pattern_accession_record, info)
    with open('/YOUR_WORK_PATH/download_record.txt', 'a') as download:
        download.writelines(list(map(lambda x:x.strip('>')+'\n', download_record)))

# accomplish the download function

with open('/YOUR_WORK_PATH/remain_data.txt', 'r') as rem:
    data = rem.read().strip('\n').split('\n')

for num in range(start,end):
    segment = data[num*20:(num+1)*20]

    with open('/YOUR_WORK_PATH/download_record.txt', 'a+') as download:
        download_data = download.read().strip('\n').split('\n')


    try:
        if set(download_data) > set(segment):
            continue
        else:
            get_info_from_url(segment, num)
            check_intergrity(num)
            record_download(num)
    except:
        with open('/YOUR_WORK_PATH/error_record.txt', 'a') as err:
            err.write(str(num)+'\n')