根据NCBI序列号批量下载fasta文件

最新推荐文章于 2025-03-05 14:20:05 发布

Maiya19724

最新推荐文章于 2025-03-05 14:20:05 发布

阅读量2.8k

点赞数

文章标签：生物信息学 linux 经验分享 ubuntu

本文链接：https://blog.csdn.net/Maiya19724/article/details/124126346

版权

该脚本组合使用bash和Python，通过NCBI登录号从NCBI网站批量下载核酸序列。首先，从orgin_list.txt中提取独特的NCBI登录号，然后使用Python的Pyppeteer库逐个获取序列并将其保存为FASTA格式。整个过程自动化，包括错误处理和进度跟踪。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

工作文件：

1.autoget.sh

2.autoget.py

3.orgin_list.txt 第三列包含ncbi登录号

运行autoget.sh

autoget.sh

#!/bin/bash
>list.txt
>res.txt
cat orgin_list.txt |awk '{print $3}' |sort |uniq>list.txt ##orgin_list.txt第三列为ncbi登录号，可以根据自己的数据修改这一行命令
while [-s list.txt]
do
cat list.txt |while read file ; 
    do 
    python autoget.py $file | tee -a res.txt ;
    done
>list.bad
cat list.txt |while read file ;
    do
    grep $file res.txt ; 
        if 
            [ $? -ne 0 ];then echo $file >>list.bad;
        fi;
    done
cp -p list.bad list.txt
done
cat res.txt |while read code id; do curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=${id}&db=nuccore&report=fasta&extrafeat=null&conwithfeat=on&hide-cdd=on&retmode=html&withmarkup=on&tool=portal&log$=seqview&maxdownloadsize=100000000" -o $code.fasta;
done
>orgin_list.txt
rm -rf list.txt res.txt list.bad

autoget.py

##python3.8
import sys
import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq
from lxml import etree

item = sys.argv[1]
baseUrl = "https://www.ncbi.nlm.nih.gov/nuccore/{}?report=fasta".format(item)
async def main():
    browser = await launch()
    page = await broswer.newPage()
    await page.goto(baseUrl,{'timeout': 10000*6})
    Html = pg(await page.content()).html()
    htmldata_data = etree.HTML(Html).xpath('/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[5]/div[2]/div[1]/@val')
    link_id = html_data[0]
    await browser.close()
    print(item,link_id)
asyncio.get_envent_loop().run_until_complete(main())