介绍
目前Phytozome v13版本已经增加 Command line download 选项,如果你需要下载的数量不多直接选择命令行下载就可以啦。
当然下面的方法也可以参考,如果你需要下载全部Phytozome数据,那可以如下方法五实现
方法一
登陆账号
curl 'https://signon.jgi.doe.gov/signon/create' --data-urlencode 'login=*****' --data-urlencode 'password=*****' -c cookies > /dev/null
# ****处修改为账号与密码
下载所有文件的列表
curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get-directory?organism=PhytozomeV12' -b cookies > files.xml
https://genome.jgi.doe.gov
下载文件
files.xml文件里记录每个文件的大小、存放路径、md5、类型等
比如下面记录的是拟南芥的cds序列文件,其中的url=" “中的内容提取出来,”&“替换为”&",前面加上网站https://genome.jgi.doe.gov,用curl下载(记得指定cookie文件)。
<file label=“PhytozomeV12” filename=“Athaliana_167_TAIR10.cds_primaryTranscriptOnly.fa.gz” size=“10 MB” sizeInBytes=“11041833” timestamp=“Wed Jan 08 16:38:08 PST 2014” url="/portal/ext-api/downloads/get_tape_file?blocking=true&url=/PhytozomeV12/download/_JAMO/585474407ded5e78cff8c47a/Athaliana_167_TAIR10.cds_primaryTranscriptOnly.fa.gz" project="" library="" md5=“6085fd39ad3327c727838f9da4f4b222” fileType=“Assembly” />
下面是测试下载拟南芥的数据文件,对于批量下载来讲还是比较麻烦的,可以查看files.xml文件,
将这些curl 放到一个bash文件里也可以实现批量下载。
curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/PhytozomeV12/download/_JAMO/585474407ded5e78cff8c47a/Athaliana_167_TAIR10.cds_primaryTranscriptOnly.fa.gz' -b cookies > Athaliana_167_TAIR10.cds_primaryTranscriptOnly.fa.gz
curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/PhytozomeV12/download/_JAMO/587b0adf7ded5e4229d885ab/Athaliana_447_TAIR10.fa.gz' -b cookies > Athaliana_447_TAIR10.fa.gz
curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/PhytozomeV12/download/_JAMO/587b0ade7ded5e4229d885aa/Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa.gz' -b cookies > Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa.gz
curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/PhytozomeV12/download/_JAMO/587b0ade7ded5e4229d885a8/Athaliana_447_Araport11.gene.gff3.gz' -b cookies > Athaliana_447_Araport11.gene.gff3.gz
curl 'https://genome.jgi.doe.gov/portal/ext-api/downloads/get_tape_file?blocking=true&url=/PhytozomeV12/download/_JAMO/587b0adb7ded5e4229d885a1/Athaliana_447_Araport11.cds_primaryTranscriptOnly.fa.gz' -b cookies > Athaliana_447_Araport11.cds_primaryTranscriptOnly.fa.gz
方法二 | Get JGI Genomes
该方法适合批量下载
下载
git clone https://hub.fastgit.org/guyleonard/get_jgi_genomes.git
用法
Usage:
get_jgi_genomes [-u <username> -p <password>] | [-c <cookies>] [-f | -a | -P 12 | -m 3] (-i) (-l) (-A) (-C) (-g) (-t) (-q)
Required:
-u <username>
-p <password>
or
-c <cookie file>
Portal Choice:
-f Mycocosm aka fungi
-a Phycocosm aka algae
-P <version> PhytozomeV aka plants
-m <version> MetazomeV aka metazoans
Portal File Options:
-A get assembly
-C get CDS
-g get GFF
-t get transcripts
JGI Taxa ID:
-i <id> JGI ID of Genome Project
Other:
-l list only, no downloads
下载示例
# 登录:
./bin/get_jgi_genomes -u your.email@address.com -p password
# 登录后从 Mycocosm 下载所有蛋白质文件的列表:
./bin/get_jgi_genomes -c signon.cookie -f -l
# 登录后从 Phycocosm 下载所有 CDS 文件:
./bin/get_jgi_genomes -c signon.cookie -a -C
# 登录后从 Phytozome V12 下载所有程序集文件:
./bin/get_jgi_genomes -c signon.cookie -P 12 -A
方法三 | jgi-query
{% note purple no-icon %}
这是一个python写的脚本,感兴趣的可以查看使用信息,点击此处链接
{% endnote %}
下载
git clone https://github.com/glarue/jgi-query.git
使用
usage: jgi-query.py [-h] [-x [XML]] [-c] [-s] [-f] [-u] [-n RETRY_N]
[-l logfile] [-r REGEX] [-a]
[organism_abbreviation]
This script will list and retrieve files from JGI using the curl API. It will
return a list of all files available for download for a given query organism.
positional arguments:
organism_abbreviation
organism name formatted per JGI's abbreviation. For
example, 'Nematostella vectensis' is abbreviated by
JGI as 'Nemve1'. The appropriate abbreviation may be
found by searching for the organism on JGI; the name
used in the URL of the 'Info' page for that organism
is the correct abbreviation. The full URL may also be
used for this argument (default: None)
optional arguments:
-h, --help show this help message and exit
-x [XML], --xml [XML]
specify a local xml file for the query instead of
retrieving a new copy from JGI (default: None)
-c, --configure initiate configuration dialog to overwrite existing
user/password configuration (default: False)
-s, --syntax_help
-f, --filter_files filter organism results by config categories instead
of reporting all files listed by JGI for the query
(work in progress) (default: False)
-u, --usage print verbose usage information and exit (default:
False)
-n RETRY_N, --retry_n RETRY_N
number of times to retry downloading files with errors
(0 to skip such files) (default: 4)
-l logfile, --load_failed logfile
retry downloading from URLs listed in log file
(default: None)
-r REGEX, --regex REGEX
Regex pattern to use to auto-select and download files
(no interactive prompt) (default: None)
-a, --all Auto-select and download all files for query (no
interactive prompt) (default: False)
方法四
此脚本可以下载所有植物数据
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 30 20:33:58 2019
@author: Bohan
"""
import requests,json,hashlib,os,time
from pathlib import Path
from time import time, perf_counter
from fake_useragent import UserAgent
from xml.etree import ElementTree
#引入requests。
session = requests.session()
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'
}
#add headers加请求头,前面有说过加请求头是为了模拟浏览器正常的访问,避免被反爬虫。
data={}
data['login']='*****@******.com' #Replace this with you ACCOUNTNAME in phytozome.jgi.doe.gov账户名
data['password']='*******' #Replace this with you PASSWORD in phytozome.jgi.doe.gov密码
data['commit']='Sign In'
def sign_in():
global cookies_dict #define cookies_dict to store dict form cookie
global cookies_str
cookies_dict={}
url = 'https://signon.jgi.doe.gov/signon/create' #把登录的网址赋值给URL sign_in URL,登录地址已更新
session.post(url, headers=headers, data=data)
cookies_dict = requests.utils.dict_from_cookiejar(session.cookies)
cookies_str = json.dumps(cookies_dict)
f = open('cookies.txt', 'w')
f.write(cookies_str)
f.close()
# 以上7行代码,是登录网站并存储cookies,signin the phytozome and save cookies
def cookies_read():
cookies_txt = open('cookies.txt', 'r')
cookies_dict = json.loads(cookies_txt.read())
cookies = requests.utils.cookiejar_from_dict(cookies_dict)
return (cookies)
# 以上4行代码,是cookies读取,read local cookies
def md5sum(filepath):
fd = open(filepath,"rb")
fcont = fd.read()
fd.close()
fmd5 = str(hashlib.md5(fcont).hexdigest())
return fmd5
#定义一个md5sum函数,返回校验值,def a function md5sum, to check and return md5 value of a certain file
def createpath(file_path):
try:
if not os.path.exists(file_path):
print ('文件夹',file_path,'不存在,重新建立') #print ('folder',file_path,'is not exist, created it')
#os.mkdir(file_path)
os.makedirs(file_path)
except IOError as e:
print ('文件操作失败',e) #print ('IOError',e)
except Exception as e:
print ('错误 :',e) #print ('Error',e)
#定义一个createpath函数,检测所在目录是否存在,不存在则建立文件夹,check filedirectory exisit or not, if not create that folder
def getxml():
global fileurl
fileurl=[]
PHYTOALL='Phytozome'
xmldata=session.get('https://genome.jgi.doe.gov/portal/ext-api/downloads/get-directory?organism=Phytozome&organizedByFileType=false')
#输入API指定的版本名称
with open('./'+PHYTOALL+'.xml','wb') as xf:
xf.write(xmldata.content)
#下载对应版本的官方xml文件
xmlDoc = ElementTree.parse('./'+PHYTOALL+'.xml') #读取并使用ElementTree解析PhytozomeV12.xml文件,并命名为xmlDoc
folderl1 = xmlDoc.findall('folder') #使用findall功能找出子一级folder列表
print('目前数据库中有以下版本:\n') #print('The database have these Versions:\n')
number=1
for folderl1x in folderl1: #遍历一级folder列表
print(str(number)+'. '+folderl1x.attrib['name'])
number=number+1
pick=input('Pleas choose which version you want,input with number:\n 请选择你所需的版本: \n For example:2 After your input,pree Enter.\n 仅输入数字,输入完成后请回车。\n')
folderl1name =folderl1[int(pick)-1]
folderl2 = folderl1name.findall('folder') #使用findall功能找出子二级folder列表
folderl2f = folderl1name.findall('file')
for folderl2fname in folderl2f:
folderpathl2 = "./"+ str(folderl1name.get('name'))+ "/"
fileurl.append(folderpathl2)
fileurl.append(folderl2fname.get('filename'))
fileurl.append('https://genome.jgi.doe.gov'+folderl2fname.get('url'))
fileurl.append(folderl2fname.get('md5'))
for folderl2name in folderl2: #遍历二级folder列表
folderl3 = folderl2name.findall('folder') #使用findall功能找出子三级folder列表
folderl3f = folderl2name.findall('file')
for folderl3fname in folderl3f:
folderpathl3 = "./"+ str(folderl1name.get('name'))+"/"+ str(folderl2name.get('name')) + "/"
fileurl.append(folderpathl3)
fileurl.append(folderl3fname.get('filename'))
fileurl.append('https://genome.jgi.doe.gov'+folderl3fname.get('url'))
fileurl.append(folderl3fname.get('md5'))
for folderl3name in folderl3: #遍历三级folder列表
folderl4 = folderl3name.findall('folder') #使用findall功能找出子4级folder列表
folderl4f = folderl3name.findall('file')
for folderl4fname in folderl4f:
folderpathl4 = "./"+ str(folderl1name.get('name'))+"/"+ str(folderl2name.get('name')) + "/" +str(folderl3name.get('name'))+ "/"
fileurl.append(folderpathl4)
fileurl.append(folderl4fname.get('filename'))
fileurl.append('https://genome.jgi.doe.gov'+folderl4fname.get('url'))
fileurl.append(folderl4fname.get('md5'))
for folderl4name in folderl4: #遍历4级folder列表
folderl5 = folderl4name.findall('folder') #使用findall功能找出子5级folder列表
folderl5f = folderl4name.findall('file')
for folderl5fname in folderl5f:
folderpathl5 = "./"+ str(folderl1name.get('name')) + "/" + str(folderl2name.get('name')) + "/" + str(folderl3name.get('name')) + "/"+ str(folderl4name.get('name')) + "/"
fileurl.append(folderpathl5)
fileurl.append(folderl5fname.get('filename'))
fileurl.append('https://genome.jgi.doe.gov'+folderl5fname.get('url'))
fileurl.append(folderl5fname.get('md5'))
file = open("./genome.links","w")
file.write(str(fileurl))
file.close()
return fileurl
#解析官方xml文件,将对应文件名称、路径以及MD5值存取至genom.links文件,格式为列表形式,4个数值循环存储,1路径,2文件名,3URL,4MD5值
def gettasklist():
global tasklist
tasklist={}
for i in range(int(len(fileurl)/4)):
onefilelist=[]
onefilelist.append(fileurl[i*4+2])
onefilelist.append(fileurl[i*4]+fileurl[i*4+1])
onefilelist.append(fileurl[i*4+3])
tasklist[i]=onefilelist
file = open("./task.lists","w")
file.write(str(tasklist))
file.close()
return tasklist
#合并文件路径和文件名,合成tasklist,格式为1URL,2路径+文件名,3MD5值
def download_file_from_url(dl_url, file_name, md5, headers):
file_path = Path(__file__).parent.joinpath(file_name)
if file_path.exists():
dl_size = file_path.stat().st_size #if file exits, get downloaded file size
else:
dl_size = 0
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
'Cookie': cookies_dict} #include cookie into request headers
headers['Range'] = f'bytes={dl_size}-'
response = session.get(dl_url, stream=True)
print(headers) #use seesion get content via stream
print('\n\n' + '*' * 30 + '下载信息' + '*' * 30) #print('\n\n' + '*' * 30 + 'Downloading Information' + '*' * 30)
try:
total_size = int(response.headers['content-length']) #if server respond with content length, that could be continue download
print(
f'\n\n文件名称:{file_name}\t\t已下载文件大小:{dl_size / 1024 / 1024:.2f}M\t\t文件总大小:{total_size/1024/1024:.2f}M\n\n该文件支持断点续传\n') #print(f'\n\nCurrent downloading:{file_name}\t\tDownloaded:{dl_size / 1024 / 1024:.2f}M\t\tThis file supports continue downloading, downloading......\n')
start = perf_counter()
data_count = 0
count_tmp = 0
start_time = time()
with open(file_path, 'ab') as fp: #if server respond with content length, that could be continue download, writ file with ab model, append
for chunk in response.iter_content(chunk_size=512):
data_count += len(chunk)
now_pross = (data_count / total_size) * 100
mid_time = time()
if mid_time - start_time > 0.1:
speed = (data_count - count_tmp) / 1024 / (mid_time - start_time)
start_time = mid_time
count_tmp = data_count
print(
f"\rDownloading.........{now_pross:.2f}%\t{data_count//1024}Kb/{total_size//1024}Kb\t当前下载速度:{speed:.2f}Kb/s", end='') #f'\n\nDownloaded!Total used:{diff:.2f} seconds, Average downloading speed:{speed:.2f}Kb/s!
fp.write(chunk)
end = perf_counter()
diff = end - start
speed = total_size/1024/diff
print(
f'\n\n下载完成!耗时:{diff:.2f}秒, 平均下载速度:{speed:.2f}Kb/s!\n文件路径:{file_path}\n')
except KeyError: #if server respond with no content length, that means you should writ file with wb model, rewrite
print(f'\n\n当前文件名称:{file_name}\t\t已下载文件大小:{dl_size / 1024 / 1024:.2f}M\t\t该文件服务器不支持断点续传,重新开始下载\n') #print(f'\n\nCurrent downloading:{file_name}\t\tDownloaded:{dl_size / 1024 / 1024:.2f}M\t\tThis file doesn't supports continue downloading,restart to download this file.\n')
start = perf_counter()
data_count = 0
count_tmp = 0
start_time = time()
with open(file_path, 'wb') as fp:
for chunk in response.iter_content(chunk_size=512):
data_count += len(chunk)
mid_time = time()
if mid_time - start_time > 0.1:
speed = (data_count - count_tmp) / 1024 / (mid_time - start_time)
start_time = mid_time
count_tmp = data_count
print(
f"\rDownloading.........\t{data_count//1024}Kb当前下载速度:{speed:.2f}Kb/s", end='') #f"\rDownloading.........\t{data_count//1024}KbCurrent downloading speed:{speed:.2f}Kb/s", end='')
fp.write(chunk)
end = perf_counter()
diff = end - start
speed = data_count/1024/diff
print(
f'\n\n下载完成!耗时:{diff:.2f}秒, 平均下载速度:{speed:.2f}Kb/s!\n文件路径:{file_path}\n') #f'\n\nDownloaded!Total used:{diff:.2f} seconds, Average downloading speed:{speed:.2f}Kb/s!\nFile Path:{file_path}\n')
fmd5=md5sum(file_name)
if fmd5 == md5: #check intergrity of file
print('文件校验成功!')
else:
print('文件校验失败')
def paralleldownload():
for j in range(int(len(tasklist))):
try:
if md5sum(tasklist[j][1]) != tasklist[j][2]:
download_file_from_url(tasklist[j][0],tasklist[j][1],tasklist[j][2],headers)
else:
print('第'+str(j+1)+'个文件已存在且与本地文件一致') #print('The No.'+str(j+1)+'file is already existing, and it don't need to be download again')
except FileNotFoundError as e:
print('共计'+str(int(len(tasklist)))+'个文件,'+'目前开始下载第'+str(j+1)+'个文件') #print('There are total'+str(int(len(tasklist)))+'files,'+'We are downloading the number:'+str(j+1))
download_file_from_url(tasklist[j][0],tasklist[j][1],tasklist[j][2],headers)
sign_in()
getxml() #GETXML
gettasklist() #GETtasklist
for i in range(int(len(fileurl)/4)):
createpath(fileurl[i*4]) #解析官方xml文件,在根目录下创建所有子目录
paralleldownload()