今天研究了下怎么写脚本批量下载TCGA中的.svs文件。分两部分,使用requests的python脚本和使用wget的shell脚本。shell脚本会更好些,Python的那个可能卡死。
任何下载工具都是人在一定的规则下写出来的,并没有什么玄幻的。思考+实践+优化=工具。
requests
根据上面的代码,进行简单修改。主要添加了.svs文件名输出和下载速度显示。
# coding:utf-8
'''
from https://blog.csdn.net/qq_35203425/article/details/80992727
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: chenwi4323@gmail.com
@cp
* change the save file name of the form e.g. TCGA-J8-A3YE-01Z-00-DX1.83286B2F-6D9C-4C11-8224-24D86BF517FA.svs instead of
the id name 66fab868-0b7e-4eb1-885f-6c62d1e80936.txt
* show the download speed
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
import time
print(__doc__)
requests.packages.urllib3.disable_warnings()
def download(url, file_path):
r = requests.get(url, stream=True, verify=False)
total_size = int(r.headers['content-length'])
print(f"{total_size/1024/1024}MB")
temp_size = 0
size = 0
with open(file_path, "wb") as f:
time1 = time.time()
for chunk in r.iter_content(chunk_size=1024):
if chunk:
temp_size += len(chunk)
f.write(chunk)
done = int(50 * temp_size / total_size)
if time.time()-time1 > 5:
speed = (temp_size - size) / 1024 / 1024 / 5
sys.stdout.write("\r[%s%s] %d%%%s%sMB/s" %
('#' * done, ' ' * (50 - done), 100 * temp_size / total_size, ' '*10, speed))
sys.stdout.flush()
size = temp_size
time1 = time.time()
print()
def get_UUID_list(manifest_path):
UUID_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['id']
UUFN_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['filename']
UUID_list = list(UUID_list)
UUFN_list = list(UUFN_list)
return UUID_list, UUFN_list
def get_last_UUFN(file_path):
dir_list = os.listdir(file_path)
if not dir_list:
return
else:
dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))
return dir_list[-1]
def get_lastUUFN_index(UUFN_list, last_UUFN):
for i, UUFN in enumerate(UUFN_list):
if UUFN == last_UUFN:
return i
return 0
def quit(signum, frame):
# Ctrl+C quit
print()
print('You choose to stop me.')
exit()
print()
if __name__ == '__main__':
signal.signal(signal.SIGINT, quit)
signal.signal(signal.SIGTERM, quit)
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
help="gdc_manifest.txt file path")
parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
help="Which folder is the download file saved to?")
args = parser.parse_args()
link = r'https://api.gdc.cancer.gov/data/'
# args
manifest_path = args.M
save_path = args.S
print("Save file to {}".format(save_path))
UUID_list, UUFN_list = get_UUID_list(manifest_path)
last_UUFN = get_last_UUFN(save_path)
print("Last download file {}".format(last_UUFN))
last_UUFN_index = get_lastUUFN_index(UUFN_list, last_UUFN)
print("last_UUFN_index:", last_UUFN_index)
for UUID, UUFN in zip(UUID_list[last_UUFN_index:], UUFN_list[last_UUFN_index:]):
url = os.path.join(link, UUID)
file_path = os.path.join(save_path, UUFN)
download(url, file_path)
print(f'{UUFN} have been downloaded')
wget
使用wget的好处是,支持自动无限重连,支持断点续传,对于大文件的下载非常友好,不用担心下到一半断网了。下面的脚本是批量下载svs文件,会持续更新一段时间。
#!/usr/bin/env bash
# 文件名:wget_download.sh
# download and re-download
#location="cervix_uteri"
location=$1
if [ ! -d "${location}" ]
then
mkdir "${location}"
fi
echo run script: $0 $* > download_${location}.log 2>&1
manifest_file=`echo gdc_manifest_${location}.txt`
row_num=`awk 'END{print NR}' ${manifest_file}`
file_num=$((${row_num}-1))
echo file_num is ${file_num} >> download_${location}.log 2>&1
uuid_array=($(awk '{print $1}' ${manifest_file}))
uufn_array=($(awk '{print $2}' ${manifest_file}))
start=`ls ./${location} | wc -l`
if [ ${start} -eq 0 ];then
start=1
fi
echo start from ${start} >> download_${location}.log 2>&1
for k in `seq ${start} ${file_num}`
do
echo ${uuid_array[$k]} ${uufn_array[$k]} >> download_${location}.log 2>&1
wget -c -t 0 -O ./${location}/${uufn_array[$k]} https://api.gdc.cancer.gov/data/${uuid_array[$k]}
done
使用时,输入
chmod +x wget_download.sh
./wget_download.sh cervix_uteri
TODO
想想怎么可以将下载速度加快呢
Axel多线程下载工具
mwget 多线程版本wget下载工具