下载TCGA数据的程序

最新推荐文章于 2024-05-27 11:39:13 发布

math_computer

最新推荐文章于 2024-05-27 11:39:13 发布

阅读量1k

点赞数

本文链接：https://blog.csdn.net/math_computer/article/details/103907655

版权

requests

TCGA下载python源代码

根据上面的代码，进行简单修改。主要添加了.svs文件名输出和下载速度显示。

# coding:utf-8
'''
from https://blog.csdn.net/qq_35203425/article/details/80992727
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: chenwi4323@gmail.com

@cp
* change the save file name of the form e.g.  TCGA-J8-A3YE-01Z-00-DX1.83286B2F-6D9C-4C11-8224-24D86BF517FA.svs instead of
the id name 66fab868-0b7e-4eb1-885f-6c62d1e80936.txt
* show the download speed
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
import time

print(__doc__)

requests.packages.urllib3.disable_warnings()


def download(url, file_path):
    r = requests.get(url, stream=True, verify=False)
    total_size = int(r.headers['content-length'])
    print(f"{total_size/1024/1024}MB")
    temp_size = 0
    size = 0

    with open(file_path, "wb") as f:
        time1 = time.time()
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                temp_size += len(chunk)
                f.write(chunk)
                done = int(50 * temp_size / total_size)
                if time.time()-time1 > 5:
                    speed = (temp_size - size) / 1024 / 1024 / 5
                    sys.stdout.write("\r[%s%s] %d%%%s%sMB/s" %
                                     ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size, ' '*10, speed))
                    sys.stdout.flush()
                    size = temp_size
                    time1 = time.time()
    print()


def get_UUID_list(manifest_path):
    UUID_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['id']
    UUFN_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['filename']
    UUID_list = list(UUID_list)
    UUFN_list = list(UUFN_list)
    return UUID_list, UUFN_list

def get_last_UUFN(file_path):
    dir_list = os.listdir(file_path)
    if not dir_list:
        return
    else:
        dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))

        return dir_list[-1]


def get_lastUUFN_index(UUFN_list, last_UUFN):
    for i, UUFN in enumerate(UUFN_list):
        if UUFN == last_UUFN:
            return i
    return 0


def quit(signum, frame):
    # Ctrl+C quit
    print()
    print('You choose to stop me.')
    exit()
    print()


if __name__ == '__main__':

    signal.signal(signal.SIGINT, quit)
    signal.signal(signal.SIGTERM, quit)

    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
                        help="gdc_manifest.txt file path")
    parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
                        help="Which folder is the download file saved to?")
    args = parser.parse_args()

    link = r'https://api.gdc.cancer.gov/data/'

    # args
    manifest_path = args.M
    save_path = args.S

    print("Save file to {}".format(save_path))

    UUID_list, UUFN_list = get_UUID_list(manifest_path)
    last_UUFN = get_last_UUFN(save_path)
    print("Last download file {}".format(last_UUFN))
    last_UUFN_index = get_lastUUFN_index(UUFN_list, last_UUFN)
    print("last_UUFN_index:", last_UUFN_index)

    for UUID, UUFN in zip(UUID_list[last_UUFN_index:], UUFN_list[last_UUFN_index:]):
        url = os.path.join(link, UUID)
        file_path = os.path.join(save_path, UUFN)
        download(url, file_path)
        print(f'{UUFN} have been downloaded')

wget

wget帮助命令

使用wget的好处是，支持自动无限重连，支持断点续传，对于大文件的下载非常友好，不用担心下到一半断网了。下面的脚本是批量下载svs文件，会持续更新一段时间。

#!/usr/bin/env bash
# 文件名：wget_download.sh

# download and re-download
#location="cervix_uteri"
location=$1
if [ ! -d "${location}" ]
then
    mkdir "${location}"
fi

echo run script: $0 $*  > download_${location}.log  2>&1

manifest_file=`echo gdc_manifest_${location}.txt`
row_num=`awk 'END{print NR}' ${manifest_file}`
file_num=$((${row_num}-1))
echo file_num is ${file_num} >> download_${location}.log  2>&1
uuid_array=($(awk '{print $1}' ${manifest_file}))
uufn_array=($(awk '{print $2}' ${manifest_file}))
start=`ls  ./${location} | wc -l`
if [ ${start} -eq 0 ];then
    start=1
fi

echo start from ${start} >> download_${location}.log  2>&1
for k in `seq ${start} ${file_num}`
do
    echo ${uuid_array[$k]} ${uufn_array[$k]} >> download_${location}.log  2>&1
    wget -c -t 0 -O ./${location}/${uufn_array[$k]} https://api.gdc.cancer.gov/data/${uuid_array[$k]}
done

使用时，输入

chmod +x wget_download.sh
./wget_download.sh cervix_uteri

TODO

想想怎么可以将下载速度加快呢
Axel多线程下载工具
 mwget 多线程版本wget下载工具

math_computer

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
下载TCGA数据的程序

目录requestswget今天研究了下怎么写脚本批量下载TCGA中的.svs文件。分两部分，使用requests的python脚本和使用wget的shell脚本。任何下载工具都是人在一定的规则下写出来的，并没有什么玄幻的。思考+实践+优化=工具。requestsTCGA下载python源代码根据上面的代码，进行简单修改。主要添加了.svs文件名输出和下载速度显示。# coding...
复制链接

扫一扫