Python脚本下载TCGA大数据,非常简单,开放源代码

前言

使用TCGA官方的gdc-client下载工具有时候很慢,经常会挂掉,那干脆自己写一个下载小程序。于是使用TCGA的API写了个下载TCGA数据的脚本,脚本也是需要下载manifest文件的。

环境

后面有把程序打包成EXE,包含命令行的和图形界面的,让没有python的同学也能用

环境:Python3.6
函数包:

  • os
  • pandas
  • requests
  • sys
  • argparse
  • signal

代码

# coding:utf-8
'''
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: chenwi4323@gmail.com
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal

print(__doc__)

requests.packages.urllib3.disable_warnings()


def download(url, file_path):
    r = requests.get(url, stream=True, verify=False)
    total_size = int(r.headers['content-length'])
    # print(total_size)
    temp_size = 0

    with open(file_path, "wb") as f:

        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                temp_size += len(chunk)
                f.write(chunk)
                done = int(50 * temp_size / total_size)
                sys.stdout.write("\r[%s%s] %d%%" % ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size))
                sys.stdout.flush()
    print()


def get_UUID_list(manifest_path):
    UUID_list = pd.read_table(manifest_path, sep='\t', encoding='utf-8')['id']
    UUID_list = list(UUID_list)
    return UUID_list


def get_last_UUID(file_path):
    dir_list = os.listdir(file_path)
    if not dir_list:
        return
    else:
        dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))

        return dir_list[-1][:-4]


def get_lastUUID_index(UUID_list, last_UUID):
    for i, UUID in enumerate(UUID_list):
        if UUID == last_UUID:
            return i
    return 0


def quit(signum, frame):
    # Ctrl+C quit
    print('You choose to stop me.')
    exit()
    print()


if __name__ == '__main__':

    signal.signal(signal.SIGINT, quit)
    signal.signal(signal.SIGTERM, quit)

    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
                        help="gdc_manifest.txt file path")
    parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
                        help="Which folder is the download file saved to?")
    args = parser.parse_args()

    link = r'https://api.gdc.cancer.gov/data/'

    # args
    manifest_path = args.M
    save_path = args.S

    print("Save file to {}".format(save_path))

    UUID_list = get_UUID_list(manifest_path)
    last_UUID = get_last_UUID(save_path)
    print("Last download file {}".format(last_UUID))
    last_UUID_index = get_lastUUID_index(UUID_list, last_UUID)

    for UUID in UUID_list[last_UUID_index:]:
        url = os.path.join(link, UUID)
        file_path = os.path.join(save_path, UUID + '.txt')
        download(url, file_path)
        print(f'{UUID} have been downloaded')

使用方法

在命令行中命令就行:

python tcga_download.py -m manifest-xx.txt -s xxx

讲解:
manifest-xx.txt 是你下载的manifest文件路径
xxx是你下载的文件像保存到的那个文件夹(这个文件夹最好是新建的空文件夹)

演示:
这里写图片描述

将程序打包成EXE

最后对于那些没有安装Python的人来说,可以使用我打包好的工具tcga_download.exe来下载TCGA数据,简单方便,有点类似gdc-client这个工具,哈哈哈,不过自己写的还是有成就感吧,后期打算做成QT界面版本的,点点鼠标就行。
tcga_download.exe放在网盘里了,有需要可以自行下载
链接:https://pan.baidu.com/s/1AGyZ5cAyPUK06zqiQGx-nQ 密码:3os4

演示:
这里写图片描述

图形界面的下载EXE

点点鼠标就能下载的小公举exe:
下载地址:https://github.com/chenwi/TCGAD

演示:
这里写图片描述

  • 5
    点赞
  • 44
    收藏
    觉得还不错? 一键收藏
  • 20
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 20
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值