下载TCGA数据的程序

今天研究了下怎么写脚本批量下载TCGA中的.svs文件。分两部分,使用requests的python脚本和使用wget的shell脚本。shell脚本会更好些,Python的那个可能卡死。

任何下载工具都是人在一定的规则下写出来的,并没有什么玄幻的。思考+实践+优化=工具

requests

根据上面的代码,进行简单修改。主要添加了.svs文件名输出和下载速度显示。

# coding:utf-8
'''
from https://blog.csdn.net/qq_35203425/article/details/80992727
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: chenwi4323@gmail.com

@cp
* change the save file name of the form e.g.  TCGA-J8-A3YE-01Z-00-DX1.83286B2F-6D9C-4C11-8224-24D86BF517FA.svs instead of
the id name 66fab868-0b7e-4eb1-885f-6c62d1e80936.txt
* show the download speed
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
import time

print(__doc__)

requests.packages.urllib3.disable_warnings()


def download(url, file_path):
    r = requests.get(url, stream=True, verify=False)
    total_size = int(r.headers['content-length'])
    print(f"{total_size/1024/1024}MB")
    temp_size = 0
    size = 0

    with open(file_path, "wb") as f:
        time1 = time.time()
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                temp_size += len(chunk)
                f.write(chunk)
                done = int(50 * temp_size / total_size)
                if time.time()-time1 > 5:
                    speed = (temp_size - size) / 1024 / 1024 / 5
                    sys.stdout.write("\r[%s%s] %d%%%s%sMB/s" %
                                     ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size, ' '*10, speed))
                    sys.stdout.flush()
                    size = temp_size
                    time1 = time.time()
    print()


def get_UUID_list(manifest_path):
    UUID_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['id']
    UUFN_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['filename']
    UUID_list = list(UUID_list)
    UUFN_list = list(UUFN_list)
    return UUID_list, UUFN_list

def get_last_UUFN(file_path):
    dir_list = os.listdir(file_path)
    if not dir_list:
        return
    else:
        dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))

        return dir_list[-1]


def get_lastUUFN_index(UUFN_list, last_UUFN):
    for i, UUFN in enumerate(UUFN_list):
        if UUFN == last_UUFN:
            return i
    return 0


def quit(signum, frame):
    # Ctrl+C quit
    print()
    print('You choose to stop me.')
    exit()
    print()


if __name__ == '__main__':

    signal.signal(signal.SIGINT, quit)
    signal.signal(signal.SIGTERM, quit)

    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
                        help="gdc_manifest.txt file path")
    parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
                        help="Which folder is the download file saved to?")
    args = parser.parse_args()

    link = r'https://api.gdc.cancer.gov/data/'

    # args
    manifest_path = args.M
    save_path = args.S

    print("Save file to {}".format(save_path))

    UUID_list, UUFN_list = get_UUID_list(manifest_path)
    last_UUFN = get_last_UUFN(save_path)
    print("Last download file {}".format(last_UUFN))
    last_UUFN_index = get_lastUUFN_index(UUFN_list, last_UUFN)
    print("last_UUFN_index:", last_UUFN_index)

    for UUID, UUFN in zip(UUID_list[last_UUFN_index:], UUFN_list[last_UUFN_index:]):
        url = os.path.join(link, UUID)
        file_path = os.path.join(save_path, UUFN)
        download(url, file_path)
        print(f'{UUFN} have been downloaded')

wget

使用wget的好处是,支持自动无限重连,支持断点续传,对于大文件的下载非常友好,不用担心下到一半断网了。下面的脚本是批量下载svs文件,会持续更新一段时间。

#!/usr/bin/env bash
# 文件名:wget_download.sh

# download and re-download
#location="cervix_uteri"
location=$1
if [ ! -d "${location}" ]
then
    mkdir "${location}"
fi

echo run script: $0 $*  > download_${location}.log  2>&1

manifest_file=`echo gdc_manifest_${location}.txt`
row_num=`awk 'END{print NR}' ${manifest_file}`
file_num=$((${row_num}-1))
echo file_num is ${file_num} >> download_${location}.log  2>&1
uuid_array=($(awk '{print $1}' ${manifest_file}))
uufn_array=($(awk '{print $2}' ${manifest_file}))
start=`ls  ./${location} | wc -l`
if [ ${start} -eq 0 ];then
    start=1
fi

echo start from ${start} >> download_${location}.log  2>&1
for k in `seq ${start} ${file_num}`
do
    echo ${uuid_array[$k]} ${uufn_array[$k]} >> download_${location}.log  2>&1
    wget -c -t 0 -O ./${location}/${uufn_array[$k]} https://api.gdc.cancer.gov/data/${uuid_array[$k]}
done

使用时,输入

chmod +x wget_download.sh
./wget_download.sh cervix_uteri

TODO

想想怎么可以将下载速度加快呢
Axel多线程下载工具
mwget 多线程版本wget下载工具

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
下载TCGA的临床数据,你可以使用R语言中的GDCquery_clinic函数。通过指定项目(project)和数据类型(type),你可以获取到所需的临床数据。例如,要下载TCGA-PRAD项目的临床数据,你可以使用以下代码: cl_new <- GDCquery_clinic(project = 'TCGA-PRAD', type = 'clinical') 然后,你可以将新下载数据与已有的临床数据进行合并,使用merge函数,并指定合并的列名(by),以及处理相同列名的后缀(suffixes)。例如: clinical <- merge(clinical, cl_new, by = 'bcr_patient_barcode', all = T, suffixes = c('.old', '.new')) 最后,你可以根据特定的条件来筛选需要的数据并进行处理。根据你提供的代码,你可以根据时间数据和其他变量的条件来选择相关的生存信息,并将其写入CSV文件中。例如,你可以使用以下代码来处理数据并将结果写入CSV文件: clinical$dcf_time = with(clinical,ifelse(!days_to_new_tumor_event_after_initial_treatment=='',days_to_new_tumor_event_after_initial_treatment,'')) clinical_filt$dcf_time = with(clinical,ifelse(!days_to_first_biochemical_recurrence=='',days_to_first_biochemical_recurrence,dcf_time)) clinical$dcf_status = ifelse(!clinical$dcf_time=='',1,0) clinical$dcf_time = with(clinical,ifelse(dcf_time=='',os_time,dcf_time)) write.csv(clinical, file = 'clinical_with_os_dcf.csv') 以上是一个处理临床数据和导出CSV文件的示例。你可以根据你的具体需求进行相应的调整和修改。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* *3* [TCGA_临床数据下载_全面数据](https://blog.csdn.net/weixin_59289660/article/details/125861350)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值