TCGA 数据下载 —— TCGAbiolinks 临床数据下载
文章目录
获取临床信息
TCGAbiolinks 还提供了一些用于查询、下载和解析临床数据的函数,GDC 数据库中的包含多种不同的临床信息,主要包括:
indexed clinical: 使用XML文件创建的精炼临床数据XML: 原始临床数据BCR Biotab: 解析XML文件之后的tsv文件
indexed 信息和 XML 原始数据之间的区别主要有两个:
XML包含的信息更多,如放疗、药物信息、预后信息、样本信息等,而indexed只是XML的一个子集indexed数据包含更新后的预后信息,也就是说,如果在前一次随访中患者还活着,而后一次随访发现患者已经死去了,则会更新患者的状态为死亡,而XML文件会重新添加一个条目,用于记录该次随访的信息
还可以获取其他临床信息,如
- 组织切片图像
- 病理报告
BCR Biotab
Clinical
我们获取乳腺癌 BCR Biotab 文件格式的临床信息
query <- GDCquery(
project = "TCGA-BRCA",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab"
)
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
查看具体包含的临床信息,有药物、随访、化疗等信息
names(clinical.BCRtab.all)
# [1] "clinical_omf_v4.0_brca" "clinical_follow_up_v1.5_brca"
# [3] "clinical_follow_up_v4.0_nte_brca" "clinical_patient_brca"
# [5] "clinical_nte_brca" "clinical_follow_up_v4.0_brca"
# [7] "clinical_follow_up_v2.1_brca" "clinical_radiation_brca"
# [9] "clinical_drug_brca"
查看患者信息
patient_info <- clinical.BCRtab.all$clinical_patient_brca
dim(patient_info)
# [1] 1099 112
patient_info[1:3, 1:6]
# # A tibble: 3 × 6
# bcr_patient_uuid bcr_patient_barcode form_completion_date prospective_collection
# <chr> <chr> <chr> <chr>
# 1 bcr_patient_uuid bcr_patient_barcode form_completion_date tissue_prospective_col…
# 2 CDE_ID: CDE_ID:2003301 CDE_ID: CDE_ID:3088492
# 3 6E7D5EC6-A469-467C-B748-237353C23416 TCGA-3C-AAAU 2014-1-13 NO
# # ℹ 2 more variables: retrospective_collection <chr>, birth_days_to <chr>
如果我们想获取乳腺癌患者的 er 状态,那么可以
library(tidyverse)
patient_info %>%
dplyr::select(starts_with("er")) %>%
head(3)
# # A tibble: 3 × 6
# er_status_by_ihc er_status_ihc_Percen…¹ er_positivity_scale_…² er_ihc_score er_positivity_scale_…³
# <chr> <chr> <chr> <chr> <chr>
# 1 breast_carcinoma_e… er_level_cell_percent… breast_carcinoma_immu… immunohisto… positive_finding_estr…
# 2 CDE_ID:2957359 CDE_ID:3128341 CDE_ID:3203081 CDE_ID:2230… CDE_ID:3086851
# 3 Positive 50-59% [Not Available] [Not Availa… [Not Available]
# # ℹ abbreviated names: ¹er_status_ihc_Percent_Positive, ²er_positivity_scale_used,
# # ³er_positivity_scale_other
# # ℹ 1 more variable: er_positivity_method <chr>
Biospecimen
获取采样信息
query.biospecimen <- GDCquery(
project = "TCGA-BRCA",
data.category = "Biospecimen",
data.type = "Biospecimen Supplement",
data.format = "BCR Biotab"
)
GDCdownload(query.biospecimen)
biospecimen.BCRtab.all <- GDCprepare(query.biospecimen)
查看所包含的所有信息种类
names(biospecimen.BCRtab.all)
# [1] "biospecimen_analyte_brca" "ssf_tumor_samples_brca"
# [3] "biospecimen_slide_brca" "biospecimen_portion_brca"
# [5] "biospecimen_sample_brca" "biospecimen_diagnostic_slides_brca"
# [7] "ssf_normal_controls_brca" "biospecimen_aliquot_brca"
# [9] "biospecimen_protocol_brca" "biospecimen_shipment_portion_brca"
Indexed
临床 indexed 数据,使用 GDCquery_clinic 函数来获取
clinical <- GDCquery_clinic(project = "TCGA-BRCA", type = "clinical")
dim(clinical)
# [1] 1098 70
clinical[1:3, 1:5]
# project submitter_id synchronous_malignancy ajcc_pathologic_stage days_to_diagnosis
# 1 TCGA-BRCA TCGA-A2-A04N No Stage IA 0
# 2 TCGA-BRCA TCGA-EW-A1OW No Stage IIA 0
# 3 TCGA-BRCA TCGA-AC-A2FO No Stage IIB 0
获取其他项目的临床信息
clinical <- GDCquery_clinic(project = "TARGET-RT", type = "clinical")
dim(clinical)
# [1] 69 38
clinical[1:6, 14:17]
# project submitter_id disease_type primary_site
# 1 TARGET-RT TARGET-52-PASZYE Complex Mixed and Stromal Neoplasms Kidney
# 2 TARGET-RT TARGET-52-PATAFT Complex Mixed and Stromal Neoplasms Liver and intrahepatic bile ducts
# 3 TARGET-RT TARGET-52-PATBLF Complex Mixed and Stromal Neoplasms Lip
# submitter_id.1
# 1 TARGET-52-PASZYE
# 2 TARGET-52-PATAFT
# 3 TARGET-52-PATBLF
XML
处理 XML 格式的临床数据分为两步:
- 使用
GDCquery和GDCDownload来查询和下载Biospecimen或ClinicalXML文件 - 使用
GDCprepare_clinic来解析文件
注意:患者与临床信息是一对多的关系,即一个患者可能会接受多次化疗,因此,只能对单个表进行解析,使用 clinical.info 参数来选择对应的表

query <- GDCquery(
project = "TCGA-BRCA",
data.category = "Clinical",
data.format = "bcr xml",
barcode = c("TCGA-3C-AAAU", "TCGA-4H-AAAK")
)
GDCdownload(query)
解析患者表
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
clinical[,1:4]
# bcr_patient_barcode additional_studies tumor_tissue_site tumor_tissue_site_other
# 1 TCGA-3C-AAAU NA Breast NA
# 2 TCGA-4H-AAAK NA Breast NA
获取用药信息
clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
clinical.drug[,1:4]
# bcr_patient_barcode tx_on_clinical_trial regimen_number bcr_drug_barcode
# 1 TCGA-3C-AAAU YES NA TCGA-3C-AAAU-D60350
# 2 TCGA-4H-AAAK NO NA TCGA-4H-AAAK-D68065
# 3 TCGA-4H-AAAK NO NA TCGA-4H-AAAK-D68067
# 4 TCGA-4H-AAAK NO NA TCGA-4H-AAAK-D68072
由于这两个患者没有放疗信息,返回了 NULL
clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
clinical.radiation
# NULL
其他数据
组织切片图像(SVS 格式)
query.harmonized <- GDCquery(
project = "TCGA-OV",
data.category = "Biospecimen",
data.type = 'Slide Image'
)
获取样本信息
getResults(query.harmonized)[1:6,1:4]
# id data_format cases access
# 1 1c610d80-8b2f-40dc-986b-605771f66e99 SVS TCGA-57-1586 open
# 2 d13aa745-0794-4374-98fd-7c67cd35e3c4 SVS TCGA-25-1878 open
# 3 adfc7bae-6299-4949-b9c7-377a09558898 SVS TCGA-25-1878 open
# 4 f65e450e-4883-439a-93a1-e4d506c8c73a SVS TCGA-36-2543 open
# 5 3488aeb5-1a84-49a2-ba0e-c3f44610f861 SVS TCGA-36-2534 open
# 6 8573e04c-2614-4478-9dc6-5e3d2b5fda90 SVS TCGA-36-2534 open
诊断切片(SVS 格式)
query.harmonized <- GDCquery(
project = "TCGA-COAD",
data.category = "Biospecimen",
data.type = "Slide Image",
experimental.strategy = "Diagnostic Slide",
barcode = c("TCGA-RU-A8FL", "TCGA-AA-3972")
)
获取样本信息
getResults(query.harmonized)[,1:4]
# id data_format cases access
# 130 b339b9d1-af19-46e3-94cf-eb21c391da0e SVS TCGA-AA-3972 open
过滤函数
还有一些函数用过筛选临床样本,例如
TCGAquery_SampleTypes:

bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",
"TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
"TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
"TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
"TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
"TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
"TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
"TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
"TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
"TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
"TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
"TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
"TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
"TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
"TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
"TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
"TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")
S <- TCGAquery_SampleTypes(bar,"TP")
S2 <- TCGAquery_SampleTypes(bar,"NB")
# 返回 TP 或 NB 类型的样本
SS <- TCGAquery_SampleTypes(bar,c("TP","NB"))
S
# [1] "TCGA-G9-6336-01A-11R-1789-07" "TCGA-AO-A0J5-01A-11R-1789-07" "TCGA-G9-6380-01A-11R-1789-07"
# [4] "TCGA-G9-6340-01A-11R-1789-07"
S2
# [1] "TCGA-G9-7036-10A-11R-1789-07" "TCGA-BH-A1ES-10A-11R-1789-07" "TCGA-BH-A1F0-10A-11R-1789-07"
SS
# [1] "TCGA-G9-6336-01A-11R-1789-07" "TCGA-AO-A0J5-01A-11R-1789-07" "TCGA-G9-6380-01A-11R-1789-07"
# [4] "TCGA-G9-6340-01A-11R-1789-07" "TCGA-G9-7036-10A-11R-1789-07" "TCGA-BH-A1ES-10A-11R-1789-07"
# [7] "TCGA-BH-A1F0-10A-11R-1789-07"
TCGAquery_MatchedCoupledSampleTypes
SSS <- TCGAquery_MatchedCoupledSampleTypes(bar,c("NT","TP"))
# 返回同时包含 NT 和 TP 样本的患者 barcode
SSS
# [1] "TCGA-G9-6336-11A-11R-1789-07" "TCGA-G9-6380-11A-11R-1789-07" "TCGA-G9-6340-11A-11R-1789-07"
# [4] "TCGA-G9-6336-01A-11R-1789-07" "TCGA-G9-6380-01A-11R-1789-07" "TCGA-G9-6340-01A-11R-1789-07"
下载所有临床数据
使用下面的代码可以获取所有 TCGA 项目的临床数据
library(data.table)
library(dplyr)
library(regexPipes)
# 获取所有索引信息
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>%
regexPipes::grep("TCGA", value = TRUE) %>%
sort %>%
plyr::alply(1, GDCquery_clinic, .progress = "text") %>%
rbindlist(fill = TRUE)
readr::write_csv(clinical, file = "~/Downloads/all_clin_indexed.csv")
# 解析 XML 文件并信息获取对应的信息
getclinical <- function(proj) {
message(proj)
result <- NULL
attempt <- 1
max_attempts <- 5 # 设置最大尝试次数
while(attempt <= max_attempts) {
result <- tryCatch({
query <- GDCquery(project = proj, data.category = "Clinical", data.format = "bcr xml")
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
clinical_data <- list(clinical)
for(i in c("admin", "radiation", "follow_up", "drug", "new_tumor_event")){
message(i)
aux <- GDCprepare_clinic(query, clinical.info = i)
if(is.null(aux) || nrow(aux) == 0) next
# 处理重复的列名
replicated <- which(grep("bcr_patient_barcode", colnames(aux), value = TRUE, invert = TRUE) %in% colnames(clinical))
colnames(aux)[replicated] <- paste0(colnames(aux)[replicated], ".", i)
if(!is.null(aux)) clinical <- merge(clinical, aux, by = "bcr_patient_barcode", all = TRUE)
}
# 保存临床数据到csv文件
readr::write_csv(clinical, path = paste0("~/Downloads/", proj, "_clinical_from_XML.csv"))
return(clinical)
}, error = function(e) {
message(paste0("Error clinical: ", proj, " Attempt: ", attempt))
attempt <<- attempt + 1 # 增加尝试次数
NULL
})
# 如果成功获取数据,则跳出循环
if (!is.null(result)) break
}
# 如果多次尝试后仍然失败,返回NULL并发出警告
if (is.null(result)) {
warning(paste0("Failed to get clinical data for project: ", proj, " after ", max_attempts, " attempts."))
}
return(result)
}
# 患者信息
clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>%
regexPipes::grep("TCGA", value = T) %>%
sort %>%
plyr::alply(1, getclinical, .progress = "text") %>%
rbindlist(fill = TRUE) %>%
setDF %>%
subset(!duplicated(clinical))
readr::write_csv(clinical, path = "~/Downloads/all_clin_XML.csv")
我觉得还是直接从一些数据库中(如 UCSC Xena),下载整理好的临床数据即可。

选择 phenotype 中的临床数据

或者从 TCGA 泛癌分析 提供的页面中,下载临床数据或其他数据

1249

被折叠的 条评论
为什么被折叠?



