前言
文章目录
前面我们曾介绍过
TCGAbiolinks
的使用,但是由于
TCGA
的改版,该包也做出了相应的更新,所以我再重新介绍一下新版
TCGAbiolinks
的使用。
TCGAbiolinks
是一个利用 GDC API
接口来查询、下载和分析 TCGA
数据库的数据的 R
包
TCGAbiolinks
包的功能主要可以分为三大块:
- 数据查询和下载
- 数据的常规分析
- 可视化
该包可以从 Bioconductor
上安装稳定版本
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
或者从 GitHub
上安装开发版本
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
导入包
library(TCGAbiolinks) # version 2.30.4
数据查询
TCGAbiolinks
提供了一些函数用于查询和下载 GDC
中的数据,包括:
Harmonized
:这部分数据都比较新,使用的是GRCh38
(hg38
) 基因组版本,使用的是GDC pipeline
来处理数据Legacy
:这部分的数据应该是较早之前测的,使用的是GRCh37
(hg19
) 基因组版本
使用 GDCquery
函数来查询 GDC
的数据,该函数的参数为:
GDCquery(
project,
data.category,
data.type,
workflow.type,
access,
platform,
barcode,
data.format,
experimental.strategy,
sample.type
)
project
:该参数的取值非常多,可以使用如下命令来查询所有可用的项目
> TCGAbiolinks:::getGDCprojects()$project_id
[1] "HCMI-CMDC" "TCGA-BRCA" "TARGET-ALL-P3"
[4] "EXCEPTIONAL_RESPONDERS-ER" "CGCI-HTMCP-LC" "CPTAC-2"
[7] "CMI-MBC" "TARGET-ALL-P2" "OHSU-CNL"
[10] "TARGET-ALL-P1" "MMRF-COMMPASS" "ORGANOID-PANCREATIC"
[13] "NCICCR-DLBCL" "TCGA-SARC" "TCGA-ACC"
[16] "WCDT-MCRPC" "TCGA-UCEC" "MP2PRT-ALL"
[19] "TCGA-KIRC" "CGCI-HTMCP-CC" "CMI-ASC"
[22] "CGCI-HTMCP-DLBCL" "BEATAML1.0-CRENOLANIB" "CDDP_EAGLE-1"
[25] "APOLLO-LUAD" "CMI-MPC" "FM-AD"
[28] "MATCH-Z1D" "MATCH-Y" "MATCH-N"
[31] "MATCH-Q" "MP2PRT-WT" "TCGA-LAML"
[34] "VAREPOP-APOLLO" "TCGA-SKCM" "TRIO-CRU"
[37] "TCGA-PAAD" "TCGA-TGCT" "TCGA-CESC"
[40] "TCGA-ESCA" "TCGA-THCA" "TCGA-LIHC"
[43] "TCGA-PRAD" "TCGA-READ" "MATCH-I"
[46] "MATCH-W" "MATCH-B" "MATCH-H"
[49] "TCGA-OV" "TCGA-UVM" "MATCH-Z1A"
[52] "MATCH-U" "BEATAML1.0-COHORT" "TCGA-BLCA"
[55] "CGCI-BLGSP" "CTSP-DLBCL1" "MATCH-S1"
[58] "MATCH-R" "MATCH-Z1I" "CPTAC-3"
[61] "TCGA-CHOL" "TCGA-GBM" "MATCH-S2"
[64] "TCGA-UCS" "TCGA-PCPG" "TCGA-MESO"
[67] "TARGET-CCSK" "TARGET-WT" "TARGET-RT"
[70] "TCGA-DLBC" "TARGET-OS" "TCGA-COAD"
[73] "REBC-THYR" "TCGA-STAD" "TCGA-KIRP"
[76] "TCGA-THYM" "TCGA-KICH" "TCGA-LGG"
[79] "TARGET-AML" "TCGA-LUSC" "TCGA-LUAD"
[82] "TCGA-HNSC" "TARGET-NBL"
data.category
:可以使用如下方式来查询TCGA-BRCA
项目的可用的分类数据
> TCGAbiolinks:::getProjectSummary("TCGA-BRCA")
$file_count
[1] 61173
$data_categories
file_count case_count data_category
1 17337 1098 Simple Nucleotide Variation
2 9281 1098 Sequencing Reads
3 5316 1098 Biospecimen
4 2288 1098 Clinical
5 12292 1098 Copy Number Variation
6 4876 1097 Transcriptome Profiling
7 3714 1097 DNA Methylation
8 919 881 Proteome Profiling
9 226 101 Somatic Structural Variation
10 4924 1095 Structural Variation
$case_count
[1] 1098
$file_size
[1] 6.245362e+14
新版本只能获取 harmonized
类型的数据,主要包含 7
个分类:
- Bio