cBioPortal 数据库 API 使用
文章目录
前言
虽然 cBioPortal
数据库提供了很多交互式可视化图表来展示和探索不同组学癌症数据集中的基因组信息。但是,我们想要进行更深入的分析,就需要自己下载数据并进行个性化分析
cBioPortal
并没有一键式批量下载所有数据集的功能,只能选择对应的研究(study
)一步步下载。当我们需要分析的数据集涉及多种癌型或想要进行泛癌分析时,这种方式还是偏繁琐。
因此,cBioPortal
为我们提供了 REST API
接口,我们可以通过代码来进行批量下载。接口使用的是 Swagger/OpenAPI
规范,该规范主要是用于描述 REST API
,既可以作为文档给开发者阅读,也可以让机器根据这个文档自动生成客户端代码等
API
文档的链接地址:https://www.cbioportal.org/api/api-docs
文档是 json
格式的,里面详细描述了各函数及参数的格式和使用方式
使用这个 API
的方式较多,包括 R
和 Python
:
R
cBioPortalData
: 推荐使用这个包rapiclient
: 也可以用这个包,解析API
文档
library(rapiclient)
client <- get_api(url = "https://www.cbioportal.org/api/api-docs")
CGDSR
:不推荐使用,将会废弃
Python
bravado
from bravado.client import SwaggerClient
cbioportal = SwaggerClient.from_url(
'https://www.cbioportal.org/api/api-docs',
config={
"validate_requests": False,
"validate_responses": False
}
)
我们分别介绍 cBioPortalData
包和 bravado
包的使用。如果你使用的是其他编程语言,只要使用对应的能解析 OpenAPI
规范的 API
文档的包就行
cBioPortalData
安装导入
BiocManager::install("cBioPortalData")
library(cBioPortalData)
1. 数据结构
cBioPortalData
会将每一个 study
保存为一个 MultiAssayExperiment
结构的对象(S4
)
例如,我们使用 cBioDataPack
函数来下载和解析 luad_tcga
所包含的所有数据
luad <- cBioDataPack(cancer_study_id = "luad_tcga", ask = FALSE)
而 luad
就是一个 MultiAssayExperiment
对象
> class(luad)
[1] "MultiAssayExperiment"
attr(,"package")
[1] "MultiAssayExperiment"
我们可以查看该对象所包含的 slot
> slotNames(luad)
[1] "ExperimentList" "colData" "sampleMap" "drops" "metadata"
其中,最重要的是
ExperimentList
: 包含了所有的实验检测的数据,如突变、拷贝数、表达等
> experiments(luad)
ExperimentList class object of length 17:
[1] cna_hg19.seg: RaggedExperiment with 81799 rows and 518 columns
[2] CNA: SummarizedExperiment with 24776 rows and 516 columns
[3] expression_median: SummarizedExperiment with 17814 rows and 32 columns
[4] linear_CNA: SummarizedExperiment with 24776 rows and 516 columns
[5] methylation_hm27_normals: SummarizedExperiment with 1788 rows and 24 columns
[6] methylation_hm27: SummarizedExperiment with 1788 rows and 126 columns
[7] methylation_hm450_normals: SummarizedExperiment with 16556 rows and 32 columns
[8] methylation_hm450: SummarizedExperiment with 16556 rows and 460 columns
[9] mRNA_median_all_sample_Zscores: SummarizedExperiment with 17814 rows and 32 columns
[10] mRNA_median_Zscores: SummarizedExperiment with 16617 rows and 32 columns
[11] mutations_extended: RaggedExperiment with 72541 rows and 230 columns
[12] mutations_mskcc: RaggedExperiment with 72541 rows and 230 columns
[13] RNA_Seq_v2_expression_median: SummarizedExperiment with 20531 rows and 517 columns
[14] RNA_Seq_v2_mRNA_median_all_sample_Zscores: SummarizedExperiment with 20531 rows and 517 columns
[15] RNA_Seq_v2_mRNA_median_Zscores: SummarizedExperiment with 20440 rows and 517 columns
[16] rppa_Zscores: SummarizedExperiment with 222 rows and 365 columns
[17] rppa: SummarizedExperiment with 223 rows and 365 columns
colData
: 样本信息
> colData(luad)[1:4, 1:5]
DataFrame with 4 rows and 5 columns
PATIENT_ID SAMPLE_ID OTHER_SAMPLE_ID SPECIMEN_CURRENT_WEIGHT DAYS_TO_COLLECTION
<character> <character> <character> <character> <character>
TCGA-05-4244 TCGA-05-4244 TCGA-05-4244-01 bac0b02d-ac3b-4784-b.. [Not Available] [Not Available]
TCGA-05-4249 TCGA-05-4249 TCGA-05-4249-01 80f196fe-1eaf-40cb-a.. [Not Available] [Not Available]
TCGA-05-4250 TCGA-05-4250 TCGA-05-4250-01 8f274178-7a8e-46b6-8.. [Not Available] [Not Available]
TCGA-05-4382 TCGA-05-4382 TCGA-05-4382-01 cce6d71f-369e-467f-b.. [Not Available] [Not Available]
sampleMap
: 样本与实验数据之间的对应关系
> sampleMap(luad)
DataFrame with 5029 rows and 3 columns
assay primary colname
<factor> <character> <character>
1 cna_hg19.seg TCGA-05-4244 TCGA-05-4244-01
2 cna_hg19.seg TCGA-05-4249 TCGA-05-4249-01
3 cna_hg19.seg TCGA-05-4250 TCGA-05-4250-01
4 cna_hg19.seg TCGA-05-4382 TCGA-05-4382-01
5 cna_hg19.seg TCGA-05-4384 TCGA-05-4384-01
... ... ... ...
5025 rppa TCGA-NJ-A55O TCGA-NJ-A55O-01
5026 rppa TCGA-NJ-A55R TCGA-NJ-A55R-01
5027 rppa TCGA-NJ-A7XG TCGA-NJ-A7XG-01
5028 rppa TCGA-O1-A52J TCGA-O1-A52J-01
5029 rppa TCGA-S2-AA1A TCGA-S2-AA1A-01
如何获取实验数据呢?方法有很多种,用 assays
可以返回所有实验数据列表
> exp.all <- assays(luad)
> names(exp.all)
[1] "cna_hg19.seg" "CNA"
[3] "expression_median" "linear_CNA"
[5] "methylation_hm27_normals" "methylation_hm27"
[7] "methylation_hm450_normals" "methylation_hm450"
[9] "mRNA_median_all_sample_Zscores" "mRNA_median_Zscores"
[11] "mutations_extended" "mutations_mskcc"
[13] "RNA_Seq_v2_expression_median" "RNA_Seq_v2_mRNA_median_all_sample_Zscores"
[15] "RNA_Seq_v2_mRNA_median_Zscores" "rppa_Zscores"
[17] "rppa"
获取 CNV
数据
> exp.all$CNA[1:5, 1:5]
TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3 -1 -1 -1 0 0
ACTRT2 -1 -1 -1 0 0
AGRN -1 -1 -1 0 0
ANKRD65 -1 -1 -1 0 0
ATAD3A -1 -1 -1 0 0
或者,使用 assay
函数并传递实验名称
> assay(luad, "CNA")[1:5, 1:5]
TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3 -1 -1 -1 0 0
ACTRT2 -1 -1 -1 0 0
AGRN -1 -1 -1 0 0
ANKRD65 -1 -1 -1 0 0
ATAD3A -1 -1 -1 0 0
具体的就不在这介绍了,我们重点关注的是数据的下载和解析
2. API
上面我们介绍的是直接下载和解析一个 study
的数据,但有时,我们关注的并不是一个 study
里面的数据,而是想看某一类型的数据在泛癌中的表现情况。
例如,想要统计所有 TCGA
的 study
中 CNA
的变异情况,如果下载所有的 study
数据然后提取对应的 CNA
数据,则会显得很麻烦很慢,所以我们要使用 REST API
来获取特定类型的数据
首先,初始化 API
cbio <- cBioPortal()
> class(cbio)
[1] "cBioPortal"
attr(,"package")
[1] "cBioPortalData"
使用 getStudies
获取所有 study
的 ID
studies <- getStudies(cbio)
> head(studies)
# A tibble: 6 × 12
name description publicStudy pmid citation groups status importDate allSampleCount studyId cancerTypeId
<chr> <chr> <lgl> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr>
1 Pan-Lun… "Whole-exome s… TRUE 27158… TCGA, N… "" 0 2021-04-0… 1144 nsclc_… nsclc
2 Head an… "TCGA Head and… TRUE NA NA "PUBL… 0 2021-04-2… 530 hnsc_t… hnsc
3 Breast … "Whole-exome s… TRUE 26451… TCGA, C… "PUBL… 0 2021-04-2… 817 brca_t… brca
4 Ovarian… "Whole exome s… TRUE 21720… TCGA, N… "PUBL… 0 2021-04-2… 489 ov_tcg… hgsoc
5 Uterine… "Whole exome s… TRUE 23636… TCGA, N… "PUBL… 0 2021-04-2… 373 ucec_t… ucec
6 Bladder… "Whole-exome s… TRUE 28988… Roberts… "PUBL… 0 2021-04-2… 413 blca_t… blca
# … with 1 more variable: referenceGenome <chr>
或者,使用 cBioPortal
的对象方法
resp <- cbio$getAllStudiesUsingGET()
该方法返回的是 response
类型的对象,我们使用 httr::content
函数来解析
parsedResponse <- httr::content(resp)
返回的是列表,可以计算 study
的数量
> length(parsedResponse)
[1] 318
> dim(studies)
[1] 318 12
统计这些 study
所涉及的癌型和总样本数
> length(unique(studies$cancerTypeId))
[1] 94
> sum(studies$allSampleCount)
[1] 133449
你可能会有一个疑问,我是怎么知道 getStudies
函数的,我们可以使用 ls
来列出包中所有的函数
> ls("package:cBioPortalData")
[1] "allSamples" "cBioCache" "cBioDataPack" "cBioPortal"
[5] "cBioPortalData" "clinicalData" "downloadStudy" "genePanelMolecular"
[9] "genePanels" "geneTable" "getDataByGenePanel" "getGenePanel"
[13] "getGenePanelMolecular" "getSampleInfo" "getStudies" "loadStudy"
[17] "molecularData" "molecularProfiles" "mutationData" "removeDataCache"
[21] "removePackCache" "sampleLists" "samplesInSampleLists" "searchOps"
[25] "setCache" "studiesTable" "untarStudy"
例如,获取 sampleListId
> sampleLists(cbio, studyId = "luad_tcga")
# A tibble: 11 × 5
category name description sampleListId studyId
<chr> <chr> <chr> <chr> <chr>
1 all_cases_with_methylation_data Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
2 all_cases_with_mutation_data Samples with mu… Samples with mutation d… luad_tcga_seque… luad_t…
3 all_cases_with_rppa_data Samples with pr… Samples protein data (R… luad_tcga_rppa luad_t…
4 all_cases_with_methylation_data Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
5 all_cases_with_mutation_and_cna_data Samples with mu… Samples with mutation a… luad_tcga_cnaseq luad_t…
6 all_cases_with_mrna_array_data Samples with mR… Samples with mRNA expre… luad_tcga_mrna luad_t…
7 all_cases_with_methylation_data Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
8 all_cases_in_study All samples All samples (586 sample… luad_tcga_all luad_t…
9 all_cases_with_cna_data Samples with CN…