cBioPortal 数据库 API 使用

最新推荐文章于 2024-08-21 08:44:04 发布

名本无名

最新推荐文章于 2024-08-21 08:44:04 发布

阅读量1.7k

点赞数 23

分类专栏：生信数据库文章标签：数据库 linux 运维

本文链接：https://blog.csdn.net/dxs18459111694/article/details/139080439

版权

cBioPortal 数据库 API 使用

前言

虽然 cBioPortal 数据库提供了很多交互式可视化图表来展示和探索不同组学癌症数据集中的基因组信息。但是，我们想要进行更深入的分析，就需要自己下载数据并进行个性化分析

cBioPortal 并没有一键式批量下载所有数据集的功能，只能选择对应的研究（study）一步步下载。当我们需要分析的数据集涉及多种癌型或想要进行泛癌分析时，这种方式还是偏繁琐。

因此，cBioPortal 为我们提供了 REST API 接口，我们可以通过代码来进行批量下载。接口使用的是 Swagger/OpenAPI 规范，该规范主要是用于描述 REST API，既可以作为文档给开发者阅读，也可以让机器根据这个文档自动生成客户端代码等

API 文档的链接地址：https://www.cbioportal.org/api/api-docs

文档是 json 格式的，里面详细描述了各函数及参数的格式和使用方式

使用这个 API 的方式较多，包括 R 和 Python：

R

cBioPortalData: 推荐使用这个包
rapiclient: 也可以用这个包，解析 API 文档

library(rapiclient)
client <- get_api(url = "https://www.cbioportal.org/api/api-docs")

CGDSR：不推荐使用，将会废弃

Python

bravado

from bravado.client import SwaggerClient
cbioportal = SwaggerClient.from_url(
    'https://www.cbioportal.org/api/api-docs',
    config={ 
        "validate_requests": False,
        "validate_responses": False
    }
)

我们分别介绍 cBioPortalData 包和 bravado 包的使用。如果你使用的是其他编程语言，只要使用对应的能解析 OpenAPI 规范的 API 文档的包就行

cBioPortalData

安装导入

BiocManager::install("cBioPortalData")

library(cBioPortalData)

1. 数据结构

cBioPortalData 会将每一个 study 保存为一个 MultiAssayExperiment 结构的对象（S4）

例如，我们使用 cBioDataPack 函数来下载和解析 luad_tcga 所包含的所有数据

luad <- cBioDataPack(cancer_study_id = "luad_tcga", ask = FALSE)

而 luad 就是一个 MultiAssayExperiment 对象

> class(luad)
[1] "MultiAssayExperiment"
attr(,"package")
[1] "MultiAssayExperiment"

我们可以查看该对象所包含的 slot

> slotNames(luad)
[1] "ExperimentList" "colData"        "sampleMap"      "drops"          "metadata"

其中，最重要的是

ExperimentList: 包含了所有的实验检测的数据，如突变、拷贝数、表达等

> experiments(luad)
ExperimentList class object of length 17:
 [1] cna_hg19.seg: RaggedExperiment with 81799 rows and 518 columns
 [2] CNA: SummarizedExperiment with 24776 rows and 516 columns
 [3] expression_median: SummarizedExperiment with 17814 rows and 32 columns
 [4] linear_CNA: SummarizedExperiment with 24776 rows and 516 columns
 [5] methylation_hm27_normals: SummarizedExperiment with 1788 rows and 24 columns
 [6] methylation_hm27: SummarizedExperiment with 1788 rows and 126 columns
 [7] methylation_hm450_normals: SummarizedExperiment with 16556 rows and 32 columns
 [8] methylation_hm450: SummarizedExperiment with 16556 rows and 460 columns
 [9] mRNA_median_all_sample_Zscores: SummarizedExperiment with 17814 rows and 32 columns
 [10] mRNA_median_Zscores: SummarizedExperiment with 16617 rows and 32 columns
 [11] mutations_extended: RaggedExperiment with 72541 rows and 230 columns
 [12] mutations_mskcc: RaggedExperiment with 72541 rows and 230 columns
 [13] RNA_Seq_v2_expression_median: SummarizedExperiment with 20531 rows and 517 columns
 [14] RNA_Seq_v2_mRNA_median_all_sample_Zscores: SummarizedExperiment with 20531 rows and 517 columns
 [15] RNA_Seq_v2_mRNA_median_Zscores: SummarizedExperiment with 20440 rows and 517 columns
 [16] rppa_Zscores: SummarizedExperiment with 222 rows and 365 columns
 [17] rppa: SummarizedExperiment with 223 rows and 365 columns

colData: 样本信息

> colData(luad)[1:4, 1:5]
DataFrame with 4 rows and 5 columns
               PATIENT_ID       SAMPLE_ID        OTHER_SAMPLE_ID SPECIMEN_CURRENT_WEIGHT DAYS_TO_COLLECTION
              <character>     <character>            <character>             <character>        <character>
TCGA-05-4244 TCGA-05-4244 TCGA-05-4244-01 bac0b02d-ac3b-4784-b..         [Not Available]    [Not Available]
TCGA-05-4249 TCGA-05-4249 TCGA-05-4249-01 80f196fe-1eaf-40cb-a..         [Not Available]    [Not Available]
TCGA-05-4250 TCGA-05-4250 TCGA-05-4250-01 8f274178-7a8e-46b6-8..         [Not Available]    [Not Available]
TCGA-05-4382 TCGA-05-4382 TCGA-05-4382-01 cce6d71f-369e-467f-b..         [Not Available]    [Not Available]

sampleMap: 样本与实验数据之间的对应关系

> sampleMap(luad)
DataFrame with 5029 rows and 3 columns
            assay      primary         colname
         <factor>  <character>     <character>
1    cna_hg19.seg TCGA-05-4244 TCGA-05-4244-01
2    cna_hg19.seg TCGA-05-4249 TCGA-05-4249-01
3    cna_hg19.seg TCGA-05-4250 TCGA-05-4250-01
4    cna_hg19.seg TCGA-05-4382 TCGA-05-4382-01
5    cna_hg19.seg TCGA-05-4384 TCGA-05-4384-01
...           ...          ...             ...
5025         rppa TCGA-NJ-A55O TCGA-NJ-A55O-01
5026         rppa TCGA-NJ-A55R TCGA-NJ-A55R-01
5027         rppa TCGA-NJ-A7XG TCGA-NJ-A7XG-01
5028         rppa TCGA-O1-A52J TCGA-O1-A52J-01
5029         rppa TCGA-S2-AA1A TCGA-S2-AA1A-01

如何获取实验数据呢？方法有很多种，用 assays 可以返回所有实验数据列表

> exp.all <- assays(luad)
> names(exp.all)
 [1] "cna_hg19.seg"                              "CNA"                                      
 [3] "expression_median"                         "linear_CNA"                               
 [5] "methylation_hm27_normals"                  "methylation_hm27"                         
 [7] "methylation_hm450_normals"                 "methylation_hm450"                        
 [9] "mRNA_median_all_sample_Zscores"            "mRNA_median_Zscores"                      
[11] "mutations_extended"                        "mutations_mskcc"                          
[13] "RNA_Seq_v2_expression_median"              "RNA_Seq_v2_mRNA_median_all_sample_Zscores"
[15] "RNA_Seq_v2_mRNA_median_Zscores"            "rppa_Zscores"                             
[17] "rppa"

获取 CNV 数据

> exp.all$CNA[1:5, 1:5]
        TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3                -1              -1              -1               0               0
ACTRT2               -1              -1              -1               0               0
AGRN                 -1              -1              -1               0               0
ANKRD65              -1              -1              -1               0               0
ATAD3A               -1              -1              -1               0               0

或者，使用 assay 函数并传递实验名称

> assay(luad, "CNA")[1:5, 1:5]
        TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3                -1              -1              -1               0               0
ACTRT2               -1              -1              -1               0               0
AGRN                 -1              -1              -1               0               0
ANKRD65              -1              -1              -1               0               0
ATAD3A               -1              -1              -1               0               0

具体的就不在这介绍了，我们重点关注的是数据的下载和解析

2. API

上面我们介绍的是直接下载和解析一个 study 的数据，但有时，我们关注的并不是一个 study 里面的数据，而是想看某一类型的数据在泛癌中的表现情况。

例如，想要统计所有 TCGA 的 study 中 CNA 的变异情况，如果下载所有的 study 数据然后提取对应的 CNA 数据，则会显得很麻烦很慢，所以我们要使用 REST API 来获取特定类型的数据

首先，初始化 API

cbio <- cBioPortal()

> class(cbio)
[1] "cBioPortal"
attr(,"package")
[1] "cBioPortalData"

使用 getStudies 获取所有 study 的 ID

studies <- getStudies(cbio)

> head(studies)
# A tibble: 6 × 12
  name     description     publicStudy pmid   citation groups status importDate allSampleCount studyId cancerTypeId
  <chr>    <chr>           <lgl>       <chr>  <chr>    <chr>   <int> <chr>               <int> <chr>   <chr>       
1 Pan-Lun… "Whole-exome s… TRUE        27158… TCGA, N… ""          0 2021-04-0…           1144 nsclc_… nsclc       
2 Head an… "TCGA Head and… TRUE        NA     NA       "PUBL…      0 2021-04-2…            530 hnsc_t… hnsc        
3 Breast … "Whole-exome s… TRUE        26451… TCGA, C… "PUBL…      0 2021-04-2…            817 brca_t… brca        
4 Ovarian… "Whole exome s… TRUE        21720… TCGA, N… "PUBL…      0 2021-04-2…            489 ov_tcg… hgsoc       
5 Uterine… "Whole exome s… TRUE        23636… TCGA, N… "PUBL…      0 2021-04-2…            373 ucec_t… ucec        
6 Bladder… "Whole-exome s… TRUE        28988… Roberts… "PUBL…      0 2021-04-2…            413 blca_t… blca        
# … with 1 more variable: referenceGenome <chr>

或者，使用 cBioPortal 的对象方法

resp <- cbio$getAllStudiesUsingGET()

该方法返回的是 response 类型的对象，我们使用 httr::content 函数来解析

parsedResponse <- httr::content(resp)

返回的是列表，可以计算 study 的数量

> length(parsedResponse)
[1] 318
> dim(studies)
[1] 318  12

统计这些 study 所涉及的癌型和总样本数

> length(unique(studies$cancerTypeId))
[1] 94
> sum(studies$allSampleCount)
[1] 133449

你可能会有一个疑问，我是怎么知道 getStudies 函数的，我们可以使用 ls 来列出包中所有的函数

> ls("package:cBioPortalData")
 [1] "allSamples"            "cBioCache"             "cBioDataPack"          "cBioPortal"           
 [5] "cBioPortalData"        "clinicalData"          "downloadStudy"         "genePanelMolecular"   
 [9] "genePanels"            "geneTable"             "getDataByGenePanel"    "getGenePanel"         
[13] "getGenePanelMolecular" "getSampleInfo"         "getStudies"            "loadStudy"            
[17] "molecularData"         "molecularProfiles"     "mutationData"          "removeDataCache"      
[21] "removePackCache"       "sampleLists"           "samplesInSampleLists"  "searchOps"            
[25] "setCache"              "studiesTable"          "untarStudy"

例如，获取 sampleListId

> sampleLists(cbio, studyId = "luad_tcga")
# A tibble: 11 × 5
   category                                      name             description              sampleListId     studyId
   <chr>                                         <chr>            <chr>                    <chr>            <chr>  
 1 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 2 all_cases_with_mutation_data                  Samples with mu… Samples with mutation d… luad_tcga_seque… luad_t…
 3 all_cases_with_rppa_data                      Samples with pr… Samples protein data (R… luad_tcga_rppa   luad_t…
 4 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 5 all_cases_with_mutation_and_cna_data          Samples with mu… Samples with mutation a… luad_tcga_cnaseq luad_t…
 6 all_cases_with_mrna_array_data                Samples with mR… Samples with mRNA expre… luad_tcga_mrna   luad_t…
 7 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 8 all_cases_in_study                            All samples      All samples (586 sample… luad_tcga_all    luad_t…
 9 all_cases_with_cna_data                       Samples with CN…