TCGAbiolinks的使用


更新历史

created:2019

当初使用TCGAbiolinks由官方文档翻译过来,目前仍有相关专业初学者在使用此工具故此更新部分内容,注意查阅最新官方文档;

updated:2022

目前此工具本人已不再使用,补充下当时写的一段代码,内有本人注释,方便初学者直观理解;

介绍

一、官方文档介绍

TCGAbiolinks is able to access The National Cancer Institute (NCI) Genomic Data Commons (GDC) thorough itsGDC Application Programming Interface (API) to search, download and prepare relevant data for analysis in R.

二、安装

可以从bioconductor上下载安装稳定版或开发版的TCGA插件。
1.版本安装
1.1 稳定版

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")

1.2 开发版(如果稳定版使用存在问题,尝试使用开发版)

devtools::install_github('BioinformaticsFMRP/TCGAbiolinks')

2.1 可使用Docker image
如果想使用docker image(一个极度精简版的Linux程序运行环境),也是适用于TCGA。
For more information please check: https://docs.docker.com/ and https://www.bioconductor.org/help/docker/

3.1 需要的库

根据实际情况导入库,例如:
library(TCGAbiolinks)
library(dplyr)
library(DT)

4.1 Session info

version
packageVersion("TCGAbiolinks")

三、GDC clinical data的相关介绍

在GDC数据库里,可以从不同的源来获得clinical data。
1)Indexed clinical:从XML文件里提炼出来的clinical data(提炼指indexed数据是更新过的数据);
2)XML:原数据(包含各种信息如radiation, drugs information, follow-ups, biospecimen, etc. );
3)BCR Biotab:从XML里面解析出来的tsv文件(格式化过的数据);

1.1 从BCR Biotab里面抓取数据

query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
names(clinical.BCRtab.all)

query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", `在这里插入代码片`
                  data.format = "BCR Biotab",
                   file.type = "radiation")
GDCdownload(query)
clinical.BCRtab.radiation <- GDCprepare(query)
clinical.BCRtab.all$clinical_drug_acc  %>% 
  head  %>% 
  datatable(options = list(scrollX = TRUE, keys = TRUE))

1.2 从indexed data抓取数据

clinical <- GDCquery_clinic(project = "TCGA-LUAD", type = "clinical")
clinical %>%
  datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)

1.3从XML clinical data中抓取数据
1)使用GDCqueryGDCDownload 函数来查找和下载biospecimen 或 clinical XML files ;
GDCquery参数:

?projectA list of valid project (see table below)
data.categoryA valid project (see list with TCGAbiolinks:::getProjectSummary(project))
data.typeA data type to filter the files to download
workflow.typeGDC workflow type
legacySearch in the legacy repository
accessFilter by access type. Possible values: controlled, open
platformExample:
CGH- 1x1M_G4447AIlluminaGA_RNASeqV2
AgilentG4502A_07IlluminaGA_mRNA_DGE
Human1MDuoHumanMethylation450
HG-CGH-415K_G4124AIlluminaGA_miRNASeq
HumanHap550IlluminaHiSeq_miRNASeq
ABIH-miRNA_8x15K
HG-CGH-244ASOLiD_DNASeq
IlluminaDNAMethylation_OMA003_CPIIlluminaGA_DNASeq_automated
IlluminaDNAMethylation_OMA002_CPIHG-U133_Plus_2
HuEx- 1_0-st-v2Mixed_DNASeq
H-miRNA_8x15Kv2IlluminaGA_DNASeq_curated
MDA_RPPA_CoreIlluminaHiSeq_TotalRNASeqV2
HT_HG-U133AIlluminaHiSeq_DNASeq_automated
diagnostic_imagesmicrosat_i
IlluminaHiSeq_RNASeqSOLiD_DNASeq_curated
IlluminaHiSeq_DNASeqCMixed_DNASeq_curated
IlluminaGA_RNASeqIlluminaGA_DNASeq_Cont_automated
IlluminaGA_DNASeqIlluminaHiSeq_WGBS
pathology_reportsIlluminaHiSeq_DNASeq_Cont_automated
Genome_Wide_SNP_6bio
tissue_imagesMixed_DNASeq_automated
HumanMethylation27Mixed_DNASeq_Cont_curated
IlluminaHiSeq_RNASeqV2Mixed_DNASeq_Cont
file.typeTo be used in the legacy database for some platforms, to define which file types to be used.
barcodeA list of barcodes to filter the files to download
experimental.strategyFilter to experimental stratey. Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array. Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq
sample.typeA sample type to filter the files to download

2)使用 GDCprepare_clinic 函数来解析 XML files;
3)通过参数clinical.info来解析 病例个人数据表中的某一个类别的信息。

data.categoryclinical.info
Clinicaldrug
Clinicaladmin
Clinicalfollow_up
Clinicalradiation
Clinicalpatient
Clinicalstage_event
Clinicalnew_tumor_event
Biospecimensample
Biospecimenbio_patient
Biospecimenanalyte
Biospecimenaliquot
Biospecimenprotocol
Biospecimenportion
Biospecimenslide
Othermsi

例如下面列举的抓取几种数据类别:

query <- GDCquery(project = "TCGA-COAD", 
                  data.category = "Clinical", 
                  file.type = "xml", 
                  barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
clinical %>% 
datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)
clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
clinical.drug %>% 
datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)
clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
clinical.radiation %>% 
datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)
clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")
clinical.admin %>% 
datatable(filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)

1.4 Microsatellite data
1.5 Tissue slide image (SVS format)
1.6 Diagnostic Slide (SVS format)
1.7 Pathology report (PDF)
1.8 Tissue slide image (SVS format)
1.9 Clinical Supplement (XML format)
1.10 Clinical data (Biotab format)

1.11 Filter functions

1)TCGAquery_SampleTypes
The function TCGAquery_SampleTypes will filter barcodes based on a type the argument typesample;

ArgumentDescription
barcodeis a list of samples as TCGA barcodes
typesamplea character vector indicating tissue type to query. Example:
TPPRIMARY SOLID TUMOR
TRRECURRENT SOLID TUMOR
TBPrimary Blood Derived Cancer-Peripheral Blood
TRBMRecurrent Blood Derived Cancer-Bone Marrow
TAPAdditional-New Primary
TMMetastatic
TAMAdditional Metastatic
THOCHuman Tumor Original Cells
TBMPrimary Blood Derived Cancer-Bone Marrow
NBBlood Derived Normal
NTSolid Tissue Normal
NBCBuccal Cell Normal
NEBVEBV Immortalized Normal

例子如下:

bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",  
         "TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
         "TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
         "TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
         "TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
         "TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
         "TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
         "TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
         "TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
         "TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
         "TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
         "TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
         "TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
         "TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
         "TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
         "TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
         "TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")

S <- TCGAquery_SampleTypes(bar,"TP")
S2 <- TCGAquery_SampleTypes(bar,"NB")

# Retrieve multiple tissue types  NOT FROM THE SAME PATIENTS
SS <- TCGAquery_SampleTypes(bar,c("TP","NB"))

# Retrieve multiple tissue types  FROM THE SAME PATIENTS
SSS <- TCGAquery_MatchedCoupledSampleTypes(bar,c("NT","TP"))

四、如何使用TCGAbiolinks下载和预处理处理数据以便于分析

1. 下载方式:有两种方式从GDC下载
a. 客户端:通过创建MANIFEST文件建立任务,稳定但速度较api慢;

[GDC Data Transfer Tool]
(https://gdc.cancer.gov/access-data/gdc-data-transfer-tool)

b. API:通过创建MANIFEST文件建立任务,数据下载后打包成tar.gz ,如果数据过大容易出现打包错误中断需要重新执行,可以使用参数 files.per.chunk 将大文件拆分成块。

[GDC Application Programming Interface (API)]
(https://gdc.cancer.gov/developers/gdc-application-programming-interface-api)

2. 数据预处理

2.1 通过SummarizedExperiment对象可以获得三个主要矩阵;

1) Sample matrix:via colData(data): stores sample information

2) Assay matrix : via assay(data): stores molecular data

3) Feature matrix :via rowRanges(data): stores metadata about the features, including their genomic ranges

2.2 参数
1)GDCdownload

| Argument |: Description   
| query |:GDCquery 查询函数;
| token.file | :文件标记后下载数据(仅适用于 method = "client") ;
| method | :选择下载方式: "api"或"client";
| directory | :指定数据下载路径 目录/文件夹, 默认路径: GDCdata ;
| files.per.chunk | :使用API方式下载时设定同时下载的数量(例如 files.per.chunk = 6) ,会减少下载数据量过大时出现问题的可能;

2)GDCprepare

| Argument | :Description 
| query |:GDCquery 查询函数;
| save | :是否作为RData对象保存;
| save.filename | :指定数据保存的文件名,默认则自动创建;
| directory | :指定数据下载路径 目录/文件夹, 默认路径: GDCdata ;
| summarizedExperiment | :是否创建一个summarizedExperiment对象,默认 TRUE (if possible) ;
| remove.files.prepared | :删除读取的文件,默认: FALSE (这个参数在| save | 参数设为TRUE时使用);
| add.gistic2.mut |: If a list of genes (gene symbol) is given, columns with gistic2 results from GDAC firehose (hg19) and a column indicating if there is or not mutation in that gene (hg38) (TRUE or FALSE - use the MAF file for more information) will be added to the sample matrix in the summarized Experiment object. 
| mut.pipeline |: If add.gistic2.mut is not NULL this field will be taken in consideration. Four separate variant calling pipelines are implemented for GDC data harmonization. Options: muse, varscan2, somaticsniper, MuTect2. For more information: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/ ;
| mutant_variant_classification | :List of mutant_variant_classification that will be consider a sample mutant or not. Default: "Frame_Shift_Del", "Frame_Shift_Ins", "Missense_Mutation", "Nonsense_Mutation", "Splice_Site", "In_Frame_Del", "In_Frame_Ins", "Translation_Start_Site", "Nonstop_Mutation" ;

2.3 使用GDC api从legacy database遗产数据库中查找下载

例如从 harmonized database (data aligned against genome of reference hg38)下载gene expression quantification:

query <- GDCquery(project = "TCGA-GBM",
                           data.category = "Gene expression",
                           data.type = "Gene expression quantification",
                           platform = "Illumina HiSeq", 
                           file.type  = "normalized_results",
                           experimental.strategy = "RNA-Seq",
                           barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                           legacy = TRUE)
GDCdownload(query, method = "api", files.per.chunk = 10)
data <- GDCprepare(query)
data <- gbm.exp.legacy
# Gene expression aligned against hg19.
datatable(as.data.frame(colData(data)), 
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = FALSE)
# Only first 100 to make render faster
datatable(assay(data)[1:100,], 
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = TRUE)

rowRanges(data)

2.4 GDCprepare: Outputs
这个函数正在开发中;
在这里插入图片描述
1)Harmonized data

Data.categoryData.typeWorkflow TypeStatus
Transcriptome ProfilingGene Expression QuantificationHTSeq - CountsData frame or SE (losing 5% of information when mapping to genomic regions)
HTSeq - FPKM-UQReturning only a (losing 5% of information when mapping to genomic regions)
HTSeq - FPKMReturning only a (losing 5% of information when mapping to genomic regions)
Isoform Expression QuantificationNot needed
miRNA Expression QuantificationNot neededReturning only a dataframe for the moment
Copy number variationCopy Number SegmentReturning only a dataframe for the moment
Masked Copy Number SegmentReturning only a dataframe for the moment
Gene Level Copy Number ScoresReturning only a dataframe for the moment
Simple Nucleotide Variation
Raw Sequencing Data
BiospecimenSlide Image
BiospecimenBiospecimen Supplement
Clinical

2)Legacy data

Data.categoryData.typePlatformfile.typeStatus
Transcriptome Profiling
Copy number variation-Affymetrix SNP Array 6.0nocnv_hg18.segWorking
-Affymetrix SNP Array 6.0hg18.segWorking
-Affymetrix SNP Array 6.0nocnv_hg19.segWorking
-Affymetrix SNP Array 6.0hg19.segWorking
-Illumina HiSeqSeveralWorking
Simple Nucleotide VariationSimple somatic mutation
Raw Sequencing Data
Biospecimen
Clinical
Protein expressionMDA RPPA Core-Working
Gene expressionGene expression quantificationIllumina HiSeqnormalized_resultsWorking
Illumina HiSeqresultsWorking
HT_HG-U133A-Working
AgilentG4502A_07_2-Data frame only
AgilentG4502A_07_1-Data frame only
HuEx-1_0-st-v2FIRMA.txtNot Preparing
gene.txtNot Preparing
Isoform expression quantification
miRNA gene quantification
Exon junction quantification
Exon quantification
miRNA isoform quantification
DNA methylationIllumina Human Methylation 450Not usedWorking
Illumina Human Methylation 27Not usedWorking
Illumina DNA Methylation OMA003 CPINot usedWorking
Illumina DNA Methylation OMA002 CPINot usedWorking
Illumina Hi SeqNot working
Raw Microarray Data
Structural Rearrangement
Other

2.5 完整的例子(https://gist.github.com/tiagochst/a701bad3fa3800ade7063760755e0aad)

# ------------------------------------------------------------------
# Updating TCGAbiolinks to work with GDC data
# --------------------------------------------------------------------
# Install last version from the github (this is a development version)
devtools::install_github("BioinformaticsFMRP/TCGAbiolinks")
library(TCGAbiolinks)


####################### Working harmonized data ###########################
#                   Data.category: clinical and biospecimen
############################################################################
# Clinical information
# https://gdc.nci.nih.gov/about-data/data-harmonization-and-generation/clinical-data-harmonization
clin <- GDCquery_clinic("TCGA-ACC", type = "clinical", save.csv = TRUE)
clin <- GDCquery_clinic("TCGA-ACC", type = "biospecimen", save.csv = TRUE)

#-----------------------------------------------------------------------------
#                   Data.category: MAF files
#-----------------------------------------------------------------------------
mut <- GDCquery_Maf(tumor = "ACC")
clin <- GDCquery_clinic("TCGA-ACC","clinical")
clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")]
TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:20],
                        filename = "onco.pdf",
                        annotation = clin,
                        color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"),
                        rows.font.size=10,
                        heatmap.legend.side = "right",
                        dist.col = 0,
                        label.font.size = 10)


#-----------------------------------------------------------------------------
#                   Data.category: Copy number variation
#-----------------------------------------------------------------------------
query <- GDCquery(project = "TCGA-ACC",
                  data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)

query <- GDCquery("TCGA-ACC",
                  "Copy Number Variation",
                  data.type = "Masked Copy Number Segment",
                  sample.type = c("Primary solid Tumor")) # query$results[[1]]$cases
GDCdownload(query)
data <- GDCprepare(query)

#-----------------------------------------------------------------------------
#                   Data.category: Transcriptome Profiling
#-----------------------------------------------------------------------------
workflow.type <- c("HTSeq - Counts", "HTSeq - FPKM","HTSeq - FPKM-UQ")
for(i in workflow.type){
    print(i)
    query <- GDCquery(project = "TARGET-AML",
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = i,
                      barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R"))
    GDCdownload(query)
    data <- GDCprepare(query)
}
#data.type <- c("miRNA Expression Quantification","Isoform Expression Quantification")
data.type <- c("miRNA Expression Quantification")
for(i in data.type){
    print(i)
    query <- GDCquery(project = "TARGET-AML",
                      data.category = "Transcriptome Profiling",
                      data.type = i,
                      workflow.type = "BCGSC miRNA Profiling",
                      barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R"))
    GDCdownload(query)
    data <- GDCprepare(query)
    print(head(data))
}


####################### Working with Legacy data ###########################
#                   Data.category: Copy number variation
############################################################################
query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "nocnv_hg19.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01", "TCGA-OR-A5LJ-10A-01D-A29K-01"))
GDCdownload(query)
z <- GDCprepare(query)

query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "nocnv_hg18.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01", "TCGA-OR-A5LJ-10A-01D-A29K-01"))
GDCdownload(query)
z <- GDCprepare(query)


query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "hg19.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01", "TCGA-OR-A5LJ-10A-01D-A29K-01"))
GDCdownload(query)
z <- GDCprepare(query)

query <- GDCquery(project = "TCGA-LGG",
                  barcode = c("TCGA-HT-7476-10A-01D-2022-02", "TCGA-FG-6689-01A-11D-1891-02"),
                  data.category =  "Copy number variation", platform = "Illumina HiSeq", legacy = TRUE)
GDCdownload(query)
z <- GDCprepare(query)

query <- GDCquery(project = "TCGA-ACC",
                  data.category =  "Copy number variation",
                  legacy = TRUE,
                  file.type = "hg18.seg",
                  barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01", "TCGA-OR-A5LJ-10A-01D-A29K-01"))
GDCdownload(query)
z <- GDCprepare(query)
####################### Working with Legacy data ###########################
#                   Data.category: DNA methylation & Protein expression
############################################################################
# Function to get two samples to test the function
legacyPipeline <- function(project, data.category, platform){
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE)
    cases <- query$results[[1]]$cases[1:2]
    query <- GDCquery(project = project,
                      data.category = data.category,
                      platform = platform,
                      legacy = TRUE,
                      barcode = cases)
    GDCdownload(query)
    data <- GDCprepare(query)
    return(data)
}
# DNA methylation
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 27")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina Human Methylation 450")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA003 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")
data <- legacyPipeline("TCGA-GBM","DNA methylation","Illumina DNA Methylation OMA002 CPI")

# Protein expression
data <- legacyPipeline("TCGA-GBM","Protein expression","MDA_RPPA_Core")

五、如何使用TCGAbiolinks查找下载来可视化mutation data

1. 查找和下载
1.1 TCGAbiolinks 提供两种方式从GDC下载mutation data:
1)Use GDCquery_Maf which will download MAF (mutation annotation files) aligned against hg38;
2)Use GDCquery, GDCdownload and GDCpreprare to downoad MAF aligned against hg19;

1.2 Mutation data (hg38)

maf <- GDCquery_Maf("CHOL", pipelines = "muse")

maf <- chol_maf@data

# Only first 50 to make render faster
datatable(maf[1:20,],
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)

Pipelines options:muse, varscan2, somaticsniper, mutect.

1.3 Mutation data (hg38)

query.maf.hg19 <- GDCquery(project = "TCGA-CHOL", 
                           data.category = "Simple nucleotide variation", 
                           data.type = "Simple somatic mutation",
                           access = "open", 
                           legacy = TRUE)
                           
# Check maf availables
datatable(dplyr::select(getResults(query.maf.hg19),-contains("cases")),
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 10), 
          rownames = FALSE)
          
query.maf.hg19 <- GDCquery(project = "TCGA-CHOL", 
                           data.category = "Simple nucleotide variation", 
                           data.type = "Simple somatic mutation",
                           access = "open", 
                           file.type = "bcgsc.ca_CHOL.IlluminaHiSeq_DNASeq.1.somatic.maf",
                           legacy = TRUE)
GDCdownload(query.maf.hg19)
maf <- GDCprepare(query.maf.hg19)

data <- bcgsc.ca_CHOL.IlluminaHiSeq_DNASeq.1.somatic.maf

# Only first 50 to make render faster
datatable(maf[1:20,],
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)

1.4 可视化数据
使用Bioconductor包maftools;

library(maftools)
library(dplyr)
maf <- GDCquery_Maf("CHOL", pipelines = "muse") %>% read.maf

library(maftools)
library(dplyr)
maf <- chol_maf

datatable(getSampleSummary(maf),
          filter = 'top',
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
          rownames = FALSE)
plotmafSummary(maf = maf, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE)

oncoplot(maf = maf, top = 10, removeNonMutated = TRUE)
titv = titv(maf = maf, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = titv)

Demo Code

R Code

#-----------------------------------------------------------------------------
#                  安装TCGAbiolinks
#-----------------------------------------------------------------------------
#安装稳定版biolinks
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.10")
BiocManager::install("TCGAbiolinks")

#安装开发版biolinks
#devtools::install_github('BioinformaticsFMRP/TCGAbiolinks')

#-----------------------------------------------------------------------------
#                   手动安装示例区域(自己看着办)
#-----------------------------------------------------------------------------
#常规联网查找安装,
install.packages("example_name")
#通过bioconductor资源安装
BiocManager::install("example_name")
#to search through available packages programmatically
BiocManager::available("example_name")
#to flag packages that are either out-of-date or too new for your version of Bioconductor. 
BiocManager::valid()
#-----------------------------------------------------------------------------
#                   加载库
#-----------------------------------------------------------------------------
#biolinks使用
library(SummarizedExperiment)
library(TCGAbiolinks)

#分析绘图使用
BiocManager::install("edgeR")
install.packages("gplots")
BiocManager::install("ggplot2")
library(edgeR)
library(gplots)
library(ggplot2)
require(gplot2)
#-----------------------------------------------------------------------------
#                   创建工作目录
#-----------------------------------------------------------------------------
#创建新目录
dir.create("c:/TCGA");
dir.create("c:/TCGA/CESC");
#目录
work_dir <- "c:/TCGA/CESC"
#将当前目录指向工作目录,并返回显示
setwd(work_dir)
getwd()

#-----------------------------------------------------------------------------
#                   创建全局变量(根据实际填写,非全部必须)
#-----------------------------------------------------------------------------
##### GDCquery #####
project <- "TCGA-CESC"
data_category <- "Transcriptome Profiling"
data_type <- "Gene Expression Quantification"
workflow_type <- "HTSeq - Counts"
# bar <- c()
legacy <- FALSE

##### GDCdownload #####
# TokenFile <- 
method <- 'api'
download_dir <- paste0(work_dir,"/GDC/",gsub("-","_",project))
FilesChunk <- 6
  
##### GDCprepare #####
save <- TRUE
SaveName <- paste0(download_dir,"_","RNAsqe_HTSeq_Counts",".rda")
# summarizedExperiment <- FASLSE
  
#-----------------------------------------------------------------------------
#                   查询(根据实际情况调整参数)
#-----------------------------------------------------------------------------
query <- GDCquery(project = project,
                  data.category = data_category,
                  data.type = data_type,
                  workflow.type = workflow_type,
                  legacy = legacy 
                  )

#-----------------------------------------------------------------------------
#                   下载(根据实际情况调整参数)
#-----------------------------------------------------------------------------
GDCdownload(query = query,
            directory = download_dir,
            files.per.chunk = FilesChunk,
            method = method
            )

#-----------------------------------------------------------------------------
#                   预处理(根据实际情况调整参数)
#-----------------------------------------------------------------------------
data <- GDCprepare(query,
                   directory = download_dir,
                   save = save,
                   save.filename = SaveName
                   )

#-----------------------------------------------------------------------------
#                   提取
#-----------------------------------------------------------------------------
#获取表达矩阵
data_expr <- assay(data)

#查看表达矩阵的维度
dim(data_expr)

#创建表达矩阵文件路径expr_file
expr_file <- paste0(download_dir,"_","All_HTSeq_Counts",".txt")

#输出到表达矩阵文件中,file参数为""可以输出到console中
write.table(data_expr,file = expr_file,sep = "\t",row.names = T,quote = F)

#提取Gene表达量矩阵
gene_info = read.table("C:/Jagger/fannie/test/gene_info.txt",header = 1)
cat("Total Gene annotation:",dim(gene_info)[1])
gene_selected <- row.names(data_expr)%in%gene_info$Gene_id
gene_expr <- data_expr[gene_selected,]
cat("Total Gene Selected:",dim(gene_expr)[1])
gene_expr_file <- paste0(download_dir,"_","Gene_HTSeq_Counts",".txt")
write.table(gene_expr,file = gene_expr_file,sep="\t",row.names = T,quote = F)

#-----------------------------------------------------------------------------
#                   分析
#-----------------------------------------------------------------------------

##### Terminal模式中运行时使用此交互 #####
args <- commandArgs(TRUE)
working_dir <- args[1]
rawdata_file <- args[2]
sample_info_file <- args[3]

##### 编辑器中使用 #####
#工作目录
working_dir <- "C:/TCGA/Analysis"
#基因表达矩阵文件路径
rawdata_file <- "C:/TCGA/CESC/GDC/TCGA_CESC_Gene_HTSeq_Counts.txt"
#
#sample_info_file <- 
  
##### 设置工作目录并返回 #####
setwd(working_dir)
getwd()
  
#设置FDR和 fold change 阈值
fdr = 0.01
fold_change = 2

#读取基因表达矩阵
rawdata=read.table(rawdata_file,header = TRUE,stringsAsFactors = FALSE,row.names = 1)

#原始数据的基因数和样本量
dim(rawdata)

#针对基因的表达量进行过滤,过滤标准设置为:至少有25%的样本,基因的表达量大于2
quant <- apply(rawdata,1,quantile,0.75)
keep <- which((quant >= 2) == 1)
rawdata <- rawdata[keep,]
dim(rawdata)

#样分组信息,可以通过文件传入,也可以采用下面的简单方法,基于barcode判断是癌症还是正常组织
#sample_info =read.table(sample_info_file,header=TRUE,stringAsFactors=FALSE,row.names=1)

#基于样本的编号,根据第14位判断是癌症还是正常样本
#0为癌症样本,1为正常样本

#Barcode 实例: TCGA-W5-AA34-11A-11R-A41I-07

group <- factor(substr(colnames(rawdata),14,14))
summary(group)

#获得基因名称列表
genes=rownames(rawdata)

#制作DEGlist对象
y <- DGEList(counts=rawdata,genes=genes,group=group)
#计算有多少行
nrow(y)

#TMM标准化(基于count数据的标准化)
y <- calcNormFactors(y)

#分布估计(这步较慢)
y <- estimateCommonDisp(y,verbose=TRUE)
y <- estimateTagwiseDisp(y)

#差异分析检验(T检验)
et <- exactTest(y)

#针对FDR,fold change 对基因的显著性表达进行‘多重检验’
#对p value 进行BH检验
et$table$FDR <- p.adjust(et$table$PValue,method = "BH")
#summary显示共上调多少基因,下调多少基因(数据较大则重新调整FDR)
summary(de <- decideTestsDGE(et,adjust.method = "BH",p.value = fdr, lfc = log2(fold_change)))

#标识基因的上下调情况
et$table$regulate = as.factor(ifelse(et$table$FDR < fdr & abs(et$table$logFC) >=log2(fold_change),ifelse(et$table$logFC>log2(fold_change),'Up','Down'),'Normal'))

summary(et$table$regulate)

#绘制前调整右下角大小,过小会报错
plotMD(et)
print('et:')
et

#筛选出差异表达的基因
diff_expr_out <- et$table[et$table$regulate !='Normal',]

#输出结果

#保存结果
write.table(diff_expr_out,file = "DE_genes.txt",quote = FALSE, row.names = T,sep = "\t")

#绘制火山图
#设置FDR,Fold_Change 的辅助线
fdr_line = -log10(fdr)
fc_line = log2(fold_change)

#采用ggplot进行绘图
gp1 <- ggplot(et$table,aes(x=logFC,y=-log10(FDR),colour=regulate))+xlab("log2Fold Change")+ylab("-log10 FDR")+ylim(0,30)
#基于上下调关系,制定不同的颜色
gp2 <- gp1 + geom_point() +scale_color_manual(values = c("green","black","red"))
#增加两条辅助线
gp3 <- gp2 + geom_hline(yintercept = fdr_line,linetype=4)+geom_vline(xintercept = c(-fc_line,fc_line),linetype=4)

#run gp3可直接得到图
gp3

#保存图片
ggsave("DE_gene_Vocano.pdf",width = 16,height = 9)

##################
#绘制聚类热图
normData=y$pseudo.counts
heatmapData <- normData[rownames(diff_expr_out),]

#有些表达量为0,无法进行log转换,增加一个小背景:0.001
heatmapExp=log2(heatmapData+0.001)
heatmapMat=as.matrix(heatmapExp)

#输出的热图的储存途径
pdf(file = "DE_gene_heatmap.pdf",width = 60,height = 90)
par(oma=c(10,3,3,7))
heatmap.2(heatmapMat,col='greenred',trace='none')
dev.off()

#样品相关性绘图
pdf(file="gene_cor_cluster2.pdf",width = 60,height = 90)
par(oma=c(10,3,3,7))

#基于pearson相关性对样品进行聚类
heatmap(cor(normData))
dev.off()

#-----------------------------------------------------------------------------
#                   PPI
#-----------------------------------------------------------------------------
# 安装STRINGdb 软件包
BiocManager::install("STRINGdb")
BiocManager::install("rlist")
############################################################
library(STRINGdb)
library("rlist")
# 设置程序参数
work_dir <- "C:/TCGA/CESC/Analysis" 
deg_file <- "C:/TCGA/CESC/Analysis/DE_genes.txt"
setwd(work_dir)


# 获取物种的分类编号
# get_STRING_species(version="10", species_name=NULL) 
# NCBI数据库中9606 代表人类,700为可信度分值
#version已更新到11,但只有10不报错
string_db <- STRINGdb$new(version="10", species=9606,score_threshold=700, input_directory= work_dir)

# 读取差异表达的文件,获得差异表达基因列表
degs = read.table(deg_file,header=T,comment.char = "",check.names=F)
degs$gene <- rownames(degs)
head(degs)

# 查看有多少差异表达的基因需要分析 
cat("Total deg genes:", dim(degs)[1])

# 将基因的ID map 到string 数据库中, 不一定每个基因都能map上,很慢很慢
deg_mapped <- string_db$map( degs, "gene", removeUnmappedRows = TRUE )

# 查看有多少ID map 上了 
cat("Total String id mapped :", dim(deg_mapped)[1])

#-----------------------------------------------------------------------------
#                   GO KEGG
#-----------------------------------------------------------------------------
BiocManager::install("org.Hs.eg.db")
BiocManager::install("clusterProfiler")
BiocManager::install("pathview")
BiocManager::install("topGO")
install.packages("SparseM")
BiocManager::install("enrichMap")

library(org.Hs.eg.db)
library(clusterProfiler)
library(pathview)
require(pathview)
library(topGO)

#args <- commandArgs(TRUE)
#work_dir <-  args[1]
#deg_file <- args[2]


work_dir <- "C:/TCGA/Annotation"
deg_file <- "C:/TCGA/Analysis/DE_genes.txt"
setwd(work_dir)

# 读取差异表达的文件,获得差异表达基因列表
degs = read.table(deg_file,header=T,comment.char = "",check.names=F)
DEG_list <- rownames(degs)
head(DEG_list)
#GO 富集
# ont:  MF,BP,CC
ego <- enrichGO(gene          = DEG_list,
                keyType       = "ENSEMBL",
                OrgDb         = org.Hs.eg.db,
                ont           = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

write.table(ego, file = "GO_enrichment_stat.txt",sep="\t", row.names =F, quote = F)


ego_results<-summary(ego)
ego_results
# 绘制相关图
pdf(file = "ego_barplot.pdf")
barplot(ego, showCategory=20, x = "GeneRatio")
dev.off()

# 点图
pdf(file = "ego_dotplot.pdf")
dotplot(ego,showCategory=20)
dev.off()

pdf(file = "enrich_map.pdf")
enrichMap(ego)
dev.off()

# 绘制topGO图
pdf(file = "topgo.pdf")
plotGOgraph(ego)
dev.off()


# KEGG pathway 富集
gene_ids <- bitr(DEG_list, fromType="ENSEMBL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
kegg <- enrichKEGG(gene         = gene_ids$ENTREZID,
                   organism     = 'hsa',
                   pvalueCutoff = 0.05)
head(kegg)
write.table(kegg, file = "KEGG_enrichment_stat.txt",sep="\t", row.names =F, quote = F)

# 点图
png(file = "kegg_dotplot.png")
dotplot(kegg,title="Enrichment KEGG_dot")
dev.off()

# 合并表达信息和基因信息
deg_info <- degs['logFC']
deg_info['ENSEMBL'] <- rownames(deg_info)
gene_ids_merge <- merge(gene_ids, deg_info,by='ENSEMBL')
# 去掉ENTREZID 重复的基因
index<-duplicated(gene_ids_merge$ENTREZID)
gene_ids_merge<-gene_ids_merge[!index,]

# 提取FC的值,在map上进行颜色标注
map_ids <- as.matrix(gene_ids_merge['logFC'])
rownames(map_ids) <- gene_ids_merge$ENTREZID

                
#  绘制富集的pathway 以前三个为例
pathway_id <- "hsa04110"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa03030"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa04022"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa04115"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa04151"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)
pathway_id <- "hsa04010"
map <- pathview(gene.data  = map_ids[,1],
                pathway.id = pathway_id,
                species    = "hsa", kegg.native = TRUE)

###第二种:MF
ego <- enrichGO(gene          = DEG_list,
                keyType       = "ENSEMBL",
                OrgDb         = org.Hs.eg.db,
                ont           = "MF",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

write.table(ego, file = "GO_enrichment_stat.txt",sep="\t", row.names =F, quote = F)
ego_results<-summary(ego)
ego_results
# 绘制相关图
pdf(file = "ego_barplot_MF.pdf")
barplot(ego, showCategory=20, x = "GeneRatio")
dev.off()
# 点图
pdf(file = "ego_dotplot_MF.pdf")
dotplot(ego,showCategory=20)
dev.off()
#效果不好
#pdf(file = "enrich_map.pdf")
#enrichMap(ego)
#dev.off()
# 绘制topGO图
pdf(file = "topgo_MF.pdf")
plotGOgraph(ego)
dev.off()

###第三种:CC
ego <- enrichGO(gene          = DEG_list,
                keyType       = "ENSEMBL",
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

write.table(ego, file = "GO_enrichment_stat.txt",sep="\t", row.names =F, quote = F)
ego_results<-summary(ego)
ego_results
# 绘制相关图
pdf(file = "ego_barplot_CC.pdf")
barplot(ego, showCategory=20, x = "GeneRatio")
dev.off()
# 点图
pdf(file = "ego_dotplot_CC.pdf")
dotplot(ego,showCategory=20)
dev.off()
#效果不好
#pdf(file = "enrich_map.pdf")
#enrichMap(ego)
#dev.off()
# 绘制topGO图
pdf(file = "topgo_CC.pdf")
plotGOgraph(ego)
dev.off()

Perl Code

#!/usr/bin/perl -w
use strict;

## Perl Code
my $biotype_file = "biotype_classs.txt";
my $gtf = "gencode.v33.annotation.gtf";
my $biotype = "gene";


my %biotype_list;
open my $fh1, $biotype_file or die;
while (<$fh1>) {
    chomp;
    my @array = split /\t/, $_;
#    print "$array[2], $array[0]\n";
    if($array[2]eq $biotype){
        $biotype_list{$array[0]} = 1;
    }
}
close $fh1;


open my $out, ">${biotype}_info.txt" or die;
print $out "Gene_id\tGene_id_info\tgene_name\tbiotype\n";
open my $fh2, $gtf or die;
while (<$fh2>) {
    chomp;
    next if /^#/;
    my @array = split /\t/, $_;
    next unless ($array[2] eq "gene");
    $array[8] =~ /gene_id\s+"(\S+?)";.*gene_type\s+"(\S+?)";.*gene_name\s+"(\S+?)";/;
    my $geneid = $1;
    my $genebiotype = $2;
    my $genename = $3;
    my $gene_id_norm=(split("\\.",$geneid))[0];
#    print "$genebiotype\n";
    if ($biotype_list{$genebiotype}) {
        print $out "$gene_id_norm\t$geneid\t$genename\t$genebiotype\n";
    }
}
close $fh2;
my $biotype_file = "biotype_classs.txt";
my $gtf = "gencode.v33.annotation.gtf";
my $biotype = "test";

print ($biotype_file);

my %biotype_list;
open (my $fh1, "<$biotype_file") or die;
while (<$fh1>) {
    chomp;
    my @array = split /\t/, $_;
#    print "$array[2], $array[0]\n";
    if($array[2]eq $biotype){
        $biotype_list{$array[0]} = 1;
    }
}
close $fh1;


open my $out, ">${biotype}_info.txt" or die;
print $out "Gene_id\tGene_id_info\tgene_name\tbiotype\n";
open my $fh2, $gtf or die;
while (<$fh2>) {
    chomp;
    next if /^#/;
    my @array = split /\t/, $_;
    next unless ($array[2] eq "gene");
    $array[8] =~ /gene_id\s+"(\S+?)";.*gene_type\s+"(\S+?)";.*gene_name\s+"(\S+?)";/;
    my $geneid = $1;
    my $genebiotype = $2;
    my $genename = $3;
    my $gene_id_norm=(split("\\.",$geneid))[0];
#    print "$genebiotype\n";
    if ($biotype_list{$genebiotype}) {
        print $out "$gene_id_norm\t$geneid\t$genename\t$genebiotype\n";
    }
}
close $fh2;

Project shot

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值