甲基化特异性区域的计算鉴别

多形性成胶质细胞瘤(GBM)甲基化区域的计算鉴别

 

目的:找出胶质细胞瘤特异性甲基化区域,为临床诊断提供理论依据

 

步骤:

1、  查找数据:下载TCGA中GBM的RNA-seq和甲基化数据

2、  甲基化数据分析,正常肿瘤对比,进行差异甲基化分析,找出肿瘤样本中高甲基化区域

3、  对RNA-seq数据进行分析,正常肿瘤对比,差异表达基因的筛选,找出肿瘤样本中低表达基因。

4、  结合甲基化和RNA-seq数据,将高甲基化和低表达基因取交集,这些基因很可能属于抑癌基因,与抑癌基因取交集,再结合promoter区域的CpG整合分析,寻找候选靶标。

5、  对找出的靶标进行验证,利用pubmed以及其他数据库,反向验证靶标的可靠性

一、数据下载

首先进入TCGA下载数据GBM的RNA-seq和甲基化数据,从下表可见GBM共有172套RNA-seq数据以及437套DNA甲基化数据,由于TCGA提供Infinium HumanMethylation27 BeadChip和Infinium HumanMethylation450 BeadChip两种芯片平台的数据,为了避免后续不同芯片平台间数据合并的困难,仅下载HumanMethylation450的芯片数据,共计154套。

图表 1TCGA数据汇总

二、初步整理数据

使用TCGA-Assembler.2.0.5进行GBM数据批量下载与初步整理,并且绘制RNA-seq基因表达量盒型图以及甲基化芯片数据盒型图,由于数据量较大,此处不贴图。

三、整体可视化

首先对于甲基化数据,选取ID为TCGA.06.AABW.11A.31D.A368.05的数据,查看总体甲基化程度。由于每个位点真实情况只存在:甲基化/非甲基化两种,所以对全部位点甲基化程度进行统计,也应该是大部分位点处于“完全甲基化”(Methylation state=1)和“完全非甲基化”(Methylation state=0)两种状态,下图绘制了数据的频数柱状图,可以明显看出形状处于“两头高,中间低”,反向说明芯片数据质量较好。

图表 2单个样本CpG甲基化程度统计

接下来,对多个样本绘制CpG甲基化程度小提琴图,同一行是同一个病人,左边样本来源于Primary Solid Tumor,右边样本来源于Recurrent Solid Tumor,除了甲基化程度大部分分布于0和1附近外,还能看出来源于同一病人肿瘤的甲基化程度依旧会有略微差异。

TCGA barcode:https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

图表 3小提琴图

 

同样的,对于RNA-seq数据也可以进行一些初步可视化,除了数据下载后绘制的盒型图,亦可以进行PCA初步查看数据分布,下图左为PCA陡坡图,反映了第一主成分、第二主成分…等等所拥有信息量的比例,下图右为使用PCA1和PCA2绘制的散点图,可以发现5个正常样本距离较近,从侧面反映数据可信度较好。

最后,对于RNA-seq表达谱数据,使用系统聚类方法,绘制树状图,可以发现5个正常样本距离也是很近,数据质量还行。

 

四、差异甲基化区域筛选

为了更加科学高效地筛选差异甲基化位点,参考bioconductor中甲基化芯片的分析流程,使用minfi包进行差异甲基化分析,得到差异甲基化位点。

http://www.bioconductor.org/help/workflows/methylationArrayAnalysis/

在检测的526733个CpG位点中,共有4927个CpG位点P值<0.01,且在肿瘤样本中保持着甲基化程度高于0.7,对应2054个基因。

五、差异表达基因筛选

由于数据源自RNA-seq,最主流的分析方法当然是基于负二项分布模型的DESeq2包。

先用MA-plot查看差异表达基因大致分布。意外的是,图形左侧有大概七条线状条纹,最初我怀疑这是sample之间有batch effect导致,需要用其他更好normalize的方法,后来用identify方法挨个找出每条线上的基因名及其对应的表达量,发现这些基因在172套样本中表达量几乎全为0,仅有一两个样本有一点点表达,这种数据的存在导致这些线状条纹的产生。

图表 4 MA-plot

然后, 选取p值最小的差异表达基因,绘制其在不同组间表达量,确实差异很显著。

图表 5表达量散点图

接着,绘制差异表达基因在不同组间的表达量热图,正常样本是图片最左边的五列,当然如果需要解释具体的生物学问题,需要将聚类出来的每一类,将差异表达基因进行GO以及KEGG注释,结合有关的生物学表型,探讨其分子机制及意义

图表 6表达量热图

最后,选取筛选条件为p值小于0.01且log2FoldChange<-2的差异表达基因,在肿瘤样本低表达的基因,共计1657个

六、抑癌基因的获取

在pubmed中查询研究人员整理的tumor suppressor genes,果然在Nucleic Acids Res发表过TSGene数据库,存储了抑癌基因的列表。https://bioinfo.uth.edu/TSGene/download.cgi

下载全部1217个人类抑癌基因的列表。

All the 1217 human tumor suppressor genes with basic annotations

图表 7 TSGene database

七、数据整合

对于甲基化数据中,肿瘤样本高甲基化CpG附近的基因,RNA-seq中肿瘤样本低表达的基因,以及TSGene数据库中下载的抑癌基因列表,三者做overlap,找出特异性的候选靶标,为后续分析做准备,下图为三者overlap的韦恩图。

图表 8数据整合韦恩图

共计找出12个候选靶标基因。

八、靶标筛选

之前筛选选择的单个CpG的差异甲基化,而实际临床检测应用时候,可能需要多个CpG作为对照,因此统计了12个候选靶标基因TSS前1.5kb内所有CpG的甲基化程度,然后绘制热图,可以明显发现,虽然当初用CpG的差异甲基化位点筛出来的基因都是肿瘤样本高甲基化的,可是统计TSS前1.5kb内所有CpG的甲基化程度,这些基因却有很多在所有样本中都是低甲基化状态,而看上去很靠谱的是NUAK1基因,其正常样本在TSS前1.5kb内低甲基化,肿瘤样本中对应区域高甲基化。

图表 9 methylation heatmap

NUAK1基因TSS前1.5kb内共检测了7个CpG,这7个CpG在154个样本中检测出来的甲基化程度如下图,可以明显看出来这7个CpG在Tumor组织中甲基化程度都相对高,而在Normal组织中甲基化程度相对较低。

图表 10 NUAK1的TSS区CpG甲基化程度

使用Cpgplot预测CpG island位置,这7个CpG在promoter5’到3’序列图上相对位置如下

1035 1094 1408 1413 1444 1448 1471

图表 11 CpG island预测

参数使用:

NUAK1promoter from 1 to 1500

     Observed/Expected ratio > 0.40

     Percent C + Percent G > 40.00

     Length > 100

CpG island详细信息: Length 101 (1086..1186)   Length 105 (1366..1470)

这七个CpG基本都在CpGisland中,具体序列见附录

 

九、靶标基因相关讨论

进入Gene数据库搜索NUAK1相关内容,可以发现基因全称NUAK family kinase 1,还是个激酶,激酶的话就对调控会有很大作用了,而在HPA RNA-seq normal tissues项目中,又看出来这个激酶在脑中表达量明显高于其他组织,这又与发生在脑部的GBM不谋而合。

图表 12 NUAK1相关讨论

 

十、分子机制探讨

对于肿瘤组织中高甲基化CpG附近的,并且在肿瘤样本中低表达的intersect共计274个基因,使用Gene Ontology进行富集分析,可以明显发现在GO biological process生物学过程中的“神经系统发育”、“化学性突触传递”和“细胞膜的组织”等部分里面有着富集,特别是“中枢神经系统的髓鞘形成”,富集程度达到26.95倍,这又与研究的多发生于脑补的GBM有着密切的联系,反向验证实验结果的正确性。

图表 13 GO富集分析

十一、FurtherMore

根据生物学知识可以得到,CpG的甲基化会调控基因的转录,因此,Transcript Start Site(TSS)附近的甲基化程度值得进行一番深入研究,选用人类基因组hg19版本,对23056基因共计46489个转录起始位点,进行转录起始位点富集甲基化程度统计。

统计TSS前后5000bp内CpG甲基化程度,并且使用曲线进行拟合,可以发现TSS处的CpG Methylation水平明显降低,这也与科学常识相吻合。

图表 14 TSS附近甲基化程度

附录

NUAK1promoter区CpG island用蓝色标注,检测的CpG用加粗横线下标标注。

>NUAK1promoter

TATGAAAGGAGAAGGGGGAGCTTTGGAACTGGACAGGTAGGGTTTAAATCCTGGTCCTGC

CATTTACAAACTGTGTAACCTCTGGGAAATTACTTAACCTTTCTGATACGGTTTCTTCAA

TTGAAAATAGGGATTGTAAACAGCTACTTTACAGAGGAGGGTTTACTGTCATAAAACAGT

ACCAGCCTATGGTAGATGATGCTGTTGATTCAATAGATACTGATGAAGTCCCACATATCT

GGGAGTAACACTATCAGCCAGAATAAGCCAGGTTATGCTGCTTTAACAAATCCCACTGAC

TTAACATAAATAAAGTTTAATATTTGCTGACACTACTTTTCCAATGCAGATTGGCTGGGA

TTTTCTGCCATCTTTACACAGAGACTCAGGCTGGCTTGGGGAGGCTCCAACTTTGTACCA

CCATGATTGTCAAGATAGGAAAAAGAAGACATGGGGATTTGCTCACTGGCTCCTAAAGGT

TCCAAGTGAAAGCAACACAATCACTTCTGCTCTCATTCCATTGGCCTAAGCAAGTCCGAT

GGTCTCATCTAACTTCAAGAGGATGGGGTAGTGGATTTCTACCACATCGCAGGGGAGGGG

AAGAAGTAGAAAATATTTTAAAATACTACTGTAGACACAATGTGTTTATCTCTACAATAG

ATCTGCTAAATCCATATCTTAAGAGAAAACCGAAGCTCTGAGAGGTTATGTCATTTGACC

AAGGCCACACAGCCAATCCTCTGGGAAGCCAGGACTCAACCTGCATGCCACTCATCCTGA

CTGTGGGATCTATAGGCACAACGTCCACAAACTGTATAGAAGCAGAACAGTTTCAGGATG

GGGGTGGGTGGCAGGGAGGCCCTGCAAATGCGGTGGACAGAATAGGAATTGAATAGGCAT

GATGCGCCTTTGTGTCTGTGTTCTGTCATTGCTCCTGGGGAAGGGAAGAAGGGGCAGGGA

AGTTTGTGTGAAATATCAGTAGAAGGAAAGTTTGGACAGGTAAGAATATCACGCGTCAGA

GTACAGCAGAGACACGTGTGGAGGATGAGGGCAGTGTTTCAGGCCATTACTCTGGCAGCA

GTGAGGAGGCTCCCGGGGAGGGTTGGGGAGAGCGGGCTGTTGCTGGAGTCCGGGGTAAGG

TGACCAGGGGTAGACAGGAATGAAGGCGGCGGGAATGTAGAGGAAGGGCTGCACCTGAGA

GCTGCAGAGGAGGCCCCAGTGAGGTCCACTGGTGGAAAAGAACCGCCATGCGATCTGCAG

AACACCAGACCTTCTTCAGCCTTCACTCTTCCCACCCTAGTCTGGTACATTGCACCACTT

GTTAAAAAAAAAAATTTCCCTAAGACCTCTTTTTCTCTAGCCTTCTTCCTTATTTTTCAT

GGTCTCTTCTTAGAACGGCGGCAGCCACGCCGCGGTGGGAGGCCCTTCCTGCCTGACCCT

TACCGTGCGGGGTACCGTTCCTGTCACCATCGCCAGGATCTGGCCCTTTCAGTGAAAGGA


diffFind.R

setwd("G:/AllShare/SkillTrainHomework/BasicDataProcessingResult")
load("GBM__methylation_450.rda")
setwd("G:/AllShare/SkillTrainHomework")
DATAFRAME<- data.frame(Data)

TCGASampleName<- colnames(DATAFRAME)
TCGASamplelist <- strsplit(TCGASampleName,split=".",fixed = TRUE)
TCGASampletpye <- c()
TCGASampleID <- c()
for(i in 1:length(TCGASamplelist)){
  TCGASampletpye <- c(TCGASampletpye,TCGASamplelist[[i]][4])
  TCGASampleID <- c(TCGASampleID,paste(TCGASamplelist[[i]][2],TCGASamplelist[[i]][3],sep = "."))
}
TCGASampletpyeResult <- c()
for(i in 1:length(TCGASampletpye)){
  if(as.integer(substr(TCGASampletpye[i],1,2))<10){
    TCGASampletpyeResult <- c(TCGASampletpyeResult,"Tumor")
  }else{
    TCGASampletpyeResult <- c(TCGASampletpyeResult,"Normal")
  }
}

library(ggplot2)
#h<- DATAFRAME$TCGA.06.0125.01A.01D.A45W.05
h<- DATAFRAME$TCGA.06.AABW.11A.31D.A368.05
pdf("CpGmethlation.pdf")
ggplot(NULL,aes(x=h))+
  geom_histogram(binwidth = 0.02)+
  labs(x = "% methylation per base",title = "Histogram of % CpG methylation\nSampleID:TCGA.06.AABW.11A.31D.A368.05")
dev.off()


library(reshape2)
tmp <- Data[,1:8]
tmp <- data.frame(tmp)
colnames(tmp) <- TCGASampleID[1:8]
tmp2 <- stack(tmp)
tmp2 <- data.frame(tmp2)
colnames(tmp2) <- c("CpG methylation","SampleID")
pdf("MethlationViolin.pdf",width =12,height = 10 )
ggplot(tmp2,aes(x=SampleID,y=`CpG methylation`,fill=SampleID))+
  geom_violin()+
  facet_wrap(~SampleID,ncol=2)
dev.off()


library(minfi)
dmp <- dmpFinder(Data, pheno = TCGASampletpyeResult, type = "categorical")
save(dmp,file = "diff.Rdata")
load("diff.Rdata")

meanTumorData<- apply(Data[,TCGASampletpyeResult=="Tumor"],1,mean)

#&!is.na(meanTumorData)

# head(meanTumorData)
# head(dmp)
# head(signifidmp,10)
# summary(dmp)
# rownames(dmp)
dmp$genelistID <- as.integer(rownames(dmp))
o<- order(dmp[,"genelistID"])
dmp<- dmp[o,]
#筛选出p<0.01且无空值的CpG,并且正常样本甲基化程度<0.3,即筛选肿瘤中高甲基化的基因
signifidmp <- dmp[meanTumorData>0.7 & !is.na(meanTumorData) & dmp$pval<0.01
                  &!is.na(dmp$pval) & !is.na(dmp$qval) & !is.na(dmp$intercept),]
#signifidmp <- signifidmp[signifidmp$intercept<0.3,]dmp <- 
signifiData<- Data[as.integer(rownames(signifidmp)),]
signifiDes<- Des[as.integer(rownames(signifidmp)),]

# Data[Des[,1]=="ch.3.2438620R",]
# meanTumorData[Des[,1]=="ch.3.2438620R"]>0.7
# mean(Data[103429,TCGASampletpyeResult=="Normal"])
# mean(Data[103429,TCGASampletpyeResult=="Tumor"])
# summary(Data[Des[,1]=="ch.3.2438620R",])

# apply((head(signifiData,10)[,TCGASampletpyeResult=="Tumor"]),1,mean)

signifiGene <- signifiDes[,2]

length(unique(signifiGene))

head(signifidmp)

#intercept正常样本甲基化程度
# Data[362831,118]
# Data[362831,48]
# Data[362831,]
# 
# Data[310369,118]
# Data[310369,48]
# Data[310369,]



#以下是关于CpG的一些计算


#因为Des有部分空缺,取出非空部分,生成Desfull
Desfull<- Des[!is.na(Des[,3]),]
#同样取出Data里非空部分,计算mean
methtstat<- apply(Data[!is.na(Des[,3]),],1,mean,na.rm=TRUE)
#去掉计算mean值以后NA的部分对应的Desfull
Desfull<- Desfull[!is.na(methtstat),]
#再去掉methtstat的mean值中NA的部分
methtstatresult<- methtstat[!is.na(methtstat)]
grCpG <- GRanges(seqnames = paste("chr",Desfull[,3],sep = ""),
              ranges = IRanges(start = as.integer(Desfull[,4]), width = 1))
grCpG$value <- methtstatresult


library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
trans <- as.data.frame(transcripts(txdb_hg19))
trans<- trans[trans$seqnames %in% c("chrX","chrY",paste("chr",1:22,sep="")),]
grTSS <- GRanges(seqnames = trans$seqnames,
                 ranges = IRanges(start = trans$start-5000,end = trans$start+5000))

grCpG
grTSS
hitObj<- findOverlaps(grTSS,grCpG)
CpGRelativeSite <- c()
CpGRelativeMeth <- c()
for(i in 1:length(hitObj)){
  #取出对应CpG真实位置
  CpGsite <- grCpG[hitObj[i]@to]@ranges@start
  #计算回TSS真实位置
  TSSsite<- mean(grTSS[hitObj[i]@from]@ranges)
  #计算CpG相对位置并且存储
  CpGRelativeSite <- c(CpGRelativeSite,TSSsite - CpGsite)
  #取出CpG平均甲基化程度
  CpGRelativeMeth <- c(CpGRelativeMeth,grCpG[hitObj[i]@to]$value)
}
save.image(file = "diffresult.Rdata")

load("diffresult.Rdata")

CpGResult <- c()
for(i in -5000:5000){
  CpGResult <- c(CpGResult,mean(CpGframe[CpGframe$CpGRelativeSite==i,"CpGRelativeMeth"]))
}
CpGframe <- data.frame(-5000:5000,CpGResult)
colnames(CpGframe) <- c("CpGRelativeSite","MethState")
ggplot(CpGframe, aes(x=CpGRelativeSite, y=MethState))+
  #geom_point()+
  geom_smooth()
ggsave(filename="TSS附近甲基化程度.pdf",width = 12,height=8)





library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
trans <- as.data.frame(transcripts(txdb_hg19))
grTSS <- GRanges(seqnames = trans$seqnames,
                 ranges = IRanges(start = trans$start-5000,end = trans$start+5000))


genesall<- genes(txdb_hg19)
SERPIND1<- genesall[genesall$gene_id=="3053"]
SERPIND1promoter<- promoters(SERPIND1,upstream = 1500,downstream = 0)
SERPIND1promoter
SERPIND1hit<- findOverlaps(SERPIND1promoter,grCpG)
SERPIND1hit@to
grCpG[SERPIND1hit@to]
#将Data里对应数据取出
Datafull<- Data[!is.na(Des[,3]),]
Datafull<- Datafull[!is.na(methtstat),]

NUAK1<- genesall[genesall$gene_id=="9891"]
NUAK1promoter<- promoters(NUAK1,upstream = 1500,downstream = 0)
NUAK1promoter

NUAK1hit<- findOverlaps(NUAK1promoter,grCpG)
NUAK1hit@to
grCpG[NUAK1hit@to]
#将Data里对应数据取出
Datafull<- Data[!is.na(Des[,3]),]
Datafull<- Datafull[!is.na(methtstat),]


apply(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Normal"],1,mean)
apply(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Tumor"],1,mean)
library(ggplot2)
normalNUAK1<- data.frame(stack(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Normal"]))
tumorNUAK1 <- data.frame(stack(Datafull[NUAK1hit@to,TCGASampletpyeResult=="Tumor"]))
normalNUAK1$sampletype <- "Normal"
normalNUAK1$site <- 1:7
tumorNUAK1$sampletype <- "Tumor"
tumorNUAK1$site <- 1:7
frame1<- data.frame(normalNUAK1$NA..1,normalNUAK1$sampletype,normalNUAK1$site)
colnames(frame1) <- c("MethState","sampletype","site")
frame2<- data.frame(tumorNUAK1$NA..1,tumorNUAK1$sampletype,tumorNUAK1$site)
colnames(frame2) <- c("MethState","sampletype","site")
resultframe <- rbind(frame1,frame2)
ggplot(resultframe, aes(x=site, y=MethState,colour=sampletype)) + 
  scale_x_discrete(limits=1:7)+
  geom_point(position="jitter")
ggsave(filename="TargetCpGNUAK1.pdf",width = 8,height=6)


apply(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Normal"],1,mean)
apply(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Tumor"],1,mean)
library(ggplot2)
normalSERPIND1<- data.frame(stack(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Normal"]))
tumorSERPIND1 <- data.frame(stack(Datafull[SERPIND1hit@to,TCGASampletpyeResult=="Tumor"]))
normalSERPIND1$sampletype <- "Normal"
normalSERPIND1$site <- 1:2
tumorSERPIND1$sampletype <- "Tumor"
tumorSERPIND1$site <- 1:2
frame1<- data.frame(normalSERPIND1$NA..1,normalSERPIND1$sampletype,normalSERPIND1$site)
colnames(frame1) <- c("MethState","sampletype","site")
frame2<- data.frame(tumorSERPIND1$NA..1,tumorSERPIND1$sampletype,tumorSERPIND1$site)
colnames(frame2) <- c("MethState","sampletype","site")
resultframe <- rbind(frame1,frame2)
ggplot(resultframe, aes(x=site, y=MethState,colour=sampletype)) + 
  scale_x_discrete(limits=1:2)+
  geom_point(position="jitter")
ggsave(filename="TargetCpGSERPIND1.pdf",width = 8,height=6)

library("BSgenome.Hsapiens.UCSC.hg19")
library(seqinr)
genome <- BSgenome.Hsapiens.UCSC.hg19
grCpG[NUAK1hit@to]
promoterSeq<- getSeq(genome,NUAK1promoter)
write.fasta(promoterSeq$`9891`,"NUAK1promoter","promoterSeq.txt")


normalNUAK1$NA..1
tumorNUAK1$NA..1

dim(Datafull)
dim(Desfull)
# 尝试用mice包补缺失值,由于数据量太大而取消
# library(mice)
# library(VIM)
# #md.pattern(Data)
# tmp <- Data[,1:3]
# aggr_plot <- aggr(tmp, col = c('navyblue', 'red'), numbers=TRUE, sortVars=TRUE, 
#                   labels=names(tmp),cex.axis=.7, gap=2, 
#                   ylab=c("Histogram of missing data", "Pattern"))
# tempData <- mice(Data,m=1,maxit=50,meth='pmm',seed=500)
# 
# 尝试T检验,由于有缺失值而取消
# Ttest<- t.test(Data[1,1:3],Data[1,c(48,118)])
# 
# Data[1,118]
# Data[1,48]
# tmp<- Data[1,]
# myfun <-function(x){
#   Ttest<- t.test(x[c(1:47,49:117,119:154)],x[c(48,118)])
#   return(Ttest$p.value)
# }
# myfun(Data[1,])

# result<- apply(Data,1,myfun)
# group <- factor(TCGASampletpyeResult,levels=c("Normal","Tumor"))
# design <- model.matrix(~-1+group)
# design
# fit.reduced <- lmFit(Data,design)
# fit.reduced <- eBayes(fit.reduced)
# summary(decideTests(fit.reduced))
# 
# 尝试使用missMethyl包,最后designMatrix设置有错误而差异区域识别错误,差异不明显
# top<-topTable(fit.reduced,coef=1)
# top
# cpgs <- as.integer(rownames(top))
# Data[cpgs[10],]
# Data[cpgs[10],48]
# Data[cpgs[10],118]
# 
# par(mfrow=c(2,2))
# for(i in 1:4){
#   stripchart(Data[rownames(Data)==cpgs[i],]~design[,4],method="jitter",
#              group.names=c("Normal","Tumor"),pch=16,cex=1.5,col=c(4,2),ylab="Beta values",
#              vertical=TRUE,cex.axis=1.5,cex.lab=1.5)
#   title(cpgs[i],cex.main=1.5)
# }


Downloaded.R

setwd("G:/AllShare/SkillTrainHomework/TCGA-Assembler.2.0.5/TCGA-Assembler")
#' Load functions
source("Module_A.R")
source("Module_B.R")
setwd("G:/AllShare/SkillTrainHomework")
#' choose a cancer type
#' 可查看网址https://tcga-data.nci.nih.gov/docs/publications/tcga/
sCancer <- "GBM"
sPath1 <- "./DownloadedData"
sPath2 <- "./BasicDataProcessingResult"
sPath3 <- "./AdvancedDataProcessingResult"
#下载RNA-Seq数据
path_geneExp <- DownloadRNASeqData(cancerType = sCancer,
                        assayPlatform = "gene.normalized_RNAseq",
                        saveFolderName = sPath1)
#' Download DNA methylation 450 data
#' 使用Illumina的甲基化分析芯片测出来的甲基化数据
path_methylation_450 <- DownloadMethylationData(cancerType = sCancer,
                        assayPlatform = "methylation_450",
                        saveFolderName = sPath1)
# 下载出来的格式说明:
# TCGA-3C-AAAU-01A-11R-A41B-07
# 前三位TCGA-3C-AAAU表示病人ID
# 第四位01A表示肿瘤类型Tumor types range from 01 - 09,
# normal types from 10 - 19 and control samples from 20 - 29.
# 第五位Order of portion in a sequence of 100 - 120 mg sample portions
# 可查看网址https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

#' Process gene expression data
#' 对RNA-seq下载结果进行处理,将基因名进行处理
list_geneExp <-
  ProcessRNASeqData(inputFilePath = path_geneExp[1],
                    outputFileName = paste(sCancer,
                                           "geneExp",
                                           sep = "__"),
                    dataType = "geneExp",
                    outputFileFolder = sPath2)
#处理methylation数据
list_methylation_450 <-
  ProcessMethylation450Data(inputFilePath =
                    path_methylation_450[1], outputFileName = paste(sCancer,
                    "methylation_450", sep = "__"), outputFileFolder = sPath2)

#'Perform advanced data processing using Module B functions
#'Advanced进一处理methylation数据
list_methylation_450_OverallAverage <-
  CalculateSingleValueMethylationData(input = list_methylation_450,
                    regionOption = c("TSS1500", "TSS200"), DHSOption = "Both",
                    outputFileName = paste(sCancer, "methylation_450", sep = "__"),
                    outputFileFolder = sPath3,
                    chipAnnotationFile = "./SupportingFiles/MethylationChipAnnotation.rda")
save.image("tmpdata.Rdata")
load("tmpdata.Rdata")


RNAdiff.R

#setwd("G:/AllShare/SkillTrainHomework/BasicDataProcessingResult")
#load("GBM__geneExp.rda")
setwd("G:/AllShare/SkillTrainHomework")
load("RNAdiff.RData")
DATAFRAME<- data.frame(Data)
TCGASampleName<- colnames(DATAFRAME)
TCGASamplelist <- strsplit(TCGASampleName,split=".",fixed = TRUE)
TCGASampletpye <- c()
TCGASampleID <- c()
for(i in 1:length(TCGASamplelist)){
  TCGASampletpye <- c(TCGASampletpye,TCGASamplelist[[i]][4])
  TCGASampleID <- c(TCGASampleID,paste(TCGASamplelist[[i]][2],TCGASamplelist[[i]][3],sep = "."))
}
TCGASampletpyeResult <- c()
for(i in 1:length(TCGASampletpye)){
  if(as.integer(substr(TCGASampletpye[i],1,2))<10){
    TCGASampletpyeResult <- c(TCGASampletpyeResult,"Tumor")
  }else{
    TCGASampletpyeResult <- c(TCGASampletpyeResult,"Normal")
  }
}
phenotype <- data.frame(TCGASampletpyeResult)
rownames(phenotype) <- colnames(Data)
rownames(Data) <- Des[,2]
library(DESeq2)
Data<- floor(Data)
dds <- DESeqDataSetFromMatrix(countData = Data,
                              colData = phenotype,
                              design = ~ TCGASampletpyeResult)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
resSig <- subset(resOrdered, padj < 0.01)
res001 <- results(dds, alpha=0.01)
pdf(file = "MA-plot.pdf",width = 15,height = 9)
plotMA(res001, ylim=c(-3,3))
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
res[idx,]
Data[c("100131561","677811","414899","677846"),]
Data[c("9271","140893"),]
Data[c("390077","144742"),]
dev.off()
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="TCGASampletpyeResult", 
                returnData=TRUE)

library("ggplot2")
ggplot(d, aes(x=TCGASampletpyeResult, y=count)) + 
  geom_point(position=position_jitter(w=0.1,h=0)) + 
  scale_y_log10(breaks=c(25,100,400))+
  labs(title=paste("EntrezID:",res@rownames[which.min(res$padj)]))+
  theme(plot.title = element_text(hjust = 0.5))
ggsave(filename="p值最小的基因不同表达量.pdf",width = 8,height=6)

ntd <- normTransform(dds)
library("pheatmap")
select <- rownames(resOrdered)[1:20]
df <- as.data.frame(colData(dds)[,"TCGASampletpyeResult"])
pdf(file = "热图.pdf",width = 20,height = 9)
pheatmap(assay(ntd)[select,],show_rownames=TRUE)
dev.off()

RnaSigGeneID<- rownames(resSig)
RnaSigGeneName<- Des[Des[,2]%in%RnaSigGeneID,1]

intersect(RnaSigGeneName,signifiGene)

#log2FoldChange为负数,正常高表达;log2FoldChange为正数,肿瘤高表达
#筛选log2FoldChange为负数,即肿瘤低表达的部分
resSigdown<- resSig[resSig$log2FoldChange<(-2),]
#dim(resSigdown)
RnaSigGeneIDdown<- rownames(resSigdown)
RnaSigGeneNamedown<- Des[Des[,2]%in%RnaSigGeneIDdown,1]

# Data["2498",c(38:41,76)]
# Data["2498",]
# Data["7153",c(38:41,76)]
# Data["7153",]

#与抑癌基因取交集
Human_TSGs <- read.table("Human_TSGs.txt",header = TRUE,sep = "\t")
#TSGene <- read.table("TSGene-LOFdataset.txt",header = TRUE,sep = "\t")
#sig_exp <- read.table("sig_exp.txt",header = TRUE,sep = "\t")
TSGs <- Human_TSGs$GeneSymbol
#TSGs <- TSGene$GeneName
#TSGs <- sig_exp$Symbol
result1<- intersect(RnaSigGeneNamedown,TSGs)
result2<- intersect(result1,signifiGene)#三者交集
resultmuch <- intersect(RnaSigGeneNamedown,signifiGene)
write.table(resultmuch,"候选靶标基因多多多.txt",quote = FALSE,row.names = FALSE,col.name = FALSE)

#"CCHCR1"%in%signifiGene




#此段代码需要methylation的TCGASampletpyeResult对象
setwd("G:/AllShare/SkillTrainHomework/AdvancedDataProcessingResult")
load("GBM__methylation_450__TSS1500-TSS200__Both.rda")
MethTssFrame<- data.frame(Data)
rownames(MethTssFrame) <- Des[,1]
targetFrame<- MethTssFrame[result2,]
colnames(targetFrame) <- TCGASampletpyeResult
targetFrame<-na.omit(targetFrame)
library(pheatmap)
pdf(file = "target热图.pdf",width = 20,height = 6)
pheatmap(targetFrame,show_rownames=TRUE)
dev.off()

library (VennDiagram)
draw.triple.venn(area1=5, area2=5, area3=5
                 ,n12=3, n23=3, n13=3, n123=3
                 ,category = c('A','B','C'))
pdf("交集.pdf",width = 10,height = 10)
T<-venn.diagram(list(A=RnaSigGeneNamedown,B=TSGs,C=signifiGene),filename=NULL
                ,lwd=1,lty=2,category = c('RNA-seq down','Tumor suppressor gene','Hypermethylation'))
grid.draw(T)
dev.off()


#save.image("RNAdiff.RData")


PCAdata<-t(Data)
#作主成分分析
PCAdata.pr<-prcomp(PCAdata,scale=FALSE)
#作预测
PCA_eset<- predict(PCAdata.pr)
colnames(PCA_eset)
pdf(file = "陡坡图.pdf",width = 14,height = 7)
screeplot(PCAdata.pr,type="lines")
dev.off()

data.hc <- hclust( dist(PCAdata))
pdf(file = "树状图.pdf",width = 22,height = 12)
plot(data.hc, hang = -1)
dev.off()

#plot(PCA_eset[,1:2])

library("ggplot2")
ggplot(NULL, aes(x=PCA_eset[,1], y=PCA_eset[,2], colour=TCGASampletpyeResult)) + 
  geom_point() + 
  guides(color=guide_legend(title=NULL)) +
  labs(x = "PCA1 34.7%",y = "PCA2 15.3%",title = "RNA-seq Principal components analysis")
ggsave(filename="PCA RNA-seq.pdf",width = 8,height=6)


GOterm <- read.table("analysis.txt",header = TRUE,sep = "\t")
library(ggplot2)
ggplot(GOterm,aes(x=GO.biological.process,y=upload_1..over.under.,fill = P.value))+
  geom_bar(stat="identity")+
  labs(x = "GO terms",y = "Fold Enrichment",title = "GO biological process analysis")+
  coord_flip()
ggsave(filename="GO analysis.pdf",width = 8,height=6)



发布了79 篇原创文章 · 获赞 140 · 访问量 52万+
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 编程工作室 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览