数据库简介
CCLE数据库几个知识点
CCLE中细胞系表达谱的GEO数据库GSE36133
CCLE数据库数据下载链接
数据处理
样品信息处理
通过R语言来处理数据,需要一定R语言基础;
因为下载下来的数据包含了多种癌症,首先是处理细胞系信息,选择自己需要的细胞系,进行后续操作。
rm(list = ls())
sample<-read.csv(file="sample_info.csv")
colnames(sample)
# [1] "DepMap_ID" "cell_line_name"
# [3] "stripped_cell_line_name" "CCLE_Name"
# [5] "alias" "COSMICID"
# [7] "sex" "source"
# [9] "Achilles_n_replicates" "cell_line_NNMD"
# [11] "culture_type" "culture_medium"
# [13] "cas9_activity" "RRID"
# [15] "WTSI_Master_Cell_ID" "sample_collection_site"
# [17] "primary_or_metastasis" "primary_disease"
# [19] "Subtype" "age"
# [21] "Sanger_Model_ID" "depmap_public_comments"
# [23] "lineage" "lineage_subtype"
# [25] "lineage_sub_subtype" "lineage_molecular_subtype"
可以看到,样品信息包含了以上26中信息,可以根据需要,选择信息进行后续处理,比如说,我选择样品ID,细胞系名称,原发灶或转移灶,原发疾病类型,亚型等信息。
sample_info<-sample[,c(1,3,17,18,19)]
###choose CRC cell lines
unique(sample_info$primary_disease)
# [1] "Ovarian Cancer" "Leukemia"
# [3] "Colon/Colorectal Cancer" "Skin Cancer"
# [5] "Lung Cancer" "Bladder Cancer"
# [7] "Kidney Cancer" "Breast Cancer"
# [9] "Pancreatic Cancer" "Myeloma"
# [11] "Brain Cancer" "Sarcoma"
# [13] "Lymphoma" "Bone Cancer"
# [15] "Fibroblast" "Gastric Cancer"
# [17] "Engineered" "Thyroid Cancer"
# [19] "Neuroblastoma" "Prostate Cancer"
# [21] "Rhabdoid" "Gallbladder Cancer"
# [23] "Endometrial/Uterine Cancer" "Head and Neck Cancer"
# [25] "Bile Duct Cancer" "Esophageal Cancer"
# [27] "Liver Cancer" "Cervical Cancer"
# [29] "Unknown" "Eye Cancer"
# [31] "Adrenal Cancer" "Liposarcoma"
# [33] "Embryonal Cancer" "Teratoma"
# [35] "Non-Cancerous"
可以看到有35种不同的癌症类型,我们选择特定的一种即可,比如我选择肝癌。
which(sample_info$primary_disease=="Liver Cancer")
cell_lines<-sample_info[which(sample_info$primary_disease=="Liver Cancer"),]
save(cell_lines,sample,file="Data1_sample_information.Rdata")
这样我们就选择了我们想研究的癌症类型及需要的细胞系名称及相关信息,先保存下来。
基因表达信息
先读取我们下载的表达信息
exp<-read.csv(file="CCLE_expression.csv")
rownames(exp)<-exp[,1]
exp[1:3,1:3]
exp<-exp[,-1]
# TSPAN6..7105. TNMD..64102. DPM1..8813.
# ACH-001113 4.990501 0.0000000 7.273702
# ACH-001289 5.209843 0.5459684 7.070604
# ACH-001339 3.779260 0.0000000 7.346425
##choose samples from expression matrix
a<-cell_lines$DepMap_ID
##%in%判断exp中的元素是否都在我们选择的细胞系中
b<-c(rownames(exp) %in% a)
length(b)
exp_liver<-exp[b,]
##判断细胞系信息中细胞名称的元素是否在肝癌细胞系exp中
c<-c(cell_lines$DepMap_ID %in% rownames(exp_liver))
cell_exp<-exp_liver[c,]
现在我们就得到了我们需要的表达矩阵,但是并没有对应细胞名,我们要把ID和对应名称匹配
rownames(cell_lines)<-cell_lines[,1]
merge<-cbind(cell_lines,exp_liver)
save(merge,file="input_sample_and_exp.Rdata")
rownames(merge)<-merge$stripped_cell_line_name
matrix<-merge[,-c(1:5)]
matrix<-t(matrix)
d<-rownames(matrix)
class(d)
d<-as.matrix(d)
matrix<-cbind(d,matrix)
write.csv(matrix,file="CRC_exp.csv")
matrix<- read.csv(file="CRC_exp.csv")
去除NA值,整理表达矩阵
sum(is.na(matrix))
newdata<-na.omit(matrix)
duplicated(newdata$X.1)
h<-newdata[duplicated(newdata$X.1),]
mydata<-newdata[!duplicated(newdata$X.1),]
rownames(mydata)<-mydata$X.1
mydata<-mydata[,-c(1:3)]