R语言生信 TCGA以三阴乳腺癌为例获得癌与癌旁一一匹配的临床样本

生信从负数开始学

已于 2023-09-26 16:41:49 修改

阅读量1k

点赞数 1

文章标签： r语言数据分析

于 2023-09-26 16:37:28 首次发布

本文链接：https://blog.csdn.net/aydayUP111/article/details/133314833

版权

代码近乎我完全原创，大部分是我逐个试错得到的，但是后来发现与我的需求相反，所以放到网上，用以记录，便于有需求时查询。首先是获得基因表达矩阵和临床信息也就是表型。接下来的操作是获得癌与癌旁患者ID样本一一对应的临床信息，之后也可获得相应的表达矩阵进行后续分析

table(str_sub(colnames(exp1),14,15))
group_list <- ifelse(as.numeric(str_sub(colnames(exp1),14,15))<10,'tumor','normal')
table(group_list)
group_list <- factor(group_list,levels = c("normal","tumor"))
table(group_list)
#group_list里记录了表达矩阵中患者ID的对应的组织类型，且和表达矩阵的顺序一样
#构建样本号与肿瘤分类对应关系
colData <- data.frame(TCGAid = colnames(exp1),group_list=group_list)
patient_id <- substr(colData$TCGAid,9,12)#取第TCGA id第9-12个字符即为患者id
colData <- data.frame(colData$TCGAid,colData$group_list,patient_id)
colData <- colData %>% filter(!grepl('-01B|-11B|-06A',colData.TCGAid))
#colData就是表达矩阵里的样本TCGA ID与组织类型（normal和tumor）的对应关系
clinical_nt <- merge(clinical,colData,by = "patient_id")
colnames(clinical_nt)[colnames(clinical_nt)=="colData.TCGAid"] <- "TCGAID"
colnames(clinical_nt)[colnames(clinical_nt)=="colData.group_list"] <- "group_list"
#挑选需要的表型信息
colnames(clinical_nt)
phe_clinical <- clinical_nt[,c("patient_id","days_to_birth","days_to_death","bcr_patient_barcode","TCGAID","group_list","stage_event","breast_carcinoma_estrogen_receptor_status","breast_carcinoma_progesterone_receptor_status","lab_proc_her2_neu_immunohistochemistry_receptor_status")]
#phe_clinical包含了需要的表型数据以及TCGAID，patient_id以及癌与癌旁的信息

#下面要获得癌与癌旁匹配的样本编号
normal_group <- colData[which(colData$colData.group_list=="normal"),]
table(normal_group)
tumor_group <- colData[which(colData$colData.group_list=="tumor"),]
table(tumor_group)
nt_group <- tumor_group[tumor_group$patient_id %in% normal_group$patient_id,]
#查看重复的是那些样本，看是否删除
nt_group[duplicated(nt_group$patient_id),]
normal_group[which(normal_group$patient_id=="A13E"),]
nt_group1 <- nt_group[-which(nt_group$colData.TCGAid=="TCGA-A7-A13E-01A-11R-A277-07"),]
normal_group[which(normal_group$patient_id=="A0DB"),]
nt_group1[which(nt_group1$patient_id=="A0DB"),]
nt_group_2 <- nt_group1[-which(nt_group1$colData.TCGAid=="TCGA-A7-A0DB-01A-11R-A00Z-07"),]
nt_group_3 <- nt_group_2[-which(nt_group_2$colData.TCGAid=="TCGA-A7-A0DB-01C-02R-A277-07"),]
normal_group_new <- normal_group[normal_group$patient_id%in%nt_group_3$patient_id,]
#用sum来看是否完全配对
sum(nt_group_3$patient_id%in%normal_group_new$patient_id)
#normal_group_new和nt_group_3就是癌旁与癌匹配的样本编号以及TCGAID，共有98个样本，去除了肿瘤组织里01B这种石蜡包埋的样本以及06A转移癌症的样本
#接下来把这两个样本编号纵向拼在一个数据框里，便于后续筛选匹配的临床表型
sample_normal_tumor <- rbind(nt_group_3,normal_group_new)
#按照癌与癌旁匹配的TCGAID筛选需要的临床样本
colnames(sample_normal_tumor)[colnames(sample_normal_tumor)=="colData.TCGAid"] <- "TCGAID"
colnames(sample_normal_tumor)[colnames(sample_normal_tumor)=="colData.group_list"] <- "group_list" 
sample_phe_clinical <- phe_clinical[phe_clinical$TCGAID %in% sample_normal_tumor$TCGAID,]

然后筛选三阴乳腺癌的临床表型

colnames(clinical)[grep("receptor_status",colnames(clinical),ignore.case = TRUE)]
#grep函数全局搜索并输出，ignore.case=TRUE忽略大小写
#下一步看有多少个三阴乳腺癌的样本
table(sample_phe_clinical$breast_carcinoma_estrogen_receptor_status == 'Negative' &
        sample_phe_clinical$breast_carcinoma_progesterone_receptor_status == 'Negative' &
        sample_phe_clinical$lab_proc_her2_neu_immunohistochemistry_receptor_status == 'Negative')
#结果是共有116个三阴乳腺癌的患者
TNBC_samples_clinical <- clinical[clinical$breast_carcinoma_estrogen_receptor_status == 'Negative' &
                                    clinical$breast_carcinoma_progesterone_receptor_status == 'Negative' &
                                    clinical$lab_proc_her2_neu_immunohistochemistry_receptor_status == 'Negative',]

生信从负数开始学

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
2
评论
R语言生信 TCGA以三阴乳腺癌为例获得癌与癌旁一一匹配的临床样本

代码近乎我完全原创，大部分是我逐个试错得到的，但是后来发现与我的需求相反，所以放到网上，用以记录，便于有需求时查询。首先是获得基因表达矩阵和临床信息也就是表型。接下来的操作是获得癌与癌旁患者ID样本一一对应的临床信息，之后也可获得相应的表达矩阵进行后续分析。然后筛选三阴乳腺癌的临床表型。
复制链接

扫一扫