单细胞多组学数据介绍①——甲基化数据

一、甲基化数据格式介绍

我们拿到的数据是经过处理后的二手数据,来自文章“Multi-omics profiling of mouse gastrulation at single cell resolution”。其中甲基化数据分两个level,分别是cpg level和feature level。

1.cpg level data

cpg_level:在CpG水平上测量每个细胞的DNA甲基化
如图是1140个样本的cpg_level甲基化数据,测了小鼠22条染色体近一亿个碱基位点的cpg level。

我们随意找一个“.tsv.gz”用记事本打开,数据呈现如下分布。“chr”代表1-22号染色体,“pos”代表碱基,“met_reads”代表“甲基化” reads,rate=met_reads/(met_reads+nonmet_reads)123243454

2.feature level data

feature_level:在基因组特征水平(启动子、增强子等)聚集的每个细胞的DNA甲基化测量结果。
feature_level是在不同基因组特征标识下进行的ChIP-seq。这涉及到不少的组蛋白修饰的背景知识,通过组蛋白修饰来区分不同的基因组元件,比如H3K4me1可以作为增强子的标志,H3K4me3可作为启动子标志。包括其他的长散在元件,短散在元件等等都有各自的标志。放一张feature level data 的图。

3. loading data

1.确定loading路径

> io <- list()
> io$sample.metadata <- "scnmt_gastrulation/sample_metadata.txt"
> io$feature_data <- "scnmt_gastrulation/met/feature_level"
> io$cpg_data <- "scnmt_gastrulation/met/cpg_level"
##大家的路径不同要根据自己的文件位置调整哦

2.确定要导入哪些基因特征annotation下的数据
这里我们任意选四个annotation,“=”前面就是上面提到的feature level下的文件名去掉后缀,“=”后面是对应的基因元件

> opts$annos <- c(
  "prom_2000_2000"="Promoters",
  "H3K27ac_distal_E7.5_Mes_intersect12"="Mesoderm enhancers",
  "H3K27ac_distal_E7.5_Ect_intersect12"="Ectoderm enhancers",
  "H3K27ac_distal_E7.5_End_intersect12"="Endoderm enhancers"
)

3.Load DNA methylation data

##Load DNA methylation data on feature level
> feature_data <- lapply(names(opts$annos), function(n)
>  fread(sprintf("%s/%s.tsv.gz",io$feature_data,n), showProgress=F) %>%
    setnames(c("id_met","id","anno","Nmet","Ntotal","rate"))
) %>% rbindlist

二、分析方法

1./QC/: 质量控制(cpg_level)

1.load R package

> library(data.table)
> library(purrr)
> library(ggplot2)

2. Load sample metadata

# Define which cells to use
> opts <- list()
> opts$cells <- fread("scnmt_eb/sample_metadata.txt") %>%
 .[!is.na(id_met),id_met]
# coverage阈值
> opts$met_coverage_threshold <- 5e4
> sample_metadata <- fread("scnmt_eb/sample_metadata.txt") %>% 
  .[id_met%in%opts$cells]
> colnames(sample_metadata)
 [1] "sample"                    "id_rna"                    "id_met"                    "id_acc"                   
 [5] "day"                       "day2"                      "genotype"                  "lineage10x"               
 [9] "lineage10x_2"              "stage.mapped"              "celltype.multinomial.prob" "pass_rnaQC"               
[13] "pass_metQC"                "pass_accQC"               
> 

3.Load methylation data

###-- Load methylation data and calculate QC statistics per sample -->
> stats <- data.table(id_met=opts$cells, coverage=as.numeric(NA))
> for (cell in opts$cells) {
  if (file.exists(sprintf("%s/%s.tsv.gz","scnmt_eb/met/cpg_level",cell))) {
    tmp <- fread(cmd=sprintf("zcat < %s/%s.tsv.gz","scnmt_eb/met/cpg_level",cell), sep="\t", verbose=F, showProgress=F) %>% .[,c(1,2,5)] %>% setnames(c("chr","pos","rate"))
    stats[id_met==cell,coverage:=nrow(tmp)]
    stats[id_met==cell,mean:=mean(tmp$rate, na.rm=T)]
  } else {
    print(sprintf("Sample %s not found",cell))
  }
}
> stats <- stats[!is.na(coverage)]
> stats[1:5,1:3]
                         id_met coverage      mean
1: TET_embryoid_body_Plate1_E01   263182 0.7163446
2: TET_embryoid_body_Plate1_D01   239636 0.4828949
3: TET_embryoid_body_Plate1_G01    50872 0.6656707
4: TET_embryoid_body_Plate1_F01   210232 0.7768180
5: TET_embryoid_body_Plate1_A01    26647 0.7687545


4.Plot QC statistics

> tmp <- stats[,c("id_met","coverage")] %>% setkey(coverage) %>% .[,id_met:=factor(id_met,levels=id_met)]
> tmp$cellcolor <- c("black","red")[as.numeric(tmp$coverage < opts$met_coverage_threshold)+1]
> p1 <- ggplot(tmp, aes(x=id_met, y=coverage)) +
  geom_bar(stat="identity", position="dodge", fill="#F8766D", color="#F8766D") +
  labs(title="", x="", y="Number of observed CpG sites") +
  geom_hline(yintercept=opts$met_coverage_threshold, colour="black", linetype="dashed") +
  barplot_theme() +
  scale_y_continuous(expand=c(0,0), limits=c(0,4e+6)) +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )
> print(p1)


虚线为阈值,我们设置的是coverage</5000。

2. /dimensionality_reduction/:降维

单细胞甲基化数据存在大量缺失值,具体的降维方法可参考另一篇介绍R语言下多组学因子分析MOFA ——单细胞甲基化分析

3. /distributions/:绘制全基因组分布

因为整体甲基化水平并不能很具体的说明一些生物学问题,所以以各种基因元件为背景的甲基化水平在各个胚层细胞和不同时期细胞之间的动态变化是目前单细胞甲基化研究的主要方式,也就是前面提到的“feature_level ”。

先放一张我们要复现的目标图像
在这里插入图片描述
作者的结论是:与中胚层和内胚层增强子相反,外胚层增强子早在E4.5的表皮细胞中就已打开并去甲基化。在中胚层命运的细胞中,外胚层增强子才被部分抑制。

1.导入数据

load R package

> library(data.table)
> library(purrr)
> library(ggplot2)
> library(ggpubr)
> library(stringi)

Define path

> io <- list()
> io$sample.metadata <- "scnmt_gastrulation/sample_metadata.txt"
> io$data <- "scnmt_gastrulation/met/feature_level"

Define which stage and lineages to use (use NULL for all)

> opts <- list()
> opts$stage_lineage <- NULL
# Define which annotations to use
> opts$annos <- c(
  "prom_2000_2000"="Promoters",
  "H3K27ac_distal_E7.5_Mes_intersect12"="Mesoderm enhancers",
  "H3K27ac_distal_E7.5_Ect_intersect12"="Ectoderm enhancers",
  "H3K27ac_distal_E7.5_End_intersect12"="Endoderm enhancers"
)

Define which cells to use

> tmp <- fread(io$sample.metadata) %>% .[!is.na(id_met)] 
> if (is.null(opts$stage_lineage)) {
  opts$cells <- tmp[,id_met]
} else {
  opts$cells <- tmp %>%
    .[,stage_lineage:=paste(stage,lineage10x_2,sep="_")] %>%
    .[pass_metQC==T & stage_lineage%in%opts$stage_lineage,id_met]
}
> dim(tmp)
[1] 1140   12
> colnames(tmp)
 [1] "sample"       "id_rna"       "id_met"       "id_acc"       "embryo"       "plate"        "pass_rnaQC"   "pass_metQC"

Load sample metadata and DNA methylation data

> sample_metadata <- fread(io$sample.metadata) %>% 
  .[,c("id_met","stage","lineage10x_2")] %>%
  .[id_met%in%opts$cells] %>%
  .[,stage_lineage:=paste(stage,lineage10x_2,sep="_")]
> dim(sample_metadata)
[1] 1140    4
> sample_metadata[1:5,1:4]
                    id_met stage       lineage10x_2           stage_lineage
1: E4.5-5.5_new_Plate3_E09  E4.5           Epiblast           E4.5_Epiblast
2: E4.5-5.5_new_Plate3_G09  E4.5           Epiblast           E4.5_Epiblast
3: E4.5-5.5_new_Plate3_H09  E4.5           Epiblast           E4.5_Epiblast
4: E4.5-5.5_new_Plate3_B09  E4.5           Epiblast           E4.5_Epiblast
5: E4.5-5.5_new_Plate4_F01  E4.5 Primitive_endoderm E4.5_Primitive_endoderm

> data <- lapply(names(opts$annos), function(n)
  fread(sprintf("%s/%s.tsv.gz",io$data,n), showProgress=F) %>%
    setnames(c("id_met","id","anno","Nmet","Ntotal","rate"))
) %>% rbindlist

> data[1:5,1:6]
                    id_met                 id           anno Nmet Ntotal rate
1: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000001 prom_2000_2000    1      3   33
2: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000028 prom_2000_2000    2     29    7
3: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000037 prom_2000_2000    1      1  100
4: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000049 prom_2000_2000    1     22    5
5: E4.5-5.5_new_Plate1_A02 ENSMUSG00000000058 prom_2000_2000    0      5    0

2. 解析数据

合并methylation data 和 sample metadata

> data <- data %>% merge(sample_metadata, by="id_met")

计算rate值(rate=100*(sum(Nmet)/sum(Ntotal)))

> data.pseudobulk <- data %>% 
  .[,.(rate=100*(sum(Nmet)/sum(Ntotal))),by=c("stage","id","anno")]

plot

> p <- gghistogram(data.pseudobulk, x = "rate", y="..density..", fill = "#F37A71", color = "black", alpha=0.75) +
  facet_wrap(~stage+anno, nrow=4, scales="fixed") +
  theme(
    axis.text.y = element_text(size=rel(0.9))
  )
> print(p)

结果如图:
在这里插入图片描述

  • 3
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值