cfDNAPro｜cfDNA片段数据生物学表征及可视化的R包

本文介绍了cfDNAPro这款工具，用于分析cfDNA的数据特性，包括片段长度的可视化、分布比较、模态长度、振荡模式以及利用ggplot2进行数据美化。这些功能有助于非侵入性诊断和肿瘤研究中的数据解读。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

前言

cfDNA（无细胞DNA，游离DNA，Circulating free DNA or Cell free DNA）是指在血液循环中存在的DNA片段。这些DNA片段不属于任何细胞，因此被称为“无细胞”或“游离”的。cfDNA来源广泛，可以来自正常细胞和病变细胞（如肿瘤细胞）的死亡和分解过程。cfDNA的长度通常在160-180碱基对左右，这与核小体保护的DNA片段长度相符。

cfDNA的研究对于非侵入性诊断、疾病监测、早期检测以及了解生理和病理状态具有重要意义。特别是在肿瘤学领域，通过分析循环肿瘤DNA（ctDNA），即来源于肿瘤细胞的cfDNA，可以获取肿瘤的遗传信息，从而指导癌症的诊断、治疗选择和治疗效果监测。

cfDNAPro

主要功能：

数据表征：计算片段大小分布的整体、中位数和众数，以及片段大小轮廓中的峰和谷，还有振荡周期性。
数据可视化：提供了多种函数来可视化这些数据，包括整体到单个片段的可视化、度量可视化、模式和摘要可视化等。

demo

1.片段长度可视化

上图：横轴表示片段长度，范围为30bp至500bp。纵轴表示具有特定读取长度的读取比例。这里的线并不是平滑曲线，而是连接不同数据点的直线。
下图：首先统计长度小于或等于30bp的读取数量（例如N），然后将其归一化为比例。重复这一过程，直至处理完所有片段长度（即30bp, 31bp, …, 500bp），然后以线图的形式呈现。与非累积图一样，这里的线也是连接各个数据点，而不是平滑曲线。

library(scales)
library(ggpubr)
library(ggplot2)
library(dplyr)


# Define a list for the groups/cohorts.
grp_list<-list("cohort_1"="cohort_1",
               "cohort_2"="cohort_2",
               "cohort_3"="cohort_3",
               "cohort_4"="cohort_4")

# Generating the plots and store them in a list.
result<-sapply(grp_list, function(x){
  result <-callSize(path = data_path) %>% 
    dplyr::filter(group==as.character(x)) %>% 
    plotSingleGroup()
}, simplify = FALSE)
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.

# Multiplexing the plots in one figure
suppressWarnings(
  multiplex <-
    ggarrange(result$cohort_1$prop_plot + 
              theme(axis.title.x = element_blank()),
            result$cohort_4$prop_plot + 
              theme(axis.title = element_blank()),
            result$cohort_1$cdf_plot,
            result$cohort_4$cdf_plot + 
              theme(axis.title.y = element_blank()),
            labels = c("Cohort 1 (n=5)", "Cohort 4 (n=4)"),
            label.x = 0.2,
            ncol = 2,
            nrow = 2))

multiplex

2.片段长度分布比较

callMetrics：计算了每个组的中位片段大小分布
上图：每个队列中位数片段大小分布的比例。y轴显示读取比例，x轴显示片段大小。图中显示的线不是平滑的曲线，而是连接不同数据点的线
下图：中位数累积分布函数(CDF)的图形。y轴显示累积比例，x轴仍然显示片段大小。这是一个逐步上升的图形，反映了不同片段大小下读取的累积分布情况。

# Set an order for those groups (i.e. the levels of factors).
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generate plots.
compare_grps<-callMetrics(data_path) %>% plotMetrics(order=order)
#> setting default input_type to picard.

# Modify plots.
p1<-compare_grps$median_prop_plot +
  ylim(c(0, 0.028)) +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_text(size=12,face="bold")) +
  theme(legend.position = c(0.7, 0.5),
        legend.text = element_text( size = 11),
        legend.title = element_blank())

p2<-compare_grps$median_cdf_plot +
  scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) +
  theme(axis.title=element_text(size=12,face="bold")) +
  theme(legend.position = c(0.7, 0.5),
        legend.text = element_text( size = 11),
        legend.title = element_blank())

# Finalize plots.
suppressWarnings(
  median_grps<-ggpubr::ggarrange(p1,
                       p2,
                       label.x = 0.3,
                       ncol = 1,
                       nrow = 2
                       ))


median_grps

3.可视化DNA片段模态长度

柱状图：这里的模态片段大小是指在样本中出现次数最多的DNA片段长度

# Set an order for your groups, it will affect the group order along x axis!
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")

# Generate mode bin chart.
mode_bin <- callMode(data_path) %>% plotMode(order=order,hline = c(167,111,81))
#> setting default mincount as 0.
#> setting default input_type to picard.

# Show the plot.
suppressWarnings(print(mode_bin))

堆叠柱状图：可以看到每个组中不同长度片段的分布

# Set an order for your groups, it will affect the group order along x axis.
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")

# Generate mode stacked bar chart. You could specify how to stratify the modes
# using 'mode_partition' arguments. If other modes exist other than you 
# specified, an 'other' group will be added to the plot.

mode_stacked <- 
  callMode(data_path) %>% 
  plotModeSummary(order=order,
                  mode_partition = list(c(166,167)))
#> setting default input_type to picard.

# Modify the plot using ggplot syntax.
mode_stacked <- mode_stacked + theme(legend.position = "top")

# Show the plot.
suppressWarnings(print(mode_stacked))

4.片段化振荡模式比较

间峰距离：通过测量和比较间距距离（峰值之间的距离），比较不同队列中的10bp周期性振荡模式

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")

# Plot and modify inter-peak distances.

  inter_peak_dist<-callPeakDistance(path = data_path,  limit = c(50, 135)) %>%
  plotPeakDistance(order = order) +
  labs(y="Fraction") +
  theme(axis.title =  element_text(size=12,face="bold"),
        legend.title = element_blank(),
        legend.position = c(0.91, 0.5),
        legend.text = element_text(size = 11))
#> setting the mincount to 0.
#>  setting the xlim to c(7,13). 
#>  setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.


# Show the plot.
suppressWarnings(print(inter_peak_dist))

间谷距离：与之前介绍的间峰距离可视化相比，间谷距离的可视化重点在于表示读取次数下降的区域，而不是上升的区域。这两个图表的区别在于它们关注的是碎片大小谱的不同特点，一个是峰点（即频率的局部最高点），另一个是谷点（即频率的局部最低点）。

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_valley_dist<-callValleyDistance(path = data_path,  
                                      limit = c(50, 135)) %>%
  plotValleyDistance(order = order) +
  labs(y="Fraction") +
  theme(axis.title =  element_text(size=12,face="bold"),
        legend.title = element_blank(),
        legend.position = c(0.91, 0.5),
        legend.text = element_text(size = 11))
#> setting the mincount to 0. 
#>  setting the xlim to c(7,13). 
#>  setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.

# Show the plot.
suppressWarnings(print(inter_valley_dist))

5. ggplot2美化

library(ggplot2)
library(cfDNAPro)
# Set the path to the example sample.
exam_path <- examplePath("step6")
# Calculate peaks and valleys.
peaks <- callPeakDistance(path = exam_path) 
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
valleys <- callValleyDistance(path = exam_path) 
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.
# A line plot showing the fragmentation pattern of the example sample.
exam_plot_all <- callSize(path=exam_path) %>% plotSingleGroup(vline = NULL)
#> setting default outfmt to df.
#> setting default input_type to picard.
# Label peaks and valleys with dashed and solid lines.
exam_plot_prop <- exam_plot_all$prop + 
  coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
  geom_vline(xintercept=peaks$insert_size, colour="red",linetype="dashed") +
  geom_vline(xintercept = valleys$insert_size,colour="blue")

# Show the plot.
suppressWarnings(print(exam_plot_prop))

# Label peaks and valleys with dots.
exam_plot_prop_dot<- exam_plot_all$prop + 
  coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
  geom_point(data= peaks, 
             mapping = aes(x= insert_size, y= prop),
             color="blue",alpha=0.5,size=3) +
  geom_point(data= valleys, 
             mapping = aes(x= insert_size, y= prop),
             color="red",alpha=0.5,size=3) 
# Show the plot.
suppressWarnings(print(exam_plot_prop_dot))