使用coloc进行QTL数据的共定位分析

摘要#

共定位分析旨在确定两个性状在给定基因组区域中可能共享的因果变异,本文中所说的共定位是基于贝叶斯推断的共定位分析,使用的软件是R语言中的coloc包。


抛砖引玉-共定位的原理与算法#

官方对于coloc的介绍如下:

The coloc package can be used to perform genetic colocalisation analysis of two potentially related phenotypes, to ask whether they share common genetic causal variant(s) in a given region

因此,共定位的目的是为了检验输入的两种表型在给定的区域内是否共享同一个因果变异,本文中的共定位是基于贝叶斯推断的共定位,关于贝叶斯推断的原理,请参考以下的资料:

  1. 文献1:Bayesian Test for Colocalisation between Pairs of Genetic Association Studies Using Summary Statistics

  2. 文献2:Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses

  3. coloc官方文档

共定位分析的准备工作#

相信你在阅读官方文档时注意到了这段话:

A single genomic region does not correspond to the whole genome, nor to a whole chromosome. Coloc also does not automatically split chromosomes or a genome into regions. It is assumed the user can look at their data, identify a region with overlapping GWAS signals between two studies, and decide on the set of SNPs to include.

意思是软件不会为你自动选择共定位区域,而是把共定位区域的选择工作交给用户来完成,而且强调了共定位区域不能为整条染色体或者全基因组。因此,定义共定位区域是共定位分析中特别重要的一部分,而共定位区域是为了共定位数据对(也就是你要检验的两种共定位表型)而服务的,所以一般的流程是先确定要检验的表型,再确定要检验的区域

待检验表型的选择#

所谓待检验的表型,指的是最后用来计算共定位概率的配对数据。在GWAS分析中,一般只针对一个性状进行关联分析,而在QTL分析中,往往同时对很多表型进行关联,例如在eQTL分析中,每一个基因都代表一个表型,在caQTL中,每一个开放区域都代表一个表型。此时,我们检验的表型就是每一个基因-开放区域配对数据,因此,我们需要首先确定所有的配对数据,然后为它们分别指定共定位区域。

在这里,我参考了coloc官网的推荐文献,进行了简单的整理,感兴趣的学者可以阅读一下原文进行细节的探究:

共定位类型文章地址共定位区域
eQTL-meQTLNature communications 2018leadSNP 上下游 250kb
eQTL-GWASNature 2017独立 GWAS 上下 1mb
eQTL-pQTLNature 2018按照遗传距离划分区域
meQTL-GWASAJRCCM 2018甲基化位点上下游 500kb
pQTL-eQTLNature Communications 2018lead-pSNP 上下 1mb
caQTL-GWAS/eQTLNature Communications 2018其他研究确定的区域
GWAS-GWASInflammatory Bowel Diseases 2018每一个 SNP 的上下 50kb

这些文献中,很多都提到了一个概念独立位点,独立位点指的是一个区域中,没有其他SNP与此SNP的连锁程度超过给定阈值,这个阈值一般是r2<0.2r2<0.2),选择连锁强度最高的lead-meSNP对应的甲基化探针

  • 使用每一个基因的上下游各1mb范围作为共定位区域进行分析
  • 下载练习数据#

    下面,我将使用公共数据库中的数据进行共定位,所使用的数据分别是来自PancanQTL[1]的cis-eQTL数据和来自Pancan-meQTL[2]的cis-meQTL数据,这里使用乳腺癌的数据,如果想根据这篇教程练习,请先下载相关数据,下载地址如下:

    处理原始数据#

    推荐大家使用data.table进行数据的读取,使用tidyverse进行数据处理,代码简洁优雅,功能强大。假设大家已经把数据导入,并命名为eQTLmeQTL,下面的分析都以此为基础进行分析。

    suppressMessages(library(tidyverse))
    suppressMessages(library(data.table))
    
    数据预处理#

    此处的MAF是用TCGA的数据进行计算的,这里提供下载,下载地址

    # 导入MAF信息
    maf = fread("BRCA_MAF.txt")
    meQTL = fread("BRCA_tumor.cis_meQTL.tsv") %>%
        rename(
            pvalue   = `p-value`,
            `t-stat` = status
        ) %>%
        mutate(varbeta = (beta / `t-stat`) ^ 2) %>%
        separate(
            col    = "alleles",
            into   = c("ref", "alt"),
            sep    = "/",
            fill   = "right",
            remove = TRUE
        ) %>%
        separate(
            col    = "snp_position",
            into   = c("chr", "position"),
            sep    = ":",
            fill   = "right",
            remove = TRUE
        ) %>%
        mutate_at(
            "position",
            as.numeric
        ) %>%
        left_join(
            y = maf,
            by = c("snp", "position", "ref", "alt")
        ) %>%
        select(snp, chr, position, ref, alt, maf, probes, beta, varbeta, pvalue)
    eQTL = fread("BRCA_cis_eQTL_fdr005.tsv") %>%
        rename(pvalue  = `p-value`) %>%
        mutate(varbeta = (beta / `t-stat`) ^ 2) %>%
        separate(
            col    = "alleles",
            into   = c("ref", "alt"),
            sep    = "/",
            fill   = "right",
            remove = TRUE
        ) %>%
        left_join(
            y  = maf,
            by = c("snp", "position", "ref", "alt")
        ) %>%
        select(snp, chr, position, ref, alt, maf, gene, beta, varbeta, pvalue)
    
    提取leadQTL#
    lead_eSNP = eQTL %>%
        group_by(gene) %>%
        arrange(pvalue) %>%
        distinct(gene, .keep_all = TRUE) %>%
        rename_at(
            vars(c("beta", "varbeta", "pvalue")))
    
    lead_meSNP = meQTL %>%
        group_by(probes) %>%
        arrange(pvalue) %>%
        distinct(probes, .keep_all = TRUE)
    
    定义共定位数据对#
    coloc_pair_list = apply(
        lead_eSNP %>%
            mutate(gene2 = gene) %>%
            column_to_rownames("gene2") %>%
            mutate_at(
                .vars = vars(c("position", ends_with("_eqtl"))),
                .funs = as.numeric
            ),
        MARGIN = 1,
        FUN = function(x) {
            result                   = as.list(x)
            result[["maf"]]          = as.numeric(result[["maf"]])
            result[["position"]]     = as.numeric(result[["position"]])
            result[["beta_eqtl"]]    = as.numeric(result[["beta_eqtl"]])
            result[["varbeta_eqtl"]] = as.numeric(result[["varbeta_eqtl"]])
            result[["pvalue_eqtl"]]  = as.numeric(result[["pvalue_eqtl"]])
            return(result)
        },
        simplify = FALSE
    )
    
    找出meQTL中与每一个lead-eSNP重合的meSNP#
    overlapped_meSNP = lapply(
        X = coloc_pair_list,
        FUN = function(x) {
            meQTL %>% filter(snp %in% x[["snp"]])
        }
    )
    
    计算连锁强度并确定最高连锁强度的甲基化探针#

    这一步明白原理就可以了,计算LD的方法有很多种,这里选择了LDlinkR软件包,这个工具的在线网页服务地址见我的友情链接。下面的代码会将连锁强度最高的甲基化探针以及对应的统计量信息输出。

    lead_probe = mapply(
        FUN = function(coloc_list, meqtl_overlap) {
            # 没有重叠的情况直接输出空结果
            if(length(meqtl_overlap[["snp"]]) == 0) {
                return(list(
                    probe        = NA,
                    beta_meqtl   = NA,
                    varbeta      = NA,
                    pvalue_meqtl = NA
                ))
            }
            # 多个meSNP对应到同一个探针时,直接输出探针
            if(length(unique(meqtl_overlap[["probes"]])) == 1) {
                return(
                    list(
                        probe         = meqtl_overlap[["probes"]][1],
                        beta_meqtl    = meqtl_overlap[["beta"]][1],
                        varbeta_meqtl = meqtl_overlap[["varbeta"]][1],
                        pvalue_meqtl  = meqtl_overlap[["pvalue"]][1]
                    )
                )
            } else {
                lead_esnp = coloc_list[["snp"]]
                # 获得重合meSNP对应的探针的lead—meSNP用来计算LD
                lead_mesnps_query = unique(lead_meSNP$snp[lead_meSNP$probes %in% meqtl_overlap$probes])
                ld_with_lead_esnp = sapply(
                    X = lead_mesnps_query,
                    FUN = function(x) {
                        if (length(x) == 0) {
                            return(0)
                        } else {
                            tryCatch({
                                ld_matrix = LDlinkR::LDpair(
                                    var1  = x,
                                    var2  = lead_esnp,
                                    pop   = "CEU",
                                    # 这里的token需要自己申请
                                    token = "自己申请"
                                )
                                return(ld_matrix[["r2"]][1])
                            },
                            error = function(error) {
                                print(paste(x, lead_esnp, error, sep = "\t"))
                                return(0)
                            })
                        }
                    },
                    simplify = "array"
                )
                max_ld = max(ld_with_lead_esnp)
                if(max_ld == 0 | is.na(max_ld)) {
                    return(list(
                        probe        = NA,
                        beta_meqtl   = NA,
                        varbeta      = NA,
                        pvalue_meqtl = NA
                    ))
                } else {
                    max_ld_snp = names(ld_with_lead_esnp)[which(ld_with_lead_esnp == max_ld)]
                    # 如果多个probe都有最大连锁,取最显著的
                    max_ld_probe   = (lead_meSNP %>% filter(snp %in% max_ld_snp) %>% arrange(pvalue))$probes[1]
                    max_ld_beta    = (lead_meSNP %>% filter(snp %in% max_ld_snp) %>% arrange(pvalue))$beta[1]
                    max_ld_varbeta = (lead_meSNP %>% filter(snp %in% max_ld_snp) %>% arrange(pvalue))$varbeta[1]
                    max_ld_pvalue  = (lead_meSNP %>% filter(snp %in% max_ld_snp) %>% arrange(pvalue))$pvalue[1]
                    return(
                        list(
                            probe        = max_ld_probe,
                            beta_meqtl   = max_ld_beta,
                            varbeta      = max_ld_varbeta,
                            pvalue_meqtl = max_ld_pvalue
                        )
                    )
                }
            }
        },
        coloc_pair_list,
        overlapped_meSNP,
        SIMPLIFY = FALSE
    )
    
    提取共定位区域内的所有SNP#

    我们定义共定位区域为每一个基因的顺式关联窗口,也就是上下游各1mb,由于我们使用的eQTL的关联距离也是1mb,因此我们直接提取与每一个基因关联的所有cis-eQTL,它们都位于我们的共定位区域中。

    # 提取共定位区域内的所有eSNP
    eSNP_by_gene = sapply(
        X = unique(eQTL[["gene"]]),
        FUN = function(egene) {
            eQTL %>% filter(gene == egene)
        },
        simplify = FALSE,
        USE.NAMES = TRUE
    )
    
    # 提取共定位区域内的所有meSNP
    meSNP_by_gene = lapply(
        X = eSNP_by_gene,
        FUN = function(esnps) {
            meQTL %>% filter(snp %in% esnps[["snp"]])
        }
    )
    
    进行共定位分析#

    共定位的时候需要注意,有很多可能导致数据出问题的情况,比如没有与eQTL重叠的meQTL,所使用的SNP不是单方向突变而无法计算LD等等,在写函数的时候需要设计一个异常捕获来处理这些情况。

    coloc_result = mapply(
        FUN = function(df_1, df_2, n_1, n_2, type_1 = "quant", type_2 = "quant") {
        # 如果没有重叠的SNP,就跳过共定位
            if (nrow(df_1) == 0 | nrow(df_2) == 0) {
                return(
                    list(
                        n_snps = 0,
                        PP3    = 0,
                        PP4    = 0
                    )
                )
            }
            df_1 = df_1 %>%
                rename(MAF = maf, p = pvalue) %>%
                arrange(p) %>%
                # coloc要求不能有重复的SNP,所以只保留更显著的
                distinct(snp, .keep_all = TRUE) %>%
                select(snp, position, p, beta, varbeta, MAF) %>%
                filter(!is.na(MAF)) %>%
                as.list()
            df_1[["N"]] = n_1
            df_1[["type"]] = type_1
    
            df_2 = df_2 %>%
                rename(MAF = maf, p = pvalue) %>%
                arrange(p) %>%
                distinct(snp, .keep_all = TRUE) %>%
                select(snp, position, p, beta, varbeta, MAF) %>%
                filter(!is.na(MAF)) %>%
                as.list()
            df_2[["N"]] = n_2
            df_2[["type"]] = type_2
    
            if (length(df_1[["snp"]])== 0 | length(df_2[["snp"]]) == 0) {
                return(
                    list(
                        n_snps = 0,
                        PP3    = 0,
                        PP4    = 0
                    )
                )
            }
    
            if (is.null(coloc::check_dataset(df_1)) & is.null(coloc::check_dataset(df_2))) {
                invisible(capture.output({
                    coloc_result = coloc::coloc.abf(
                        dataset1 = df_1,
                        dataset2 = df_2
                    )
                }))
                return(
                    list(
                        n_snps = coloc_result[["summary"]][["nsnps"]],
                        PP3    = coloc_result[["summary"]][["PP.H3.abf"]],
                        PP4    = coloc_result[["summary"]][["PP.H4.abf"]]
                    )
                )
            } else {
                return(
                    list(
                        n_snps = 0,
                        PP3    = 0,
                        PP4    = 0
                    )
                )
            }
        },
        df_1      = eSNP_by_gene,
        df_2      = meSNP_by_gene,
        n_1       = 1092,
        n_2       = 664,
        SIMPLIFY  = FALSE,
        USE.NAMES = TRUE
    )
    
    合并所有数据#

    在提取lead-eSNP的过程中,我们获得了共定位的基因,然后我们通过计算连锁强度获得了共定位的甲基化探针,最后我们进行了共定位检验,现在我们将这些信息合并到一个列表中,最终生成一个数据表方便输出,现在我们有三个列表,分别保存着SNP的信息与eQTL的基因信息、甲基化探针信息、共定位结果

    # 把所有的必要信息组合起来
    final_coloc_result_list = mapply(
        FUN = function(coloc_pairs_eqtl, coloc_pairs_meqtl, coloc_result) {
            return(c(
                coloc_pairs_eqtl,
                coloc_pairs_meqtl,
                coloc_result
            ))
        },
        coloc_pairs_eqtl  = coloc_pair_list,
        coloc_pairs_meqtl = lead_probe,
        coloc_result      = coloc_result,
        SIMPLIFY          = FALSE
    )
    
    # 将列表合并成数据框
    final_coloc_result_table = do.call(rbind, final_coloc_result_list) %>%
        as.data.frame() %>%
        # 转换后数据类型全部丢失,需要手动设置回来
        mutate_at(
            .vars = vars(c("snp", "chr", "ref", "alt", "gene", "probe")),
            .funs = as.character
        ) %>%
        mutate_at(
            .vars = vars(-c("snp", "chr", "ref", "alt", "gene", "probe")),
            .funs = as.numeric
        ) %>%
        mutate_all(
            .funs = function(x) {
                ifelse(is.na(x) | x == "NA", NA, x)
            }
        ) %>%
        # 如果探针缺失,则共定位信号为0
        mutate(PP4 = ifelse(is.na(probe), 0, PP4)) %>%
        arrange(desc(PP4))
    

    合并后的部分共定位结果如下所示:

    snpchrpositionrefaltgenebeta_eqtlvarbeta_eqtlmafpvalue_eqtlprobebeta_meqtlvarbeta_meqtlpvalue_meqtln_snpsPP3PP4
    rs2239961chr2221363960AGFLJ395820.670.00153158850.22023813.21e-58cg17353431-0.820.00106220032807521.4e-97401
    rs9896243chr1744826056CGLRRC37A20.70.00179763670.158882781.01e-54cg015701820.860.003811488990434691.01e-38201
    rs4982912chr1424903284CTCBLN3-0.50.0012311480.317307692.57e-42cg131059040.640.001733852755433211.42e-45301
    rs9897978chr1713900256GTCDRT15P0.490.00126626410.347985357.77e-40cg11395062-0.490.002009795345745911.25e-25101
    rs11868472chr1774701165ACMXRA70.340.00077794770.435439564.19e-32cg27546012-0.420.00147119534621881.1e-25101
    rs4751321chr10132897429ACTCERG1L-0.450.00156640510.246794872.39e-28cg09858951-0.450.00210419998316642.85e-21101
    rs16831518chr1160146645CTATP1A40.380.00160337370.194139191.45e-20cg19308497-0.510.001953119133508953.91e-2834.70379291075399e-151
    rs9896243chr1744826056CGLRRC37A0.380.00188604080.158882788.53e-18cg015701820.860.003811488990434691.01e-3821.47792889038308e-150.999999999999929
    rs7132019chr126992122AGSPSB2-0.490.00064793390.379578754.09e-71cg26269324-0.670.001912634894232012.26e-45501.21303855863666e-130.999999999999886
    rs2992756chr118807339TCKLHDC7A0.390.00093564010.471495339.11e-35cg05040210-0.560.001718166930655869.17e-37199.57129487693459e-130.999999999999034

    参考资料#


    1. Jing Gong, Shufang Mei, Chunjie Liu, Yu Xiang, Youqiong Ye, Zhao Zhang, Jing Feng, Renyan Liu, Lixia Diao, An-Yuan Guo, Xiaoping Miao, Leng Han, PancanQTL: systematic identification of cis-eQTLs and trans-eQTLs in 33 cancer types, Nucleic Acids Research, Volume 46, Issue D1, 4 January 2018, Pages D971–D976, https://doi.org/10.1093/nar/gkx861 ↩︎

    2. Jing Gong, Hao Wan, Shufang Mei, Hang Ruan, Zhao Zhang, Chunjie Liu, An-Yuan Guo, Lixia Diao, Xiaoping Miao, Leng Han, Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer, Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D1066–D1072, https://doi.org/10.1093/nar/gky814 ↩︎

  • 12
    点赞
  • 68
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值