你需要理解一下“偏相关系数”及R语言实现-CSDN博客

本文链接：https://blog.csdn.net/nixiang_888/article/details/122120926

本文通过一个具体的案例，探讨了在存在混杂因素的情况下，如何使用偏相关系数来更准确地评估变量间的相关性。首先介绍了偏相关系数的概念，并通过冷饮销售量与游泳人数的相关性分析，展示了偏相关系数的实际应用价值。接着详细解释了偏相关系数的计算过程，并提供了R语言实现的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、背景

提起相关系数，我们最常见的是“Pearson”，“Spearman"等相关系数。但是，我们有时候常常忽略“偏”相关系数。其实，最近在做医学相关的项目时候，遇到这样的问题。

比如，在需求糖尿病患者中，异常代谢无与病人发病时间的关联时，由于糖尿病受到各种其他临床因素的影响，这时在进行相关性分析时，不得不考虑“混杂”因素的影响。

这里，我们举一个更简单的示例：游泳可以促进冷饮的销售，即游泳的人越多，冷饮的销售量也越多。
统计数据如下：
在这里插入图片描述

第一种：Pearson检测

结果如下：
在这里插入图片描述

如上表，”冷饮销售量“和”游泳人数“两个变量的简单相关分析，结果显示Pearson相关系数为”0.972“，P值小于0.01，相关关系具有统计学意义。

但事实上真是如此吗？

第二种：偏相关性检验

为什么要用偏相关呢？
因为u冷饮销售量增加和游泳人数地增加都可能与“天气热”相联系，所以，很有可能是因为天气热导致吃冷饮的人和游泳的人同时增加。

基于此，我们再纳入一个变量——“气温”，把气温作为控制变量，然后再进行“冷饮销售量”和“游泳人数”的相关分析。

基于此，我们再纳入一个变量——“气温”，把气温作为控制变量，然后再进行“冷饮销售量”和“游泳人数”的相关分析。
如上图偏相关分析的结果，纳入“气温”这个控制变量后，“冷饮销售量”和“游泳人数”的相关系数低至0.215，而且P值大于0.05，相关关系不再具有统计学意义。因此，我们就有充分的理由怀疑上文两变量的直接相关系数是不准确的。

二、偏相关的原理

题设：

统计 $x_1$ 和 $x_2$ 之间的相关性，其存在混杂（协变量）因素 $z_1, z_2, ..., z_k$ （可以多个）

计算：“先回归，求残差，再相关”

以 $x_1$ 和 $x_2$ 为因变量，以 $z_1, z_2, ..., z_k$ 为自变量，进行多重线性回归，获得 $x_1$ 和 $x_2$ 的预测值： $\widehat{x_1}$ 和 $\widehat{x_2}$
计算残差值： $\widehat{e_1}=x_1-\widehat{x_1}$ $\widehat{e_2}=x_2-\widehat{x_2}$
计算偏相关系数： $pcor=corr(\widehat{e_1}, \widehat{e_2})$

三、R语言实战

# partial_Spearman: Partial Spearman's Rank Correlation
# Description
partial.Spearman computes the partial Spearman\'s rank correlation between variable X and variable Y adjusting for other variables, Z. The basic approach involves fitting a specified model of X on Z, a specified model of Y on Z, obtaining the probability-scale residuals from both models, and then calculating their Pearson\'s correlation. X and Y can be any orderable variables, including continuous or discrete variables. By default, partial.Spearman uses cumulative probability models (also referred as cumulative link models in literature) for both X on Z and Y on Z to preserve the rank-based nature of Spearman\'s correlation, since the model fit of cumulative probability models only depends on the order information of variables. However, for some specific types of variables, options of fitting parametric models are also available. See details in fit.x and fit.y

# Usage
partial.Spearman(formula, data, fit.x = "orm", fit.y = "orm",
  link.x = c("logit", "probit", "cloglog", "loglog", "cauchit", "logistic"),
  link.y = c("logit", "probit", "cloglog", "loglog", "cauchit", "logistic"),
  subset, na.action = getOption("na.action"), fisher = TRUE,
  conf.int = 0.95)

# Details
To compute the partial Spearman's rank correlation between X and Y adjusting for Z, formula is specified as X | Y ~ Z. This indicates that models of X ~ Z and Y ~ Z will be fit.

For Examples:

# NOT RUN {
data(PResidData)
#### fitting cumulative probability models for both Y and W
partial.Spearman(c|w ~ z,data=PResidData)
#### fitting a cumulative probability model for W and a poisson model for c
partial.Spearman(c|w~z, fit.x="poisson",data=PResidData)
partial.Spearman(c|w~z, fit.x="poisson", fit.y="lm.emp", data=PResidData )
# }