Pearson Correlation Score

excerpt from "programming collective intelligence"...


A slightly more sophisticated way to determine the similarity between people’s inter-
ests is to use a Pearson correlation coefficient. The correlation coefficient is a mea-
sure of how well two sets of data fit on a straight line. The formula for this is more
complicated than the Euclidean distance score, but it tends to give better results in
situations where the data isn’t well normalized—for example, if critics’ movie rank-
ings are routinely more harsh than average.




You can also see a straight line on the chart. This is called the best-fit line because it
comes as close to all the items on the chart as possible. If the two critics had identi-
cal ratings for every movie, this line would be diagonal and would touch every item
in the chart, giving a perfect correlation score of 1. In the case illustrated, the critics
disagree on a few movies, so the correlation score is about 0.4. Figure 2-3 shows an
example of a much higher correlation, one of about 0.75.




One interesting aspect of using the Pearson score, which you can see in the figure, is
that it corrects for grade inflation. In this figure, Jack Matthews tends to give higher
scores than Lisa Rose, but the line still fits because they have relatively similar prefer-
ences. If one critic is inclined to give higher scores than the other, there can still be
perfect correlation if the difference between their scores is consistent. The Euclidean
distance score described earlier will say that two critics are dissimilar because one is
consistently harsher than the other, even if their tastes are very similar. Depending
on your particular application, this behavior may or may not be what you want.


The code for the Pearson correlation score first finds the items rated by both critics.
It then calculates the sums and the sum of the squares of the ratings for the two crit-
ics, and calculates the sum of the products of their ratings. Finally, it uses these
results to calculate the Pearson correlation coefficient, shown in bold in the code
below. Unlike the distance metric, this formula is not very intuitive, but it does tell
you how much the variables change together divided by the product of how much
they vary individually.



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 皮尔逊相关系数(Pearson correlation)是一种衡量两个变量之间线性关系强度的统计量。它的取值范围在-1到1之间,当取值为1时表示两个变量完全正相关,取值为-1时表示两个变量完全负相关,取值为时表示两个变量之间没有线性关系。 ### 回答2: 皮尔逊相关系数(Pearson correlation coefficient)是一种统计学上用来衡量两个变量之间线性相关程度的方法。它通常是用来衡量两个连续型变量的相关程度,而不是用来衡量两个分类变量之间的关系。 它的取值范围在-1到1之间,数值越接近于1或-1说明两个变量之间的相关程度越强,而数值越接近于0则说明两个变量之间的相关程度越小。当取值为1时,说明两个变量呈完全正相关,即随着一个变量的增加,另一个变量也会相应增加;当取值为-1时,说明两个变量呈完全负相关,即随着一个变量的增加,另一个变量会相应减少;当取值为0时,说明两个变量之间不存在线性相关关系。 皮尔逊相关系数的计算公式是通过样本协方差除以两个变量的标准差之积得到的。它可以通过SPSS软件、Excel等工具来计算。 皮尔逊相关系数具有如下优点:它是一种简单、稳健、易于计算和解释的度量方法;它对线性关系的敏感性较高,能够有效地体现出两个变量之间的线性相关度;它能够帮助分析人员进一步理解数据之间的关系,为进一步模型建立提供依据。 但是,皮尔逊相关系数也存在一些缺点:它只能够反映出两个变量之间的线性相关度,对于非线性关系的反映能力较弱;它对异常值较为敏感,由于异常值的影响可能导致结果产生误判;它不能说明变量之间的因果关系,只能说明它们之间的相关性。 综上所述,皮尔逊相关系数是一种常用的度量两个变量之间线性相关度的方法,它能够帮助分析人员更好地理解数据之间的关系。但也需要注意其局限性,以避免产生误判。 ### 回答3: 皮尔逊相关系数(Pearson correlation)是一种衡量两个连续变量之间线性相关程度的方法。它是由英国统计学家卡尔·皮尔逊(Karl Pearson)在19世纪末提出的,常用于在数据分析、数据挖掘、统计学等领域中,评估两个变量之间的线性相关性,在科研中应用非常广泛。 皮尔逊相关系数的取值范围在-1到1之间,其中-1表示完全负相关,0表示没有相关性,1表示完全正相关。当系数为正值时,表明两个变量呈同向变化的趋势;当系数为负值时,表明两个变量呈反向变化的趋势。系数的绝对值越接近1,说明两个变量之间相关度越高。 在计算皮尔逊相关系数时,需要先将两个变量的数据标准化,即将每个变量的值减去均值,再除以标准差,这样可以消除两个变量的量纲不同所产生的影响,确保计算出的系数具有可比性。 除了皮尔逊相关系数,还有其他的相关系数可以用来衡量两个变量之间的相关性,如斯皮尔曼相关系数、刻度相关系数等。不同的相关系数适用于不同类型的变量和关系模型,需要根据具体情况选择合适的方法。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值