r²或R²-何时使用

Picture this- You are a stock analyst responsible for predicting Walmart’s stock price ahead of its quarterly earnings report. You are hard at work just when your data scientist walks in saying they discovered a little-known data stream providing daily Walmart parking lot occupancy that seems well correlated with Walmart’s historic revenues. You are understandably excited. You ask them to use the parking lot data alongside other standard metrics in a machine learning model to forecast Walmart’s stock price.

想象一下-您是一名股票分析师,负责在其季度收益报告之前预测沃尔玛的股价。 当您的数据科学家说他们发现了一个鲜为人知的数据流时,您正在努力工作,该数据流提供的每日沃尔玛停车场占用率似乎与沃尔玛的历史收益密切相关 。 您很兴奋。 您要求他们在机器学习模型中将停车场数据与其他标准指标一起使用,以预测沃尔玛的股价。

So far so good.

到目前为止,一切都很好。

The data scientist returns in a few hours claiming that after careful validation of the model, its predictions are strongly correlated with the true stock price. Do you accept the model without any further investigations?

数据科学家在几个小时后返回,声称经过仔细验证模型后,其预测与真实股票价格密切相关 。 您接受模型而无需进一步调查吗?

I hope not.

我希望不是。

Correlations are good for identifying patterns in data, but almost meaningless for quantifying a model’s performance, especially for complex models (like machine learning models). This is because correlations only tell if two things follow each other (e.g., parking lot occupancy and Walmart’s stock), but don’t tell how they match each other (e.g., predicted and actual stock price). For that, model performance metrics like the coefficient of determination (R²) can help.

çorrelations是用于识别数据中的模式不错,但几乎没有任何意义了量化模型的性能,特别是对于复杂的模型(如机器学习模型)。 这是因为相关性仅指示两个事物是否相互跟随(例如,停车场占用率和沃尔玛的股票),而没有告诉它们如何相互匹配(例如,预测股价和实际股价)。 为此,模型性能指标(如确定系数(R²))可以提供帮助。

In this article, we will learn:

在本文中,我们将学习:

  1. What is the correlation coefficient (r) and its square (r²)?

    什么是相关系数(r)和它的平方(R²)?

  2. What is the coefficient of determination (R²)?

    什么是判定(R²)的系数α

  3. When to use each of the above?

    何时使用以上每种方法?

1.相关系数:“这个预测指标有多好?” (1. Correlation coefficient: “How good is this predictor?”)

Image for post
Shorter the sum of blue lines, closer the correlation coefficient is to +1. Image by author.
蓝线的总和越短,相关系数就越接近+1。 图片由作者提供。

Correlation coefficients help quantify mutual relationships or connections between two things. Some well-known correlated quantities are weight and height of humans, house value and its area, and, as we saw in the above example, a store’s revenue and its parking lot occupancy.

相关系数有助于量化两件事之间的相互关系或联系。 一些众所周知的相关数量是人的体重和身高,房屋价值及其面积,以及如我们在上面的示例中看到的,商店的收入及其停车场的占用率。

One of the most widely used correlation coefficients is the Pearson correlation coefficient (usually denoted by r). Graphically, this can be understood as “how close is the data to the line of best fit?”

皮尔森相关系数 (通常用r表示)是使用最广泛的相关系数之一。 在图形上,这可以理解为“数据与最佳拟合线有多近?”

Image for post
r ranges from −1 to +1. Grey line is the line that fits the data the best. Image by author.
r的范围是-1至+1。 灰线是最适合数据的线。 图片由作者提供。
  1. If the points are very far away, r is close to 0

    如果点很远,则r接近0

  2. If the points are very close to the line and the line is sloping upward, r is close to +1

    如果点非常接近直线并且直线向上倾斜,则r接近+1

  3. If the points are very close to the line and the line is sloping downward, r is close to −1

    如果这些点非常接近直线并且直线向下倾斜,则r接近-1

Notice how the figure above has missing numbers on the axes? That is because the Pearson correlation coefficient is independent of the magnitude of the numbers; it is sensitive to relative changes only. This property is usually desirable since variables rarely have the same magnitudes. E.g., Walmart’s stock price is tens of dollars whereas the numbers of cars parked in front of its stores are in the thousands.

请注意,上图的轴上缺少数字吗? 这是因为Pearson相关系数与数字的大小无关。 它仅对相对变化敏感。 由于变量很少具有相同的大小,因此通常需要此属性。 例如,沃尔玛的股价为几十美元,而停在其商店门口的汽车数量则为数千辆。

However, due to its insensitivity to actual magnitude, the Pearson correlation coefficient can be misused to give a false sense of confidence when two things are indeed expected to have the same magnitude.

但是,由于它对实际量值不敏感,当确实期望两件事具有相同量值时,皮尔逊相关系数可能会被误用于错误的置信度。

To make matters worse, some people take the square of the Pearson correlation coefficient to bring it between 0 and +1 and call it r². But this is not to be confused with the coefficient of determination (R²) which is explained below.

更糟糕的是,一些人把Pearson相关系数的平方要拿0 +1之间,并称之为[R²。 但这不应与下面说明的确定系数(R²)混淆。

2.决定系数:“该模型有多好?” (2. Coefficient of determination: “How good is this model?”)

Image for post
Longer the sum of orange lines, lower the coefficient of determination. Image by author.
橙色线的总和越长,确定系数越低。 图片由作者提供。

Unlike the Pearson correlation coefficient, the coefficient of determination measures how well the predicted values match (and not just follow) the observed values. It depends on the distance between the points and the 1:1 line (and not the best-fit line) as shown above. Closer the data to the 1:1 line, higher the coefficient of determination.

与Pearson相关系数不同,确定系数可衡量预测值观察值匹配 (而不是仅跟随 )的程度。 如上所示,它取决于点与1:1线(而不是最佳拟合线)之间的距离。 数据越接近1:1线,确定系数越高。

The coefficient of determination is often denoted by R². However, it is not the square of anything. It can range from any negative number to +1.

测定系数通常用R 2表示。 但是,这不是任何事情的平方。 它的范围可以从任何负数到+1。

Image for post
R² can range from negative infinity to +1. Grey line is the line where the quantities on both axes are equal (also known as 1:1 line). Image by author.
R²的范围可以从负无穷大到+1。 灰线是两条轴上的数量相等的线(也称为1:1线)。 图片由作者提供。
  1. R² = +1 indicates that the predictions match the observations perfectly

    R²= +1表示预测与观察值完全匹配
  2. R² = 0 indicates that the predictions are as good as random guesses around the mean of the observed values

    R²= 0表示预测与观察值均值周围的随机猜测一样好
  3. Negative R² indicates that the predictions are worse than random

    负R²表示预测比随机预测差

Since R² indicates the distance of points from the 1:1 line, it does depend on the magnitude of the numbers (unlike r²).

由于R 2表示点的从1的距离:1线,它依赖于数字的大小(不像ř²)。

3.什么时候用什么? (3. When to use what?)

The Pearson correlation coefficient (r) is used to identify patterns in things whereas the coefficient of determination (R²) is used to identify the strength of a model.

皮尔逊相关系数(r)用于识别事物中的模式,而确定系数(R²)用于识别模型的强度。

By taking the square of r, you get the squared Pearson correlation coefficient (r²) which is completely different from the coefficient of determination (R²), except in very specific cases of linear regression (when both the grey lines from the above figures merge making the blue and orange lines equivalent).

通过取r的平方将得到其是从确定的(R 2)的系数完全不同,除了在非常平方Pearson相关系数(R²) 线性回归的特定情况下 (当从上述数字两者灰色线条合并使蓝色和橙色线等效)。

Thus, the Pearson correlation coefficient or its square should rarely be used to evaluate a model’s performance. This is explained using 3 examples in the figure below.

因此,皮尔逊相关系数或其平方应很少用于评估模型的性能。 下图中使用3个示例对此进行了说明。

Image for post
Model predictions from 3 different models for Walmart’s stock price. Image by author.
沃尔玛股票价格的3种不同模型的模型预测。 图片由作者提供。
  1. Model 1: R² = 0.99 indicates that it almost perfectly predicts stock prices.

    模型1:R²= 0.99表明它几乎可以完美地预测股票价格。
  2. Model 2: R² = 0.59 indicates that it predicts stock prices poorly. However, if you looked at r² only, you would have been overly optimistic. This kind of biased prediction is extremely common with machine learning models. It is thus all the more important to visualize your predictions rather than just summarize them using statistics.

    模型2:R²= 0.59表示它对股票价格的预测不佳。 但是,如果你看在只,你会过于乐观。 这种有偏的预测在机器学习模型中非常普遍。 因此,可视化您的预测而不是仅使用统计信息进行汇总就显得尤为重要。

  3. Model 3: R² = −0.98 indicates that it is worse than randomly guessing the stock price around $50. But again if you had just looked at r², you might have lost all your money! Side note: Believe it or not, stock predictions opposite to actual trends are quite common. It has also given rise to a whole new field called Contrarian Investing.

    模型3:R²= −0.98表示,这比随机猜测股价在50美元附近还差。 但是,如果你又刚刚研究了R²,你可能已经失去了所有的钱! 旁注:信不信由你,与实际趋势相反的股票预测非常普遍 。 它也引起了一个全新的领域,叫做逆向投资

回顾 (Recap)

  1. Correlations are useful to find patterns and relationships in data but mostly useless to evaluate predictions.

    关联对于查找数据中的模式和关系很有用,但是对于评估预测却几乎没有用。
  2. To evaluate predictions, use metrics like the coefficient of determination which captures how well predictions match observations, or how much of the variation in observed data is explained by the predictions.

    要评估预测,请使用确定系数来衡量预测与观测的匹配程度,或者通过预测来解释观测数据有多少变化,从而确定度量。
  3. The squared Pearson correlation coefficient is usually not equal to the coefficient of determination (or r² ≠ R²)

    平方Pearson相关系数通常是不等于确定的系数(或r²≠R 2)

If you want a math-y explanation of the difference between r² and R², check out this excellent article by Deepak Khandelwal.

如果你想和R R的差异²的数学-Y的解释,看看这个优秀的文章迪帕克Khandelwal

翻译自: https://towardsdatascience.com/r%C2%B2-or-r%C2%B2-when-to-use-what-4968eee68ed3

  • 5
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值