kaggle房价预测得分_在r中使用预测能力得分

最新推荐文章于 2020-11-26 02:28:06 发布

weixin_26750481

最新推荐文章于 2020-11-26 02:28:06 发布

阅读量1k

点赞数

文章标签：人工智能机器学习算法 java

原文链接：https://towardsdatascience.com/using-the-predictive-power-score-in-r-26c43d05dc01

版权

kaggle房价预测得分

In recent months Florian Wetschoreck published a story on Toward Data Science’s Medium channel that attracted the attention of many data scientists on LinkedIn thanks to its very provocative title: “RIP correlation. Introducing the Predictive Power Score”. Let’s see what it is and how to use it in R.

最近几个月， Florian Wetschoreck在Toward Data Science的Medium频道上发表了一个故事，该故事因其非常具有启发性的标题： “ RIP相关性”而吸引了LinkedIn上许多数据科学家的关注。介绍预测能力得分” 。让我们看看它是什么以及如何在R中使用它。

预测能力得分的定义 (Definition of Predictive Power Score)

The Predictive Power Score (PPS) is a normalized index (it ranges from 0 to 1) that tells us how much the variable x (be it numerical or categorical) could be used to predict the variable y (numerical or categorical). The higher the PPS index, the more the variable x is decisive in predicting the variable y.

预测能力得分 (PPS)是归一化的索引(范围从0到1)，它告诉我们变量x (无论是数值还是分类的)可以用来预测变量y (数值或分类的)多少。 PPS指数越高，变量x对预测变量y的决定性就越大。

The conceptual similarity of PPS with the correlation coefficient is evident, although the following differences exist:

尽管存在以下差异，但PPS在概念上与相关系数的相似性是显而易见的：

PPS also detects non-linear relationships between x and y.
PPS还检测x和y之间的非线性关系。
PPS is not a symmetrical index. This means that PPS(x, y) ≠ PPS(y, x). In other words, it is not said that if x predicts y, then y also predicts x.
PPS不是对称索引。这意味着PPS( x ， y )≠PPS( y ， x )。换句话说，并不是说如果x预测y ，那么y也会预测x 。
PPS allows both numerical and categorical variables.
PPS允许数值和分类变量。

Basically, PPS is an asymmetric nonlinear index that is applicable to all types of variables for predictive purposes.

基本上，PPS是非对称非线性指标，出于预测目的，它适用于所有类型的变量。

Behind the scene it implements Decision Trees as learning algorithms due to their robustness to outliers and poor data pre-processing.

由于其对异常值的鲁棒性和不良的数据预处理，在后台将决策树实现为学习算法。

The score is calculated on the test sets of a default of 4-fold cross-validation given by the scikit-learn cross_val_score function and it is given by different metrics according to the type of problem (regression or classification) defined by the target variable:

分数是根据scikit-learn cross_val_score函数给出的默认4倍交叉验证的测试集计算得出的，并且根据目标变量定义的问题类型( 回归或分类 )由不同的指标给出：

Regression: Mean Absolute Error (MAE) normalized to the [0, 1] interval given a baseline of “naïve” values of y calculated as the median of the target variable.
回归：归一化为[0，1]区间的平均绝对误差(MAE)，给定y的“天真”值基线作为目标变量的中值。
Classification: Weighted F1 normalized to the [0, 1] interval given a baseline of “naïve” values of y calculated as the most common value of the target variable or a random value (sometimes, a random value has a higher F1 than the most common value).
分类：加权的F1归一化为[0，1]区间，给定y的“初始”值基线作为目标变量的最常见值或随机值(有时，随机值的F1高于最高值)共同价值)。

You can get into the details of the Python code thanks to the fact that the Predictive Power Score project was released as open source on Github.

由于Predictive Power Score项目已在Github上作为开源发布，您可以深入了解Python代码的细节。

皮尔逊相关与PPS (Pearson Correlation vs PPS)

The two indexes come from different domains and fundamental distinctions must be made:

这两个索引来自不同的领域，必须进行基本的区分：

Pearson correlation is given by the normalized covariance between two numerical variables. Covariance depends on the deviations of the two variables from their respective means, so it’s a statistical measure. Given two numerical variables, Pearson correlation is a descriptive index well defined mathematically that gives the goodness-of-fit for the best possible linear function describing the relation between the variables.
皮尔逊相关性由两个数值变量之间的归一化协方差给出。协方差取决于两个变量与其各自平均值的偏差，因此这是一种统计量度。给定两个数值变量，Pearson相关性是数学上定义良好的描述性索引，可为描述变量之间关系的最佳线性函数提供拟合优度。
PPS tries to solve the issues of only linear correlations and only numeric variables being measured in a correlation analysis by applying decision tree estimation. It is derived from performance metrics of that estimation. At the time this post was written (version 1.1.0), it gets by default a random sample (stratified if needed) of 5000 rows from the input dataset in order to speed up the calculations (the size of the sample can be modified using a proper parameter). No tuning is done to get the opti

最低0.47元/天解锁文章

weixin_26750481

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
kaggle房价预测得分_在r中使用预测能力得分

kaggle房价预测得分In recent months Florian Wetschoreck published a story on Toward Data Science’s Medium channel that attracted the attention of many data scientists on LinkedIn thanks to its very provocati...
复制链接

扫一扫