二进制分类_用pintbiserialcorrselector解决二进制分类

最新推荐文章于 2023-09-05 20:15:00 发布

weixin_26637537

最新推荐文章于 2023-09-05 20:15:00 发布

阅读量186

点赞数

文章标签： python java 人工智能机器学习

原文链接：https://medium.com/analytics-vidhya/solve-binary-classification-with-pintbiserialcorrselector-406565328e35

版权

二进制分类

The majority of machine learning problems in today’s world are the classification ones. Mostly data scientists and machine learning engineers use Correlations as Pearson or Spearman to find features that correlate the most with the predicted value. However, these types of correlations work better on continuous-continuous pair of features. That’s why we at Sigmoid decided to add in our features section library — Kydavra, a method that will work also on dichotomous data (a series that has only 2 values).

当今世界，大多数机器学习问题都是分类问题。通常，数据科学家和机器学习工程师使用Pearson或Spearman等相关性来查找与预测值最相关的特征。但是，这些类型的相关性在连续连续的特征对上效果更好。这就是为什么我们在Sigmoid决定添加功能部分库— Kydavra的原因，该方法也适用于二分数据(一个只有2个值的序列)。

使用来自Kydavra库的PointBiserialCorrSelector。 (Using PointBiserialCorrSelector from Kydavra library.)

As always, for those that are there mostly just for the solution to their problem their are the commands and the code:

像往常一样，对于那些只为解决问题提供解决方案的人来说，它们是命令和代码：

So to install kydavra just type the following line in the terminal or command line.

因此，要安装kydavra，只需在终端或命令行中输入以下行。

pip install kydavra

Now you can import the Selector and apply it on your data set a follows:

现在，您可以导入选择器，并将其应用于数据集，如下所示：

from kydavra import PointBiserialCorrSelectorselector = PointBiserialCorrSelector()new_columns = selector.select(df, ‘target’)

PointBiserialCorrSelector has the next parrameters:

PointBiserialCorrSelector具有下一个参数：

min_corr — minimal correlation for the feature to be important (default = 0.5)
min_corr —重要的功能的最小相关性(默认值= 0.5)
max_corr — maximal correlation for the feature to be importatn (default = 0.8)
max_corr —要导入的特征的最大相关性(默认= 0.8)
last_level — the number of correlation levels that the selector will take in account. Recommended is to not change it (default=2).
last_level —选择器将考虑的相关级别数。建议不要更改它(默认值= 2)。

So let’t test it on the Heart Disease UCI Dataset. Note that it was cleaned before.

因此，请不要在“ 心脏病UCI数据集”上进行测试。请注意，之前已对其进行了清洁。

from kydavra import PointBiserialCorrSelectorselector = PointBiserialCorrSelector()df = pd.read_csv(‘cleaned.csv’)new_columns = selector.select(df, ‘target’)print(new_columns)

The result:

结果：

['cp', 'thalach', 'exang', 'oldpeak', 'sex', 'slope', 'ca', 'thal']

Note that the features are ordered în the descendind order depending on the Point-Biserial Correlation Value.

请注意，要素是按照点-双亲相关值的降序排列的。

To see the impact of feature selection on different types of models I decided to train 3 models (LogisticRegression, the linear one, TreeDecissionClassifier — the non-linear one, and the SVC with Gaussian kernel). So before feature selection we had the following cross_val_score:

为了了解特征选择对不同类型模型的影响，我决定训练3种模型(LogisticRegression，线性模型，TreeDecissionClassifier(非线性模型)和带有高斯核的SVC)。因此，在选择特征之前，我们具有以下cross_val_score：

LINEAR - 0.8346127946127947TREE - 0.7681481481481482SVC - 0.8345454545454546

After applying feature selection the scores were:

应用特征选择后，得分为：

LINEAR - 0.838047138047138TREE - 0.7718518518518518SVC - 0.8418181818181818

So we got some more accuracy on Tree and SVC (almost 1%), but now we are using only 8 features instead of 13, and that’s respectable.

因此，我们在Tree和SVC上获得了更高的准确性(将近1％)，但是现在我们仅使用8个功能而不是13个，这是值得尊重的。

Created with ❤ by Sigmoid.

由Sigmoid使用❤创建。

If you want to dive deeper in how Biserial-Point Correlation Works I highly recommend the links at the end of the article. If you tried kydavra I invite you to leave some feedback and share you experience using it throw responding to this form.

如果您想更深入地了解Biserial-Point关联的工作原理，我强烈建议您在本文结尾处的链接。如果您尝试过kydavra，我邀请您留下一些反馈并分享使用该表单的经验，并回复此表格。

Useful links:

有用的链接：