二进制分类_用pintbiserialcorrselector解决二进制分类

二进制分类

The majority of machine learning problems in today’s world are the classification ones. Mostly data scientists and machine learning engineers use Correlations as Pearson or Spearman to find features that correlate the most with the predicted value. However, these types of correlations work better on continuous-continuous pair of features. That’s why we at Sigmoid decided to add in our features section library — Kydavra, a method that will work also on dichotomous data (a series that has only 2 values).

当今世界,大多数机器学习问题都是分类问题。 通常,数据科学家和机器学习工程师使用Pearson或Spearman等相关性来查找与预测值最相关的特征。 但是,这些类型的相关性在连续连续的特征对上效果更好。 这就是为什么我们在Sigmoid决定添加功能部分库— Kydavra的原因,该方法也适用于二分数据(一个只有2个值的序列)。

使用来自Kydavra库的PointBiserialCorrSelector。 (Using PointBiserialCorrSelector from Kydavra library.)

As always, for those that are there mostly just for the solution to their problem their are the commands and the code:

像往常一样,对于那些只为解决问题提供解决方案的人来说,它们是命令和代码:

So to install kydavra just type the following line in the terminal or command line.

因此,要安装kydavra,只需在终端或命令行中输入以下行。

pip install kydavra

Now you can import the Selector and apply it on your data set a follows:

现在,您可以导入选择器,并将其应用于数据集,如下所示:

from kydavra import PointBiserialCorrSelectorselector = PointBiserialCorrSelector()new_columns = selector.select(df, ‘target’)

PointBiserialCorrSelector has the next parrameters:

PointBiserialCorrSelector具有下一个参数:

  • min_corr — minimal correlation for the feature to be important (default = 0.5)

    min_corr —重要的功能的最小相关性(默认值= 0.5)
  • max_corr — maximal correlation for the feature to be importatn (default = 0.8)

    max_corr —要导入的特征的最大相关性(默认= 0.8)
  • last_level — the number of correlation levels that the selector will take in account. Recommended is to not change it (default=2).

    last_level —选择器将考虑的相关级别数。 建议不要更改它(默认值= 2)。

So let’t test it on the Heart Disease UCI Dataset. Note that it was cleaned before.

因此,请不要在“ 心脏病UCI数据集”上进行测试。 请注意,之前已对其进行了清洁。

from kydavra import PointBiserialCorrSelectorselector = PointBiserialCorrSelector()df = pd.read_csv(‘cleaned.csv’)new_columns = selector.select(df, ‘target’)print(new_columns)

The result:

结果:

['cp', 'thalach', 'exang', 'oldpeak', 'sex', 'slope', 'ca', 'thal']

Note that the features are ordered în the descendind order depending on the Point-Biserial Correlation Value.

请注意,要素是按照点-双亲相关值的降序排列的。

To see the impact of feature selection on different types of models I decided to train 3 models (LogisticRegression, the linear one, TreeDecissionClassifier — the non-linear one, and the SVC with Gaussian kernel). So before feature selection we had the following cross_val_score:

为了了解特征选择对不同类型模型的影响,我决定训练3种模型(LogisticRegression,线性模型,TreeDecissionClassifier(非线性模型)和带有高斯核的SVC)。 因此,在选择特征之前,我们具有以下cross_val_score:

LINEAR - 0.8346127946127947TREE - 0.7681481481481482SVC - 0.8345454545454546

After applying feature selection the scores were:

应用特征选择后,得分为:

LINEAR - 0.838047138047138TREE - 0.7718518518518518SVC - 0.8418181818181818

So we got some more accuracy on Tree and SVC (almost 1%), but now we are using only 8 features instead of 13, and that’s respectable.

因此,我们在Tree和SVC上获得了更高的准确性(将近1%),但是现在我们仅使用8个功能而不是13个,这是值得尊重的。

Created with ❤ by Sigmoid.

由Sigmoid使用❤创建。

Image for post

If you want to dive deeper in how Biserial-Point Correlation Works I highly recommend the links at the end of the article. If you tried kydavra I invite you to leave some feedback and share you experience using it throw responding to this form.

如果您想更深入地了解Biserial-Point关联的工作原理,我强烈建议您在本文结尾处的链接。 如果您尝试过kydavra,我邀请您留下一些反馈并分享使用该表单的经验,并回复此表格

Useful links:

有用的链接:

翻译自: https://medium.com/analytics-vidhya/solve-binary-classification-with-pintbiserialcorrselector-406565328e35

二进制分类

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值