绝对编码和增量编码_超越一时:分类编码的增量改进

绝对编码和增量编码

The beyond-one-hot project has started to grow up.  Last fall, I did a couple of posts comparing different methods of encoding categorical variables for machine learning problems.  You can check them out here and here respectively.

超过一个热门项目已经开始成长。 去年秋天,我做了几篇文章,比较了针对机器学习问题的不同编码类别变量的编码方法。 您可以分别在此处此处检出它们。

Those posts were pretty well received, so the hacky little script that was used to make the plots got worked on a little more and eventually became a pip-installable python library that used scikit-learn style objects.   I’m now happy to say that the library is being used in production in at least 2 large systems that I know of, and has reached something that resembles stability.

这些帖子收到了很好的好评,因此用于制作情节的骇人的小脚本需要更多的工作,最终成为使用scikit-learn样式对象的可点子安装的python库 。 我现在很高兴地说,该库已在至少两个我知道的大型系统中用于生产,并且达到了类似于稳定性的水平。

Aside from just stability, we’ve added some useful functionality in the past few months including:

除了稳定性之外,我们在过去几个月中还添加了一些有用的功能,包括:

  • Addition of a drop_invariant option to the transformers to check for features with 0 variance at the fit() step, and drop those features from the output reliably at transform()
  • Addition of a return_df option to all transformers to allow the user to toggle between the transform() method returning a pandas DataFrame or a numpy array
  • If cols is passed as [], nothing is encoded and the dataset is passed through unchanged
  • If cols is passed as None, then the dataset passed to fit() is inspected to infer which columns should be encoded, and those are used.  Any column typed as ‘object’ in the pandas DataFrame representation is considered appropriate for encoding.
  • 向转换器添加drop_invariant选项,以在fit()步骤检查方差为0的特征,并在transform()处可靠地从输出中删除那些特征
  • 在所有转换器中添加return_df选项,以允许用户在返回熊猫DataFrame或numpy数组的transform()方法之间切换
  • 如果将cols作为[]传递,则不会进行任何编码,并且数据集将保持不变
  • 如果将cols作为None传递,则将检查传递给fit()的数据集以推断应编码的列以及已使用的列。 在pandas DataFrame表示形式中键入为“对象”的任何列都被认为适合编码。

In the past few months I’ve accumulated some interest in contributing, and of course there are still things to help with so if you are interested, leave a comment below or find me on github and get involved.  We need help with documentation, addition of new encoders, benchmarking of performance (computationally), and most importantly, getting the library into production so we can find out where it’s useful and where it’s lacking.

在过去的几个月中,我积累了一些贡献的兴趣,当然还有很多事情可以帮忙,因此,如果您有兴趣,请在下面发表评论,或者在github上找到并参与其中。 我们需要文档方面的帮助,新编码器的添加,性能的基准测试(最重要的是),最重要的是,将库投入生产,以便我们找出有用的地方和缺少的地方。

So if you haven’t already, check out categorical_encoding on github, and let me know what you think.

因此,如果您还没有,请在github上查看categorical_encoding,让我知道您的想法。

翻译自: https://www.pybloggers.com/2016/06/beyond-one-hot-incremental-improvements-in-categorical-encoding/

绝对编码和增量编码

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值