论文地址:https://sci2s.ugr.es/keel/pdf/algorithm/congreso/1992-Kerber-ChimErge-AAAI92.pdf
Kerber, Randy. “Chimerge: Discretization of numeric attributes.” Proceedings of the tenth national conference on Artificial intelligence. 1992.
ChiMerge算法详解(英文):https://medium.com/@nithin_rajan/data-discretization-using-chimerge-55c8ade3cfda
ChiMerge算法借助卡方检验,算法思路是:
- 首先把每个值都当做一个独立的区间
- 循环地合并区间,如果卡方值小于4.6,则合并区间(90%置信度/10%显著性水平下,卡方的值为4.6)
- 直到满足所需的区间数或是全部的卡方都大于4.6为止
ChiMerge算法使用
为了方便与高效,我们借助第三方工具scorecardbundle
首先安装:
pip install -i https://pypi.org/project --upgrade scorecardbundle
Scorecard-Bundle github主页:https://github.com/Lantianzz/Scorecard-Bundle
示例代码
from scorecardbundle.feature_discretization.ChiMerge import ChiMerge
from sklearn.datasets import make_classification
import pandas as pd
if __name__ == '__main__':
data_x, data_y = make_classification(n_samples=100, n_classes=4, n_features=10, n_informative=8, random_state=0)
x_value = data_x[:, 0]
y_value = data_y
trans_cm = ChiMerge(max_intervals=10, min_intervals=2, decimal=3, output_dataframe=True)
result_cm = trans_cm.fit_transform(pd.DataFrame(x_value), y_value)
print("阈值:", trans_cm.boundaries_[0])
print("分箱结果:", pd.cut(x_value, trans_cm.boundaries_[0]).codes)
算法python实现(可参考以下文章)
ChiMerge (Ker92):https://gist.github.com/alanzchen/17d0c4a45d59b79052b1cd07f531689e
ChiMerge算法:卡方检验+ChiMerge+Python:https://www.yanxishe.com/blogDetail/25070
注意:目前博主测试了四种复现方式,没有一个是能正常跑通的,很奇怪…而且复现的思路都有部分不同。如果有跑通的请在评论区发一下,相互学习