python数据分箱_如何在python中实现数据的最优分箱

展开全部

Monotonic Binning with Python

Monotonic binning is a data preparation technique widely used in scorecard development and is usually implemented with SAS. Below is an attempt to do the monotonic binning with python.

Python Code:

# import packages

import pandas as pd

import numpy as np

import scipy.stats.stats as stats

# import data

data = pd.read_csv("/home/liuwensui/Documents/data/accepts.csv", sep = ",", header = 0)

# define a binning function

def mono_bin(Y, X, n = 20):

# fill missings with median

X2 = X.fillna(np.median(X))

r = 0

while np.abs(r) < 1:

d1 = pd.DataFrame({"X": X2, "Y": Y, "Bucket": pd.qcut(X2, n)})

d2 = d1.groupby('Bucket', as_index = True)

r, p = stats.spearmanr(d2.mean().X, d2.me

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,以下是一个基于CART算法实现Python最优分箱代码,可以用于对连续变量进行分箱操作: ```python import numpy as np import pandas as pd from sklearn.tree import DecisionTreeRegressor def binning_continuous_var(data, target, min_samples_leaf=50, max_bins=10, return_bins=False): data = pd.concat([data, target], axis=1) cont_cols = data.select_dtypes(include=[np.number]).columns.tolist() for col in cont_cols: binned_col, bins = bin_continuous_var(data, col, target, min_samples_leaf, max_bins) data[col] = binned_col if return_bins: return data, bins else: return data def bin_continuous_var(data, col, target, min_samples_leaf, max_bins): data_range = data[col].max() - data[col].min() if data_range == 0: return data[col], [] else: tree_model = DecisionTreeRegressor( criterion='mse', min_samples_leaf=min_samples_leaf, max_leaf_nodes=max_bins, random_state=42 ) tree_model.fit(data[col].to_frame(), target) n_leaves = tree_model.get_n_leaves() while n_leaves >= max_bins: max_bins -= 1 tree_model = DecisionTreeRegressor( criterion='mse', min_samples_leaf=min_samples_leaf, max_leaf_nodes=max_bins, random_state=42 ) tree_model.fit(data[col].to_frame(), target) n_leaves = tree_model.get_n_leaves() leaves_range = [(tree_model.tree_.threshold[i - 1], tree_model.tree_.threshold[i]) for i in np.where(tree_model.tree_.children_left == -1)[0]] bins = [data[col].min()] + [i[1] for i in leaves_range[:-1]] + [data[col].max()] binned_col = np.digitize(data[col], bins) binned_col = pd.Series(binned_col, index=data.index) binned_col = binned_col.map(lambda x: np.round(np.mean(data[target.name][binned_col == x]), 4)) return binned_col, bins ``` 该代码,`binning_continuous_var`函数是用于执行最优分箱的主函数,输入参数包括待分箱数据、目标变量、最小样本数、最大分箱数和是否返回分箱边界值等。该函数会循环处理每个连续变量,并调用`bin_continuous_var`函数对每个连续变量进行分箱操作,最后将分箱结果更新到数据。如果需要返回分箱结果,则返回数据集和分箱边界值列表。 `bin_continuous_var`函数是用于执行单个连续变量的分箱操作,输入参数包括待分箱数据、连续变量的列名、目标变量、最小样本数和最大分箱数等。该函数会使用CART算法拟合一个回归树模型,并根据最大叶节点数目的限制对树进行剪枝操作,从而得到最优分箱边界值。最后,该函数会将数据的连续变量转换为对应的分箱结果,并返回分箱结果和分箱边界值列表。 使用该代码,您只需要将待分箱数据和目标变量传入`binning_continuous_var`函数即可,例如: ```python # 生成测试数据 data = pd.DataFrame({ 'col1': np.random.rand(1000), 'col2': np.random.rand(1000), 'col3': np.random.rand(1000), 'target': np.random.randint(0, 2, 1000) }) # 执行最优分箱操作 data_binned = binning_continuous_var(data.drop('target', axis=1), data['target'], min_samples_leaf=50, max_bins=10, return_bins=False) ``` 以上就是一个基于CART算法实现Python最优分箱代码,希望可以对您有所帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值