python创建变量score_scorecardpy库的使用简介

最新推荐文章于 2022-01-29 17:53:24 发布

weixin_39687814

最新推荐文章于 2022-01-29 17:53:24 发布

阅读量1k

点赞数

文章标签： python创建变量score

Python中信贷评分卡中常用的两个库有scorecardpy和Toad。其中scorecardpy是由谢士晨博士开发，Toad是由厚本金融风控团队内部孵化产生的标准评分卡库。关于Toad的使用，之前已经写过学习教程，

该软件包是R软件包评分卡的python版本。它的目标是通过提供一些常见任务的功能，使传统信用风险计分卡模型的开发更加轻松有效。该包的功能及对应的函数如下：数据划分(split_df)

变量选择(iv, var_filter)

变量分箱(woebin, woebin_plot, woebin_adj, woebin_ply)

评分转换(scorecard, scorecard_ply)

模型评估(perf_eva, perf_psi)

首先，导入germancredit数据。

import scorecardpy as sc

dat = sc.germancredit()

dt_s = sc.var_filter(dat, y="creditability")

这个函数可以根据指定的条件筛选变量，例如IV值、缺失率、一致性等，函数的参数如下:

def var_filter(dt, y, x=None, iv_limit=0.02, missing_limit=0.95,

identical_limit=0.95, var_rm=None, var_kp=None,

return_rm_reason=False, positive='bad|1')var_rm：强制删除变量的名称

var_kp：强制保留变量的名称

return_rm_reason：是否返回每个变量被删除的原因

positive:坏样本的标签

数据划分

train, test = sc.split_df(dt_s, 'creditability').values()

def split_df(dt, y=None, ratio=0.7, seed=186)

该函数的ratio默认为0.7，即按照7:3对数据集进行分割。ratio可以随意进行设置，比如[0.5,0.2]

变量分箱

bins = sc.woebin(dt_s, y="creditability")

def woebin(dt, y, x=None,

var_skip=None, breaks_list=None, special_values=None,

stop_limit=0.1, count_distr_limit=0.05, bin_num_limit=8,

# min_perc_fine_bin=0.02, min_perc_coarse_bin=0.05, max_num_bin=8,

positive="bad|1", no_cores=None, print_step=0, method="tree",

ignore_const_cols=True, ignore_datetime_cols=True,

check_cate_num=True, replace_blank=True,

save_breaks_list=None, **kwargs):

返回的是每个变量的分箱结果组成的字典。

woebin支持决策树分箱、卡方分箱、自定义分箱，默认的WOE值计算是用坏样本率/好样本率，这个可以通过参数postive进行调整。如果某一箱只有好样本或者坏样本，会对缺失的类别赋予0.99进行调整，方便计算woe值。重要参数含义如下：var_skip:指定不需要分箱的变量。

breaks_list:分割点的List。对分箱进行调整的时候用。可以进行自定义分箱

special_values:指定单独的箱。

count_distr_limit:分箱结果中最小占比。默认0.05

stop_limit:当IV值的增加值小于stop_limit或者卡方值小于qchisq(1-stoplimit, 1)时停止分割。

bin_num_limit:最大分箱数。

method:分箱方法，可以有"tree" or "chimerge"。

ignore_const_cols:是否忽略常数列。

check_cate_num:检查分类变量中类别数是否大于50。

replace_blank:将空值替换为None。

sc.woebin_plot()可以画出变量分箱之后的Bi_variate图，这里的坏样本率图展示了每一箱的好坏样本数、样本占比、坏样本率，比较清晰明了。

sc.woebin_plot(bins)

分箱调整

breaks_adj = sc.woebin_adj(dt_s, "creditability", bins)

def woebin_adj(dt, y, bins, adj_all_var=False, special_values=None, method="tree", save_breaks_list=None, count_distr_limit=0.05)

重要参数的含义：adj_all_var:是否显示woe变量的单调性。

其它参数和woebin()函数一样，这里没有深入研读调整分箱的代码，而且运行过程中有报错。猜测调整的方向是坏样本率单调。

此外也可以手动进行分箱调整：

breaks_adj = {

'age.in.years': [26, 35, 40],

'other.debtors.or.guarantors': ["none", "co-applicant%,%guarantor"]

}

bins_adj = sc.woebin(dt_s, y="creditability", breaks_list=breaks_adj)

woe转换

分箱之后需要对变量的原始值进行转换，将变量值转化成woe值，后续使用变量的WOE值入模进行训练。

train_woe = sc.woebin_ply(train, bins_adj)

test_woe = sc.woebin_ply(test, bins_adj)

模型建立

y_train = train_woe.loc[:,'creditability']

X_train = train_woe.loc[:,train_woe.columns != 'creditability']

y_test = test_woe.loc[:,'creditability']

X_test = test_woe.loc[:,train_woe.columns != 'creditability']

# logistic regression ------

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1)

lr.fit(X_train, y_train)

# lr.coef_

# lr.intercept_

# predicted proability

train_pred = lr.predict_proba(X_train)[:,1]

test_pred = lr.predict_proba(X_test)[:,1]

模型评估

train_perf = sc.perf_eva(y_train, train_pred, title = "train")

test_perf = sc.perf_eva(y_test, test_pred, title = "test")

def perf_eva(label, pred, title=None, groupnum=None, plot_type=["ks", "roc"], show_plot=True, positive="bad|1", seed=186)

perf_eva()函数可以从KS、AUC、Lift曲线、PR曲线评估模型的效果。可以通过plot_type参数控制评估方法，可以选"ks", "lift", "roc", "pr"。s

评分映射

模型评估之后，需要对概率进行映射，转换成评分卡得分。得分包括每个客户的最终得分和单个变量的得分。

card = sc.scorecard(bins_adj, lr, X_train.columns)

scorecard()返回一个字典，对应的是基础分和每个变量的得分。

def scorecard(bins, model, xcolumns, points0=600, odds0=1/19, pdo=50, basepoints_eq0=False)

scorecard()函数的参数含义如下：bins:分箱信息。woebin()返回的结果。

model:模型对象。

points0:基础分，默认为600。 odds:好坏比，默认为1:19

pdo:比率翻番的倍数，默认为50。

basepoints_eq0:如果为True,则将基础分分散到每个变量中。

评分稳定性评估--PSI

sc.perf_psi(

score = {'train':train_score, 'test':test_score},

label = {'train':y_train, 'test':y_test}

)

总结:对于想学习建模的新手来说，一套有效的脚本能够帮助快速建立一张评分卡。本人在自学的过程中看了几乎网上所有能找到的评分卡脚本，比较实用的还是之前github上的半自动建模包，自己也尝试写了一个评分卡脚本，终于算是入门完成。scorecardpy库虽然封装得很好，但是一旦运行过程中出现问题，调试起来仍然是件很麻烦的事情。对比起来，本人觉得半自动化建模的包更具有实用性，将链接附在下方，感兴趣者可以学习。【作者】：Labryant

【原创公众号】：风控猎人

【简介】：某创业公司策略分析师，积极上进，努力提升。乾坤未定，你我都是黑马。

【转载说明】：转载请说明出处，谢谢合作！~

weixin_39687814

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python创建变量score_scorecardpy库的使用简介

Python中信贷评分卡中常用的两个库有scorecardpy和Toad。其中scorecardpy是由谢士晨博士开发，Toad是由厚本金融风控团队内部孵化产生的标准评分卡库。关于Toad的使用，之前已经写过学习教程，该软件包是R软件包评分卡的python版本。它的目标是通过提供一些常见任务的功能，使传统信用风险计分卡模型的开发更加轻松有效。该包的功能及对应的函数如下：数据划分(split_df)...
复制链接

扫一扫