PSI-群体稳定性指标(连续型)
PSI介绍
-
PSI
(Population Stability Index:群体稳定性指标)在风控中,一套模型上线后往往需要很久(通常一年以上),如果模型不稳定会直接影响决策的合理性,所以稳定性压倒一切,
PSI
反应了验证样本在各个分布与建模样本分布的稳定性,常用来筛选特征变量,评估模型稳定性入模变量保证稳定性,变量监控
模型分数保证稳定性,模型监控
其中在建模时通常以
- 训练样本(In the Sample,INS)作为预期分布
- 验证样本作为实际分布,训练样本包括
- 样本外(Out of Sample,OOS)
- 跨时间样本(Out of Time,OOT)
PSI计算公式
-
PSI计算公式
:
$$
\begin{aligned}
PSI &= sum{(实际占比-预期占比)\ln(\frac{实际占比}{预期占比})} \
&= \sum\limits_{buckets} (actual_pct - expect_pct)\ln(\frac{actual_pct}{expect_pct})\end{aligned}
$$
其中,origin_percent
表示实际数据当前分箱样本数占比、new_percent
表示预期数据当前分箱样本数占比。 -
注意:
np.log
基数默认为e
,信息论中尝尝选择2,因此信息的单位是比特
(bits
),而机器学习中基数长选择为自然常数e,因此单位常被称为奈特
(nats
) -
PSI数值范围
PSI范围 稳定性 建议事项 0 ~ 0.1 好 没有变化或者很少变化 0,1 ~ 0.25 略不稳定 有变化,继续监控后续变化 > 0.25 不稳定 发生大变化,进行特征项分析
连续性随机变量PSI计算
-
代码
import pandas as pd import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from tqdm import notebook """创建数据""" cancer = load_breast_cancer() df = pd.DataFrame(cancer.data,columns=['_'.join(i.split()) for i in cancer.feature_names]) df['y'] = cancer.target X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,:-1],df['y'],test_size=.2) print(X_train.shape) # (455,30) print(X_test.shape) # (114,30)
psi计算代码
def psi_calculate(origin,new,feature_name,origin_y=None ,y_name=None,buckets_type='cut',bins_num=10): """计算单个连续型变量的psi origin为实际数据,new为预期数据 无监督分箱,当bucket_type 为 cut、qcut 时,无需借助目标变量 有监督分箱,当bucket_type 为 tree、chimerge时,此时借助scorecardpy库 Parameters ---------------------------------------------------------------------- :param origin: DataFrame,实际数据 :param new: DataFrame,预期数据 :param feature_name: string,需要计算PSI的字段(类别型) :param origin_y: Series,y值 :param y_name: string,目标变量名称 :param buckets_type: string,分箱方式: cut、qcut、tree、chimerge :param bins_num: int,分箱数 Returns ---------------------------------------------------------------------- :return psi: float,psi值 psi_df: DataFrame,psi详细 Examples ---------------------------------------------------------------------- --等频、等距分箱(无监督分箱) >>> psi,psi_df = psi_calculate(origin=X_train ,new=X_test ,feature_name='mean_radius' ,buckets_type='qcut' ,bins_num=10 ) --决策树,卡方分箱(有监督分箱) >>> psi,psi_df = psi_calculate(origin=X_train ,new=X_test ,feature_name='mean_radius' ,origin_y=y_train ,y_name='y' ,buckets_type='chimerge' ) """ origin = origin[[feature_name]] new = new[[feature_name]] if buckets_type == 'cut': # 等宽分箱 origin_min = origin[feature_name].min() # 最小值 origin_max = origin[feature_name].max() # 最大值 binlen = (origin_max-origin_min) / bins_num #等频率每一箱长度 bins = [origin_min + i * binlen for i in range(1, bins_num)]#设定分组 bins.insert(0, -float("inf")) bins.append(float("inf")) print(bins) origin_cut = pd.cut(origin[feature_name] ,bins=bins ).value_counts(sort=False).reset_index() new_cut = pd.cut(new[feature_name] ,bins=bins ).value_counts(sort=False).reset_index() origin_cut.columns = ['buckets','origin_cnt'] new_cut.columns = ['buckets','new_cnt'] elif buckets_type == 'qcut': # 等频率分箱 qcut_data = pd.qcut(origin[feature_name] ,q=bins_num ,duplicates='drop' # ,retbins=True ) origin_cut = origin[feature_name].groupby(qcut_data).count().rename('origin_cnt') qcut_bins = origin_cut.index.categories # 等频分箱的bins,如果直接用retbins返回的会有浮点数误差 origin_cut = origin_cut.reset_index() new_cut = new[feature_name].groupby(pd.cut(new[feature_name] ,bins=qcut_bins ) ).count().rename('new_cnt').reset_index() origin_cut.columns = ['buckets','origin_cnt'] new_cut.columns = ['buckets','new_cnt'] elif buckets_type in ['tree','chimerge']: ## 借助scorecardpy 库的分箱,分箱方法为决策树或者卡方 origin = origin[[feature_name]] new = new[[feature_name]] ## 卡方和决策树分箱都是有监督分箱,需要借助目标变量 origin_cut_data = pd.concat([origin[feature_name],origin_y],axis=1) origin_cut = sc.woebin(origin_cut_data ,y=y_name ,method=buckets_type )[feature_name] break_list = origin_cut.breaks.tolist() break_list = [float(i) for i in break_list] break_list.insert(0,-np.inf) origin_cut = origin_cut[['bin','count']] new_cut = new[feature_name].groupby(pd.cut(new[feature_name] ,bins=break_list ,right=False # [left,right) ) ).count().rename('new_cnt').reset_index() new_cut[feature_name] = new_cut[feature_name].astype('str') new_cut[feature_name] = new_cut[feature_name].apply(lambda x: x.replace(' ','')) origin_cut.columns = ['buckets','origin_cnt'] new_cut.columns = ['buckets','new_cnt'] ## 解决为了防止浮点数误差导致的后续无法根据分割点merge ## [-inf,0.14999999999999997) [-inf,0.15) origin_cut['buckets'] = new_cut['buckets'] else: print('bucket_types 只能在【cut、qcut、tree、chimerge】中') raise ValueError # print(origin_cut) origin_cut['feature'] = feature_name new_cut['feature'] = feature_name origin_cut = origin_cut[['feature','buckets','origin_cnt']] new_cut = new_cut[['feature','buckets','new_cnt']] # print(origin_cut) # print(new_cut) psi_df = pd.merge(origin_cut,new_cut,on=['feature','buckets']) # print(psi_df) # 计算占比,分子加1,防止计算PSI时分子为0(这里分母不可能为0) psi_df['origin_percent'] = (psi_df['origin_cnt'] + 1) / psi_df['origin_cnt'].sum() psi_df['new_percent'] = (psi_df['new_cnt'] + 1) / psi_df['new_cnt'].sum() psi_df['minus'] = psi_df.apply(lambda x: x['origin_percent']-x['new_percent'],axis=1) psi_df['log'] = psi_df.apply(lambda x: np.log(x['origin_percent']/x['new_percent']),axis=1) # psi_df['psi'] = psi_df.apply(lambda x: (x['origin_percent']-x['new_percent']) *\ # np.log(x['origin_percent']/x['new_percent']) # ,axis=1) psi_df['psi_bucket'] = psi_df.apply(lambda x: x['minus'] * x['log'],axis=1) psi_df['psi'] = psi_df['psi_bucket'].sum() psi = psi_df['psi_bucket'].sum() return psi,psi_df
计算单个变量psi(等频)
# 计算单个指标的psi psi,psi_df = psi_calculate(X_train,X_test,buckets_type='qcut',feature_name='mean_radius') psi """输出""" 0.11023008607241508
计算单个变量psi(
以卡方分箱为例子
)psi,psi_df = psi_calculate(origin=X_train ,new=X_test ,feature_name='mean_radius' ,origin_y=y_train ,y_name='y' ,buckets_type='chimerge' ) psi """输出""" 0.02009471667376726
计算所有指标的psi(
以卡方分箱为例子
)# 计算所有指标的psi psi_list = [] psi_df_list = [] for feature in notebook.tqdm(X_train.columns): print(feature) psi,psi_df = psi_calculate(origin=X_train ,new=X_test ,feature_name=feature ,origin_y=y_train ,y_name='y' ,buckets_type='chimerge' ) psi_list.append((feature,psi)) psi_df_list.append(psi_df) psi = pd.DataFrame(psi_list,columns=['feature','psi']) psi_df = pd.concat(psi_df_list,ignore_index=False)