利用WOE & IV建立用户流失模型

最新推荐文章于 2024-09-11 21:59:19 发布

Sarah_07

最新推荐文章于 2024-09-11 21:59:19 发布

阅读量550

点赞数 2

文章标签：机器学习 python 数据分析算法

本文链接：https://blog.csdn.net/Sarah_07/article/details/126336430

版权

利用WOE & IV建立用户流失模型

现在流量红利越来越小，获新客的成本也越来越高，比如活动投资10万元，新客获客100个，获客成本就是100元/个，但是如果客单价平均50元，用户在整个生命周期中平均下单次数只要小于2，那么这波活动general来说就是亏损的。但是如果获客成本持续走高是客观现实，提升ROI的方式就是增加用户的生命周期价值了，也就是提升我们说的LTV。用户流失模型就是来服务如何提升LTV的场景方法之一，concept也是非常直接简单。预测用户流失的风险，如果高于某个阈值，我们判断该用户极有可能要流失掉。那么就需要私域营销来干预一下了，比如发个券，寄个赠品，邀请参加一下线下活动等。

用户流失模型的算法有很多，LR,GBDT,NN等。计划用两篇博文简单介绍下如何利用LR来实现用户流失风险预测。

模型实现简单来说就是follow Data cleaning -> EDA -> feature selection -> model establishment ->deployment. 除了最终输出一个预测模型外，有时候运营的同学或老板们也会关心 到底是哪些原因使用户流失了，这时我们可以用EDA过程中的一些发现来回答这类问题，其中WOE是一个非常有效的工具来挖掘到底具有什么特征的用户容易流失。本文我们介绍下WOE和Information value的使用。

WOE 和 IV基本知识

什么是WOE

WOE的全称weight of evidence,它是通过分箱处理分析某个因素对目标变量的影响。 $ln(\frac{\%of non-events}{\% of events})$

如果我们定义客户流失为events,那么WOE>0意味着用户不流失的可能性比较大，WOE<0意味着用户流失的可能性比较大。

利用WOE进行特征处理有以下几点好处：

利用WOE处理出来的值是单调递增或递减的关系。比如Age这个连续变量，我们将它分为<20,20_35,3550,>50四组，那么每一组计算出的woe值组合呈现的是单调递增或递减的关系。如果不是，说明分箱不合理。
对于一些包含水平特别多的categorical 变量，可以将WOE值类似的分组regroup到一个组，减少变量的个数。避免特征过多对模型的影响。
从一很容易看出，WOE对outlier具有很好的鲁棒性。
一般对于缺失值，WOE的处理是将其归为一组，因此利用WOE也可以很好的处理缺失值。

什么是IV

IV的全称是information value，它的作用是进行特征筛选的。 $\sum(\% of non events - \% of events)*WOE$

IV	Predictive Power
<0.02	useless for prediction
0.02-0.1	week predictor
0.1-0.3	medium predictor
0.3-0.5	strong predictor
>0.5	too good to be true

Demo Code

import pandas as pd
import numpy as np
import scipy.stats as stats
from pandas.api.types import is_numeric_dtype
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns',None)

df.head(2)

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1,889.50	No

df['TotalCharges'] = df['TotalCharges'].astype(str)
df['TotalCharges'] = df['TotalCharges'].apply(lambda x: x.replace(',',''))

df.columns = [c[0].lower()+c[1:] for c in df.columns]
df['label'] = df['churn'].map({'Yes':0,'No':1})
df['seniorCitizen'] = df['seniorCitizen'].map({1: 'Yes', 0: 'No'})
df.drop(['customerID', 'churn'], axis=1, inplace=True)

分别为CategoricalFeature和ContinuousFeature创建一个类来计算woe

class CategoricalFeature():
    def __init__(self,df,feature):
        self.df = df
        self.feature = feature
    @property
    def df_bin(self):
        df_bin = self.df
        df_bin['bin'] = df_bin[self.feature].fillna('MISSING')
        return df_bin[['bin','label']]

class ContinuousFeature():
    def __init__(self, df, feature):
        self.df = df
        self.feature = feature
        self.bin_min_size = int(len(self.df) * 0.05)

    def __generate_bins(self, bins_num):
        df = self.df[[self.feature, 'label']]
        df['bin'] = pd.qcut(df[self.feature], bins_num, duplicates='drop') \
                    .apply(lambda x: x.left) \
                    .astype(float)
        return df

    def __generate_correct_bins(self, bins_max=20):
        for bins_num in range(bins_max, 1, -1):
            df = self.__generate_bins(bins_num)
            df_grouped = pd.DataFrame(df.groupby('bin') \
                                      .agg({self.feature: 'count',
                                            'label': 'sum'})) \
                                      .reset_index()
            r, p = stats.stats.spearmanr(df_grouped['bin'], df_grouped['label'])

            if (
                    abs(r)==1 and                                                        # check if woe for bins are monotonic
                    df_grouped[self.feature].min() > self.bin_min_size                   # check if bin size is greater than 5%
                    and not (df_grouped[self.feature] == df_grouped['label']).any()      # check if number of good and bad is not equal to 0
            ):
                break

        return df

    @property
    def df_bins(self):
        df_bin = self.__generate_correct_bins()
        df_bin['bin'].fillna('MISSING', inplace=True)
        return df_bin[['bin', 'label']]

我们以totalCharges为例来看下怎么计算woe

df['totalCharges'] = df['totalCharges'].astype(float)

cols = ['totalCharges']
feats_dict = {}
for col in cols:
    if is_numeric_dtype(df[col]):
        feats_dict[col] = ContinuousFeature(df, col)
    else:
        feats_dict[col] = CategoricalFeature(df, col)

创建一个类处理IV

class Analysis():
    def seq_palette(self, n_colors):
        return sns.cubehelix_palette(n_colors, start=.5, rot=-.75, reverse=True)

    def group_by_feature(self, feat):
        df = feat.df_bins \
                            .groupby('bin') \
                            .agg({'label': ['count', 'sum']}) \
                            .reset_index()
        df.columns = [feat.feature, 'count', 'good']
        df['bad'] = df['count'] - df['good']
        return df
    
class IV(Analysis):
    @staticmethod
    def __perc_share(df, group_name):
        return df[group_name] / df[group_name].sum()

    def __calculate_perc_share(self, feat):
        df = self.group_by_feature(feat)
        df['perc_good'] = self.__perc_share(df, 'good')
        df['perc_bad'] = self.__perc_share(df, 'bad')
        df['perc_diff'] = df['perc_good'] - df['perc_bad']
        return df

    def __calculate_woe(self, feat):
        df = self.__calculate_perc_share(feat)
        df['woe'] = np.log(df['perc_good']/df['perc_bad'])
        df['woe'] = df['woe'].replace([np.inf, -np.inf], np.nan).fillna(0)
        return df

    def calculate_iv(self, feat):
        df = self.__calculate_woe(feat)
        df['iv'] = df['perc_diff'] * df['woe']
        return df, df['iv'].sum()

    def draw_woe(self, feat):
        iv_df, iv_value = self.calculate_iv(feat)
        fig, ax = plt.subplots(figsize=(10,6))
        sns.barplot(x=feat.feature, y='woe', data=iv_df, palette=self.seq_palette(len(iv_df.index)))
        ax.set_title('WOE visualization for: ' + feat.feature)
        plt.show()
        plt.show()

    @staticmethod
    def interpretation(iv):
        if iv < 0.02:
            return 'useless'
        elif iv < 0.1:
            return 'weak'
        elif iv < 0.3:
            return 'medium'
        elif iv < 0.5:
            return 'strong'
        else:
            return 'suspicious'

    def interpret_iv(self, feat):
        _, iv = self.calculate_iv(feat)
        return self.interpretation(iv)

    def print_iv(self, feat):
        _, iv = self.calculate_iv(feat)
        print('Information value: %0.2f' % iv)
        print('%s is a %s predictor' % (feat.feature.capitalize(), self.interpretation(iv)))

iv = IV()

feat_charges = feats_dict['totalCharges']
iv.group_by_feature(feat_charges)

	totalCharges	count	good	bad
0	18.80	1172	599	573
1	197.95	1172	819	353
2	678.37	1172	889	283
3	1,397.47	1172	899	273
4	2,745.37	1172	944	228
5	4,919.84	1172	1013	159
6	MISSING	11	11	0

iv_df, iv_value = iv.calculate_iv(feat_charges)
display(iv_df)
print('Information value: ', iv_value)

	totalCharges	count	good	bad	perc_good	perc_bad	perc_diff	woe	iv
0	18.80	1172	599	573	0.12	0.31	-0.19	-0.97	0.19
1	197.95	1172	819	353	0.16	0.19	-0.03	-0.18	0.01
2	678.37	1172	889	283	0.17	0.15	0.02	0.13	0.00
3	1,397.47	1172	899	273	0.17	0.15	0.03	0.17	0.00
4	2,745.37	1172	944	228	0.18	0.12	0.06	0.40	0.02
5	4,919.84	1172	1013	159	0.20	0.09	0.11	0.83	0.09
6	MISSING	11	11	0	0.00	0.00	0.00	0.00	0.00

Information value:  0.3152298256031499

iv.draw_woe(feat_charges)

在这里插入图片描述

从上图我们可以看到一个由负数转为正数的分界点678.367，我们预处理时定义流失为1即good,未流失为0.WOE>0意思是当用户的totalCharges>678.367时，倾向于不流失。给运营同学带来的启发是说，当用户的totalCharges<678.367时，应该用些主动触达的方法刺激用户再次消费。这对降低流失率是ROI较高的，毕竟我们看到1397.474 到2745.367用户的WOE增长明显，但是刺激消费者从花费1千多到将近3千还是比较困难的

iv.print_iv(feat_charges)

Information value: 0.32
Totalcharges is a strong predictor

本篇文章主要是介绍下WOE和IV的基本概念，以及如何在python中实现。下一篇文章我们介绍下如何将WOE和LR模型结合起来，完整的建立用户流失模型。