客户生命周期可分为四个阶段:潜在客户阶段、响应客户阶段、既得客户阶段、流失客户阶段
本章整体是一个客户价值预测的案例,背景是某信用卡公司在地推活动之后,获取了大量客户的信用卡申请信息,其中一个部分客户顺利开卡,并且有月消费记录,而另外一部分客户没有激活信用卡。公司的营销部门希望对潜在消费能力高的客户进行激活卡普的影响活动。在营销活动之前,需要对客户的潜在价值进行预测,或分析不同客户特征对客户价值的影响程度,这就涉及数值预测的相关方法,比较常用的就是线性回归模型
1. 线性回归
1.1 简单线性回归
使用statsmodel实现简单线性回归
本案例使用汽车贷款数据,相应字段如下
字段名 | 中文含义 |
---|---|
id | id |
Acc | 是否开卡(1=已开通) |
avg_exp | 月均信用卡支出(元) |
avg_exp_ln | 月均信用卡支出的自然对数 |
gender | 性别(男=1) |
Age | 年龄 |
Income | 年收入(万元) |
Ownrent | 是否自有住房(有=1;无=0) |
Selfempl | 是否自谋职业(1=yes, 0=no) |
dist_home_val | 所住小区房屋均价(万元) |
dist_avg_income | 当地人均收入 |
high_avg | 高出当地平均收入 |
edu_class | 教育等级:小学及以下开通=0,中学=1,本科=2,研究生=3 |
本案例需要对用户月均信用卡支出建立预测模型,考虑到该变量与收入可能存在线性关系,因此可以先使用收入Income
作为解释变量,建立简单线性回归模型
引入相应的包
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
from pandas.core import datetools
#用于设置pandas数据框最大的显示列数,超过该显示列数会使用省略号来代替
pd.set_option('display.max_columns', None)
读取数据
raw = pd.read_csv('creditcard_exp.csv', skipinitialspace=True)
# skipinitialspace,忽略分隔符后的空白(默认为False,即不忽略)
raw.head()
输出结果
通过简单的观察和分析,可以得到以下数据清洗策略:
1)月均信用卡支出avg_exp
为空值的是没有开卡(Acc = 0)的客户,因此使用avg_exp
非空的数据进行建模,使用模型预测那些avg_exp
为空的数据
2)Acc
是用于代表用户时候开卡,只有开卡用户才会有信用卡支出,因此Acc
不能进入模型
3)age2
是Age
的平方,在这个案例意义不大,因此可以删除
4)avg_exp_In
是月均信用卡支出avg_exp
的对数值
5)字段字段id
显然不能进入模型
数据清洗和描述性统计分析
exp = raw[raw['avg_exp'].notnull()].copy().iloc[:,2:].drop('age2',axis=1)
exp_new = raw[raw['avg_exp'].isnull()].copy().iloc[:,2:].drop('age2',axis=1)
# 进行变量的描述性统计分析
exp.describe(include='all')
相关性分析
要进行线性回归分析,首先要确认解释变量和被解释变量之间存在线性关系(或进过转换后存在线性关系)
#散点图
exp.plot('Income', 'avg_exp', kind='scatter')
plt.show()
exp[['Income', 'avg_exp', 'Age', 'dist_home_val']].corr(method='pearson')
相关性分析可以判断出avg_exp
与Income
确实存在较高的相关性,可以用简单线性回归来建立模型
lm_s = ols('avg_exp ~ Income', data=exp).fit()
# 输出一元线性回归系数
print(lm_s.params)
本段使用了statsmodel中的ols函数,该函数需要传入一个字符串作为参数(也被称为formula公式),该字符串使用波浪线“~”来分隔被解释变量和解释变量
ols函数返回一个模型(最小二乘法),该模型使用fit方法进行训练。训练完成的模型使用params属性保存解释变量的系数与方程的截距
## 线性回归算法
### 简单线性回归
In[6]:
lm_s = ols(‘avg_exp ~ Income+Age+dist_home_val’, data=exp).fit()
lm_s.summary()
Predict-在原始数据集上得到预测值和残差
In[7]:
lm_s.summary()
In[8]:
pd.DataFrame([lm_s.predict(exp), lm_s.resid], index=[‘predict’, ‘resid’]
).T.head()
在待预测数据集上得到预测值
In[9]:
lm_s.predict(exp_new)[:5]
### 多元线性回归
In[10]:
lm_m = ols(‘avg_exp ~ Age + Income + dist_home_val + dist_avg_income’,
data=exp).fit()
lm_m.summary()
### 多元线性回归的变量筛选
In[11]:
‘’‘forward select’’’
def forward_select(data, response):
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = float(‘inf’), float(‘inf’)
while remaining:
aic_with_candidates=[]
for candidate in remaining:
formula = “{} ~ {}”.format(
response,’ + '.join(selected + [candidate]))
aic = ols(formula=formula, data=data).fit().aic
aic_with_candidates.append((aic, candidate))
aic_with_candidates.sort(reverse=True)
best_new_score, best_candidate=aic_with_candidates.pop()
if current_score > best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
print (‘aic is {},continuing!’.format(current_score))
else:
print (‘forward selection over!’)
break
formula = "{} ~ {} ".format(response,' + '.join(selected))
print('final formula is {}'.format(formula))
model = ols(formula=formula, data=data).fit()
return(model)
In[12]:
data_for_select = exp[[‘avg_exp’, ‘Income’, ‘Age’, ‘dist_home_val’,
‘dist_avg_income’]]
lm_m = forward_select(data=data_for_select, response=‘avg_exp’)
print(lm_m.rsquared)
# 线性回归的诊断
### 残差分析
In[13]:
ana1 = lm_s
In[14]:
exp[‘Pred’] = ana1.predict(exp)
exp[‘resid’] = ana1.resid
exp.plot(‘Pred’, ‘resid’,kind=‘scatter’)
plt.show()
In[15]:
遇到异方差情况,教科书上会介绍使用加权最小二乘法,但是实际上最常用的是对被解释变量取对数
ana1 = ols(‘avg_exp ~ Income’, exp).fit()
ana1.summary()
In[15]:
ana2 = ols(‘avg_exp_ln ~ Income’, exp).fit()
exp[‘Pred’] = ana2.predict(exp)
exp[‘resid’] = ana2.resid
exp.plot(‘Income’, ‘resid’,kind=‘scatter’)
#ana2.summary()
#plt.show()
取对数会使模型更有解释意义
In[16]:
exp[‘Income_ln’] = np.log(exp[‘Income’])
In[17]:
ana3 = ols(‘avg_exp_ln ~ Income_ln’, exp).fit()
exp[‘Pred’] = ana3.predict(exp)
exp[‘resid’] = ana3.resid
exp.plot(‘Income_ln’, ‘resid’,kind=‘scatter’)
plt.show()
#%%
ana3.summary()
寻找最优的模型
In[18]:
r_sq = {‘exp~Income’:ana1.rsquared, ‘ln(exp)~Income’:ana2.rsquared,
‘ln(exp)~ln(Income)’:ana3.rsquared}
print(r_sq)
### 强影响点分析
In[19]:
exp[‘resid_t’] = (exp[‘resid’] - exp[‘resid’].mean()) / exp[‘resid’].std()
Find outlier:
In[20]:
exp[abs(exp[‘resid_t’]) > 2]
Drop outlier
In[21]:
exp2 = exp[abs(exp[‘resid_t’]) <= 2].copy()
ana4 = ols(‘avg_exp_ln ~ Income_ln’, exp2).fit()
exp2[‘Pred’] = ana4.predict(exp2)
exp2[‘resid’] = ana4.resid
exp2.plot(‘Income’, ‘resid’, kind=‘scatter’)
plt.show()
In[22]:
ana4.rsquared
statemodels包提供了更多强影响点判断指标
In[23]:
from statsmodels.stats.outliers_influence import OLSInfluence
OLSInfluence(ana3).summary_frame().head()
### 增加变量
经过单变量线性回归的处理,我们基本对模型的性质有了一定的了解,接下来可以放入更多的连续型解释变量。在加入变量之前,要注意变量的函数形式转变。比如当地房屋均价、当地平均收入,其性质和个人收入一样,都需要取对数
In[24]:
exp2[‘dist_home_val_ln’] = np.log(exp2[‘dist_home_val’])
exp2[‘dist_avg_income_ln’] = np.log(exp2[‘dist_avg_income’])
ana5 = ols(’’‘avg_exp_ln ~ Age + Income_ln +
dist_home_val_ln + dist_avg_income_ln’’’, exp2).fit()
ana5.summary()
### 多重共线性分析
In[25]:
Step regression is not always work.
In[26]:
ana5.bse # The standard errors of the parameter estimates
The function “statsmodels.stats.outliers_influence.variance_inflation_factor” uses “OLS” to fit data, and it will generates a wrong rsquared. So define it ourselves!
In[27]:
def vif(df, col_i):
cols = list(df.columns)
cols.remove(col_i)
cols_noti = cols
formula = col_i + ‘~’ + ‘+’.join(cols_noti)
r2 = ols(formula, df).fit().rsquared
return 1. / (1. - r2)
In[28]:
exog = exp2[[‘Income_ln’, ‘dist_home_val_ln’,
‘dist_avg_income_ln’]]
for i in exog.columns:
print(i, ‘\t’, vif(df=exog, col_i=i))
Income_ln与dist_avg_income_ln具有共线性,使用“高出平均收入的比率”代替其中一个
In[29]:
exp2[‘high_avg_ratio’] = exp2[‘high_avg’] / exp2[‘dist_avg_income’]
In[30]:
exog1 = exp2[[‘high_avg_ratio’, ‘dist_home_val_ln’,
‘dist_avg_income_ln’]]
for i in exog1.columns:
print(i, ‘\t’, vif(df=exog1, col_i=i))
In[31]:
var_select = exp2[[‘avg_exp_ln’, ‘high_avg_ratio’,
‘dist_home_val_ln’, ‘dist_avg_income_ln’]]
ana7 = forward_select(data=var_select, response=‘avg_exp_ln’)
print(ana7.rsquared)
In[32]:
formula8 = ‘’’
avg_exp_ln ~ dist_avg_income_ln + dist_home_val_ln +
C(gender) + C(Ownrent) + C(Selfempl) + C(edu_class)
‘’’
ana8 = ols(formula8, exp2).fit()
ana8.summary()
In[33]:
formula9 = ‘’’
avg_exp_ln ~ dist_avg_income_ln + dist_home_val_ln +
C(Selfempl) + C(gender):C(edu_class)
‘’’
ana9 = ols(formula9, exp2).fit()
ana9.summary()
## 正则算法
### 岭回归
In[34]:
lmr = ols(‘avg_exp ~ Income + dist_home_val + dist_avg_income’,
data=exp).fit_regularized(alpha=1, L1_wt=0)
lmr.summary()
L1_wt参数为0则使用岭回归,为1使用lasso
### LASSO算法
In[35]:
lmr1 = ols(‘avg_exp ~ Age + Income + dist_home_val + dist_avg_income’,
data=exp).fit_regularized(alpha=1, L1_wt=1)
lmr1.summary()
### 使用scikit-learn进行正则化参数调优
In[36]:
from sklearn.preprocessing import StandardScaler
continuous_xcols = [‘Age’, ‘Income’, ‘dist_home_val’,
‘dist_avg_income’] # 抽取连续变量
scaler = StandardScaler() # 标准化
X = scaler.fit_transform(exp[continuous_xcols])
y = exp[‘avg_exp_ln’]
In[37]:
from sklearn.linear_model import RidgeCV
alphas = np.logspace(-2, 3, 100, base=10)
Search the min MSE by CV
rcv = RidgeCV(alphas=alphas, store_cv_values=True)
rcv.fit(X, y)
In[38]:
print(‘The best alpha is {}’.format(rcv.alpha_))
print(‘The r-square is {}’.format(rcv.score(X, y)))
Default score is rsquared
In[39]:
X_new = scaler.transform(exp_new[continuous_xcols])
np.exp(rcv.predict(X_new)[:5])
In[40]:
cv_values = rcv.cv_values_
n_fold, n_alphas = cv_values.shape
cv_mean = cv_values.mean(axis=0)
cv_std = cv_values.std(axis=0)
ub = cv_mean + cv_std / np.sqrt(n_fold)
lb = cv_mean - cv_std / np.sqrt(n_fold)
plt.semilogx(alphas, cv_mean, label=‘mean_score’)
plt.fill_between(alphas, lb, ub, alpha=0.2)
plt.xlabel("
a
l
p
h
a
\\alpha
alpha")
plt.ylabel(“mean squared errors”)
plt.legend(loc=“best”)
plt.show()
In[41]:
rcv.coef_
手动选择正则化系数——根据业务判断
岭迹图
In[42]:
from sklearn.linear_model import Ridge
ridge = Ridge()
coefs = []
for alpha in alphas:
ridge.set_params(alpha=alpha)
ridge.fit(X, y)
coefs.append(ridge.coef_)
In[43]:
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale(‘log’)
plt.xlabel(‘alpha’)
plt.ylabel(‘weights’)
plt.title(‘Ridge coefficients as a function of the regularization’)
plt.axis(‘tight’)
plt.show()
In[44]:
ridge.set_params(alpha=40)
ridge.fit(X, y)
ridge.coef_
In[45]:
ridge.score(X, y)
预测
In[46]:
np.exp(ridge.predict(X_new)[:5])
lasso
In[54]:
from sklearn.linear_model import LassoCV
lasso_alphas = np.logspace(-3, 0, 100, base=10)
lcv = LassoCV(alphas=lasso_alphas, cv=10) # Search the min MSE by CV
lcv.fit(X, y)
print(‘The best alpha is {}’.format(lcv.alpha_))
print(‘The r-square is {}’.format(lcv.score(X, y)))
Default score is rsquared
In[49]:
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso_coefs = []
for alpha in lasso_alphas:
lasso.set_params(alpha=alpha)
lasso.fit(X, y)
lasso_coefs.append(lasso.coef_)
In[50]:
ax = plt.gca()
ax.plot(lasso_alphas, lasso_coefs)
ax.set_xscale(‘log’)
plt.xlabel(‘alpha’)
plt.ylabel(‘weights’)
plt.title(‘Lasso coefficients as a function of the regularization’)
plt.axis(‘tight’)
plt.show()
In[51]:
lcv.coef_
弹性网络
In[52]:
from sklearn.linear_model import ElasticNetCV
l1_ratio = [.1, .5, .7, .9, .95, .99]
encv = ElasticNetCV(l1_ratio=l1_ratio)
encv.fit(X, y)
In[53]:
print(‘The best l1_ratio is {}’.format(encv.l1_ratio_))
print(‘The best alpha is {}’.format(encv.alpha_))
#%%