逻辑回归

这篇博客详细介绍了逻辑回归的建模过程,包括数据清洗、变量筛选、模型建立、预测和评估。在数据清洗阶段,涉及字符串编码、错误值处理和填补空缺值等操作;变量筛选采用IV值方法;模型建立后,通过交叉验证进行优化,评估指标包括准确率、混淆矩阵和ROC曲线。
摘要由CSDN通过智能技术生成

逻辑回归建模案例

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
字段名中文含义
ID客户编号
Suc_flag成功入网标识
ARPU入网后ARPU
PromCnt1212个月内的营销次数
PromCnt3636个月内的营销次数
PromCntMsg1212个月内发短信的次数
PromCntMsg3636个月内发短信的次数
Class客户重要性等级(根据前运营商消费情况)
Age年龄
Gender性别
HomeOwner是否拥有住房
AvgARPU当地平均ARPU
AvgHomeValue当地房屋均价
AvgIncome当地人均收入
telecom = pd.read_csv('teleco_camp.csv',skipinitialspace=True)
telecom.head()
IDSuc_flagARPUPromCnt12PromCnt36PromCntMsg12PromCntMsg36ClassAgeGenderHomeOwnerAvgARPUAvgHomeValueAvgIncome
012150.05.659.501.63.0479.0MH49.8949043340039460
1530NaN4.509.001.43.6371.0MH48.5747423760033545
267125.06.4011.002.03.6179.0FH49.27264610040042091
371180.07.1510.252.43.6163.0FH47.3349533990039313
4142115.05.9010.502.03.81NaNFU47.827404475000
telecom.describe(include='all')
IDSuc_flagARPUPromCnt12PromCnt36PromCntMsg12PromCntMsg36ClassAgeGenderHomeOwnerAvgARPUAvgHomeValueAvgIncome
count9686.0000009686.0000004843.0000009686.0000009686.0000009686.0000009686.0000009686.0000007279.000000968696869686.0000009686.0000009686.000000
uniqueNaNNaNNaNNaNNaNNaNNaNNaNNaN32NaNNaNNaN
topNaNNaNNaNNaNNaNNaNNaNNaNNaNFHNaNNaNNaN
freqNaNNaNNaNNaNNaNNaNNaNNaNNaN52235377NaNNaNNaN
mean97975.4740860.50000078.1217223.4472127.3370591.1784022.3909352.42453059.150845NaNNaN52.905156110986.29981440491.444249
std56550.1711200.50002662.2256861.2318901.9524360.2872260.9143141.04904716.516400NaNNaN4.99377598670.85545028707.494146
min12.0000000.0000005.0000000.7500001.0000000.2000000.4000001.0000000.000000NaNNaN46.1389680.0000000.000000
25%48835.5000000.00000050.0000002.9000006.2500001.0000001.4000002.00000047.000000NaNNaN49.76011652300.00000024464.000000
50%99106.0000000.50000065.0000003.2500007.7500001.2000002.6000002.00000060.000000NaNNaN50.87667276900.00000043100.000000
75%148538.7500001.000000100.0000003.6500008.2500001.4000003.2000003.00000073.000000NaNNaN54.452822128175.00000056876.000000
max191779.0000001.0000001000.00000015.15000019.5000003.6000005.6000004.00000087.000000NaNNaN99.444787600000.000000200001.000000

数据清洗

  • 对字符串变量进行编码
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
telecom['Gender'] = le.fit_transform(telecom['Gender'])    # 对Gender进行自动编码

telecom['HomeOwner'].replace({'H': 0, 'U': 1}, inplace=True)   # 对HomeOwner进行人工编码
  • 错误值处理
for col in ['AvgIncome', 'Age', 'AvgHomeValue']:
    telecom[col].replace({0: np.NaN, }, inplace=True)      # AvgIncome、Age、AvgHomeValue的0值实为缺失值
  • 填补空缺值
from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)   # 使用均值填补
telecom1 = pd.DataFrame(imputer.fit_transform(telecom), columns=telecom.columns)  
  • 处理极端值
def blk(floor, root):        # 'blk' will return a function
    def f(x):       
        if x < floor:
            x = floor
        elif x > root:
            x = root
        return x
    return f

q1 = telecom1['Age'].quantile(0.01)          # 计算百分位数
q99 = telecom1['Age'].quantile(0.99)
blk_tot = blk(floor=q1, root=q99)      # 'blk_tot' is a function
telecom1['Age']= telecom1['Age'].map(blk_tot)
telecom1.describe()
IDSuc_flagARPUPromCnt12PromCnt36PromCntMsg12PromCntMsg36ClassAgeGenderHomeOwnerAvgARPUAvgHomeValueAvgIncome
count9686.0000009686.0000009686.0000009686.0000009686.0000009686.0000009686.0000009686.0000009686.0000009686.0000009686.0000009686.0000009686.0000009686.000000
mean97975.4740860.50000078.1217223.4472127.3370591.1784022.3909352.42453059.2301060.5163120.44486952.905156112179.20275553513.457361
std56550.1711200.50002643.9979331.2318901.9524360.2872260.9143141.04904714.0468350.6007160.4969774.99377597997.59263217227.468161
min12.0000000.0000005.0000000.7500001.0000000.2000000.4000001.00000021.0000000.0000000.00000046.1389687500.0000002499.000000
25%48835.5000000.00000065.0000002.9000006.2500001.0000001.4000002.00000051.0000000.0000000.00000049.76011653500.00000042775.000000
50%99106.0000000.50000078.1217223.2500007.7500001.2000002.6000002.00000059.1589720.0000000.00000050.87667278450.00000053513.457361
75%148538.7500001.00000078.1217223.6500008.2500001.4000003.2000003.00000069.0000001.0000001.00000054.452822128175.00000056876.000000
max191779.0000001.0000001000.00000015.15000019.5000003.6000005.6000004.00000086.0000002.0000001.00000099.444787600000.000000200001.000000

变量筛选

使用模型进行筛选

  • 使用可输出变量重要性的模型进行筛选
from sklearn import ensemble

X = telecom1.loc[:, 'PromCnt12':]  # 生成解释变量(入网后才有ARPU,因此ARPU不应包含在模型内)
y = telecom1['Suc_flag']  # 生成被解释变量

crf = ensemble.RandomForestClassifier()
crf.fit(X=X, y=y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
pd.Series(X.columns, index=crf.feature_importances_).sort_index(ascending=False)
0.356981       PromCnt12
0.304656    PromCntMsg12
0.073231       PromCnt36
0.069373    PromCntMsg36
0.052378         AvgARPU
0.042860    AvgHomeValue
0.037396       AvgIncome
0.031021             Age
0.015768           Class
0.009526          Gender
0.006809       HomeOwner
dtype: object
  • 用IV值做变量筛选–发现非线性关系
# 定义计算IV值的函数
def IV_between(y, x): # y and x are pd.series type
    all_i = y.groupby(x).count()
    bad_i = y.groupby(x).sum() # Assume: 1 indicate bad, 0 indicate good
    good_i = all_i - bad_i
    p1 = bad_i / bad_i.sum()
    p0 = good_i / good_i.sum()
    woe = np.log((p1 + 1e-5) / (p0 + 1e-5))  # 1e-5 make the formula meaningful
    IV = (p1 - p0) * woe
    return IV.sum()
IV = pd.Series()

for i in X.columns:    
    if len(X[i].unique()) > 10 and X[i].dtype != np.object:  # 将水平数大于10的变量进行离散化后,使用IV值进行变量重要性判断
        try:
            tmp = pd.qcut(X[i], 5)  # 等分离散
        except:
            tmp = pd.cut(X[i], 5)   # 等宽离散
        IV = IV.append(pd.Series([i], index=[IV_between(y, tmp)]))
    else:
        IV = IV.append(pd.Series([i], index=[IV_between(y, X[i])]))  # 水平数不大于10的作为离散变量直接计算IV
        
sorted_IV = IV.sort_index(ascending=False)
sorted_IV
0.479255       PromCnt12
0.326262    PromCntMsg12
0.040798           Class
0.032807         AvgARPU
0.031443    PromCntMsg36
0.016009       PromCnt36
0.014777    AvgHomeValue
0.013139             Age
0.004153       AvgIncome
0.000263       HomeOwner
0.000030          Gender
dtype: object

变量筛选有多种方法,本例选取IV值最大的前8个变量作为示例

selected_features = ['Suc_flag', ]
selected_features.extend(sorted_IV.iloc[:8])
telecom2 = telecom1[selected_features]
telecom2.head()
Suc_flagPromCnt12PromCntMsg12ClassAvgARPUPromCntMsg36PromCnt36AvgHomeValueAge
01.05.651.64.049.8949043.09.5033400.079.000000
10.04.501.43.048.5747423.69.0037600.071.000000
21.06.402.01.049.2726463.611.00100400.079.000000
31.07.152.41.047.3349533.610.2539900.063.000000
41.05.902.01.047.8274043.810.5047500.059.158972

划分训练集和测试集

  • 对离散变量进行哑变量变换
telecom3 = telecom2.join(pd.get_dummies(telecom['Class'])).drop('Class', axis=1)
telecom3.head()
Suc_flagPromCnt12PromCntMsg12AvgARPUPromCntMsg36PromCnt36AvgHomeValueAge1234
01.05.651.649.8949043.09.5033400.079.0000000001
10.04.501.448.5747423.69.0037600.071.0000000010
21.06.402.049.2726463.611.00100400.079.0000001000
31.07.152.447.3349533.610.2539900.063.0000001000
41.05.902.047.8274043.810.5047500.059.1589721000
from sklearn.model_selection import train_test_split

data = telecom3.iloc[:, 1:]
target = telecom3['Suc_flag']
train_data, test_data, train_target, test_target = train_test_split(
    data, target, test_size=0.4, train_size=0.6, random_state=123) 

标准化

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(train_data)

scaled_train_data = scaler.transform(train_data)
scaled_test_data = scaler.transform(test_data)

建立逻辑回归模型

from sklearn.linear_model import LogisticRegression

logistic_model = LogisticRegression()
logistic_model.fit(scaled_train_data, train_target)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

预测

预测分类标签

train_predict = logistic_model.predict(scaled_train_data)
test_predict = logistic_model.predict(scaled_test_data)

预测概率

train_proba = logistic_model.predict_proba(scaled_train_data)[:, 1]  # 计算分别属于各类的概率,取标签为1的概率
test_proba = logistic_model.predict_proba(scaled_test_data)[:, 1]

评估

  • mean accuracy
logistic_model.score(scaled_test_data, test_target)
0.7685161290322581
  • 混淆矩阵
from sklearn import metrics

print(metrics.confusion_matrix(test_target, test_predict, labels=[0, 1]))  # 混淆矩阵
print(metrics.classification_report(test_target, test_predict))  # 计算评估指标
[[1510  395]
 [ 502 1468]]
             precision    recall  f1-score   support

        0.0       0.75      0.79      0.77      1905
        1.0       0.79      0.75      0.77      1970

avg / total       0.77      0.77      0.77      3875
  • ROC曲线与AUC
fpr_test, tpr_test, th_test = metrics.roc_curve(test_target, test_proba)
fpr_train, tpr_train, th_train = metrics.roc_curve(train_target, train_proba)

plt.figure(figsize=[6, 6])
plt.plot(fpr_test, tpr_test, 'b-')
plt.plot(fpr_train, tpr_train, 'r-')
plt.title('ROC curve')
print('AUC = %6.4f' %metrics.auc(fpr_test, tpr_test))
AUC = 0.8304

[外链图片转存失败(img-aFo8gRGr-1562725776410)(output_43_1.png)]

交叉验证优化

from sklearn.linear_model import LogisticRegressionCV

lrcv = LogisticRegressionCV(Cs=10, cv=4)   # Cs为正则化系数的倒数
lrcv.fit(scaled_train_data, train_target)
lrcv.scores_
{1.0: array([[0.50550206, 0.54470426, 0.59009629, 0.6781293 , 0.76753783,
         0.78129298, 0.78404402, 0.78404402, 0.78404402, 0.78404402],
        [0.50584997, 0.55333792, 0.60289057, 0.68891948, 0.75430145,
         0.76944253, 0.76324845, 0.76324845, 0.76256022, 0.76256022],
        [0.50550964, 0.54063361, 0.58333333, 0.64049587, 0.74104683,
         0.75550964, 0.76033058, 0.76033058, 0.76033058, 0.76033058],
        [0.50550964, 0.54752066, 0.58402204, 0.70661157, 0.75757576,
         0.76584022, 0.76033058, 0.76101928, 0.76101928, 0.76101928]])}
lrcv.C_  # 模型交叉验证分数最高的C值
array([2.7825594])
test_cv_proba = lrcv.predict_proba(scaled_test_data)[:, 1]
fpr_test, tpr_test, th_test = metrics.roc_curve(test_target, test_cv_proba)

print('AUC = %6.4f' %metrics.auc(fpr_test, tpr_test))
AUC = 0.8311
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值