数据来源:
DC竞赛-大数据竞赛平台www.dcjingsai.com任务:
通过给定的企业客户信息,建立分类模型,判断企业客户是否会流失。
数据字段说明:
- (1)ID:编号
- (2)Contract:是否有合同
- (3)Dependents:是否有家属
- (4)DeviceProtection:是否有设备保护
- (5)InternetService:是否有互联网服务
- (6)MonthlyCharges:月度费用
- (7)MultipleLines:是否有多条线路
- (8)Partner:是否有配偶
- (9)PaymentMethod:付款方式
- (10)PhoneService:是否有电话服务
- (11)SeniorCitizen:是否为老年人
- (12)TVProgram:是否有电视节目
- (13)TotalCharges:总费用
- (14)gender:用户性别
- (15)tenure:任期年数
- (16)Churn:用户是否流失
评分标准:
以sklearn.metrics的accuracy_score为标准,分数越高,模型表现越好
数据读取和常用库导入:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('客户流失判断train.csv')
test = pd.read_csv('客户流失判断test_noLabel.csv')
数据展示:
train.head()
#结果:
ID Contract Dependents DeviceProtection InternetService MonthlyCharges MultipleLines Partner PaymentMethod PhoneService SeniorCitizen TVProgram TotalCharges gender tenure Label
0 0 One year No No internet service No 24.150000 Yes Yes Bank transfer (automatic) Yes 0 No internet service 1505.900000 Male 60 No
1 1 Month-to-month No No Fiber optic 76.142284 Yes No Electronic check Yes 0 No 946.581518 Female 12 Yes
2 2 Month-to-month Yes No internet service No 26.200000 Yes Yes Electronic check Yes 0 No internet service 1077.500000 Female 40 No
3 3 Two year Yes No internet service No 24.650000 Yes Yes Bank transfer (automatic) Yes 0 No internet service 1138.800000 Female 45 No
4 4 Month-to-month Yes No internet service No 19.150000 No Yes Mailed check Yes 0 No internet service 477.600000 Male 25 No
test.head()
#结果:
ID Contract Dependents DeviceProtection InternetService MonthlyCharges MultipleLines Partner PaymentMethod PhoneService SeniorCitizen TVProgram TotalCharges gender tenure Label
0 0 One year No No internet service No 24.150000 Yes Yes Bank transfer (automatic) Yes 0 No internet service 1505.900000 Male 60 No
1 1 Month-to-month No No Fiber optic 76.142284 Yes No Electronic check Yes 0 No 946.581518 Female 12 Yes
2 2 Month-to-month Yes No internet service No 26.200000 Yes Yes Electronic check Yes 0 No internet service 1077.500000 Female 40 No
3 3 Two year Yes No internet service No 24.650000 Yes Yes Bank transfer (automatic) Yes 0 No internet service 1138.800000 Female 45 No
4 4 Month-to-month Yes No internet service No 19.150000 No Yes Mailed check Yes 0 No internet service 477.600000 Male 25 No
print('train{},test{}'.format(train.shape, test.shape))
#结果:
train(5227, 16),test(1307, 15)
train.info()
#结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5227 entries, 0 to 5226
Data columns (total 16 columns):
ID 5227 non-null int64
Contract 5227 non-null object
Dependents 5227 non-null object
DeviceProtection 5227 non-null object
InternetService 5227 non-null object
MonthlyCharges 5227 non-null float64
MultipleLines 5227 non-null object
Partner 5227 non-null object
PaymentMethod 5227 non-null object
PhoneService 5227 non-null object
SeniorCitizen 5227 non-null int64
TVProgram 5227 non-null object
TotalCharges 5227 non-null float64
gender 5227 non-null object
tenure 5227 non-null int64
Label 5227 non-null object
dtypes: float64(2), int64(3), object(11)
memory usage: 653.5+ KB
test.info()
#结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1307 entries, 0 to 1306
Data columns (total 15 columns):
ID 1307 non-null int64
Contract 1307 non-null object
Dependents 1307 non-null object
DeviceProtection 1307 non-null object
InternetService 1307 non-null object
MonthlyCharges 1307 non-null float64
MultipleLines 1307 non-null object
Partner 1307 non-null object
PaymentMethod 1307 non-null object
PhoneService 1307 non-null object
SeniorCitizen 1307 non-null int64
TVProgram 1307 non-null object
TotalCharges 1307 non-null float64
gender 1307 non-null object
tenure 1307 non-null int64
dtypes: float64(2), int64(3), object(10)
memory usage: 153.3+ KB
训练集和测试集都没有缺失值,不需要填充。
描述性统计分析:
train.describe(include = 'all')
#输出结果:
ID Contract Dependents DeviceProtection InternetService MonthlyCharges MultipleLines Partner PaymentMethod PhoneService SeniorCitizen TVProgram TotalCharges gender tenure Label
count 5227.000000 5227 5227 5227 5227 5227.000000 5227 5227 5227 5227 5227.000000 5227 5227.000000 5227 5227.000000 5227
unique NaN 3 2 3 3 NaN 3 2 4 2 NaN 3 NaN 2 NaN 2
top NaN Month-to-month No No Fiber optic NaN No No Electronic check Yes NaN No NaN Female NaN No
freq NaN 3386 4049 2777 2803 NaN 2542 3014 2517 4857 NaN 2292 NaN 2650 NaN 3280
mean 2613.000000 NaN NaN NaN NaN 66.823765 NaN NaN NaN NaN 0.118615 NaN 2084.477153 NaN 28.775971 NaN
std 1509.049259 NaN NaN NaN NaN 28.862749 NaN NaN NaN NaN 0.323366 NaN 2183.825066 NaN 24.293077 NaN
min 0.000000 NaN NaN NaN NaN 18.250000 NaN NaN NaN NaN 0.000000 NaN 18.800000 NaN 0.000000 NaN
25% 1306.500000 NaN NaN NaN NaN 45.000000 NaN NaN NaN NaN 0.000000 NaN 292.979609 NaN 5.000000 NaN
50% 2613.000000 NaN NaN NaN NaN 74.200000 NaN NaN NaN NaN 0.000000 NaN 1218.650000 NaN 23.000000 NaN
75% 3919.500000 NaN NaN NaN NaN 89.900000 NaN NaN NaN NaN 0.000000 NaN 3373.825000 NaN 51.000000 NaN
max 5226.000000 NaN NaN NaN NaN 118.600000 NaN NaN NaN NaN 1.000000 NaN 8564.750000 NaN 72.000000 NaN
for col in list(train.columns):
if type(train[col].unique()[0]) is str:
print(col, train[col].unique())
#输出结果:
Contract ['One year' 'Month-to-month' 'Two year']
Dependents ['No' 'Yes']
DeviceProtection ['No internet service' 'No' 'Yes']
InternetService ['No' 'Fiber optic' 'DSL']
MultipleLines ['Yes' 'No' 'No phone service']
Partner ['Yes' 'No']
PaymentMethod ['Bank transfer (automatic)' 'Electronic check' 'Mailed check'
'Credit card (automatic)']
PhoneService ['Yes' 'No']
TVProgram ['No internet service' 'No' 'Yes']
gender ['Male' 'Female']
Label ['No' 'Yes']
对只有两个值的特征用label encoder编码,两个以上的用独热编码:
#找出只有两个值的特征
object_cols = [col for col in train.columns if train[col].dtype == 'object']
labelcol = [col for col in object_cols if train[col].nunique()==2]
test_object_cols = [col for col in test.columns if test[col].dtype =='object']
test_labelcol = [col for col in test_object_cols if test[col].nunique()==2]
#label encoder编码
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in labelcol:
train[col] = label_encoder.fit_transform(train[col])
for col in test_labelcol:
test[col] = label_encoder.fit_transform(test[col])
#找出要用独热编码的特征
onehotcol = [col for col in object_cols if train[col].nunique()>2]
#独热编码
dum_train = pd.get_dummies(train[onehotcol])
dum_train = dum_train.astype('int')
dum_test = pd.get_dummies(test[onehotcol])
dum_test = dum_test.astype('int')
#合并
train.drop(onehotcol, axis=1, inplace = True)
test.drop(onehotcol, axis=1, inplace = True)
train = pd.concat([dum_train,train],axis=1)
test = pd.concat([dum_test, test], axis=1)
#检查
train.info()
#输出结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5227 entries, 0 to 5226
Data columns (total 29 columns):
Contract_Month-to-month 5227 non-null int64
Contract_One year 5227 non-null int64
Contract_Two year 5227 non-null int64
DeviceProtection_No 5227 non-null int64
DeviceProtection_No internet service 5227 non-null int64
DeviceProtection_Yes 5227 non-null int64
InternetService_DSL 5227 non-null int64
InternetService_Fiber optic 5227 non-null int64
InternetService_No 5227 non-null int64
MultipleLines_No 5227 non-null int64
MultipleLines_No phone service 5227 non-null int64
MultipleLines_Yes 5227 non-null int64
PaymentMethod_Bank transfer (automatic) 5227 non-null int64
PaymentMethod_Credit card (automatic) 5227 non-null int64
PaymentMethod_Electronic check 5227 non-null int64
PaymentMethod_Mailed check 5227 non-null int64
TVProgram_No 5227 non-null int64
TVProgram_No internet service 5227 non-null int64
TVProgram_Yes 5227 non-null int64
ID 5227 non-null int64
Dependents 5227 non-null int64
MonthlyCharges 5227 non-null float64
Partner 5227 non-null int64
PhoneService 5227 non-null int64
SeniorCitizen 5227 non-null int64
TotalCharges 5227 non-null float64
gender 5227 non-null int64
tenure 5227 non-null int64
Label 5227 non-null int64
dtypes: float64(2), int64(27)
memory usage: 1.2 MB
test.info()
#输出结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1307 entries, 0 to 1306
Data columns (total 28 columns):
Contract_Month-to-month 1307 non-null int64
Contract_One year 1307 non-null int64
Contract_Two year 1307 non-null int64
DeviceProtection_No 1307 non-null int64
DeviceProtection_No internet service 1307 non-null int64
DeviceProtection_Yes 1307 non-null int64
InternetService_DSL 1307 non-null int64
InternetService_Fiber optic 1307 non-null int64
InternetService_No 1307 non-null int64
MultipleLines_No 1307 non-null int64
MultipleLines_No phone service 1307 non-null int64
MultipleLines_Yes 1307 non-null int64
PaymentMethod_Bank transfer (automatic) 1307 non-null int64
PaymentMethod_Credit card (automatic) 1307 non-null int64
PaymentMethod_Electronic check 1307 non-null int64
PaymentMethod_Mailed check 1307 non-null int64
TVProgram_No 1307 non-null int64
TVProgram_No internet service 1307 non-null int64
TVProgram_Yes 1307 non-null int64
ID 1307 non-null int64
Dependents 1307 non-null int64
MonthlyCharges 1307 non-null float64
Partner 1307 non-null int64
PhoneService 1307 non-null int64
SeniorCitizen 1307 non-null int64
TotalCharges 1307 non-null float64
gender 1307 non-null int64
tenure 1307 non-null int64
dtypes: float64(2), int64(26)
memory usage: 286.0 KB
生成correlation matrix查看关联
sns.set(style="white")
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
f, ax = plt.subplots(figsize=(20, 15))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
plt.title('Correlation Matrix', fontsize=18)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)
plt.show()
把与label关联低的特征去掉
dropcol = ['MultipleLines_No phone service','MultipleLines_Yes','MultipleLines_No',
'TVProgram_Yes','ID','SeniorCitizen',
'PhoneService','gender']
train.drop(dropcol, axis = 1,inplace = True)
test.drop(dropcol, axis = 1, inplace = True)
划分训练集和验证集
X = train.drop('Label', axis=1)
y = train['Label']
检查Label比例
num_yes = train.Label.sum()
num_yes
#输出结果
1947
num_no = train.shape[0]-num_yes
num_no
#输出结果
3280
数据集不平衡,后续采用StratifiedKFold分层采样
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2)
导入库
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
构建模型
def model_select(X_train,y_train,clf,clf_name,parameters,kfold):
pipeline = Pipeline([('scaler', MinMaxScaler()),(clf_name, clf)])
#数据不平衡,采用StratifiedKFold分层采样
folder = StratifiedKFold(n_splits = kfold, shuffle = True, random_state=0)
grid_search = GridSearchCV(estimator = pipeline,
param_grid = parameters,
cv = folder,
scoring = 'accuracy')
gs = grid_search.fit(X_train, y_train)
print('最优参数{},最优分数{}'.format(gs.best_params_, gs.best_score_))
return gs
from sklearn.metrics import accuracy_score
#Decision Tree
no_folds = 10
dt = DecisionTreeClassifier(random_state = 1)
dt_parameters ={'dt__max_depth':[4,5,6,7,9]}
dt_model = model_select(X_train,y_train,dt,'dt',dt_parameters,no_folds)
y_pred = dt_model.predict(X_valid)
print(accuracy_score(y_valid,y_pred))
#输出结果:
最优参数{'dt__max_depth': 5},最优分数0.7711073905764171
0.7743785850860421
#Random Forest
no_folds = 10
rf = RandomForestClassifier(random_state = 1)
rf_parameters = {'rf__max_depth':[4,5,6,7,8]}
rf_model = model_select(X_train,y_train,rf,'rf',rf_parameters,no_folds)
y_pred = rf_model.predict(X_valid)
print(accuracy_score(y_valid,y_pred))
#输出结果
最优参数{'rf__max_depth': 6},最优分数0.7727816311887108
0.7762906309751434
#SVM
no_folds = 10
svc = SVC(random_state = 1)
svc_parameters = {'svc__C':[0.01,0.03,0.1,1,1.5], 'svc__gamma':[0.01,0.1,1,1.5]}
svc_model = model_select(X_train,y_train,svc,'svc',svc_parameters,no_folds)
y_pred = svc_model.predict(X_valid)
print(accuracy_score(y_valid,y_pred))
#输出结果
最优参数{'svc__C': 1.5, 'svc__gamma': 0.01},最优分数0.7574742884477398
0.7724665391969407
#xgboost
no_folds = 10
xgb = XGBClassifier(random_state = 1)
xgb_parameters = {'xgboost__max_depth':[4,5,6,7,8,9],
'xgboost__learning_rate':[0.001,0.01,0.02,0.03,1]}
xgb_model = model_select(X_train,y_train,xgb,'xgboost',xgb_parameters,no_folds)
y_pred = xgb_model.predict(X_valid)
print(accuracy_score(y_valid,y_pred))
#输出结果
最优参数{'xgboost__learning_rate': 0.01, 'xgboost__max_depth': 5},最优分数0.7727816311887108
0.7762906309751434
其中XGBClassifier分数最高,使用XGBClassifier来预测测试集
xgb_model = XGBClassifier(random_state = 1, max_depth = 6, learning_rate = 0.03)
xgb_model.fit(X,y)
preds = xgb_model.predict(test)
final_DF = pd.DataFrame()
get_ID = pd.read_csv('客户流失判断test_noLabel.csv')
final_DF['ID'] = get_ID['ID']
final_DF['Label'] = pd.Series(preds).map({1:'Yes',0:'No'})
Name = '客户流失判断结果.csv'
final_DF.to_csv
提交结果