案例:电信行业的客户流失预测模型
项目背景:
在电信行业,顾客可以从各种服务提供商中选择。顾客流失被定义为顾客停止与公司或服务进行业务往来的情况。
项目任务是使用提供的其余数据建立一个预测客户流失的模型。
数据集介绍:
此数据集包括电信公司的客户数据,包括服务使用情况、人口统计数据以及客户是否流失。
有需要下载数据练习的, 可点击这篇文章 发送文末暗号到公众号领取。
具体代码
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder,LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
import seaborn as sns
data==pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
target='Churn'
y = data[target]
X=data.drop(target,axis=1)
if not all(isinstance(val, int) for val in y):
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
else:
y
numerical_cols = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X.select_dtypes(include=[object]).columns.tolist()
# Preprocessor Pipeline setup
numerical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training pipeline
models = {
'RandomForest': RandomForestClassifier(),
'XGBoost': XGBClassifier(),
'CatBoost': CatBoostClassifier(verbose=0)
}
pipeline_results = {}
for name, model in models.items():
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', model)
])
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_val_score(pipeline, X_train, y_train, cv=kf, scoring='accuracy')
pipeline_results[name] = cv_results.mean()
# print results
print(pipeline_results)
三种模型的准确度都在80%左右,其中catboost 表现最好,RF其次,XGB最后。
# Select the best model
best_model_name = max(pipeline_results, key=pipeline_results.get)
best_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', models[best_model_name])
])
best_pipeline.fit(X_train, y_train)
# Printing the accuracy of the best model
y_pred = best_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Best Model: {best_model_name}')
print(f'Accuracy: {accuracy}')
在测试集上的结果显示准确率是81%左右,使用的是catboost模型
# Get column names
numerical_cols = preprocessor.named_transformers_['num'].get_feature_names_out(input_features=numerical_cols)
categorical_cols_encoded = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(input_features=categorical_cols)
all_columns = list(numerical_cols) + list(categorical_cols_encoded)
# get feature importance
feature_importance = model.feature_importances_
# dataframe
feature_importance_df = pd.DataFrame({'Feature': all_columns, 'Importance': feature_importance})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
feature_importance_df= feature_importance_df.head(10)
# visualise top 10 features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
总结
可以进一步通过特征选择、参数优化等方式,提高模型的性能比如:
- 确定相关特征: 确定与流失分析相关的主要数据特征
- 相关性分析: 分析特征之间的相关性,以确定它们对流失的影响程度
- 特征重要性: 分析特征的重要性,了解哪些因素对于预测流失最为关键
每个企业的流失分析过程可能会有所不同,具体的方法和模型选择取决于业务的性质、可用的数据以及分析的目标。