会员数据化运营（二）

最新推荐文章于 2024-10-05 11:27:00 发布

m0_74988308

最新推荐文章于 2024-10-05 11:27:00 发布

阅读量61

点赞数

文章标签：大数据数据挖掘数据分析机器学习人工智能

本文链接：https://blog.csdn.net/m0_74988308/article/details/132281580

版权

案例：基于嵌套Pipeline和FeatureUnion符合数据工具流的营销响应预测

案例背景：

有关会员预测的实际应用。会员部门在做会员营销时，希望通过数据预测在下一次营销活动时，响应活动会员的具体名单和相应概率，以此来制定针对性的营销策略。

技术重点：

通过管道方法方法将多个特征处理工程组合起来，然后形成特征工程的pipeline，再将特征工程的pipeline与RandomForestClassifier组合起来形成复合Pipeline。

part 1 导入库

import time #用来记录不同算法参数下模型的运行时间
import numpy as np
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis #降维转换
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier #集成算法
#RandomForestClassifier 最终分类预测模型训练和预测
#ExtraTreesClassifier 配合PFE提取重要特征
from sklearn.feature_selection import RFE #配合ExtraTreesClassifier使用
from sklearn.model_selection import cross_val_score,StratifiedKFold #交叉检验
from sklearn.pipeline import Pipeline,FeatureUnion #管道应用
from imblearn.over_sampling import SMOTE #过抽样处理库，做样本均衡处理
from sklearn.metrics import f1_score,accuracy_score,precision_score #模型拟合效果评估

part 2 读取数据

raw_data = pd.read_excel('order.xlsx',sheet_name=0)

part 3 数据审查

#查看基本状态：
print('records:{0} features:{1}'.format(raw_data.shape[0],(raw_data.shape[1]-1)))

#查看缺失值：
print('NaN records count:',raw_data.isnull().any(axis=1).count())
na_cols = raw_data.isnull().any()
print('NaN cols',na_cols[na_cols]==True)

#查看样本均衡情况：
print('sample distribution:',raw_data['value_level'].groupby(raw_data['response']).count())

————————————————————————————————————————————————————————————————————————————
sample distribution: response
0    30415
1     9584

使用groupby以response为主体对value_level做计数汇总

part 4 数据预处理

#Nan处理：
na_rules = {'age' : raw_data['age'].mean(),
            'total_pageviews' :  raw_data['total_pageviews'].mean(),
            'edu' : raw_data['edu'].median(),
            'edu_ages' : raw_data['edu_ages'].median(),
            'user_level' : raw_data['user_level'].median(),
            'industry' : raw_data['industry'].median(),
            'act_level' : raw_data['act_level'].median(),
            'sex' : raw_data['sex'].median(),
            'red_money' : raw_data['red_money'].mean(),
            'region' : raw_data['region'].median()
            }
raw_data = raw_data.fillna(na_rules)
print('Check NA exists:', raw_data.isnull().any().sum())

#分割特征和目标：
num = int(0.7*raw_data.shape[0])
x,y = raw_data.drop('response',axis=1),raw_data['response']
x_train,x_test = x.iloc[:num,:],x.iloc[num:,:]
y_train,y_test = y.iloc[:num],y.iloc[num:]

#样本均衡：
model_smote = SMOTE()
x_smote_resampled,y_smote_resampled = model_smote.fit_resample(x_train,y_train)

part 5 模型训练

model_etc = ExtraTreesClassifier()
model_rfe = RFE(model_etc) #RFE方法提取重要特征
model_lda = LinearDiscriminantAnalysis() #LDA模型对象
model_rf = RandomForestClassifier() #分类对象

使用两类pipeline:

第一类：FetureUnion，用于将多个转换后的特征组合起来，然后基于组合后的特征做进一步后续应用。通过RFE和LDA得到特征组合

第二类：Pipeline，将转换特征与后续模型结合起来。将FeatureUnion和RandomForestClassifier组合应用。

# 构建带有嵌套的pipeline
pipelines = Pipeline([
    ('feature_union', FeatureUnion(  # 组合特征pipeline
        transformer_list=[
            ('model_rfe', model_rfe),  # 通过RFE中提取特征
            ('model_lda', model_lda),  # 通过LDA提取特征
        ],
        transformer_weights={  # 建立不同特征模型的权重
            'model_rfe': 1,  # RFE模型权重
            'model_lda': 0.8,  # LDA模型权重
        },
    )),
    ('model_rf', model_rf),  # rf模型对象
])

#设置pipe值
pipelines.set_params(feature_union__model_rfe__estimator__n_estimators = 20)
pipelines.set_params(feature_union__model_rfe__estimator__n_jobs = -1)
pipelines.set_params(feature_union__model_rfe__n_features_to_select = 20)
pipelines.set_params(feature_union__model_lda__n_components = 1)
pipelines.set_params(feature_union__n_jobs = -1)

#pipeline检验
cv = StratifiedKFold(3) #交叉检验
score_list = list()
time_list = list()
n_estimators = [10,50,100]
for parameter in n_estimators : 
    t1 = time.time()
    print('set parameters: %s ' %parameter)
    pipelines.set_params(model_rf__n_estimators = parameter)
    score_tmp = cross_val_score(pipelines,x_train,y_train,scoring='accuracy',cv=cv,n_jobs=1) #使用交叉检验计算得分
    time_list.append(time.time() - t1)
    score_list.append(score_tmp)

#组合交叉检验得分和详细数据
time_pd = pd.DataFrame.from_dict({'n_estimators' : n_estimators,'time' :time_list})
score_pd = pd.DataFrame(score_list,columns=[''.join(['score',str(i+1)]) for i in range(len(score_list))])
pd_merge = pd.concat((time_pd,score_pd),axis=1)
pd_merge['score_mean'] = pd_merge.iloc[:,2:-1].mean(axis=1)
pd_merge['score_std'] = pd_merge.iloc[:,2:-2].std(axis=1)
print(pd_merge.head())

#将最优参数设置当模型中，并训练pipelines
pipelines.set_params(model_rf__n_estimators = 50)
pipelines.fit(x_train,y_train)

#组合交叉检验得分和详情数据结果：  
n_estimators       time    score1    score2    score3  score_mean   
0            10  26.566431  0.881281  0.877424  0.878924    0.879353  \
1            50  27.172450  0.889103  0.890710  0.888675    0.889907   
2           100  26.910383  0.887925  0.891139  0.888246    0.889532   

   score_std  
0   0.002728  
1   0.001136  
2   0.002273

score_mean的得分越高越好，越高说明模型的预测越准；score_std越小越好，越小说明了不同交叉检验次数下的结果越稳定；time越小越好，越小意味着耗时更短。

part 6 模型效果检验

pre_test = pipelines.predict(x_test)
score = [i(y_test,pre_test) for i in [f1_score,accuracy_score,precision_score]]
print('scores result: f1: {0} , accuracy: {1} , precision: {2}'.format(score[0],score[1],score[2]))

————————————————————————————————————————————————————————————————————————————————————
scores result: f1: 0.7596589878469073 , accuracy: 0.8895833333333333 , precision: 0.790785498489426

准确度为0.8895，模型的均值为0.888，结果与交叉检验时模型的结果基本一致，说明模型本身效果不错。

part 7 预测新数据集

#基本数据过程：
new_data = pd.read_excel('order.xlsx',sheet_name=1)
print('records:{0} features:{1}'.format(new_data.shape[0],(new_data.shape[1])))

print('NaN records count:',new_data.isnull().any(axis=1).count())
new_data_fillna = new_data.fillna(na_rules)

#预测概率：
pre_labels = pd.DataFrame(pipelines.predict(new_data_fillna), columns=['labels'])
pre_pro = pd.DataFrame(pipelines.predict_proba(new_data_fillna),columns=['pro1','pro2'])
predict_pd = pd.concat((pre_labels,pre_pro),axis=1)
print(predict_pd.head())

********************************
   labels  pro1  pro2
0       0  0.92  0.08
1       0  0.92  0.08
2       0  0.98  0.02
3       0  1.00  0.00
4       0  0.88  0.12

#保存到文件
writer = pd.ExcelWriter('order_predict_result.xlsx')
predict_pd.to_excel(writer,'Sheet1')
writer._save()
writer.close()