用户行为价值购买率预测——二分类问题（2）

信仰一跃的淡水鱼

已于 2023-05-15 17:20:19 修改

阅读量1k

点赞数 1

分类专栏：数据爬取机器学习文章标签：大数据数据分析 python

于 2022-04-06 10:32:47 首次发布

本文链接：https://blog.csdn.net/qq_37673434/article/details/117963560

版权

机器学习同时被 2 个专栏收录

3 篇文章 1 订阅

订阅专栏

数据爬取

2 篇文章 0 订阅

订阅专栏

前言：

接上一篇：用户行为价值购买率预测——二分类问题（1）
接下来要展示的是数据划分，建模，预测与评价等等。

正文：

首先导入相关的包：

#----------数据集处理--------------#
from sklearn.model_selection import train_test_split        # 划分训练集和验证集
from sklearn.model_selection import KFold,StratifiedKFold   # k折交叉
from imblearn.combine import SMOTETomek,SMOTEENN            # 综合采样
from imblearn.over_sampling import SMOTE                    # 过采样
from imblearn.under_sampling import RandomUnderSampler      # 欠采样

#----------数据处理--------------#
from sklearn.preprocessing import StandardScaler # 标准化
from sklearn.preprocessing import OneHotEncoder  # 独热编码
from sklearn.preprocessing import OrdinalEncoder
from scipy.stats import chi2_contingency       # 数值型特征检验，检验特征与标签的关系
from scipy.stats import f_oneway,ttest_ind     # 分类型特征检验，检验特征与标签的关系

数据集划分：

对数据集的离散字段进行独热编码：

train_con_dummy=train_con.join(pd.get_dummies(train_con[str_features].city_num)).drop(str_features,axis=1)#先独热编码再进行训练集划分

由于没有测试集，我们需要从数据集中自己划分出测试集：

from sklearn.model_selection import train_test_split
def train_test_val_split(data, ratio_train, ratio_test):
    train, test = train_test_split(data, train_size=ratio_train, test_size=ratio_test)
    return train, test
train_f, test= train_test_val_split(train_con_dummy, 0.7, 0.3)

对离散特征进行卡方检验：

for col in str_features:#卡方检验认为显著水平大于95%是差异性显著的，这里即看p值是否是p>0.05，若p>0.05，则说明特征不会呈现差异性
    obs=pd.crosstab(train_con['result'],
                    train_con[col],
                    rownames=['result'],
                    colnames=[col])
    chi2, p, dof, expect = chi2_contingency(obs)
    print("{} 卡方检验p值: {:.4f}".format(col,p))#并没有大于0.05 说明特征存在差异性，不需要剔除

first_order_time 卡方检验p值: 0.0000
city_num 卡方检验p值: 0.0000
对连续变量做方差分析进行特征筛选：

from sklearn.feature_selection import SelectKBest,f_classif#对连续变量做方差分析进行特征筛选

f,p=f_classif(train_con[num_features],train_con['result'])
k = f.shape[0] - (p > 0.05).sum()
selector = SelectKBest(f_classif, k=k)
selector.fit(train_con[num_features],train_con['result'])

print('scores_:',selector.scores_)
print('pvalues_:',selector.pvalues_)
print('selected index:',selector.get_support(True))

各连续变量的得分：
在这里对其不做处理
先去除user_id,result这两个对于预测结果没作用的字段：

y=train_f['result']
train_f=train_f.drop(['user_id','result'],axis=1)
test=test.drop('user_id',axis=1)

然后对连续变量做标准化：

standardScaler=StandardScaler()#连续变量标准化,返回值为标准化后的数据
ss=standardScaler.fit(train_f.loc[:,num_features])
train_f.loc[:,num_features]=ss.transform(train_f.loc[:,num_features])
test.loc[:,num_features]=ss.transform(test.loc[:,num_features])

然后从训练集中划分出训练集和验证集，因为如果题目没有给出测试集的话，我们就需要从数据集中划分出测试集，而这里划分出来的验证集是用于评估模型的好坏：

x=train_f
x_train,x_valid,y_train,y_valid=train_test_split(x,y,test_size=0.2,random_state=2020)#划分为训练集和验证集
y_test=test_e.result
x_test=test_e.drop(['user_id','result'],axis=1)

在上一篇提到，这个数据集是一个极其不平衡的数据集，所以对样本较少的标签要进行处理，在这里采用的是综合采样处理：

smote_tomek = SMOTETomek(random_state=115)#综合采样 因为数据集是不平衡的
x_resampled, y_resampled = smote_tomek.fit_resample(x_train, y_train)

接下来就是进行模型的构建和预测评估了。

模型构建和预测评估：

首先导入模型包，sklearn中封装了许多机器学习的模型和模型评估工具，可以导入直接使用：

from sklearn.model_selection import GridSearchCV#导入模型包
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
#----------模型评估工具----------#
from sklearn.metrics import confusion_matrix # 混淆矩阵
from sklearn.metrics import classification_report
from sklearn.metrics import recall_score,f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

定义一个画ROC曲线的函数：

def get_rocauc(X,y,clf):#画ROC曲线
    from sklearn.metrics import roc_curve
    FPR,recall,thresholds=roc_curve(y,clf.predict_proba(X)[:,1],pos_label=1)
    area=roc_auc_score(y,clf.predict_proba(X)[:,1])
    maxindex=(recall-FPR).tolist().index(max(recall-FPR))
    threshold=thresholds[maxindex]
    plt.figure()
    plt.plot(FPR,recall,color='red',label='ROC curve (area = %0.2f)'%area)
    plt.plot([0,1],[0,1],color='black',linestyle='--')
    plt.scatter(FPR[maxindex],recall[maxindex],c='black',s=30)
    plt.xlim([-0.05,1.05])
    plt.ylim([-0.05,1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('Recall')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc='lower right')
    plt.show()
    return threshold

接下来就是选用模型，扔入数据进行训练，例如随机森林：

clf=RandomForestClassifier(n_estimators=100,max_features=0.74,min_samples_leaf=2,random_state=33, verbose=True)#随机森林
clf=clf.fit(x_resampled_smote,y_resampled_smote)
y_pred_clf_smote=clf.predict(x_valid)

查看模型的精准度：

print(classification_report(y_valid, y_pred_clf_smote))#F1-SCORE是测试准确度的量度

在这里插入图片描述
从结果上来看，0类别的准确率很高，因为样本数充足，但1类别的准确率却不高，这说明处理样本不平衡上面还需要下很大的功夫（当时比赛没有考虑到这一点）

y_pred_clf_pro=clf.predict_proba(x_test)

用训练好的随机森林模型进行预测，保存结果。
再来用一个线性回归模型做训练看看效果：（这里使用了网格搜索去进行超参数的寻优）

param = {"penalty": ["l1", "l2", ], "C": [0.1, 1, 10], "solver": ["liblinear","saga"]}#线性回归需要寻优的参数
gs = GridSearchCV(estimator=LogisticRegression(), param_grid=param, cv=2, scoring="roc_auc",verbose=10) 
gs.fit(x_resampled,y_resampled)  
print(gs.best_params_)

寻优参数数量的增加也会增加训练的时间（毕竟要组合去进行）
使用最优参数去进行预测：

y_pred = gs.best_estimator_.predict(x_valid) #线性回归验证集
print(classification_report(y_valid, y_pred))

从结果也可以看出，1类别的预测准确度为0，还是类别不平衡的问题。当时比赛还是入门，所以没有解决这个问题，现在回头看来，这个类别不平衡问题若能正确处理，应该能拿一个很好的成绩。

信仰一跃的淡水鱼

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
用户行为价值购买率预测——二分类问题（2）

前言：接上一篇：用户行为价值购买率预测——二分类问题（1）接下来要展示的是数据划分，建模，预测与评价等等。正文：首先导入相关的包：#----------数据集处理--------------#from sklearn.model_selection import train_test_split # 划分训练集和验证集from sklearn.model_selection import KFold,StratifiedKFold # k折交叉from imblearn.c
复制链接

扫一扫

专栏目录