Otto商品分类--决策树模型_奥拓商品分类kaggle 2015-CSDN博客

本文链接：https://blog.csdn.net/weixin_38664232/article/details/87854610

训练部分**

思路：原始特征+tfidf特征

训练部分**

我们以Kaggle2015年举办的Otto Group Product Classification Challenge竞赛数据为例，分别调用缺省参数CART、CART+GrideSearchCV以进行超参数调优。

1.工具准备

import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV

2.读取数据


#读取数据
dpath='./data/'
 
#采用原始特征+tf_idf特征
train1=pd.read_csv(dpath+"Otto_FE_train_org.csv")
train2=pd.read_csv(dpath+"Otto_FE_train_tfidf.csv")
 
#去掉多余的id
train2=train2.drop(["id"],['targer'],axis=1)
train=pd.concat([train1,train2],axis=1,ignore_index=False)
train.head()
 
del train1
del train2

3.准备数据

y_train=train['target']
X_train=train.drop(["id","target"],axis=1)

#保存特征名字以备后用
feat_names=X_train.columns


#生成稀疏数据
from scipy.sparse import csr_matrix
X_train=csr_matrix(X_train)

4.默认参数的决策树模型

from sklearn.tree import DecisionTreeClassifier
DT1=DecisionTreeClassifier()


#交叉验证用于评估模型性能和进行参数调优（模型选择）
#分类任务中交叉验证缺省是采用StratifiedKFold
#数据集比较大，采用3折交叉验证
from sklearn.model_selection import cross_val_score
loss=cross_val_score(DT1,X_train,y_train,cv=3,scoring="neg_log_loss")

print('logloss of each fold is:',-loss)
print('cv logloss is:',-loss.mean())

logloss of each fold is: [10.1700857 9.86630808 9.74333791]
cv logloss is: 9.926577231188997

5.决策树超参数调优

决策树的超参数有：

max_depth（树的深度）或max_leaf_nodes（叶子节点的数目）
min_samples_leaf（叶子节点的最小样本数）、min_samples_split（中间节点的最小样本树）
min_weight_fraction_leaf（叶子节点的样本权重占总权重的比例）
min_impurity_split（最小不纯净度也可以调整）
max_features（最大特征数目）

在sklearn框架下，不同学习器的参数调整步骤相同：

设置参数搜索范围
生成GridSearchCV的实例（参数）
调用GridSearchCV的fit方法

from sklearn.model_selection import GridSearchCV

#需要调优的参数
max_depth=range(10,100,10)
min_samples_leaf=range(1,10,2)
tuned_parameters=dict(max_depth=max_depth,min_samples_leaf=min_samples_leaf)

DT2=DecisionTreeClassifier()
grid=GridSearchCV(DT2,tuned_parameters,cv=3,scoring="neg_log_loss")
grid.fit(X_train,y_train)

print('Best score: %f using %s"%(-grid.best_score_,-grid.best_parms_))

输出结果：

test_means=-grid.cv_results_['mean_test_score']

test_scores=np.array(test_means).reshape(len(max_depth),len(min_samples_leaf))

for i,value in enumerate(max_depth):
    plt.plot(min_samples_leaf,test_scores[i],label='test_max_score'+str(value))

plt.legend()
plt.xlabel('min_samples_leaf')
plt.ylabel('logloss')
plt.show()

结果：

看来max_depth最好是10，再细看一下当max_depth取10时，模型性能随参数min_samples_leaf的变化

plt.plot(min_samples_leaf,test_scores[0],label='test_max_depth'+str(10))
plt.show()

输出结果：

可以看出模型性能随参数min_samples_leaf的变化趋势是越大越好（可能是因为样本数目比较大），下一步继续减小max_depth,同时增大min_samples_leaf的数目

max_depth=range(3,10,2)
min_samples_leaf=range(11,20,2)
tuned_parameters=dict(max_depth=max_depth,min_samples_leaf=min_samples_leaf)

DT2=DecisionTreeClassifier()
grid=GridSearchCV(DT2,tuned_parameters,cv=3,scoring='neg_log_loss')
grid.fit(X_tain,y_tain)

print('Best score:%f using %s'%(-grid.best_score_,-grid.best_param_))

输出分数：

Best score:1.206972 using {'max_depth': 10, 'min_samples_leaf': 9}

test_means=-grid.cv_results_['mean_test_score']

test_scores=np.array(test_means).reshape(len(max_depth),len(min_samples_leaf))
for i,value in enumerate(max_depth):
    plt.plot(min_samples_leaf,test_scores[i],label='test_max_depth'+str(value))

plt.legend()
plt.xlabel('min_samples_leaf')
plt.ylabel('logloss')
plt.show()

输出结果：

plt.plot(min_samples_leaf,test_scores[3],label='test_max_depth:'+str(9))
plt.show()

输出结果：

扩大max_depth和min_samples_leaf

from sklearn.model_selection import GridSearchCV

#需要调优的参数
max_depth=range(10,20,2)
min_samples_leaf=range(20.30,2)
tuned_parameters=dict(max_depth=max_dapth,min_samples_leaf=min_samples_leaf)


DT2=DecisionTreeClassifier()
grid=GridSearchCV(DT2,tuned_parameters,cv=3,scoring='neg_log_loss')
grid.fit(X_train,y_train)

print('Best score:%f using %s'%(-grid.best_score_,-grid.best_params_))

输出结果：

从结果来看，max_depth可以确定为10，但是min_samples_leaf还得继续调整

test_means=-grid.cv_results_['mean_test_score']

test_scores=np.array(test_means).reshape(len(max_depth),len(min_samples_leaf))


for in,value in enumerate(max_dapth):
    plt.plot(min_samples_leaf,test_scores[i],label='test_max_depth:'+
                                str(value))

plt.legend()
plt.xlabel('min_samples_leaf')
plt.ylabel('logloss')
plt.show()

输出结果：

max_depth固定在10，扩大min_samples_leaf

from sklearn.model_selection import GridSearchCV

#需要调整的参数
#max_depth=10
min_samples_leaf=range(30,40,2)
tuned_parameters=dict(min_samples_leaf=min_samples_leaf)

DT2=DecisionTreeClassifier(max_depth=10)
grid=GridSearchCV(DT2,tuned_parameters,cv=3,scoring='neg_log_loss')
grid.fit(X_train,y_tain)


test_means=-grid.cv_results_['mean_test_score']
plt.plot[min_samples_leaf,test_means,label='test_max_depth'+str(10))

plt.legend()
plt.xlabel('min_samples_leaf')
plt.ylabel('logloss')
plt.show()

画图结果：

#输出最佳分数
print('Best score: %f using %s'%(-grid.best_score_,grid.best_params_))

输出分数：

CART其他模型复杂度参数：

max_leaf_nodes和max_depth类似，调试其中任意一个就好；
min_samples_split和min_samples_leaf通常也有关系，调试其中任意一个就好
min_weight_fraction_leaf：由于本任务我们对类别/样本没有设置权重，min_weight_fraction_leaf和min_samples_leaf通常也有关系功能类似，也无需调整
max_features：原则上应该是越大越好，这里我们用了所有特征，无需再调

当然固若有计算资源，对上述参数进行调优也可以，只是预计再调优得到的性能提升不大。

保存模型，用于后续测试

import cPickle
cPickle.dump(grid.best_estimator,open("Otto_CART_org_tfidf.pkl",'wb'))

查看特征重要性：

DT3=grid.best_estimator_
df=pd.DataFrame({"columns":list(feat_names),"importance":
                list(DT3.feature_importances_.T)})
df=df.sort_values(by=['importance'],ascending=False)

print(df)

测试部分**决策树


#读取数据
dpath='./data/'
 
#采用原始特征+tf_idf特征
train1=pd.read_csv(dpath+"Otto_FE_train_org.csv")
train2=pd.read_csv(dpath+"Otto_FE_train_tfidf.csv")
 
#去掉多余的id
train2=train2.drop(["id"],['targer'],axis=1)
train=pd.concat([train1,train2],axis=1,ignore_index=False)
train.head()
 
del train1
del train2

y_train=train['target']
X_train=train.drop(["id","target"],axis=1)

#保存特征名字以备后用
feat_names=X_train.columns


#生成稀疏数据
from scipy.sparse import csr_matrix
X_train=csr_matrix(X_test)

#load训练好的模型
import cPickle
CART_best=cPickle.load(open("Otto_CART_org_tfidf.pkl",'rb'))

#输出每类的概率
y_test_pred=CART_best.predict_proba(X_test)

print(y_test_pred.shape)


#生成提交的结果
out_df=pd.DataFrame(y_test_pred)

columns=np.empty(9,dtype=object)
for i in range(9):
    columns[i]='Class_'+str(i+1)
out_df.columns=columns

out_df=pd.concat([test_id,out_df],axis=1)

out_df.to_csv("CART_org_tfidf.csv",index=False)

原始特征和tfidf两种特征

Logistic回归：在Kaggle的Private Leaderboard的分数0.59817（排名第2243位）

RBF核SVM（只有tfidf特征）：0.48947（排名1254位）

CART:1.07144（交叉验证估计的测试误差很难，可惜性能太差），单棵决策树性能不好