基于随机森林的otto商品分类

数据集介绍

Otto Group数据集来源于《Otto Group Product Classification Challenge》。Otto集团是世界上最大的电子商务公司之一,在20多个国家拥有子公司。我们每天在全球销售数百万种产品,在我们的产品线中添加了数千种产品。

我们公司对我们产品性能的一致性分析至关重要。然而,由于我们的全球基础设施不同,许多相同的产品被分类不同。因此,我们的产品分析的质量在很大程度上取决于对类似产品进行准确分类的能力。分类越好,我们对产品范围的了解就越多。

在这次竞争中,我们为超过200000种产品提供了一个具有93项功能的数据集。目的是建立一个预测模型,能够区分我们的主要产品类别。获奖模型将采用开源模式。

奥托集团产品分类数据集:

  • Target:共9个商品类别
  • Features:93个特征:整数型特征
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV
%matplotlib inline

读取数据

查看当前工作路径

os.path.abspath('.')

读取数据

data = pd.read_csv("./otto-group-product-classification-challenge/train.csv")
data.head()
idfeat_1feat_2feat_3feat_4feat_5feat_6feat_7feat_8feat_9...feat_85feat_86feat_87feat_88feat_89feat_90feat_91feat_92feat_93target
01100000000...100000000Class_1
12000000010...000000000Class_1
23000000010...000000000Class_1
34100161500...012000000Class_1
45000000000...100001000Class_1

5 rows × 95 columns

# 数据维度
data.shape
(61878, 95)

数据特征分析

# 描述性统计
data.describe()
idfeat_1feat_2feat_3feat_4feat_5feat_6feat_7feat_8feat_9...feat_84feat_85feat_86feat_87feat_88feat_89feat_90feat_91feat_92feat_93
count61878.00000061878.0000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.000000...61878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.00000061878.000000
mean30939.5000000.386680.2630660.9014670.7790810.0710430.0256960.1937040.6624331.011296...0.0707520.5323061.1285760.3935490.8749150.4577720.8124210.2649410.3801190.126135
std17862.7843151.525331.2520732.9348182.7880050.4389020.2153331.0301022.2557703.474822...1.1514601.9004382.6815541.5754552.1154661.5273854.5978042.0456460.9823851.201720
min1.0000000.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%15470.2500000.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%30939.5000000.000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
75%46408.7500000.000000.0000000.0000000.0000000.0000000.0000000.0000001.0000000.000000...0.0000000.0000001.0000000.0000001.0000000.0000000.0000000.0000000.0000000.000000
max61878.00000061.0000051.00000064.00000070.00000019.00000010.00000038.00000076.00000043.000000...76.00000055.00000065.00000067.00000030.00000061.000000130.00000052.00000019.00000087.000000

8 rows × 94 columns

# 查看数据分布
sns.countplot(x=data.target)
<AxesSubplot:xlabel='target', ylabel='count'>

在这里插入图片描述

可以看出,数据类别不均衡

数据处理

# 特征值
x = data.drop(["id","target"], axis=1)
# 目标值
y = data["target"]

x.head()
feat_1feat_2feat_3feat_4feat_5feat_6feat_7feat_8feat_9feat_10...feat_84feat_85feat_86feat_87feat_88feat_89feat_90feat_91feat_92feat_93
01000000000...0100000000
10000000100...0000000000
20000000100...0000000000
31001615001...22012000000
40000000000...0100001000

5 rows × 93 columns

y.value_counts().sort_index()

# 由于数据集较大,同时样本类别分布不均衡,故通过欠采样缩小数据集规模
# from imblearn.under_sampling import RandomUnderSampler

把标签值转换为数字

y = LabelEncoder().fit_transform(y)
y
array([0, 0, 0, ..., 8, 8, 8])

分割数据

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train.shape, y_train.shape, y_test.shape, x_test.shape
((49502, 93), (49502,), (12376,), (12376, 93))

模型训练

from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(oob_score=True)
rf_model.fit(x_train, y_train)
RandomForestClassifier(oob_score=True)
y_pred = rf_model.predict(x_test)

模型评估

# 模型在训练集上的准确率 
rf_model.score(x_train, y_train)
0.9999797987960083
# 模型在测试集上的准确率 
rf_model.score(x_test, y_test)
0.8089043309631545
# 包外估计
rf_model.oob_score_
0.7993818431578522
encoder = OneHotEncoder(sparse=False)
y_test = encoder.fit_transform(y_test.reshape(-1,1))
y_pred = encoder.fit_transform(y_pred.reshape(-1,1))
y_test,
(array([[0., 0., 1., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 1., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 1., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]]),)
 y_pred
array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
# logloss评估
log_loss(y_test, y_pred, eps=1e-15, normalize=True)
6.600210582899472
# 以概率形式输出
y_pred_proba = rf_model.predict_proba(x_test)
y_pred_proba
array([[0.  , 0.2 , 0.77, ..., 0.  , 0.02, 0.  ],
       [0.02, 0.48, 0.16, ..., 0.06, 0.  , 0.  ],
       [0.03, 0.02, 0.03, ..., 0.3 , 0.32, 0.02],
       ...,
       [0.12, 0.01, 0.05, ..., 0.08, 0.11, 0.53],
       [0.01, 0.56, 0.32, ..., 0.01, 0.02, 0.  ],
       [0.18, 0.09, 0.01, ..., 0.1 , 0.2 , 0.14]])
rf_model.oob_score_
0.7993818431578522
log_loss(y_test, y_pred_proba, eps=1e-15, normalize=True)
0.6232249914857839

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值