otto-group-product-classification-challenge 神经网络 分类训练

29 篇文章 0 订阅
26 篇文章 1 订阅

1.数据集: kaggle竞赛提供的一个商品分类数据集。

200,000 种产品,(样本数量)

93 个特征,(输入特征维数)

目标是建立一个能够区分要产品9种类别(输出维数)的预测模型。

数据集下载链接:

kagge 竞赛官网:(得注册登录,嫌麻烦的话可以直接去我的网盘,后面给出链接)

Otto Group Product Classification Challenge | Kaggle

其中,sampleSubmission.csv是提交格式,(不过deadline已经过了,现在可以拿来做分类练手,提交后可以看到自己的排名,大佬云集!)。

train.csv是训练集,除了特征以外,还有类别用于模型训练。

 

 test.csv是测试集,只有特征,类别得我们拿训练好的模型做预测。

数据集网盘链接:

链接: https://pan.baidu.com/s/1MHh-RHTX38vqMrE83lAWZg?pwd=tw1q

提取码: tw1q

2.模型:

本文采用了如上图所示的全连接神经网络模型,其中,N指batch_size。

 3.python 代码

import torch
import numpy as np
import pandas as pd
from tensorboardX import SummaryWriter
from torch import nn, optim
from torch.nn import Linear, ReLU, CrossEntropyLoss
from torch.nn.functional import softmax
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler

# 使用gpu加速,没有显卡就用cpu
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# 读取训练和测试数据集,数据类型为DataFrame
train_set = pd.read_csv("dataset/train.csv", index_col='id')
test_set = pd.read_csv("dataset/test.csv", index_col='id')

# 确定分类,从训练集的分类中选中所有不重复分类labels
classes = train_set.iloc[:, -1].unique()

# 建立分类号 和 products 的对应关系
idx_to_class = {i: x for i, x in enumerate(classes)}
class_to_idx = {x: i for i, x in idx_to_class.items()}

# 数据集归一化
# 训练集
for col in train_set.columns[: -1]:
    mean = train_set[col].mean()
    std = train_set[col].std()
    train_set[col] = train_set[col].apply(lambda x: (x - mean) / std)

# 测试集
for col in test_set.columns[:-1]:
    mean = train_set[col].mean()
    std = train_set[col].std()
    test_set[col] = test_set[col].apply(lambda x: (x - mean) / std)


# 准备数据集,这里需要注意的是,训练集返回特征+类别号,测试集只返回特征
class data_set(Dataset):
    def __init__(self, df, train=True):
        self.df = df
        self.train = train
        if self.train:
            self.X = torch.from_numpy(np.array(self.df.iloc[:, :-1]))
            self.Y = [class_to_idx[x] for x in self.df.iloc[:, -1]]
            self.Y = torch.from_numpy(np.array(self.Y))
        else:
            self.X = torch.from_numpy(np.array(self.df))
            self.Y = torch.Tensor([])

    def __getitem__(self, item):
        if self.train:
            return self.X[item], self.Y[item]
        else:
            return self.X[item]

    def __len__(self):
        return len(self.df)


# 数据集类实例化
train_data_set = data_set(train_set, train=True)
test_data_set = data_set(test_set, train=False)

# 随机划分训练集和验证集0.8,0.2
num = len(train_data_set)
indices = list(range(num))
np.random.shuffle(indices)
split = int(np.floor(0.2 * num))
train_idx, valid_idx = indices[split:], indices[:split]

# 随机采样器设置
train_sample = SubsetRandomSampler(train_idx)
valid_sample = SubsetRandomSampler(valid_idx)

batch_size = 64

# DataLoader生成Minibatch
train_loader = DataLoader(train_data_set, batch_size, sampler=train_sample)
valid_loader = DataLoader(train_data_set, batch_size, sampler=valid_sample)
test_loader = DataLoader(test_data_set, batch_size)


# 模型类
class model(nn.Module):
    def __init__(self):
        super(model, self).__init__()
        self.Linear1 = Linear(93, 128, bias=True)
        self.Linear2 = Linear(128, 64, bias=True)
        self.Linear3 = Linear(64, 9, bias=True)
        self.activate = ReLU()

    def forward(self, x):
        x = self.activate(self.Linear1(x))
        x = self.activate(self.Linear2(x))
        x = self.Linear3(x)
        return x


# 类实例化
my_model = model()
my_model.to(device)

# loss
criterion = CrossEntropyLoss(size_average=True)
criterion = criterion.to(device)

# SGD
optimizer = optim.SGD(my_model.parameters(), lr=0.01, momentum=0.1)

# 训练次数
epoch_num = 100
valid_loss_min = np.Inf

# 可视化
writer = SummaryWriter("logs")

for epoch in range(epoch_num):
    print("epoch:{}".format(epoch))
    # 模型训练
    my_model.train()
    train_loss = 0
    train_iter = 0
    for data in train_loader:
        train_iter += 1
        x, y = data
        x, y = x.to(device), y.to(device)
        y_pred = my_model(x.float())
        loss = criterion(y_pred, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print("train_loss:{}".format(train_loss / train_iter))
    writer.add_scalar("train_loss", train_loss / train_iter, epoch)

    # 模型验证
    my_model.eval()
    valid_loss = 0
    valid_iter = 0
    for data in valid_loader:
        valid_iter += 1
        x, y = data
        x, y = x.to(device), y.to(device)
        y_pred = my_model(x.float())
        loss = criterion(y_pred, y)
        valid_loss += loss.item()
        if valid_loss < valid_loss_min:
            torch.save(model.state_dict(my_model), 'model.pt')
            valid_loss_min = valid_loss
            print("model saved to path:model.pt")

    print("valid_loss:{}".format(valid_loss / valid_iter))
    writer.add_scalar("valid_loss", valid_loss / valid_iter, epoch)

writer.close()

# 加载训练好的模型和参数
my_model.load_state_dict(torch.load("model.pt", map_location=device))

# 创建DataFrame准备写入结果数据
test_df = pd.DataFrame(0, index=np.arange(test_set.shape[0]), columns=np.concatenate([np.array(["id"]), classes]))
test_df['id'] = test_set.index

# 开始测试
my_model.eval()
with torch.no_grad():
    counter = 0
    for data in test_loader:
        data = data.to(device)
        pred = my_model(data.float())
        # 注意这里的pred是线性层输出,要想输出概率,还得经过softmax
        row = softmax(pred).data
        fin_row = np.around(row.squeeze().to('cpu').numpy(), decimals=1)
        test_df.iloc[counter * batch_size:(counter + 1) * batch_size, 1:] = fin_row.copy()
        counter += 1

# 写入数据到submission.csv
test_df.to_csv('submission.csv', index=False)

刚开始写,有很多不懂的地方,幸好kaggle大佬们写的代码开源,自己照着敲了一遍,收获巨大。

4.可视化结果:

模型在训练集上逐步收敛,在验证集上loss先降低,又升高,原因是训练次数过多,导致过拟合,因此,在模型保存时,保存了泛化能力最好的模型参数,即在验证集上loss最小时的模型参数。

5.最后,一起进步!

  • 4
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Newjet666

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值