简介
上一篇文章,我们看了pytorch的进行一次训练的大致流程,这篇文章,我们就来一次小实战。
数据集
公共自行车使用量预测。
训练集中共有10000条样本,预测集中有7000条样本。
数据集下载:http://sofasofa.io/competition.php?id=1
代码
# 公共自行车使用量预测
# 代码用jupyter notebook写的,用pycharm可能要稍微改一下,不过差别不大
import torch
import pandas as pd
import numpy as np
train_data = pd.read_csv("E:\\李宏毅深度学习\\sofa_data\\公共自行车使用量预测\\data\\train.csv")
test_data = pd.read_csv("E:\\李宏毅深度学习\\sofa_data\\公共自行车使用量预测\\data\\test.csv")
# 显示train_data的前五行,可以看看它的大致结构
train_data.head()
# 选出y
train_y = train_data['y']
# 选出除y之外的其他列
train_x = train_data.loc[:,train_data.columns != 'y']
# 去除id这一列,id完全没用
train_x.drop('id', axis=1, inplace=True)
# 查看trian_x每一列的信息
train_x.info()
"""
划分数据,将训练数据分为训练集和验证集
"""
# 0 to 999
# valid_x = train_x[:1000]
# valid_y = train_y[:1000]
# 9000 to 9999
# x = train_x[9000:]
"""训练"""
x_tensor = torch.tensor(train_x[:9000].values, dtype=torch.float32)
y_tensor = torch.tensor(train_y[:9000].values.reshape(-1, 1), dtype=torch.float32)
"""验证,这里我们取后1000个数据作为验证集,其实是不太好的,应该随机取,或者做k则交叉验证"""
valid_x = torch.tensor(train_x[9000:].values, dtype=torch.float32)
valid_y = torch.tensor(train_y[9000:].values.reshape(-1, 1), dtype=torch.float32)
# 这里要注意,为何用torch.tensor? 因为pd读出来的是numpy数据类型,不能用pytorch进行训练
# 我们要把numpy数据类型转变为tensor类型,为了后面的模型计算
# 这里也显示出来,tensor的本质,(个人理解)
# 是为了进行模型计算而创造的一种数据结构或者数据类型,在数学上还是线性代数,是在实际的计算机上计算的一种线性代数的实习
from torch.utils.data import DataLoader
batch_size = 128
# 这里没有定义class,而是直接使用的torch自己封装的一个类,主要是这个数据有点小,而且我们已经将它全部读入内存了
# 其实本质还是一样的,还Dataset的三个函数
"""训练数据集"""
MyDataset = torch.utils.data.TensorDataset(x_tensor, y_tensor)
"""验证数据集""""
valid_Dateset = torch.utils.data.TensorDataset(valid_x, valid_y)
# 生成迭代器
dataloader = DataLoader(MyDataset, batch_size, shuffle=True)
valid_dataloader = DataLoader(valid_Dateset, batch_size, shuffle=False)
# 定义模型
import torch.nn as nn
net = nn.Sequential(
nn.Linear(7, 12),
nn.Sigmoid(), #这里的激活函数不要选ReLU,否则效果很差
nn.Linear(12, 6)
)
# 学习率
lr = 0.01
# 损失函数
loss = nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr)
# 开始训练
n_epochs = 5
total_loss = 0
loss_list = []
valid_loss_list = []
for epoch in range(n_epochs):
net.train()
print("train begin====================")
running_loss = 0.0
# 很标准的入门训练套路
for x, y in dataloader:
optimizer.zero_grad()
pred = net(x)
ls = loss(pred, y)
running_loss += ls.item()
ls.backward()
optimizer.step()
epoch_loss = running_loss / len(dataloader.dataset)
print("train_loss: ", epoch_loss)
loss_list.append(epoch_loss)
print("train over====================")
# 在最后一次epoch时,在测试集上进行预测,生成预测的csv, 用于提交
if epoch == 4:
print("预测:")
test_x = test_data.loc[:,test_data.columns != 'id']
test_x_tensor = torch.tensor(test_x[:].values, dtype=torch.float32)
test_pred = net(test_x_tensor)
test_pred = test_pred.detach().numpy()
test_pred = pd.DataFrame(test_pred)
test_pred.to_csv('mypred.csv')
# 验证
print("valid begin-------------------")
net.eval()
vaild_running_loss = 0.0
with torch.no_grad():
for x, y in valid_dataloader:
pred = net(x)
ls = loss(pred, y)
vaild_running_loss += ls.item()
vaild_epoch_loss = vaild_running_loss / len(valid_dataloader.dataset)
print("valid loss:", vaild_epoch_loss)
print("valid over-------------------")
print("++++++++++++++++++++++++++++++++++++++++++++++++++")
最后
这样,一次小实战就完成了,这个模型和数据用cpu跑也挺快的,大家可以调调参数,练练丹。这里的net是我调完之后的,效果还行。
放一下我的结果吧,也就是最后一次调参的输出。这个结果什么也代表不了,主要是让大家试一试,看看不同的参数有什么结果。
欢迎大家点赞收藏。
train begin====================
train_loss: 11.88190345594618
train over====================
valid begin-------------------
valid loss: 13.094645629882812
valid over-------------------
++++++++++++++++++++++++++++++++++++++++++++++++++
train begin====================
train_loss: 11.448072102864584
train over====================
valid begin-------------------
valid loss: 11.402728637695313
valid over-------------------
++++++++++++++++++++++++++++++++++++++++++++++++++
train begin====================
train_loss: 11.460680121527778
train over====================
valid begin-------------------
valid loss: 11.051776245117187
valid over-------------------
++++++++++++++++++++++++++++++++++++++++++++++++++
train begin====================
train_loss: 11.513427659776475
train over====================
valid begin-------------------
valid loss: 11.308153442382812
valid over-------------------
++++++++++++++++++++++++++++++++++++++++++++++++++
train begin====================
train_loss: 11.462217176649306
train over====================
预测:
valid begin-------------------
valid loss: 11.4285546875
valid over-------------------
++++++++++++++++++++++++++++++++++++++++++++++++++