一、题目要求
利用训练集所有样本的特征和标签训练分类模型,将训练集样本的特征输入训练好的模型得到分类结果,将分类结果上传至官网进行打分,精度评分越高越好。
二、数据说明
官网下载3个CSV文件。
train.csv:训练集,42001行785列。第一行为标题行,介绍了第一列为样本标签,其余784列为样本特征。训练数据集共有42000个样本,每个样本有784个特征,即一张28*28的灰度图片展平所得。
test.csv:测试集(无标签),28001行784列。第一行为标题行,介绍了每一列为样本的一个特征,共784个特征。测试数据集共有28000个样本。
sample_submission.csv:提交文件的模板。28001行2列,第一行为标题行,介绍了第一列为测试集对应的每一个样本的索引序号。第二列为代填入的标签,值全部为0,需要将测试集每个样本的预测结果覆盖在此。完成填写后可直接提交此文件。
三、完整代码
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm
class CSVTrainDataset(Dataset):
def __init__(self, train_path):
train_data = pd.read_csv(train_path)
self.labels = train_data.iloc[:,0].values
self.features = train_data.iloc[:,1:].values
def __len__(self):
# return the lenth of the dataset
return len(self.labels)
def __getitem__(self, idx):
sample = torch.from_numpy(self.features[idx]).float(), torch.tensor(self.labels[idx]).long()
return sample
class CSVTestDataset(Dataset):
def __init__(self, test_path):
self.features = pd.read_csv(test_path).values
def __len__(self):
return self.features.shape[0]
def __getitem__(self, idx):
sample = self.features[idx]
sample =torch.from_numpy(sample).float()
return sample
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1)
self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(in_features=1600, out_features=128)
self.fc2 = nn.Linear(in_features=128, out_features=10)
def forward(self, x):
x = self.conv1(x)
x = nn.functional.relu(x)
x = nn.functional.max_pool2d(x, 2)
x = self.conv2(x)
x = nn.functional.relu(x)
x = nn.functional.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x,1)
x = self.fc1(x)
x = nn.functional.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = nn.functional.log_softmax(x,dim =1)
return output
show = 0
train_dataset = CSVTrainDataset(train_path=r"..\data\train.csv")
test_dataset = CSVTestDataset(test_path=r"..\data\test.csv")
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
model = CNN()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()
num_epochs = 5
for epoch in range(num_epochs):
print("training")
for i, (images, labels) in tqdm(enumerate(train_loader), total=len(train_loader)):
images = images.reshape(-1, 1, 28, 28)
if show == 1 and epoch == 1:
if i == 3:
images_subset = images[25:35].squeeze()
labels_subset = labels[25:35]
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('training set display')
for i in range(10):
row, col = divmod(i, 5)
ax = axes[row, col]
ax.imshow(images_subset[i], cmap='gray')
ax.title.set_text(f'Labels:{labels_subset[i].item()}')
ax.axis('off')
plt.show()
model.train()
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
print(f'Epoch[{epoch+1}/{num_epochs}], Loss:{loss.item():.4f}')
with torch.no_grad():
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
model.eval()
print("predicting")
for i, images in enumerate(test_loader):
images = images.reshape(-1, 1, 28, 28)
pred = model(images)
pred = pred.argmax(dim=1)
if show == 1:
if i == 3:
images_test_subset = images[25:35].squeeze()
labels_test_subset = pred[25:35]
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('test set display')
for i in range(10):
row, col = divmod(i, 5)
ax = axes[row, col]
ax.imshow(images_test_subset[i], cmap='gray')
ax.title.set_text(f'Labels:{labels_test_subset[i].item()}')
ax.axis('off')
plt.show()
if i == 0:
result = pred
else:
result = torch.cat((result, pred), dim=0)
result = result[0:28000]
print("take a look at the result")
print(result)
numpy_array = result.numpy()
csv_file_path = r"..\data\sample_submission.csv"
df = pd.read_csv(csv_file_path)
df.iloc[:, 1] = numpy_array.flatten()
df.to_csv(csv_file_path, index=False)
print("been saved seccessfully")
须要如此配置文件夹,才可符合代码中的地址,代码方可正常运行。
注意:复制代码至新建main.py文件可直接运行,运行结果(测试集分类值)将直接保存到sample_submission.csv文件的相应位置,当程序打印出"been saved successfully",程序即顺利运行完毕,直接进入kaggle官网提交sample_submission.csv文件即可获得打分和排名。
四、思路简介
1.数据预处理
1.创建批处理数据加载器
为加快训练速度,可以将多个样本同时构建成一个数据,设定超参数batch_size决定批处理量大小。
举例:
batch_size = 64
生成(64,28,28)的高维数据可实现并行运算。
如果后续存在卷积操作,需要再添加一个卷积维度成为(64,1,28,28)
这里的图像为平面图像,通道数为1,故不会像RGB图像(三通道)一样产生(64,1,3,28,28)的数据。
2.样本重塑
原始样本为一维数组,为使用计算机视觉领域常用的二维卷积,须要将一维数组转化为二维数组的形式,即还原为图片的形式。
举例:数据由(64,784)变化为(64,28,28)。
2.构建卷积神经网络模型
class CNN(nn.Module): def __init__(self): super(CNN, self).__init__() self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1) self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1) self.dropout1 = nn.Dropout2d(0.25) self.dropout2 = nn.Dropout2d(0.5) self.fc1 = nn.Linear(in_features=1600, out_features=128) self.fc2 = nn.Linear(in_features=128, out_features=10) def forward(self, x): x = self.conv1(x) x = nn.functional.relu(x) x = nn.functional.max_pool2d(x, 2) x = self.conv2(x) x = nn.functional.relu(x) x = nn.functional.max_pool2d(x, 2) x = self.dropout1(x) x = torch.flatten(x,1) x = self.fc1(x) x = nn.functional.relu(x) x = self.dropout2(x) x = self.fc2(x) output = nn.functional.log_softmax(x,dim =1) return output
此模型为简易CNN模型,包含三层卷积层。前向传播分别经过二维卷积、激活函数、二维最大池化、二维卷积、激活函数、二维最大池化、丢弃、展平、全连接层、激活函数、丢弃、全连接层、归一化层。
五、单步调试数据流
show = 1
设置show为1则在程序中展示训练集和测试集的部分图像,设置为0则不展示。
train_dataset = CSVTrainDataset(train_path=r"..\data\train.csv")
test_dataset = CSVTestDataset(test_path=r"..\data\test.csv")
从CSV文件中导入数据。
当前主要数据:
test_dataset{CSVTrainDataset:28000}
features{ndarray:(28000,784)}
train_dataset{42000}
features{ndarray:(42000,784)}
labels{ndarray:(42000,)}
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
实例化数据加载器。
当前主要数据:
test_dataset{CSVTrainDataset:28000}
features{ndarray:(28000,784)}
train_dataset{42000}
features{ndarray:(42000,784)}
labels{ndarray:(42000,)}
train_loader{DataLoader:657}
model = CNN()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.NLLLoss()
实例化模型、优化器和损失函数。
num_epochs = 5
设置训练轮次,每一个轮次使用训练集全集进行训练。
for i, (images, labels) in tqdm(enumerate(train_loader), total=len(train_loader)):
从训练集数据加载器取出样本的特征数据和标签数据。
当前主要数据:
train_loader{DataLoader:657}
images{Tensor:(64,784)}
labels{Tensor:(64,)}
images = images.reshape(-1, 1, 28, 28)
将样本的特征从一维数组还原为二维数组,并加入卷积维。
当前主要数据:
train_loader{DataLoader:657}
images{Tensor:(64,1,28,28)}
labels{Tensor:(64,)}
images_subset = images[25:35].squeeze()
labels_subset = labels[25:35]
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('training set display')
for i in range(10):
row, col = divmod(i, 5)
ax = axes[row, col]
ax.imshow(images_subset[i], cmap='gray')
ax.title.set_text(f'Labels:{labels_subset[i].item()}')
ax.axis('off')
plt.show()
展示训练集的部分图片。
model.train()
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
进行训练。
def forward(self, x):
x = self.conv1(x)
x = nn.functional.relu(x)
x = nn.functional.max_pool2d(x, 2)
x = self.conv2(x)
x = nn.functional.relu(x)
x = nn.functional.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x,1)
x = self.fc1(x)
x = nn.functional.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = nn.functional.log_softmax(x,dim =1)
return output
进入模型。
当前主要数据:
x{Tensor:(64,1,28,28)}
x = self.conv1(x)
x{Tensor:(64,32,26,26)}
x = nn.functional.relu(x)
x = nn.functional.max_pool2d(x, 2)
x{Tensor:(64,32,13,13)}
x = self.conv2(x)
x{Tensor:(64,64,11,11)}
x = nn.functional.relu(x)
x = nn.functional.max_pool2d(x, 2)
x = self.dropout1(x)
x{Tensor:(64,64,5,5)}
x = torch.flatten(x,1)
x{Tensor:(64,1600)}
x = self.fc1(x)
x{Tensor:(64,128)}
x = nn.functional.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
x{Tensor:(64,10)}
output = nn.functional.log_softmax(x,dim =1)
output{Tensor:(64,10)}
经过多轮次训练,完成模型的训练。
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)
当前主要数据:
test_loader{DataLoader:438}
for i, images in enumerate(test_loader):
从测试集数据加载器取出样本的特征数据。
当前主要数据:
test_loader{DataLoader:438}
images{Tensor:(64,784)}
images = images.reshape(-1, 1, 28, 28)
将样本的特征从一维数组还原为二维数组,并加入卷积维。
当前主要数据:
test_loader{DataLoader:438}
images{Tensor:(64,1,28,28)}
pred = model(images)
使用训练好的模型进行分类。
当前主要数据:
test_loader{DataLoader:438}
pred{Tensor:(64,10)}
pred = pred.argmax(dim=1)
获取测试集分类结果。
当前主要数据:
pred{Tensor:(64,)}
images_test_subset = images[25:35].squeeze()
labels_test_subset = pred[25:35]
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
fig.suptitle('test set display')
for i in range(10):
row, col = divmod(i, 5)
ax = axes[row, col]
ax.imshow(images_test_subset[i], cmap='gray')
ax.title.set_text(f'Labels:{labels_test_subset[i].item()}')
ax.axis('off')
plt.show()
展示测试集分类结果的部分图像。
if i == 0:
result = pred
else:
result = torch.cat((result, pred), dim=0)
遍历完所有测试集加载器后,将所有的预测结果拼接在一起。
当前主要数据:
result{Tensor:(28032,)}
result = result[0:28000]
切除多余部分。
当前主要数据:
result{Tensor:(28000,)}
此即为最终的分类结果,下面将其保存到提交模板CSV文件中。
print("take a look at the result")
print(result)
numpy_array = result.numpy()
csv_file_path = r"..\data\sample_submission.csv"
df = pd.read_csv(csv_file_path)
df.iloc[:, 1] = numpy_array.flatten()
df.to_csv(csv_file_path, index=False)
print("been saved seccessfully")
ImageId,Label 1,2 2,0 3,9 4,9 5,3 6,7 7,0 8,3 9,0 10,3 11,5 12,7 13,4 14,0 15,4 16,3 17,3 18,1
完成提交,评分为:0.98935。(epoch_num = 100)