最近在学b站刘二大人的pytorch教程。将第八讲的课后作业写了下,在此记录。
我也是初学者,写代码常常在写bug,代码也不够简洁,参考了其他同学的课后作业。
如果有幸被您看到,希望可以轻喷,我们一起讨论~
0.数据集准备
科学上网,如果可以的话,上面有一些如何简单处理数据集的代码。
https://www.kaggle.com/competitions/titanic/overview
1.思路提纲
1.数据处理
①性别(sex)的字符串转为整数
male用1表示,female用0表示
df = pd.read_csv(filepath, header=0)
df.replace('male', 1, inplace=True)
df.replace('female', 0, inplace=True)
方法来自这里
②年龄(age)数据有缺失值
使用该列均值填充
df = df.fillna(df.mean())
2.数据格式的转换
pandas的dataframe→numpy的array(float32)→pytorch提供的tensor
【数据集你的转换搞的我代码好复杂好乱QAQ】
xy = df.iloc[:, [1,2,4,5,6,7,9]] #取所需的整数数据列
xy = (np.array(xy)).astype(np.float32)
2.代码呈现
#0.引用库
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import matplotlib.pyplot as plt
#1.准备数据集
class Train_Dataset(Dataset):
def __init__(self, filepath):
df = pd.read_csv(filepath, header=0)#读取数据
df.replace('male', 1, inplace=True)
df.replace('female', 0, inplace=True)#性别字符串转为01表示
df = df.fillna(df.mean())#用该列均值填充缺失值,这里会有个warning
xy = df.iloc[:, [1,2,4,5,6,7,9]]#具体参考train.csv看清楚取哪几列
xy = (np.array(xy)).astype(np.float32)#将其转为np.array,内部字符为float32
self.len = xy.shape[0] ##求xy的行数
self.x_train = torch.from_numpy(xy[:,1:])
self.y_train = torch.from_numpy(xy[:,[0]]) ##将全部数据直接导入内存,[]很重要
def __getitem__(self, index):
return self.x_train[index], self.y_train[index] ##返回元组(x,y)
def __len__(self):
return self.len
class Test_Dataset(Dataset):
def __init__(self, filepath):
df1 = pd.read_csv(filepath,header=0)
df1.replace('male', 1, inplace=True)
df1.replace('female', 0, inplace=True)
df1 = df1.fillna(df1.mean())
xy1 = df1.iloc[:,[1,3,4,5,6,8]]
xy1 = (np.array(xy1)).astype(np.float32)
self.len = xy1.shape[0]
self.x_test = torch.from_numpy(xy1[:,:])
def __getitem__(self, index):
return self.x_test[index]
def __len__(self):
return self.len
train_dataset = Train_Dataset('./titanic/train.csv')#将titanic文件夹放入pycharm的运行路径
train_loader = DataLoader(dataset=train_dataset,
batch_size=1,
shuffle=True,#乱序的batch
num_workers=0)
test_dataset = Test_Dataset('./titanic/test.csv')
test_loader = DataLoader(dataset=test_dataset,
batch_size=1,
shuffle=False,#顺序的batch
num_workers=0)
#2.设计模型
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear1 = torch.nn.Linear(6,4)
self.linear2 = torch.nn.Linear(4,2)
self.linear3 = torch.nn.Linear(2,1)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, x):
x = self.sigmoid(self.linear1(x))
x = self.sigmoid(self.linear2(x))
x = self.sigmoid(self.linear3(x))
return x
model = Model()
#3.构造损失和优化器
criterion = torch.nn.BCELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(),lr=0.01)
#4.训练与测试
def train(epoch):
for batch_idx, data in enumerate(train_loader, 0):
inputs, target = data
optimizer.zero_grad()
#forward+backward+update
outputs = model(inputs)
loss = criterion(outputs, target)
loss.backward()
optimizer.step()
def test():
y = []
with torch.no_grad():
for data in test_loader:
features = data
y_test = model(features)
if y_test > 0.5: #存活为1
y.append(1)
else: #死亡为0
y.append(0)
print(y)
test_idx = pd.read_csv('./titanic/test.csv')
output = pd.DataFrame({'PassengerId': test_idx.PassengerId, 'Survived':np.array(y)})
output.to_csv('my_predict_p8h1.csv', index=False)
if __name__ == '__main__':
for epoch in range(1000):
train(epoch)
if epoch %1000 ==999:#训练集运行到epoch=999时,输出一次测试集结果
test()
3.测试验证结果正确率
还是需要科学上网
https://www.kaggle.com/competitions/titanic/overview
中的How to Submit your Prediction to Kaggle,注意提交格式,两列,一列id一列预测值
以上代码我是71分多一点,比我第一次的30,第二次的60好多了 QAQ