【pytorch学习】Titanic数据集使用pytorch构建神经网络处理

最近在学b站刘二大人的pytorch教程。将第八讲的课后作业写了下,在此记录。
我也是初学者,写代码常常在写bug,代码也不够简洁,参考了其他同学的课后作业
如果有幸被您看到,希望可以轻喷,我们一起讨论~

0.数据集准备

科学上网,如果可以的话,上面有一些如何简单处理数据集的代码。
https://www.kaggle.com/competitions/titanic/overview

1.思路提纲

1.数据处理

①性别(sex)的字符串转为整数
male用1表示,female用0表示

df = pd.read_csv(filepath, header=0)
df.replace('male', 1, inplace=True)
df.replace('female', 0, inplace=True)

方法来自这里

②年龄(age)数据有缺失值
使用该列均值填充

df = df.fillna(df.mean())

2.数据格式的转换
pandas的dataframe→numpy的array(float32)→pytorch提供的tensor
【数据集你的转换搞的我代码好复杂好乱QAQ】

xy = df.iloc[:, [1,2,4,5,6,7,9]]	#取所需的整数数据列
xy = (np.array(xy)).astype(np.float32)

2.代码呈现

#0.引用库
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import matplotlib.pyplot as plt

#1.准备数据集
class Train_Dataset(Dataset):
    def __init__(self, filepath):
        df = pd.read_csv(filepath, header=0)#读取数据
        df.replace('male', 1, inplace=True)
        df.replace('female', 0, inplace=True)#性别字符串转为01表示
        df = df.fillna(df.mean())#用该列均值填充缺失值,这里会有个warning
        xy = df.iloc[:, [1,2,4,5,6,7,9]]#具体参考train.csv看清楚取哪几列
        xy = (np.array(xy)).astype(np.float32)#将其转为np.array,内部字符为float32
        self.len = xy.shape[0]  ##求xy的行数
        self.x_train = torch.from_numpy(xy[:,1:])
        self.y_train = torch.from_numpy(xy[:,[0]])  ##将全部数据直接导入内存,[]很重要

    def __getitem__(self, index):
        return self.x_train[index], self.y_train[index]   ##返回元组(x,y)

    def __len__(self):
        return self.len

class Test_Dataset(Dataset):
    def __init__(self, filepath):
        df1 = pd.read_csv(filepath,header=0)
        df1.replace('male', 1, inplace=True)
        df1.replace('female', 0, inplace=True)
        df1 = df1.fillna(df1.mean())
        xy1 = df1.iloc[:,[1,3,4,5,6,8]]
        xy1 = (np.array(xy1)).astype(np.float32)
        self.len = xy1.shape[0]
        self.x_test = torch.from_numpy(xy1[:,:])

    def __getitem__(self, index):
        return self.x_test[index]

    def __len__(self):
        return self.len

train_dataset = Train_Dataset('./titanic/train.csv')#将titanic文件夹放入pycharm的运行路径
train_loader = DataLoader(dataset=train_dataset,
                          batch_size=1,
                          shuffle=True,#乱序的batch
                          num_workers=0)
test_dataset = Test_Dataset('./titanic/test.csv')
test_loader = DataLoader(dataset=test_dataset,
                         batch_size=1,
                         shuffle=False,#顺序的batch
                         num_workers=0)

#2.设计模型
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.linear1 = torch.nn.Linear(6,4)
        self.linear2 = torch.nn.Linear(4,2)
        self.linear3 = torch.nn.Linear(2,1)
        self.sigmoid = torch.nn.Sigmoid()
    def forward(self, x):
        x = self.sigmoid(self.linear1(x))
        x = self.sigmoid(self.linear2(x))
        x = self.sigmoid(self.linear3(x))
        return x
model = Model()

#3.构造损失和优化器
criterion = torch.nn.BCELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(),lr=0.01)

#4.训练与测试
def train(epoch):
    for batch_idx, data in enumerate(train_loader, 0):
        inputs, target = data
        optimizer.zero_grad()

        #forward+backward+update
        outputs = model(inputs)
        loss = criterion(outputs, target)
        loss.backward()
        optimizer.step()

def test():
    y = []
    with torch.no_grad():
        for data in test_loader:
            features = data
            y_test = model(features)
            if y_test > 0.5:	#存活为1
                y.append(1)
            else:	#死亡为0
                y.append(0)
    print(y)
    test_idx = pd.read_csv('./titanic/test.csv')
    output = pd.DataFrame({'PassengerId': test_idx.PassengerId, 'Survived':np.array(y)})
    output.to_csv('my_predict_p8h1.csv', index=False)

if __name__ == '__main__':
    for epoch in range(1000):
        train(epoch)
        if epoch %1000 ==999:#训练集运行到epoch=999时,输出一次测试集结果
            test()

3.测试验证结果正确率

还是需要科学上网
https://www.kaggle.com/competitions/titanic/overview
中的How to Submit your Prediction to Kaggle,注意提交格式,两列,一列id一列预测值
以上代码我是71分多一点,比我第一次的30,第二次的60好多了 QAQ

  • 4
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
好的,我明白了。关于 PyTorch 和泰坦尼克数据集的问题,我猜测你是想问如何使用 PyTorch 对泰坦尼克数据集进行分类预测。如果是这样,可以按照以下步骤进行: 1. 导入必要的库和数据集 ```python import torch import torch.nn as nn import pandas as pd from sklearn.model_selection import train_test_split # 读取数据集 data = pd.read_csv('titanic.csv') ``` 2. 数据预处理 ```python # 去除无用特征 data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True) # 处理缺失值 data['Age'].fillna(data['Age'].mean(), inplace=True) data['Embarked'].fillna('S', inplace=True) # 将分类变量转换为数值变量 data['Sex'] = data['Sex'].map({'male': 0, 'female': 1}) data['Embarked'] = data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}) # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(data.drop('Survived', axis=1), data['Survived'], test_size=0.2, random_state=42) # 转换为 PyTorch 张量 X_train = torch.tensor(X_train.values, dtype=torch.float32) X_test = torch.tensor(X_test.values, dtype=torch.float32) y_train = torch.tensor(y_train.values, dtype=torch.float32) y_test = torch.tensor(y_test.values, dtype=torch.float32) ``` 3. 定义模型 ```python class TitanicModel(nn.Module): def __init__(self): super(TitanicModel, self).__init__() self.fc1 = nn.Linear(7, 64) self.fc2 = nn.Linear(64, 32) self.fc3 = nn.Linear(32, 1) self.relu = nn.ReLU() self.sigmoid = nn.Sigmoid() def forward(self, x): out = self.fc1(x) out = self.relu(out) out = self.fc2(out) out = self.relu(out) out = self.fc3(out) out = self.sigmoid(out) return out model = TitanicModel() ``` 4. 训练模型 ```python criterion = nn.BCELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.01) for epoch in range(1000): optimizer.zero_grad() y_pred = model(X_train) loss = criterion(y_pred.squeeze(), y_train) loss.backward() optimizer.step() if epoch % 100 == 0: print('Epoch {}: train loss: {}'.format(epoch, loss.item())) ``` 5. 评估模型 ```python with torch.no_grad(): y_pred = model(X_test) y_pred_class = y_pred.round() accuracy = (y_pred_class == y_test).sum() / float(len(y_test)) print('Test accuracy:', accuracy.item()) ```
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值