目录
3.训练集数据处理以及准确率分析,就是前面说的分析准确率,但也有过拟合可能所以可有可无
代码使用的是pytoch库,但大体思路都差不多目前得分是0.7751,还可以优化这里的代码只是第一版。(*/ω\*)
1. 导入所需的库
from sklearn.utils import shuffle
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
因为要使用pandas清洗数据所以我这里就没有使用 datase 和dataloader选择 使用sklearn工具里面的 shv
from sklearn.model_selection import train_test_splitz 这个是用来划分数据集的因为泰坦尼克生存预测每天只有10次提交机会,所以我把训练集分成测试和训练集用来观察准确度,避免无意义提交,所以有关后续这部分的不需要的兄弟们可以不导入
2.数据清洗环节
我看到有人是先将测试集纵向添加到训练集里面然后一起进行数据清洗,我个人观点觉得有可能会污染数据因为你是拿训练集来进行训练,要保证测试集与训练集没有数据上的干扰才好,所以我选择了分开进行数据处理
data = pd.read_csv("train.csv")
print("Train_data")
data.info()
data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1, inplace=True)
data["Age"].fillna(data["Age"].mean(), inplace=True)
data["Embarked"].fillna(data["Embarked"].mode()[0], inplace=True)
le = LabelEncoder()
data["Sex"] = le.fit_transform(data["Sex"])
data["Embarked"] = le.fit_transform(data["Embarked"])
X = data.drop("Survived", axis=1)
y = data["Survived"]
scaler = StandardScaler()
X = scaler.fit_transform(X)
test_Data = pd.read_csv("test.csv")
print("Test_data")
test_Data.info()
y_test_data_0 = test_Data["PassengerId"]
y_test_data_0 = y_test_data_0.to_numpy()
test_Data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1, inplace=True)
test_Data["Age"].fillna(data["Age"].mean(), inplace=True)
test_Data["Embarked"].fillna(test_Data["Embarked"].mode()[0], inplace=True)
le = LabelEncoder()
test_Data["Sex"] = le.fit_transform(test_Data["Sex"])
test_Data["Embarked"] = le.fit_transform(test_Data["Embarked"])
scaler = StandardScaler()
x_test_Data = scaler.fit_transform(test_Data)
x_test_Data = torch.from_numpy(x_test_Data).float()
我这里是去掉了乘客id,姓名, 票, 以及客舱号码, 首先id去掉毋庸置疑,我看到有说法通过对不同人的称呼来分析阶级地位是否会影响生存率,我认为票价可以说明阶级地位吧但可能也有特殊情况,以后再分析,客舱号损失太多了意义也不大去掉!
接下来是补齐数据年龄当然是拿平均值来补齐,而登船地点当然是取众数
再把性别和登船地点转化为数字
3.训练集数据处理以及准确率分析,就是前面说的分析准确率,但也有过拟合可能所以可有可无
# 将数据集分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 将数据转换为张量
X_train = torch.from_numpy(X_train).float()
X_test = torch.from_numpy(X_test).float()
y_train = torch.from_numpy(y_train.to_numpy()).float()
y_test = torch.from_numpy(y_test.to_numpy()).float()
print(X_train)
print(y_train)
print(data)
X = torch.from_numpy(X).float()
y = torch.from_numpy(y.to_numpy()).float()
最后两行是正式训练的时候要用
4.模型建立以及训练集运行预测
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(7, 32)
self.fc2 = nn.Linear(32, 16)
self.fc3 = nn.Linear(16, 8)
self.fc4 = nn.Linear(8, 1)
self.dropout = nn.Dropout(0.2)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.relu(self.fc2(x))
x = self.dropout(x)
x = self.relu(self.fc3(x))
x = self.dropout(x)
x = self.sigmoid(self.fc4(x))
return x
model = Net()
criterion = nn.BCELoss(reduction='sum')
optimizer = optim.Adam(model.parameters(), weight_decay=0.01)
# 训练模型
epochs = 50
batch_size = 32
for epoch in range(epochs):
running_loss = 0.0
X_train, y_train = shuffle(X_train ,y_train)
for i in range(0, len(X_train), batch_size):
inputs = X_train[i:i+batch_size]
labels = y_train[i:i+batch_size]
outputs = model(inputs)
loss = criterion(outputs.squeeze(), labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch {epoch+1}, loss: {running_loss / len(X_train)}")
with torch.no_grad():
outputs = model(X_test)
outputs = outputs.squeeze()
outputs[outputs >= 0.5] = 1.0
outputs[outputs < 0.5] = 0.0
accuracy = torch.sum(outputs == y_test) / len(y_test)
print("Test accuracy:", accuracy.item())
我这里选择了提升复杂度,防止过拟合,以及非线性其实这里优化器里面又使用L2防止过拟合是没必要的 我这个代码的分数是0.762 ,还不如之前 所以酌情去掉我没去掉是因为我后来尝试改变复杂度loss 下降了一整个小数点所以我觉得提升拟合度有可能会出现一个更好的结果目前还在尝试中
同时优化器选择了自动优化器,其实可以都尝试一下的但我太懒了以后再说
5.正式训练
print("正式训练开始")
for epoch in range(1000):
running_loss = 0.0
for i in range(0, len(X), batch_size):
inputs = X[i:i+batch_size]
labels = y[i:i+batch_size]
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs.squeeze(), labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch {epoch+1}, loss: {running_loss / len(X)}")
6.输出结果并以csv保存
y_test_data = model(x_test_Data)
y_test_data = y_test_data.detach().numpy()
y_test_data = (y_test_data >= 0.5).astype(np.int32)
print(y_test_data)
print(y_test_data_0)
y_test_data = y_test_data.squeeze()
y_test_data_0 = y_test_data_0.tolist()
y_test_data = y_test_data.tolist()
df =pd.DataFrame({"PassengerId": y_test_data_0, "Survived": y_test_data})
df.to_csv("gender_submission.csv")