Tensorflow Titanic 乘客生存概率预测

tensorflow Titanic 生存概率预测

题目概述

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

在本题中,我们需要建立一个预测模型并且使用各属性预测哪些人更可能存活下来。

数据集

What Data Will I Use in This Competition?

In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

数据集文件数据集数据条数
train.csv训练集891
test.csv测试集418

数据集下载路径:https://www.kaggle.com/c/titanic/data

导入所需库和数据集加载

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print(len(train_data), len(test_data))
train_data.head()

891 418

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

我们可以看到训练集个数为891 测试集(我们用于预测的数据集)个数为418.

各属性我们可以查阅下表以理解其含义。

变量定义键值
survival存活0=No,1=Yes
pclass票的类别1=1st,2=2nd,3=3rd
sex性别
Age年龄
sibsp在船上有几个兄弟/配偶
parch在船上有几个双亲/孩子
ticket票编号
fare乘客票价
cabin客舱号码
embarked登船港口C = Cherbourg, Q = Queenstown, S = Southampton

打乱训练集数据

将数据集打乱以减少数据间的相关性,提高训练结果的准确率。

这里用到了sample函数:

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

sample函数详情

这里我只用到了frac参数为打乱样本的比例,所以我设为1.

train_data = train_data.sample(frac=1)
train_data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
69669703Kelly, Mr. Jamesmale44.0003635928.0500NaNS
46746801Smart, Mr. John Montgomerymale56.00011379226.5500NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
38838903Sadlier, Mr. MatthewmaleNaN003676557.7292NaNQ
30330412Keane, Miss. Nora AfemaleNaN0022659312.3500E101Q

查看总体数据缺省并进行数据预处理

合并训练集和测试集查看数据信息

all_data=pd.concat([train_data,test_data],ignore_index=True)
all_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

从输出的数据我们可以得到如下信息:

  • 总共有1309条数据

  • Survived 缺省418条数据(此为测试集的Survived需要通过神经网络预测输出)

  • Age,Fare,Cabin,Embarked都有缺省数据

    数据处理思路分析

    • 由于Cabin数据缺失过于多,我直接将此数据省略

    • 对于Age与Embarked 直接采用平均数填充

    • Fare属性只有一条数据缺省。很容易想到票价与船舱等级有很大的关系,因此我将缺失数据同类型的票价的平均数填充给他。

      pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=‘mean’, fill_value=None, margins=False, dropna=True, margins_name=‘All’, observed=False)

      pivot_table函数详情

    对于此透视表我需要用到其中三个参数即可求出平均值。例如

    透视表运行示例

    • index为类别属性,作为返回的透视表的行名
    • values为所需计算的属性,作为透视表的值
    • affgunc为计算属性的方法,此题用平均

数据处理

# 对缺省的年龄进行平均数填充
all_data['Age'] = all_data['Age'].fillna(all_data['Age'].mean())
# 对缺省的登船港口进行众数填充
all_data['Embarked'] = all_data['Embarked'].fillna(all_data['Embarked'].mode())
# 性别转换为1和0
all_data['Sex']=[1 if x=='male' else 0 for x in all_data.Sex]
# 将SibSp,Parch 用有无代替数量以减少离散
all_data['SibSp']=[0 if x==0 else 1 for x in all_data.Sex]
all_data['Parch']=[0 if x==0 else 1 for x in all_data.Sex]
# 丢失票价用同等票类型的平均数进行填充
all_data['Fare'] = all_data['Fare'].fillna(all_data.pivot_table(index='Pclass', values='Fare', aggfunc=np.mean).loc[all_data[all_data['Fare'].isnull()]['Pclass'],'Fare'])
# 将上船码头与票价以独热编码鵆
all_data['p1']=np.array(all_data['Pclass']==1).astype(np.int32)
all_data['p2']=np.array(all_data['Pclass']==2).astype(np.int32)
all_data['p3']=np.array(all_data['Pclass']==3).astype(np.int32)
all_data['e1']=np.array(all_data['Embarked']=='S').astype(np.int32)
all_data['e2']=np.array(all_data['Embarked']=='C').astype(np.int32)
all_data['e3']=np.array(all_data['Embarked']=='Q').astype(np.int32)
#对票价和年龄归一化
all_data['Age'] = (all_data['Age']-all_data['Age'].min())/(all_data['Age'].max()-all_data['Age'].min())
all_data['Fare'] = (all_data['Fare']-all_data['Fare'].min())/(all_data['Fare'].max()-all_data['Fare'].min())

数据选取

选取了船票类型,登岸港口,性别,票价,年龄,经过测试,SibSp与Parch对结果影响并不大所以没有选用

all_data_selected = all_data[['p1', 'p2', 'p3', 'e1', 'e2', 'e3', 'Sex', 'Fare', 'Age']].values#, 'SibSp', 'Parch'
train_data = all_data_selected[:len(train_data)]
test_data = all_data_selected[len(train_data):]
train_label1 = all_data[['Survived']].values
train_lable = train_label1[:len(train_data)].reshape(-1,1)

train_data[0:5,:]
array([[0.        , 1.        , 0.        , 1.        , 0.        ,
        0.        , 0.        , 0.02537431, 0.22334962],
       [0.        , 0.        , 1.        , 0.        , 1.        ,
        0.        , 1.        , 0.01411046, 0.210823  ],
       [0.        , 1.        , 0.        , 1.        , 0.        ,
        0.        , 1.        , 0.02732618, 0.6743079 ],
       [0.        , 0.        , 1.        , 1.        , 0.        ,
        0.        , 1.        , 0.01571255, 0.37366905],
       [0.        , 0.        , 1.        , 1.        , 0.        ,
        0.        , 1.        , 0.01571255, 0.26092948]])
train_x = tf.cast(train_data, tf.float32)
train_y = tf.cast(train_lable, tf.int16)
test_x = tf.cast(test_data, tf.float32)
print(train_x.shape, train_y.shape)
(891, 9) (891, 1)

模型配置与训练

模型:全连接神经网络

  • 第一层16结点,使用relu作激活函数
  • 第二层8节点,使用relu作激活函数
  • 输出层使用sigmoid函数,输出(0~1)的概率

训练器:小批量训练模型

  • 每个小批量使用32条数据
  • 训练60轮
  • 从训练集中划分20%为测试数据
model=tf.keras.Sequential()
try:
    model = tf.keras.models.load_model("Titanic.h5")
    model.summary()
except:
    model.add(tf.keras.layers.Dense(16,activation="relu",kernel_regularizer = tf.keras.regularizers.l2()))#input_shape=(1,9)添加隐含层,隐含层是全连接层,16个结点,激活函数使用relu函数
    model.add(tf.keras.layers.Dense(8,activation="relu"))
    model.add(tf.keras.layers.Dense(1,activation="sigmoid"))#添加输出层,输出层使全连接层,激活函数是sigmoid函数
    
    #配置训练方法
    #优化器使用adam,损失函数使用二值交叉熵损失函数,准确率使用准确率函二值
    model.compile(optimizer=tf.keras.optimizers.Adam(lr = 0.07),
                loss='binary_crossentropy',
                metrics=['binary_accuracy'])
    
    # #训练模型
    # #使用训练集中的数据训练,从中划分20%作为测试数据,用在每轮训练后评估模型的性能,每个小批量使用32条数据,训练50轮
    history = model.fit(train_x,train_y,
                batch_size=32,
                epochs=60,
                validation_split=0.2)
    model.summary()
    model.save("Titanic.h5", overwrite=True, save_format=None)
Epoch 1/60
23/23 [==============================] - 2s 46ms/step - loss: 0.6572 - binary_accuracy: 0.6516 - val_loss: 0.5127 - val_binary_accuracy: 0.8045
Epoch 2/60
23/23 [==============================] - 0s 7ms/step - loss: 0.5755 - binary_accuracy: 0.7653 - val_loss: 0.5096 - val_binary_accuracy: 0.8101
Epoch 3/60
23/23 [==============================] - 0s 6ms/step - loss: 0.5329 - binary_accuracy: 0.7950 - val_loss: 0.4760 - val_binary_accuracy: 0.8101
Epoch 4/60
23/23 [==============================] - 0s 7ms/step - loss: 0.4905 - binary_accuracy: 0.7873 - val_loss: 0.5239 - val_binary_accuracy: 0.8045
Epoch 5/60
23/23 [==============================] - 0s 7ms/step - loss: 0.4977 - binary_accuracy: 0.8053 - val_loss: 0.4804 - val_binary_accuracy: 0.8101
..........

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                160       
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 9         
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________
model.evaluate(train_x,train_y)
28/28 [==============================] - 0s 3ms/step - loss: 0.4559 - binary_accuracy: 0.8092



[0.45588964223861694, 0.8092031478881836]

可视化

#accuracy的历史
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('model accuracy')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend(['train', 'validation'], loc='upper left')
plt.ylim((0, 1))
plt.show()
 
#loss的历史
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['train', 'validation'], loc='upper left')
plt.ylim((0, 1))
plt.show()

model accuracy
model loss

测试结果评价

gender_submission 这是一组假设所有且只有女性乘客幸存下来的预测,作为提交文件应该是什么样子的示例。

所以此评价测试结果并非准确。

test_y = pd.read_csv('gender_submission.csv')
test_y = np.array(test_y['Survived'])
model.evaluate(test_x,test_y)
14/14 [==============================] - 0s 3ms/step - loss: 0.2923 - binary_accuracy: 0.8900

[0.2923037111759186, 0.8899521827697754]

保存数据

  • 保存预测生存状况

    使用where函数,对生存概率大于0.5的人取生存。

  • 保存预测生存概率

pred = model.predict(test_x)
result = pd.DataFrame({'PassengerId':np.arange(892,892+418), 'Survived':tf.where(pred>0.5,1,0).numpy().reshape(-1)})
result.to_csv("titanic_survived_predictions.csv", index=False)
result = pd.DataFrame({'PassengerId':np.arange(892,892+418), 'Survived':pred.reshape(-1)})
result.to_csv("titanic_psurvived_predictions.csv", index=False)

最后上传kaggle得分

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值