tensorflow Titanic 生存概率预测
题目概述
The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
在本题中,我们需要建立一个预测模型并且使用各属性预测哪些人更可能存活下来。
数据集
What Data Will I Use in This Competition?
In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv
and the other is titled test.csv
.
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
The test.csv
dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.
Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.
数据集文件 | 数据集 | 数据条数 |
---|---|---|
train.csv | 训练集 | 891 |
test.csv | 测试集 | 418 |
数据集下载路径:https://www.kaggle.com/c/titanic/data
导入所需库和数据集加载
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
print(len(train_data), len(test_data))
train_data.head()
891 418
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
我们可以看到训练集个数为891 测试集(我们用于预测的数据集)个数为418.
各属性我们可以查阅下表以理解其含义。
变量 | 定义 | 键值 |
---|---|---|
survival | 存活 | 0=No,1=Yes |
pclass | 票的类别 | 1=1st,2=2nd,3=3rd |
sex | 性别 | |
Age | 年龄 | |
sibsp | 在船上有几个兄弟/配偶 | |
parch | 在船上有几个双亲/孩子 | |
ticket | 票编号 | |
fare | 乘客票价 | |
cabin | 客舱号码 | |
embarked | 登船港口 | C = Cherbourg, Q = Queenstown, S = Southampton |
打乱训练集数据
将数据集打乱以减少数据间的相关性,提高训练结果的准确率。
这里用到了sample函数:
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
这里我只用到了frac参数为打乱样本的比例,所以我设为1.
train_data = train_data.sample(frac=1)
train_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
696 | 697 | 0 | 3 | Kelly, Mr. James | male | 44.0 | 0 | 0 | 363592 | 8.0500 | NaN | S |
467 | 468 | 0 | 1 | Smart, Mr. John Montgomery | male | 56.0 | 0 | 0 | 113792 | 26.5500 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
388 | 389 | 0 | 3 | Sadlier, Mr. Matthew | male | NaN | 0 | 0 | 367655 | 7.7292 | NaN | Q |
303 | 304 | 1 | 2 | Keane, Miss. Nora A | female | NaN | 0 | 0 | 226593 | 12.3500 | E101 | Q |
查看总体数据缺省并进行数据预处理
合并训练集和测试集查看数据信息
all_data=pd.concat([train_data,test_data],ignore_index=True)
all_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Survived 891 non-null float64
2 Pclass 1309 non-null int64
3 Name 1309 non-null object
4 Sex 1309 non-null object
5 Age 1046 non-null float64
6 SibSp 1309 non-null int64
7 Parch 1309 non-null int64
8 Ticket 1309 non-null object
9 Fare 1308 non-null float64
10 Cabin 295 non-null object
11 Embarked 1307 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
从输出的数据我们可以得到如下信息:
-
总共有1309条数据
-
Survived 缺省418条数据(此为测试集的Survived需要通过神经网络预测输出)
-
Age,Fare,Cabin,Embarked都有缺省数据
数据处理思路分析
-
-
由于Cabin数据缺失过于多,我直接将此数据省略
-
对于Age与Embarked 直接采用平均数填充
-
Fare属性只有一条数据缺省。很容易想到票价与船舱等级有很大的关系,因此我将缺失数据同类型的票价的平均数填充给他。
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=‘mean’, fill_value=None, margins=False, dropna=True, margins_name=‘All’, observed=False)
对于此透视表我需要用到其中三个参数即可求出平均值。例如
- index为类别属性,作为返回的透视表的行名
- values为所需计算的属性,作为透视表的值
- affgunc为计算属性的方法,此题用平均
-
数据处理
# 对缺省的年龄进行平均数填充
all_data['Age'] = all_data['Age'].fillna(all_data['Age'].mean())
# 对缺省的登船港口进行众数填充
all_data['Embarked'] = all_data['Embarked'].fillna(all_data['Embarked'].mode())
# 性别转换为1和0
all_data['Sex']=[1 if x=='male' else 0 for x in all_data.Sex]
# 将SibSp,Parch 用有无代替数量以减少离散
all_data['SibSp']=[0 if x==0 else 1 for x in all_data.Sex]
all_data['Parch']=[0 if x==0 else 1 for x in all_data.Sex]
# 丢失票价用同等票类型的平均数进行填充
all_data['Fare'] = all_data['Fare'].fillna(all_data.pivot_table(index='Pclass', values='Fare', aggfunc=np.mean).loc[all_data[all_data['Fare'].isnull()]['Pclass'],'Fare'])
# 将上船码头与票价以独热编码鵆
all_data['p1']=np.array(all_data['Pclass']==1).astype(np.int32)
all_data['p2']=np.array(all_data['Pclass']==2).astype(np.int32)
all_data['p3']=np.array(all_data['Pclass']==3).astype(np.int32)
all_data['e1']=np.array(all_data['Embarked']=='S').astype(np.int32)
all_data['e2']=np.array(all_data['Embarked']=='C').astype(np.int32)
all_data['e3']=np.array(all_data['Embarked']=='Q').astype(np.int32)
#对票价和年龄归一化
all_data['Age'] = (all_data['Age']-all_data['Age'].min())/(all_data['Age'].max()-all_data['Age'].min())
all_data['Fare'] = (all_data['Fare']-all_data['Fare'].min())/(all_data['Fare'].max()-all_data['Fare'].min())
数据选取
选取了船票类型,登岸港口,性别,票价,年龄,经过测试,SibSp与Parch对结果影响并不大所以没有选用
all_data_selected = all_data[['p1', 'p2', 'p3', 'e1', 'e2', 'e3', 'Sex', 'Fare', 'Age']].values#, 'SibSp', 'Parch'
train_data = all_data_selected[:len(train_data)]
test_data = all_data_selected[len(train_data):]
train_label1 = all_data[['Survived']].values
train_lable = train_label1[:len(train_data)].reshape(-1,1)
train_data[0:5,:]
array([[0. , 1. , 0. , 1. , 0. ,
0. , 0. , 0.02537431, 0.22334962],
[0. , 0. , 1. , 0. , 1. ,
0. , 1. , 0.01411046, 0.210823 ],
[0. , 1. , 0. , 1. , 0. ,
0. , 1. , 0.02732618, 0.6743079 ],
[0. , 0. , 1. , 1. , 0. ,
0. , 1. , 0.01571255, 0.37366905],
[0. , 0. , 1. , 1. , 0. ,
0. , 1. , 0.01571255, 0.26092948]])
train_x = tf.cast(train_data, tf.float32)
train_y = tf.cast(train_lable, tf.int16)
test_x = tf.cast(test_data, tf.float32)
print(train_x.shape, train_y.shape)
(891, 9) (891, 1)
模型配置与训练
模型:全连接神经网络
- 第一层16结点,使用relu作激活函数
- 第二层8节点,使用relu作激活函数
- 输出层使用sigmoid函数,输出(0~1)的概率
训练器:小批量训练模型
- 每个小批量使用32条数据
- 训练60轮
- 从训练集中划分20%为测试数据
model=tf.keras.Sequential()
try:
model = tf.keras.models.load_model("Titanic.h5")
model.summary()
except:
model.add(tf.keras.layers.Dense(16,activation="relu",kernel_regularizer = tf.keras.regularizers.l2()))#input_shape=(1,9)添加隐含层,隐含层是全连接层,16个结点,激活函数使用relu函数
model.add(tf.keras.layers.Dense(8,activation="relu"))
model.add(tf.keras.layers.Dense(1,activation="sigmoid"))#添加输出层,输出层使全连接层,激活函数是sigmoid函数
#配置训练方法
#优化器使用adam,损失函数使用二值交叉熵损失函数,准确率使用准确率函二值
model.compile(optimizer=tf.keras.optimizers.Adam(lr = 0.07),
loss='binary_crossentropy',
metrics=['binary_accuracy'])
# #训练模型
# #使用训练集中的数据训练,从中划分20%作为测试数据,用在每轮训练后评估模型的性能,每个小批量使用32条数据,训练50轮
history = model.fit(train_x,train_y,
batch_size=32,
epochs=60,
validation_split=0.2)
model.summary()
model.save("Titanic.h5", overwrite=True, save_format=None)
Epoch 1/60
23/23 [==============================] - 2s 46ms/step - loss: 0.6572 - binary_accuracy: 0.6516 - val_loss: 0.5127 - val_binary_accuracy: 0.8045
Epoch 2/60
23/23 [==============================] - 0s 7ms/step - loss: 0.5755 - binary_accuracy: 0.7653 - val_loss: 0.5096 - val_binary_accuracy: 0.8101
Epoch 3/60
23/23 [==============================] - 0s 6ms/step - loss: 0.5329 - binary_accuracy: 0.7950 - val_loss: 0.4760 - val_binary_accuracy: 0.8101
Epoch 4/60
23/23 [==============================] - 0s 7ms/step - loss: 0.4905 - binary_accuracy: 0.7873 - val_loss: 0.5239 - val_binary_accuracy: 0.8045
Epoch 5/60
23/23 [==============================] - 0s 7ms/step - loss: 0.4977 - binary_accuracy: 0.8053 - val_loss: 0.4804 - val_binary_accuracy: 0.8101
..........
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 16) 160
_________________________________________________________________
dense_1 (Dense) (None, 8) 136
_________________________________________________________________
dense_2 (Dense) (None, 1) 9
=================================================================
Total params: 305
Trainable params: 305
Non-trainable params: 0
_________________________________________________________________
model.evaluate(train_x,train_y)
28/28 [==============================] - 0s 3ms/step - loss: 0.4559 - binary_accuracy: 0.8092
[0.45588964223861694, 0.8092031478881836]
可视化
#accuracy的历史
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('model accuracy')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend(['train', 'validation'], loc='upper left')
plt.ylim((0, 1))
plt.show()
#loss的历史
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['train', 'validation'], loc='upper left')
plt.ylim((0, 1))
plt.show()
测试结果评价
gender_submission 这是一组假设所有且只有女性乘客幸存下来的预测,作为提交文件应该是什么样子的示例。
所以此评价测试结果并非准确。
test_y = pd.read_csv('gender_submission.csv')
test_y = np.array(test_y['Survived'])
model.evaluate(test_x,test_y)
14/14 [==============================] - 0s 3ms/step - loss: 0.2923 - binary_accuracy: 0.8900
[0.2923037111759186, 0.8899521827697754]
保存数据
-
保存预测生存状况
使用where函数,对生存概率大于0.5的人取生存。
-
保存预测生存概率
pred = model.predict(test_x)
result = pd.DataFrame({'PassengerId':np.arange(892,892+418), 'Survived':tf.where(pred>0.5,1,0).numpy().reshape(-1)})
result.to_csv("titanic_survived_predictions.csv", index=False)
result = pd.DataFrame({'PassengerId':np.arange(892,892+418), 'Survived':pred.reshape(-1)})
result.to_csv("titanic_psurvived_predictions.csv", index=False)