目录
文章目录
泰坦尼克号,1912年4月15号沉没,2224人,1502人死亡。本篇文章先讲一下数据预处理,再用MLP预测下每一位乘客的存活率。
1 下载数据集
import urllib.request
import os
url="http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls"
filepath="data/titanic3.xls"
if not os.path.isfile(filepath):#不存在就下载
result=urllib.request.urlretrieve(url,filepath)
print('downloaded:',result)
Output
downloaded: ('data/titanic3.xls', <http.client.HTTPMessage object at 0x7f7bf0b8d198>)
在data文件夹中,有titanic3.xls
2 使用Pandas DataFrame读取数据
import numpy
import pandas as pd
#读excel的方法
all_df = pd.read_excel(filepath)
#查看两行数据
all_df[:2]
Output
- survived 0死1活
- pclass 1头等舱,2二等舱,3三等舱
- sibsp 手足或配偶也在船上的数量
- parch 双亲或者子女在船上的数量
- ticket 船票号码
- fare 旅客费用
- cabin 舱位号码
- embarked 登船港口,C,Q,S三种
survival是label
把有用的字段选取到DataFrame中
例如ticket 和 cabin和预测结果关系不是那么大,可以去掉
cols=['survived','name','pclass' ,'sex', 'age', 'sibsp',
'parch', 'fare', 'embarked']
all_df=all_df[cols]
all_df[:2]
统计下值为null的情况,excel表中没有值
all_df.isnull().sum()
Output
survived 0
name 0
pclass 0
sex 0
age 263
sibsp 0
parch 0
fare 1
embarked 2
dtype: int64
all_df.isnull().sum()
预处理方案
- name,训练时候不需要,删除
- age,有几项是null,需要用平均值代替
- fare,有几项是null,需要用平均值代替
- sex,是文字,male和female,转成0和1
- embarked,C、Q、S,必须使用One-Hot Encoding
3 使用Pandas DataFrame进行数据预处理
3.1 删除name
df=all_df.drop(['name'], axis=1)
df[:2]
Output
3.2 将age和fare的null替换成平均值
#fillna
age_mean = df['age'].mean()
df['age'] = df['age'].fillna(age_mean)
fare_mean = df['fare'].mean()
df['fare'] = df['fare'].fillna(fare_mean)
3.3 将性别转换为0与1
df['sex']= df['sex'].map({'female':0, 'male': 1}).astype(int)
df[:2]
3.4 embarked 字段进行One-hot Encoding
get_dummies()方法
python中get_dummies实践
x_OneHot_df = pd.get_dummies(data=df,columns=["embarked" ])
x_OneHot_df[:2]
Output
可以看到原来的embarked变成了三项,embarked_C、embarked_Q、embarked_S
4 将DataFram转换为Array
ndarray = x_OneHot_df.values
ndarray.shape
Output
(1309, 10)
4.1 提取feature 和 label
Label = ndarray[:,0]
Features = ndarray[:,1:]
第一列是label,第二列到最后都是feature
查看下label 和 feature
Label[:2]
Output
array([ 1., 1.])
Features[:2]
Output
array([[ 1. , 0. , 29. , 0. , 0. , 211.3375,
0. , 0. , 1. ],
[ 1. , 1. , 0.9167, 1. , 2. , 151.55 ,
0. , 0. , 1. ]])
特征尺度差距较大,不好学习,需要标准化到0-1之间
4.2 将ndarray特征字段进行标准化
from sklearn import preprocessing
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
scaledFeatures=minmax_scale.fit_transform(Features)
scaledFeatures[:2]
Output
array([[0. , 0. , 0.36116884, 0. , 0. ,
0.41250333, 0. , 0. , 1. ],
[0. , 1. , 0.00939458, 0.125 , 0.22222222,
0.2958059 , 0. , 0. , 1. ]])
4.3 将数据分为训练数据与测试数据
8:2
msk = numpy.random.rand(len(all_df)) < 0.8
train_df = all_df[msk]
test_df = all_df[~msk]
print('total:',len(all_df),
'train:',len(train_df),
'test:',len(test_df))
Output
total: 1309 train: 1075 test: 234
4.4 将上述操作封装成函数
def PreprocessData(raw_df):
# 删掉name
df=raw_df.drop(['name'], axis=1)
#age的null用mean代替
age_mean = df['age'].mean()
df['age'] = df['age'].fillna(age_mean)
#fare的null用mean代替
fare_mean = df['fare'].mean()
df['fare'] = df['fare'].fillna(fare_mean)
#sex变成0和1
df['sex']= df['sex'].map({'female':0, 'male': 1}).astype(int)
#embarked 进行 one-hot encoding
x_OneHot_df = pd.get_dummies(data=df,columns=["embarked" ])
# dataframe转成array
ndarray = x_OneHot_df.values
# 第一列是label
Label = ndarray[:,0]
# 第二列到最后一列是特征
Features = ndarray[:,1:]
# 把特征的取值标准化到0-1之间
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
scaledFeatures=minmax_scale.fit_transform(Features)
return scaledFeatures,Label
调用
#读excel文件
all_df = pd.read_excel(filepath)
#保留关键的信息
cols=['survived','name','pclass' ,'sex', 'age', 'sibsp',
'parch', 'fare', 'embarked']
all_df=all_df[cols]
#8:2划分数据集
msk = numpy.random.rand(len(all_df)) < 0.8
train_df = all_df[msk]
test_df = all_df[~msk]
print('total:',len(all_df),
'train:',len(train_df),
'test:',len(test_df))
#调用数据处理函数
train_Features,train_Label=PreprocessData(train_df)
test_Features,test_Label=PreprocessData(test_df)
查看下结果
train_Features[:2]
Ouput
array([[0. , 0. , 0.35714259, 0. , 0. ,
0.41250333, 0. , 0. , 1. ],
[0. , 1. , 0.00315126, 0.125 , 0.22222222,
0.2958059 , 0. , 0. , 1. ]])
train_Label[:2]
Output
array([1., 1.])
5 Build model
预处理后建立模型来预测结果
from keras.models import Sequential
from keras.layers import Dense,Dropout
model = Sequential()
model.add(Dense(units=40, input_dim=9,
kernel_initializer='uniform',
activation='relu'))
model.add(Dense(units=30,
kernel_initializer='uniform',
activation='relu'))
model.add(Dense(units=1,
kernel_initializer='uniform',
activation='sigmoid'))
print(model.summary())
Output
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_4 (Dense) (None, 40) 400
_________________________________________________________________
dense_5 (Dense) (None, 30) 1230
_________________________________________________________________
dense_6 (Dense) (None, 1) 31
=================================================================
Total params: 1,661
Trainable params: 1,661
Non-trainable params: 0
_________________________________________________________________
None
参数计算
9*40+40 = 400
40*30+30 = 1230
30*1+1 = 31
6 训练和评估
6.1 训练模型
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
train_history =model.fit(x=train_Features,
y=train_Label,
validation_split=0.1,
epochs=30,
batch_size=30,verbose=2)
参数说明,请看
【Keras-MLP】MNIST
输入,1304,9 输出1304,1
Output
Train on 930 samples, validate on 104 samples
Epoch 1/30
- 9s - loss: 0.6898 - acc: 0.5871 - val_loss: 0.6711 - val_acc: 0.7885
………………
………………
………………
Epoch 24/30
- 0s - loss: 0.4501 - acc: 0.7882 - val_loss: 0.4188 - val_acc: 0.8365
Epoch 25/30
- 0s - loss: 0.4493 - acc: 0.7892 - val_loss: 0.4203 - val_acc: 0.7981
Epoch 26/30
- 0s - loss: 0.4492 - acc: 0.7935 - val_loss: 0.4202 - val_acc: 0.8173
Epoch 27/30
- 0s - loss: 0.4498 - acc: 0.7946 - val_loss: 0.4197 - val_acc: 0.8173
Epoch 28/30
- 0s - loss: 0.4484 - acc: 0.7957 - val_loss: 0.4197 - val_acc: 0.8173
Epoch 29/30
- 0s - loss: 0.4476 - acc: 0.7989 - val_loss: 0.4199 - val_acc: 0.8173
Epoch 30/30
- 0s - loss: 0.4471 - acc: 0.7968 - val_loss: 0.4191 - val_acc: 0.8173
可视化一下
import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train History')
plt.ylabel(train)
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
调用
show_train_history(train_history,'acc','val_acc')
Output
看损失
show_train_history(train_history,'loss','val_loss')
Output
6.2 评估模型的准确性
scores = model.evaluate(x=test_Features,
y=test_Label)
Output
275/275 [==============================] - 0s 95us/step
看看精度
scores[1]
Output
0.8000000004334883
Note:scores[0]为损失
7 加入 Jack 和 Rose的数据
7.1 预处理
他们是虚构的人物
假设
Jack 3等舱,Rose 头等
Jack 男,Rose 女
Jack的票价5 ,Rose 100
Jack 23岁,Rose 20
‘survived’,‘name’,‘pclass’ ,‘sex’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ‘embarked’
Jack = pd.Series([0 ,'Jack',3, 'male' , 23, 1, 0, 5.0000,'S'])
Rose = pd.Series([1 ,'Rose',1, 'female', 20, 1, 0, 100.0000,'S'])
创建他们的DataFrame
JR_df = pd.DataFrame([list(Jack),list(Rose)],
columns=['survived', 'name','pclass', 'sex',
'age', 'sibsp','parch', 'fare','embarked'])
加入到数据集中
all_df=pd.concat([all_df,JR_df])
用 all_df[-2:]
查看加入的数据
survived name pclass sex age sibsp parch fare embarked
0 0 Jack 3 male 23.0 1 0 5.0 S
1 1 Rose 1 female 20.0 1 0 100.0 S
7.2 预测
# 预处理
all_Features,Label=PreprocessData(all_df)
# 预测
all_probability=model.predict(all_Features)
查看部分预测结果
all_probability[:10]
Output
array([[0.9759661 ],
[0.59928054],
[0.97479546],
[0.38840103],
[0.97282785],
[0.2548555 ],
[0.9483398 ],
[0.31095958],
[0.9491472 ],
[0.3147983 ]], dtype=float32)
上面就是存活率
将概率整合到数据集中
pd=all_df
pd.insert(len(all_df.columns),
'probability',all_probability)
看Rose 和 Jack的存活概率
pd[-2:]
结果
8 Taianic中感人的故事
查看生存概率挺高却没有活下来的人
pd[(pd['survived']==0) & (pd['probability']>0.9) ]
Output
显示下前五项数据
真实是,Allison家族有4位成员,爸爸35,妈妈25,一个两岁的女儿Loraine 以及一个不满一岁的婴儿Trevor,加上一个护士Alice Cleaver,乘坐游轮返回加拿大蒙特利尔。
女士优先,本来母亲可以带着女儿和婴儿上救生艇,但是找不到婴儿,所以坚持不上救生艇,找婴儿,其实婴儿早就被护士带上救生艇,最后,家人全牺牲,只活下来护士和婴儿
分析有高存活率但是最终死亡的数据,可以看到许多类是的感人事件,想了解更多的感人事件,看看历史,或者看看下面声明的内容。哈哈哈
声明
声明:代码源于《TensorFlow+Keras深度学习人工智能实践应用》 林大贵版,引用、转载请注明出处,谢谢,如果对书本感兴趣,买一本看看吧!!!