案例实战|泰坦尼克号船员获救预测(数据预处理部分)

今天,拿Kaggle中的项目来实战演练下:泰坦尼克号船员获救预测,先看下项目的基本描述:

Competition Description

项目描述

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

RMS泰坦沉船是人类历史上最著名的沉船事故之一。1912年4月15号,泰坦号在碰到一个冰块后而沉船,船上的2224名有1502人遇难。这起事故震惊使人,让人们更加注意轮船的规则。

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

导致这么多人遇难的原因之一是船上没有足够多的救生船。尽管有些侥幸逃生的因素,但是有些人群更可能幸存下来,比如女人,小孩,和上层人群。

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

这次竞赛,我们想预测哪些人群更容易生还。特别地,希望大家用机器学习的模型来预测哪些人幸免于难。

再来看下数据集的描述:

The data has been split into two groups:

training set (train.csv)

test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also usefeature engineering to create new features.

训练集,给出了每个船员的基本特征,比如性别,阶层,及最后的获救情况。

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

测试集只包含基本特征,不包含获救情况(not include ground truth),需要用上面训练得到的模型预测获救情况。

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

一个结果提交的样例

数据字典

字段 说明 取值

survival Survival 0 = No, 1 = Yes

pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd

sex Sex

Age Age in years

sibsp # of siblings / spouses aboard the Titanic

parch # of parents / children aboard the Titanic

ticket Ticket number

fare Passenger fare

cabin Cabin number

embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

变量说明

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny(保姆), therefore parch=0 for them.

02

数据预处理:特征工程

先用pandas看下前5条数据,长得样子如下,Survived列为标签值,1表示获救,0表示未获救。

clipboard.png

调用pandas的describe查看下训练数据的整体统计情况:

clipboard.png

可以看出,训练集一共有891条数据,其中Age列为714个,表示有些列的取值为空,因此先对这些值做一些数据清洗理(同时修改训练集和测试集的Age列),将对这些NaN的值填充上均值与方差的上下浮动区间值,然后将Age的类型由floate64转化为int,如下所示:

full_data = [train, test]

for dataset in full_data:

    age_avg = dataset['Age'].mean()

    age_std = dataset['Age'].std()

    age_null_count = dataset['Age'].isnull().sum()

    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)

    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list

#Cannot convert non-finite values (NA or inf) to integer,因此需要先做Na检查

    dataset['Age'] = dataset['Age'].astype(int)

机器学习中特征工程是非常重要的,对各个特征的分析研究,对于最后的预测结果起到至关重要的作用,因此一定要花足够多的时间来分析特征,构思各个特征间的关系,是不是有些特征可以合并为一个新的特征,有些特征可以过滤掉等等。

好了,完成和年龄相关的特征处理,根据这个项目的要求和实际的经验(让女人和小孩先走,这是电影中的一句台词),所以先分析了年龄相关的这个重要特征。

接下来,看看和家庭成员相关的两个特征:SibSp(姊妹,夫妻) 和 Parch(父母和孩子),根据这两个特征,我们转化为这样两个可能更好用的,合并两个为一个家庭成员数量的特征:FamilySize,添加是否为只是自己一个人在船上的特征IsAlone

for dataset in full_data:

    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

for dataset in full_data:

    dataset['IsAlone'] = 0

    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

接下来,除了和年龄,性别,家庭成员,另外还可能和人的身份相关吧,比如鄙人是不是贵族,乘坐的是不是头等舱,因此,需要根据人的Name,提取其 title 特征,

def get_title(name):

    title_search = re.search(' ([A-Za-z]+)\.', name)

    # If the title exists, extract and return it.

    if title_search:

        return title_search.group(1)

    return ""

for dataset in full_data:

    dataset['Title'] = dataset['Name'].apply(get_title)

for dataset in full_data:

    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\

                                                 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')

    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')

    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

其他杂项,包括,增加名字长度,是否有Cabin,是否停靠过等,这些比较简单,可以考虑添加,也可以考虑不添加。

train['Name_length'] = train['Name'].apply(len)

test['Name_length'] = test['Name'].apply(len)

train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

for dataset in full_data:

    dataset['Embarked'] = dataset['Embarked'].fillna('S')

03

数据预处理:数据清洗

要想交给机器学习的算法来解决问题,预处理的数据每列必须为数值型的数据,现在,经过02步的操作后,我们看看,现在的数据表格各列的数据类型为:

clipboard.png

这样不行啊,我们得将那些不是数值列的作转换,在转化前,我们先做一个特征选取吧,移除一些无用的特征,比如,PaggegerId,Name(已经转化为NameLength),Ticket,Cabin(转化为Has_Cabin),SibSp(转化为FamilySize),

drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']

train = train.drop(drop_elements, axis = 1)

test  = test.drop(drop_elements, axis = 1)

移除这些列后,数据变为这样子:

clipboard.png

进一步,对object列做数据类型转化:

 dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)



    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

    dataset['Title'] = dataset['Title'].map(title_mapping)

    dataset['Title'] = dataset['Title'].fillna(0)



    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

将Fare列和Age连续型数值变为离散型值(做分类)

    for dataset in full_data:

    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0

    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1

    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2

    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3

 dataset['Fare'] = dataset['Fare'].astype(int)

    dataset.loc[ dataset['Age'] <= 16, 'Age']  = 0

         dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1

         dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2

         dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3

         dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;

数据清洗完毕后,最终的前5条数据长这样子,

clipboard.png

所有的列都为数值型的:

clipboard.png

特征提取和数据清洗工作告一段落,下面进入可视化展示阶段。

03

数据预处理:可视化展示

将以上所有特征,画出Pearson特征关系图,需要借助seaborn库,

colormap = plt.cm.RdBu

plt.figure(figsize=(14,12))

plt.title('Pearson Correlation of Features', y=1.05, size=15)

sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 

            square=True, cmap=colormap, linecolor='white', annot=True)

plt.show()

clipboard.png

通过这幅图,可以对比观察两两特征间的相关关系,数值越接近1或-1,相关性越强,越接近0,相关性越小。

接下来,再绘制一个多变量图,看下从一个特征到另一个特征的数据分布情况:

clipboard.png

至此,泰坦尼克号船员预测的数据预处理任务完成,明天推送,这些数据feed到机器学习的算法中,然后得到一个预测模型,看一下在测试集上的表现如何,以及如何做出优化。(Python学习交流705673780)

觉得看完有所收获的朋友加个关注,点赞收藏哟!

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值