Kaggle练习赛Titanic手札
标签: Kaggle
原文链接:http://blog.csdn.net/xuelabizp/article/details/52886054
参考资料:https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic/comments
一、Titanic练习赛介绍
kaggle上面的比赛有若干种,分别是Featured,Research,Playground和101等。Featured和Research比赛可以获得奖金,而Playground和101就是用来练手的。新注册Kaggle账号之后,网站会提示新手进行Titanic练习赛。
Titanic练习赛主要就是预测乘客是否存活,训练集中有乘客的若干特征与存活情况,乘客特征是年龄,性别等信息。使用训练集训练出一个模型,然后利用该模型去预测测试集中乘客的存活情况,原文描述如下:
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
二、特征分析与选择
训练集中乘客的特征有:PassengerId
,Pclass
,Name
,Sex
,Age
,SibSp
,Parch
,Ticket
,Fare
,Cabin
和Embarked
。
从常识上推断,PassengerId
,Name
,Ticket
,Embarked
是无关特征,Pclass
,Sex
,Age
,SibSp
,Parch
,Fare
和Cabin
是相关特征。
按理说Cabin
应该是非常重要的相关特征,因为事故发生在晚上,这个时候大家都在睡觉,船舱距离逃生位置的远近直接决定了存活率,但是训练集中该特征的缺失值太多了,所以初步分析时,先不考虑该特征。
接下来按照推测的相关度从高到低依次分析:Sex
,Age
,Fare
,SibSp
,Parch
,Pclass
。
2.1 Sex
电影《Titanic》中,“小孩和女人先走”让人印象深刻,那么我们就看看女人和男人的存活量和存活率。
男女性别中幸存的数量和比率如上图所示,男性乘客的幸存率不到20%,而女性乘客的幸存率在70%以上,两者差距较大,由此可以推断Sex
是非常重要的相关特征。实际上,把测试集中的所有男性都判定为遇难,所有女性都判定为存活,也有76.555%的正确率。
def sex_analysis(trainDf):
fig = plt.figure(figsize=(8 ,6))
maleDf = trainDf[trainDf['Sex'] == 'male']
femaleDf = trainDf[trainDf['Sex'] == 'female']
#男性女性存活数量
ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title('survival count of both sexes')
ax1.set_xticks([0, 1])
ax1.set_xticklabels(['male', 'female'])
ax1.set_xlabel("Sex")
ax1.set_ylabel("Survival count")
ax1.grid()
maleSurvived = maleDf['Survived'] == 1
femaleSurvived = femaleDf['Survived'] == 1
ax1.bar(0, maleSurvived.sum(), align="center", color='b', alpha = 0.5)
ax1.bar(1, femaleSurvived.sum(), align="center", color='r', alpha = 0.5)
#男性女性存活率
ax2 = fig.add_subplot(1, 2, 2)
ax2.set_title('survival rate of both sexes')
ax2.set_xticks([0, 1])
ax2.set_xticklabels(['male', 'female'])
ax2.set_xlabel("Sex")
ax2.set_ylabel("Survival rate")
ax2.grid()
ax2.bar(0, float(maleSurvived.sum()) / len(maleDf.index), align="center", color='b', alpha = 0.5)
ax2.bar(1, float(femaleSurvived.sum()) / len(femaleDf.index), align="center", color='r', alpha = 0.5)
plt.show()
2.2 Age
一般遇难的情况下,小孩的存活率比较高,因为大家把逃生的机会让给了小孩。
幸存者年龄分布直方图和频率直方图如上图所示,以5年为一个年龄区间,即[0,5), [5,10),…,[80, 85)。从图中可以看到15岁以下的孩童存活率较高,虽然存活的中年人数量较高,但是存活率反而较低。
在[80,85)区间中,存活率是100%,这其实算一个异常值,因为这个区间只有一个乘客,他存活,则这个区间的存活率为100;他遇难,则这个区间的存活率为0。
综上所述,年龄是一个相关特征,且应当为数据集添加一个Child
特征,该特征根据年龄生成,如当一个乘客的年龄小于15岁时,将其Child
特征设置为1
,其它年龄时,将其Child
特征设置为0
。
def age_analysis(trainDf):
fig = plt.figure(figsize=(10, 6))
#年龄有缺失值,提取出年龄值没有缺失的行
ageDf = trainDf[trainDf['Age'].notnull()]
#把年龄转化为整数,提取存活的乘客信息
ageDf['Age'] = ageDf['Age'].astype(int)
survivedDf = trainDf[trainDf['Survived'] == 1]
#幸存者的年龄频数直方图
survivedAge = []
for i in range(1, 18):#所有乘客的年龄均在[0, 80]之内,按照5岁一个区间统计,半开半闭共17个区间
survivedAge.append( len(survivedDf[ (survivedDf['Age'] >= (i-1)*5) & (survivedDf['Age'] < i*5) ]) )
ax1 = fig.add_subplot(1, 2, 1)
ax1.set_title('age distribution of survivors')
ax1.set_xticks(range(5, 85, 5))
ax1.set_xlabel("Age")
ax1.set_ylabel("Survival count")
ax1.set_ylim(0, 45)
ax1.grid()
ax1.bar(range(5, 90, 5), survivedAge