数据来源于kaggle
其中,训练集是 891 × 12 891\times 12 891×12,测试集 418 × 11 418\times 11 418×11
[外链图片转存失败(img-e7i6IYkb-1565196263873)(attachment:image.png)]
读取数据
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
train.shape,test.shape
((891, 12), (418, 11))
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
特征工程
幸存者年龄分布
- 绘制幸存者年龄段分布
train['AgeBand'] = pd.cut(train['Age'], 5)
y=train[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).sum()
y.plot.bar(x='AgeBand',rot=45)#rot 表示旋转角度
plt.title('survived number')
plt.show()
- 绘制幸存者及遇难者年龄分布
plt.figure(figsize=(10,10))
g = sns.FacetGrid(train, col='Survived')
g.map(plt.hist, 'Age', bins=20)
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x2a8775e8390>
<Figure size 720x720 with 0 Axes>
- 绘制幸存者及遇难者年龄分布的累积直方图
以不同颜色区分幸存者和遇难者
plt.hist(x = [train[train['Survived']==1]['Age'], train[train['Survived']==0]['Age']],
stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x2a8797faa20>
从上述图表可以看出:
-
老年人遇难的比例最高
-
青年的遇难数量占遇难者的大部分,这是因为青年人比例占总人数中最多
-
青年的幸存者数量占幸存者的大部分
-
0-10岁儿童幸存比例最高
幸存者船票等级分布
- 绘制船票等级的分布
sns.barplot(x = 'Pclass', y = 'Survived', order=[1,2,3], data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x2a879834780>
- 绘制幸存者船票等级分布