# Titanic : Machine Learning from DisasterQuestion要求你建立一个预测模型来回答这个问题:“什么样的人更有可能生存?”使用乘客数据(如姓名、年龄、性别、社会经济阶层等)。
一、导入数据包和数据集
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns重点:在kaggle notebook上时,应该把pd.read_csv("./kaggle/input/titanic/train.csv")引号中第一个'.'去掉
读入训练集和测试及都需要
train = pd.read_csv("./kaggle/input/titanic/train.csv")
test = pd.read_csv("./kaggle/input/titanic/test.csv")
allData = pd.concat([train, test], ignore_index=True)
# dataNum = train.shape[0]
# featureNum = train.shape[1]
train.info()
二、数据总览
概况输入train.info()回车可以查看数据集整体信息
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB输入train.head()可以查看数据样例
特征
Variable | Definition | Key :-:|:-:|:-: survival | Survival | 0 = No, 1 = Yes pclass | Ticket class(客舱等级) | 1 = 1st, 2 = 2nd, 3 = 3rd sex | Sex Age | Age in years sibsp | # of siblings / spouses aboard the Titanic(旁系亲属) parch | # of parents / children aboard the Titanic(直系亲属) ticket | Ticket number fare | Passenger fare cabin | Cabin number(客舱编号) embarked | Port of Embarkation(上船港口编号) | C = Cherbourg, Q = Queenstown, S = Southampton
三、可视化数据分析
性别特征Sex女性生存率远高于男性
# Sex
sns.countplot('Sex', hue='Survived', data=train)
plt.show()
等级特征Pclass乘客等级越高,生存率越高
# Pclass
sns.barplot(x='Pclass', y="Survived", data=train)
plt.show()
家庭成员数量特征FamilySize=Parch+SibSp
家庭成员数量适中,生存率高
# FamilySize = SibSp + Parch + 1
allData['FamilySize'] = allData['SibSp'] + allData['Parch'] + 1
sns.barplot(x='FamilySize', y='Survived', data=allData)
plt.show()
上船港口特征Embarked上船港口不同,生存率不同
# Embarked
sns.countplot('Embarked', hue='Survived', data=train)
plt.show()
年龄特征Age年龄小或者正值壮年生存率高
# Age
sns.stripplot(x="Survived", y="Age", data=train, jitter=True)
plt.show()
- 年龄生存密度
facet = sns.FacetGrid(train, hue="Survived",aspect=2)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.xlabel('Age')
plt.ylabel('density')
plt.show()
儿童相对于全年龄段有特殊的生存率
作者将10及以下视为儿童,设置单独标签
费用特征Fare费用越高,生存率越高
# Fare
sns.stripplot(x="Survived", y="Fare", data=train, jitter=True)
plt.show()
姓名特征Name
头衔特征Title头衔由姓名的前置称谓进行分类
# Name
allData['Title'] = allData['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
pd.crosstab(allData['Title'], allData['Sex'])统计分析
TitleClassification = {'Officer':['Capt', 'Col', 'Major', 'Dr', 'Rev'],
'Royalty':['Don', 'Sir', 'the Countess', 'Dona', 'Lady'],
'Mrs':['Mme', 'Ms', 'Mrs'],
'Miss':['Mlle', 'Miss'],
'Mr':['Mr'],
'Master':['Master','Jonkheer']}
for title in TitleClassification.keys():
cnt = 0
for name in TitleClassification[title]:
c