这是kaggle上的一个开源项目。train集中,记载了泰坦尼克号乘客的各类信息,包括姓名、年龄、性别、船舱、家庭状况等,以及最终的生存状况(0表示死亡,1表示存活),通过清洗、探索分析train,建立合适的模型,对test集中的客户的生存状况进行预测。train集中,embarked代表上岸港口,pclass代表社会阶层,cabin代表船舱代号,sex代表性别,其他略。
# 一、引入基础包、数据
-*- coding:utf-8 –*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")
train = pd.read_csv("F:/data/train.csv")
test = pd.read_csv("F:/data/test.csv")
#查看train集的基本信息
print train.shape
print train.describe()
print train.head(10)
#查看test集的基本信息
print test.shape
print test.describe()
`print test.head(5)
`
**# 二、清洗数据**
trainPassID = train["PassengerId"]
testPassID = test["PassengerId"]
**# 处理train中Embarked的空值**
traintotal = train.isnull().sum().sort_values(ascending=False).astype(float)
trainpercent = train.isnull().sum().sort_values(ascending=False)/len(train)*100
traintp = pd.concat([traintotal,trainpercent],axis=1,keys=["Total","Percent"])
print(traintp)
#train集中空值数量比例如下
testtotal = test.isnull().sum().sort_values(ascending = False)
testpercent = test.isnull().sum().sort_values(ascending = False)/len(test)*100
testtp = pd.concat([testtotal, testpercent], axis = 1,keys= ["Total", "Percent"])
print (testtp)
#test集中空值数量比例如下
print (train[train["Embarked"].isnull()])
#train集中Embarked列空值具体信息如下
现在处理cabin列中的空值
sur = train['Survived']
train = train.drop(['Survived'],axis=1)
alldata = pd.concat([train,test],ignore_index=False)
alldata['Cabin'] = alldata['Cabin'].fillna('N')
alldata['Cabin'] = [i[0] for i in alldata['Cabin']]
missing = alldata[alldata['Cabin'] == 'N']
notmissing = alldata[alldata['Cabin'] != 'N']
print alldata.groupby('Cabin')['Fare'].mean().sort_values
A 41.244314
B 122.383078
C 107.926598
D 53.007339
E 54.564634
F 18.079367
G 14.205000
N 19.132707
T 35.500000
查看train与test数据文件中,各船舱对应的费用均值,依靠费用均值,来推测乘客所在船舱
def cabin_filler(i):
a = 0
if i<16:
a = "G"
elif i>=16 and i<27:
a = "F"
elif i>=27 and i<38:
a = "T"
elif i>=38 and i<47:
a = "A"
elif i>= 47 and i<53:
a = "E"
elif i>= 53 and i<54:
a = "D"
elif i>=54 and i<116:
a = '