Kaggle泰坦尼克号项目初探

一、项目介绍

泰坦尼克号的沉没是历史上最臭名昭著的海难之一。1912年4月15日,在她的处女航中,被广泛认为的“沉没” RMS泰坦尼克号与冰山相撞后沉没。不幸的是,船上没有足够的救生艇供所有人使用,导致2224名乘客和机组人员中的1502人死亡。虽然幸存有一些运气,但似乎有些人比其他人更有可能生存。在这一挑战中,我们要求您建立一个预测模型来回答以下问题:“什么样的人更有可能生存?” 使用乘客数据(即姓名,年龄,性别,社会经济舱等)。

二、项目步骤

1、导入库

import numpy as np              #科学计算
import pandas as pd             #数据分析
import seaborn as sns           #数据可视化
import matplotlib.pyplot as plt #数据可视化

%matplotlib inline
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

PassengerId–乘客ID

Pclass-------乘客等级(1/2/3等舱位)

Name---------乘客姓名

Sex----------性别

Age----------年龄

SibSp--------堂兄弟/妹个数

Parch--------父母与小孩个数

Ticket-------船票信息

Fare---------票价

Cabin--------客舱

Embarked-----登船港口

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

2、数据预处理

数据清理中最常用的技术是填充缺失数据。根据经验来讲,分类数据只能用众数,连续数据可以用中位数或平均数。
所以我们用众数来填充登船地数据,用中位数来填充年龄数据。缺失值较大的一般是暂时不考虑。当然,连续值还可以通过数据拟合来填充缺失数据。

train['Embarked'].fillna(train['Embarked'].mode()[0], inplace = True)
train['Age'].fillna(train['Age'].median(), inplace = True)
test['Age'].fillna(train['Age'].median(), inplace = True)
test['Fare'].fillna(train['Fare'].mode()[0], inplace = True)
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

3、特征工程

3.1 幸存者与遇难者年龄分布

plt.hist(x = [train[train['Survived']==1]['Age'], train[train['Survived']==0]['Age']], 
         stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x291c2f02b38>

在这里插入图片描述

从上述图表可以看出:

1、老年人遇难的比例最高

2、青年的遇难数量占遇难者的大部分,这是因为青年人比例占总人数中最多

3、青年的幸存者数量占幸存者的大部分

4、0-10岁儿童幸存比例最高

3.2 客等级与幸存关系

plt.hist(x = [train[train['Survived']==1]['Pclass'], train[train['Survived']==0]['Pclass']], 
         stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.xticks([1,2,3])
plt.title('Pclass Histogram by Survival')
plt.xlabel('Pclass ')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x291c2fbdb00>

在这里插入图片描述

从上述图表可以看出:

1、第三等级的乘客遇难率最高,遇难人数最多,乘客数也是最多;

2、第一等级乘客遇难率最低,幸存率最高,生还人数最多;

3.3 家庭人数与幸存关系

train['family_size']=train['SibSp']+train['Parch']+1
train.family_size.unique()
y=train[['family_size', 'Survived']].groupby(['family_size'],as_index=False).sum()
y.plot.bar(x='family_size',rot=45)
<matplotlib.axes._subplots.AxesSubplot at 0x291c3026f98>

plt.hist(x = [train[train['Survived']==1]['family_size'], train[train['Survived']==0]['family_size']], 
         stacked=True, color = ['b','r'],label = ['Survived','Dead'])
# plt.xticks([1,  2,  3,  4,  5,  6,  7,  8, 11])
plt.title('family_size Histogram by Survival')
plt.xlabel('family_size ')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x291bf207ba8>

在这里插入图片描述

从上图可看出:

1、单身人士遇难人数最多,单身乘客数最多,生还人数最多;

2、家庭人数大于4人的家庭,遇难率最高,生还的可能性较小;

3、4人家庭的生还率是最高的;

3.4 登录港口与幸存关系

plt.hist(x = [train[train['Survived']==1]['Embarked'], train[train['Survived']==0]['Embarked']], 
         stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.title('Embarked Histogram by Survival')
plt.xlabel('Embaeked ')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x291c31041d0>

在这里插入图片描述

train=train.drop(['family_size'],1)
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

4、特征值处理

4.1 合并数据集,便于进行特征处理

full=train.append(test,ignore_index=True,sort=False)
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1309 non-null float64
Cabin          295 non-null object
Embarked       1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

将Cabin分为有值和无值两类

def set_Cabin_type(df):
    df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes"
    df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No"
    return df
full = set_Cabin_type(full)
full.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NoS
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833YesC
231.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NoS
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000YesS
450.03Allen, Mr. William Henrymale35.0003734508.0500NoS
full=full.drop(['Ticket','PassengerId','Name'],axis=1)
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
Survived    891 non-null float64
Pclass      1309 non-null int64
Sex         1309 non-null object
Age         1309 non-null float64
SibSp       1309 non-null int64
Parch       1309 non-null int64
Fare        1309 non-null float64
Cabin       1309 non-null object
Embarked    1309 non-null object
dtypes: float64(3), int64(3), object(3)
memory usage: 92.1+ KB

4.2 数据变换及离散化处理

性别离散化:

set_map={'male':1,
        'female':0}
full['Sex']=full['Sex'].map(set_map)
set_map={'Yes':1,
        'No':0}
full['Cabin']=full['Cabin'].map(set_map)
full.head()
SurvivedPclassSexAgeSibSpParchFareCabinEmbarked
00.03122.0107.25000S
11.01038.01071.28331C
21.03026.0007.92500S
31.01035.01053.10001S
40.03135.0008.05000S
pclass=pd.DataFrame()
pclass=pd.get_dummies(full['Pclass'],prefix='Pclass')
pclass.head()
Pclass_1Pclass_2Pclass_3
0001
1100
2001
3100
4001
full=pd.concat([full,pclass],axis=1)
full=full.drop(['Pclass'],axis=1)

embarked=pd.DataFrame()
embarked=pd.get_dummies(full['Embarked'],prefix='Embarked')
embarked.head()
Embarked_CEmbarked_QEmbarked_S
0001
1100
2001
3001
4001
full=pd.concat([full,embarked],axis=1)
full=full.drop(['Embarked'],axis=1)
full.head()
SurvivedSexAgeSibSpParchFareCabinPclass_1Pclass_2Pclass_3Embarked_CEmbarked_QEmbarked_S
00.0122.0107.25000001001
11.0038.01071.28331100100
21.0026.0007.92500001001
31.0035.01053.10001100001
40.0135.0008.05000001001
family=pd.DataFrame()
family['family_size']=full['SibSp']+full['Parch']+1
family['family_sigle']=family['family_size'].map(lambda s: 1 if s==1 else 0)
family['family_small']=family['family_size'].map(lambda s:1 if 2<=s<=4 else 0)
family['family_large']=family['family_size'].map(lambda s:1 if s>=5 else 0)
family.head()
family_sizefamily_siglefamily_smallfamily_large
02010
12010
21100
32010
41100
full=pd.concat([full,family],axis=1)
full=full.drop(['SibSp','Parch','family_size'],axis=1)
full.head()
SurvivedSexAgeFareCabinPclass_1Pclass_2Pclass_3Embarked_CEmbarked_QEmbarked_Sfamily_siglefamily_smallfamily_large
00.0122.07.25000001001010
11.0038.071.28331100100010
21.0026.07.92500001001100
31.0035.053.10001100001010
40.0135.08.05000001001100
age=pd.DataFrame()
age['child']=full['Age'].map(lambda s:1 if 0<s<=6 else 0)
age['teen']=full['Age'].map(lambda s:1 if 6<s<=18 else 0)
age['younth']=full['Age'].map(lambda s:1 if 18<s<=40 else 0)
age['mid']=full['Age'].map(lambda s:1 if 40<s<=60 else 0)
age['old']=full['Age'].map(lambda s:1 if s>60 else 0)
age.head()
childteenyounthmidold
000100
100100
200100
300100
400100
full=pd.concat([full,age],axis=1)
full=full.drop(['Age'],axis=1)
full.head()
SurvivedSexFareCabinPclass_1Pclass_2Pclass_3Embarked_CEmbarked_QEmbarked_Sfamily_siglefamily_smallfamily_largechildteenyounthmidold
00.017.2500000100101000100
11.0071.2833110010001000100
21.007.9250000100110000100
31.0053.1000110000101000100
40.018.0500000100110000100
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
fare_scale_param = scaler.fit(full['Fare'].values.reshape(-1,1))
full['Fare'] = scaler.fit_transform(full['Fare'].values.reshape(-1,1), fare_scale_param)
full.head()
SurvivedSexFareCabinPclass_1Pclass_2Pclass_3Embarked_CEmbarked_QEmbarked_Sfamily_siglefamily_smallfamily_largechildteenyounthmidold
00.01-0.503176000100101000100
11.000.734809110010001000100
21.00-0.490126000100110000100
31.000.383263110000101000100
40.01-0.487709000100110000100

4.3 将训练集和预测集分离

train=full.loc[:890]
test_=full.loc[891:]
x_train=train.drop(['Survived'],axis=1)
x_train.head()
SexFareCabinPclass_1Pclass_2Pclass_3Embarked_CEmbarked_QEmbarked_Sfamily_siglefamily_smallfamily_largechildteenyounthmidold
01-0.503176000100101000100
100.734809110010001000100
20-0.490126000100110000100
300.383263110000101000100
41-0.487709000100110000100
y_train=train['Survived'].astype(int)
y_train.head()
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int32
test_=test_.drop(['Survived'],axis=1)
test_.head()
SexFareCabinPclass_1Pclass_2Pclass_3Embarked_CEmbarked_QEmbarked_Sfamily_siglefamily_smallfamily_largechildteenyounthmidold
8911-0.491978000101010000100
8920-0.508010000100101000010
8931-0.456051001001010000001
8941-0.475868000100110000100
8950-0.405784000100101000100

5、建立模型

from sklearn.model_selection import train_test_split#这个模块主要是对数据的分割,以及与数据划分相关的功能

from sklearn.linear_model import LogisticRegression#线性模型,逻辑回归
from sklearn.tree import DecisionTreeClassifier #树模型,决策树
from sklearn.ensemble import RandomForestClassifier#集成模型,随机森林RF
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score #交叉验证指标
from sklearn.metrics import confusion_matrix,precision_score,accuracy_score,mean_squared_error,classification_report #各种评价模型效果的指标

#训练集测试集划分
t1_x,t2_x,t1_y,t2_y=train_test_split(x_train,y_train,test_size=0.3,random_state=11)

#模型选择
models=[LogisticRegression(),DecisionTreeClassifier(),RandomForestClassifier(),
        XGBClassifier(),LGBMClassifier(),KNeighborsClassifier(),SVC()]
D:\soft\ANACONDA\lib\site-packages\dask\config.py:168: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  data = yaml.load(f.read()) or {}
D:\soft\ANACONDA\lib\site-packages\distributed\config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  defaults = yaml.load(f)
# evaluate models by using cross-validation
names=['LR','Tree','RF','XGBC','LGBC','KNN','SVM']
for name, model in zip(names,models):
    score=cross_val_score(model,t1_x,t1_y,cv=5)
    print("{}:{},{}".format(name,score.mean(),score))
LR:0.7960387096774193,[0.768      0.832      0.856      0.80645161 0.71774194]
Tree:0.8057419354838709,[0.776      0.808      0.856      0.83064516 0.75806452]
RF:0.7993032258064515,[0.752      0.832      0.848      0.80645161 0.75806452]
XGBC:0.8137290322580645,[0.8        0.832      0.856      0.80645161 0.77419355]
LGBC:0.8217806451612903,[0.808      0.824      0.864      0.80645161 0.80645161]
KNN:0.7799870967741935,[0.76       0.816      0.832      0.7983871  0.69354839]
SVM:0.8073161290322581,[0.792      0.84       0.832      0.81451613 0.75806452]
# evaluate models by using cross-validation
names=['LR','Tree','RF','XGBC','LGBC','KNN','SVM']
for name, model in zip(names,models):
    score=cross_val_score(model,t2_x,t2_y,cv=5)
    print("{}:{},{}".format(name,score.mean(),score))
LR:0.8354996505939901,[0.83333333 0.90740741 0.85185185 0.83018868 0.75471698]
Tree:0.7908455625436759,[0.7962963  0.85185185 0.77777778 0.73584906 0.79245283]
RF:0.7986722571628231,[0.77777778 0.7962963  0.77777778 0.79245283 0.8490566 ]
XGBC:0.7986023759608665,[0.81481481 0.81481481 0.74074074 0.77358491 0.8490566 ]
LGBC:0.8245981830887491,[0.77777778 0.90740741 0.7962963  0.81132075 0.83018868]
KNN:0.801956673654787,[0.83333333 0.85185185 0.7962963  0.71698113 0.81132075]
SVM:0.8579315164220824,[0.85185185 0.92592593 0.87037037 0.81132075 0.83018868]

6、模型融合

from sklearn.ensemble import VotingClassifier
LR = LogisticRegression()
Tree = DecisionTreeClassifier()
RF = RandomForestClassifier()
XGBC = XGBClassifier()
LGBC = LGBMClassifier()
KNN = KNeighborsClassifier()
SVM = SVC()
eclf=VotingClassifier([('LR',LR),('Tree',Tree),('RF',RF),('XGBC',XGBC),('LGBC',LGBC),
                       ('KNN',KNN),('SVM',SVM)],voting='hard',n_jobs=-1)
eclf.fit(t1_x,t1_y)
eclf.score(t2_x,t2_y)
0.8582089552238806
result=eclf.predict(test_)
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": result
    })
submission.to_csv('submission.csv', index=False)
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值