Kaggle泰坦尼克号项目初探

最新推荐文章于 2024-05-09 10:20:11 发布

L_彳亍

最新推荐文章于 2024-05-09 10:20:11 发布

阅读量656

点赞数 1

分类专栏： Kaggle项目文章标签：数据挖掘机器学习 svm

本文链接：https://blog.csdn.net/weixin_42143615/article/details/108632658

版权

Kaggle项目专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、项目介绍

泰坦尼克号的沉没是历史上最臭名昭著的海难之一。1912年4月15日，在她的处女航中，被广泛认为的“沉没” RMS泰坦尼克号与冰山相撞后沉没。不幸的是，船上没有足够的救生艇供所有人使用，导致2224名乘客和机组人员中的1502人死亡。虽然幸存有一些运气，但似乎有些人比其他人更有可能生存。在这一挑战中，我们要求您建立一个预测模型来回答以下问题：“什么样的人更有可能生存？” 使用乘客数据（即姓名，年龄，性别，社会经济舱等）。

二、项目步骤

1、导入库

import numpy as np              #科学计算
import pandas as pd             #数据分析
import seaborn as sns           #数据可视化
import matplotlib.pyplot as plt #数据可视化

%matplotlib inline

train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

PassengerId–乘客ID

Pclass-------乘客等级(1/2/3等舱位)

Name---------乘客姓名

Sex----------性别

Age----------年龄

SibSp--------堂兄弟/妹个数

Parch--------父母与小孩个数

Ticket-------船票信息

Fare---------票价

Cabin--------客舱

Embarked-----登船港口

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

2、数据预处理

数据清理中最常用的技术是填充缺失数据。根据经验来讲，分类数据只能用众数，连续数据可以用中位数或平均数。
所以我们用众数来填充登船地数据，用中位数来填充年龄数据。缺失值较大的一般是暂时不考虑。当然，连续值还可以通过数据拟合来填充缺失数据。

train['Embarked'].fillna(train['Embarked'].mode()[0], inplace = True)
train['Age'].fillna(train['Age'].median(), inplace = True)
test['Age'].fillna(train['Age'].median(), inplace = True)
test['Fare'].fillna(train['Fare'].mode()[0], inplace = True)
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           418 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

3、特征工程

3.1 幸存者与遇难者年龄分布

plt.hist(x = [train[train['Survived']==1]['Age'], train[train['Survived']==0]['Age']], 
         stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()

<matplotlib.legend.Legend at 0x291c2f02b38>

在这里插入图片描述

从上述图表可以看出：

1、老年人遇难的比例最高

2、青年的遇难数量占遇难者的大部分，这是因为青年人比例占总人数中最多

3、青年的幸存者数量占幸存者的大部分

4、0-10岁儿童幸存比例最高

3.2 客等级与幸存关系

plt.hist(x = [train[train['Survived']==1]['Pclass'], train[train['Survived']==0]['Pclass']], 
         stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.xticks([1,2,3])
plt.title('Pclass Histogram by Survival')
plt.xlabel('Pclass ')
plt.ylabel('# of Passengers')
plt.legend()

<matplotlib.legend.Legend at 0x291c2fbdb00>

在这里插入图片描述

从上述图表可以看出：

1、第三等级的乘客遇难率最高，遇难人数最多，乘客数也是最多；

2、第一等级乘客遇难率最低，幸存率最高，生还人数最多；

3.3 家庭人数与幸存关系

train['family_size']=train['SibSp']+train['Parch']+1
train.family_size.unique()
y=train[['family_size', 'Survived']].groupby(['family_size'],as_index=False).sum()
y.plot.bar(x='family_size',rot=45)

<matplotlib.axes._subplots.AxesSubplot at 0x291c3026f98>

plt.hist(x = [train[train['Survived']==1]['family_size'], train[train['Survived']==0]['family_size']], 
         stacked=True, color = ['b','r'],label = ['Survived','Dead'])
# plt.xticks([1,  2,  3,  4,  5,  6,  7,  8, 11])
plt.title('family_size Histogram by Survival')
plt.xlabel('family_size ')
plt.ylabel('# of Passengers')
plt.legend()

<matplotlib.legend.Legend at 0x291bf207ba8>

在这里插入图片描述

从上图可看出：

1、单身人士遇难人数最多，单身乘客数最多，生还人数最多；

2、家庭人数大于4人的家庭，遇难率最高，生还的可能性较小；

3、4人家庭的生还率是最高的；

3.4 登录港口与幸存关系

plt.hist(x = [train[train['Survived']==1]['Embarked'], train[train['Survived']==0]['Embarked']], 
         stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.title('Embarked Histogram by Survival')
plt.xlabel('Embaeked ')
plt.ylabel('# of Passengers')
plt.legend()

<matplotlib.legend.Legend at 0x291c31041d0>

在这里插入图片描述

train=train.drop(['family_size'],1)
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

4、特征值处理

4.1 合并数据集，便于进行特征处理

full=train.append(test,ignore_index=True,sort=False)
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       891 non-null float64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1309 non-null float64
Cabin          295 non-null object
Embarked       1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

将Cabin分为有值和无值两类

def set_Cabin_type(df):
    df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes"
    df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No"
    return df

full = set_Cabin_type(full)
full.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0.0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	No	S
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	Yes	C
2	3	1.0	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	No	S
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	Yes	S
4	5	0.0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	No	S

full=full.drop(['Ticket','PassengerId','Name'],axis=1)
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
Survived    891 non-null float64
Pclass      1309 non-null int64
Sex         1309 non-null object
Age         1309 non-null float64
SibSp       1309 non-null int64
Parch       1309 non-null int64
Fare        1309 non-null float64
Cabin       1309 non-null object
Embarked    1309 non-null object
dtypes: float64(3), int64(3), object(3)
memory usage: 92.1+ KB

4.2 数据变换及离散化处理

性别离散化：

set_map={'male':1,
        'female':0}
full['Sex']=full['Sex'].map(set_map)
set_map={'Yes':1,
        'No':0}
full['Cabin']=full['Cabin'].map(set_map)
full.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Cabin	Embarked
0	0.0	3	1	22.0	1	7.2500	0	S
1	1.0	1	0	38.0	1	71.2833	1	C
2	1.0	3	0	26.0	0	7.9250	0	S
3	1.0	1	0	35.0	1	53.1000	1	S
4	0.0	3	1	35.0	0	8.0500	0	S

pclass=pd.DataFrame()
pclass=pd.get_dummies(full['Pclass'],prefix='Pclass')
pclass.head()

	Pclass_1	Pclass_3
0	0	1
1	1	0
2	0	1
3	1	0
4	0	1

full=pd.concat([full,pclass],axis=1)
full=full.drop(['Pclass'],axis=1)

embarked=pd.DataFrame()
embarked=pd.get_dummies(full['Embarked'],prefix='Embarked')
embarked.head()

	Embarked_C	Embarked_S
0	0	1
1	1	0
2	0	1
3	0	1
4	0	1

full=pd.concat([full,embarked],axis=1)
full=full.drop(['Embarked'],axis=1)
full.head()

	Survived	Sex	Age	SibSp	Fare	Cabin	Pclass_1	Pclass_3	Embarked_C	Embarked_S
0	0.0	1	22.0	1	7.2500	0	0	1	0	1
1	1.0	0	38.0	1	71.2833	1	1	0	1	0
2	1.0	0	26.0	0	7.9250	0	0	1	0	1
3	1.0	0	35.0	1	53.1000	1	1	0	0	1
4	0.0	1	35.0	0	8.0500	0	0	1	0	1

family=pd.DataFrame()
family['family_size']=full['SibSp']+full['Parch']+1
family['family_sigle']=family['family_size'].map(lambda s: 1 if s==1 else 0)
family['family_small']=family['family_size'].map(lambda s:1 if 2<=s<=4 else 0)
family['family_large']=family['family_size'].map(lambda s:1 if s>=5 else 0)
family.head()

	family_size	family_sigle	family_small
0	2	0	1
1	2	0	1
2	1	1	0
3	2	0	1
4	1	1	0

full=pd.concat([full,family],axis=1)
full=full.drop(['SibSp','Parch','family_size'],axis=1)
full.head()

	Survived	Sex	Age	Fare	Cabin	Pclass_1	Pclass_3	Embarked_C	Embarked_S	family_sigle	family_small
0	0.0	1	22.0	7.2500	0	0	1	0	1	0	1
1	1.0	0	38.0	71.2833	1	1	0	1	0	0	1
2	1.0	0	26.0	7.9250	0	0	1	0	1	1	0
3	1.0	0	35.0	53.1000	1	1	0	0	1	0	1
4	0.0	1	35.0	8.0500	0	0	1	0	1	1	0

age=pd.DataFrame()
age['child']=full['Age'].map(lambda s:1 if 0<s<=6 else 0)
age['teen']=full['Age'].map(lambda s:1 if 6<s<=18 else 0)
age['younth']=full['Age'].map(lambda s:1 if 18<s<=40 else 0)
age['mid']=full['Age'].map(lambda s:1 if 40<s<=60 else 0)
age['old']=full['Age'].map(lambda s:1 if s>60 else 0)
age.head()

	child	teen	younth	mid	old
0	0	0	1	0	0
1	0	0	1	0	0
2	0	0	1	0	0
3	0	0	1	0	0
4	0	0	1	0	0

full=pd.concat([full,age],axis=1)
full=full.drop(['Age'],axis=1)
full.head()

	Survived	Sex	Fare	Cabin	Pclass_1	Pclass_3	Embarked_C	Embarked_S	family_sigle	family_small	younth
0	0.0	1	7.2500	0	0	1	0	1	0	1	1
1	1.0	0	71.2833	1	1	0	1	0	0	1	1
2	1.0	0	7.9250	0	0	1	0	1	1	0	1
3	1.0	0	53.1000	1	1	0	0	1	0	1	1
4	0.0	1	8.0500	0	0	1	0	1	1	0	1

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
fare_scale_param = scaler.fit(full['Fare'].values.reshape(-1,1))
full['Fare'] = scaler.fit_transform(full['Fare'].values.reshape(-1,1), fare_scale_param)
full.head()

	Survived	Sex	Fare	Cabin	Pclass_1	Pclass_3	Embarked_C	Embarked_S	family_sigle	family_small	younth
0	0.0	1	-0.503176	0	0	1	0	1	0	1	1
1	1.0	0	0.734809	1	1	0	1	0	0	1	1
2	1.0	0	-0.490126	0	0	1	0	1	1	0	1
3	1.0	0	0.383263	1	1	0	0	1	0	1	1
4	0.0	1	-0.487709	0	0	1	0	1	1	0	1

4.3 将训练集和预测集分离

train=full.loc[:890]
test_=full.loc[891:]
x_train=train.drop(['Survived'],axis=1)
x_train.head()

	Sex	Fare	Cabin	Pclass_1	Pclass_3	Embarked_C	Embarked_S	family_sigle	family_small	younth
0	1	-0.503176	0	0	1	0	1	0	1	1
1	0	0.734809	1	1	0	1	0	0	1	1
2	0	-0.490126	0	0	1	0	1	1	0	1
3	0	0.383263	1	1	0	0	1	0	1	1
4	1	-0.487709	0	0	1	0	1	1	0	1

y_train=train['Survived'].astype(int)
y_train.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int32

test_=test_.drop(['Survived'],axis=1)
test_.head()

	Sex	Fare	Pclass_2	Pclass_3	Embarked_Q	Embarked_S	family_sigle	family_small	younth	mid	old
891	1	-0.491978	0	1	1	0	1	0	1	0	0
892	0	-0.508010	0	1	0	1	0	1	0	1	0
893	1	-0.456051	1	0	1	0	1	0	0	0	1
894	1	-0.475868	0	1	0	1	1	0	1	0	0
895	0	-0.405784	0	1	0	1	0	1	1	0	0

5、建立模型

from sklearn.model_selection import train_test_split#这个模块主要是对数据的分割,以及与数据划分相关的功能

from sklearn.linear_model import LogisticRegression#线性模型，逻辑回归
from sklearn.tree import DecisionTreeClassifier #树模型，决策树
from sklearn.ensemble import RandomForestClassifier#集成模型，随机森林RF
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score #交叉验证指标
from sklearn.metrics import confusion_matrix,precision_score,accuracy_score,mean_squared_error,classification_report #各种评价模型效果的指标

#训练集测试集划分
t1_x,t2_x,t1_y,t2_y=train_test_split(x_train,y_train,test_size=0.3,random_state=11)

#模型选择
models=[LogisticRegression(),DecisionTreeClassifier(),RandomForestClassifier(),
        XGBClassifier(),LGBMClassifier(),KNeighborsClassifier(),SVC()]

D:\soft\ANACONDA\lib\site-packages\dask\config.py:168: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  data = yaml.load(f.read()) or {}
D:\soft\ANACONDA\lib\site-packages\distributed\config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  defaults = yaml.load(f)

# evaluate models by using cross-validation
names=['LR','Tree','RF','XGBC','LGBC','KNN','SVM']
for name, model in zip(names,models):
    score=cross_val_score(model,t1_x,t1_y,cv=5)
    print("{}:{},{}".format(name,score.mean(),score))

LR:0.7960387096774193,[0.768      0.832      0.856      0.80645161 0.71774194]
Tree:0.8057419354838709,[0.776      0.808      0.856      0.83064516 0.75806452]
RF:0.7993032258064515,[0.752      0.832      0.848      0.80645161 0.75806452]
XGBC:0.8137290322580645,[0.8        0.832      0.856      0.80645161 0.77419355]
LGBC:0.8217806451612903,[0.808      0.824      0.864      0.80645161 0.80645161]
KNN:0.7799870967741935,[0.76       0.816      0.832      0.7983871  0.69354839]
SVM:0.8073161290322581,[0.792      0.84       0.832      0.81451613 0.75806452]

# evaluate models by using cross-validation
names=['LR','Tree','RF','XGBC','LGBC','KNN','SVM']
for name, model in zip(names,models):
    score=cross_val_score(model,t2_x,t2_y,cv=5)
    print("{}:{},{}".format(name,score.mean(),score))

LR:0.8354996505939901,[0.83333333 0.90740741 0.85185185 0.83018868 0.75471698]
Tree:0.7908455625436759,[0.7962963  0.85185185 0.77777778 0.73584906 0.79245283]
RF:0.7986722571628231,[0.77777778 0.7962963  0.77777778 0.79245283 0.8490566 ]
XGBC:0.7986023759608665,[0.81481481 0.81481481 0.74074074 0.77358491 0.8490566 ]
LGBC:0.8245981830887491,[0.77777778 0.90740741 0.7962963  0.81132075 0.83018868]
KNN:0.801956673654787,[0.83333333 0.85185185 0.7962963  0.71698113 0.81132075]
SVM:0.8579315164220824,[0.85185185 0.92592593 0.87037037 0.81132075 0.83018868]

6、模型融合

from sklearn.ensemble import VotingClassifier
LR = LogisticRegression()
Tree = DecisionTreeClassifier()
RF = RandomForestClassifier()
XGBC = XGBClassifier()
LGBC = LGBMClassifier()
KNN = KNeighborsClassifier()
SVM = SVC()
eclf=VotingClassifier([('LR',LR),('Tree',Tree),('RF',RF),('XGBC',XGBC),('LGBC',LGBC),
                       ('KNN',KNN),('SVM',SVM)],voting='hard',n_jobs=-1)
eclf.fit(t1_x,t1_y)
eclf.score(t2_x,t2_y)

0.8582089552238806

result=eclf.predict(test_)
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": result
    })
submission.to_csv('submission.csv', index=False)

L_彳亍

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Kaggle泰坦尼克号项目初探

一、项目介绍泰坦尼克号泰坦尼克号的沉没是历史上最臭名昭著的海难之一。1912年4月15日，在她的处女航中，被广泛认为的“沉没” RMS泰坦尼克号与冰山相撞后沉没。不幸的是，船上没有足够的救生艇供所有人使用，导致2224名乘客和机组人员中的1502人死亡。虽然幸存有一些运气，但似乎有些人比其他人更有可能生存。在这一挑战中，我们要求您建立一个预测模型来回答以下问题：“什么样的人更有可能生
复制链接

扫一扫