泰坦尼克预测生存率初次探索

泰坦尼克号生存率预测

目录

  1. 提出问题(Business Understanding )
  2. 理解数据(Data Understanding)
    • 采集数据
    • 导入数据
    • 查看数据集信息
  3. 数据清洗(Data Preparation )
    • 数据预处理
    • 特征工程(Feature Engineering)
  4. 构建模型(Modeling)
  5. 模型评估(Evaluation)
  6. 方案实施 (Deployment)

1. 提出问题

什么样的人在泰坦尼克号中更容易存活?

2. 理解数据

1)采集数据

2)导入数据

3)查看数据集信息

2.1 采集数据

下载Kaggle泰坦尼克号数据

2.2 导入数据

我们将训练数据和测试数据合并,方便同时清洗

#导入处理数据包
import numpy as np
import pandas as pd 
#导入数据
path='C:/Users/Titanic'
f=open(path+'/train.csv')
g=open(path+'/test.csv')
#训练数据集
train=pd.read_csv(f)
#测试数据集
test=pd.read_csv(g)
#在这里要记住数据集有891条数据
print('训练数据集:',train.shape,'测试数据集:',test.shape)
训练数据集: (891, 12) 测试数据集: (418, 11)
rowNum_train=train.shape[0]
rowNum_test=test.shape[0]
print('kaggle训练数据集有多少行数据:',rowNum_train,
     'kaggle测试数据集有多少行数据:',rowNum_test)
kaggle训练数据集有多少行数据: 891 kaggle测试数据集有多少行数据: 418
#合并数据集,方便同时对两个数据集进行清洗
full=train.append(test,ignore_index=True)
print('合并后的数据集:',full.shape)
合并后的数据集: (1309, 12)

2.3 查看数据集信息

#查看数据
full.head()
AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicket
022.0NaNS7.2500Braund, Mr. Owen Harris013male10.0A/5 21171
138.0C85C71.2833Cumings, Mrs. John Bradley (Florence Briggs Th…021female11.0PC 17599
226.0NaNS7.9250Heikkinen, Miss. Laina033female01.0STON/O2. 3101282
335.0C123S53.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)041female11.0113803
435.0NaNS8.0500Allen, Mr. William Henry053male00.0373450

Embarked 登船港口
(S=英国南安普顿 Southampton C=法国 瑟堡市 Cherbourg Q=爱尔兰 昆士敦 Queenstown)
Fare 船票价格
Parch 船上父母数/子女数(不同代直系亲属数)
SibSp 船上兄弟姐妹数/配偶数(同代直系亲属数)
Pclass 客舱等级(1=1等舱,2=2等舱,3=3等舱)

#获取数据类型列的描述统计信息
full.describe()
AgeFareParchPassengerIdPclassSibSpSurvived
count1046.0000001308.0000001309.0000001309.0000001309.0000001309.000000891.000000
mean29.88113833.2954790.385027655.0000002.2948820.4988540.383838
std14.41349351.7586680.865560378.0200610.8378361.0416580.486592
min0.1700000.0000000.0000001.0000001.0000000.0000000.000000
25%21.0000007.8958000.000000328.0000002.0000000.0000000.000000
50%28.00000014.4542000.000000655.0000003.0000000.0000000.000000
75%39.00000031.2750000.000000982.0000003.0000001.0000001.000000
max80.000000512.3292009.0000001309.0000003.0000008.0000001.000000
#查看每一列数据类型和数据总数
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

可知数据总有1309行。
其中部分信息有缺失数据
数据类型列:
* 年龄(Age)总数1046条,缺失263条,缺失率263/1309=20%
* 船票(Fare)总数1308条,缺失1条
字符串列:
* 登船港口(Embarked)总数1307,缺失2条
* 船舱号(Cabin)数据总数是295,缺失了1309-295=1014,缺失率1014/1309=77.5% 缺失较为严重

3. 数据清洗

3.1 数据预处理

缺失值处理

'''
首先对于数据类型列年龄,船票价格
处理缺失值最简单的方法采用平均数来填充缺失值
'''
print('处理前:')
full.info()
#年龄
full['Age']=full['Age'].fillna(full['Age'].mean())
#船票价格
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
print('处理后:')
full.info()
处理前:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
处理后:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
#检查数据
full.head()
AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicket
022.0NaNS7.2500Braund, Mr. Owen Harris013male10.0A/5 21171
138.0C85C71.2833Cumings, Mrs. John Bradley (Florence Briggs Th…021female11.0PC 17599
226.0NaNS7.9250Heikkinen, Miss. Laina033female01.0STON/O2. 3101282
335.0C123S53.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)041female11.0113803
435.0NaNS8.0500Allen, Mr. William Henry053male00.0373450
'''
处理缺失比较大的字符串列登船港口和船舱号
'''
#Embarked登船港口:读取该列信息
from collections import Counter
Counter(full['Embarked'])
Counter({‘C’: 270, ‘Q’: 123, ‘S’: 914, nan: 2})
'''
只有两个缺失值,我们将缺失值填充为最频繁出现的值S
'''
full['Embarked']=full['Embarked'].fillna('S')
#船舱号:读取该列信息
Counter(full['Cabin'])
#发现缺失信息较多,而且船舱号信息比较杂,因此在这里将缺失值填充为U,表示未知
full['Cabin']=full['Cabin'].fillna('U')
#检查信息处理是否正常
full.head()
AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicket
022.0US7.2500Braund, Mr. Owen Harris013male10.0A/5 21171
138.0C85C71.2833Cumings, Mrs. John Bradley (Florence Briggs Th…021female11.0PC 17599
226.0US7.9250Heikkinen, Miss. Laina033female01.0STON/O2. 3101282
335.0C123S53.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)041female11.0113803
435.0US8.0500Allen, Mr. William Henry053male00.0373450
#查看缺失值处理情况
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3.2 特征提取

3.2.1 数据分类

通过查看full.info(),可以看到每一列的数据类型,一般给出三种分类数值,时间,分类数据,在这里像姓名,船舱号等没有明显类别的字符串类型,也归入到分类数据中,之后可以考虑是否可以提取特征。

1.数值类型:
乘客编号(PassengerId),年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch)
2.时间序列:无
3.分类数据:
1)有直接类别:
乘客性别(Sex),登船港口(Embarked),船舱等级(Pclass)
2)其他字符串类型:
乘客姓名(Name),船舱号(Cabin),船票编号(Ticket)

分类数据:有直接类别的

性别(Sex)

'''
将性别的值映射为数值
男(male)对应数值1,女(female)对应数值0
'''
sex_mapDict={'male':1,'female':0}
#map函数:对Series每个数据应用自定义的函数计算
full['Sex']=full['Sex'].map(sex_mapDict)
full.head()
AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicket
022.0US7.2500Braund, Mr. Owen Harris013110.0A/5 21171
138.0C85C71.2833Cumings, Mrs. John Bradley (Florence Briggs Th…021011.0PC 17599
226.0US7.9250Heikkinen, Miss. Laina033001.0STON/O2. 3101282
335.0C123S53.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)041011.0113803
435.0US8.0500Allen, Mr. William Henry053100.0373450

登船港口(Embarked)

'''
使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked
'''
#存放提取后的特征
embarkedDf=pd.DataFrame()

embarkedDf=pd.get_dummies(full['Embarked'],prefix='Embarked')
embarkedDf.head()
Embarked_CEmbarked_QEmbarked_S
0001
1100
2001
3001
4001
#添加虚拟变量到泰坦尼克号数据集full
full=pd.concat([full,embarkedDf],axis=1)
#删除登船港口
full.drop('Embarked',axis=1,inplace=True)
full.head()
AgeCabinFareNameParchPassengerIdPclassSexSibSpSurvivedTicketEmbarked_CEmbarked_QEmbarked_S
022.0U7.2500Braund, Mr. Owen Harris013110.0A/5 21171001
138.0C8571.2833Cumings, Mrs. John Bradley (Florence Briggs Th…021011.0PC 17599100
226.0U7.9250Heikkinen, Miss. Laina033001.0STON/O2. 3101282001
335.0C12353.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)041011.0113803001
435.0U8.0500Allen, Mr. William Henry053100.0373450001

客舱等级(Pclass)

#同样对客舱等级进行one-hot编码,前缀名是Pclass
pclassDf=pd.DataFrame()
pclassDf=pd.get_dummies(full['Pclass'],prefix='Pclass')
pclassDf.head()
Pclass_1Pclass_2Pclass_3
0001
1100
2001
3100
4001
#添加客舱等级的虚拟变量到原始数据集
full=pd.concat([full,pclassDf],axis=1)
#删掉原客舱等级列
full.drop('Pclass',axis=1,inplace=True)
full.head()
AgeCabinFareNameParchPassengerIdSexSibSpSurvivedTicketEmbarked_CEmbarked_QEmbarked_SPclass_1Pclass_2Pclass_3
022.0U7.2500Braund, Mr. Owen Harris01110.0A/5 21171001001
138.0C8571.2833Cumings, Mrs. John Bradley (Florence Briggs Th…02011.0PC 17599100100
226.0U7.9250Heikkinen, Miss. Laina03001.0STON/O2. 3101282001001
335.0C12353.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)04011.0113803001100
435.0U8.0500Allen, Mr. William Henry05100.0373450001001

分类数据:字符串类型

从字符串数据类型中提取特征,也归为分类数据中,这部分包含的数据为:

1.乘客姓名(Name)
2.客舱号(Cabin)
3.船票编号(Ticket)

从姓名中提取头衔

'''
观察姓名特点,可以发现乘客头衔每个名字当中都包含了具体的称谓或者头衔,可将该部分提出
'''
full['Name'].head()
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
'''
可以看到名字整体分为“名,称谓.姓”
因此我们可以使用split进行字符串分割,获取所需的头衔
'''
def getTitle(name):
    str1=name.split(',')[1]
    str2=str1.split('.')[0]
    str3=str2.strip() #移除字符串头尾指定字符(默认为空格)
    return str3
#存放提取后的特征
titleDf=pd.DataFrame()
titleDf['Title']=full['Name'].map(getTitle)
titleDf.head()
Title
0Mr
1Mrs
2Miss
3Mrs
4Mr
#查看提取的信息
titleDf['Title'].unique()
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer', 'Dona'], dtype=object)
'''
定义以下几种头衔类别:
Officer政府官员
Royalty王室(皇室)
Mr已婚男士
Mrs已婚妇女
Miss年轻未婚女子
Master有技能的人/教师
并与提取的信息一一对应
'''
title_mapDict={  
                'Mr':               'Mr', 
                'Mrs':             'Mrs',
                'Miss':           'Miss',
                'Master':       'Master',
                'Don':         'Royalty',
                'Rev':         'Officer', 
                'Dr':          'Officer', 
                'Mme':             'Mrs', 
                'Ms':              'Mrs',
                'Major':       'Officer', 
                'Lady':        'Royalty',
                'Sir':         'Royalty', 
                'Mlle':           'Miss',
                'Col':         'Officer', 
                'Capt':        'Officer', 
                'the Countess':'Royalty',
                'Jonkheer':    'Royalty',
                'Dona':        'Royalty'
}
titleDf['Title']=titleDf['Title'].map(title_mapDict)
#使用one-hot编码
titleDf=pd.get_dummies(titleDf['Title'])
titleDf.head()
MasterMissMrMrsOfficerRoyalty
0001000
1000100
2010000
3000100
4001000
#同样的添加姓名产生的虚拟变量到原始数据集
full=pd.concat([full,titleDf],axis=1)
full.drop('Name',axis=1,inplace=True)
full.head()
AgeCabinFareParchPassengerIdSexSibSpSurvivedTicketEmbarked_CEmbarked_SPclass_1Pclass_2Pclass_3MasterMissMrMrsOfficerRoyalty
022.0U7.250001110.0A/5 2117101001001000
138.0C8571.283302011.0PC 1759910100000100
226.0U7.925003001.0STON/O2. 310128201001010000
335.0C12353.100004011.011380301100000100
435.0U8.050005100.037345001001001000

5 rows × 21 columns

从客舱号中提取客舱类别

#客舱号的类别为首字母,可以建立映射关系
full['Cabin']=full['Cabin'].map(lambda c: c[0])
#进行one-hot编码,前缀为Cabin
cabinDf=pd.DataFrame()
cabinDf=pd.get_dummies(full['Cabin'],prefix='Cabin')
cabinDf.head()
Cabin_ACabin_BCabin_CCabin_DCabin_ECabin_FCabin_GCabin_TCabin_U
0000000001
1001000000
2000000001
3001000000
4000000001
#添加到原数据集
full=pd.concat([full,cabinDf],axis=1)
full.drop('Cabin',axis=1,inplace=True)
full.head()
AgeFareParchPassengerIdSexSibSpSurvivedTicketEmbarked_CEmbarked_QRoyaltyCabin_ACabin_BCabin_CCabin_DCabin_ECabin_FCabin_GCabin_TCabin_U
022.07.250001110.0A/5 21171000000000001
138.071.283302011.0PC 17599100001000000
226.07.925003001.0STON/O2. 3101282000000000001
335.053.100004011.0113803000001000000
435.08.050005100.0373450000000000001

5 rows × 29 columns

建立家庭人数和家庭类别

familyDf=pd.DataFrame()
'''
家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己
'''
familyDf['FamilySize']=full['Parch']+full['SibSp']+1
'''
家庭类别:
小家庭Family_Single:家庭人员=1
中等家庭Family_Small:2<=家庭人员<=4
大家庭Family_Large:家庭人员>=5
(根据需求人工设置虚拟变量)
'''
familyDf['Family_Single']=familyDf['FamilySize'].map(lambda s: 1 if s==1 else 0)
familyDf['Family_Small']=familyDf['FamilySize'].map(lambda s: 1 if 2<=s<=4 else 0)
familyDf['Family_Large']=familyDf['FamilySize'].map(lambda s: 1 if 5<=s else 0)
familyDf.head()
FamilySizeFamily_SingleFamily_SmallFamily_Large
02010
12010
21100
32010
41100
#将变量添加到数据集
full=pd.concat([full,familyDf],axis=1)
full.head()
AgeFareParchPassengerIdSexSibSpSurvivedTicketEmbarked_CEmbarked_QCabin_DCabin_ECabin_FCabin_GCabin_TCabin_UFamilySizeFamily_SingleFamily_SmallFamily_Large
022.07.250001110.0A/5 21171000000012010
138.071.283302011.0PC 17599100000002010
226.07.925003001.0STON/O2. 3101282000000011100
335.053.100004011.0113803000000002010
435.08.050005100.0373450000000011100

5 rows × 33 columns

3.3 特征选择

相关系数法:计算各个特征的相关系数

#相关性矩阵
corrDf = full.corr() 
corrDf
AgeFareParchPassengerIdSexSibSpSurvivedEmbarked_CEmbarked_QEmbarked_SCabin_DCabin_ECabin_FCabin_GCabin_TCabin_UFamilySizeFamily_SingleFamily_SmallFamily_Large
Age1.0000000.171521-0.1308720.0257310.057397-0.190747-0.0703230.076179-0.012718-0.0591530.1328860.106600-0.072644-0.0859770.032461-0.271918-0.1969960.116675-0.038189-0.161210
Fare0.1715211.0000000.2215220.031416-0.1854840.1602240.2573070.286241-0.130054-0.1698940.0727370.073949-0.037567-0.0228570.001179-0.5071970.226465-0.2748260.1972810.170853
Parch-0.1308720.2215221.0000000.008942-0.2131250.3735870.081629-0.008635-0.1009430.071881-0.0273850.0010840.0204810.058325-0.012304-0.0368060.792296-0.5490220.2485320.624627
PassengerId0.0257310.0314160.0089421.0000000.013406-0.055224-0.0050070.0481010.011585-0.0498360.000549-0.0081360.000306-0.045949-0.0230490.000208-0.0314370.0285460.002975-0.063415
Sex0.057397-0.185484-0.2131250.0134061.000000-0.109609-0.543351-0.066564-0.0886510.115193-0.057396-0.040340-0.006655-0.0832850.0205580.137396-0.1885830.284537-0.255196-0.077748
SibSp-0.1907470.1602240.373587-0.055224-0.1096091.000000-0.035322-0.048396-0.0486780.073709-0.015727-0.027180-0.0086190.006015-0.0132470.0090640.861952-0.5910770.2535900.699681
Survived-0.0703230.2573070.081629-0.005007-0.543351-0.0353221.0000000.1682400.003650-0.1496830.1507160.1453210.0579350.016040-0.026456-0.3169120.016639-0.2033670.279855-0.125147
Embarked_C0.0761790.286241-0.0086350.048101-0.066564-0.0483960.1682401.000000-0.164166-0.7782620.1077820.027566-0.020010-0.031566-0.014095-0.258257-0.036553-0.1078740.159594-0.092825
Embarked_Q-0.012718-0.130054-0.1009430.011585-0.088651-0.0486780.003650-0.1641661.000000-0.491656-0.061459-0.042877-0.020282-0.019941-0.0089040.142369-0.0871900.127214-0.122491-0.018423
Embarked_S-0.059153-0.1698940.071881-0.0498360.1151930.073709-0.149683-0.778262-0.4916561.000000-0.0560230.0029600.0305750.0405600.0181110.1373510.0877710.014246-0.0629090.093671
Pclass_10.3625870.599956-0.0130330.026495-0.107371-0.0342560.2859040.325722-0.166101-0.1818000.2756980.242963-0.073083-0.0354410.048310-0.776987-0.029656-0.1265510.165965-0.067523
Pclass_2-0.014193-0.121372-0.0100570.022714-0.028862-0.0524190.093349-0.134675-0.1219730.196532-0.037929-0.0502100.127371-0.032081-0.0143250.176485-0.039976-0.0350750.097270-0.118495
Pclass_3-0.302093-0.4196160.019521-0.0415440.1165620.072610-0.322308-0.1714300.243706-0.003805-0.207455-0.169063-0.0411780.056964-0.0300570.5276140.0584300.138250-0.2233380.155560
Master-0.3639230.0115960.2534820.0022540.1643750.3291710.085221-0.014172-0.0090910.018297-0.0421920.0018600.058311-0.013690-0.0061130.0411780.355061-0.2653550.1201660.301809
Miss-0.2541460.0920510.066473-0.050027-0.6728190.0775640.332795-0.0143510.198804-0.113886-0.0125160.008700-0.0030880.061881-0.013832-0.0043640.087350-0.023890-0.0180850.083422
Mr0.165476-0.192192-0.3047800.0141160.870678-0.243104-0.549199-0.065538-0.0802240.108924-0.030261-0.032953-0.026403-0.0725140.0236110.131807-0.3264870.386262-0.300872-0.194207
Mrs0.1980910.1392350.2134910.033299-0.5711760.0616430.3449350.098379-0.100374-0.0229500.0803930.0455380.0133760.042547-0.011742-0.1622530.157233-0.3546490.3612470.012893
Officer0.1628180.028696-0.0326310.0022310.087288-0.013813-0.0313160.003678-0.003212-0.0012020.006055-0.024048-0.017076-0.008281-0.003698-0.067030-0.0269210.0133030.003966-0.034572
Royalty0.0594660.026214-0.0301970.004400-0.020408-0.0107870.0333910.077213-0.021853-0.054250-0.012950-0.012202-0.008665-0.004202-0.001876-0.071672-0.0236000.008761-0.000073-0.017542
Cabin_A0.1251770.020094-0.030707-0.0028310.047561-0.0398080.0222870.094914-0.042105-0.056984-0.024952-0.023510-0.016695-0.008096-0.003615-0.242399-0.0429670.045227-0.029546-0.033799
Cabin_B0.1134580.3937430.0730510.015895-0.094453-0.0115690.1750950.161595-0.073613-0.095790-0.043624-0.041103-0.029188-0.014154-0.006320-0.4237940.032318-0.0879120.0842680.013470
Cabin_C0.1679930.4013700.0096010.006092-0.0774730.0486160.1146520.158043-0.059151-0.101861-0.053083-0.050016-0.035516-0.017224-0.007691-0.5156840.037226-0.1374980.1419250.001362
Cabin_D0.1328860.072737-0.0273850.000549-0.057396-0.0157270.1507160.107782-0.061459-0.0560231.000000-0.034317-0.024369-0.011817-0.005277-0.353822-0.025313-0.0743100.102432-0.049336
Cabin_E0.1066000.0739490.001084-0.008136-0.040340-0.0271800.1453210.027566-0.0428770.002960-0.0343171.000000-0.022961-0.011135-0.004972-0.333381-0.017285-0.0425350.068007-0.046485
Cabin_F-0.072644-0.0375670.0204810.000306-0.006655-0.0086190.057935-0.020010-0.0202820.030575-0.024369-0.0229611.000000-0.007907-0.003531-0.2367330.0055250.0040550.012756-0.033009
Cabin_G-0.085977-0.0228570.058325-0.045949-0.0832850.0060150.016040-0.031566-0.0199410.040560-0.011817-0.011135-0.0079071.000000-0.001712-0.1148030.035835-0.0763970.087471-0.016008
Cabin_T0.0324610.001179-0.012304-0.0230490.020558-0.013247-0.026456-0.014095-0.0089040.018111-0.005277-0.004972-0.003531-0.0017121.000000-0.051263-0.0154380.022411-0.019574-0.007148
Cabin_U-0.271918-0.507197-0.0368060.0002080.1373960.009064-0.316912-0.2582570.1423690.137351-0.353822-0.333381-0.236733-0.114803-0.0512631.000000-0.0141550.175812-0.2113670.056438
FamilySize-0.1969960.2264650.792296-0.031437-0.1885830.8619520.016639-0.036553-0.0871900.087771-0.025313-0.0172850.0055250.035835-0.015438-0.0141551.000000-0.6888640.3026400.801623
Family_Single0.116675-0.274826-0.5490220.0285460.284537-0.591077-0.203367-0.1078740.1272140.014246-0.074310-0.0425350.004055-0.0763970.0224110.175812-0.6888641.000000-0.873398-0.318944
Family_Small-0.0381890.1972810.2485320.002975-0.2551960.2535900.2798550.159594-0.122491-0.0629090.1024320.0680070.0127560.087471-0.019574-0.2113670.302640-0.8733981.000000-0.183007
Family_Large-0.1612100.1708530.624627-0.063415-0.0777480.699681-0.125147-0.092825-0.0184230.093671-0.049336-0.046485-0.033009-0.016008-0.0071480.0564380.801623-0.318944-0.1830071.000000

32 rows × 32 columns

#主要看与生存情况(Survived)的相关系数,ascending=False表示降序
corrDf['Survived'].sort_values(ascending=False)
Survived         1.000000
Mrs              0.344935
Miss             0.332795
Pclass_1         0.285904
Family_Small     0.279855
Fare             0.257307
Cabin_B          0.175095
Embarked_C       0.168240
Cabin_D          0.150716
Cabin_E          0.145321
Cabin_C          0.114652
Pclass_2         0.093349
Master           0.085221
Parch            0.081629
Cabin_F          0.057935
Royalty          0.033391
Cabin_A          0.022287
FamilySize       0.016639
Cabin_G          0.016040
Embarked_Q       0.003650
PassengerId     -0.005007
Cabin_T         -0.026456
Officer         -0.031316
SibSp           -0.035322
Age             -0.070323
Family_Large    -0.125147
Embarked_S      -0.149683
Family_Single   -0.203367
Cabin_U         -0.316912
Pclass_3        -0.322308
Sex             -0.543351
Mr              -0.549199
Name: Survived, dtype: float64

可以看到头衔Mrs与生存情况存在强烈的正相关

在这里选择 头衔(titleDf)、船舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)作为模型输入

#特征选择
full_X=pd.concat([titleDf,
                 pclassDf,
                 familyDf,
                 full['Fare'],
                 cabinDf,
                 embarkedDf,
                 full['Sex']
                 ],axis=1)
full_X.head()
MasterMissMrMrsOfficerRoyaltyPclass_1Pclass_2Pclass_3FamilySizeCabin_DCabin_ECabin_FCabin_GCabin_TCabin_UEmbarked_CEmbarked_QEmbarked_SSex
000100000120000010011
100010010020000001000
201000000110000010010
300010010020000000010
400100000110000010011

5 rows × 27 columns

4. 构建模型

4.1 建立训练数据集和测试数据集

#原始的数据集有891行
sourceRow=891
#原始数据集:特征
source_X=full_X.loc[0:sourceRow-1,:]
#原始数据集:标签
source_Y=full.loc[0:sourceRow-1,'Survived']
#预测数据集:特征
pred_X=full_X.loc[sourceRow:,:]
#确认选取的数据集
print('原始数据集有多少行:',source_X.shape[0])
print('预测数据集有多少行:',pred_X.shape[0])
原始数据集有多少行: 891
预测数据集有多少行: 418
#选择交叉验证
from sklearn.cross_validation import train_test_split

#建立模型用的训练数据集和测试数据集
train_X,test_X,train_Y,test_Y=train_test_split(source_X,source_Y,train_size=.8)
#输出数据集大小
print ('原始数据集特征:',source_X.shape, 
       '训练数据集特征:',train_X.shape ,
      '测试数据集特征:',test_X.shape)

print ('原始数据集标签:',source_Y.shape, 
       '训练数据集标签:',train_Y.shape ,
      '测试数据集标签:',test_Y.shape)
原始数据集特征: (891, 27) 训练数据集特征: (712, 27) 测试数据集特征: (179, 27)
原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)

4.2 选择机器学习算法

#使用逻辑回归
#第一步:导入算法
from sklearn.linear_model import LogisticRegression
#第二步:创建模型:逻辑回归(logisic regression)
model=LogisticRegression()

4.3 训练模型

#第三步:训练模型
model.fit(train_X,train_Y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

5. 模型评估

#通过score得到模型的准确率
model.score(test_X,test_Y)
0.8268156424581006

6. 方案实施

使用预测数据集得到预测结果,保存到csv文件中

#使用机器学习模型,对预测数据集中的生存情况进行预测
pred_Y=model.predict(pred_X)
'''
生成的预测值是浮点数
Kaggle要求提交的是整型
对数据类型转换
'''
pred_Y=pred_Y.astype(int)
#乘客Id
passenger_id=full.loc[sourceRow:,'PassengerId']
#数据框:乘客id,预测生存情况的值
predDf=pd.DataFrame(
    {'PassengerId':passenger_id,
    'Survived':pred_Y})
predDf.shape
predDf.head()
#保存结果
predDf.to_csv(path+'/titanic_pred.csv',index=False)
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值