titanic_kaggle

利用逻辑回归预测泰坦尼克号生存率

目录

  1. 提出问题
  2. 理解数据
  • 采集数据
  • 导入数据
  • 查看数据集信息
  1. 数据清洗
  • 数据预处理
  • 特征工程
  1. 构建模型
  2. 模型评估
  3. 方案实施
  • 提交结果到Kaggle

1.提出问题

什么样的人在泰坦尼克号中更容易存活?

2.理解数据

2.1 采集数据

从Kaggle泰坦尼克号项目页面下载数据:https://www.kaggle.com/c/titanic

2.2导入数据

import numpy as np
import pandas as pd
import matplotlib as plt
#导入训练集
train = pd.read_csv("/Users/qxh/Desktop/titanic/train.csv")
#导入测试集
test = pd.read_csv("/Users/qxh/Desktop/titanic/test.csv")
print('训练集数据大小:',train.shape)
print('测试集数据大小:',test.shape)
训练集数据大小: (891, 12)
测试集数据大小: (418, 11)
#合并训练集和测试集,为数据处理做准备
full = train.append(test, ignore_index = True)
print('整体数据集大小:',full.shape)
整体数据集大小: (1309, 12)

2.3 查看数据集信息

#查看数据,了解各特征的表达含义:
'''
Age:年龄
Cabin:船舱号
Embarked:登船地点
Fare:船票价格
Name:乘客名字
Parch:不同代直系亲属数(父母,子女)
PassengerId:乘客编号
Pclass:舱位等级
Sex:性别
SibSp:同代直系亲属数(兄弟姐妹,配偶)
Survived:是否存活
Ticket:船票编码
'''
full.head()
AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicket
022.0NaNS7.2500Braund, Mr. Owen Harris013male10.0A/5 21171
138.0C85C71.2833Cumings, Mrs. John Bradley (Florence Briggs Th...021female11.0PC 17599
226.0NaNS7.9250Heikkinen, Miss. Laina033female01.0STON/O2. 3101282
335.0C123S53.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)041female11.0113803
435.0NaNS8.0500Allen, Mr. William Henry053male00.0373450
#查看具体统计信息
full.describe()
AgeFareParchPassengerIdPclassSibSpSurvived
count1046.0000001308.0000001309.0000001309.0000001309.0000001309.000000891.000000
mean29.88113833.2954790.385027655.0000002.2948820.4988540.383838
std14.41349351.7586680.865560378.0200610.8378361.0416580.486592
min0.1700000.0000000.0000001.0000001.0000000.0000000.000000
25%21.0000007.8958000.000000328.0000002.0000000.0000000.000000
50%28.00000014.4542000.000000655.0000003.0000000.0000000.000000
75%39.00000031.2750000.000000982.0000003.0000001.0000001.000000
max80.000000512.3292009.0000001309.0000003.0000008.0000001.000000
#查看每一列的数据类型,和数据总数
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3. 数据清洗

3.1数据预处理

缺失值处理

所有数据总共有1309行。
其中的缺失数据有:

  • 年龄(Age)里面数据总数是1046条,缺失了263条数据,用平均值填补。
  • 船票价格(Fare)里面数据总数是1308条,缺失了1条数据,用平均值填补。
  • 登船港口(Embarked)里面数据总数是1308条,缺失了2条数据,用出现最频繁的值填补。
  • 船舱号(Cabin)里面数据总数是295条,缺失了1014条数据,缺失较多,增添新标记unknown进行填补。
#年龄(age)
full['Age']=full['Age'].fillna(full['Age'].mean())
#船票价格(fare)
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
#登船港口:最频繁的值
full['Embarked'].describe()
count     1307
unique       3
top          S
freq       914
Name: Embarked, dtype: object
full['Embarked']=full['Embarked'].fillna('S')
#船舱号:缺失较多,填充为unknown
full['Cabin']=full['Cabin'].fillna('U')
#查看缺失值填补后的信息
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3.2 特征工程

将12个因素通过其数据类型分为3类:

  1. 数值类型:
  • 乘客编号(PassengerId)
  • 年龄(Age)
  • 船票价格(Fare)
  • 同代直系亲属人数(SibSp)
  • 不同代直系亲属人数(Parch)
  1. 时间序列:无
  2. 分类数据(直接分类)
  • 乘客性别(Sex):男性male,女性female
  • 登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown
  • 客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱
  1. 分类数据(字符串类型):可能从这里面提取出特征来
  • 乘客姓名(Name)
  • 客舱号(Cabin)
  • 船票编号(Ticket)
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3.2.1 分类数据(直接分类)

在乘客性别(Sex),登船港口(Embarked),客舱等级(Pclass)中,找出每个类别的分类标签进行分割,用0和1表示。

  • 性别(Sex)
sex_mapDict = {'male':1,'female':0}
#map:对series每个数据应用自定义的函数计算
full['Sex'] = full['Sex'].map(sex_mapDict)
  • 登陆港口(Embarked)
embarkedDf = pd.DataFrame()
#get_dummies进行one_hot编码
embarkedDf = pd.get_dummies(full['Embarked'],prefix='Embarked')
embarkedDf.head()
Embarked_CEmbarked_QEmbarked_S
0001
1100
2001
3001
4001
full = pd.concat([full,embarkedDf],axis=1)
full.drop('Embarked',axis=1,inplace=True)
  • 客舱等级(Pclass)
pcalssDf = pd.DataFrame()
pcalssDf = pd.get_dummies(full['Pclass'],prefix='Pclass')
pcalssDf.head()
Pclass_1Pclass_2Pclass_3
0001
1100
2001
3100
4001
full = pd.concat([full,pcalssDf],axis=1)
full.drop('Pclass',axis=1,inplace=True)

3.2.2分类数据(字符串类型)

  • 从名字中提取头衔
full['Name'].head()
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
#提取出头衔
def get_title(name):
    str1 = name.split(',')[1]
    str2 = str1.split('.')[0]
    str3 = str2.strip()
    #strip()用于移除字符串头尾指定字符,这里是移除头尾空格
    return str3
titleDf = pd.DataFrame()
titleDf['Title'] = full['Name'].map(get_title) 
titleDf.groupby('Title').count()
Title
Capt
Col
Don
Dona
Dr
Jonkheer
Lady
Major
Master
Miss
Mlle
Mme
Mr
Mrs
Ms
Rev
Sir
the Countess
'''
定义以下几种头衔类别:
Officer政府官员
Royalty王室(皇室)
Mr已婚男士
Mrs已婚妇女
Miss年轻未婚女子
Master有技能的人/教师
'''
#姓名中头衔字符串与定义头衔类别的映射关系
title_mapDict = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"
                    }

titleDf['Title'] = titleDf['Title'].map(title_mapDict)
titleDf = pd.get_dummies(titleDf['Title'])
titleDf.head()
MasterMissMrMrsOfficerRoyalty
0001000
1000100
2010000
3000100
4001000
full = pd.concat([full,titleDf],axis=1)
full.drop('Name',axis=1,inplace=True)
  • 客舱号
full['Cabin'].head()
0       U
1     C85
2       U
3    C123
4       U
Name: Cabin, dtype: object
#客舱号的首字母是客舱的类别
cabinDf = pd.DataFrame()
full['Cabin'] = full['Cabin'].map(lambda c : c[0])
full['Cabin'].head()
0    U
1    C
2    U
3    C
4    U
Name: Cabin, dtype: object
cabinDf = pd.get_dummies( full['Cabin'] , prefix = 'Cabin' )
cabinDf.head()
Cabin_ACabin_BCabin_CCabin_DCabin_ECabin_FCabin_GCabin_TCabin_U
0000000001
1001000000
2000000001
3001000000
4000000001
full = pd.concat([full,cabinDf],axis=1)
full.drop('Cabin',axis=1,inplace= True)
full.head()
AgeFareParchPassengerIdPclassSexSibSpSurvivedTicketEmbarked_C...RoyaltyCabin_ACabin_BCabin_CCabin_DCabin_ECabin_FCabin_GCabin_TCabin_U
022.07.2500013110.0A/5 211710...0000000001
138.071.2833021011.0PC 175991...0001000000
226.07.9250033001.0STON/O2. 31012820...0000000001
335.053.1000041011.01138030...0001000000
435.08.0500053100.03734500...0000000001

5 rows × 27 columns

3.2.3 数据类型

  • 家庭人员和家庭类别
#存放家庭信息
familyDf = pd.DataFrame()

'''
家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己
(因为乘客自己也是家庭成员的一个,所以这里加1)
'''
familyDf[ 'family_size' ] = full[ 'Parch' ] + full[ 'SibSp' ] + 1

familyDf['family_size'].describe()
count    1309.000000
mean        1.883881
std         1.583639
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        11.000000
Name: family_size, dtype: float64
%matplotlib notebook
familyDf['family_size'].plot()
<IPython.core.display.Javascript object>
<matplotlib.axes._subplots.AxesSubplot at 0x110675400>
'''
家庭类别:
小家庭Family_Single:家庭人数=1
中等家庭Family_Small: 2<=家庭人数<=4
大家庭Family_Large: 家庭人数>=5
'''

familyDf['family_single'] = familyDf['family_size'].map(lambda s : 1 if s==1 else 0)
familyDf['family_small'] = familyDf['family_size'].map(lambda s : 1 if 2<=s<=4 else 0)
familyDf['family_large'] = familyDf['family_size'].map(lambda s : 1 if s>4 else 0)
familyDf.head()
family_sizefamily_singlefamily_smallfamily_large
02010
12010
21100
32010
41100
full = pd.concat([full,familyDf],axis=1)
full.drop([ 'Parch','SibSp','family_size' ],axis=1, inplace=True)
full.head()
AgeFarePassengerIdPclassSexSurvivedTicketEmbarked_CEmbarked_QEmbarked_S...Cabin_CCabin_DCabin_ECabin_FCabin_GCabin_TCabin_Ufamily_singlefamily_smallfamily_large
022.07.25001310.0A/5 21171001...0000001010
138.071.28332101.0PC 17599100...1000000010
226.07.92503301.0STON/O2. 3101282001...0000001100
335.053.10004101.0113803001...1000000010
435.08.05005310.0373450001...0000001100

5 rows × 28 columns

  • 年龄(Age)和船票费用(Fare)

年龄和费用的数值范围相较于别的类别的数值范围(0,1)相差太大,遂对其进行scaling,使他们的取值范围落在[-1,1]上

import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
len(full['Age'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.





1309
age_scale_param = scaler.fit(full['Age'].reshape(-1,1))

/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
full['Age_scaled'] = scaler.fit_transform(full['Age'].reshape(-1,1), age_scale_param)
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
full.head()
AgeFarePassengerIdPclassSexSurvivedTicketEmbarked_CEmbarked_QEmbarked_S...Cabin_DCabin_ECabin_FCabin_GCabin_TCabin_Ufamily_singlefamily_smallfamily_largeAge_scaled
022.07.25001310.0A/5 21171001...000001010-0.611972
138.071.28332101.0PC 17599100...0000000100.630431
226.07.92503301.0STON/O2. 3101282001...000001100-0.301371
335.053.10004101.0113803001...0000000100.397481
435.08.05005310.0373450001...0000011000.397481

5 rows × 29 columns

full.drop([ 'Age'],axis=1, inplace=True)
full.head()
FarePassengerIdPclassSexSurvivedTicketEmbarked_CEmbarked_QEmbarked_SMaster...Cabin_DCabin_ECabin_FCabin_GCabin_TCabin_Ufamily_singlefamily_smallfamily_largeAge_scaled
07.25001310.0A/5 211710010...000001010-0.611972
171.28332101.0PC 175991000...0000000100.630431
27.92503301.0STON/O2. 31012820010...000001100-0.301371
353.10004101.01138030010...0000000100.397481
48.05005310.03734500010...0000011000.397481

5 rows × 28 columns

fare_scale_param = scaler.fit(full['Fare'].reshape(-1,1))
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
full['Fare_scaled'] = scaler.fit_transform(full['Fare'].reshape(-1,1), fare_scale_param)
/Users/qxh/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  """Entry point for launching an IPython kernel.
full.head()
FarePassengerIdPclassSexSurvivedTicketEmbarked_CEmbarked_QEmbarked_SMaster...Cabin_ECabin_FCabin_GCabin_TCabin_Ufamily_singlefamily_smallfamily_largeAge_scaledFare_scaled
07.25001310.0A/5 211710010...00001010-0.611972-0.503595
171.28332101.0PC 175991000...000000100.6304310.734503
27.92503301.0STON/O2. 31012820010...00001100-0.301371-0.490544
353.10004101.01138030010...000000100.3974810.382925
48.05005310.03734500010...000011000.397481-0.488127

5 rows × 29 columns

full.drop([ 'Fare'],axis=1, inplace=True)
full.head()
PassengerIdPclassSexSurvivedTicketEmbarked_CEmbarked_QEmbarked_SMasterMiss...Cabin_ECabin_FCabin_GCabin_TCabin_Ufamily_singlefamily_smallfamily_largeAge_scaledFare_scaled
01310.0A/5 2117100100...00001010-0.611972-0.503595
12101.0PC 1759910000...000000100.6304310.734503
23301.0STON/O2. 310128200101...00001100-0.301371-0.490544
34101.011380300100...000000100.3974810.382925
45310.037345000100...000011000.397481-0.488127

5 rows × 28 columns

#处理完毕后的数据特征信息
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 28 columns):
PassengerId      1309 non-null int64
Pclass           1309 non-null int64
Sex              1309 non-null int64
Survived         891 non-null float64
Ticket           1309 non-null object
Embarked_C       1309 non-null uint8
Embarked_Q       1309 non-null uint8
Embarked_S       1309 non-null uint8
Master           1309 non-null uint8
Miss             1309 non-null uint8
Mr               1309 non-null uint8
Mrs              1309 non-null uint8
Officer          1309 non-null uint8
Royalty          1309 non-null uint8
Cabin_A          1309 non-null uint8
Cabin_B          1309 non-null uint8
Cabin_C          1309 non-null uint8
Cabin_D          1309 non-null uint8
Cabin_E          1309 non-null uint8
Cabin_F          1309 non-null uint8
Cabin_G          1309 non-null uint8
Cabin_T          1309 non-null uint8
Cabin_U          1309 non-null uint8
family_single    1309 non-null int64
family_small     1309 non-null int64
family_large     1309 non-null int64
Age_scaled       1309 non-null float64
Fare_scaled      1309 non-null float64
dtypes: float64(3), int64(6), object(1), uint8(18)
memory usage: 125.4+ KB

3.3特征选择

通过计算各个特征与survived之间的相关系数,选择和生存率有关的特征

#特征选择
corrDf = full.corr()
corrDf
PassengerIdPclassSexSurvivedEmbarked_CEmbarked_QEmbarked_SMasterMissMr...Cabin_ECabin_FCabin_GCabin_TCabin_Ufamily_singlefamily_smallfamily_largeAge_scaledFare_scaled
PassengerId1.000000-0.0383540.013406-0.0050070.0481010.011585-0.0498360.002254-0.0500270.014116...-0.0081360.000306-0.045949-0.0230490.0002080.0285460.002975-0.0634150.0257310.031416
Pclass-0.0383541.0000000.124617-0.338481-0.2696580.2304910.0913200.0952570.0244870.121492...-0.2256490.0131220.052133-0.0427500.7138570.147393-0.2183030.127306-0.366371-0.558477
Sex0.0134060.1246171.000000-0.543351-0.066564-0.0886510.1151930.164375-0.6728190.870678...-0.040340-0.006655-0.0832850.0205580.1373960.284537-0.255196-0.0777480.057397-0.185484
Survived-0.005007-0.338481-0.5433511.0000000.1682400.003650-0.1496830.0852210.332795-0.549199...0.1453210.0579350.016040-0.026456-0.316912-0.2033670.279855-0.125147-0.0703230.257307
Embarked_C0.048101-0.269658-0.0665640.1682401.000000-0.164166-0.778262-0.014172-0.014351-0.065538...0.027566-0.020010-0.031566-0.014095-0.258257-0.1078740.159594-0.0928250.0761790.286241
Embarked_Q0.0115850.230491-0.0886510.003650-0.1641661.000000-0.491656-0.0090910.198804-0.080224...-0.042877-0.020282-0.019941-0.0089040.1423690.127214-0.122491-0.018423-0.012718-0.130054
Embarked_S-0.0498360.0913200.115193-0.149683-0.778262-0.4916561.0000000.018297-0.1138860.108924...0.0029600.0305750.0405600.0181110.1373510.014246-0.0629090.093671-0.059153-0.169894
Master0.0022540.0952570.1643750.085221-0.014172-0.0090910.0182971.000000-0.110595-0.258902...0.0018600.058311-0.013690-0.0061130.041178-0.2653550.1201660.301809-0.3639230.011596
Miss-0.0500270.024487-0.6728190.332795-0.0143510.198804-0.113886-0.1105951.000000-0.585809...0.008700-0.0030880.061881-0.013832-0.004364-0.023890-0.0180850.083422-0.2541460.092051
Mr0.0141160.1214920.870678-0.549199-0.065538-0.0802240.108924-0.258902-0.5858091.000000...-0.032953-0.026403-0.0725140.0236110.1318070.386262-0.300872-0.1942070.165476-0.192192
Mrs0.033299-0.179945-0.5711760.3449350.098379-0.100374-0.022950-0.093887-0.212435-0.497310...0.0455380.0133760.042547-0.011742-0.162253-0.3546490.3612470.0128930.1980910.139235
Officer0.002231-0.1373410.087288-0.0313160.003678-0.003212-0.001202-0.029567-0.066899-0.156611...-0.024048-0.017076-0.008281-0.003698-0.0670300.0133030.003966-0.0345720.1628180.028696
Royalty0.004400-0.104916-0.0204080.0333910.077213-0.021853-0.054250-0.015002-0.033945-0.079466...-0.012202-0.008665-0.004202-0.001876-0.0716720.008761-0.000073-0.0175420.0594660.026214
Cabin_A-0.002831-0.2021430.0475610.0222870.094914-0.042105-0.056984-0.000711-0.0356970.015372...-0.023510-0.016695-0.008096-0.003615-0.2423990.045227-0.029546-0.0337990.1251770.020094
Cabin_B0.015895-0.353414-0.0944530.1750950.161595-0.073613-0.095790-0.0171680.035069-0.096776...-0.041103-0.029188-0.014154-0.006320-0.423794-0.0879120.0842680.0134700.1134580.393743
Cabin_C0.006092-0.430044-0.0774730.1146520.158043-0.059151-0.101861-0.047456-0.013418-0.068072...-0.050016-0.035516-0.017224-0.007691-0.515684-0.1374980.1419250.0013620.1679930.401370
Cabin_D0.000549-0.265341-0.0573960.1507160.107782-0.061459-0.056023-0.042192-0.012516-0.030261...-0.034317-0.024369-0.011817-0.005277-0.353822-0.0743100.102432-0.0493360.1328860.072737
Cabin_E-0.008136-0.225649-0.0403400.1453210.027566-0.0428770.0029600.0018600.008700-0.032953...1.000000-0.022961-0.011135-0.004972-0.333381-0.0425350.068007-0.0464850.1066000.073949
Cabin_F0.0003060.013122-0.0066550.057935-0.020010-0.0202820.0305750.058311-0.003088-0.026403...-0.0229611.000000-0.007907-0.003531-0.2367330.0040550.012756-0.033009-0.072644-0.037567
Cabin_G-0.0459490.052133-0.0832850.016040-0.031566-0.0199410.040560-0.0136900.061881-0.072514...-0.011135-0.0079071.000000-0.001712-0.114803-0.0763970.087471-0.016008-0.085977-0.022857
Cabin_T-0.023049-0.0427500.020558-0.026456-0.014095-0.0089040.018111-0.006113-0.0138320.023611...-0.004972-0.003531-0.0017121.000000-0.0512630.022411-0.019574-0.0071480.0324610.001179
Cabin_U0.0002080.7138570.137396-0.316912-0.2582570.1423690.1373510.041178-0.0043640.131807...-0.333381-0.236733-0.114803-0.0512631.0000000.175812-0.2113670.056438-0.271918-0.507197
family_single0.0285460.1473930.284537-0.203367-0.1078740.1272140.014246-0.265355-0.0238900.386262...-0.0425350.004055-0.0763970.0224110.1758121.000000-0.873398-0.3189440.116675-0.274826
family_small0.002975-0.218303-0.2551960.2798550.159594-0.122491-0.0629090.120166-0.018085-0.300872...0.0680070.0127560.087471-0.019574-0.211367-0.8733981.000000-0.183007-0.0381890.197281
family_large-0.0634150.127306-0.077748-0.125147-0.092825-0.0184230.0936710.3018090.083422-0.194207...-0.046485-0.033009-0.016008-0.0071480.056438-0.318944-0.1830071.000000-0.1612100.170853
Age_scaled0.025731-0.3663710.057397-0.0703230.076179-0.012718-0.059153-0.363923-0.2541460.165476...0.106600-0.072644-0.0859770.032461-0.2719180.116675-0.038189-0.1612101.0000000.171521
Fare_scaled0.031416-0.558477-0.1854840.2573070.286241-0.130054-0.1698940.0115960.092051-0.192192...0.073949-0.037567-0.0228570.001179-0.507197-0.2748260.1972810.1708530.1715211.000000

27 rows × 27 columns

'''
查看各个特征与生成情况(Survived)的相关系数,
ascending=False表示按降序排列
'''
corrDf['Survived'].sort_values(ascending =False)
Survived         1.000000
Mrs              0.344935
Miss             0.332795
family_small     0.279855
Fare_scaled      0.257307
Cabin_B          0.175095
Embarked_C       0.168240
Cabin_D          0.150716
Cabin_E          0.145321
Cabin_C          0.114652
Master           0.085221
Cabin_F          0.057935
Royalty          0.033391
Cabin_A          0.022287
Cabin_G          0.016040
Embarked_Q       0.003650
PassengerId     -0.005007
Cabin_T         -0.026456
Officer         -0.031316
Age_scaled      -0.070323
family_large    -0.125147
Embarked_S      -0.149683
family_single   -0.203367
Cabin_U         -0.316912
Pclass          -0.338481
Sex             -0.543351
Mr              -0.549199
Name: Survived, dtype: float64
full_x = pd.concat([ titleDf,#头衔
                     pcalssDf,#客舱等级
                     familyDf,#家庭大小
                     full['Fare_scaled'],#船票价格
                     full['Age_scaled'],
                     cabinDf,#船舱号
                     embarkedDf,#登船港口
                     full['Sex']#性别
                    ],axis=1)
full_x.head()
MasterMissMrMrsOfficerRoyaltyPclass_1Pclass_2Pclass_3family_size...Cabin_DCabin_ECabin_FCabin_GCabin_TCabin_UEmbarked_CEmbarked_QEmbarked_SSex
00010000012...0000010011
10001001002...0000001000
20100000011...0000010010
30001001002...0000000010
40010000011...0000010011

5 rows × 28 columns

4.构建模型

用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型

4.1 建立训练数据集和测试数据集

sourceRow = 891

source_x = full_x.loc[0:sourceRow-1, :]
source_y = full.loc[0:sourceRow-1,'Survived']

pred_x = full_x.loc[sourceRow:,:]
print('训练集数据大小:',source_x.shape)
训练集数据大小: (891, 28)
print('测试集数据大小:',pred_x.shape)
测试集数据大小: (418, 28)
'''
从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test)
train_test_split是交叉验证中常用的函数,功能是从样本中随机的按比例选取train data和test data
train_data:所要划分的样本特征集
train_target:所要划分的样本结果
test_size:样本占比,如果是整数的话就是样本的数量
'''

from sklearn.cross_validation import train_test_split

#建立模型用的训练数据sour集和测试数据集
train_x, test_x, train_y, test_y = train_test_split(source_x ,
                                                    source_y,
                                                    train_size=.8)
#输出数据集大小
print ('原始数据集特征:',source_x.shape, 
       '训练数据集特征:',train_x.shape ,
      '测试数据集特征:',test_x.shape)

print ('原始数据集标签:',source_y.shape, 
       '训练数据集标签:',train_y.shape ,
      '测试数据集标签:',test_y.shape)
原始数据集特征: (891, 28) 训练数据集特征: (712, 28) 测试数据集特征: (179, 28)
原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)

4.2 选择机器学习算法

#逻辑回归

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

4.3 训练模型

model.fit(train_x, train_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

5 模型评估

model.score(test_x , test_y )
0.82681564245810057

6.实施方案

#得上预测结果上传到kaggle
pred_y = model.predict(pred_x)

'''
生成的预测值是浮点数(0.0,1,0)
但是Kaggle要求提交的结果是整型(0,1)
所以要对数据类型进行转换
'''
pred_y=pred_y.astype(int)

#乘客id
passenger_id = full.loc[sourceRow:,'PassengerId']
#数据框:乘客id,预测生存情况的值
predDf = pd.DataFrame( 
    { 'PassengerId': passenger_id ,      'Survived': pred_y } )

predDf.shape
(418, 2)
predDf.head()
PassengerIdSurvived
8918920
8928931
8938940
8948950
8958961
predDf.to_csv( '/Users/qxh/Desktop/titanic/titanic_pred.csv' , index = False )
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值