特征工程(八)特征工程案例分析(2)—利用逻辑回归预测泰坦尼克号生存率

undo_try

于 2023-04-18 21:27:41 发布

阅读量976

点赞数

分类专栏： # 特征工程文章标签：逻辑回归 python 机器学习

本文链接：https://blog.csdn.net/qq_44665283/article/details/130231639

版权

特征工程专栏收录该内容

8 篇文章 1 订阅

订阅专栏

泰坦尼克号将乘客分为一等舱、二等舱、三等舱三个等级，等级不同决定了安全设施、娱乐设施、餐饮等的不同，对生存率有一定影响。
那是个绅士的年代，船难时，很多男士放弃逃生机会优先女士孩子逃生，然后慷慨赴死，性别年龄也是影响生存率的因素之一。　
根据背景初步判断船舱等级、乘客年龄、性别是影响生存率的因素。

一些人比其他人更有可能生存，比如妇女，儿童和上层阶级。什么样的人在泰坦尼克号中更容易存活？

下载数据地址如下：
https://www.kaggle.com/competitions/titanic/data

1、导入数据

import warnings
warnings.filterwarnings('ignore')

# 导入处理数据包
import numpy as np
import pandas as pd

# 导入数据
train_data = pd.read_csv("./titanic_data/train.csv")
test_data = pd.read_csv("./titanic_data/test.csv")

print('训练数据集:',train_data.shape,'测试数据集:',test_data.shape)

训练数据集: (891, 12) 测试数据集: (418, 11)

# 合并数据集，方便同时对两个数据集进行清洗
full_data = train_data.append(test_data,ignore_index=True)
print('合并后的数据集:',full_data.shape)

合并后的数据集: (1309, 12)

2、查看数据集的信息

# 查看数据
full_data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0.0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1.0	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0.0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# 获取数据类型列的描述性统计信息
full_data.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	1309.000000	891.000000	1309.000000	1046.000000	1309.000000	1309.000000	1308.000000
mean	655.000000	0.383838	2.294882	29.881138	0.498854	0.385027	33.295479
std	378.020061	0.486592	0.837836	14.413493	1.041658	0.865560	51.758668
min	1.000000	0.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	328.000000	0.000000	2.000000	21.000000	0.000000	0.000000	7.895800
50%	655.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	982.000000	1.000000	3.000000	39.000000	1.000000	0.000000	31.275000
max	1309.000000	1.000000	3.000000	80.000000	8.000000	9.000000	512.329200

describe只能查看数据类型的描述统计信息，对于其他类型的数据不显示，比如字符串类型姓名（name），客舱号（Cabin）。
这很好理解，因为描述统计指标是计算数值，所以需要该列的数据类型是数据

# 查看每一列的数据类型和数据总数
full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

我们发现数据总共有1309行。

其中数据类型列：年龄（Age）、船舱号（Cabin）里面有缺失数据：

1）年龄（Age）里面数据总数是1046条，缺失了1309-1046=263，缺失率263/1309=20%
2）船票价格（Fare）里面数据总数是1308条，缺失了1条数据

字符串列：

1）登船港口（Embarked）里面数据总数是1307，只缺失了2条数据，缺失比较少
2）船舱号（Cabin）里面数据总数是295，缺失了1309-295=1014，缺失率=1014/1309=77.5%，缺失比较大

这为我们下一步数据清洗指明了方向，只有知道哪些数据缺失数据，我们才能有针对性的处理。

3.数据清洗（Data Preparation ）

3.1 数据预处理

缺失值处理：

在前面，理解数据阶段，我们发现数据总共有1309行。

其中数据类型列：年龄（Age）、船票价格（Fare）里面有缺失数据。
字符串列：登船港口（Embarked）、船舱号（Cabin）里面有缺失数据。

这为我们下一步数据清洗指明了方向，只有知道哪些数据缺失数据，我们才能有针对性的处理。很多机器学习算法为了训练模型，要求所传入的特征中不能有空值。

如果是数值类型，用平均值取代
如果是分类数据，用最常见的类别取代
使用模型预测缺失值，例如：K-NN

# 1、对于数值类型年龄(Age)和船票价格(Fare)这两列数值类型，我们用平均值进行填充
full_data['Age'] = full_data['Age'].fillna(full_data['Age'].mean())

full_data['Fare'] = full_data['Fare'].fillna(full_data['Fare'].mean())

# 可以看到Age列和Fare列已经没有空值了
full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

# 2、填充登船港口（Embarked) 这一列
'''
出发地点:  S=英国   南安普顿  Southampton
途径地点1: C=法国   瑟堡市    Cherbourg
途径地点2: Q=爱尔兰 昆士敦    Queenstown
'''
# 可以看到S类别是最常见的，我们将缺失值填充为最频繁出现的
full_data['Embarked'].value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

# 将缺失值填充为最频繁出现的S
full_data['Embarked'] = full_data['Embarked'].fillna('S')

# 可以看到Embarked列已经没有空值了
full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

# 3、填充船舱号（Cabin) 这一列
full_data['Cabin'].value_counts()

C23 C25 C27        6
G6                 5
B57 B59 B63 B66    5
C22 C26            4
F33                4
                  ..
A14                1
E63                1
E12                1
E38                1
C105               1
Name: Cabin, Length: 186, dtype: int64

# 缺失值比较多，填充为U，表示未知(unknown)
full_data['Cabin'] = full_data['Cabin'].fillna('U')


# 可以看到所有列已经没有空值了,Survived这一列是标签列，不需要进行处理
full_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        1309 non-null   object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

# 查看数据是否正常
full_data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0.0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	U	S
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1.0	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	U	S
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0.0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	U	S

3.2 特征提取

查看数据类型，分为3种数据类型。并对类别数据处理：用数值代替类别，并进行One-hot编码

（1）数值类型：
乘客编号（PassengerId），年龄（Age），船票价格（Fare），同代直系亲属人数（SibSp），不同代直系亲属人数（Parch）

（2）时间序列：无
（3) 分类数据：

1）有直接类别的

  乘客性别（Sex）：男性male，女性female
  登船港口（Embarked）：出发地点S=英国南安普顿Southampton，途径地点1：C=法国 瑟堡市Cherbourg，出发地点2：Q=爱尔兰 昆士敦Queenstown
  客舱等级（Pclass）：1=1等舱，2=2等舱，3=3等舱

2）字符串类型：可能从这里面提取出特征来，也归到分类数据中
```
  乘客姓名（Name）
  客舱号（Cabin）
  船票编号（Ticket）
```

3.2.1 直接类别的分类数据

# 1、将性别值映射为数值,男（male）对应数值1，女（female）对应数值0
sex_dict = {
    'male':1,
    'female':0
}

full_data['Sex'] = full_data['Sex'].map(sex_dict)
full_data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0.0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	U	S
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	C
2	3	1.0	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	U	S
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	S
4	5	0.0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	U	S

# 2、登船港口（Embarked)进行one-hot编码
'''
使用get_dummies进行one-hot编码，产生虚拟变量
'''
embarkedDf = pd.get_dummies(full_data['Embarked'],prefix='Embarked')
embarkedDf.head()

	Embarked_C	Embarked_S
0	0	1
1	1	0
2	0	1
3	0	1
4	0	1

# 在原始数据集上添加one-hot编码产生的虚拟变量
full_data = pd.concat([full_data,embarkedDf],axis=1)

'''
因为已经对Embarked进行了one-hot编码，产生了虚拟变量，因此我们把Embarked列删除

drop删除某一列代码解释:
因为drop(name,axis=1)里面指定了name是哪一列，比如指定的是A这一列，axis=1表示按行操作
那么结合起来就是把A列里面每一行删除，最终结果是删除了A这一列。
简单来说，使用drop删除某几列的方法记住这个语法就可以了: drop([列名1,列名2],axis=1)
'''
full_data.drop('Embarked',axis=1,inplace=True)

full_data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked_C	Embarked_S
0	1	0.0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	U	0	1
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	1	0
2	3	1.0	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	U	0	1
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	0	1
4	5	0.0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	U	0	1

# 3、客舱等级（Pclass）进行one-hot编码
# 客舱等级（Pclass）：1=1等舱，2=2等舱，3=3等舱


pclassDf = pd.get_dummies(full_data['Pclass'],prefix='Pclass')
pclassDf.head()

	Pclass_1	Pclass_3
0	0	1
1	1	0
2	0	1
3	1	0
4	0	1

# 在原始数据集上添加one-hot编码产生的虚拟变量
full_data = pd.concat([full_data,pclassDf],axis=1)

full_data.drop('Pclass',axis=1,inplace=True)

full_data.head()

	PassengerId	Survived	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked_C	Embarked_S	Pclass_1	Pclass_3
0	1	0.0	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	U	0	1	0	1
1	2	1.0	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	1	0	1	0
2	3	1.0	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	U	0	1	0	1
3	4	1.0	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	0	1	1	0
4	5	0.0	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	U	0	1	0	1

3.2.2 字符串类别的分类数据

# 1、从姓名列[Name]提取头衔
'''
注意到在乘客名字 (Name) 中，有一个非常显著的特点:
乘客头衔每个名字当中都包含了具体的称谓或者说是头衔，将这部分信息提取出来后可以作为非常有用一个新变量，可以帮助我们进行预测。
'''
full_data['Name'].head(10)

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object

'''
定义函数，从姓名中获取头衔
'''
def getTitle(name):
    str1 = name.split(',')[1]
    str2 = str1.split('.')[0]
    str3 = "".join(str2.strip())
    return str3
titleDf = pd.DataFrame()
titleDf['Title'] = full_data['Name'].map(getTitle)
titleDf

	Title
0	Mr
1	Mrs
2	Miss
3	Mrs
4	Mr
...	...
1304	Mr
1305	Dona
1306	Mr
1307	Mr
1308	Master

1309 rows × 1 columns

'''
定义以下几种头衔类别:
Officer  政府官员
Royalty  王室
Mr       已婚男士
Mrs      已婚妇女
Miss     年轻未婚女子
Master   有技能的人/教师
'''

# 姓名中头衔字符串与定义头衔类别的映射关系
title_dict = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Don": "Royalty",
    "Sir": "Royalty",
    "Jonkheer": "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess": "Royalty",
    "Dona": "Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr": "Mr",
    "Mrs": "Mrs",
    "Miss": "Miss",
    "Master": "Master",
    "Lady": "Royalty"
}


titleDf['Title'] = titleDf['Title'].map(title_dict)

# one-hot编码
titleDf = pd.get_dummies(titleDf['Title'])
titleDf.head()

	Miss	Mr	Mrs
0	0	1	0
1	0	0	1
2	1	0	0
3	0	0	1
4	0	1	0

# 添加one-hot编码到full_data，bing'q并且删除Name这一列
full_data = pd.concat([full_data,titleDf],axis=1)

full_data.drop('Name',axis=1,inplace=True)
full_data

	PassengerId	Survived	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked_C	...	Embarked_S	Pclass_1	Pclass_2	Pclass_3	Master	Miss	Mr	Mrs	Officer	Royalty
0	1	0.0	1	22.000000	1	0	A/5 21171	7.2500	U	0	...	1	0	0	1	0	0	1	0	0	0
1	2	1.0	0	38.000000	1	0	PC 17599	71.2833	C85	1	...	0	1	0	0	0	0	0	1	0	0
2	3	1.0	0	26.000000	0	0	STON/O2. 3101282	7.9250	U	0	...	1	0	0	1	0	1	0	0	0	0
3	4	1.0	0	35.000000	1	0	113803	53.1000	C123	0	...	1	1	0	0	0	0	0	1	0	0
4	5	0.0	1	35.000000	0	0	373450	8.0500	U	0	...	1	0	0	1	0	0	1	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1304	1305	NaN	1	29.881138	0	0	A.5. 3236	8.0500	U	0	...	1	0	0	1	0	0	1	0	0	0
1305	1306	NaN	0	39.000000	0	0	PC 17758	108.9000	C105	1	...	0	1	0	0	0	0	0	0	0	1
1306	1307	NaN	1	38.500000	0	0	SOTON/O.Q. 3101262	7.2500	U	0	...	1	0	0	1	0	0	1	0	0	0
1307	1308	NaN	1	29.881138	0	0	359309	8.0500	U	0	...	1	0	0	1	0	0	1	0	0	0
1308	1309	NaN	1	29.881138	1	1	2668	22.3583	U	1	...	0	0	0	1	1	0	0	0	0	0

1309 rows × 21 columns

# 2、从Cabin列提取客舱号信息
full_data['Cabin'] = full_data['Cabin'].map(lambda c:c[0])
full_data.head()

	PassengerId	Survived	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked_C	...	Embarked_S	Pclass_1	Pclass_3	Miss	Mr	Mrs
0	1	0.0	1	22.0	1	A/5 21171	7.2500	U	0	...	1	0	1	0	1	0
1	2	1.0	0	38.0	1	PC 17599	71.2833	C	1	...	0	1	0	0	0	1
2	3	1.0	0	26.0	0	STON/O2. 3101282	7.9250	U	0	...	1	0	1	1	0	0
3	4	1.0	0	35.0	1	113803	53.1000	C	0	...	1	1	0	0	0	1
4	5	0.0	1	35.0	0	373450	8.0500	U	0	...	1	0	1	0	1	0

5 rows × 21 columns

# 进行one-hot编码
cabinDf = pd.get_dummies(full_data['Cabin'],prefix='Cabin')
cabinDf.head()

	Cabin_C	Cabin_U
0	0	1
1	1	0
2	0	1
3	1	0
4	0	1

full_data = pd.concat([full_data,cabinDf],axis=1)

full_data.drop('Cabin',axis=1,inplace=True)
full_data.head()

	PassengerId	Survived	Sex	Age	SibSp	Ticket	Fare	Embarked_C	...	Cabin_C	Cabin_U
0	1	0.0	1	22.0	1	A/5 21171	7.2500	0	...	0	1
1	2	1.0	0	38.0	1	PC 17599	71.2833	1	...	1	0
2	3	1.0	0	26.0	0	STON/O2. 3101282	7.9250	0	...	0	1
3	4	1.0	0	35.0	1	113803	53.1000	0	...	1	0
4	5	0.0	1	35.0	0	373450	8.0500	0	...	0	1

5 rows × 29 columns

# 3、建立家庭人数和家庭类别
familyDf = pd.DataFrame()

'''
家庭人数 = 同代直系亲属数(Parch) + 不同代直系亲属数(SibSp) + 乘客自己
'''


familyDf['FamilySize'] = full_data['Parch'] + full_data['SibSp'] + 1

familyDf.head()

	FamilySize
0	2
1	2
2	1
3	2
4	1

'''
家庭类别
小家庭Family_Small:     家庭人数=1
中等家庭Family_Middle:  2<=家庭人数<=4
大家庭Family_Large:     家庭人数>=5
'''


familyDf['Family_Small']  =  familyDf['FamilySize'].map(lambda cnt: 1 if cnt == 1 else 0 )
familyDf['Family_Middle'] =  familyDf['FamilySize'].map(lambda cnt: 1 if 2 <= cnt <= 4 else 0 )
familyDf['Family_Large']  =  familyDf['FamilySize'].map(lambda cnt: 1 if cnt >= 5 else 0 )


familyDf.head()

	FamilySize	Family_Small	Family_Middle
0	2	0	1
1	2	0	1
2	1	1	0
3	2	0	1
4	1	1	0

# 拼接到full_data
full_data = pd.concat([full_data,familyDf],axis=1)

full_data.head()

	PassengerId	Survived	Sex	Age	SibSp	Ticket	Fare	Embarked_C	...	Cabin_U	FamilySize	Family_Small	Family_Middle
0	1	0.0	1	22.0	1	A/5 21171	7.2500	0	...	1	2	0	1
1	2	1.0	0	38.0	1	PC 17599	71.2833	1	...	0	2	0	1
2	3	1.0	0	26.0	0	STON/O2. 3101282	7.9250	0	...	1	1	1	0
3	4	1.0	0	35.0	1	113803	53.1000	0	...	0	2	0	1
4	5	0.0	1	35.0	0	373450	8.0500	0	...	1	1	1	0

5 rows × 33 columns

# 目前的特征
full_data.shape

(1309, 33)

3.3 特征选择

# 相关性矩阵
corrDf = full_data.corr()
corrDf

	PassengerId	Survived	Sex	Age	SibSp	Parch	Fare	Embarked_C	Embarked_Q	Embarked_S	...	Cabin_D	Cabin_E	Cabin_F	Cabin_G	Cabin_T	Cabin_U	FamilySize	Family_Small	Family_Middle	Family_Large
PassengerId	1.000000	-0.005007	0.013406	0.025731	-0.055224	0.008942	0.031416	0.048101	0.011585	-0.049836	...	0.000549	-0.008136	0.000306	-0.045949	-0.023049	0.000208	-0.031437	0.028546	0.002975	-0.063415
Survived	-0.005007	1.000000	-0.543351	-0.070323	-0.035322	0.081629	0.257307	0.168240	0.003650	-0.149683	...	0.150716	0.145321	0.057935	0.016040	-0.026456	-0.316912	0.016639	-0.203367	0.279855	-0.125147
Sex	0.013406	-0.543351	1.000000	0.057397	-0.109609	-0.213125	-0.185484	-0.066564	-0.088651	0.115193	...	-0.057396	-0.040340	-0.006655	-0.083285	0.020558	0.137396	-0.188583	0.284537	-0.255196	-0.077748
Age	0.025731	-0.070323	0.057397	1.000000	-0.190747	-0.130872	0.171521	0.076179	-0.012718	-0.059153	...	0.132886	0.106600	-0.072644	-0.085977	0.032461	-0.271918	-0.196996	0.116675	-0.038189	-0.161210
SibSp	-0.055224	-0.035322	-0.109609	-0.190747	1.000000	0.373587	0.160224	-0.048396	-0.048678	0.073709	...	-0.015727	-0.027180	-0.008619	0.006015	-0.013247	0.009064	0.861952	-0.591077	0.253590	0.699681
Parch	0.008942	0.081629	-0.213125	-0.130872	0.373587	1.000000	0.221522	-0.008635	-0.100943	0.071881	...	-0.027385	0.001084	0.020481	0.058325	-0.012304	-0.036806	0.792296	-0.549022	0.248532	0.624627
Fare	0.031416	0.257307	-0.185484	0.171521	0.160224	0.221522	1.000000	0.286241	-0.130054	-0.169894	...	0.072737	0.073949	-0.037567	-0.022857	0.001179	-0.507197	0.226465	-0.274826	0.197281	0.170853
Embarked_C	0.048101	0.168240	-0.066564	0.076179	-0.048396	-0.008635	0.286241	1.000000	-0.164166	-0.778262	...	0.107782	0.027566	-0.020010	-0.031566	-0.014095	-0.258257	-0.036553	-0.107874	0.159594	-0.092825
Embarked_Q	0.011585	0.003650	-0.088651	-0.012718	-0.048678	-0.100943	-0.130054	-0.164166	1.000000	-0.491656	...	-0.061459	-0.042877	-0.020282	-0.019941	-0.008904	0.142369	-0.087190	0.127214	-0.122491	-0.018423
Embarked_S	-0.049836	-0.149683	0.115193	-0.059153	0.073709	0.071881	-0.169894	-0.778262	-0.491656	1.000000	...	-0.056023	0.002960	0.030575	0.040560	0.018111	0.137351	0.087771	0.014246	-0.062909	0.093671
Pclass_1	0.026495	0.285904	-0.107371	0.362587	-0.034256	-0.013033	0.599956	0.325722	-0.166101	-0.181800	...	0.275698	0.242963	-0.073083	-0.035441	0.048310	-0.776987	-0.029656	-0.126551	0.165965	-0.067523
Pclass_2	0.022714	0.093349	-0.028862	-0.014193	-0.052419	-0.010057	-0.121372	-0.134675	-0.121973	0.196532	...	-0.037929	-0.050210	0.127371	-0.032081	-0.014325	0.176485	-0.039976	-0.035075	0.097270	-0.118495
Pclass_3	-0.041544	-0.322308	0.116562	-0.302093	0.072610	0.019521	-0.419616	-0.171430	0.243706	-0.003805	...	-0.207455	-0.169063	-0.041178	0.056964	-0.030057	0.527614	0.058430	0.138250	-0.223338	0.155560
Master	0.002254	0.085221	0.164375	-0.363923	0.329171	0.253482	0.011596	-0.014172	-0.009091	0.018297	...	-0.042192	0.001860	0.058311	-0.013690	-0.006113	0.041178	0.355061	-0.265355	0.120166	0.301809
Miss	-0.050027	0.332795	-0.672819	-0.254146	0.077564	0.066473	0.092051	-0.014351	0.198804	-0.113886	...	-0.012516	0.008700	-0.003088	0.061881	-0.013832	-0.004364	0.087350	-0.023890	-0.018085	0.083422
Mr	0.014116	-0.549199	0.870678	0.165476	-0.243104	-0.304780	-0.192192	-0.065538	-0.080224	0.108924	...	-0.030261	-0.032953	-0.026403	-0.072514	0.023611	0.131807	-0.326487	0.386262	-0.300872	-0.194207
Mrs	0.033299	0.344935	-0.571176	0.198091	0.061643	0.213491	0.139235	0.098379	-0.100374	-0.022950	...	0.080393	0.045538	0.013376	0.042547	-0.011742	-0.162253	0.157233	-0.354649	0.361247	0.012893
Officer	0.002231	-0.031316	0.087288	0.162818	-0.013813	-0.032631	0.028696	0.003678	-0.003212	-0.001202	...	0.006055	-0.024048	-0.017076	-0.008281	-0.003698	-0.067030	-0.026921	0.013303	0.003966	-0.034572
Royalty	0.004400	0.033391	-0.020408	0.059466	-0.010787	-0.030197	0.026214	0.077213	-0.021853	-0.054250	...	-0.012950	-0.012202	-0.008665	-0.004202	-0.001876	-0.071672	-0.023600	0.008761	-0.000073	-0.017542
Cabin_A	-0.002831	0.022287	0.047561	0.125177	-0.039808	-0.030707	0.020094	0.094914	-0.042105	-0.056984	...	-0.024952	-0.023510	-0.016695	-0.008096	-0.003615	-0.242399	-0.042967	0.045227	-0.029546	-0.033799
Cabin_B	0.015895	0.175095	-0.094453	0.113458	-0.011569	0.073051	0.393743	0.161595	-0.073613	-0.095790	...	-0.043624	-0.041103	-0.029188	-0.014154	-0.006320	-0.423794	0.032318	-0.087912	0.084268	0.013470
Cabin_C	0.006092	0.114652	-0.077473	0.167993	0.048616	0.009601	0.401370	0.158043	-0.059151	-0.101861	...	-0.053083	-0.050016	-0.035516	-0.017224	-0.007691	-0.515684	0.037226	-0.137498	0.141925	0.001362
Cabin_D	0.000549	0.150716	-0.057396	0.132886	-0.015727	-0.027385	0.072737	0.107782	-0.061459	-0.056023	...	1.000000	-0.034317	-0.024369	-0.011817	-0.005277	-0.353822	-0.025313	-0.074310	0.102432	-0.049336
Cabin_E	-0.008136	0.145321	-0.040340	0.106600	-0.027180	0.001084	0.073949	0.027566	-0.042877	0.002960	...	-0.034317	1.000000	-0.022961	-0.011135	-0.004972	-0.333381	-0.017285	-0.042535	0.068007	-0.046485
Cabin_F	0.000306	0.057935	-0.006655	-0.072644	-0.008619	0.020481	-0.037567	-0.020010	-0.020282	0.030575	...	-0.024369	-0.022961	1.000000	-0.007907	-0.003531	-0.236733	0.005525	0.004055	0.012756	-0.033009
Cabin_G	-0.045949	0.016040	-0.083285	-0.085977	0.006015	0.058325	-0.022857	-0.031566	-0.019941	0.040560	...	-0.011817	-0.011135	-0.007907	1.000000	-0.001712	-0.114803	0.035835	-0.076397	0.087471	-0.016008
Cabin_T	-0.023049	-0.026456	0.020558	0.032461	-0.013247	-0.012304	0.001179	-0.014095	-0.008904	0.018111	...	-0.005277	-0.004972	-0.003531	-0.001712	1.000000	-0.051263	-0.015438	0.022411	-0.019574	-0.007148
Cabin_U	0.000208	-0.316912	0.137396	-0.271918	0.009064	-0.036806	-0.507197	-0.258257	0.142369	0.137351	...	-0.353822	-0.333381	-0.236733	-0.114803	-0.051263	1.000000	-0.014155	0.175812	-0.211367	0.056438
FamilySize	-0.031437	0.016639	-0.188583	-0.196996	0.861952	0.792296	0.226465	-0.036553	-0.087190	0.087771	...	-0.025313	-0.017285	0.005525	0.035835	-0.015438	-0.014155	1.000000	-0.688864	0.302640	0.801623
Family_Small	0.028546	-0.203367	0.284537	0.116675	-0.591077	-0.549022	-0.274826	-0.107874	0.127214	0.014246	...	-0.074310	-0.042535	0.004055	-0.076397	0.022411	0.175812	-0.688864	1.000000	-0.873398	-0.318944
Family_Middle	0.002975	0.279855	-0.255196	-0.038189	0.253590	0.248532	0.197281	0.159594	-0.122491	-0.062909	...	0.102432	0.068007	0.012756	0.087471	-0.019574	-0.211367	0.302640	-0.873398	1.000000	-0.183007
Family_Large	-0.063415	-0.125147	-0.077748	-0.161210	0.699681	0.624627	0.170853	-0.092825	-0.018423	0.093671	...	-0.049336	-0.046485	-0.033009	-0.016008	-0.007148	0.056438	0.801623	-0.318944	-0.183007	1.000000

32 rows × 32 columns

'''
查看各个特征与存活(Survived)的相关系数,倒序排列
'''
corrDf['Survived'].sort_values(ascending=False)

Survived         1.000000
Mrs              0.344935
Miss             0.332795
Pclass_1         0.285904
Family_Middle    0.279855
Fare             0.257307
Cabin_B          0.175095
Embarked_C       0.168240
Cabin_D          0.150716
Cabin_E          0.145321
Cabin_C          0.114652
Pclass_2         0.093349
Master           0.085221
Parch            0.081629
Cabin_F          0.057935
Royalty          0.033391
Cabin_A          0.022287
FamilySize       0.016639
Cabin_G          0.016040
Embarked_Q       0.003650
PassengerId     -0.005007
Cabin_T         -0.026456
Officer         -0.031316
SibSp           -0.035322
Age             -0.070323
Family_Large    -0.125147
Embarked_S      -0.149683
Family_Small    -0.203367
Cabin_U         -0.316912
Pclass_3        -0.322308
Sex             -0.543351
Mr              -0.549199
Name: Survived, dtype: float64

根据各个特征与Survived的相关系数大小，选择这几个特征作为模型的输入：

头衔（前面所在的数据集titleDf）、客舱等级（pclassDf）、家庭大小（familyDf）、船票价格（Fare）、船舱号（cabinDf）、登船港口（embarkedDf）、性别（Sex）

full_X = pd.concat(
    [
        titleDf,
        pclassDf,
        familyDf,
        full_data['Fare'],
        cabinDf,
        embarkedDf,
        full_data['Sex']
    ],axis=1
)

full_X.head()

	Miss	Mr	Mrs	Pclass_1	Pclass_3	FamilySize	...	Cabin_U	Embarked_C	Embarked_S	Sex
0	0	1	0	0	1	2	...	1	0	1	1
1	0	0	1	1	0	2	...	0	1	0	0
2	1	0	0	0	1	1	...	1	0	1	0
3	0	0	1	1	0	2	...	0	0	1	0
4	0	1	0	0	1	1	...	1	0	1	1

5 rows × 27 columns

4、构建模型

坦尼克号测试数据集因为是我们最后要提交给Kaggle的，里面没有生存情况的值，所以不能用于评估模型。
使用Kaggle泰坦尼克号项目给的训练数据集，做为我们的原始数据集（记为source），从这个原始数据集中拆分出训练数据集（记为train：用于模型训练）和测试数据集（记为test：用于模型评估）

# 原始数据集有891行
source_row = 891


# 原始数据集的特征
source_X = full_X.loc[0:source_row-1,:]
# 原始数据集的标签
source_y = full_data.loc[0:source_row-1,'Survived']


# 预测数据集特征
pred_X = full_X.loc[source_row:,:]


print('原始数据集的大小：',source_X.shape[0])
print('预测数据集的大小：',pred_X.shape[0])

原始数据集的大小： 891
预测数据集的大小： 418

# 1、拆分原始数据集
from sklearn.model_selection import train_test_split


train_X,test_X,train_y,test_y  = train_test_split(
    source_X,
    source_y,
    test_size=0.2,
    train_size=0.8
)



# 2、选择机器学习算法，我们选择最基础的逻辑回归算法
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()


# 3、训练模型
lr.fit(train_X,train_y)

# 4、评估模型，用精确率进行评估
lr.score(test_X,test_y)

0.8156424581005587

5、上传到Kaggle

# 对预测数据集进行预测
pred_y = lr.predict(pred_X)

# 转换为kaggle要求是整形
pred_y = pred_y.astype(int)


# 乘客id
passenger_id = full_data.loc[source_row:,'PassengerId']

predDf = pd.DataFrame(
    {
        'PassengerId':passenger_id,
        'Survived':pred_y
    }
)

predDf.head()

	PassengerId	Survived
891	892	0
892	893	1
893	894	0
894	895	0
895	896	1

# 保存结果
predDf.to_csv('./titanic_data/titanic_pred.csv',index=False)