动手学数据分析—5.数据建模及模型评估
引言&复习
本章将开始数据建模。
过程将综合使用所学知识:特征工程、模型搭建与模型评估。
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
plt.rcParams['figure.figsize'] = (10, 6) # 设置输出图片大小
# 读取训练数集
train = pd.read_csv('train.csv')
train.shape
(891, 12)
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
一、 特征工程
本步骤旨在通过对数据进行适当处理以达到供建模使用的目的。
1.1缺失值填充
- 对分类变量缺失值:填充某个缺失值字符(NA)、用最多类别的进行填充
- 对连续变量缺失值:填充均值、中位数、众数
# 对分类变量进行填充
train['Cabin'] = train['Cabin'].fillna('NA')
train['Embarked'] = train['Embarked'].fillna('S')
# 对连续变量进行填充
train['Age'] = train['Age'].fillna(train['Age'].mean())
# 检查缺失值比例
train.isnull().mean().sort_values(ascending=False)
Embarked 0.0
Cabin 0.0
Fare 0.0
Ticket 0.0
Parch 0.0
SibSp 0.0
Age 0.0
Sex 0.0
Name 0.0
Pclass 0.0
Survived 0.0
P