Taitanic幸存者预测---分类树应用-CSDN博客

本文链接：https://blog.csdn.net/weixin_42122125/article/details/106493134

本文详细介绍了使用Python进行Titanic数据预处理、建立决策树模型并进行模型验证的过程。通过分析数据，删除无关特征，转换类别特征，并利用交叉验证和网格搜索调整模型参数。最终得出最佳决策树模型，并对测试集进行预测，分析模型表现。

摘要由CSDN通过智能技术生成

作者：Irain
QQ：2573396010
微信：18802080892
GitHub项目链接：Taitanic幸存者预测

1 环境准备

工具：python 3.7.2、Jupyter Notebook
python第三方库：pandas、sklearn、matplotlib、numpy、graphviz
graphviz决策树环境：
安装graphviz
GraphViz配置环境变量
 解决execute dot, -Tsvg, make sure the Graphviz executables are on your systems PATH
数据下载链接：https://www.kaggle.com/c/titanic/data
安装python第三方库：
pip install Jupyter Notebook-i https://pypi.douban.com/simple/
pip install pandas -i https://pypi.douban.com/simple/
pip install sklearn -i https://pypi.douban.com/simple/
pip install matplotlib -i https://pypi.douban.com/simple/
pip install numpy -i https://pypi.douban.com/simple/
pip install graphviz -i https://pypi.douban.com/simple/

2 读取和分析数据

2.1 读取和查看数据

在这里插入图片描述

import pandas as pd
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv
data = pd.read_csv(r"data.csv",index_col='PassengerId') # 读取数据，把PassengerId作为DataFrame索引 Read a comma-separated values (csv) file into DataFrame.
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head
data.head() # Return the first n rows.(nint, default 5)
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html?highlight=data%20info#pandas.DataFrame.info
data.info() # This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

2.2 分析数据

2.2.1 分析数据总体信息

在这里插入图片描述

2.2.1.1 数据表头信息

PassengerId索引，标签Survived、其他10个特征
PassengerId：乘客编号
Survived：代表是否幸存，0否，1是
Pclass：船舱等级，1最好，2次之，3最后
SibSp：与该名乘客一起同行上船的兄弟姐妹的个数
Parch：与该名乘客一起同行上船的长辈与孩子的个数
Ticket：船票编号
Fare：船票价格
Cabin：该名乘客在船舱内的编号
Embarked：该名乘客登船的码头（S、C、Q三个码头）

2.2.1.2 特征存在缺失值

特征Age（部分缺失）：考虑填补缺失值
特征Cabin（大量缺失）：考虑丢弃该特征
特征Embarked（极少量缺失）：考虑丢弃缺失值

2.2.2 分析数据内容

在这里插入图片描述
读取数据时，把特征PassengerId设置为样本索引。
特征（Name、Ticket、Cabin）与标签Survived关联性小，丢弃该特征。
特征（Sex、Embarked）值的类型为字符型，改为数值类型，便于处理。

3 数据预处理

3.1 删除低价值的列

在这里插入图片描述

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
# 删除和预测的y没有关系的列（Name、Ticket）、缺失值过多的列（Cabin）
data.drop(columns=['Name','Ticket','Cabin'],inplace=True,axis=1) # Drop specified labels from rows or columns.
data

3.2 填补缺失值数量少的列

在这里插入图片描述

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html?highlight=mean#pandas.DataFrame.mean
print("年龄平均值:",data['Age'].mean()) # 年龄平均值
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html?highlight=fillna#pandas.Series.fillna
data['Age'] = data['Age'].fillna(data['Age'].mean()) # 平均值填补缺失值数量少的列
data.info()

3.3 丢弃缺失值的少数行

在这里插入图片描述

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dropna.html?highlight=dropna#pandas.Series.dropna
data = data.dropna() # 丢弃缺失值的少数行
data.info()

3.4 分类变量的字符型转换为数值型

3.4.1 两极分类的性别Sex

在这里插入图片描述
通过判断方式，把（data[‘Sex’] == ‘female’）的bool类型结果（FALSE、TRUE）转换为（0、1）表示的数值型，再传回data[‘Sex’].

#把两极值的性别转换为整数（0、1表示）
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html?highlight=astype#pandas.Series.astype
print("(data['Sex'] == 'female').dtypes类型：",(data['Sex'] == 'female').dtypes) # bool类型可以直接与数值型转换
data['Sex'] = (data['Sex'] == 'female').astype("int") # Cast a pandas object to a specified dtype dtype.
data

3.4.2 三极分类的上船码头Embarked

在这里插入图片描述
使用Series类型的unique和ndarray的tolist方法，获得特征Embarked值的种类结果，把该种类结果的类型转换为list类型的label是，然后，使用Series类型的apply、lambda和list类型的index方法，把特征Embarked值转换为种类结果的序号。

#将分类变量转换为数值型变量
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html?highlight=unique#pandas.Series.unique
print("data[\"Embarked\"].unique()类型：",type(data["Embarked"].unique())) # Return unique values of Series object as a NumPy array. See Notes.
#https://numpy.org/doc/1.18/reference/generated/numpy.ndarray.tolist.html?highlight=tolist#numpy.ndarray.tolist
labels = data["Embarked"].unique().tolist() # Return a copy of the array data as a (nested) Python list. 
print("labels值：",labels)
#https://realpython.com/python-lambda/#lambda-calculus
print("lambda函数调用：",lambda x: labels.index(x))
print("lambda函数：",type(lambda x: labels.index(x)))
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html?highlight=apply#pandas.Series.apply
data[