作者:Irain
QQ:2573396010
微信:18802080892
GitHub项目链接:Taitanic幸存者预测
目录
1 环境准备
工具:python 3.7.2、Jupyter Notebook
python第三方库:pandas、sklearn、matplotlib、numpy、graphviz
graphviz决策树环境:
安装graphviz
GraphViz配置环境变量
解决execute dot, -Tsvg, make sure the Graphviz executables are on your systems PATH
数据下载链接:https://www.kaggle.com/c/titanic/data
安装python第三方库:
pip install Jupyter Notebook-i https://pypi.douban.com/simple/
pip install pandas -i https://pypi.douban.com/simple/
pip install sklearn -i https://pypi.douban.com/simple/
pip install matplotlib -i https://pypi.douban.com/simple/
pip install numpy -i https://pypi.douban.com/simple/
pip install graphviz -i https://pypi.douban.com/simple/
2 读取和分析数据
2.1 读取和查看数据
import pandas as pd
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv
data = pd.read_csv(r"data.csv",index_col='PassengerId') # 读取数据,把PassengerId作为DataFrame索引 Read a comma-separated values (csv) file into DataFrame.
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head
data.head() # Return the first n rows.(nint, default 5)
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html?highlight=data%20info#pandas.DataFrame.info
data.info() # This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
2.2 分析数据
2.2.1 分析数据总体信息
2.2.1.1 数据表头信息
PassengerId索引,标签Survived、其他10个特征
PassengerId:乘客编号
Survived:代表是否幸存,0否,1是
Pclass:船舱等级,1最好,2次之,3最后
SibSp:与该名乘客一起同行上船的兄弟姐妹的个数
Parch:与该名乘客一起同行上船的长辈与孩子的个数
Ticket:船票编号
Fare:船票价格
Cabin:该名乘客在船舱内的编号
Embarked:该名乘客登船的码头(S、C、Q三个码头)
2.2.1.2 特征存在缺失值
特征Age(部分缺失):考虑填补缺失值
特征Cabin(大量缺失):考虑丢弃该特征
特征Embarked(极少量缺失):考虑丢弃缺失值
2.2.2 分析数据内容
读取数据时,把特征PassengerId设置为样本索引。
特征(Name、Ticket、Cabin)与标签Survived关联性小,丢弃该特征。
特征(Sex、Embarked)值的类型为字符型,改为数值类型,便于处理。
3 数据预处理
3.1 删除低价值的列
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
# 删除和预测的y没有关系的列(Name、Ticket)、缺失值过多的列(Cabin)
data.drop(columns=['Name','Ticket','Cabin'],inplace=True,axis=1) # Drop specified labels from rows or columns.
data
3.2 填补缺失值数量少的列
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html?highlight=mean#pandas.DataFrame.mean
print("年龄平均值:",data['Age'].mean()) # 年龄平均值
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html?highlight=fillna#pandas.Series.fillna
data['Age'] = data['Age'].fillna(data['Age'].mean()) # 平均值填补缺失值数量少的列
data.info()
3.3 丢弃缺失值的少数行
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dropna.html?highlight=dropna#pandas.Series.dropna
data = data.dropna() # 丢弃缺失值的少数行
data.info()
3.4 分类变量的字符型转换为数值型
3.4.1 两极分类的性别Sex
通过判断方式,把(data[‘Sex’] == ‘female’)的bool类型结果(FALSE、TRUE)转换为(0、1)表示的数值型,再传回data[‘Sex’].
#把两极值的性别转换为整数(0、1表示)
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html?highlight=astype#pandas.Series.astype
print("(data['Sex'] == 'female').dtypes类型:",(data['Sex'] == 'female').dtypes) # bool类型可以直接与数值型转换
data['Sex'] = (data['Sex'] == 'female').astype("int") # Cast a pandas object to a specified dtype dtype.
data
3.4.2 三极分类的上船码头Embarked
使用Series类型的unique和ndarray的tolist方法,获得特征Embarked值的种类结果,把该种类结果的类型转换为list类型的label是,然后,使用Series类型的apply、lambda和list类型的index方法,把特征Embarked值转换为种类结果的序号。
#将分类变量转换为数值型变量
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html?highlight=unique#pandas.Series.unique
print("data[\"Embarked\"].unique()类型:",type(data["Embarked"].unique())) # Return unique values of Series object as a NumPy array. See Notes.
#https://numpy.org/doc/1.18/reference/generated/numpy.ndarray.tolist.html?highlight=tolist#numpy.ndarray.tolist
labels = data["Embarked"].unique().tolist() # Return a copy of the array data as a (nested) Python list.
print("labels值:",labels)
#https://realpython.com/python-lambda/#lambda-calculus
print("lambda函数调用:",lambda x: labels.index(x))
print("lambda函数:",type(lambda x: labels.index(x)))
#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html?highlight=apply#pandas.Series.apply
data[