决策树实现对titanic的生存预测
import pandas as pd
train=pd.read_csv("D:\\AA\\C\\deepmind\\case\\ti\\train.csv")
test=pd.read_csv("D:/AA/C/deepmind/case/ti/test.csv")
data=pd.concat([train,test])
data
数据初探
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
‘age’:‘年龄’
‘Cabin’:‘船舱号’
‘Embarked’:‘登船港口’
‘Fare’:‘票价’
‘Name’:‘姓名’
‘Parch’:‘是否有父母’
‘PassengerId’:‘乘客的编号’
‘Pclass’:‘哪个舱位’
‘Sex’:‘性别’
‘SibSp’:‘是否有兄弟姐妹’
‘Survived’:‘是否生还’ 0表示死了,1表示没死
‘Ticket’:‘票的数量’
name_dict = {'Age':'年龄',
'Cabin':'船舱号',
'Embarked':'登船港口',
'Fare':'票价',
'Name':'姓名',
'Parch':'是否有父母',
'PassengerId':'乘客的编号',
'Pclass':'哪个舱位',
'Sex':'性别',
'SibSp':'是否有兄弟姐妹',
'Survived':'是否生还',
'Ticket':'票的数量'}
data.rename(columns=name_dict,inplace=True)
data
年龄 | 船舱号 | 登船港口 | 票价 | 姓名 | 是否有父母 | 乘客的编号 | 哪个舱位 | 性别 | 是否有兄弟姐妹 | 是否生还 | 票的数量 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | NaN | NaN | S | 8.0500 | Spector, Mr. Woolf | 0 | 1305 | 3 | male | 0 | NaN | A.5. 3236 |
414 | 39.0 | C105 | C | 108.9000 | Oliva y Ocana, Dona. Fermina | 0 | 1306 | 1 | female | 0 | NaN | PC 17758 |
415 | 38.5 | NaN | S | 7.2500 | Saether, Mr. Simon Sivertsen | 0 | 1307 | 3 | male | 0 | NaN | SOTON/O.Q. 3101262 |
416 | NaN | NaN | S | 8.0500 | Ware, Mr. Frederick | 0 | 1308 | 3 | male | 0 | NaN | 359309 |
417 | NaN | NaN | C | 22.3583 | Peter, Master. Michael J | 1 | 1309 | 3 | male | 1 | NaN | 2668 |
1309 rows × 12 columns
问题定义
我们想要预测出生还率,因此我们需要处理的特征值就是舱位,年龄,性别对是否生还做出预测。其他的我们就不做处理了哈哈哈哈,ok我们来确定一下特征值
x = data[["哪个舱位", "年龄", "性别"]]
y = data["是否生还"]
确定了特征值之后就需要对我们的数据进行预处理
数据预处理
#数据初探
x.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 3 columns):
哪个舱位 1309 non-null int64
年龄 1046 non-null float64
性别 1309 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 40.9+ KB
可以看出只有年龄有缺失值,然后上一节我们讲了缺失值的删除处理,这里我们来看看缺失值怎么填补。
缺失值的填补
年龄是一个数值变量,使用均值(mean())填补
x['年龄'].fillna(x['年龄'].mean(), inplace=True)
D:\ANACONDA\lib\site-packages\pandas\core\generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._update_inplace(new_data)
x
哪个舱位 | 年龄 | 性别 | |
---|---|---|---|
0 | 3 | 22.000000 | male |
1 | 1 | 38.000000 | female |
2 | 3 | 26.000000 | female |
3 | 1 | 35.000000 | female |
4 | 3 | 35.000000 | male |
... | ... | ... | ... |
413 | 3 | 29.881138 | male |
414 | 1 | 39.000000 | female |
415 | 3 | 38.500000 | male |
416 | 3 | 29.881138 | male |
417 | 3 | 29.881138 | male |
1309 rows × 3 columns
可以看出我们的性别是定义的字符变量,在模型进行训练的时候我们一般对数值型进行训练,因此这里需要对性别这个字符变量进行转换
import numpy as np
map= {label: idx for idx, label in enumerate(np.unique(x['性别']))}
x['性别'] = x['性别'].map(map)
哪个舱位 | 年龄 | 性别 | |
---|---|---|---|
0 | 3 | 22.000000 | 1 |
1 | 1 | 38.000000 | 0 |
2 | 3 | 26.000000 | 0 |
3 | 1 | 35.000000 | 0 |
4 | 3 | 35.000000 | 1 |
... | ... | ... | ... |
413 | 3 | 29.881138 | 1 |
414 | 1 | 39.000000 | 0 |
415 | 3 | 38.500000 | 1 |
416 | 3 | 29.881138 | 1 |
417 | 3 | 29.881138 | 1 |
1309 rows × 3 columns
将处理好的数据进行特征工程了!
特征工程
from sklearn.model_selection import train_test_split
x_train, x_test, y_trian, y_test = train_test_split(x, y, test_size=0.2)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(x_train.to_dict(orient='record'))
X_test = vec.fit_transform(x_test.to_dict(orient='record'))
定义决策树
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_trian)
y_predict = dtc.predict(x_test)
模型评价
利用梯度下降对决策树进行提高
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
gbc=GradientBoostingClassifier()
gbc.fit(x_train,y_trian)
gbc_y_pred=gbc.predict(x_test)
print(gbc.score(X_test,y_test))
print(classification_report(gbc_y_pred,y_test))
0.8044692737430168
precision recall f1-score support
0 0.82 0.85 0.84 105
1 0.77 0.74 0.76 74
accuracy 0.80 179
macro avg 0.80 0.80 0.80 179
weighted avg 0.80 0.80 0.80 179
总代码+可视化
import pandas as pd
data=pd.read_csv("D:\\AA\\C\\deepmind\\case\\ti\\train.csv")
name_dict = {'Age':'年龄',
'Cabin':'船舱号',
'Embarked':'登船港口',
'Fare':'票价',
'Name':'姓名',
'Parch':'是否有父母',
'PassengerId':'乘客的编号',
'Pclass':'哪个舱位',
'Sex':'性别',
'SibSp':'是否有兄弟姐妹',
'Survived':'是否生还',
'Ticket':'票的数量'}
data.rename(columns=name_dict,inplace=True)
x = data[["哪个舱位", "年龄", "性别"]]
y = data["是否生还"]
x['年龄'].fillna(x['年龄'].mean(), inplace=True)
import numpy as np
map= {label: idx for idx, label in enumerate(np.unique(x['性别']))}
x['性别'] = x['性别'].map(map)
from sklearn.model_selection import train_test_split
x_train,x_test,y_trian,y_test = train_test_split(x,y,test_size=0.2)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(x_train.to_dict(orient='record'))
X_test = vec.fit_transform(x_test.to_dict(orient='record'))
print(vec.feature_names_)
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_trian)
y_predict = dtc.predict(x_test)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
gbc=GradientBoostingClassifier()
gbc.fit(x_train,y_trian)
gbc_y_pred=gbc.predict(x_test)
print(gbc.score(X_test,y_test))
print(classification_report(gbc_y_pred,y_test))
from sklearn import tree
tree.export_graphviz(dtc, out_file='./tree.dot', feature_names=x_train.columns)
['哪个舱位', '年龄', '性别']
0.7877094972067039
precision recall f1-score support
0 0.87 0.81 0.84 120
1 0.66 0.75 0.70 59
accuracy 0.79 179
macro avg 0.76 0.78 0.77 179
weighted avg 0.80 0.79 0.79 179
D:\ANACONDA\envs\sec\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._update_inplace(new_data)
D:\ANACONDA\envs\sec\lib\site-packages\ipykernel_launcher.py:21: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
下面的网址是直接进行可视化的,把下面这个黑色框框的内容复制到这个网址下面就行了。
http://webgraphviz.com/
我们团队旨在建设并发布高质量的技术文章和技术社区去储备人才,输送人才,有对大数据人工智能感兴趣的朋友可以加我们的QQ群呀~(我们平时会推送一些免费的课程,每周都有技术分享会欢迎大家参与)
我们的技术社区的网址:https://discourse.qingxzd.com/