实现对titanic的生存预测+模型优化+可视化

决策树实现对titanic的生存预测

import pandas as pd
train=pd.read_csv("D:\\AA\\C\\deepmind\\case\\ti\\train.csv")
test=pd.read_csv("D:/AA/C/deepmind/case/ti/test.csv")
data=pd.concat([train,test])
data

数据初探

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB

‘age’:‘年龄’
‘Cabin’:‘船舱号’
‘Embarked’:‘登船港口’
‘Fare’:‘票价’
‘Name’:‘姓名’
‘Parch’:‘是否有父母’
‘PassengerId’:‘乘客的编号’
‘Pclass’:‘哪个舱位’
‘Sex’:‘性别’
‘SibSp’:‘是否有兄弟姐妹’
‘Survived’:‘是否生还’ 0表示死了,1表示没死
‘Ticket’:‘票的数量’

name_dict = {'Age':'年龄',
'Cabin':'船舱号',        
'Embarked':'登船港口',       
'Fare':'票价',          
'Name':'姓名',           
'Parch':'是否有父母',         
'PassengerId':'乘客的编号',    
'Pclass':'哪个舱位',         
'Sex':'性别',           
'SibSp':'是否有兄弟姐妹',          
'Survived':'是否生还',    
'Ticket':'票的数量'}
data.rename(columns=name_dict,inplace=True)
data
年龄船舱号登船港口票价姓名是否有父母乘客的编号哪个舱位性别是否有兄弟姐妹是否生还票的数量
022.0NaNS7.2500Braund, Mr. Owen Harris013male10.0A/5 21171
138.0C85C71.2833Cumings, Mrs. John Bradley (Florence Briggs Th...021female11.0PC 17599
226.0NaNS7.9250Heikkinen, Miss. Laina033female01.0STON/O2. 3101282
335.0C123S53.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)041female11.0113803
435.0NaNS8.0500Allen, Mr. William Henry053male00.0373450
.......................................
413NaNNaNS8.0500Spector, Mr. Woolf013053male0NaNA.5. 3236
41439.0C105C108.9000Oliva y Ocana, Dona. Fermina013061female0NaNPC 17758
41538.5NaNS7.2500Saether, Mr. Simon Sivertsen013073male0NaNSOTON/O.Q. 3101262
416NaNNaNS8.0500Ware, Mr. Frederick013083male0NaN359309
417NaNNaNC22.3583Peter, Master. Michael J113093male1NaN2668

1309 rows × 12 columns

问题定义

我们想要预测出生还率,因此我们需要处理的特征值就是舱位,年龄,性别对是否生还做出预测。其他的我们就不做处理了哈哈哈哈,ok我们来确定一下特征值

x = data[["哪个舱位", "年龄", "性别"]]
y = data["是否生还"]

确定了特征值之后就需要对我们的数据进行预处理

数据预处理

#数据初探
x.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 3 columns):
哪个舱位    1309 non-null int64
年龄      1046 non-null float64
性别      1309 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 40.9+ KB
可以看出只有年龄有缺失值,然后上一节我们讲了缺失值的删除处理,这里我们来看看缺失值怎么填补。

缺失值的填补

年龄是一个数值变量,使用均值(mean())填补

x['年龄'].fillna(x['年龄'].mean(), inplace=True)
D:\ANACONDA\lib\site-packages\pandas\core\generic.py:6287: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
x
哪个舱位年龄性别
0322.000000male
1138.000000female
2326.000000female
3135.000000female
4335.000000male
............
413329.881138male
414139.000000female
415338.500000male
416329.881138male
417329.881138male

1309 rows × 3 columns

可以看出我们的性别是定义的字符变量,在模型进行训练的时候我们一般对数值型进行训练,因此这里需要对性别这个字符变量进行转换

import numpy as np
map= {label: idx for idx, label in enumerate(np.unique(x['性别']))}
x['性别'] = x['性别'].map(map)
哪个舱位年龄性别
0322.0000001
1138.0000000
2326.0000000
3135.0000000
4335.0000001
............
413329.8811381
414139.0000000
415338.5000001
416329.8811381
417329.8811381

1309 rows × 3 columns

将处理好的数据进行特征工程了!

特征工程

from sklearn.model_selection import train_test_split
x_train, x_test, y_trian, y_test = train_test_split(x, y, test_size=0.2)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(x_train.to_dict(orient='record'))
X_test = vec.fit_transform(x_test.to_dict(orient='record'))

定义决策树

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_trian)
y_predict = dtc.predict(x_test)

模型评价

利用梯度下降对决策树进行提高

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
gbc=GradientBoostingClassifier()
gbc.fit(x_train,y_trian)
gbc_y_pred=gbc.predict(x_test)
print(gbc.score(X_test,y_test))
print(classification_report(gbc_y_pred,y_test))
0.8044692737430168
              precision    recall  f1-score   support

           0       0.82      0.85      0.84       105
           1       0.77      0.74      0.76        74

    accuracy                           0.80       179
   macro avg       0.80      0.80      0.80       179
weighted avg       0.80      0.80      0.80       179

总代码+可视化

import pandas as pd
data=pd.read_csv("D:\\AA\\C\\deepmind\\case\\ti\\train.csv")
name_dict = {'Age':'年龄',
'Cabin':'船舱号',        
'Embarked':'登船港口',       
'Fare':'票价',          
'Name':'姓名',           
'Parch':'是否有父母',         
'PassengerId':'乘客的编号',    
'Pclass':'哪个舱位',         
'Sex':'性别',           
'SibSp':'是否有兄弟姐妹',          
'Survived':'是否生还',    
'Ticket':'票的数量'}
data.rename(columns=name_dict,inplace=True)
x = data[["哪个舱位", "年龄", "性别"]]
y = data["是否生还"]
x['年龄'].fillna(x['年龄'].mean(), inplace=True)
import numpy as np
map= {label: idx for idx, label in enumerate(np.unique(x['性别']))}
x['性别'] = x['性别'].map(map)
from sklearn.model_selection import train_test_split
x_train,x_test,y_trian,y_test = train_test_split(x,y,test_size=0.2)
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(x_train.to_dict(orient='record'))
X_test = vec.fit_transform(x_test.to_dict(orient='record'))
print(vec.feature_names_)
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_trian)
y_predict = dtc.predict(x_test)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
gbc=GradientBoostingClassifier()
gbc.fit(x_train,y_trian)
gbc_y_pred=gbc.predict(x_test)
print(gbc.score(X_test,y_test))
print(classification_report(gbc_y_pred,y_test))
from sklearn import tree
tree.export_graphviz(dtc, out_file='./tree.dot', feature_names=x_train.columns)
['哪个舱位', '年龄', '性别']
0.7877094972067039
              precision    recall  f1-score   support

           0       0.87      0.81      0.84       120
           1       0.66      0.75      0.70        59

    accuracy                           0.79       179
   macro avg       0.76      0.78      0.77       179
weighted avg       0.80      0.79      0.79       179

D:\ANACONDA\envs\sec\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
D:\ANACONDA\envs\sec\lib\site-packages\ipykernel_launcher.py:21: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

下面的网址是直接进行可视化的,把下面这个黑色框框的内容复制到这个网址下面就行了。

http://webgraphviz.com/

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
我们团队旨在建设并发布高质量的技术文章和技术社区去储备人才,输送人才,有对大数据人工智能感兴趣的朋友可以加我们的QQ群呀~(我们平时会推送一些免费的课程,每周都有技术分享会欢迎大家参与)

我们的技术社区的网址:https://discourse.qingxzd.com/

  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值