kaggle入门:Titanic: Machine Learning from Disaster(决策树+随机森林)

背景

最近数据仓库与数据挖掘布置一个作业,就是分析这个Titanic数据集,kaggle上的一个入门题。
作为一个菜鸡,看了几本关于深度学习的书,也修了机器学习这门课。但依然改变不了我是个菜鸡中的菜鸡的事实。自学的过程中,感觉跌跌撞撞又有点急功近利,没有很系统地学习,学了半天感觉还没入门,只会调包,甚至调包都不熟练,掌握的知识非常之零散。这次布置了这个题,就想借此机会好好做一下(当然查找借鉴了了许多博客),顺便入门一下kaggle(我天,居然第一次用kaggle,丢人)。

正文

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer #特征转换器
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn import tree
import seaborn as sns
%matplotlib inline

看看数据集长啥样

df = pd.read_csv("titanic_data.csv")       #titanic_data.csv就是kaggle网站中的train.csv
df.head(10)

在这里插入图片描述
kaggle上的数据说明:
在这里插入图片描述

df.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

df.isnull().sum()

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

看看幸存人数和死亡人数

g = sns.countplot("Survived",data=df)
g.set_title("Survived")

在这里插入图片描述

df.groupby(['Sex','Survived'])['Survived'].count()

Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64

f,ax=plt.subplots(1,2)
df[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=df,ax=ax[1])
ax[1].set_title('Sex:Survived vs Dead')
plt.show()

在这里插入图片描述
可以看到男性总人数多于女性,但是女性存活率远高于男性。所以性别是预测船员是否幸存的重要因素。

看看不同年龄幸存和遇难的人数

survive_df = df[df['Survived']==1]
died_df = df[df['Survived']==0]
plt.figure(figsize=(30,10))
# people who died and survived
plt.subplot(121)
sns.distplot(survive_df['Age'].dropna().values,bins=range(0,80,1),kde=False,color='blue',axlabel='Survived_Age')
plt.subplot(122)
sns.distplot(died_df['Age'].dropna().values,bins=range(0,80,1),kde=False,color='red',axlabel='Died_Age')
plt.show()

在这里插入图片描述
emm 放一张图看看

plt.figure(figsize=(30,10))
sns.distplot(survive_df['Age'].dropna().values,bins=range(0,80,1),kde=False,color='blue',axlabel='Survived_Age')
sns.distplot(died_df['Age'].dropna().values,bins=range(0,80,1),kde=False,color='red',axlabel='Died_Age')
plt.show()

在这里插入图片描述
可以看到幼儿0 ~ 6岁获救的比例很大,6 ~ 12岁遇难的可能较大。青壮年遇难的可能性远高于幸存可能性。65岁以下的老年人具有一定获救的可能性,而65岁以上则都不幸遇难,于是我们之后可以利用这一特征对年龄进行分类。

看看各属性与幸存率的关系

plt.figure(figsize=(30,40))
plt.subplot(421)
sns.barplot('Sex','Survived',data=df)
plt.subplot(422)
sns.barplot('Pclass','Survived',data=df)
plt.subplot(423)
sns.barplot('Embarked','Survived',data=df)
plt.subplot(424)
sns.barplot('Parch','Survived',data=df)
plt.subplot(425)
sns.barplot('SibSp','Survived',data=df)
plt.subplot(426)
survival_df = df[df['Survived']==1]
death_df = df[df['Survived']==0]
sns.distplot((survival_df['Fare'].dropna().values),kde=False,color='blue')
sns.distplot((death_df['Fare'].dropna().values),kde=False,color='red',axlabel='Fare')

plt.subplot(427)
df['Family'] = df['Parch']+df['SibSp']

sns.barplot('Family','Survived',data=df)
plt.subplot(428)
sns.barplot('Embarked','Survived',data=df)

在这里插入图片描述
上图可以发现幸存率与Sex、Pclass、Emarked、Family (SibSp+Parch,合在一起发现关系更紧密)。
Fare用柱状图表示好像并不太好,试试小提琴图

sns.violinplot(x='Survived',y='Fare',data=df)

在这里插入图片描述

df['Fare'] = df['Fare'].map(lambda x:np.log(x+1))
sns.violinplot(x='Survived',y='Fare',data=df)

在这里插入图片描述
可以看到log(Fare)<2.5时幸存的概率较小。
对Family和Fare做一个分类:

df['Fare']=df['Fare'].map(lambda x: 'poor' if x<2.5 else 'rich')
df['Family'] = df['SibSp']+df['Parch']
df['Family']  = df['Family'].map(lambda x:'small' if x<=3 else 'big')

对了,数据集中Age还有不少缺失,如何填充呢,看看Age属性与哪些属性有较强的关联性

df['BoolSex'] = df['Sex']=='male'
plt.figure(figsize=(15,10))
corr = sns.heatmap(df.drop(['PassengerId'],axis=1).corr(),annot=True)

在这里插入图片描述
可以看到Age与SibSp、Parch、Pclass有较强的关联性
还有Embarkedd有两个缺失,直接填充人数最多的’C’好了
于是可以做如下填充:

group = df.groupby(['Pclass','SibSp','Parch']).Age
df['Age'] = group.transform(lambda x:x.fillna(x.median()))
df['Embarked']=df['Embarked'].fillna('C')
df['Age']=df['Age'].fillna(df['Age'].mean())
df['Age']=df['Age'].map(lambda x:'baby' if x<=6 else 'child' if x<16 else 'youth' if x<35 else 'adult' if x<65 else 'old' if x<75 else 'tooold' if x>=75 else 'null')
df.isnull().sum()

PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
Family 0
BoolSex 0
dtype: int64

除了Cabin都没有缺失值了,Cabin缺失实在太多,决定舍弃

使用决策树预测乘客是否幸存

selected_columns = ['Pclass','Sex','Age','Family','Fare','Embarked']
x = df[selected_columns]
y = df['Survived']
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record')) 
print(vec.feature_names_ )

[‘Age=adult’, ‘Age=baby’, ‘Age=child’, ‘Age=old’, ‘Age=tooold’, ‘Age=youth’, ‘Embarked=C’, ‘Embarked=Q’, ‘Embarked=S’, ‘Family=big’, ‘Family=small’, ‘Fare=poor’, ‘Fare=rich’, ‘Pclass’, ‘Sex=female’, ‘Sex=male’]

这里采用10折交叉验证取在测试集最优的模型

from sklearn.model_selection import KFold
import copy
from sklearn import metrics
cv = KFold(n_splits=10)
tree_model = DecisionTreeClassifier('gini')
valid_dt_accs = []
test_dt_accs = []
optimal_dt_class = None
maxi = 0.0
for train_index, valid_index in cv.split(X_train):
    train_x,test_x = X_train[train_index],X_train[valid_index]
    train_y,test_y= y_train.iloc[train_index], y_train.iloc[valid_index]
    model = tree_model.fit(train_x,train_y)
    valid_acc = model.score(test_x,test_y)
    test_acc = model.score(X_test,y_test)
    if test_acc > maxi:
        maxi = test_acc
        optimal_dt_class = copy.deepcopy(model)
        print("maxi:",maxi)
    valid_dt_accs.append(valid_acc)
    test_dt_accs.append(test_acc)
    print(classification_report(y_test,model.predict(X_test)))
    
print("Validate accuracy per fold: ", valid_dt_accs, "\n")
print("Test accuracy per fold: ", test_dt_accs, "\n")

print("Average accuracy: ", sum(valid_dt_accs)/len(valid_dt_accs))
print("Average accuracy: ", sum(test_dt_accs)/len(test_dt_accs))


   precision    recall  f1-score   support

           0       0.87      0.93      0.90       117
           1       0.85      0.73      0.78        62

    accuracy                           0.86       179
   macro avg       0.86      0.83      0.84       179
weighted avg       0.86      0.86      0.86       179

              precision    recall  f1-score   support

           0       0.86      0.91      0.88       117
           1       0.80      0.73      0.76        62

    accuracy                           0.84       179
   macro avg       0.83      0.82      0.82       179
weighted avg       0.84      0.84      0.84       179

              precision    recall  f1-score   support

           0       0.86      0.95      0.90       117
           1       0.88      0.71      0.79        62

    accuracy                           0.87       179
   macro avg       0.87      0.83      0.84       179
weighted avg       0.87      0.87      0.86       179

              precision    recall  f1-score   support

           0       0.85      0.91      0.88       117
           1       0.80      0.71      0.75        62

    accuracy                           0.84       179
   macro avg       0.83      0.81      0.82       179
weighted avg       0.84      0.84      0.84       179

              precision    recall  f1-score   support

           0       0.86      0.91      0.88       117
           1       0.81      0.71      0.76        62

    accuracy                           0.84       179
   macro avg       0.84      0.81      0.82       179
weighted avg       0.84      0.84      0.84       179

              precision    recall  f1-score   support

           0       0.87      0.95      0.91       117
           1       0.88      0.73      0.80        62

    accuracy                           0.87       179
   macro avg       0.87      0.84      0.85       179
weighted avg       0.87      0.87      0.87       179

              precision    recall  f1-score   support

           0       0.87      0.94      0.90       117
           1       0.87      0.73      0.79        62

    accuracy                           0.87       179
   macro avg       0.87      0.83      0.85       179
weighted avg       0.87      0.87      0.86       179

              precision    recall  f1-score   support

           0       0.83      0.80      0.82       117
           1       0.65      0.69      0.67        62

    accuracy                           0.77       179
   macro avg       0.74      0.75      0.74       179
weighted avg       0.77      0.77      0.77       179

              precision    recall  f1-score   support

           0       0.86      0.91      0.88       117
           1       0.81      0.71      0.76        62

    accuracy                           0.84       179
   macro avg       0.84      0.81      0.82       179
weighted avg       0.84      0.84      0.84       179

              precision    recall  f1-score   support

           0       0.86      0.95      0.90       117
           1       0.88      0.71      0.79        62

    accuracy                           0.87       179
   macro avg       0.87      0.83      0.84       179
weighted avg       0.87      0.87      0.86       179

Validation accuracy per fold:  [0.7777777777777778, 0.875, 0.8873239436619719, 0.7605633802816901, 0.8169014084507042, 0.8450704225352113, 0.7464788732394366, 0.6901408450704225, 0.8169014084507042, 0.7464788732394366] 

Test accuracy per fold:  [0.8603351955307262, 0.8435754189944135, 0.8659217877094972, 0.8379888268156425, 0.8435754189944135, 0.8715083798882681, 0.8659217877094972, 0.7653631284916201, 0.8435754189944135, 0.8659217877094972] 

Average validation accuracy:  0.7962636932707355
Average test accuracy:  0.846368715083799

用pdf保存决策树的图

import graphviz
from sklearn import tree
import pydotplus

import os
 
os.environ["PATH"] += os.pathsep + 'D:/Program Files/Graphviz/bin'
dot_data = tree.export_graphviz(optimal_dt_class, out_file=None,feature_names=optimal_dt_class.feature_importances_,
                                  filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("Decision_Tree.pdf")

随机森林预测乘客是否幸存

from sklearn.model_selection import KFold
from sklearn import metrics
from sklearn import ensemble
import copy
cv = KFold(n_splits=10)
tree_model = ensemble.RandomForestClassifier(n_estimators=500, random_state=666)
valid_accs = []
test_accs = []
optimal_rf_class = None
maxi = 0.0
for train_index, valid_index in cv.split(X_train):
    train_x,test_x = X_train[train_index],X_train[valid_index]
    train_y,test_y= y_train.iloc[train_index], y_train.iloc[valid_index]
    model = tree_model.fit(train_x,train_y)
    valid_acc = model.score(test_x,test_y)
    test_acc = model.score(X_test,y_test)
    if test_acc > maxi:
        maxi = test_acc
        optimal_rf_class = copy.deepcopy(model)
    valid_accs.append(valid_acc)
    test_accs.append(test_acc)
    print(classification_report(y_test,model.predict(X_test)))
    
print("Validate accuracy per fold: ", valid_accs, "\n")
print("Test accuracy per fold: ", test_accs, "\n")

print("Average validation accuracy: ", sum(valid_accs)/len(valid_accs))
print("Average test accuracy: ", sum(test_accs)/len(test_accs))


Output:

  precision    recall  f1-score   support

           0       0.87      0.93      0.90       117
           1       0.85      0.73      0.78        62

    accuracy                           0.86       179
   macro avg       0.86      0.83      0.84       179
weighted avg       0.86      0.86      0.86       179

              precision    recall  f1-score   support

           0       0.86      0.91      0.88       117
           1       0.80      0.73      0.76        62

    accuracy                           0.84       179
   macro avg       0.83      0.82      0.82       179
weighted avg       0.84      0.84      0.84       179

              precision    recall  f1-score   support

           0       0.87      0.95      0.91       117
           1       0.88      0.74      0.81        62

    accuracy                           0.88       179
   macro avg       0.88      0.85      0.86       179
weighted avg       0.88      0.88      0.87       179

              precision    recall  f1-score   support

           0       0.85      0.91      0.88       117
           1       0.80      0.71      0.75        62

    accuracy                           0.84       179
   macro avg       0.83      0.81      0.82       179
weighted avg       0.84      0.84      0.84       179

              precision    recall  f1-score   support

           0       0.86      0.92      0.89       117
           1       0.83      0.71      0.77        62

    accuracy                           0.85       179
   macro avg       0.84      0.82      0.83       179
weighted avg       0.85      0.85      0.85       179

              precision    recall  f1-score   support

           0       0.86      0.91      0.89       117
           1       0.82      0.73      0.77        62

    accuracy                           0.85       179
   macro avg       0.84      0.82      0.83       179
weighted avg       0.85      0.85      0.85       179

              precision    recall  f1-score   support

           0       0.87      0.94      0.90       117
           1       0.87      0.73      0.79        62

    accuracy                           0.87       179
   macro avg       0.87      0.83      0.85       179
weighted avg       0.87      0.87      0.86       179

              precision    recall  f1-score   support

           0       0.84      0.81      0.83       117
           1       0.67      0.71      0.69        62

    accuracy                           0.78       179
   macro avg       0.75      0.76      0.76       179
weighted avg       0.78      0.78      0.78       179

              precision    recall  f1-score   support

           0       0.86      0.91      0.88       117
           1       0.81      0.71      0.76        62

    accuracy                           0.84       179
   macro avg       0.84      0.81      0.82       179
weighted avg       0.84      0.84      0.84       179

              precision    recall  f1-score   support

           0       0.86      0.95      0.90       117
           1       0.88      0.71      0.79        62

    accuracy                           0.87       179
   macro avg       0.87      0.83      0.84       179
weighted avg       0.87      0.87      0.86       179

Validate accuracy per fold:  [0.7777777777777778, 0.8611111111111112, 0.8169014084507042, 0.7746478873239436, 0.8028169014084507, 0.8591549295774648, 0.7605633802816901, 0.704225352112676, 0.8169014084507042, 0.7464788732394366] 

Test accuracy per fold:  [0.8603351955307262, 0.8435754189944135, 0.8770949720670391, 0.8379888268156425, 0.8491620111731844, 0.8491620111731844, 0.8659217877094972, 0.776536312849162, 0.8435754189944135, 0.8659217877094972] 

Average validation accuracy:  0.792057902973396
Average test accuracy:  0.8469273743016759
print("Optimal Random Forest Accuracy:",optimal_rf_class.score(X_test,y_test))
optimal_preds = optimal_rf_class.predict(X_test)
print(classification_report(optimal_preds,y_test,target_names=['died','survived']))

Output:

Optimal Random Forest Accuracy: 0.8770949720670391
              precision    recall  f1-score   support

        died       0.95      0.87      0.91       127
    survived       0.74      0.88      0.81        52

    accuracy                           0.88       179
   macro avg       0.85      0.88      0.86       179
weighted avg       0.89      0.88      0.88       179

预测测试集中乘客是否幸存

my_test_df = pd.read_csv("test.csv")
my_test_group = my_test_df.groupby(['Pclass','SibSp','Parch']).Age
my_test_df['Age'] = my_test_group.transform(lambda x:x.fillna(x.median()))
my_test_df['Embarked']=my_test_df['Embarked'].fillna('C')
my_test_df['Age'] = my_test_df['Age'].fillna(my_test_df['Age'].mean())
my_test_df['Fare'] = my_test_df['Fare'].fillna(my_test_df['Fare'].mean())

my_test_df.isnull().sum()

PassengerId 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 327
Embarked 0
dtype: int64

my_test_df['Fare'] = my_test_df['Fare'].map(lambda x:np.log(x+1))
my_test_df['Fare'] = my_test_df['Fare'].map(lambda x: 'poor' if x<2.5 else 'rich')
my_test_df['Family'] = my_test_df['SibSp']+my_test_df['Parch']
my_test_df['Family']  = my_test_df['Family'].map(lambda x:'small' if x<=3 else 'big')
my_test_df['Age']=my_test_df['Age'].map(lambda x:'baby' if x<=6 else 'child' if x<16 else 'youth' if x<35 else 'adult' if x<65 else 'old' if x<75 else 'tooold' if x>=75 else 'null')
my_test_df = my_test_df[selected_columns]
my_test = vec.transform(my_test_df.to_dict(orient='record')) 

保存为csv文件,提交给kaggle

PassengerId = pd.read_csv('test.csv')['PassengerId']
my_rf_preds = optimal_rf_class.predict(my_test)
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": my_rf_preds.astype(np.int32)})
submission.to_csv("predicted_submission2.csv", index=False)

最好成绩
在这里插入图片描述

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值