背景
最近数据仓库与数据挖掘布置一个作业,就是分析这个Titanic数据集,kaggle上的一个入门题。
作为一个菜鸡,看了几本关于深度学习的书,也修了机器学习这门课。但依然改变不了我是个菜鸡中的菜鸡的事实。自学的过程中,感觉跌跌撞撞又有点急功近利,没有很系统地学习,学了半天感觉还没入门,只会调包,甚至调包都不熟练,掌握的知识非常之零散。这次布置了这个题,就想借此机会好好做一下(当然查找借鉴了了许多博客),顺便入门一下kaggle(我天,居然第一次用kaggle,丢人)。
正文
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer #特征转换器
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn import tree
import seaborn as sns
%matplotlib inline
看看数据集长啥样
df = pd.read_csv("titanic_data.csv") #titanic_data.csv就是kaggle网站中的train.csv
df.head(10)
kaggle上的数据说明:
df.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
看看幸存人数和死亡人数
g = sns.countplot("Survived",data=df)
g.set_title("Survived")
df.groupby(['Sex','Survived'])['Survived'].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
f,ax=plt.subplots(1,2)
df[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=df,ax=ax[1])
ax[1].set_title('Sex:Survived vs Dead')
plt.show()
可以看到男性总人数多于女性,但是女性存活率远高于男性。所以性别是预测船员是否幸存的重要因素。
看看不同年龄幸存和遇难的人数
survive_df = df[df['Survived']==1]
died_df = df[df['Survived']==0]
plt.figure(figsize=(30,10))
# people who died and survived
plt.subplot(121)
sns.distplot(survive_df['Age'].dropna().values,bins=range(0,80,1),kde=False,color='blue',axlabel='Survived_Age')
plt.subplot(122)
sns.distplot(died_df['Age'].dropna().values,bins=range(0,80,1),kde=False,color='red',axlabel='Died_Age')
plt.show()
emm 放一张图看看
plt.figure(figsize=(30,10))
sns.distplot(survive_df['Age'].dropna().values,bins=range(0,80,1),kde=False,color='blue',axlabel='Survived_Age')
sns.distplot(died_df['Age'].dropna().values,bins=range(0,80,1),kde=False,color='red',axlabel='Died_Age')
plt.show()
可以看到幼儿0 ~ 6岁获救的比例很大,6 ~ 12岁遇难的可能较大。青壮年遇难的可能性远高于幸存可能性。65岁以下的老年人具有一定获救的可能性,而65岁以上则都不幸遇难,于是我们之后可以利用这一特征对年龄进行分类。
看看各属性与幸存率的关系
plt.figure(figsize=(30,40))
plt.subplot(421)
sns.barplot('Sex','Survived',data=df)
plt.subplot(422)
sns.barplot('Pclass','Survived',data=df)
plt.subplot(423)
sns.barplot('Embarked','Survived',data=df)
plt.subplot(424)
sns.barplot('Parch','Survived',data=df)
plt.subplot(425)
sns.barplot('SibSp','Survived',data=df)
plt.subplot(426)
survival_df = df[df['Survived']==1]
death_df = df[df['Survived']==0]
sns.distplot((survival_df['Fare'].dropna().values),kde=False,color='blue')
sns.distplot((death_df['Fare'].dropna().values),kde=False,color='red',axlabel='Fare')
plt.subplot(427)
df['Family'] = df['Parch']+df['SibSp']
sns.barplot('Family','Survived',data=df)
plt.subplot(428)
sns.barplot('Embarked','Survived',data=df)
上图可以发现幸存率与Sex、Pclass、Emarked、Family (SibSp+Parch,合在一起发现关系更紧密)。
Fare用柱状图表示好像并不太好,试试小提琴图
sns.violinplot(x='Survived',y='Fare',data=df)
df['Fare'] = df['Fare'].map(lambda x:np.log(x+1))
sns.violinplot(x='Survived',y='Fare',data=df)
可以看到log(Fare)<2.5时幸存的概率较小。
对Family和Fare做一个分类:
df['Fare']=df['Fare'].map(lambda x: 'poor' if x<2.5 else 'rich')
df['Family'] = df['SibSp']+df['Parch']
df['Family'] = df['Family'].map(lambda x:'small' if x<=3 else 'big')
对了,数据集中Age还有不少缺失,如何填充呢,看看Age属性与哪些属性有较强的关联性
df['BoolSex'] = df['Sex']=='male'
plt.figure(figsize=(15,10))
corr = sns.heatmap(df.drop(['PassengerId'],axis=1).corr(),annot=True)
可以看到Age与SibSp、Parch、Pclass有较强的关联性
还有Embarkedd有两个缺失,直接填充人数最多的’C’好了
于是可以做如下填充:
group = df.groupby(['Pclass','SibSp','Parch']).Age
df['Age'] = group.transform(lambda x:x.fillna(x.median()))
df['Embarked']=df['Embarked'].fillna('C')
df['Age']=df['Age'].fillna(df['Age'].mean())
df['Age']=df['Age'].map(lambda x:'baby' if x<=6 else 'child' if x<16 else 'youth' if x<35 else 'adult' if x<65 else 'old' if x<75 else 'tooold' if x>=75 else 'null')
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 0
Family 0
BoolSex 0
dtype: int64
除了Cabin都没有缺失值了,Cabin缺失实在太多,决定舍弃
使用决策树预测乘客是否幸存
selected_columns = ['Pclass','Sex','Age','Family','Fare','Embarked']
x = df[selected_columns]
y = df['Survived']
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))
print(vec.feature_names_ )
[‘Age=adult’, ‘Age=baby’, ‘Age=child’, ‘Age=old’, ‘Age=tooold’, ‘Age=youth’, ‘Embarked=C’, ‘Embarked=Q’, ‘Embarked=S’, ‘Family=big’, ‘Family=small’, ‘Fare=poor’, ‘Fare=rich’, ‘Pclass’, ‘Sex=female’, ‘Sex=male’]
这里采用10折交叉验证取在测试集最优的模型
from sklearn.model_selection import KFold
import copy
from sklearn import metrics
cv = KFold(n_splits=10)
tree_model = DecisionTreeClassifier('gini')
valid_dt_accs = []
test_dt_accs = []
optimal_dt_class = None
maxi = 0.0
for train_index, valid_index in cv.split(X_train):
train_x,test_x = X_train[train_index],X_train[valid_index]
train_y,test_y= y_train.iloc[train_index], y_train.iloc[valid_index]
model = tree_model.fit(train_x,train_y)
valid_acc = model.score(test_x,test_y)
test_acc = model.score(X_test,y_test)
if test_acc > maxi:
maxi = test_acc
optimal_dt_class = copy.deepcopy(model)
print("maxi:",maxi)
valid_dt_accs.append(valid_acc)
test_dt_accs.append(test_acc)
print(classification_report(y_test,model.predict(X_test)))
print("Validate accuracy per fold: ", valid_dt_accs, "\n")
print("Test accuracy per fold: ", test_dt_accs, "\n")
print("Average accuracy: ", sum(valid_dt_accs)/len(valid_dt_accs))
print("Average accuracy: ", sum(test_dt_accs)/len(test_dt_accs))
precision recall f1-score support
0 0.87 0.93 0.90 117
1 0.85 0.73 0.78 62
accuracy 0.86 179
macro avg 0.86 0.83 0.84 179
weighted avg 0.86 0.86 0.86 179
precision recall f1-score support
0 0.86 0.91 0.88 117
1 0.80 0.73 0.76 62
accuracy 0.84 179
macro avg 0.83 0.82 0.82 179
weighted avg 0.84 0.84 0.84 179
precision recall f1-score support
0 0.86 0.95 0.90 117
1 0.88 0.71 0.79 62
accuracy 0.87 179
macro avg 0.87 0.83 0.84 179
weighted avg 0.87 0.87 0.86 179
precision recall f1-score support
0 0.85 0.91 0.88 117
1 0.80 0.71 0.75 62
accuracy 0.84 179
macro avg 0.83 0.81 0.82 179
weighted avg 0.84 0.84 0.84 179
precision recall f1-score support
0 0.86 0.91 0.88 117
1 0.81 0.71 0.76 62
accuracy 0.84 179
macro avg 0.84 0.81 0.82 179
weighted avg 0.84 0.84 0.84 179
precision recall f1-score support
0 0.87 0.95 0.91 117
1 0.88 0.73 0.80 62
accuracy 0.87 179
macro avg 0.87 0.84 0.85 179
weighted avg 0.87 0.87 0.87 179
precision recall f1-score support
0 0.87 0.94 0.90 117
1 0.87 0.73 0.79 62
accuracy 0.87 179
macro avg 0.87 0.83 0.85 179
weighted avg 0.87 0.87 0.86 179
precision recall f1-score support
0 0.83 0.80 0.82 117
1 0.65 0.69 0.67 62
accuracy 0.77 179
macro avg 0.74 0.75 0.74 179
weighted avg 0.77 0.77 0.77 179
precision recall f1-score support
0 0.86 0.91 0.88 117
1 0.81 0.71 0.76 62
accuracy 0.84 179
macro avg 0.84 0.81 0.82 179
weighted avg 0.84 0.84 0.84 179
precision recall f1-score support
0 0.86 0.95 0.90 117
1 0.88 0.71 0.79 62
accuracy 0.87 179
macro avg 0.87 0.83 0.84 179
weighted avg 0.87 0.87 0.86 179
Validation accuracy per fold: [0.7777777777777778, 0.875, 0.8873239436619719, 0.7605633802816901, 0.8169014084507042, 0.8450704225352113, 0.7464788732394366, 0.6901408450704225, 0.8169014084507042, 0.7464788732394366]
Test accuracy per fold: [0.8603351955307262, 0.8435754189944135, 0.8659217877094972, 0.8379888268156425, 0.8435754189944135, 0.8715083798882681, 0.8659217877094972, 0.7653631284916201, 0.8435754189944135, 0.8659217877094972]
Average validation accuracy: 0.7962636932707355
Average test accuracy: 0.846368715083799
用pdf保存决策树的图
import graphviz
from sklearn import tree
import pydotplus
import os
os.environ["PATH"] += os.pathsep + 'D:/Program Files/Graphviz/bin'
dot_data = tree.export_graphviz(optimal_dt_class, out_file=None,feature_names=optimal_dt_class.feature_importances_,
filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("Decision_Tree.pdf")
随机森林预测乘客是否幸存
from sklearn.model_selection import KFold
from sklearn import metrics
from sklearn import ensemble
import copy
cv = KFold(n_splits=10)
tree_model = ensemble.RandomForestClassifier(n_estimators=500, random_state=666)
valid_accs = []
test_accs = []
optimal_rf_class = None
maxi = 0.0
for train_index, valid_index in cv.split(X_train):
train_x,test_x = X_train[train_index],X_train[valid_index]
train_y,test_y= y_train.iloc[train_index], y_train.iloc[valid_index]
model = tree_model.fit(train_x,train_y)
valid_acc = model.score(test_x,test_y)
test_acc = model.score(X_test,y_test)
if test_acc > maxi:
maxi = test_acc
optimal_rf_class = copy.deepcopy(model)
valid_accs.append(valid_acc)
test_accs.append(test_acc)
print(classification_report(y_test,model.predict(X_test)))
print("Validate accuracy per fold: ", valid_accs, "\n")
print("Test accuracy per fold: ", test_accs, "\n")
print("Average validation accuracy: ", sum(valid_accs)/len(valid_accs))
print("Average test accuracy: ", sum(test_accs)/len(test_accs))
Output:
precision recall f1-score support
0 0.87 0.93 0.90 117
1 0.85 0.73 0.78 62
accuracy 0.86 179
macro avg 0.86 0.83 0.84 179
weighted avg 0.86 0.86 0.86 179
precision recall f1-score support
0 0.86 0.91 0.88 117
1 0.80 0.73 0.76 62
accuracy 0.84 179
macro avg 0.83 0.82 0.82 179
weighted avg 0.84 0.84 0.84 179
precision recall f1-score support
0 0.87 0.95 0.91 117
1 0.88 0.74 0.81 62
accuracy 0.88 179
macro avg 0.88 0.85 0.86 179
weighted avg 0.88 0.88 0.87 179
precision recall f1-score support
0 0.85 0.91 0.88 117
1 0.80 0.71 0.75 62
accuracy 0.84 179
macro avg 0.83 0.81 0.82 179
weighted avg 0.84 0.84 0.84 179
precision recall f1-score support
0 0.86 0.92 0.89 117
1 0.83 0.71 0.77 62
accuracy 0.85 179
macro avg 0.84 0.82 0.83 179
weighted avg 0.85 0.85 0.85 179
precision recall f1-score support
0 0.86 0.91 0.89 117
1 0.82 0.73 0.77 62
accuracy 0.85 179
macro avg 0.84 0.82 0.83 179
weighted avg 0.85 0.85 0.85 179
precision recall f1-score support
0 0.87 0.94 0.90 117
1 0.87 0.73 0.79 62
accuracy 0.87 179
macro avg 0.87 0.83 0.85 179
weighted avg 0.87 0.87 0.86 179
precision recall f1-score support
0 0.84 0.81 0.83 117
1 0.67 0.71 0.69 62
accuracy 0.78 179
macro avg 0.75 0.76 0.76 179
weighted avg 0.78 0.78 0.78 179
precision recall f1-score support
0 0.86 0.91 0.88 117
1 0.81 0.71 0.76 62
accuracy 0.84 179
macro avg 0.84 0.81 0.82 179
weighted avg 0.84 0.84 0.84 179
precision recall f1-score support
0 0.86 0.95 0.90 117
1 0.88 0.71 0.79 62
accuracy 0.87 179
macro avg 0.87 0.83 0.84 179
weighted avg 0.87 0.87 0.86 179
Validate accuracy per fold: [0.7777777777777778, 0.8611111111111112, 0.8169014084507042, 0.7746478873239436, 0.8028169014084507, 0.8591549295774648, 0.7605633802816901, 0.704225352112676, 0.8169014084507042, 0.7464788732394366]
Test accuracy per fold: [0.8603351955307262, 0.8435754189944135, 0.8770949720670391, 0.8379888268156425, 0.8491620111731844, 0.8491620111731844, 0.8659217877094972, 0.776536312849162, 0.8435754189944135, 0.8659217877094972]
Average validation accuracy: 0.792057902973396
Average test accuracy: 0.8469273743016759
print("Optimal Random Forest Accuracy:",optimal_rf_class.score(X_test,y_test))
optimal_preds = optimal_rf_class.predict(X_test)
print(classification_report(optimal_preds,y_test,target_names=['died','survived']))
Output:
Optimal Random Forest Accuracy: 0.8770949720670391
precision recall f1-score support
died 0.95 0.87 0.91 127
survived 0.74 0.88 0.81 52
accuracy 0.88 179
macro avg 0.85 0.88 0.86 179
weighted avg 0.89 0.88 0.88 179
预测测试集中乘客是否幸存
my_test_df = pd.read_csv("test.csv")
my_test_group = my_test_df.groupby(['Pclass','SibSp','Parch']).Age
my_test_df['Age'] = my_test_group.transform(lambda x:x.fillna(x.median()))
my_test_df['Embarked']=my_test_df['Embarked'].fillna('C')
my_test_df['Age'] = my_test_df['Age'].fillna(my_test_df['Age'].mean())
my_test_df['Fare'] = my_test_df['Fare'].fillna(my_test_df['Fare'].mean())
my_test_df.isnull().sum()
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 327
Embarked 0
dtype: int64
my_test_df['Fare'] = my_test_df['Fare'].map(lambda x:np.log(x+1))
my_test_df['Fare'] = my_test_df['Fare'].map(lambda x: 'poor' if x<2.5 else 'rich')
my_test_df['Family'] = my_test_df['SibSp']+my_test_df['Parch']
my_test_df['Family'] = my_test_df['Family'].map(lambda x:'small' if x<=3 else 'big')
my_test_df['Age']=my_test_df['Age'].map(lambda x:'baby' if x<=6 else 'child' if x<16 else 'youth' if x<35 else 'adult' if x<65 else 'old' if x<75 else 'tooold' if x>=75 else 'null')
my_test_df = my_test_df[selected_columns]
my_test = vec.transform(my_test_df.to_dict(orient='record'))
保存为csv文件,提交给kaggle
PassengerId = pd.read_csv('test.csv')['PassengerId']
my_rf_preds = optimal_rf_class.predict(my_test)
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": my_rf_preds.astype(np.int32)})
submission.to_csv("predicted_submission2.csv", index=False)
最好成绩