前言
看到有网友推荐练习数据分析可以去kaggle上找一些项目练手,对于新手可以做一下Getting Started里的练习项目,于是注册了一个kaggle账号,从经典的泰坦尼克号数据分析入手,这两天做了一下,现在把分析过程记录下来做个总结。
数据介绍
数据集主要是泰坦尼克号上的乘客信息,包括:PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked',我们要做的就是通过这些数据,对不包含'Survived'的测试数据进行预测'Survived'的结果。
通俗一点解释就是:用含有是否生还记录的乘客信息数据,来预测不含有生还记录的乘客是否生还。这是典型的二分类预测问题。
##导入基础使用包
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"]
os.chdir("D:/kaggle/titanic/")
现在来导入数据,数据可以从kaggle网站下载,或者直接在网站提供的平台来做,不用下载数据。
train_data=pd.read_csv("train.csv")
test_data=pd.read_csv("test.csv")
train_data.shape,test_data.shape
((891, 12), (418, 11))
train_data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
test_data.columns
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
- 训练数据train_data比测试数据test_data多一列Survived这是训练数据的标签列Y,除去此列的train_data是训练数据的特征列。
- 首先,PassengerId 为乘客的Id标识列,与预测乘客是否生还无关,后面会去除。
- Name 为乘客的名字,与预测乘客是否生还无关,后面也会去除。
- Ticket 为票信息,无用,去除。
### 查看一下数据的总体统计特征
train_data.describe()
有空缺值存在,因为describe函数只统计数值型数据的统计特征,因此,看出有几个特征是面属性变量。
###具体查看一下空缺值
train_data.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Cabin的空缺值占比太大,直接加入删除特征列里,Age和Embarked需要进行填补
test_data.isnull().sum()
PassengerId 0
Pclass 0
Name 0
Sex 0
Age 86
SibSp 0
Parch 0
Ticket 0
Fare 1
Cabin 327
Embarked 0
dtype: int64
### 分析Pclass列,此列为乘客的社会地位的描述
train_data["Pclass"].value_counts()
3 491
1 216
2 184
Name: Pclass, dtype: int64
三个等级
#### 计算不同等级的生还率
train_data[["Pclass","Survived"]].groupby("Pclass",as_index=False).mean()
社会地位越高的生还率越高,此特征提取出来。
###分析Sex,乘客性别列
train_data["Sex"].value_counts()
male 577
female 314
Name: Sex, dtype: int64
train_data[["Sex","Survived"]].groupby("Sex",as_index=False).mean()
女性的生还率比男性高很多,此特征提取出来。然后将描述性特征转换为数值型特征。
train_data["Sex"]=train_data["Sex"].map({"male":1,"female":2}).astype(int)
test_data["Sex"]=test_data["Sex"].map({"male":1,"female":2}).astype(int)
####分析SibSp,同乘同辈亲属的个数;Parch,同乘父母和孩子的个数
train_data["SibSp"].value_counts()
0 608
1 209
2 28
4 18
3 16
8 7
5 5
Name: SibSp, dtype: int64
train_data["Parch"].value_counts()
0 678
1 118
2 80
5 5
3 5
4 4
6 1
Name: Parch, dtype: int64
train_data[["SibSp","Survived"]].groupby("SibSp",as_index=False).mean()
train_data[["Parch","Survived"]].groupby("Parch",as_index=False).mean()
这样看,这两个特征对对是否生还的影响:特征值大的时候,生还概率小。我们现在将两者详解,组成新的特征,总的亲属人数。
train_data["Familynum"]=train_data["SibSp"]+train_data["Parch"]
test_data["Familynum"]=test_data["SibSp"]+test_data["Parch"]
train_data[["Familynum","Survived"]].groupby("Familynum",as_index=False).mean()
###Age年龄,含有空缺值,先补空缺,众数补
train_data["Age"].fillna(train_data["Age"].mode().values[0],inplace=True)
test_data["Age"].fillna(train_data["Age"].mode().values[0],inplace=True)
fig= plt.figure(figsize=(8,6))
ax=fig.add_subplot(111)
ax.boxplot(train_data["Age"],labels=["Age"],capprops={"linewidth":2},boxprops={"linewidth":1})
ax.set_yticks(np.linspace(0,81,21))
plt.show()
上下均有异常值,年龄很小的和年龄很大的人数很少,符合现实。
####Fare 票价
test["Fare"].fillna(test["Fare"].mean(),inplace=True)
fig= plt.figure(figsize=(4,4))
ax=fig.add_subplot(111)
ax.boxplot(train_data["Fare"],labels=["Fare"],capprops={"linewidth":2},boxprops={"linewidth":1})
ax.set_yticks(np.linspace(0,510,21))
plt.show()
票价信息大的达到510左右,最低低的免费,集中在76元以下。
####分析Embarked,出发地
train_data["Embarked"].isnull().sum()
2
train_data["Embarked"].value_counts()
S 644
C 168
Q 77
Name: Embarked, dtype: int64
补众数
train_data["Embarked"].fillna("S",inplace=True)
train_data[["Embarked","Survived"]].groupby("Embarked",as_index=False).mean()
##描述性特征数值化
train_data["Embarked"]=train_data["Embarked"].map({"C":1,"Q":2,"S":3}).astype(int)
test_data["Embarked"]=test_data["Embarked"].map({"C":1,"Q":2,"S":3}).astype(int)
###删除原始无用特征和中间分析过程产生的无用特征
train_data.drop(["PassengerId","Name","Ticket","Cabin"],axis=1,inplace=True)
test_data.drop(["PassengerId","Name","Ticket","Cabin"],axis=1,inplace=True)
###训练样本,训练样本标签
X_train=train_data.drop("Survived",axis=1)
Y_train=train_data["Survived"]
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
##将我们含有标签的训练集分成两部分,再次分为训练集和线下测试集
X_train,X_test,Y_train,Y_test=train_test_split(X_train,Y_train,test_size=0.3,random_state=0)
###训练模型,我们尝试使用,SVM,k_nn,朴素贝叶斯,CART局册数,随机森林,GBDT,xgboost七个模型来做分类。
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
SVM
model1=SVC(kernel="rbf")
C=[c for c in range(1,10)]
gamma=[g for g in range(1,20)]
param_grid=dict(C=C,gamma=gamma)
grid_search = GridSearchCV(model1,param_grid,scoring="accuracy")###网格调参
grid_search.fit(X_train,Y_train)
bestmodel1 = grid_search.best_estimator_
bestmodel1.score(X_train,Y_train)
0.9646869983948636
bestmodel1.score(X_test,Y_test)
0.6716417910447762
可以看出过拟合了。
knn
##knn
model1=KNeighborsClassifier(n_neighbors=3)
n=[n for n in range(1,9)]
param_grid=dict(n_neighbors=n)
grid_search=GridSearchCV(model1,param_grid,scoring="accuracy")
grid_search.fit(X_train,Y_train)
bestmodel1=grid_search.best_estimator_
bestmodel1.score(X_train,Y_train)
0.8218298555377207
Y_pre=bestmodel3.predict(X_test)
print(metrics.classification_report(Y_test,Y_pre))
precision recall f1-score support
0 0.75 0.80 0.77 168
1 0.62 0.54 0.58 100
accuracy 0.71 268
macro avg 0.68 0.67 0.68 268
weighted avg 0.70 0.71 0.70 268
GaussianNB
model3 = GaussianNB(priors=None)
model3.fit(X_train,Y_train)
model3.score(X_train,Y_train)
0.7890011223344556
决策树
### 决策树
from sklearn.tree import DecisionTreeClassifier
model4 = DecisionTreeClassifier()
param_grid={"max_depth":[1,2,3,4,5,6,7,8,9],"min_samples_split":[2,3,4]}
grid_search=GridSearchCV(model4,param_grid,scoring="accuracy")
grid_search.fit(X_train,Y_train)
bestmodel4=grid_search.best_estimator_
bestmodel4.score(X_train,Y_train)
0.8507223113964687
Y_pre=bestmodel4.predict(X_test)
print(metrics.classification_report(Y_test,Y_pre))
precision recall f1-score support
0 0.81 0.90 0.86 168
1 0.80 0.65 0.72 100
accuracy 0.81 268
macro avg 0.81 0.78 0.79 268
weighted avg 0.81 0.81 0.80 268
随机森林
model5 = RandomForestClassifier()
param_grid = {"n_estimators":[i for i in range(5,15)],"min_samples_split":[i for i in range(2,4)]}
gridsearch=GridSearchCV(model5,param_grid,scoring="accuracy")
gridsearch.fit(X_train,Y_train)
bestmodel5=gridsearch.best_estimator_
bestmodel5.score(X_train,Y_train)
0.942215088282504
Y_pre=bestmodel6.predict(X_test)
print(metrics.classification_report(Y_test,Y_pre))
precision recall f1-score support
0 0.86 0.86 0.86 168
1 0.77 0.76 0.76 100
accuracy 0.82 268
macro avg 0.81 0.81 0.81 268
weighted avg 0.82 0.82 0.82 268
GBDT
model6=GradientBoostingClassifier(random_state=10)
param_grid={"learning_rate":[i for i in np.arange(0.1,1,0.1)],"max_depth":[2,3,4]}
gridsearch=GridSearchCV(model6,param_grid,scoring="accuracy")
gridsearch.fit(X_train,Y_train)
bestmodel6=gridsearch.best_estimator_
bestmodel6.score(X_train,Y_train)
0.9373996789727127
Y_pre=bestmodel6.predict(X_test)
print(metrics.classification_report(Y_test,Y_pre))
precision recall f1-score support
0 0.84 0.90 0.87 168
1 0.81 0.72 0.76 100
accuracy 0.83 268
macro avg 0.83 0.81 0.82 268
weighted avg 0.83 0.83 0.83 268
XGboost
params1={"max_depth":list(range(3,20,4)),"min_child_weight":list(range(1,10,2))}
grid_search1=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=10,
gamma=0,subsample=0.8,colsample_bytree=0.8,
objective="binary:logistic",nthread=4,
scale_pos_weight=1,seed=27),param_grid=params1,
scoring="roc_auc",cv=5)
grid_search1.fit(X_train,Y_train)
grid_search1.best_params_,grid_search1.best_score_
({'max_depth': 15, 'min_child_weight': 1}, 0.8760627725970771)
params2={"gamma":[(i+1)/10.0 for i in range(0,5)]}
grid_search2=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=10,
max_depth=15,min_child_weight=1,
subsample=0.8,colsample_bytree=0.8,
objective="binary:logistic",nthread=4,
scale_pos_weight=1,seed=27),param_grid=params2,
scoring="roc_auc",cv=5)
grid_search2.fit(X_train,Y_train)
grid_search2.best_params_,grid_search2.best_score_
({'gamma': 0.2}, 0.8762017157341404)
params3={"subsample":[i/100.0 for i in range(75,90,5)],
"colsample_bytree":[i/100.0 for i in range(75,90,5)]}
grid_search3=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=10,
max_depth=15,min_child_weight=1,gamma=0.2,
objective="binary:logistic",nthread=4,
scale_pos_weight=1,seed=27),param_grid=params3,
scoring="roc_auc",cv=5)
grid_search3.fit(X_train,Y_train)
grid_search3.best_params_,grid_search3.best_score_
({'colsample_bytree': 0.75, 'subsample': 0.8}, 0.8762017157341404)
params4={"reg_alpha":[1e-5,1e-2,0.1,2,3]}
grid_search4=GridSearchCV(estimator=XGBClassifier(learning_rate=0.1,n_estimators=10,
max_depth=15,min_child_weight=1,gamma=0.2,
objective="binary:logistic",nthread=4,
subsample=0.8,colsample_bytree=0.75,
scale_pos_weight=1,seed=27),param_grid=params4,
scoring="roc_auc",cv=5)
grid_search4.fit(X_train,Y_train)
grid_search4.best_params_,grid_search4.best_score_
({'reg_alpha': 1e-05}, 0.8762017157341404)
params5={"learning_rate":[i/100.0 for i in range(1,21,1)]}
grid_search5=GridSearchCV(estimator=XGBClassifier(n_estimators=10,
max_depth=15,min_child_weight=1,gamma=0.2,
objective="binary:logistic",nthread=4,
subsample=0.8,colsample_bytree=0.75,reg_alpha=1e-05,
scale_pos_weight=1,seed=27),param_grid=params5,
scoring="roc_auc",cv=5)
grid_search5.fit(X_train,Y_train)
grid_search5.best_params_,grid_search5.best_score_
({'learning_rate': 0.1}, 0.8762017157341404)
model7=XGBClassifier(learning_rate=0.1,n_estimators=80,max_depth=15,min_child_weight=1,gamma=0.2,
subsample=0.8,colsample_bytree=0.75,objective="binary:logistic",nthread=4,
scale_pos_weight=1,seed=27)
model7.fit(X_train,Y_train)
model7.score(X_train,Y_train)
0.9454253611556982
Y_pre=model7.predict(X_test)
print(metrics.classification_report(Y_test,Y_pre))
precision recall f1-score support
0 0.84 0.88 0.86 168
1 0.77 0.72 0.75 100
accuracy 0.82 268
macro avg 0.81 0.80 0.80 268
weighted avg 0.82 0.82 0.82 268
对测试集做预测,并按要求格式保存,上传
Y_test_pre=model7.predict(test_data)
##用Xgboost模型做的预测,原因为此模型预测结果提交后成绩最好0.76076%
test_p = pd.DataFrame({"PassengerId":np.arange(892,1310),"Survived":Y_test_pre})
test_p.to_csv("D:/kaggle/titanic/test_pre.csv",index=False)
总结:最后预测结果没有很好,原因应该在特征的构造上。