Python使用随机森林预测泰坦尼克号生存

tags:

  • 随机森林
  • kaggle
  • 数据挖掘
    categories: 数据挖掘
    mathjax: true

前言:

  • Kaggle数据挖掘竞赛:使用随机森林预测泰坦尼克号生存情况

数据来源kaggle

1 数据预处理

1.1 读入数据

import pandas as pd
data_train = pd.read_csv(r'train.csv')
data_test = pd.read_csv(r'test.csv')
data_train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

1.2 训练集与数据集

data_test.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS

1.2.1 查看数据完整性

data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

总共有891组数据,其中age是714条,Cabin是204条,共计12个变量

乘客ID,存活情况,船票级别,乘客姓名,性别,年龄,船上的兄弟姐妹以及配偶的人数,船上的父母以及子女的人数,船票编号,船票费用,所在船舱,登船的港口

1.2.2 查看训练数据描述信息

data_train.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

mean代表各项的均值,获救率为0.383838

1.3.1 年龄数据简化分组

def simplify_ages(df):
    #把缺失值补上,方便分组
    df.Age = df.Age.fillna(-0.5)
   
    #把Age分为不同区间,-1到0,1-5,6-12...,60以上,放到bins里,八个区间,对应的八个区间名称在group_names那
    bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
    group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
    
    #开始对数据进行离散化,pandas.cut就是这个功能
    catagories = pd.cut(df.Age,bins,labels=group_names)
    df.Age = catagories
    return df

简化Cabin,就是取字母

def simplify_cabin(df):
    df.Cabin = df.Cabin.fillna('N')
    df.Cabin = df.Cabin.apply(lambda x:x[0])
    return df

简化工资,也就是分组

def simplify_fare(df):
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    catagories = pd.cut(df.Fare,bins,labels=group_names)
    df.Fare = catagories
    return df

删除无用信息

def simplify_drop(df):
    return df.drop(['Name','Ticket','Embarked'],axis=1)

整合一遍,凑成新表

def transform_features(df):
    df = simplify_ages(df)
    df = simplify_cabin(df)
    df = simplify_fare(df)
    df = simplify_drop(df)
    return df

执行读取新表

#必须要再读取一遍原来的表,不然会报错,不仅训练集要简化,测试集也要,两者的特征名称要一致
data_train = pd.read_csv(r'train.csv')
data_train = transform_features(data_train)
data_test = transform_features(data_test)
data_train.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFareCabin
0103maleStudent101_quartileN
1211femaleAdult104_quartileC
2313femaleYoung Adult001_quartileN
3411femaleYoung Adult104_quartileC
4503maleYoung Adult002_quartileN
#data_train=data_train.drop(["PassengerId","Cabin","Name"],axis=1)
data_train.head(200)
SurvivedPclassSexAgeSibSpParchTicketFareEmbarked
003male22.010A/5 211717.2500S
111female38.010PC 1759971.2833C
213female26.000STON/O2. 31012827.9250S
311female35.01011380353.1000S
403male35.0003734508.0500S
..............................
19511female58.000PC 17569146.5208C
19603maleNaN003687037.7500Q
19703male42.00145798.4042S
19813femaleNaN003703707.7500Q
19902female24.00024874713.0000S

200 rows × 9 columns

选取我们需要的那几个列作为输入, 对于票价和姓名我就舍弃了,姓名没什么用

cols = ['PassengerId','Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
data_tr=data_train[cols].copy()
data_tr.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103male22.0107.2500S
1211female38.01071.2833C
2313female26.0007.9250S
3411female35.01053.1000S
4503male35.0008.0500S
cols = ['PassengerId','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
data_te=data_test[cols].copy()
data_te.head()
PassengerIdPclassSexAgeSibSpParchFareEmbarked
08923male34.5007.8292Q
18933female47.0107.0000S
28942male62.0009.6875Q
38953male27.0008.6625S
48963female22.01112.2875S
data_tr.isnull().sum()
data_te.isnull().sum()
PassengerId     0
Pclass          0
Sex             0
Age            86
SibSp           0
Parch           0
Fare            1
Embarked        0
dtype: int64

填充数据,,,,,,

age_mean = data_tr['Age'].mean()
data_tr['Age'] = data_tr['Age'].fillna(age_mean)
data_tr['Embarked'] = data_tr['Embarked'].fillna('S')
data_tr.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64
data_tr.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103male22.0107.2500S
1211female38.01071.2833C
2313female26.0007.9250S
3411female35.01053.1000S
4503male35.0008.0500S

用数组特征化编码年龄和S C Q等等,,因为随机森林的输入需要数值,字符不行

#import numpy as np
age_mean = data_te['Age'].mean()
data_te['Age'] = data_te['Age'].fillna(age_mean)
age_mean = data_te['Fare'].mean()
data_te['Fare'] = data_te['Fare'].fillna(age_mean)
#data_te.replace(np.na, 0, inplace=True)
#data_te.replace(np.inf, 0, inplace=True)
data_te['Sex']= data_te['Sex'].map({'female':0, 'male': 1}).astype(int)
data_te['Embarked']= data_te['Embarked'].map({'S':0, 'C': 1,'Q':2}).astype(int)
data_te.head()
PassengerIdPclassSexAgeSibSpParchFareEmbarked
08923134.5007.82922
18933047.0107.00000
28942162.0009.68752
38953127.0008.66250
48963022.01112.28750
data_tr['Sex']= data_tr['Sex'].map({'female':0, 'male': 1}).astype(int)
data_tr['Embarked']= data_tr['Embarked'].map({'S':0, 'C': 1,'Q':2}).astype(int)
data_tr.head()
#data_tr = pd.get_dummies(data_tr=data_tr,columns=['Embarked'])
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103122.0107.25000
1211038.01071.28331
2313026.0007.92500
3411035.01053.10000
4503135.0008.05000

2 数据可视化

2.1 年龄和生存率之间的关系

sns.barplot(x='Embarked',y='Survived',hue='Sex',data=data_train)
<matplotlib.axes._subplots.AxesSubplot at 0x7fee5875e3c8>

png

  • female的获救率大于 male,(应该是男士都比较绅士吧,即使面对死亡,也希望将最后的机会留给女生,,电影感悟)
  • 获救率 C 男性女性都是最高,Q时男性最低,S 时 女性最低
  • 男性的获救率低于女性的三分之一

2.2 所做的位置和生存率之间的关系

sns.pointplot(x='Pclass',y='Survived',hue='Sex',data=data_train,palette={'male':'blue','female':'pink'},
             marker=['*',"o"],linestyle=['-','--'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fee586f70b8>

png

  • 等级越高获救率越高
  • 女性大于男性

2.3 生存率与年龄的关系

sns.barplot(x = 'Age',y = 'Survived',hue='Sex',data = data_train)
<matplotlib.axes._subplots.AxesSubplot at 0x7fee587238d0>

png

  • 男性大于女性
  • student的生存率最低,bady的生存率最高
sns.barplot(x = 'Cabin',y = 'Survived',hue='Sex',data = data_train)
<matplotlib.axes._subplots.AxesSubplot at 0x7fee585b0748>

png

sns.barplot(x = 'Fare',y = 'Survived',hue='Sex',data = data_train)
<matplotlib.axes._subplots.AxesSubplot at 0x7fee5852b390>

png

3 建立模型

3.1 随机森林

from sklearn.model_selection import train_test_split
X_all = data_tr.drop(['PassengerId','Survived'],axis=1)#主要是乘客ID也没啥用,删就删了吧
y_all = data_tr['Survived']
p = 0.2 #用 百分之20作为测试集

X_train,X_test, y_train, y_test = train_test_split(X_all,y_all,test_size=p, random_state=23)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV

#选择分类器的类型,我没试过其他的哦,因为在这个案例中,有人做过试验发现随机森林模型是最好的,所以选了它。呜呜,我下次试试其他的
clf = RandomForestClassifier()

#可以通过定义树的各种参数,限制树的大小,防止出现过拟合现象哦,也可以通过剪枝来限制,但sklearn中的决策树分类器目前不支持剪枝
parameters = {'n_estimators': [4, 6, 9], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],        #分类标准用熵,基尼系数
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1,5,8]
             }

#以下是用于比较参数好坏的评分,使用'make_scorer'将'accuracy_score'转换为评分函数
acc_scorer = make_scorer(accuracy_score)

#自动调参,GridSearchCV,它存在的意义就是自动调参,只要把参数输进去,就能给出最优化的结果和参数
#GridSearchCV用于系统地遍历多种参数组合,通过交叉验证确定最佳效果参数。
grid_obj = GridSearchCV(clf,parameters,scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train,y_train)

#将clf设置为参数的最佳组合
clf = grid_obj.best_estimator_

#将最佳算法运用于数据中
clf.fit(X_train,y_train)
/home/wvdon/anaconda3/envs/weidong/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)
/home/wvdon/anaconda3/envs/weidong/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=5, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, n_estimators=4,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

3.2 预测

predictions = clf.predict(X_test)
print(accuracy_score(y_test,predictions))
data_tr
0.8268156424581006
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103122.000000107.25000
1211038.0000001071.28331
2313026.000000007.92500
3411035.0000001053.10000
4503135.000000008.05000
..............................
88688702127.0000000013.00000
88788811019.0000000030.00000
88888903029.6991181223.45000
88989011126.0000000030.00001
89089103132.000000007.75002

891 rows × 9 columns

3.3 预测test文件

predictions = clf.predict(data_te.drop('PassengerId',axis=1))
output = pd.DataFrame({'Passengers':data_te['PassengerId'],'Survived':predictions})
output.to_csv(r'test1.csv')
output.head()
PassengersSurvived
08920
18930
28940
38950
48960

3.4 提交到kaggle官网

结果是 0.77990
hhhhhhhh还是比较满意的
下次用深度学习试试


评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

wvdon

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值