【Python】Kaggle_Titanic_prediction 2 -- random forest 随机森林预测

Kaggle泰坦尼克号沉船生存预测
总结:随机森林预测最后准确率最高为0.785,kaggle排名4158/10972(38%)-2019.4.11。
没有达到大神帖子说的0.81,可能是特征还可以处理到更佳;又或者,一些离散比较大的变量例如 Fare,应该先做 scale 处理,减少机器学习的误差。
kaggle Titanic 排名

不想在Titanic这个项目上耽误太久,所以scale留着下一个项目如果有机会再试。

奔向下一个项目:predict-future-sales。
网址 https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview/evaluation

以下:Kaggle_Titanic_practice2-random forest 随机森林预测过程。

# 前面我们尝试了逻辑回归预测,准确率最高0.775.
# 接下来我们试试随机森林

# 导入常用数据模块
import pandas as pd
import numpy as np
# 导入训练集数据文件
train=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/train2.csv")
train.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_groupage_group0Sex0Embarked0
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1青年321
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1中年412
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0青年311
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1青年311
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0青年321
# 导入测试集数据文件
test=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/test2.csv")
test.head(5)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamilyage_groupage_group0Sex0Embarked0
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ0青年322
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS1中年411
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ0中老年522
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS0青年321
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS2青年311
# 试试随机森林算法
# 先把有点关系的变量都丢进去。
from sklearn.ensemble import RandomForestClassifier
x1 = train[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
y1 = train["Survived"]
x_test1 = test[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(x1, y1)

Y_pred = random_forest.predict(x_test1)
score_randomforest = random_forest.score(x1, y1)
score_randomforest

# 训练结果拟合度0.94,好高,比之前逻辑回归预测的0.8左右高了很多。
# 不过还得看看预测结果提交至kaggle看看预测准确率先。
0.9438832772166106
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final7-randomforest.csv",index=False)

# kaggle score 0.78468,对比逻辑回归的最高0.77,还是有提升。
# 接下来优化调整一下random forest 参数,看看最优是多少。
# 变量不变,调参。
x1 = train[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
y1 = train["Survived"]
x_test1 = test[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]

random_forest = RandomForestClassifier(oob_score=True, n_estimators=500)
random_forest.fit(x1, y1)

Y_pred = random_forest.predict(x_test1)
score_randomforest = random_forest.score(x1, y1)
score_randomforest

# 也是比较高
0.9438832772166106
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final9-randomforest.csv",index=False)

# kaggle score 0.77990
# 变量不变,调参。
x1 = train[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
y1 = train["Survived"]
x_test1 = test[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]

random_forest = RandomForestClassifier(oob_score=True, n_estimators=1200)
random_forest.fit(x1, y1)

Y_pred = random_forest.predict(x_test1)
score_randomforest = random_forest.score(x1, y1)
score_randomforest

Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final10-randomforest.csv",index=False)

# kaggle score 0.77990
# RandomForestClassifier中参数n_estimators更换了500,800,1200,1500,结果都不如1000好。
# 所以接下来不变更参数,只变更变量。
# 根据前面逻辑回归预测的最佳组合,选取["Pclass","Family","Sex0"]三个变量组合。
x2 = train[["Pclass","Family","Sex0"]]
y2 = train["Survived"]
x_test2 = test[["Pclass","Family","Sex0"]]

random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(x2, y2)

Y_pred = random_forest.predict(x_test2)
score_randomforest = random_forest.score(x2, y2)
score_randomforest

# 这个值反而低了。不过根据前面的经验,这个值和kaggle得分没有强相关关系。
0.813692480359147
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final8-randomforest.csv",index=False)

# kaggle score 0.77033
# 在全部可考虑变量上,减掉“Fare”票价变量试试。
x3 = train[["Pclass","Family","age_group0","Sex0","Embarked0"]]
y3 = train["Survived"]
x_test3 = test[["Pclass","Family","age_group0","Sex0","Embarked0"]]

random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(x3, y3)

Y_pred = random_forest.predict(x_test3)
score_randomforest = random_forest.score(x3, y3)
score_randomforest

0.8552188552188552
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final11-randomforest.csv",index=False)

# n_estimators=200时,kaggle score 0.76555,回到sample得分了居然。。。
# n_estimators=200时,kaggle score 0.77033,也不算高。
# 所以,Fare的魔力有那么大?那试试留住fare。。
# 留住“Fare”票价
x4 = train[["Fare","Family","age_group0","Sex0","Embarked0"]]
y4 = train["Survived"]
x_test4 = test[["Fare","Family","age_group0","Sex0","Embarked0"]]

random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(x4, y4)

Y_pred = random_forest.predict(x_test4)
score_randomforest = random_forest.score(x4, y4)
score_randomforest

# 哇好高,惊喜!!!回到一开始全部变量都在的那个水平了。
0.9438832772166106
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final12-randomforest.csv",index=False)

# 然而score还是不高,0.77左右。
# 嗯,可能是特征还可以处理到更佳;
# 又或者,一些离散比较大的变量例如 Fare,应该先做 scale 处理,减少机器学习的误差。
# 主要不想在这个项目上耽误太久,下个项目如果有需要再用scale先处理了。

# OK,Titanic项目至此先结束。
# kaggle排名4158/10972(38%)-2019.4.11。

# 奔向下一个项目:predict-future-sales
# 下一个目标项目网址https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview/evaluation
# 开撸。
  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值