Kaggle泰坦尼克号沉船生存预测
总结:随机森林预测最后准确率最高为0.785,kaggle排名4158/10972(38%)-2019.4.11。
没有达到大神帖子说的0.81,可能是特征还可以处理到更佳;又或者,一些离散比较大的变量例如 Fare,应该先做 scale 处理,减少机器学习的误差。
不想在Titanic这个项目上耽误太久,所以scale留着下一个项目如果有机会再试。
奔向下一个项目:predict-future-sales。
网址 https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview/evaluation
以下:Kaggle_Titanic_practice2-random forest 随机森林预测过程。
# 前面我们尝试了逻辑回归预测,准确率最高0.775.
# 接下来我们试试随机森林
# 导入常用数据模块
import pandas as pd
import numpy as np
# 导入训练集数据文件
train=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/train2.csv")
train.head(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Family | age_group | age_group0 | Sex0 | Embarked0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | 青年 | 3 | 2 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | 中年 | 4 | 1 | 2 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 青年 | 3 | 1 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | 青年 | 3 | 1 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 青年 | 3 | 2 | 1 |
# 导入测试集数据文件
test=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/titanic/test2.csv")
test.head(5)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Family | age_group | age_group0 | Sex0 | Embarked0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | 0 | 青年 | 3 | 2 | 2 |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S | 1 | 中年 | 4 | 1 | 1 |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q | 0 | 中老年 | 5 | 2 | 2 |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S | 0 | 青年 | 3 | 2 | 1 |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S | 2 | 青年 | 3 | 1 | 1 |
# 试试随机森林算法
# 先把有点关系的变量都丢进去。
from sklearn.ensemble import RandomForestClassifier
x1 = train[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
y1 = train["Survived"]
x_test1 = test[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(x1, y1)
Y_pred = random_forest.predict(x_test1)
score_randomforest = random_forest.score(x1, y1)
score_randomforest
# 训练结果拟合度0.94,好高,比之前逻辑回归预测的0.8左右高了很多。
# 不过还得看看预测结果提交至kaggle看看预测准确率先。
0.9438832772166106
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final7-randomforest.csv",index=False)
# kaggle score 0.78468,对比逻辑回归的最高0.77,还是有提升。
# 接下来优化调整一下random forest 参数,看看最优是多少。
# 变量不变,调参。
x1 = train[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
y1 = train["Survived"]
x_test1 = test[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
random_forest = RandomForestClassifier(oob_score=True, n_estimators=500)
random_forest.fit(x1, y1)
Y_pred = random_forest.predict(x_test1)
score_randomforest = random_forest.score(x1, y1)
score_randomforest
# 也是比较高
0.9438832772166106
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final9-randomforest.csv",index=False)
# kaggle score 0.77990
# 变量不变,调参。
x1 = train[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
y1 = train["Survived"]
x_test1 = test[["Pclass","Fare","Family","age_group0","Sex0","Embarked0"]]
random_forest = RandomForestClassifier(oob_score=True, n_estimators=1200)
random_forest.fit(x1, y1)
Y_pred = random_forest.predict(x_test1)
score_randomforest = random_forest.score(x1, y1)
score_randomforest
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final10-randomforest.csv",index=False)
# kaggle score 0.77990
# RandomForestClassifier中参数n_estimators更换了500,800,1200,1500,结果都不如1000好。
# 所以接下来不变更参数,只变更变量。
# 根据前面逻辑回归预测的最佳组合,选取["Pclass","Family","Sex0"]三个变量组合。
x2 = train[["Pclass","Family","Sex0"]]
y2 = train["Survived"]
x_test2 = test[["Pclass","Family","Sex0"]]
random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(x2, y2)
Y_pred = random_forest.predict(x_test2)
score_randomforest = random_forest.score(x2, y2)
score_randomforest
# 这个值反而低了。不过根据前面的经验,这个值和kaggle得分没有强相关关系。
0.813692480359147
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final8-randomforest.csv",index=False)
# kaggle score 0.77033
# 在全部可考虑变量上,减掉“Fare”票价变量试试。
x3 = train[["Pclass","Family","age_group0","Sex0","Embarked0"]]
y3 = train["Survived"]
x_test3 = test[["Pclass","Family","age_group0","Sex0","Embarked0"]]
random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(x3, y3)
Y_pred = random_forest.predict(x_test3)
score_randomforest = random_forest.score(x3, y3)
score_randomforest
0.8552188552188552
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final11-randomforest.csv",index=False)
# n_estimators=200时,kaggle score 0.76555,回到sample得分了居然。。。
# n_estimators=200时,kaggle score 0.77033,也不算高。
# 所以,Fare的魔力有那么大?那试试留住fare。。
# 留住“Fare”票价
x4 = train[["Fare","Family","age_group0","Sex0","Embarked0"]]
y4 = train["Survived"]
x_test4 = test[["Fare","Family","age_group0","Sex0","Embarked0"]]
random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(x4, y4)
Y_pred = random_forest.predict(x_test4)
score_randomforest = random_forest.score(x4, y4)
score_randomforest
# 哇好高,惊喜!!!回到一开始全部变量都在的那个水平了。
0.9438832772166106
Final = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": Y_pred.astype(int)})
Final.to_csv(r"D:/2018_BigData/Python/Kaggle_learning/Titanic Machine Learning from Disaster/Final12-randomforest.csv",index=False)
# 然而score还是不高,0.77左右。
# 嗯,可能是特征还可以处理到更佳;
# 又或者,一些离散比较大的变量例如 Fare,应该先做 scale 处理,减少机器学习的误差。
# 主要不想在这个项目上耽误太久,下个项目如果有需要再用scale先处理了。
# OK,Titanic项目至此先结束。
# kaggle排名4158/10972(38%)-2019.4.11。
# 奔向下一个项目:predict-future-sales
# 下一个目标项目网址https://www.kaggle.com/c/competitive-data-science-predict-future-sales/overview/evaluation
# 开撸。