Kaggle web traffic比赛:https://www.kaggle.com/c/web-traffic-time-series-forecasting/data
Kaggle web traffic比赛第一名代码地址:https://github.com/Arturus/kaggle-web-traffic
Kaggle web traffic比赛第一名代码讲解:https://blog.csdn.net/uwr44uouqcnsuqb60zk2/article/details/78794503
Kaggle web traffic比赛第二名代码链接:https://github.com/jfpuget/Kaggle/tree/master/WebTrafficPrediction
Kaggle web traffic比赛第二名的阐述链接:https://www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/39395
Kaggle web traffic比赛第六名py3代码链接:https://github.com/alanshu2018/web-traffic-forecasting
Kaggle web traffic比赛第六名py2代码链接:https://github.com/sjvasquez/web-traffic-forecasting
Kaggle web traffic比赛第二名代码详细解析:
1.Clone the Kaggle repository
2.Download competition data into the Kaggle/input directory
3.Go to the Kaggle/WebTrafficPrediction directory
4.Run the keras-kf-12-stage2-sept-10.ipynb notebook. This trains the base deep learning model and computes predictions from it.
in the Kaggle/submissions directory, including:
- keras_kf_12_stage2_sept_10_train.csv
- keras_kf_12_stage2_sept_10_test.csv
5.The file keras_kf_12_stage2_sept_10_test.csv is my first submission. It scores 36.91121 and would have got the 4th rank overall.
train_x median_x_y|max_x_y|median_diff_x_y|median_diff7m_x_y|site|agent
train_y wx_dy_norm(w0_d4~w8_d3)
train_all_x median_x_y|max_x_y|median_diff_x_y|median_diff7m_x_y|site|agent
train_all_y NaN
j结论:先用神经网络模型获取一个预测值,作为第一阶段的结果
-------------------------------------------------------------------------第一个版本---------------------------------------------------------------------------------
6.Run the Pred_11-stage2-sept-10.ipynb notebook. This creates a median based model and computes predictions out of it. It should produce files in the Kaggle/submissions directory, including:
- pred_10_stage2_sept_10_train.csv
- pred_10_stage2_sept_10_test.csv
7.Run the first_stage2.ipynb notebook. It computes the first date at which a page data is non zero. It should create a file in the Kaggle/data directory:
- first.csv
8.Run the xgb_23_keras_7_2_stage2-sept-10-2.ipynb notebook. This creates the final model by running xgboost on the residuals for the neural network predictions. It uses the past visits plus the above two notebook outputs as features. It should produce files in the Kaggle/submission directory, including:
periods = [(0,1), (1,2), (2,3), (3,4), (4,5), (5,6), (6,7), (7,8),
(0,2), (0,4), (0,8), (0,12), (0,16), (0, 20)]
features:
['WeekDay','YearDay','Month','WeekEnd',
'Visits_pred_10', log(1+NN预测值)-log(1+Huber预测值)
'Visits_keras_kf_3', log(1+NN预测值)
'AllVisits', 网点维度的中位数
'median_x_y', 比如第一周(0-1)的中位数
'median_x_y_ratio', median_x_y-AllVisits
'median_day_x_y', 工作日的中位数
'median_day_x_y_ratio', median_day_x_y- AllVisits
'mean_x_y', 比如第一周(0-1)的中位数
'mean_x_y_ratio', median_x_y-AllVisits
'mean_day_x_y', 工作日的中位数
'mean_day_x_y_ratio', median_day_x_y- AllVisits
'SiteLabel', 网址类别标签
'firstval', 最后一次非NAN的天数
'AllVar', 方差
'AllMax', 最大值
]
- xgb_1_2017-09-12-19-14-14_test.csv
9.This file is my second submission. It scores 36.78499 and got me the second place.
-----------------------------------------------------------------第二个版本-----------------------------------------------------------------------------------------
10.Kaggle asks to provide a simpler model that provides 90% of the performance, if possible. Such model is provided in file keras_simple.ipynb. Its feature set is much simpler, basically the median of visits for each of the last 8 weeks of training data, plus the site (eg es.wikipedia.org), and the agent-access method. Its output score 37.58692 and would have got the 9th rank.
keras_simple.ipynb
------------------------------------------------------------------建议简化版本------------------------------------------------------------------------------
数据集划分:
train 2016.3.14~2016.9.10
test 2016.9.13~2016.11.14
train_all 2017.3.14~2017~9.10
test_all 2017.9.13~2017.11.14