- 比赛相关
- 题目:https://www.kaggle.com/c/LANL-Earthquake-Prediction
- 周期5个月,每天可以提交两次,可以自由组队;最后一周关闭组队权限,如果要组队提前组好https://www.kaggle.com/c/LANL-Earthquake-Prediction/overview/timeline
- kaggle平台有kernels(可以创建script和note)
- 缺点:
- 国内需要翻墙,在运行程序的过程中,因为网络原因,可能commit不上去
- 建议自己写输出log文件,因为使用程序里面的print,一旦fail之后看不到全部的输出信息
- 运行太久,比如超过X个小时程序会自动fail;
- 优点
- 别人的note public之后,可以共享思路,也会有一些启发
- script可以运行程序(有python和R两种语言),可以添加python包(脚本右下角add a custom package),可以选择GPU,可以看别人的代码和输入输出文件
- 缺点:
- 做题的过程
- 理解题意,明确输入和输出,清楚问题类型【数据可视化加深对数据的理解】
- 原始数据的去噪
- 选择模型【可以选择好几种,进行对比、结合】
def debugOutputFile(content): f = open("log.txt", "a+") f.writelines(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + ": " + str(content) + "\n"); f.close() def mae_score(y_true, y_pred): return mean_absolute_error(y_true, y_pred) mae_scorer = make_scorer(mae_score, greater_is_better=False) clf = xgb.XGBRegressor(**params) cv_params = {'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]} optimized_GBM = GridSearchCV(estimator=clf, param_grid=cv_params, scoring=mae_scorer, cv=5, verbose=0, n_jobs=4) optimized_GBM.fit(X, y,eval_set=[(X, y)], eval_metric='mae',verbose=0, early_stopping_rounds=200) debugOutputFile(optimized_GBM.best_params_) debugOutputFile(optimized_GBM.cv_results_) debugOutputFile(optimized_GBM.best_score_)
- 选择特征(可以把提取的特征数据保存下来)
- 如何判断特征重要性:相关性分析(xgb本身也有特征重要性的函数)
def rank_to_dict(ranks, names, order=1): minmax = MinMaxScaler() ranks = minmax.fit_transform(order * np.array([ranks]).T).T[0] ranks = map(lambda x: round(x, 2), ranks) return dict(zip(names, ranks)) def getSomeFeatures(X, Y, k): names = X.columns ranks = {} lr = LinearRegression(normalize=True) lr.fit(X, Y) ranks["Linear reg"] = rank_to_dict(np.abs(lr.coef_), names) ridge = Ridge(alpha=1) ridge.fit(X, Y) ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names) lasso = Lasso(alpha=.05) lasso.fit(X, Y) ranks["Lasso"] = rank_to_dict(np.abs(lasso.coef_), names) rlasso = RandomizedLasso(alpha=0.04) rlasso.fit(X, Y) ranks["Corr."] = rank_to_dict(np.abs(rlasso.scores_), names) # stop the search when 5 features are left (they will get equal scores) rfe = RFE(lr, n_features_to_select=5) rfe.fit(X, Y) ranks["RFE"] = rank_to_dict(list(map(float, rfe.ranking_)), names, order=-1) rf = RandomForestRegressor() rf.fit(X, Y) ranks["RF"] = rank_to_dict(rf.feature_importances_, names) r = {} for name in names: r[name] = round(np.mean([ranks[method][name] for method in ranks.keys()]), 2) methods = sorted(ranks.keys()) ranks["Mean"] = r methods.append("Mean") ranks = pd.DataFrame(ranks) ranks.to_csv('ranks.csv', index=True) r = sorted(r.items(), key=lambda d: d[1], reverse=True) res = [] for i in range(k): res.append(r[i][0]) return res
- 如何判断特征重要性:相关性分析(xgb本身也有特征重要性的函数)
- 交叉训练
- 调参:GridSearchCV
folds = KFold(n_splits=5, shuffle=True, random_state=42) oof_preds = np.zeros((len(X), 1)) test_preds = np.zeros((len(X_test), 1)) params = { 'max_depth': 4, 'learning_rate': 0.01, "n_estimators":500, 'min_child_weight': 1, "learning_rate": 0.01, "colsample_bytree": 0.7, "subsample": 0.7, "nthread": 12, "random_state": 42, "seed":27, "scale_pos_weight":1, 'gamma': 0.5, } debugOutputFile(params) for fold_, (trn_, val_) in enumerate(folds.split(X)): print("Current Fold: {}".format(fold_)) trn_x, trn_y = X.iloc[trn_], y.iloc[trn_] val_x, val_y = X.iloc[val_], y.iloc[val_] clf = xgb.XGBRegressor(**params) clf.fit( trn_x, trn_y, eval_set=[(trn_x, trn_y), (val_x, val_y)], eval_metric='mae', verbose=0, early_stopping_rounds=200 ) val_pred = clf.predict(val_x) test_fold_pred = clf.predict(X_test) debugOutputFile("MAE = {}".format(metrics.mean_absolute_error(val_y, val_pred))) oof_preds[val_, :] = val_pred.reshape((-1, 1)) test_preds += test_fold_pred.reshape((-1, 1)) test_preds /= 5
- 调参:GridSearchCV
- 反思
- 多看别人的script和notes,开拓思路
- 注重小细节,每个问题本身会有一个标签
- 【比如地震预测这个项目,问题本身就有信号处理的标签(应该更早想到去噪)】
- 附录
- 与队友讨论反思
-
数据的处理(对信号处理不是很了解);
-
特征的提取和选择(缺乏章法,没有第一步第二步,不知道如何选择重要特征);
-
代码的规范和习惯,没有比较好的保留比赛的中间过程;
-
相关信息接触太少了,看文献,搜索资料;
-
尽早地优化耗时的过程;
-
基础和理论要加强,如机器学习、概率论等;
-
团队中要有主导方,定期的讨论,固定的共同teamwork的时间,团队主导方要比较push哈哈;
-