学习笔记为阿里云天池龙珠计划机器学习训练营的学习内容,学习链接为:https://tianchi.aliyun.com/competition/entrance/231702/introduction?spm=5176.20222472.J_3678908510.8.8f5e67c2RKrT98
总体思路:分别使用LightGBM,xgboost,gbdt,catboost建立多个个体学习器(加入bagging的策略,对数据随机采样),对最终学习器的输出使用岭回归进一步提升精度。代码如下。
改进点:
1.可以在详细分析一下字段,可以考虑对字段进行特殊处理。
2.超参数还可以调,我没有使用网格搜索,只是简单的进行的调参。
3.如果单纯为了提高精度,可以更高随机种子,多试几次
import pandas as pd
import numpy as np
df = pd.read_csv("happiness_train_complete.csv",encoding="GB2312")
df = df.sample(frac=1,replace=False,random_state=11)
df.reset_index(inplace=True)
df = df[df["happiness"]>0]
Y = df["happiness"]
df["survey_month"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[1]).astype("int64")
df["survey_day"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[2]).astype("int64")
df["survey_hour"] = df["survey_time"].map(lambda line:line.split(" ")[1].split(":")[0]).astype("int64")
X = df.drop(columns=["id","index","happiness","survey_time","edu_other","property_other","invest_other"])
from sklearn.model_selection import train_test_split
from lightgbm.sklearn import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
from sklearn.model_selection import KFold
kfold = KFold(n_splits=15, shuffle = True, random_state= 12)
model = LGBMRegressor(n_jobs=-1,learning_rate=0.051,
n_estimators=400,
num_leaves=11,
reg_alpha=2.0,
reg_lambda=2.1,
min_child_samples=6,
min_split_gain=0.5,
colsample_bytree=0.2
)
mse = []
i=0
for train, test in kfold.split(X):
X_train = X.iloc[train]
y_train = Y.iloc[train]
X_test = X.iloc[test]
y_test = Y.iloc[test]
model.fit(X_train,y_train)
# model2.fit(model.predict(X_train,pred_leaf=True),y_train)
# y_pred = model2.predict(model.predict(X=X_test,pred_leaf=True))
y_pred = model.predict(X=X_test)
e = mean_squared_error(y_true=y_test,y_pred=y_pred)
mse.append(e)
print(e)
joblib.dump(filename="light"+str(i),value=model)
i+=1
print("lightgbm",np.mean(mse),mse)
#CatBoostRegressor
import pandas as pd
import numpy as np
df = pd.read_csv("happiness_train_complete.csv",encoding="GB2312")
df = df.sample(frac=1,replace=False,random_state=11)
df.reset_index(inplace=True)
df = df[df["happiness"]>0]
Y = df["happiness"]
df["survey_month"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[1]).astype("int64")
df["survey_day"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[2]).astype("int64")
df["survey_hour"] = df["survey_time"].map(lambda line:line.split(" ")[1].split(":")[0]).astype("int64")
X = df.drop(columns=["id","index","happiness","survey_time","edu_other","property_other","invest_other"])
from sklearn.model_selection import train_test_split
from catboost import Pool, CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.externals import joblib
kfold = KFold(n_splits=15, shuffle = True, random_state= 12)
model = CatBoostRegressor(colsample_bylevel=0.1,thread_count=6,silent=Tru