龙珠训练营机器学习task04

最新推荐文章于 2023-06-25 21:10:21 发布

a_little_pig_

最新推荐文章于 2023-06-25 21:10:21 发布

阅读量183

点赞数

文章标签： python

本文链接：https://blog.csdn.net/a_little_pig_/article/details/123866819

版权

这篇博客记录了使用LightGBM, XGBoost, GBDT和CatBoost建立多个个体学习器的过程，并通过岭回归进行模型融合，以提升预测精度。作者进行了K折交叉验证，分析了每个模型的性能，并探讨了可能的改进方向，如特征工程、超参数调整等。" 47299799,645179,使用FDB调试Flash页面及远程调试指南,"['ActionScript', '调试工具']

摘要由CSDN通过智能技术生成

学习笔记为阿里云天池龙珠计划机器学习训练营的学习内容，学习链接为：https://tianchi.aliyun.com/competition/entrance/231702/introduction?spm=5176.20222472.J_3678908510.8.8f5e67c2RKrT98
总体思路：分别使用LightGBM，xgboost，gbdt，catboost建立多个个体学习器（加入bagging的策略，对数据随机采样），对最终学习器的输出使用岭回归进一步提升精度。代码如下。

改进点：
1.可以在详细分析一下字段，可以考虑对字段进行特殊处理。
2.超参数还可以调，我没有使用网格搜索，只是简单的进行的调参。
3.如果单纯为了提高精度，可以更高随机种子，多试几次

import pandas as pd
import numpy as np
df = pd.read_csv("happiness_train_complete.csv",encoding="GB2312")
df = df.sample(frac=1,replace=False,random_state=11)
df.reset_index(inplace=True)
df = df[df["happiness"]>0]
Y = df["happiness"]
df["survey_month"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[1]).astype("int64")
df["survey_day"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[2]).astype("int64")
df["survey_hour"] = df["survey_time"].map(lambda line:line.split(" ")[1].split(":")[0]).astype("int64")
X = df.drop(columns=["id","index","happiness","survey_time","edu_other","property_other","invest_other"])

from sklearn.model_selection import train_test_split
from lightgbm.sklearn import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
from sklearn.model_selection import KFold
kfold = KFold(n_splits=15, shuffle = True, random_state= 12)
model = LGBMRegressor(n_jobs=-1,learning_rate=0.051,
n_estimators=400,
num_leaves=11,
reg_alpha=2.0,
reg_lambda=2.1,
min_child_samples=6,
min_split_gain=0.5,
colsample_bytree=0.2
)
mse = []
i=0
for train, test in kfold.split(X):
X_train = X.iloc[train]
y_train = Y.iloc[train]
X_test = X.iloc[test]
y_test = Y.iloc[test]
model.fit(X_train,y_train)
# model2.fit(model.predict(X_train,pred_leaf=True),y_train)
# y_pred = model2.predict(model.predict(X=X_test,pred_leaf=True))
y_pred = model.predict(X=X_test)
e = mean_squared_error(y_true=y_test,y_pred=y_pred)
mse.append(e)
print(e)
joblib.dump(filename="light"+str(i),value=model)
i+=1
print("lightgbm",np.mean(mse),mse)
#CatBoostRegressor
import pandas as pd
import numpy as np
df = pd.read_csv("happiness_train_complete.csv",encoding="GB2312")
df = df.sample(frac=1,replace=False,random_state=11)
df.reset_index(inplace=True)

df = df[df["happiness"]>0]
Y = df["happiness"]
df["survey_month"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[1]).astype("int64")
df["survey_day"] = df["survey_time"].map(lambda line:line.split(" ")[0].split("/")[2]).astype("int64")
df["survey_hour"] = df["survey_time"].map(lambda line:line.split(" ")[1].split(":")[0]).astype("int64")
X = df.drop(columns=["id","index","happiness","survey_time","edu_other","property_other","invest_other"])

from sklearn.model_selection import train_test_split
from catboost import Pool, CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.externals import joblib
kfold = KFold(n_splits=15, shuffle = True, random_state= 12)
model = CatBoostRegressor(colsample_bylevel=0.1,thread_count=6,silent=Tru