【六 (4)机器学习-回归任务-鲍鱼年龄预测xgboost、lightgbm实战】

文章导航

【一 简明数据分析进阶路径介绍(文章导航)】

一、xgboost简介

XGBoost(Extreme Gradient Boosting)是一种基于梯度提升决策树的优化分布式梯度提升库。它是大规模并行boosting tree(提升树)的工具,它是用于解决许多数据科学问题(如分类,回归,排名等)的有效,便携和灵活的机器学习算法。

XGBoost通过并行化实现了更快的训练速度,同时也通过优化算法减少了过拟合。它内置了正则化项,能够控制模型的复杂度,从而防止过拟合。此外,XGBoost还支持列抽样,这不仅能降低过拟合,还能减少计算。

XGBoost在数据科学竞赛中非常流行,赢得了许多机器学习比赛的冠军,比如Kaggle。它的高效性和准确性使得它成为处理大规模数据集的首选工具。

优点:
高效性:通过优化算法和并行化,XGBoost能够处理大规模数据集,并在短时间内完成训练。
灵活性:XGBoost支持多种类型的目标函数,可以用于分类、回归和排名等多种问题。
鲁棒性:XGBoost内置了正则化项和列抽样等机制,能够有效防止过拟合,提高模型的泛化能力

二、lightgbm简介

LightGBM是一种基于决策树算法的梯度提升框架,由微软开发并开源。它具有高效、快速、可扩展性强的特点,适用于大规模数据集和高维特征的机器学习问题。在许多数据竞赛和工业应用中,LightGBM都取得了优异的表现,成为机器学习领域中备受青睐的模型之一。

LightGBM的核心原理是基于梯度提升框架,通过迭代训练决策树模型来不断逼近目标函数的最优值。与传统的梯度提升决策树(GBDT)相比,LightGBM引入了基于直方图的算法,通过对特征值的离散化处理,降低了算法的复杂度,提高了训练速度。此外,LightGBM还引入了互斥特征捆绑算法和直方图偏向算法,进一步提升了模型的精度和泛化能力。

优点:
高效性:具有高效的训练和预测速度,尤其在处理大规模数据集时表现出色。
低内存消耗:由于使用了基于直方图的算法和按叶子节点分割的决策树,能够减少内存消耗,适用于内存有限的环境。
高准确性:通过优化算法和特征选择等方法提高了模型的准确性。

三、代码实现

1、导入类库

# 导入类库
import numpy as np
import pandas as pd
import scipy.stats as stats

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px  

import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler

from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold

from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin
from sklearn.base import clone
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge
from sklearn.kernel_ridge import KernelRidge
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
import xgboost as xgb
from bayes_opt import BayesianOptimization

# 显示中文
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# pandas显示所有行和列 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

2、导入数据

train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

train.head()

3、类别参数预处理

le_sex = LabelEncoder()
train['Sex'] = le_sex.fit_transform(train['Sex'])

4、数据集划分、模型初始化、参数优化、保存模型

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(train.drop(columns=['id','Rings']), train['Rings'], test_size=0.2,
                                                    random_state=42)
# 定义LightGBM模型
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'mean_squared_error'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

gbm = lgb.train(params,
                lgb_train,
                num_boost_round=1000,
                valid_sets=lgb_eval,
                early_stopping_rounds=10)


# 定义贝叶斯优化函数
def lgb_evaluate(num_leaves, learning_rate, feature_fraction, bagging_fraction, bagging_freq):
    params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'mean_squared_error'},
        'num_leaves': int(num_leaves),
        'learning_rate': learning_rate,
        'feature_fraction': max(min(feature_fraction, 1), 0),
        'bagging_fraction': max(min(bagging_fraction, 1), 0),
        'bagging_freq': int(bagging_freq),
        'verbose': 0
    }

    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=1000,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=10)

    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
    mse = mean_squared_error(y_test, y_pred)
    return -mse


# 定义参数搜索范围
pbounds = {'num_leaves': (10, 50),
           'learning_rate': (0.01, 0.1),
           'feature_fraction': (0.1, 0.9),
           'bagging_fraction': (0.1, 0.9),
           'bagging_freq': (1, 10)}

# 搜索最优参数
optimizer = BayesianOptimization(f=lgb_evaluate, pbounds=pbounds, random_state=42)
optimizer.maximize(init_points=10, n_iter=20)

# 输出最优参数
params = optimizer.max['params']

# 重新训练模型
params['num_leaves'] = int(params['num_leaves'])
params['bagging_freq'] = int(params['bagging_freq'])

gbm = lgb.train(params,
                lgb_train,
                num_boost_round=1000,
                valid_sets=lgb_eval,
                early_stopping_rounds=10)

# 定义XGBoost模型
model = XGBRegressor()
model.fit(X_train, y_train)


# 定义贝叶斯优化函数
def xgb_evaluate(max_depth, learning_rate, subsample, colsample_bytree, gamma):
    params = {
        'max_depth': int(max_depth),
        'learning_rate': learning_rate,
        'subsample': max(min(subsample, 1), 0),
        'colsample_bytree': max(min(colsample_bytree, 1), 0),
        'gamma': max(gamma, 0),
        'objective': 'reg:squarederror',
        'eval_metric': 'rmse'
    }

    model = XGBRegressor(**params)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    return -mse


# 定义参数搜索范围
pbounds = {'max_depth': (1, 10),
           'learning_rate': (0.01, 0.1),
           'subsample': (0.1, 1),
           'colsample_bytree': (0.1, 1),
           'gamma': (0, 10)}

# 搜索最优参数
optimizer = BayesianOptimization(f=xgb_evaluate, pbounds=pbounds, random_state=42)
optimizer.maximize(init_points=10, n_iter=20)

# 输出最优参数
params = optimizer.max['params']
params['max_depth'] = int(params['max_depth'])

# 重新训练模型
model = XGBRegressor(**params)
model.fit(X_train, y_train)
# 使用训练好的模型进行预测
# 使用LightGBM进行预测
lgb_y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
lgb_mse = mean_squared_error(y_test, lgb_y_pred)
print(f"LightGBM MSE: {lgb_mse:.4f}")

# 使用XGBoost进行预测
xgb_y_pred = model.predict(X_test)
xgb_mse = mean_squared_error(y_test, xgb_y_pred)
print(f"XGBoost MSE: {xgb_mse:.4f}")

# 对比两个模型的效果
print(f"LightGBM MSE: {lgb_mse:.4f}, XGBoost MSE: {xgb_mse:.4f}")

# 可视化LightGBM模型的特征重要性
lgb.plot_importance(gbm, max_num_features=10)

# 可视化XGBoost模型的特征重要性
xgb.plot_importance(model, max_num_features=10)

# 保存模型
import joblib

# 保存LightGBM模型
joblib.dump(gbm, 'lgb_model.pkl')

# 保存XGBoost模型
joblib.dump(model, 'xgb_model.pkl')

4、对测试集进行预测

test['Sex'] = le_sex.fit_transform(test['Sex'])

# 加载保存的模型
model = joblib.load('lgb_model.pkl')

df_new = test.drop(columns=['id'])

# 对输入数据进行预测
y_pred = model.predict(df_new)

# 将预测结果添加到数据表中
test['Rings'] = y_pred

test[['id','Rings']].to_csv('20240406_001.csv',index=False)
  • 19
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,下面是一个简单的 LightGBM 回归模型的 Python 代码示例: 首先需要安装 LightGBM: ```python !pip install lightgbm ``` 然后加载数据集,这里使用的是 sklearn 自带的波士顿房价数据集: ```python from sklearn.datasets import load_boston boston = load_boston() X, y = boston['data'], boston['target'] ``` 接下来,我们将数据集划分为训练集和测试集: ```python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ``` 然后导入 LightGBM 并训练模型: ```python import lightgbm as lgb train_data = lgb.Dataset(X_train, label=y_train) params = { 'objective': 'regression', 'metric': 'l2', 'num_leaves': 31, 'learning_rate': 0.05, 'feature_fraction': 0.9 } num_round = 100 model = lgb.train(params, train_data, num_round) ``` 最后,用模型进行预测并计算均方误差: ```python import numpy as np from sklearn.metrics import mean_squared_error y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) print(f"MSE: {mse}") ``` 完整代码如下: ```python from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import lightgbm as lgb import numpy as np # 加载数据集 boston = load_boston() X, y = boston['data'], boston['target'] # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 训练模型 train_data = lgb.Dataset(X_train, label=y_train) params = { 'objective': 'regression', 'metric': 'l2', 'num_leaves': 31, 'learning_rate': 0.05, 'feature_fraction': 0.9 } num_round = 100 model = lgb.train(params, train_data, num_round) # 预测并计算均方误差 y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) print(f"MSE: {mse}") ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值