数据集描述
train.csv - 训练数据集;price是持续目标
test.csv - 测试数据集;您的目标是预测price每一行的值
Sample_submission.csv - 格式正确的示例提交文件
本次比赛的目标是根据各种属性预测二手车的价格。
该数据集来自kaggle竞赛,训练集包含了 188,533 条二手车的信息,共有 15 个字段。数据涵盖了不同品牌和型号的汽车,提供了关于汽车的多个属性,包括品牌 (brand)、车型 (model)、生产年份 (model_year)、行驶里程 (milage)、燃油类型 (fuel_type)、发动机类型 (engine)、变速器类型 (transmission)、外部颜色 (ext_col)、内部颜色 (int_col)、事故记录 (accident)、车辆是否为清洁标题 (clean_title)、价格 (price)。
源码及解释
import pandas as pd
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
读取数据集
## 去除重复记录
train_data.drop_duplicates()
# 定义需要填充的列
columns_to_fill = ['fuel_type', 'accident', 'clean_title']
# 使用循环为 train_data 和 test_data 中的指定列填充缺失值
for column in columns_to_fill:
train_data[column] = train_data[column].fillna('unknow')
test_data[column] = test_data[column].fillna('unknow')
去重:删除训练集中的重复记录,减少冗余数据,防止模型过拟合。
填充缺失值:通过为(fuel_type、accident、clean_title)中的缺失值填充一个占位符 ‘unknow’,确保数据集的完整性,避免缺失值导致模型训练中的错误。
from datetime import datetime
# 获取当前年份
current_year = datetime.now().year
# 创建一个新的时间列,并将当前年份填入该列
train_data['year'] = current_year
# 确保该列是数值类型
train_data['year'] = train_data['year'].astype(int)
train_data['diff_year'] = train_data['year']-train_data['model_year']
# 创建一个新的时间列,并将当前年份填入该列
test_data['year'] = current_year
# 确保该列是数值类型
test_data['year'] = test_data['year'].astype(int)
test_data['diff_year'] = test_data['year']-test_data['model_year']
train_data.head()
这里主要是计算车辆生产年份和当前年份的差值。车辆的年龄是二手车价格的重要因素。
# 遍历所有object类型的字段,查看这些字段的unique()值
for column in train_data.select_dtypes(include=['object']).columns:
unique_values = train_data[column].unique()
print(f"Unique values in '{column}': {unique_values}")
Unique values in 'brand': ['MINI' 'Lincoln' 'Chevrolet' 'Genesis' 'Mercedes-Benz' 'Audi' 'Ford'
'BMW' 'Tesla' 'Cadillac' 'Land' 'GMC' 'Toyota' 'Hyundai' 'Volvo'
'Volkswagen' 'Buick' 'Rivian' 'RAM' 'Hummer' 'Alfa' 'INFINITI' 'Jeep'
'Porsche' 'McLaren' 'Honda' 'Lexus' 'Dodge' 'Nissan' 'Jaguar' 'Acura'
'Kia' 'Mitsubishi' 'Rolls-Royce' 'Maserati' 'Pontiac' 'Saturn' 'Bentley'
'Mazda' 'Subaru' 'Ferrari' 'Aston' 'Lamborghini' 'Chrysler' 'Lucid'
'Lotus' 'Scion' 'smart' 'Karma' 'Plymouth' 'Suzuki' 'FIAT' 'Saab'
'Bugatti' 'Mercury' 'Polestar' 'Maybach']
Unique values in 'model': ['Cooper S Base' 'LS V8' 'Silverado 2500 LT' ... 'e-Golf SE'
'Integra w/A-Spec Tech Package' 'IONIQ Plug-In Hybrid SEL']
Unique values in 'fuel_type': ['Gasoline' 'E85 Flex Fuel' 'unknow' 'Hybrid' 'Diesel' 'Plug-In Hybrid'
'–' 'not supported']
Unique values in 'engine': ['172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel'
'252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel'
'320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capability' ...
'78.0HP 1.2L 3 Cylinder Engine Gasoline Fuel'
'139.0HP 1.6L 4 Cylinder Engine Plug-In Electric/Gas'
'313.0HP 2.0L 4 Cylinder Engine Plug-In Electric/Gas']
Unique values in 'transmission': ['A/T' 'Transmission w/Dual Shift Mode' '7-Speed A/T' '8-Speed A/T'
'10-Speed Automatic' '1-Speed A/T' '6-Speed A/T' '10-Speed A/T'
'9-Speed A/T' '8-Speed Automatic' '9-Speed Automatic' '5-Speed A/T'
'Automatic' '7-Speed Automatic with Auto-Shift' 'CVT Transmission'
'5-Speed M/T' 'M/T' '6-Speed M/T' '6-Speed Automatic' '4-Speed Automatic'
'7-Speed M/T' '2-Speed A/T' '1-Speed Automatic' 'Automatic CVT'
'4-Speed A/T' '6-Speed Manual' 'Transmission Overdrive Switch'
'8-Speed Automatic with Auto-Shift' '7-Speed Manual' '7-Speed Automatic'
'9-Speed Automatic with Auto-Shift' '6-Speed Automatic with Auto-Shift'
'6-Speed Electronically Controlled Automatic with O' 'F' 'CVT-F'
'8-Speed Manual' 'Manual' '–' '2' '6 Speed At/Mt' '5-Speed Automatic'
'2-Speed Automatic' '8-SPEED A/T' '7-Speed' 'Variable'
'Single-Speed Fixed Gear' '8-SPEED AT'
'10-Speed Automatic with Overdrive' '7-Speed DCT Automatic'
'SCHEDULED FOR OR IN PRODUCTION' '6-Speed' '6 Speed Mt']
Unique values in 'ext_col': ['Yellow' 'Silver' 'Blue' 'Black' 'White' 'Snowflake White Pearl Metallic'
'Gray' 'Green' 'Santorini Black Metallic' 'Purple'
'Ebony Twilight Metallic' 'Red' 'Magnetite Black Metallic'
'Diamond Black' 'Vega Blue' 'Beige' 'Gold' 'Platinum White Pearl'
'Metallic' 'White Frost Tri-Coat' 'Firecracker Red Clearcoat'
'Phytonic Blue Metallic' 'Blu' 'Orange' 'Brown'
'Brilliant Silver Metallic' 'Black Raven' 'Black Clearcoat' 'Firenze Red'
'Agate Black Metallic' 'Glacial White Pearl' 'Majestic Plum Metallic'
'designo Diamond White Metallic' 'Oxford White' 'Black Sapphire Metallic'
'Mythos Black' 'Granite Crystal Clearcoat Metallic'
'White Diamond Tri-Coat' 'Magnetite Gray Metallic'
'Carpathian Grey Premium Metallic' 'designo Diamond White Bright'
'Phantom Black Pearl Effect / Black Roof' 'Nebula Gray Pearl'
'Deep Crystal Blue Mica' 'Flame Red Clearcoat' 'Lunar Blue Metallic'
'Bright White Clearcoat' 'Rapid Red Metallic Tinted Clearcoat' 'Caviar'
'Dark Ash Metallic' 'Velvet Red Pearlcoat' 'Silver Zynith' 'Super Black'
'Antimatter Blue Metallic' 'Dark Moon Blue Metallic' 'Summit White'
'Ebony Black' '–' 'Black Cherry' 'Delmonico Red Pearlcoat'
'Platinum Quartz Metallic' 'Ultra White' 'Python Green'
'Garnet Red Metallic' 'Snow White Pearl' 'Cajun Red Tintcoat'
'Midnight Black Metallic' 'Diamond White' 'Mythos Black Metallic'
'Soul Red Crystal Metallic' 'Atomic Silver' 'Obsidian'
'Magnetic Metallic' 'Twilight Blue Metallic' 'Star White' 'Stormy Sea'
'Tango Red Metallic' 'Hyper Red' 'Portofino Gray'
'MANUFAKTUR Diamond White Bright' 'Snowflake White Pearl'
'Patriot Blue Pearlcoat' 'Tungsten Metallic' 'Chronos Gray Metallic'
'Silver Ice Metallic' 'Daytona Gray Pearl Effect'
'Ruby Red Metallic Tinted Clearcoat' 'Alpine White' 'Eminent White Pearl'
'Manhattan Noir Metallic' 'Quicksilver Metallic' 'Stellar Black Metallic'
'Sparkling Silver' 'Blueprint' 'Crystal Black Silica' 'Black Noir Pearl'
'Arancio Borealis' 'Typhoon Gray' 'Ibis White' 'Graphite Grey'
'Mineral White' 'Midnight Black' 'Northsky Blue Metallic' 'Alta White'
'Brilliant Black' 'Jet Black Mica'
'Daytona Gray Pearl Effect w/ Black Roof' 'Redline Red'
'Glacier Silver Metallic' 'Magnetic Black' 'Chronos Gray'
'Red Quartz Tintcoat' 'Nero Noctis' 'Firenze Red Metallic'
'Iridescent Pearl Tricoat' 'Twilight Black' 'Radiant Red Metallic II'
'Blue Metallic' 'Glacier White' 'Daytona Gray' 'Rosso Mars Metallic'
'Wolf Gray' 'Santorin Black' 'Designo Magno Matte'
'Emerald Green Metallic' 'Ruby Flare Pearl' 'Lunar Silver Metallic'
'Eiger Grey Metallic' 'Quartzite Grey Metallic' 'Barcelona Red'
'Beluga Black' 'Matador Red Metallic' 'Billet Silver Metallic Clearcoat'
'Anodized Blue Metallic' 'Black Forest Green' 'Ice Silver Metallic'
'Sandstone Metallic' 'Magnetic Gray Clearcoat' 'Crystal Black Pearl'
'Pacific Blue Metallic' 'Stone Gray Metallic' 'Iconic Silver Metallic'
'Dark Sapphire' 'Onyx' 'Aventurine Green Metallic' 'China Blue'
'Majestic Black Pearl' 'Midnight Silver Metallic' 'Sting Gray Clearcoat'
'Glacier Blue Metallic' 'BLACK' 'Chalk' 'Dark Matter Metallic'
'Infrared Tintcoat' 'Iridium Metallic' 'Fuji White' 'Alfa White'
'Kodiak Brown Metallic' 'Aurora Black' 'Onyx Black'
'Nightfall Gray Metallic' 'Obsidian Black Metallic' 'Phantom Black'
'Remington Red Metallic' 'designo Diamond White' 'Lizard Green'
'Rosso Corsa' 'Shadow Gray Metallic' 'Florett Silver' 'Quartz White'
'DB Black Clearcoat' 'Yulong White' 'Eiger Grey' 'Custom Color'
'Electric Blue Metallic' 'Tempest' 'Lunar Rock' 'Mosaic Black Metallic'
'Gecko Pearlcoat' 'White Clearcoat' 'BLU ELEOS'
'Granite Crystal Metallic Clearcoat' 'Rich Garnet Metallic'
'Graphite Grey Metallic' 'Bianco Icarus Metallic' 'Satin Steel Metallic'
'BLUE' 'Moonlight Cloud' 'Matador Red Mica' 'Emin White'
'Machine Gray Metallic' 'White Platinum Tri-Coat Metallic'
'Cobra Beige Metallic' 'Cayenne Red Tintcoat' 'Shoreline Blue Pearl'
'Vik Black' 'Shimmering Silver' 'Bianco Monocerus'
'Carbonized Gray Metallic' 'Carrara White Metallic' 'Dark Slate Metallic'
'Dark Graphite Metallic' 'Sonic Silver Metallic'
'White Knuckle Clearcoat' 'Titanium Silver' 'Anthracite Blue Metallic'
'Black Obsidian' 'Polymetal Gray Metallic' 'Orca Black Metallic'
'Wind Chill Pearl' 'Blue Reflex Mica' 'Dark Moss'
'Selenite Grey Metallic' 'Kemora Gray Metallic' 'Nightfall Mica'
'Liquid Platinum' 'Mountain Air Metallic' 'Kinetic Blue'
'Santorini Black' 'Carbon Black Metallic' 'Gentian Blue Metallic'
'Red Multi' 'Super White' 'Pearl White' 'Typhoon Gray Metallic'
'Navarra Blue Metallic' 'Bianco Isis' 'Navarra Blue'
'Volcano Grey Metallic' 'Arctic Gray Metallic' 'Pure White' 'Baltic Gray'
'Glacier White Metallic' 'Frozen Dark Silver Metallic'
'Magnetic Gray Metallic' 'Gun Metallic' 'Siren Red Tintcoat'
'Deep Blue Metallic' 'Cirrus Silver Metallic' 'Deep Black Pearl Effect'
'Granite' 'Sunset Drift Chromaflair' 'Oryx White Prl'
'Dark Gray Metallic' 'Bayside Blue' 'Pink' 'Ice' 'Mango Tango Pearlcoat'
'Burnished Bronze Metallic' 'Verde' 'Arctic White'
'Portofino Blue Metallic' 'Dazzling White' 'Nero Daytona'
'Nautical Blue Pearl' 'Imperial Blue Metallic' 'Vulcano Black Metallic'
'Silver Radiance' 'Hellayella Clearcoat' 'Jungle Green' 'C / C' 'Yulong'
'Pristine White' 'Silky Silver' 'Caspian Blue' 'Sangria Red'
'Donington Grey Metallic' 'Apex Blue' 'Rift Metallic' 'Fountain Blue'
'Balloon White' 'Matte White' 'Frozen White' 'Pacific Blue' 'Rosso'
'Ironman Silver' 'Octane Red Pearlcoat' 'Selenite Gray Metallic'
'Hydro Blue Pearlcoat' 'Ingot Silver Metallic' 'Quartz Blue Pearl'
'Lunare White Metallic' 'Ember Pearlcoat' 'Brands Hatch Gray Metallic'
'Navarre Blue' 'Midnight Blue Metallic' 'Shadow Black' 'Go Mango!'
'Maximum Steel Metallic' 'Silver Flare Metallic'
'Billet Clearcoat Metallic' 'Hampton Gray' 'Red Obsession' 'Silver Mist'
'Scarlet Ember' 'Crimson Red Tintcoat' 'Tan' 'Isle of Man Green Metallic'
'Crystal Black' 'Glacier' 'Iridium Silver Metallic'
'Bronze Dune Metallic' 'Maroon' 'Platinum Gray Metallic' 'Passion Red'
'Silician Yellow' 'Volcanic Orange' 'Crystal White Pearl' 'Reflex Silver'
'Blue Caelum' 'Thunder Gray' 'Ultra Black' 'Indus Silver' 'Horizon Blue'
'Grigio Nimbus' 'Carpathian Grey' 'Ametrin Metallic' 'Jupiter Red'
'GT SILVER']
Unique values in 'int_col': ['Gray' 'Beige' 'Black' '–' 'Blue' 'White' 'Red' 'Brown' 'Dark Galvanized'
'Parchment.' 'Boulder' 'Orange' 'Medium Earth Gray' 'Ebony'
'Canberra Beige' 'Jet Black' 'Silver' 'Light Platinum / Jet Black'
'Macchiato/Magmagrey' 'Gold' 'Cloud' 'Rioja Red' 'Global Black' 'Green'
'Medium Stone' 'Navy Pier' 'Dark Ash' 'BLACK' 'Portland' 'Sandstone'
'Canberra Beige/Black' 'Diesel Gray / Black' 'Sarder Brown' 'Black Onyx'
'White / Brown' 'Black/Gun Metal' 'Slate' 'Satin Black'
'Macchiato Beige/Black' 'Charcoal' 'Black / Express Red' 'Cappuccino'
'Aragon Brown' 'Parchment' 'Oyster W/Contrast' 'Adrenaline Red' 'Ebony.'
'Shara Beige' 'Graystone' 'Pearl Beige' 'Nero Ade' 'Graphite'
'Tan/Ebony/Ebony' 'Charcoal Black' 'Medium Ash Gray' 'Ebony Black'
'Light Titanium' 'Sakhir Orange' 'Tan' 'Rock Gray' 'Brandy'
'Carbon Black' 'Amber' 'Black w/Red Stitching' 'Hotspur' 'Chateau' 'Ice'
'Deep Garnet' 'Blk' 'Grace White' 'Oyster/Black' 'Mesa' 'Espresso'
'Black/Graphite' 'Ebony / Ebony Accents' 'Tan/Ebony' 'Ceramic'
'Medium Dark Slate' 'Graphite w/Gun Metal' 'Obsidian Black'
'Cocoa / Dune' 'Roast' 'Yellow' 'Hotspur Hide' 'Gray w/Blue Bolsters'
'Chestnut' 'Saiga Beige' 'ORANGE' 'Charles Blue' 'Walnut' 'Ivory / Ebony'
'Caramel' 'Pimento Red w/Ebony' 'Saddle Brown' 'Dark Gray'
'Silk Beige/Espresso Brown' 'Black / Brown' 'Ebony/Light Oyster Stitch'
'Ebony / Pimento' 'Mistral Gray / Raven' 'Giallo Taurus / Nero Ade'
'Tension' 'Medium Pewter' 'Black / Saddle' 'Camel Leather'
'Black/Saddle Brown' 'Macchiato' 'Anthracite' 'Mocha' 'Whisper Beige'
'Titan Black / Quarzit' 'Sahara Tan' 'Porpoise' 'Black/Red' 'Titan Black'
'AMG Black' 'Deep Cypress' 'Light Slate' 'Red / Black' 'Beluga Hide'
'Tupelo' 'Gideon' 'Rhapsody Blue' 'Medium Light Camel' 'Almond Beige'
'Black / Gray' 'Nero' 'Agave Green' 'Deep Chestnut' 'Dark Auburn' 'Shale'
'Silk Beige/Black' 'BEIGE' 'Magma Red' 'Linen' 'Black / Stone Grey'
'Sand Beige' 'Red/Black' 'Bianco Polar' 'Light Gray' 'Platinum' 'Sport'
'Ash' 'Black / Graphite' 'Nougat Brown' 'Camel' 'Mountain Brown'
'Pimento / Ebony' 'Classic Red' 'Sakhir Orange/Black' 'Cobalt Blue'
'Very Light Cashmere' 'Kyalami Orange' 'Orchid' 'Beluga' 'WHITE']
Unique values in 'accident': ['None reported' 'At least 1 accident or damage reported' 'unknow']
Unique values in 'clean_title': ['Yes' 'unknow']
这里其实就是该数据集最为复杂的地方,数据集基本都是object类型字段,且有很多列的值极多,进行标签编码,序列编码和独热编码都不是很好选择,尤其是独热编码,会极大增加数据集维度,导致影响建模时运行速度。
# 从 train_data 数据集中删除 'price', 'id', 'ext_col', 'transmission', 'brand', 'int_col', 'model_year', 'year', 'model' 列,得到新的 train_data1 数据集
train_data1 = train_data.drop(['price', 'id', 'ext_col', 'transmission', 'brand', 'int_col', 'model_year', 'year', 'model'], axis=1)
# 从 test_data 数据集中删除 'id', 'ext_col', 'transmission', 'brand', 'int_col', 'model_year', 'year', 'model' 列,得到新的 test_data1 数据集
test_data1 = test_data.drop(['id', 'ext_col', 'transmission', 'brand', 'int_col', 'model_year', 'year', 'model'], axis=1)
# 将 'train_data' 数据集中的 'price' 列提取出来,作为目标变量,赋值给 label
label = train_data['price']
# 查看 train_data1 和 test_data1 的形状(即数据集的行数和列数),返回一个包含两个数据集形状的元组
train_data1.shape, test_data1.shape
这里选择去除这些冗余特征,只选择了对engine列进行处理,大家可以尝试对这些列进行处理,从中提取更多的有用信息。
# 提取马力、排量
train_data1['hp'] = train_data1['engine'].str[0:3]
train_data1['l'] = train_data1['engine'].str[8:11]
# 将提取出来的两列转换为数值类型
train_data1['hp'] = pd.to_numeric(train_data1['hp'], errors='coerce')
train_data1['l'] = pd.to_numeric(train_data1['l'], errors='coerce')
# 提取马力、排量
test_data1['hp'] = test_data1['engine'].str[0:3]
test_data1['l'] = test_data1['engine'].str[8:11]
# 将提取出来的两列转换为数值类型
test_data1['hp'] = pd.to_numeric(test_data1['hp'], errors='coerce')
test_data1['l'] = pd.to_numeric(test_data1['l'], errors='coerce')
# 查看数据类型和信息
train_data1.info()
从engine列中提取出车辆的马力和排量的信息(只要大家多观察,很多特征中都会有部分有用的信息),并将马力和排量转换为数值变量。
from sklearn.preprocessing import OneHotEncoder
# 选择训练集中的 object 类型列进行独热编码
object_columns = train_data1.select_dtypes(include=['object']).columns
# 创建 OneHotEncoder 对象,使用 handle_unknown='ignore' 来处理测试集中的新类别
encoder = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')
# 对训练集进行 fit_transform,得到独热编码结果
train_data_encoded = encoder.fit_transform(train_data1[object_columns])
# 对测试集进行 transform,使用训练集的编码规则
test_data_encoded = encoder.transform(test_data1[object_columns])
# 将独热编码结果转换回 DataFrame,并合并到原始数据中
train_data_encoded_df = pd.DataFrame(train_data_encoded, columns=encoder.get_feature_names_out(object_columns))
test_data_encoded_df = pd.DataFrame(test_data_encoded, columns=encoder.get_feature_names_out(object_columns))
# 查看编码后的训练集和测试集
print(train_data_encoded_df.shape)
print(test_data_encoded_df.shape)
这里选择对其他分类变量进行独热编码,因为处理掉了一些冗余特征,因此这里进行独热编码并不会大幅增加计算量。
接下来,就可以尝试进行建模,并查看效果了,这里选择使用lightgbm来建模,是因为数据量较大,lightgbm能够运行更快,更节省内存,且准确度也较高。并使用optuna进行参数优化。想了解更多有关lightgbm和optuna的知识,请观看我的kaggle入门级竞赛Spaceship Titanic LIghtgbm+Optuna调参
import optuna
import lightgbm as lgb
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import logging
# 禁用 LightGBM 的日志输出
logging.basicConfig(level=logging.WARNING)
# 划分数据集
X = train_data_encoded_df
y = label
# 划分训练集和验证集
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
# 标准化处理
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # 计算训练集的均值和标准差,并应用于训练集
X_valid_scaled = scaler.transform(X_valid) # 使用训练集的均值和标准差对验证集进行转换
# 记录每次试验的 RMSE
rmse_list = []
# 定义Optuna的目标函数
def objective(trial):
# 使用Optuna选择超参数
params = {
'objective': 'regression', # 回归任务
'boosting_type': 'gbdt', # 梯度提升决策树
'num_leaves': trial.suggest_int('num_leaves', 20, 100), # 树的最大叶子数
'learning_rate': trial.suggest_float('learning_rate', 1e-5, 1e-1), # 学习率,取对数均匀分布
'n_estimators': trial.suggest_int('n_estimators', 50, 500), # 树的数量
'max_depth': trial.suggest_int('max_depth', 3, 15), # 树的最大深度
'subsample': trial.suggest_float('subsample', 0.5, 1.0), # 数据采样率
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0), # 特征采样率
# 添加L1和L2正则化参数
'lambda_l1': trial.suggest_float('lambda_l1', 0.0, 10.0), # L1正则化强度
'lambda_l2': trial.suggest_float('lambda_l2', 0.0, 10.0) # L2正则化强度
}
# 创建模型,设置 verbose 为 False 以避免输出日志
model = lgb.LGBMRegressor(**params, verbose=-1)
# 训练模型,使用 callbacks 参数进行早期停止
model.fit(X_train_scaled, y_train, eval_set=[(X_valid_scaled, y_valid)], callbacks=[lgb.early_stopping(stopping_rounds=10)])
# 进行预测
y_pred = model.predict(X_valid_scaled)
# 计算 RMSE(均方根误差)
rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
# 记录每次试验的 RMSE
rmse_list.append(rmse)
return rmse # Optuna 会根据最小化RMSE来寻找最佳超参数
# 创建Optuna的Study对象
study = optuna.create_study(direction='minimize') # 最小化RMSE
# 开始超参数优化
study.optimize(objective, n_trials=100) # 尝试100次
# 输出最佳超参数和对应的RMSE值
print(f"Best trial: {study.best_trial.params}")
print(f"Best RMSE: {study.best_value}")
# 使用最佳超参数训练最终模型
best_params = study.best_trial.params
final_model = lgb.LGBMRegressor(**best_params, verbose=-1)
final_model.fit(X_train_scaled, y_train) # 在整个训练集上训练
# 在验证集上进行预测并计算RMSE
y_pred_final = final_model.predict(X_valid_scaled)
final_rmse = np.sqrt(mean_squared_error(y_valid, y_pred_final))
print(f"Final RMSE on validation set: {final_rmse}")
# 绘制每次试验的 RMSE
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(rmse_list) + 1), rmse_list, marker='o', linestyle='-', color='b')
plt.xlabel('Trial Number')
plt.ylabel('RMSE')
plt.title('RMSE Progression During Optuna Optimization')
plt.grid(True)
plt.show()
Best trial: {'num_leaves': 99, 'learning_rate': 0.09357485283802834, 'n_estimators': 491, 'max_depth': 15, 'subsample': 0.7112748373683647, 'colsample_bytree': 0.925675143315549, 'lambda_l1': 0.25437338486941125, 'lambda_l2': 0.22301041044693493}
Best RMSE: 69755.9963939882
Final RMSE on validation set: 69757.25759815409
通过绘制的Optuna 优化过程中的 RMSE 进展图可以看到,其实并不需要进行100次的迭代优化,大概在20多次就差不多可以了。
这里使用了 Optuna 来调优 LightGBM 的超参数,结果返回了一个最佳的超参数组合,这些超参数对训练模型产生了较好的影响。具体配置如下:
-
num_leaves: 99 — 这是 LightGBM 的一个重要参数,控制树的叶子数目。叶子数目较大时,模型可能会更复杂,能够捕捉更多的细节,但也容易过拟合。这里的99说明模型是适度复杂的。
-
learning_rate: 0.0936 — 学习率是控制模型每一步更新的步伐。较低的学习率有助于避免过拟合,但可能需要更多的迭代次数。0.0936 属于一个适中的学习率,既能够加速训练,又不至于在更新中丢失信息。
-
n_estimators: 491 — 迭代次数(即基学习器的数量)。491个树相对较多,但考虑到学习率较大,使用这么多树是合理的。
-
max_depth: 15 — 限制每棵树的最大深度。深度为15意味着每棵树的结构较为复杂,能捕捉到更多的特征交互信息,但也容易造成过拟合。
-
subsample: 0.711 — 训练数据的采样比例。这个参数控制每棵树训练时使用的数据量。0.711 表示在每一棵树训练时,约71%的数据会被随机选择用来训练,这有助于减少过拟合。
-
colsample_bytree: 0.926 — 每棵树在训练时使用的特征比例。0.926 表示几乎使用了所有特征,有助于提高模型的表现。
-
lambda_l1: 0.254 和 lambda_l2: 0.223 — L1 和 L2 正则化系数,用于控制模型的复杂度和防止过拟合。正则化系数在合理范围内,帮助减少不必要的特征影响。
- Best RMSE 和 Final RMSE
Best RMSE: 69755.9964 — 这是在 Optuna 调优过程中,通过交叉验证(或某种验证方法)得到的最佳 Root Mean Squared Error (RMSE)。RMSE 是衡量回归模型预测误差的一种方式,值越小表示模型性能越好。在这里,69755.9964 是最佳结果。
Final RMSE on validation set: 69757.2576 — 这是在验证集上的最终 RMSE 结果,表示在最佳参数下,模型在验证集上的预测性能。这一结果与最佳 RMSE 非常接近,仅相差约 1.26,这说明模型在训练过程中已经非常稳定,参数调整得相当有效。
- 总结
在这次使用 LightGBM 和 Optuna 进行超参数调优的过程中,得到了一个具有良好泛化能力的模型。最终 RMSE 与最佳 RMSE 非常接近,表明模型在验证集上的表现与训练时的表现一致,且未出现明显的过拟合或欠拟合。后续可以通过进一步的特征工程、模型集成等手段继续提升模型性能。
本次分享到这里就结束了,感谢大家的观看。