电力消耗预测模型学习笔记
1. 环境准备
安装所需的库,并导入相关的Python库:
```python
!pip install lightgbm==3.3.0
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
```
2. 数据读取
读取训练集和测试集数据:
```python
train = pd.read_csv('/mnt/data/train.csv')
test = pd.read_csv('/mnt/data/test.csv')
```
3. 数据可视化
通过柱状图和折线图对数据进行初步可视化分析:
```python
# 不同type类型对应target的柱状图
type_target_df = train.groupby('type')['target'].mean().reset_index()
plt.figure(figsize=(8, 4))
plt.bar(type_target_df['type'], type_target_df['target'], color=['blue', 'green'])
plt.xlabel('Type')
plt.ylabel('Average Target Value')
plt.title('Bar Chart of Target by Type')
plt.show()
# ID '00037f39cf' 的 target 值折线图
specific_id_df = train[train['id'] == '00037f39cf']
plt.figure(figsize=(10, 5))
plt.plot(specific_id_df['dt'], specific_id_df['target'], marker='o', linestyle='-')
plt.xlabel('DateTime')
plt.ylabel('Target Value')
plt.title("Line Chart of Target for ID '00037f39cf'")
plt.show()
```
4. 数据预处理
将训练数据和测试数据进行合并并排序,同时创建历史平移特征和窗口统计特征:
```python
# 合并训练数据和测试数据,并进行排序
data = pd.concat([train, test], axis=0, ignore_index=True)
data = data.sort_values(['id', 'dt'], ascending=False).reset_index(drop=True)
# 创建历史平移特征
for i in range(1, 11):
data[f'last{i}_target'] = data.groupby('id')['target'].shift(i)
# 创建窗口统计特征
data['win3_mean_target'] = data[['last1_target', 'last2_target', 'last3_target']].mean(axis=1)
```
5. 数据集划分
将数据集重新划分为训练集和测试集:
```python
train = data[data['dt'] > 10].reset_index(drop=True)
test = data[data['dt'] <= 10].reset_index(drop=True)
```
6. 特征选择
确定输入特征,去除无关的列:
```python
train_cols = [col for col in data.columns if col not in ['id', 'target', 'dt']]
```
7. 模型定义与训练
定义并训练LightGBM模型,同时进行预测:
```python
def time_model(lgb, train_df, test_df, cols):
# 训练集和验证集切分
trn_x, trn_y = train_df[train_df['dt'] > 40][cols], train_df[train_df['dt'] > 40]['target']
val_x, val_y = train_df[train_df['dt'] <= 40][cols], train_df[train_df['dt'] <= 40]['target']
# 构建模型输入数据
train_matrix = lgb.Dataset(trn_x, label=trn_y)
valid_matrix = lgb.Dataset(val_x, label=val_y)
# lightgbm参数
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': 'mse',
'min_child_weight': 5,
'num_leaves': 32,
'lambda_l2': 10,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
'learning_rate': 0.05,
'seed': 2024,
'nthread': 16,
'verbosity': -1
}
# 训练模型
model = lgb.train(
lgb_params,
train_matrix,
num_boost_round=1000,
valid_sets=[train_matrix, valid_matrix],
early_stopping_rounds=50,
verbose_eval=50
)
# 验证集和测试集结果预测
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_df[cols], num_iteration=model.best_iteration)
# 离线分数评估
score = mean_squared_error(val_y, val_pred)
print('Validation MSE:', score)
return val_pred, test_pred
# 训练和预测
lgb_oof, lgb_test = time_model(lgb, train, test, train_cols)
```
8. 结果保存
将预测结果保存到本地文件:
```python
test['target'] = lgb_test
test[['id', 'dt', 'target']].to_csv('submit.csv', index=False)
```
总结与改进建议
1. **模型参数调整**:尝试不同的参数设置,尤其是 `num_leaves`、`learning_rate` 和 `min_child_weight` 等参数。
2. **特征工程**:尝试加入更多特征,如周期性特征、类型特征等。
3. **交叉验证**:使用K折交叉验证来评估模型的稳定性。
4. **参数调优**:使用网格搜索或贝叶斯优化进行参数调优。
通过这些步骤,可以提高模型的预测性能,并确保在比赛中取得更好的成绩。