目录
数据分析案例:能源数据分析
1. 项目背景
随着可再生能源和智能电网的发展,电力需求预测与能源管理成为了能源行业的核心任务。准确把握用电负荷变化规律,不仅能帮助电网调度与发电计划合理安排,还能支持需求响应、节能减排和动态定价。本案例以某城市区域的用电负荷与天气数据为例,示范如何利用 Pandas 对能源数据进行清洗、探索、特征工程和负荷预测建模,为智能调度和运维决策提供数据支撑。
2. 数据加载与预处理
2.1 数据说明
假设已有一份 energy_data.csv
,包含以下字段:
timestamp
:时间戳(YYYY-MM-DD HH:MM:SS)consumption_kwh
:该时段(小时)用电量(千瓦时)temperature_c
:该时段平均气温(℃)is_holiday
:是否节假日(0/1)
2.2 读取与清洗
import pandas as pd
# 读取数据,解析时间戳
df = pd.read_csv('energy_data.csv', parse_dates=['timestamp'])
print("原始记录条数:", len(df))
# 删除缺失值与异常
df = df.dropna(subset=['timestamp','consumption_kwh','temperature_c'])
df = df[df['consumption_kwh'] >= 0] # 去除负值
# 按 timestamp 排序
df = df.sort_values('timestamp').reset_index(drop=True)
print(df.head())
3. 探索性数据分析(EDA)
3.1 用电量时序趋势
import matplotlib.pyplot as plt
plt.figure(figsize=(12,4))
plt.plot(df['timestamp'], df['consumption_kwh'], linewidth=0.8)
plt.title('用电量时序趋势')
plt.xlabel('时间')
plt.ylabel('用电量 (kWh)')
plt.tight_layout()
plt.show()
3.2 按小时的平均负荷
df['hour'] = df['timestamp'].dt.hour
hourly_mean = df.groupby('hour')['consumption_kwh'].mean()
plt.figure(figsize=(6,4))
hourly_mean.plot(marker='o')
plt.title('分小时平均用电量')
plt.xlabel('小时')
plt.ylabel('平均用电量 (kWh)')
plt.grid(True)
plt.tight_layout()
plt.show()
3.3 气温与用电量相关性
plt.figure(figsize=(6,4))
plt.scatter(df['temperature_c'], df['consumption_kwh'], alpha=0.3)
plt.title('气温 vs 用电量')
plt.xlabel('气温 (℃)')
plt.ylabel('用电量 (kWh)')
plt.tight_layout()
plt.show()
4. 特征工程
4.1 时间特征
df['day_of_week'] = df['timestamp'].dt.dayofweek # 0=周一
df['month'] = df['timestamp'].dt.month
4.2 滞后与滚动特征
# 前1小时用电量
df['lag_1'] = df['consumption_kwh'].shift(1).fillna(method='bfill')
# 过去24小时滚动平均
df['roll_24_mean'] = df['consumption_kwh'].rolling(window=24, min_periods=1).mean()
4.3 最终特征集
特征包括:hour
、day_of_week
、month
、is_holiday
、temperature_c
、lag_1
、roll_24_mean
5. 模型构建与评估
使用随机森林回归预测下一小时负荷。
5.1 数据划分
from sklearn.model_selection import train_test_split
# 去除最早几行 NaNs
df_model = df.dropna().reset_index(drop=True)
features = ['hour','day_of_week','month','is_holiday',
'temperature_c','lag_1','roll_24_mean']
X = df_model[features]
y = df_model['consumption_kwh']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, shuffle=False # 保持时序
)
5.2 训练随机森林
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
5.3 评估
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
import numpy as np
y_pred = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mape = mean_absolute_percentage_error(y_test, y_pred)
print(f"RMSE: {rmse:.2f} kWh")
print(f"MAPE: {mape:.2%}")
绘制预测对比图:
plt.figure(figsize=(12,4))
plt.plot(df_model['timestamp'].iloc[len(X_train):], y_test.values, label='真实')
plt.plot(df_model['timestamp'].iloc[len(X_train):], y_pred, label='预测', alpha=0.8)
plt.legend(); plt.title('预测与真实对比'); plt.xlabel('时间'); plt.ylabel('用电量 (kWh)')
plt.tight_layout(); plt.show()
6. 业务应用
- 短期负荷预测
- 提前1小时预测负荷,调度发电机组与储能设备
- 需求响应
- 在高负荷时段触发动态电价或限电策略
- 资源优化
- 根据季节与气温变化调整设备维护与检修计划
- 能效分析
- 分析节假日与天气对用电量的影响,为节能减排提供依据
7. 完整代码
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
# 1. 读取与清洗
df = pd.read_csv('energy_data.csv', parse_dates=['timestamp'])
df = df.dropna(subset=['consumption_kwh','temperature_c'])
df = df[df['consumption_kwh']>=0]
df = df.sort_values('timestamp').reset_index(drop=True)
# 2. 特征工程
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['lag_1'] = df['consumption_kwh'].shift(1).fillna(method='bfill')
df['roll_24_mean']= df['consumption_kwh'].rolling(24, min_periods=1).mean()
df_model = df.dropna().reset_index(drop=True)
features = ['hour','day_of_week','month','is_holiday',
'temperature_c','lag_1','roll_24_mean']
X = df_model[features]
y = df_model['consumption_kwh']
# 3. 划分与训练
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, shuffle=False
)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# 4. 评估
y_pred = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mape = mean_absolute_percentage_error(y_test, y_pred)
print(f"RMSE: {rmse:.2f} kWh, MAPE: {mape:.2%}")
# 5. 可视化对比
plt.figure(figsize=(12,4))
plt.plot(df_model['timestamp'].iloc[len(X_train):], y_test.values, label='真实')
plt.plot(df_model['timestamp'].iloc[len(X_train):], y_pred, label='预测', alpha=0.8)
plt.legend(); plt.title('预测与真实对比'); plt.xlabel('时间'); plt.ylabel('用电量 (kWh)')
plt.tight_layout(); plt.show()
8. 总结
本文完整展示了能源数据分析与负荷预测流程,包括数据清洗、时序特征构建、随机森林回归建模与评估。通过短期负荷预测,可为电网调度、需求响应和能效管理提供决策支持。后续可引入更多气象因素、长周期趋势模型或深度学习方法,进一步提升预测精度与应用价值。