数据分析案例：能源数据分析-CSDN博客

本文链接：https://blog.csdn.net/qq_42568323/article/details/147476246

数据分析案例：能源数据分析

1. 项目背景

随着可再生能源和智能电网的发展，电力需求预测与能源管理成为了能源行业的核心任务。准确把握用电负荷变化规律，不仅能帮助电网调度与发电计划合理安排，还能支持需求响应、节能减排和动态定价。本案例以某城市区域的用电负荷与天气数据为例，示范如何利用 Pandas 对能源数据进行清洗、探索、特征工程和负荷预测建模，为智能调度和运维决策提供数据支撑。

2. 数据加载与预处理

2.1 数据说明

假设已有一份 energy_data.csv，包含以下字段：

timestamp：时间戳（YYYY-MM-DD HH:MM:SS）
consumption_kwh：该时段（小时）用电量（千瓦时）
temperature_c：该时段平均气温（℃）
is_holiday：是否节假日（0/1）

2.2 读取与清洗

import pandas as pd

# 读取数据，解析时间戳
df = pd.read_csv('energy_data.csv', parse_dates=['timestamp'])
print("原始记录条数：", len(df))

# 删除缺失值与异常
df = df.dropna(subset=['timestamp','consumption_kwh','temperature_c'])
df = df[df['consumption_kwh'] >= 0]  # 去除负值

# 按 timestamp 排序
df = df.sort_values('timestamp').reset_index(drop=True)
print(df.head())

3. 探索性数据分析（EDA）

3.1 用电量时序趋势

import matplotlib.pyplot as plt

plt.figure(figsize=(12,4))
plt.plot(df['timestamp'], df['consumption_kwh'], linewidth=0.8)
plt.title('用电量时序趋势')
plt.xlabel('时间')
plt.ylabel('用电量 (kWh)')
plt.tight_layout()
plt.show()

3.2 按小时的平均负荷

df['hour'] = df['timestamp'].dt.hour
hourly_mean = df.groupby('hour')['consumption_kwh'].mean()

plt.figure(figsize=(6,4))
hourly_mean.plot(marker='o')
plt.title('分小时平均用电量')
plt.xlabel('小时')
plt.ylabel('平均用电量 (kWh)')
plt.grid(True)
plt.tight_layout()
plt.show()

3.3 气温与用电量相关性

plt.figure(figsize=(6,4))
plt.scatter(df['temperature_c'], df['consumption_kwh'], alpha=0.3)
plt.title('气温 vs 用电量')
plt.xlabel('气温 (℃)')
plt.ylabel('用电量 (kWh)')
plt.tight_layout()
plt.show()

4. 特征工程

4.1 时间特征

df['day_of_week'] = df['timestamp'].dt.dayofweek  # 0=周一
df['month']       = df['timestamp'].dt.month

4.2 滞后与滚动特征

# 前1小时用电量
df['lag_1'] = df['consumption_kwh'].shift(1).fillna(method='bfill')

# 过去24小时滚动平均
df['roll_24_mean'] = df['consumption_kwh'].rolling(window=24, min_periods=1).mean()

4.3 最终特征集

特征包括：hour、day_of_week、month、is_holiday、temperature_c、lag_1、roll_24_mean

5. 模型构建与评估

使用随机森林回归预测下一小时负荷。

5.1 数据划分

from sklearn.model_selection import train_test_split

# 去除最早几行 NaNs
df_model = df.dropna().reset_index(drop=True)
features = ['hour','day_of_week','month','is_holiday',
            'temperature_c','lag_1','roll_24_mean']
X = df_model[features]
y = df_model['consumption_kwh']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, shuffle=False  # 保持时序
)

5.2 训练随机森林

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

5.3 评估

from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
import numpy as np

y_pred = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mape = mean_absolute_percentage_error(y_test, y_pred)

print(f"RMSE: {rmse:.2f} kWh")
print(f"MAPE: {mape:.2%}")

绘制预测对比图：

plt.figure(figsize=(12,4))
plt.plot(df_model['timestamp'].iloc[len(X_train):], y_test.values, label='真实')
plt.plot(df_model['timestamp'].iloc[len(X_train):], y_pred,   label='预测', alpha=0.8)
plt.legend(); plt.title('预测与真实对比'); plt.xlabel('时间'); plt.ylabel('用电量 (kWh)')
plt.tight_layout(); plt.show()

6. 业务应用

短期负荷预测
- 提前1小时预测负荷，调度发电机组与储能设备
需求响应
- 在高负荷时段触发动态电价或限电策略
资源优化
- 根据季节与气温变化调整设备维护与检修计划
能效分析
- 分析节假日与天气对用电量的影响，为节能减排提供依据

7. 完整代码

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error

# 1. 读取与清洗
df = pd.read_csv('energy_data.csv', parse_dates=['timestamp'])
df = df.dropna(subset=['consumption_kwh','temperature_c'])
df = df[df['consumption_kwh']>=0]
df = df.sort_values('timestamp').reset_index(drop=True)

# 2. 特征工程
df['hour']        = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month']       = df['timestamp'].dt.month
df['lag_1']       = df['consumption_kwh'].shift(1).fillna(method='bfill')
df['roll_24_mean']= df['consumption_kwh'].rolling(24, min_periods=1).mean()
df_model = df.dropna().reset_index(drop=True)

features = ['hour','day_of_week','month','is_holiday',
            'temperature_c','lag_1','roll_24_mean']
X = df_model[features]
y = df_model['consumption_kwh']

# 3. 划分与训练
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, shuffle=False
)
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# 4. 评估
y_pred = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mape = mean_absolute_percentage_error(y_test, y_pred)
print(f"RMSE: {rmse:.2f} kWh, MAPE: {mape:.2%}")

# 5. 可视化对比
plt.figure(figsize=(12,4))
plt.plot(df_model['timestamp'].iloc[len(X_train):], y_test.values, label='真实')
plt.plot(df_model['timestamp'].iloc[len(X_train):], y_pred,   label='预测', alpha=0.8)
plt.legend(); plt.title('预测与真实对比'); plt.xlabel('时间'); plt.ylabel('用电量 (kWh)')
plt.tight_layout(); plt.show()