基于kaggle欧洲国家太阳能发电数据集的太阳能站点效率预测

   我们将只保留一个站点,使用 scikit-learn 的基本 ML 模型进行一个月的预测,使用深度学习和tensorflow预测一到两天。
   性能指标:均方根误差,探索性分析可见,数据集是干净的:没有异常值,没有重复行,也没有缺失值。

1、基线模型

基线模型得到的结果,将会是其他模型结果的比较基准。

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from dataprepare import dataset_con
from visualize import plot_scores,plot_predictions
import warnings
warnings.filterwarnings("ignore")
pd.options.display.max_columns = 300

##原始数据
df = pd.read_csv("dataset\solar_generation_by_station.csv")
train_data,test_data = dataset_con(df)

model_instances, model_names, rmse_train, rmse_test = [], [], [], []

#构造训练集和测试集
x_train, y_train = train_data.drop(columns=['time']), train_data['FR10']
x_test, y_test = test_data.drop(columns=['time']), test_data['FR10']

# 基线模型,作为基准模型
def mean_df(d, h):
    "return the hourly mean of a specific day of the year"
    res = x_train[(x_train['day'] == d) & (x_train['hour'] == h)]['FR10'].mean()
    return res
#预测值添加到数据集
x_train['pred'] = x_train.apply(lambda x: mean_df(x.day, x.hour), axis=1)
x_test['pred'] = x_test.apply(lambda x: mean_df(x.day, x.hour), axis=1)
model_names.append("base_line")
rmse_train.append(np.sqrt(mean_squared_error(x_train['FR10'], x_train['FR10']))) # a modifier en pred
rmse_test.append(np.sqrt(mean_squared_error(x_test['FR10'], x_test['pred'])))
#显示上个月的预测(橙色)和实际值(蓝色)
plot_predictions(data=x_test[['FR10', 'pred']])

在这里插入图片描述

2、回归模型

下面会利用几种回归模型进行预测。将通过比较测试集上的性能,来判断哪个模型最有效。

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from dataprepare import dataset_con
from visualize import plot_scores,plot_predictions
import warnings
warnings.filterwarnings("ignore")
pd.options.display.max_columns = 300
##原始数据
df = pd.read_csv("F:\mygithub\Big_Data_Renewable_energies-master\dataset\solar_generation_by_station.csv")
train_data,test_data = dataset_con(df)
model_instances, model_names, rmse_train, rmse_test = [], [], [], []
#构造训练集和测试集
X_train, y_train = train_data[['month', 'week', 'day', 'hour']], train_data['FR10']
X_test, y_test = test_data[['month', 'week', 'day', 'hour']], test_data['FR10']
#训练的模型
from sklearn.neighbors import KNeighborsRegressor#k近邻
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet#线性回归,岭回归,Lasso回归,弹性网络
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import LinearSVR
from sklearn.svm import SVR
import xgboost as xgb
#打印分数
def get_rmse(reg, model_name):
    """打印传入参数的模型的分数以及并返回训练/测试集上的分数"""
    y_train_pred, y_pred = reg.predict(X_train), reg.predict(X_test)
    rmse_train, rmse_test = np.sqrt(mean_squared_error(y_train, y_train_pred)), np.sqrt(
        mean_squared_error(y_test, y_pred))
    print(model_name, '\t - RMSE on Training  = {rmse_train:%f}'%rmse_train+' / RMSE on Test = {rmse_test:}'%rmse_test)

    return rmse_train, rmse_test
# 最初使用的所有基本模型的列表
model_list = [
    LinearRegression(), Lasso(), Ridge(), ElasticNet(),
    RandomForestRegressor(), GradientBoostingRegressor(), ExtraTreesRegressor(),
    xgb.XGBRegressor(), KNeighborsRegressor()
             ]
# 训练和测试的分数和名字列表创建
model_names.extend([str(m)[:str(m).index('(')] for m in model_list])
# 训练和测试所有模型
for model, name in zip(model_list, model_names):
    model.fit(X_train, y_train)
    sc_train, sc_test = get_rmse(model, name)
    rmse_train.append(sc_train)
    rmse_test.append(sc_test)

结果比较

base_line 	 - RMSE on Training  = 0.21 / RMSE on Test = 0.15
LinearRegression 	 - RMSE on Training  = 0.21 / RMSE on Test = 0.15
Lasso 	 - RMSE on Training  = 0.21 / RMSE on Test = 0.15
Ridge 	 - RMSE on Training  = 0.21 / RMSE on Test = 0.15
ElasticNet 	 - RMSE on Training  = 0.10 / RMSE on Test = 0.10
RandomForestRegressor 	 - RMSE on Training  = 0.11 / RMSE on Test = 0.09
GradientBoostingRegressor 	 - RMSE on Training  = 0.10 / RMSE on Test = 0.10
ExtraTreesRegressor 	 - RMSE on Training  = 0.11 / RMSE on Test = 0.09
XGBRegressor 	 - RMSE on Training  = 0.10 / RMSE on Test = 0.10
LGBMRegressor 	 - RMSE on Training  = 0.10 / RMSE on Test = 0.10

3、深度学习

尝试根据过去 2 天(48 小时)的所有特征(所有其他站效率)预测一小时的 FR10 值。

3.1 数据集构建

df = pd.read_csv("dataset\solar_generation_by_station.csv")
df = df[sorted([c for c in df.columns if 'FR' in c])]
# 只保留最近4年的FR数据
df = df[-24*365*4:]
# 数据处理函数:输入为df和lookback,输出的X的各个元素为4年来每个48小时的数据
def process_data(data, past):
    X = []
    for i in range(len(data)-past-1):
        X.append(data.iloc[i:i+past].values)
    return np.array(X)
#根据过去2天的特征值预测之后1个小时的值
lookback = 48
#仅针对FR10这个站点进行预测,y为FR10第一个48小时后的所有数据值,X的元素为y对应的数据值之前的48小时数据
y = df['FR10'][lookback+1:]
X = process_data(df, lookback)
from sklearn.model_selection import train_test_split
#划分训练集和测试集,不打乱
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=False)

RNN,LSTM,GRU模型构建、训练与测试

'''
RNN
'''
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense, Embedding, Dropout
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import SimpleRNN, Dense, Embedding, Dropout


def my_RNN():
    my_rnn = Sequential()
    my_rnn.add(SimpleRNN(units=32, return_sequences=True, input_shape=(lookback,22)))
    my_rnn.add(SimpleRNN(units=32, return_sequences=True))
    my_rnn.add(SimpleRNN(units=32, return_sequences=False))
    my_rnn.add(Dense(units=1, activation='linear'))
    return my_rnn


rnn_model = my_RNN()
rnn_model.compile(optimizer='adam', loss='mean_squared_error')
rnn_model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=50, batch_size=64)

y_pred_train, y_pred_test = rnn_model.predict(X_train), rnn_model.predict(X_test)
err_train_rnn, err_test_rnn = np.sqrt(mean_squared_error(y_train, y_pred_train)), np.sqrt(mean_squared_error(y_test, y_pred_test))

def append_results(model_name,err_train,err_test):
    model_names.append(model_name)
    rmse_train.append(err_train)
    rmse_test.append(err_test)

append_results("RNN",err_train_rnn,err_test_rnn)


plot_evolution(X_train,y_train,X_test,y_test,y_pred_test)
rnn_res = pd.DataFrame(zip(list(y_test), list(np.squeeze(y_pred_test))), columns =['FR10', 'pred'])
plot_predictions(data=rnn_res[-30*24:])

'''
GRU
'''

from keras.layers import GRU

def my_GRU(input_shape):
    my_GRU = Sequential()
    my_GRU.add(GRU(units=32, return_sequences=True, activation='relu', input_shape=input_shape))
    my_GRU.add(GRU(units=32, activation='relu', return_sequences=False))
    my_GRU.add(Dense(units=1, activation='linear'))
    return my_GRU

gru_model = my_GRU(X.shape[1:])
gru_model.compile(optimizer='adam', loss='mean_squared_error')
gru_model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=50, batch_size=32)

y_pred_train, y_pred_test = gru_model.predict(X_train), gru_model.predict(X_test)
err_train_gru, err_test_gru = np.sqrt(mean_squared_error(y_train, y_pred_train)), np.sqrt(mean_squared_error(y_test, y_pred_test))

append_results("GRU",err_train_gru,err_test_gru)
plot_evolution(X_train,y_train,X_test,y_test,y_pred_test)

gru_res = pd.DataFrame(zip(list(y_test), list(np.squeeze(y_pred_test))), columns =['FR10', 'pred'])
plot_predictions(data=gru_res[-30*24:])

'''
LSTM
'''

from keras.layers import LSTM

def my_LSTM(input_shape):
    my_LSTM = Sequential()
    my_LSTM.add(LSTM(units=32, return_sequences=True, activation='relu', input_shape=input_shape))
    my_LSTM.add(LSTM(units=32, activation='relu', return_sequences=False))
    my_LSTM.add(Dense(units=1, activation='linear'))
    return my_LSTM

lstm_model = my_LSTM(X.shape[1:])
lstm_model.compile(optimizer='adam', loss='mean_squared_error')
lstm_model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=50, batch_size=32)

y_pred_train, y_pred_test = lstm_model.predict(X_train), lstm_model.predict(X_test)
err_train_lstm, err_test_lstm = np.sqrt(mean_squared_error(y_train, y_pred_train)), np.sqrt(mean_squared_error(y_test, y_pred_test))
append_results("LSTM",err_train_lstm,err_test_lstm)
plot_evolution(X_train,y_train,X_test,y_test,y_pred_test)

lstm_res = pd.DataFrame(zip(list(y_test), list(np.squeeze(y_pred_test))), columns =['FR10', 'pred'])
plot_predictions(data=lstm_res[-30*24:])

plt.style.use('fivethirtyeight')
plot_scores(model_names,rmse_train,rmse_test)


df_score = pd.DataFrame({'model_names' : model_names, 'rmse_test' : rmse_test})

plt.figure(figsize=(12, 8))
sns.barplot(y="model_names", x="rmse_test", data=df_score, palette="Blues_d")
plt.title("Comparaison des erreurs pour chaque modèle", fontsize=20)
plt.xlabel('erreur RMSE - plus elle est petite, meilleur est le modèle', fontsize=16)
plt.ylabel('liste des modèles esssayés', fontsize=16)
plt.show()

所有模型结果
在这里插入图片描述

  • 2
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
以下是针对 Kaggle 中的 Stroke Prediction 数据集进行数据清洗的 Python 代码: ```python import pandas as pd import numpy as np # 导入数据 data = pd.read_csv('healthcare-dataset-stroke-data.csv') # 查看缺失值情况 print(data.isnull().sum()) # 处理 BMI 缺失值 mean_bmi = data['bmi'].mean() data['bmi'].fillna(mean_bmi, inplace=True) # 处理 smoking_status 缺失值 mode_smoking = data['smoking_status'].mode() data['smoking_status'].fillna(mode_smoking[0], inplace=True) # 处理其他缺失值 data.dropna(inplace=True) # 处理分类变量 data['gender'] = data['gender'].map({'Male': 0, 'Female': 1, 'Other': 2}).astype(int) data['ever_married'] = data['ever_married'].map({'No': 0, 'Yes': 1}).astype(int) data['work_type'] = data['work_type'].map({'Private': 0, 'Self-employed': 1, 'Govt_job': 2, 'children': 3, 'Never_worked': 4}).astype(int) data['Residence_type'] = data['Residence_type'].map({'Rural': 0, 'Urban': 1}).astype(int) data['smoking_status'] = data['smoking_status'].map({'never smoked': 0, 'formerly smoked': 1, 'smokes': 2, 'Unknown': 3}).astype(int) # 处理年龄数据 data['age'] = pd.cut(data['age'], bins=[0, 18, 35, 50, 65, 200], labels=[1, 2, 3, 4, 5]) # 处理 BMI 数据 data['bmi'] = pd.cut(data['bmi'], bins=[0, 18.5, 24.9, 29.9, 100], labels=[0, 1, 2, 3]) # 处理 avg_glucose_level 数据 data['avg_glucose_level'] = pd.cut(data['avg_glucose_level'], bins=[0, 70, 100, 125, np.inf], labels=[0, 1, 2, 3]) # 查看处理后的数据 print(data.head()) ``` 这段代码实现了以下功能: 1. 导入数据,并查看缺失值情况; 2. 处理 BMI 和 smoking_status 的缺失值; 3. 处理其他缺失值; 4. 处理分类变量; 5. 处理年龄、BMI 和 avg_glucose_level 数据; 6. 查看处理后的数据。 需要注意的是,在处理分类变量时,需要根据数据集中的实际情况进行相应的处理。例如,在这个数据集中,gender 变量有三种取值,因此将其分别映射为 0、1 和 2。如果数据集中的分类变量取值较多,可以考虑使用 One-Hot 编码等方法。
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值