SEED江苏大数据大赛-新能源赛道(壹)

一、赛题内容简介与指导

本题任务是预测未来一周逐天的需求电量,属于典型的回归问题,输入数据为历史站点运营数据、站点充电量数据和站点静态数据。针对这类时间序列预测问题方法比较灵活,传统的时序模型、机器学习、深度学习方法均可以使用。
1、统计策略:使用最近时刻的结果进行均值、中位数、时间衰减等方式直接统计得到未来结果,这种方式比较简单,可以快速得到结果;
2、时序模型:比较常用的方法有指数平滑法、灰色预测模型、ARIMA预测、季节Sarima模型、VAR模型等,仅能刻画序列信息,无法加入其他信息进行训练,比如离散类特征;
3、机器学习模型:常见的为lightgbm、xgboost、catboost,需要构建大量时序相关特征;
4、深度学习模型:常见为rnn、lstm、cnn、transformer这类模型,可以直接输入序列信息,不需要构建大量的人工特征;

二、赛题的初步运行实现,并补充地理信息数据和调优

1、安装运行环境,导入所需依赖
# 1
#import 相关库
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import tqdm
import sys
import os
import gc
import argparse
import warnings
warnings.filterwarnings('ignore')
2、整理数据,构建训练集和测试集

根据H3编码重新制作了stub_info数据文件:
train_stub_add_info.csv ;test_stub_add_info.csv
文件中更新了各充电站的经纬度信息,并定位到具体城市

# 2 读取数据
train_power_forecast_history = pd.read_csv(r'.\1009\train\power_forecast_history.csv')
train_power = pd.read_csv(r'.\1009\train\power.csv')
# train_stub_add_info.csv 
# test_stub_add_info.csv 文件中更新了各充电站的经纬度信息,并定位到具体城市
train_stub_info = pd.read_csv(r'.\train_stub_add_info.csv')

test_power_forecast_history = pd.read_csv(r'.\1009\test\power_forecast_history.csv')
test_stub_info = pd.read_csv(r'.\test_stub_add_info.csv')
3、解析H3编码,获取地理信息
  • 调用h3 解析h3编码,获得经纬度数据
  • 申请百度地图API,地理逆编码接口,获取经纬度对应的具体地址:

![[Pasted image 20231018174245.png]]
请添加图片描述

  • 提取每个地址的城市字段:
    ![[Pasted image 20231018174432.png]]
    请添加图片描述
# encoding:utf-8
import requests
from h3 import h3
def getloc(h3_address):
    # h3 code 转换 经纬度
    hex_center_coordinates = h3.h3_to_geo(h3_address)
    # 拼接成字符串
    lon_lat = str(hex_center_coordinates[0])+','+ str(hex_center_coordinates[1])
    # 接口地址
    url = "https://api.map.baidu.com/reverse_geocoding/v3"
    # 此处填写你在控制台-应用管理-创建应用后获取的AK
    ak = "*************************"
    params = {
        "ak":       ak,
        "output":    "json",
        "coordtype":    "wgs84ll",
        "extensions_poi":    "0",
        "location":    lon_lat,
    }
    response = requests.get(url=url, params=params)
    if response:
        print(response.json()['result']['formatted_address'])
        return response.json()['result']['formatted_address']
    else:
        return ''
train_stub_info["address"] = train_stub_info["h3"].apply(lambda x : getloc(x))
test_stub_info["address"]= test_stub_info["h3"].apply(lambda x : getloc(x))

test_stub_info['city'] = test_stub_info["address"].apply(lambda x : x.split('市')[0][3:])
test_stub_info.city.value_counts()
  • 将城市名称转换为 数字编码city_flag,重新制作成stub_info数据文件
    ![[Pasted image 20231018174726.png]]
    请添加图片描述
test_stub_info['city_flag'] = test_stub_info['city'].map({'南京':0,'苏州':1,'常州':2,'无锡':3,'南通':4,'宿迁':5,'盐城':6,'泰州':7,'徐州':8,'镇江':9,'扬州':10,'淮安':11,'连云港':12,'宿州':13,'':14,})
test_stub_info.to_csv(r".\test_stub_add_info.csv")
test_stub_info.city_flag.value_counts()
3、数据清洗与预处理
# 3、数据集聚合,按天聚合,删除小时
# head(1) 每个group中保存1行
train_df = train_power_forecast_history.groupby(['id_encode','ds']).head(1)
del train_df['hour']
test_df = test_power_forecast_history.groupby(['id_encode','ds']).head(1)
del test_df['hour']
# 按天、电站聚合, 计算一天总能耗
tmp_df = train_power.groupby(['id_encode','ds'])['power'].sum()
tmp_df.columns = ['id_encode','ds','power']
# 4、合并充电量数据
train_df = train_df.merge(tmp_df, on=['id_encode','ds'], how='left')
# 合并数据
train_df = train_df.merge(train_stub_info, on='id_encode', how='left')
test_df = test_df.merge(test_stub_info, on='id_encode', how='left')
# 5、数据预处理
train_df['flag'] = train_df['flag'].map({'A':0,'B':1})
test_df['flag'] = test_df['flag'].map({'A':0,'B':1})
train_df.head(5)
4、提取时间日期特征
# 6
def get_time_feature(df, col):
    df_copy = df.copy()
    prefix = col + "_"
    df_copy['new_'+col] = df_copy[col].astype(str)
    col = 'new_'+col
    df_copy[col] = pd.to_datetime(df_copy[col], format='%Y%m%d')
    df_copy[prefix + 'year'] = df_copy[col].dt.year
    df_copy[prefix + 'month'] = df_copy[col].dt.month
    df_copy[prefix + 'day'] = df_copy[col].dt.day
    # df_copy[prefix + 'weekofyear'] = df_copy[col].dt.weekofyear
    df_copy[prefix + 'dayofweek'] = df_copy[col].dt.dayofweek
    df_copy[prefix + 'is_wknd'] = df_copy[col].dt.dayofweek // 6
    df_copy[prefix + 'quarter'] = df_copy[col].dt.quarter
    df_copy[prefix + 'is_month_start'] = df_copy[col].dt.is_month_start.astype(int)
    df_copy[prefix + 'is_month_end'] = df_copy[col].dt.is_month_end.astype(int)
    del df_copy[col]
    return df_copy  

train_df = get_time_feature(train_df, 'ds')
test_df = get_time_feature(test_df, 'ds')
cols = [f for f in test_df.columns if f not in ['ds','power','h3','address','city']]
cols

训练数据展示:
train_df.head(5)

id_encodeele_priceser_priceafter_ser_pricetotal_pricef1f2f3dspowerdc_equipment_kwcity_flagds_yeards_monthds_dayds_dayofweekds_is_wkndds_quarterds_is_month_startds_is_month_end
00.640.950.310.950.00.01.0202204152288.22401440.02202241540200
00.640.950.310.950.00.01.0202204162398.57301440.02202241650200
00.640.950.310.950.00.01.0202204172313.03301440.02202241761200
00.640.950.310.950.00.01.0202204182095.32591440.02202241800200
00.640.950.310.950.00.01.0202204191834.35901440.02202241910200
5 模型训练
# 7、模型训练与验证
# 使用K折交叉验证训练和验证模型
def cv_model(clf, train_x, train_y, test_x, seed=2023):
    # 定义折数并初始化KFold
    folds = 5
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    # 初始化oof预测和测试集预测
    oof = np.zeros(train_x.shape[0])
    test_predict = np.zeros(test_x.shape[0])
    cv_scores = []
    # KFold交叉验证
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('****** {} ******'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        # 转换数据为lightgbm数据格式
        train_matrix = clf.Dataset(trn_x, label=trn_y)
        valid_matrix = clf.Dataset(val_x, label=val_y)
        # 定义lightgbm参数
        params = {
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': 'rmse',
            'min_child_weight': 5,
            'num_leaves': 2 ** 7,
            'lambda_l2': 10,
            'feature_fraction': 0.8,
            'bagging_fraction': 0.8,
            'bagging_freq': 4,
            'learning_rate': 0.1,
            'seed': 2023,
            'nthread' : 16,
            'verbose' : -1,
            # 'device':'gpu'
        }
        # 训练模型
        model = clf.train(params, train_matrix, 5000, valid_sets=[train_matrix, valid_matrix], categorical_feature=[])
        # 获取验证和测试集的预测值
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)
        test_pred = model.predict(test_x, num_iteration=model.best_iteration)
        oof[valid_index] = val_pred
        test_predict += test_pred / kf.n_splits
        # 计算并打印当前折的分数
        score = np.sqrt(mean_squared_error(val_pred, val_y))
        cv_scores.append(score)
        print(cv_scores)
    return oof, test_predict
lgb_oof, lgb_test = cv_model(lgb, train_df[cols], train_df['power'], test_df[cols])
************************************ 1 ************************************ 
[266.5664166030034] 
************************************ 2 ************************************
 [266.5664166030034, 269.84825857493894] 
************************************ 3 ************************************ 
[266.5664166030034, 269.84825857493894, 264.3883436065406] 
************************************ 4 ************************************
[266.5664166030034, 269.84825857493894, 264.3883436065406, 264.34605684604895] 
************************************ 5 ************************************ 
[266.5664166030034, 269.84825857493894, 264.3883436065406, 264.34605684604895, 264.8575381718458]
第二次结果
[264.90523495044135, 268.80980558659286, 263.1386917495806, 263.18896925379823, 262.9298495826915]    ~~~~~ 241.81216518914164
第三次结果
[262.2219098528978, 266.69901543666936, 261.9101838643308, 260.94503659181794, 258.97461635980414]    ~~~~~ 244.17117702343822
6、输出赛题提交格式的结果
test_df['power'] = lgb_test
test_df['power'] = test_df['power'].apply(lambda x: 0 if x<0 else x)
test_df[['id_encode','ds','power']].to_csv('result.csv', index=False)

![[Pasted image 20231018180025.png]]
请添加图片描述

三、目前进展

现状:增加地理信息后效果有一定提升;参数调整摸不着头脑,越调越烂,或者过拟合

后续:对特征刻画更加精细,试着换模型进行尝试,参数调试需要学习理解一下

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值