利用阿里云PAI 实现销量预测

最新推荐文章于 2024-07-18 01:47:59 发布

Sarah_07

最新推荐文章于 2024-07-18 01:47:59 发布

阅读量735

点赞数 2

文章标签：阿里云云计算机器学习人工智能

本文链接：https://blog.csdn.net/Sarah_07/article/details/126811881

版权

利用阿里云PAI 实现销量预测

1.Buiness Background

店铺的目标管理是零售管理的重心。本项目开始之前销售目标分解分为3个step。财务团队会在上一财年末将目标分解到店铺和财月。在执行中，每个季度末，大区会对下一季度的目标进行调整，分解到店铺和财月。最后，每个财月快结束时，planning团队会将每月目标细化分解到日。这样的好处是可以按天来追踪销售完成状况，便于店铺间进行对比，同时店铺也可以根据目标值的高低来安排店铺的员工排班。

shop在中国有多家店铺，用人工拆解目标的方法进行目标拆分，一方面考虑因素较少，只是考虑到了历史同期销量，拆分准确率有待提升；另一方面，需要大量人力投入。本项目是利用机器学习综合考虑历史销量、天气、节假日等因素，预测未来一个财月到店铺到天的销量，从而辅助目标分解。

2.Model Framework

shop forecasting 模型是一个集成模型，两个基础模型分别是SARIMAX和LSTM,然后用这两个基础模型的结果按权重进行组合得到最终预测结果。两个基础模型的数据输入类型是一样的，都是历史销售、节假日、天气数据。

在这里插入图片描述

SARIMAX和Lstm都是时序模型，对历史特征的覆盖周期有一定要求，一般覆盖周期越长越好，至少覆盖一个周期。店铺根据不同的状态分为OG,移位店铺,新装修店铺,新开店店铺。考虑到pattern覆盖周期的因素，我们只针对OG店铺利用时许模型进行销售预测,进而算出daily sales ratio。对于其他状态的店铺，则利用OG店铺的预测结果进行汇总，测算daily sales ratio.

OG店铺：在shop level 直接利用销售预测结果，计算daily sales ratio.
None OG & City existed: 对于非OG店铺，但是店铺所在城市存在OG店铺的场景。第一步是将该城市的OG店铺销量汇总到城市级别，然后在城市level计算daily sales ratio。该城市的daily sales ratio即为这类型店铺的daily sales ratio.
None OG & City not existed: 对于非OG店铺，并且店铺所在城市也不存在OG店铺的场景。第一步是将店铺所在大区的OG店铺销量汇总到大区级别，然后在大区level计算daily sales ratio。该大区的daily sales ratio即为这类型店铺的daily sales ratio.

3.Model Pipeline

本项目设计有两条pipeline, 一个是for training,一个是for running prediction。这两条pipeline均是每个月运行一次，先是运行training pipeline,生成新的Sarimax和Lstm模型以及两个模型的结果权重,然后运行running prediction的pipeline，输出下一个月的销售预测结果。

在这两个pipeline中，天气数据是由function compute call api存储在OSS中；节假日数据手工上传到OSS中；历史销量数据daily增量的方式存储在maxcompute中。

模型存储在OSS中，运行在Ali PAI DLC中，通过Dataworks进行所有任务的调度。

结果最终在tableau中展示。

在这里插入图片描述

3.1 Training Pipeline

在这里插入图片描述

3.2 Running Pipeline

在这里插入图片描述

4.Model Script Deep Dive

4.1 Repository Structure

模型及相关文档存储在Azure 的 repository shop_sales_forecasting中，目前在shop_fcst_final 分支下，

Ali PAI DSW与 Azure Git Repository交互：描述 DSW与repository之间的交互
forecasting_model: 包含forecasting 模型script

4.1.1 Script of training pipeline

在这里插入图片描述

4.1.2 Schedule of scripts in training pipeline

order	Schedule Timing	Script	Output
1	财月倒数第二周周五,周六,周日	FC_Weather_getHistory.py	weather_his_YYYYMMdd.csv
2	财月倒数第二周周六	train_sarimax.py	tb_dlc_fcst_lstm,tb_dlc_fcst_sarimax
3	财月倒数第二周周六	train_score.py	tb_dlc_fcst_score,tb_dlc_fcst_train_combine

4.1.3 Script of running pipeline

在这里插入图片描述

4.1.4 Schedule of scripts in running pipeline

order	Schedule Timing	Script	Output
1	财月倒数第二周周五,周六,周日	FC_Weather_getFuture.py	weather_40_YYYYMMdd.csv
2	财月倒数第二周周日	predict_sarimax.py	tb_dlc_fcst_lstm_predict,tb_dlc_fcst_sarimax_predict
3	财月倒数第二周周日	train_score.py	tb_dlc_fcst_predict_combine

4.2 Explanations of important functions

4.2.1 Weather/FC_Weather_getHistory

获取当财月和上一财月的到区的daily 数据，包括wind_degree，max_temperature，min_temperature，wea(天气类型).天气类型简单处理分为大雨、暴雪、大雪、中雨、小雪、小雨、阵雨、阵雪、雨夹雪、晴和其他枚举值。

4.2.2 Prod_Train/utils/Fetch_and_Push

包含3个function，具体如下：

get_max_compute_table_sql

通过sql从maxcompute读取数据，并转换为dataframe
get_max_compute_table_df

通过表名从maxcompute读取数据，并转换为dataframe
push_data_maxcompute

将数据写回maxcompute 相应的table

4.2.3 Prod_Train/utils/preprossing

主要的几个function如下：

transform_df

对df_netsales,df_weather,df_holiday,df_date进行简单处理，主要是统一下字段名。对于weather，主要是将wind_degree(风力)分为1-2级、3-4级、5-7级、8-9级和above10级。
get_start_end_date_nm

获取当财月的第一天和最后一天
generate_dataset
- 处理历史销量数据：将缺失的日期补齐，销量设为0.
- 处理天气数据：用前一天的温度数据fillna
- 处理时间字段，增加week_of_year、week_of_month、weekday等特征
- 将历史销量、天气、节假日consolidate为一个dataframe

4.2.4 Prod_Train/model/sarimax_model

sarimax_model_v1
- handle null 值
- 处理weather, holiday,week及weekend的特征
- 正则化特征值，‘is_close’, ‘max_temperature’, ‘weathernetsales’,‘weekend_sea’,
  ‘netsales_weather_ad’, ‘holidaynetsales’,‘windnetsales’,‘y’
- 输出下一个财月的预测值

4.2.5 Prod_Train/utils/prepare_dataframe_v1

这里主要介绍一个function

split_sequences

按照用历史60个observation来预测未来一天的数据来生成lstm的训练集

4.2.6 Prod_Train/model/lstm_model

lstm_model_v1
- handle null 值
- 处理weather, holiday,week及weekend的特征
- 正则化特征值，‘is_close’, ‘max_temperature’, ‘weathernetsales’,‘weekend_sea’,
  ‘netsales_weather_ad’, ‘holidaynetsales’,‘windnetsales’,‘y’
- 输出下一个财月的预测值

5 Model Deployment

5.1 Dataworks 工作流

工作流在Edw_cl7733项目空间，名为DLC_EDW。

在这里插入图片描述

主要节点为shell节点，shell节点实现功能主要为两部分：

从KMS中取得access_key和access_secret
创建DLC中的任务
##@resource_reference{“kms_demo-1.0-SNAPSHOT-release.jar”}
#!/bin/bash
#********************************************************************#
java -cp kms_demo-1.0-SNAPSHOT-release.jar com.gen.kms.CacheClientEnvironmentSample > ram.info
ram_info=cat ram.info | awk 'END {print}'
id=echo $ram_info | awk -F ':' '{print $1}'
key=echo $ram_info | awk -F ':' '{print $2}'
rm ram.info
cat << EOF > jobfile
name=train-job ${para[0][2]} workers=1 worker_spec=$ {para[0][9]}
worker_image=registry-vpc.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:1.15PAI-cpu-py36-ubuntu18.04
command=python /root/data/shop_fcst_demo/train_sarimax.py ${para[0][1]} ${para[0][4]} ${para[0][5]} train-job$ {para[0][2]} $id $key
data_sources=XXXXXX
EOF

/home/admin/usertools/tools/dlc submit tfjob
–access_id= $id \ --access_key=$ key
–endpoint=pai-dlc.cn-shanghai.aliyuncs.com
–region=cn-shanghai
–job_file=./jobfile
–thirdparty_lib_dir=/root/data/shop_fcst_demo
–interactive