产品订单的数据分析与需求预测

DA_杨

已于 2023-11-23 01:07:19 修改

阅读量3.3k

点赞数 6

文章标签：数据分析数据挖掘

于 2023-08-04 15:56:05 首次发布

本文链接：https://blog.csdn.net/m0_71150367/article/details/132088262

版权

第十一届“泰迪杯”数据挖掘挑战赛—B题：产品订单的数据分析与需求预测

赛题链接

问题背景

近年来企业外部环境越来越不确定，复杂多变的外部环境，让企业供应链面临较多难题。需求预测作为企业供应链的第一道防线，重要程度不言而喻，然而需求预测受多种因素的影响，导致预测准确率普遍较低，因此需要更加优秀的算法来解决这个问题。需求预测是基于历史数据和未来的预判得出的有理论依据的结论，需求预测不仅为了企业更好的制定物料采购计划、控制库存、提升生产效率、控制生产进度，还为了帮助企业更好的把控市场潜在需求，分析目前经营状态和未来发展趋势。有利于公司管理层对未来的销售及运营计划、目标，资金预算做决策参考；有助于采购计划和安排生产计划的制定，减少受业务波动的影响。如果没有需求预测或者预测不准，公司内部很多关于销售、采购、财务预算等决策都只能根据经验而来了，会导致对市场预测不足，产生库存和资金的积压或不足等问题，增加企业库存成本。

挖掘目标

为公司管理层对未来的销售及运营计划、目标，资金预算做决策参考、制定采购计划及安排生产计划做需求预测，本文是基于历史数据及LSTM循环神经网络建立对需求量的预测模型对未来的预判，得出的有理论依据的结论。

问题一：对附件1训练数据（order_train1.csv）的一些特征因素的深入分析。

产品的不同价格对需求量的影响；
产品所在区域对需求量的影响，以及不同区域的产品需求量有何特性；
不同销售方式(线上和线下)的产品需求量的特性；
不同品类之间的产品需求量有何不同点和共同点；
不同时间段（例如月头、月中、月末等）产品需求量有何特性；
节假日对产品需求量的影响；
促销（如618、双十一等）对产品需求量的影响；
季节因素对产品需求量的影响。

问题二：对附件2预测数据（predict_sku1.csv）未来3月（即2019年1月、2月、3月）的月需求量的预测分析。

问题一

探索性数据分析（EDA）

导入数据—数据预处理（缺失值、重复值、异常值）—数据整合

# 导入库

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

# 读取数据

data=pd.read_csv('../data/order_train1.csv',encoding = 'gbk')

data.head(5)

	order_date	sales_region_code	item_code	first_cate_code	second_cate_code	sales_chan_name	item_price	ord_qty
0	2015-09-01	104	22069	307	403	offline	1114.0	19
1	2015-09-01	104	20028	301	405	offline	1012.0	12
2	2015-09-02	104	21183	307	403	online	428.0	109
3	2015-09-02	104	20448	308	404	online	962.0	3
4	2015-09-02	104	21565	307	403	offline	1400.0	3

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 597694 entries, 0 to 597693
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   order_date         597694 non-null  object 
 1   sales_region_code  597694 non-null  int64  
 2   item_code          597694 non-null  int64  
 3   first_cate_code    597694 non-null  int64  
 4   second_cate_code   597694 non-null  int64  
 5   sales_chan_name    597694 non-null  object 
 6   item_price         597694 non-null  float64
 7   ord_qty            597694 non-null  int64  
dtypes: float64(1), int64(5), object(2)
memory usage: 36.5+ MB
None

#缺失值获取
print('每个特征缺失的数目：\n',data.isnull().sum())

每个特征缺失的数目：
 order_date           0
sales_region_code    0
item_code            0
first_cate_code      0
second_cate_code     0
sales_chan_name      0
item_price           0
ord_qty              0
dtype: int64

#重复的数据
print('全部有重复：\n', data[data.duplicated()])

全部有重复：
  .......
[312 rows x 8 columns]

print('前7列有重复：\n',data.iloc[:,:7][data.iloc[:,:7].duplicated()])

前7列有重复：
    ......
[11894 rows x 7 columns]

有312个重复值，再对数据前7列进行重复值检查，我们从原数据筛选出前7列有重复的数据分析。分析结果显示，前7列重复的数据有相同的重复行，均是同一订单日期的同一产品通过相同销售渠道销往同一销售区域，产品价格对应不同或相同的订单需求量，我们认为是符合市场波动和行情，并不属于数据录入重复等情况，因此对重复值数据予以保留，不予去重。

#描述性统计
print(data.describe())
data.describe(include=['object'])

将同个产品同一天销往同个地方，相同价格的需求量加在一起，形成新的出货数据，下面以这个数据集进行分析。

#将同个产品同一天销往同个地方，相同价格的需求量加在一起，形成新的出货数据

data1=data.groupby(by=['order_date','sales_region_code','item_code','first_cate_code','second_cate_code',
                       'sales_chan_name','item_price'],as_index=False).agg({'ord_qty':np.sum})
print(data1.head(5))

   order_date  sales_region_code  item_code  first_cate_code  \
0  2015-09-01                104      20028              301   
1  2015-09-01                104      22069              307   
2  2015-09-02                101      20657              303   
3  2015-09-02                102      20323              305   
4  2015-09-02                102      20457              305   

   second_cate_code sales_chan_name  item_price  ord_qty  
0               405         offline      1012.0       12  
1               403         offline      1114.0       19  
2               410         offline      2996.0       18  
3               412         offline        99.0      502  
4               412         offline       164.0      308

#异常值
# 去除item_price和ord_qty小于0的值
data1 = data1[(data1['item_price'] >= 0) & (data1['ord_qty'] >= 0)]

# 分别画出item_price和ord_qty图
fig, axs = plt.subplots(2, 1, figsize=(10, 8))
axs[0].scatter(range(len(data1)), data1['item_price'],s=1)
axs[0].set_title('item_price scatter plot')
axs[0].set_xlabel('Index')
axs[0].set_ylabel('item_price')
axs[1].scatter(range(len(data1)), data1['ord_qty'],s=1)
axs[1].set_title('ord_qty scatter plot')
axs[1].set_xlabel('Index')
axs[1].set_ylabel('ord_qty')
plt.savefig('../tmp/item_price和ord_qty散点图.png')
plt.show()

Alt

#数据处理后保存为新文件，后续以此数据分析
pd.DataFrame(data1).to_csv('../data/order_train.csv',index=False)

#读取整合后的数据
df = pd.read_csv('../data/order_train.csv', encoding='gbk')

产品的不同价格对需求量的影响

# 按item_price分组，求对应价格的需求平均值
one_data=df.groupby('item_price',as_index=False).agg({'ord_qty':np.mean})
print('item_price和ord_qty描述性统计:\n',one_data.describe())

item_price和ord_qty描述性统计:
           item_price       ord_qty
count   14365.000000  14365.000000
mean     2205.158618     40.786167
std      5326.057554    118.997335
min         1.000000      1.000000
25%       709.000000      6.750000
50%      1305.000000     11.000000
75%      2587.000000     39.571429
max    260014.000000   9874.000000

print(np.round(one_data.corr(method='spearman'),2))

# 画出价格对应的需求平均值折线图
one_data.plot(x='item_price',y='ord_qty',kind='scatter',s=15,figsize=(11,8))
plt.ylabel('ord_qty')
from matplotlib.pyplot import MultipleLocator
plt.gca().xaxis.set_major_locator(MultipleLocator(20000))
plt.gca().yaxis.set_major_locator(MultipleLocator(1000))
plt.savefig('../tmp/1价格对应的需求平均值1.png')
plt.show()

价格对应的需求平均值

# 计算销售额
one_data['sales_volume'] = one_data['item_price'] * one_data['ord_qty']
print(one_data)
one_data.to_excel('../data/sales_volume.xlsx')

plt.figure(figsize=(11,8))
plt.scatter(one_data['item_price'],one_data['sales_volume'],s=15)
plt.gca().xaxis.set_major_locator(MultipleLocator(20000))
plt.xlabel('价格')
plt.ylabel('销售额')
plt.savefig('../tmp/1销售额2.png')
plt.show()

通过对数据的分析，产品价格对应需求量平均量有明显断层现象，我们将产品价格进行划分区间处理，分别为[1,15963.38],[21009,38334]，[46006,60007]，[93532,98016],{260006,260014}，并分别赋予为低、较低、中、较高、高五个价格等级

#价格按五个价格等级[1,15963.38],[21009,38334],[46006,60007],[93532,98016],{26006,260014}划分
data1['price_range'] = pd.cut(data1['item_price'], bins=[0,20000,46005,60007,98017,260014])
data1['sales_volume'] = data1['item_price'] * data1['ord_qty']
item_price=['低','较低','中','较高','高']
print(data1.head(5))

##各个区间的总需求平均量
average_ord_qty = data1.groupby('price_range', as_index=False)['ord_qty'].mean()['ord_qty'].tolist()
print('总需求平均量:',average_ord_qty)
plt.figure(figsize=(10, 9))
plt.bar(item_price, average_ord_qty, width=0.5, color='b')
#plt.title('不同价格等级的总需求平均量')
#plt.savefig('../tmp/1不同价格等级的总需求平均量3.png')
plt.ylabel('总需求平均量')
plt.xlabel('价格等级')
plt.show()

总需求平均量: [93.52652903031938, 6.282608695652174, 6.485714285714286, 6.105263157894737, 12.0]

在这里插入图片描述

#销售平均额
average_sales = data1.groupby('price_range', as_index=False)['sales_volume'].mean()['sales_volume'].tolist()
print('销售平均额:',average_sales)

plt.figure(figsize=(11,8))
plt.bar(item_price, average_sales, width=0.5, color='b')
#plt.title('不同价格等级的平均销售额')
plt.xlabel('价格等级')
plt.ylabel('平均销售额')
#plt.savefig('../tmp/1平均销售额4.png')
plt.show()

销售平均额: [71835.17936636858, 218169.91304347827, 312551.28571428574, 577424.9526315789, 3120108.0]

在这里插入图片描述

分析： 价格等级为低的产品平均需求量远远大于其他等级，而价格等级为高的产品的平均销售额是远大于其他等级。由于我们认为价格等级高的产品为高端产品，从中可以看出高端产品的市场容量较小，消费者关注程度小，需求量低，但其中的收入高，高端市场是不可忽略的。而价格等级为低的产品平均需求量大，占据很高的市场份额，需求量也大，其中可能存在薄利多销的情况。企业在稳定占据低端市场的同时适当扩大高端市场，适当增加订单需求量

产品所在区域对需求量的影响，以及不同区域的产品需求量有何特性

# 按 sales_region_code分组,求不同区域的总需求量
two_data=df.groupby('sales_region_code',as_index=False)['ord_qty'].sum()
print(two_data)

…

# 画饼图
two_data.plot.pie(x='sales_region_code',y='ord_qty',labels=['101','102','103','104','105'],explode=(0.03,0.02,0.01,0,0.05),
           pctdistance=0.7,autopct='%.2f%%',wedgeprops=dict(width=0.6,edgecolor="w"),shadow=True,figsize=(8,8))
plt.show()

在这里插入图片描述

#不同区域的按产品分组的平均需求量
region1=data.loc[data['sales_region_code']==101]
region2=data.loc[data['sales_region_code']==102]
region3=data.loc[data['sales_region_code']==103]
region4=data.loc[data['sales_region_code']==104]
region5=data.loc[data['sales_region_code']==105]
b1=region1[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()
b2=region2[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()
b3=region3[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()
b4=region4[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()
b5=region5[['item_code','ord_qty']].groupby(by = 'item_code',as_index=False).mean()

print('101',b1.describe())
print('102',b2.describe())
print('103',b3.describe())
print('104',b4.describe())
print('105',b5.describe())

101           item_code      ord_qty
count   1294.000000  1294.000000
mean   21030.212519    63.507840
std      606.170322    95.017748
min    20001.000000     1.000000
25%    20509.500000    14.927885
50%    21016.500000    35.793026
75%    21548.750000    75.402151
max    22084.000000  1473.969697
102           item_code      ord_qty
......

print('101平均需求量排名前五的产品为：',b1.sort_values('ord_qty',ascending=False).head(5))
print('102平均需求量排名前五的产品为：',b2.sort_values('ord_qty',ascending=False).head(5))
print('103平均需求量排名前五的产品为：',b3.sort_values('ord_qty',ascending=False).head(5))
print('104平均需求量排名前五的产品为：',b4.sort_values('ord_qty',ascending=False).head(5))
print('105平均需求量排名前五的产品为：',b5.sort_values('ord_qty',ascending=False).head(5))

101平均需求量排名前五的产品为：       item_code      ord_qty
1283      22066  1473.969697
717       21120   970.062500
375       20588   961.000000
917       21469   806.000000
1086      21758   794.000000
......

fig, axs = plt.subplots(3, 2, figsize=(15, 10))
axs[0][0].scatter(b1['item_code'], b1['ord_qty'], color='orangered',s=15)
axs[0][0].set_title('101')
axs[0][1].scatter(b2['item_code'], b2['ord_qty'], color='blueviolet',s=15,marker='*')
axs[0][1].set_title('102')
axs[1][0].scatter(b3['item_code'],b3['ord_qty'], color='green',s=15,marker='+')
axs[1][0].set_title('103')
axs[1][1].scatter(b4['item_code'], b4['ord_qty'], color='blue',s=15)
axs[1][1].set_title('104')
axs[2][0].scatter(b5['item_code'], b5['ord_qty'], color='red',s=15)
axs[2][0].set_title('105')
axs[2][1].remove()
plt.show()

在这里插入图片描述

分析: 105区域的产品个数是最多的，有1354个，而104区域的产品个数是最少的，仅有257个。其中，产品编码为21367产品的需求量最大，其他大部分产品的需求量相对较小。除此之外，103区域的产品需求量相差不大，没有个别产品突出。而101、102、104、105区域均有需求量较高的产品

不同销售方式(线上和线下)的产品需求量的特性

#按sales_chan_name分组,求线上线下的总需求量
data1=df.groupby(by = 'sales_chan_name',as_index=False).agg({'ord_qty':np.sum})

data1.plot.barh(x='sales_chan_name',y='ord_qty',color=['#1E90FF','cyan'], figsize=(11,8)).legend_.remove()
plt.xlabel('ord_qty')
for y,x in enumerate(data1['ord_qty']):
    plt.text(x+0.1,y,"%s"%round(x,1),va='center')
#plt.title('不同销售方式和对应的总需求量条形图')
#plt.savefig('../tmp/3不同销售方式和对应的总需求量条形图1.png')
plt.show()

在这里插入图片描述

data1.plot.pie(x='sales_chan_name',y='ord_qty',labels=['online','offline'],explode=(0.05,0.02),
               autopct='%.2f%%',wedgeprops=dict(width=0.6,edgecolor="w"),shadow=True,figsize=(8,8))
#plt.title('不同销售方式和对应的总需求量饼图')
#plt.savefig('../tmp/3不同销售方式和对应的总需求量饼图2.png')
plt.show()

在这里插入图片描述

# 线上和线下总需求量排名前五
top=df.groupby(by = ['sales_chan_name','item_code'],as_index=False).agg({'ord_qty':np.sum})
online_top5 =top[top['sales_chan_name'] == 'online'].sort_values(by='ord_qty', ascending=False)
print('线上总需求量排名前五的产品为：\n',online_top5.head(5))
offline_top5 =top[top['sales_chan_name'] == 'offline'].sort_values(by='ord_qty', ascending=False)
print('线下总需求量排名前五的产品为：\n',offline_top5.head(5))

线上总需求量排名前五的产品为：
      sales_chan_name  item_code  ord_qty
2532          online      21619   895494
2588          online      21715   663160
2264          online      21061   603813
2127          online      20820   408975
2728          online      21986   362735
线下总需求量排名前五的产品为：
      sales_chan_name  item_code  ord_qty
1034         offline      21271  2310551
797          offline      20973  1617680
1309         offline      21619  1033017
817          offline      20996   960703
1126         offline      21394   684014

分线上和线下，分析对应价格和区域对需求影响

#线上线下对应的不同区域

quyu=df.groupby(by =['sales_chan_name','sales_region_code']).agg({'ord_qty':np.sum}).unstack()
print(quyu)
quyu.plot(y='ord_qty',kind='bar',figsize=(12, 10))
#plt.savefig('tmp/3线上线下区域和总需求量条形图3.png')
plt.show()

                    ord_qty                                       
sales_region_code       101       102       103      104       105
sales_chan_name                                                   
offline            11542949  13634154  10173394   131335   1492361
online               860019    335994   1348792  2256318  13003641

分析: 线下销售方式的产品在102区域中的订单需求量最大，其次是101和103区域中的订单需求量接近且略低于103区域，105区域与104区域的订单需求量远低于其他三个区域，104区域的订单需求量最低；线上销售方式的产品的订单需求量主要集中于105区域，需求量远高于其他四个区域，102区域的订单需求量最低

#线上线下对应的不同价格

price=df.groupby(by =['item_price','sales_chan_name']).agg({'ord_qty':np.mean}).unstack()
fu= -1 * price['ord_qty']['offline']
ax = price['ord_qty']['online'].plot(kind='line',marker='*',figsize=(11,8))
fu.plot(kind='line',figsize=(11,8), color='blueviolet',ax=ax)
plt.xlabel('item_price')
plt.legend(['online','offline'])
plt.ylabel('ord_qty')
#plt.title('线上线下价格和平均需求量折线图')
#plt.savefig('tmp/3线上线下价格和平均需求量折线图4.png')
plt.show()

在这里插入图片描述

分析： 整体上线上销售方式的价格区间范围比线下销售的价格区间范围小，且同一价格下，线上销售方式的需求量比线下销售的大

不同品类之间的产品需求量有何不同点和共同点

#按大类，细类分组，求对应总需求

data1=df.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})
data3=df.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum}).unstack()
data3.plot(kind='barh',y='ord_qty',stacked=True,figsize=(11,8))
#plt.title('不同品类与对应的总需求量')
plt.show()

在这里插入图片描述

按大类—细类看销售方式，区域，价格

# 销售方式

name=df.groupby(by=['first_cate_code','second_cate_code','sales_chan_name'],as_index=False).agg({'ord_qty':np.sum})
print(name)

    first_cate_code  second_cate_code sales_chan_name   ord_qty
0               301               405         offline    445434
1               301               405          online   1141300
2               302               408         offline   3988155
........

name1=df.groupby(by=['first_cate_code','second_cate_code','sales_chan_name']).agg({'ord_qty':np.sum}).unstack()
name1.plot(y='ord_qty',kind='bar',figsize=(12, 10))
#plt.title('不同品类下销售方式与和对应的总需求量')
plt.show()

在这里插入图片描述

**分析：**从图和表可以看出，两种销售方式下，大类306细类407产品的总需求量远远高于其他品类，而大类303的三种产品总需求量相对较低，均不超过10000。从中可以看出，大类306细类407产品市场大，消费者关注度高。总体上看各品类的销售倾向于线下的销售

#区域
region=df.groupby(by=['first_cate_code','second_cate_code','sales_region_code'],as_index=False).agg({'ord_qty':np.sum})
print(region)

    first_cate_code  second_cate_code  sales_region_code  ord_qty
0               301               405                101    51961
1               301               405                102   223807
2               301               405                103   166150
3               301               405                104   489238
......

region.plot.scatter(x='second_cate_code',y='ord_qty', c='sales_region_code', cmap="viridis", s=20,figsize=(11,8))
#plt.title('不同品类下区域与和对应的总需求量')
plt.show()

在这里插入图片描述

细类产品402、403、404、405、406、409在区域105的需求量均为最高，细类407所在的全部区域与其中其他细类产品的总需求量相比均是最高，而细类406、410、411所在区域的需求量都偏少

# 价格
price=data.groupby(by=['item_price','second_cate_code'],as_index=True).agg({'ord_qty':np.mean}).unstack()
print(price.head(5))

price.plot(kind='line',y='ord_qty', figsize=(20,10))
#plt.title('不同品类下价格的总平均需求量')
plt.show()

在这里插入图片描述

**分析:**我们明显看出橙色折线起伏剧烈且价格集中，需求量达到最高峰，即细类402产品的需求量在相近价格中是比其他细类产品更大。浅蓝色的折线即细类410产品出现分层，价格跨度大，但是仍然有较小订单需求量。说明价格弹性大。

不同时间段（例如月头、月中、月末等）产品需求量有何特性

data=df.copy()
data['order_date']= pd.to_datetime(data['order_date'])#转换时间格式
# 将日期列设置为索引
data.set_index('order_date', inplace=True)

把每月前7天定为月头时间段，每月12-18定为月中时间段，每月最后7天定为月末时间段

# 每月前7天为月头
first= data[data.index.day <= 7]
# 每月12-18为月中
middle= data[(data.index.day >= 12) & (data.index.day <= 18)]
# 每月最后7天为月末
last= data[data.index.day >= (data.index.days_in_month - 6)]

# 分别计算每个月的总需求量
mean_first= first.groupby(first.index.month)['ord_qty'].sum()
mean_middle = middle.groupby(middle.index.month)['ord_qty'].sum()
mean_last = last.groupby(last.index.month)['ord_qty'].sum()

# 绘制总需求量图表
plt.figure(figsize=(10,8))
plt.plot(mean_first.index, mean_first.values, label='月头')
plt.plot(mean_middle.index, mean_middle.values, label='月中')
plt.plot(mean_last.index, mean_last.values, label='月末')
plt.legend()
plt.xlabel('月份')
plt.ylabel('总需求量')
#plt.title('每个月月头、月中、月末总需求量')
#plt.savefig('../tmp/5每个月月头、月中、月末总需求量1.png')
plt.show()

在这里插入图片描述

**分析：**这三年每月的月头、月中和月末总需求量相邻月的最值相差不是很大，都会在一定范围内波动。三个时间段的高峰一般出现在3月和10-11月，低峰一般出现在1-2月和7月。可以看出，三个阶段的总需求量的变化是具有规律性的，是消费者对于产品的需求是随着时间段的变化

从不同时间段数据考虑价格，区域，销售方式，品类

# 价格
price_first=first.groupby('item_price').agg({'ord_qty':np.mean})
price_middle=middle.groupby('item_price').agg({'ord_qty':np.mean})
price_last=last.groupby('item_price').agg({'ord_qty':np.mean})
# 绘制平均价格分布散点图
fig, axs = plt.subplots(3, 1, figsize=(10, 15))
axs[0].scatter(price_first.index, price_first.values, s=5,label='月头')
axs[0].set_title('月头')
axs[1].scatter(price_middle.index, price_middle.values, s=5,label='月中')
axs[1].set_title('月中')
axs[2].scatter(price_last.index,price_last.values,s=5, label='月末')
axs[2].set_title('月末')
plt.show()

在这里插入图片描述

**分析：**月头、月中、月末的需求量均集中在[0,500]这个区间，价格集中在等级为低的价格区间中，可以看出每个月的低端产品的市场需求量大。其次，月头与月末均有价格等级为高的需求量，企业对于高端市场是不可忽视的

# 区域

quyu_first=first.groupby('sales_region_code',as_index=False).agg({'ord_qty':np.sum})
print('月头:\n',quyu_first)

quyu_middle=middle.groupby('sales_region_code',as_index=False).agg({'ord_qty':np.sum})
print('月中:\n',quyu_middle)

quyu_last=last.groupby('sales_region_code',as_index=False).agg({'ord_qty':np.sum})
print('月末:\n',quyu_last)

plt.figure(figsize=(10,8))
plt.bar(quyu_first['sales_region_code'], quyu_first['ord_qty'], label='月头')
plt.bar(quyu_middle['sales_region_code'], quyu_middle['ord_qty'], label='月中', bottom=quyu_first['ord_qty'])
plt.bar(quyu_last['sales_region_code'], quyu_last['ord_qty'], label='月末', bottom=quyu_first['ord_qty']+quyu_middle['ord_qty'])
plt.legend()
plt.xlabel('区域')
plt.ylabel('总需求量')
plt.show()

在这里插入图片描述

105区域中，月头、月中和月末三个时间段的总需求量是最高的，但101、102、103区域相对接近，而104区域在三个时间段总需求量均是远远低于其他区域。

# 销售方式

xiao_first=first.groupby('sales_chan_name').agg({'ord_qty':np.sum})
print('月头销售:\n',xiao_first) 
xiao_middle=middle.groupby('sales_chan_name').agg({'ord_qty':np.sum})
print('月中销售:\n',xiao_middle)
xiao_last=last.groupby('sales_chan_name').agg({'ord_qty':np.sum})
print('月末销售:\n',xiao_last)

fig, axs = plt.subplots(1, 3, figsize=(15, 5))
axs[0].pie(xiao_first['ord_qty'], labels=xiao_first.index, autopct='%1.1f%%', wedgeprops=dict(width=0.5),shadow=True)
axs[0].set_title('月头')
axs[1].pie(xiao_middle['ord_qty'], labels=xiao_middle.index, autopct='%1.1f%%', wedgeprops=dict(width=0.5),shadow=True)
axs[1].set_title('月中')
axs[2].pie(xiao_last['ord_qty'], labels=xiao_last.index, autopct='%1.1f%%', wedgeprops=dict(width=0.5),shadow=True)
axs[2].set_title('月末')
plt.show()

在这里插入图片描述

通过三个时间段（月头、月中、月末）在线下和线上销售方式的总需求量占比的对比，可以看出三个时间段均是线下销售方式对应的总需求量远大于线上销售方式的总需求量

#大类——细类

lei_first=first.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})
print('月头品类:\n',lei_first)
lei_middle=middle.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})
print('月中品类:\n',lei_middle)
lei_last=last.groupby(by=['first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})
print('月末品类:\n',lei_last)

fig, axs = plt.subplots(3, 1, figsize=(10, 15), sharex=True)
lei_first.plot(kind='barh', ax=axs[0])
axs[0].set_title('月头')
lei_middle.plot(kind='barh', color='cyan', ax=axs[1])
axs[1].set_title('月中')
lei_last.plot(kind='barh', color='#1E90FF', ax=axs[2])
axs[2].set_title('月末')
plt.xlabel('总需求量')
plt.show()

…

在这里插入图片描述

**分析：**品类为（大类306，细类407）产品在三个时间段（月头、月中、月末）的总需求量均是远大于其他品类，并且从图中可以看出每个品类在三个时间段的订单需求量的是相差不大的

节假日对产品需求量的影响

确定2015年9月2日-2018年12月20日的所有公休假日，收集来源于国务院对于节假日公休安排，整理形成”节假日csv”文件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UWzNAaht-1691054387475)(attachment:image.png)]

# 原数据
data=df.copy()
data['order_date']= pd.to_datetime(data['order_date'])
self_data=data.groupby(by='order_date',as_index=False).agg({'ord_qty':np.sum})

#提取出数据含有节假日部分
data1=pd.read_csv('../data/法定节假日.csv',encoding = 'gbk')
data1['date']= pd.to_datetime(data1['date'])
jieri=data.loc[data['order_date'].isin(data1['date'])]

jieri1=jieri.groupby(by='order_date',as_index=False).agg({'ord_qty':np.sum})

plt.figure(figsize=(11,8))
plt.plot(self_data['order_date'],self_data['ord_qty'].values)
plt.scatter(jieri1['order_date'],jieri1['ord_qty'].values,color='r',linewidths=0.001)
plt.xlabel('日期')
plt.ylabel('总需求量')
plt.legend(['蓝线-原数据','红点-节假日'])
#plt.title('节假日影响')
plt.show()

在这里插入图片描述

从节假日数据考虑价格，区域，销售方式，品类，时间段

…

促销（如618、双十一等）对产品需求量的影响

# 原数据
data=df.copy()
data['order_date']= pd.to_datetime(data['order_date'])
self_data=data.groupby(by='order_date',as_index=False).agg({'ord_qty':np.sum})

#促销日
cuxiao=pd.read_csv('../data/促销日.csv',encoding = 'gbk')
cuxiao['order_date']= pd.to_datetime(cuxiao['order_date'])
cuxiao1=cuxiao.groupby(by='order_date',as_index=False).agg({'ord_qty':np.sum})

#比较原数据和节假日数据

plt.figure(figsize=(11,8))
plt.scatter(self_data['order_date'],self_data['ord_qty'].values)
plt.scatter(cuxiao1['order_date'],cuxiao1['ord_qty'].values,color='r',linewidths=0.001)
plt.xlabel('日期')
plt.ylabel('总需求量')
plt.legend(['蓝线-原数据','红点-促销日'])
#plt.title('节假日影响')
plt.show()

…

从促销日数据考虑价格，区域，销售方式，品类，时间段

#品类

#12月份的品类
cuxiao.index=cuxiao['order_date']
pinlei=cuxiao.groupby(by=[cuxiao.index.month,'second_cate_code'],as_index=True).agg({'ord_qty':np.sum}).unstack()
pinlei1=cuxiao.groupby(by=[cuxiao.index.month,'second_cate_code'],as_index=True).agg({'ord_qty':np.sum})
print(pinlei1)

                             ord_qty
order_date second_cate_code         
1          401                 65980
           402                 44430
           403                106740
           404                 88467
           405                 22335
           406                   503
           407                394343
           408                 89263
           409                 26952
           410                  1109
           411                   189
           412                117964
3          401                 42322
           402                 42633
           403                 77491
           404                 71224

           .....

#按品类分组
pinlei2=cuxiao.groupby('second_cate_code',as_index=True).agg({'ord_qty':np.sum})
print('品类:\n',pinlei2)

品类:
                   ord_qty
second_cate_code         
401                478775
402                445061
403                787329
.......

pinlei.plot(y='ord_qty',kind='bar',figsize=(12, 10))
plt.show()

在这里插入图片描述

**分析：**1-10月的促销日细类产品407的需求量是远大于其他细类产品的，在上述分析中我们知道整体上细类407的产品是全部产品需求量最大的。说明整体数据中需求量大的细类产品，在促销日时需求量仍然是大的

#区域
quyu=cuxiao.groupby(by='sales_region_code').agg({'ord_qty':np.sum})
print(quyu)

quyu.plot.bar(figsize=(12, 10))
plt.show()

在这里插入图片描述

#销售方式
xiao=cuxiao.groupby(by='sales_chan_name').agg({'ord_qty':np.sum})
print(xiao)

xiao.plot(kind='pie',y='ord_qty',figsize=(8, 8), autopct='%1.1f%%',shadow=True)
plt.show()

在这里插入图片描述

#不同时间段
copy_cuxiao=cuxiao.copy()
copy_cuxiao.set_index('order_date', inplace=True)
# 每月前7天为月头
first= copy_cuxiao[copy_cuxiao.index.day <= 7]
# 每月12-18为月中
middle= copy_cuxiao[(copy_cuxiao.index.day >= 12) & (copy_cuxiao.index.day <= 18)]
# 每月最后7天为月末
last= copy_cuxiao[copy_cuxiao.index.day >= (copy_cuxiao.index.days_in_month - 6)]
sum_first= first.groupby(first.index.month)['ord_qty'].sum()
sum_middle = middle.groupby(middle.index.month)['ord_qty'].sum()
sum_last = last.groupby(last.index.month)['ord_qty'].sum()
print('月头的促销日总需求\n',sum(first['ord_qty']))
print('月中的促销日总需求\n',sum(middle['ord_qty']))
print('月末的促销日总需求\n',sum(last['ord_qty']))

月头的促销日总需求
 3556003
月中的促销日总需求
 1148008
月末的促销日总需求
 193211

plt.figure(figsize=(8,8))
plt.scatter(sum_first.index,sum_first.values,marker='*',s=50, label='月头')
plt.scatter(sum_middle.index, sum_middle.values,s=50, label='月中')
plt.scatter(sum_last.index, sum_last.values, label='月末')
plt.legend()
plt.xlabel('月份')
plt.ylabel('总需求量')
#plt.title('每个月月头、月中、月末总需求量')
plt.show()

在这里插入图片描述

**分析：**1月-12月的月头分别是双十一和双十二两个促销日，其需求量远大于其他促销日的不同时间段的产品需求量。

#价格

jiage=cuxiao.groupby('item_price',as_index=False).agg({'ord_qty':np.sum}).sort_values(by='ord_qty', ascending=False)
print(jiage.head(5))
code=cuxiao.groupby('item_code',as_index=False).agg({'ord_qty':np.sum}).sort_values(by='ord_qty', ascending=False)
print(code.head(5))
fig,axs=plt.subplots(1,2,figsize=(10,5))
jiage.head(5).plot(kind='barh', x='item_price', y='ord_qty',ax=axs[0])
axs[0].set_title('促销日总需求前五的价格')
axs[0].invert_yaxis()
code.head(5).plot(kind='barh', x='item_code', y='ord_qty',ax=axs[1])
axs[1].set_title('促销日总需求前五的产品')
axs[1].invert_yaxis()
plt.show()

在这里插入图片描述

#节假日
data1=pd.read_csv('../data/法定节假日.csv',encoding = 'gbk')
data1['date']= pd.to_datetime(data1['date'])
jieri=cuxiao .loc[cuxiao['order_date'].isin(data1['date'])]

季节因素对产品需求量的影响

data=df.copy()
data['order_date']= pd.to_datetime(data['order_date'])
data.index=data['order_date']

import seaborn as sns

#四个季度与对应总需求

season=data.groupby(data.index.quarter).agg({'ord_qty':np.sum})
a=pd.DataFrame({"season":['第一季度','第二季度','第三季度','第四季度'],"ord_qty":season['ord_qty'].T})
print(a)

sns.barplot(data=a, x="season", y="ord_qty")
#plt.savefig('tmp/8不同季度与对应总需求条形图1.png')
plt.show()

在这里插入图片描述

#分每年，四个季度与对应总需求

season1=data.groupby(by=[ data.index.quarter,data.index.year]).agg({'ord_qty':np.sum}).unstack()
print(season1)

season1.plot(figsize=(11,8))
plt.ylabel('ord_qty')
plt.xlabel('order_date')
plt.tick_params(labelsize=9)
plt.show()

              ord_qty                                 
order_date       2015       2016       2017       2018
order_date                                            
1                 NaN  3342190.0  4353418.0  5189048.0
2                 NaN  3134103.0  4554570.0  3559035.0
3            832725.0  2791345.0  4568376.0  3861418.0
4           3315098.0  4404605.0  6006351.0  4866675.0

在这里插入图片描述

分析: 2015-2017年的趋势是相似的，2015年数据是从第三季度开始的，2016-2107年都是第一季度和第二季度平缓，第三季度是低峰，即为淡季，高峰为第四季度，即为旺季。而2018年高峰在第一季度，低峰在第二季度，第三季度开始需求量缓慢上升。整体趋势是上升的

按四个季度分别考虑价格，区域，销售方式，品类，时间段，节假日，促销

#区域
quyu=data.groupby(by=[data.index.quarter,'sales_region_code']).agg({'ord_qty':np.sum}).unstack()
quyu1=data.groupby(by=[data.index.quarter,'sales_region_code']).agg({'ord_qty':np.sum})
print(quyu1)


quyu.plot(y='ord_qty',kind='bar',figsize=(12, 10))
#plt.savefig('tmp/8季度区域总需求条形图3.png')
plt.show()

                              ord_qty
order_date sales_region_code         
1          101                3086061
           102                3493961
           103                2854377
           104                 358229
           105                3092028
2          101                2689544
           102                2900771
        ......

在这里插入图片描述

#销售方式

xiao=data.groupby(by=[data.index.quarter,'sales_chan_name']).agg({'ord_qty':np.sum}).unstack()
print(xiao)


xiao.plot(kind='barh',y='ord_qty',figsize=(8, 8),stacked=True)
#plt.savefig('tmp/8季度-销售总需求条形图4.png')
plt.show()

                  ord_qty         
sales_chan_name   offline   online
order_date                        
1                 9561373  3323283
2                 7670729  3576979
3                 7669057  4384807
4                12073034  6519695

在这里插入图片描述

#品类
pinlei=data.groupby(by=['first_cate_code','second_cate_code',data.index.quarter]).agg({'ord_qty':np.sum}).unstack()
pinlei1=data.groupby(by=[data.index.quarter,'first_cate_code','second_cate_code']).agg({'ord_qty':np.sum})
print(pinlei1)

pinlei.plot(y='ord_qty',kind='bar',figsize=(12, 10))
#plt.savefig('tmp/8品类总需求量5.png')
plt.show()

                                             ord_qty
order_date first_cate_code second_cate_code         
1          301             405                371630
           302             408               1420866
           303             401                827342
                           406                  8188
                           410                 19499
                           411                  5249
           304             409                146351
           305             412               1467238
           306             402                672730
                           407               5025566
           307             403               1484381
           308             404               1435616
2          301             405                324169
           302             408               1319117
           303             401                789257
           ......

在这里插入图片描述

**分析：**每个季度中大类306细类407产品仍然是远大于其他品类产品，大类303中的三个产品需求量是最小的。除了大类303的三个产品没有明显变化，在第四季度中其他品类产品均是有不同程度的增长，在其他季度中其他品类产品需求量相差不大。可见，不同品类的产品均是按照季度的变化趋势变化的。

# 价格
price = data.groupby(by=['item_price', data.index.quarter]).agg(
    {'ord_qty': np.sum}).unstack()
print(price)
print('四个季度价格计数多少个：\n', price.count())

           ord_qty                   
order_date       1      2     3     4
item_price                           
1.00           NaN   86.0   NaN   2.0
1.01           3.0   28.0   5.0  72.0
2.00           NaN  125.0  44.0  13.0
2.01           6.0   50.0  16.0  64.0
3.00           NaN  114.0  10.0  51.0
...            ...    ...   ...   ...
[14365 rows x 4 columns]
四个季度价格计数多少个：
          order_date
ord_qty  1             5386
         2             6451
         3             7539
         4             8532
dtype: int64

问题二

数据预处理和准备

数据进一步准备和处理

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

需要预测的数据有些产品编码在原来数据集没有，不能用来预测，只分析产品编码有的部分。
后续用预测出的结果按’sales_region_code’,‘first_cate_code’, 'second_cate_code’分组的平均值代替。

df1 = pd.read_csv('../data/order_train1.csv', encoding='gbk')
df2 = pd.read_csv('../data/predict_sku1.csv', encoding='gbk')
# 选取df1中'sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'列中与df2中相同的行
# 因为需要预测的数据有些在训练集中没有，我们先提取出有的部分来预测
data = df1[df1[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1).isin(
    df2[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1))]

##数据准备
# print(data.dtypes)
# print(data['ord_qty'].describe())
# 去除'ord_qty'异常值

def remove_outliers(df, col_name):
   q1 = df[col_name].quantile(0.25)
   q3 = df[col_name].quantile(0.75)
   iqr = q3 - q1
   lower_bound = q1 - 1.5 * iqr
   upper_bound = q3 + 1.5 * iqr
   df = df[(df[col_name] >= lower_bound) & (df[col_name] <= upper_bound)]
   return df

data_without_outliers = remove_outliers(data, 'ord_qty')
# 去掉异常值前、后的箱线图
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
axs[0].boxplot(data['ord_qty'])
axs[0].set_title('ord_qty')
axs[1].boxplot(data_without_outliers['ord_qty'])
axs[1].set_title('ord_qty without outliers')
#plt.savefig('tmp/第二问箱线图1.png')
plt.show()

在这里插入图片描述

#需求量图
data_without_outliers.index = pd.to_datetime(data_without_outliers.index)
plt.figure(figsize=(8,4))
plt.plot(data_without_outliers['order_date'],data_without_outliers['ord_qty'])
plt.show()

在这里插入图片描述

很明显，这样大的数据量和图像不利于我们接下来的分析，结合问题二，我们按天、周、月重采样

# 按天、周、月抽样的总需求量
data_without_outliers = data.copy()
data_without_outliers['order_date'] = pd.to_datetime(
    data_without_outliers['order_date'])
data_without_outliers = data_without_outliers.set_index('order_date')

fig = plt.figure(figsize=(18,16))
fig.subplots_adjust(hspace=.2)
ax1 = fig.add_subplot(3,1,1)
ax1.plot(data_without_outliers['ord_qty'].resample('D').sum(),linewidth=1)
ax1.set_title('按天的总需求量')
ax1.tick_params(axis='both', which='major')
ax2 = fig.add_subplot(3,1,2, sharex=ax1)
ax2.plot(data_without_outliers['ord_qty'].resample('W').sum(),linewidth=1)
ax2.set_title('按周的总需求量')
ax2.tick_params(axis='both', which='major')
ax3 = fig.add_subplot(3,1,3, sharex=ax1)
ax3.plot(data_without_outliers['ord_qty'].resample('M').sum(),linewidth=1)
ax3.set_title('按月的总需求量')
ax3.tick_params(axis='both', which='major')
#plt.savefig('tmp/第二问按天周月3.png')
plt.show()

在这里插入图片描述

把预处理后的数据按’sales_region_code’, ‘item_code’, ‘first_cate_code’, ‘second_cate_code’分组，分别按天、周、月采样得到总需求量，再去除总需求量0值多的行，依次保存为’day.csv’、‘week.csv’、'month.csv’文件，分别做为按天、周、月的时间粒度预测的预测数据集

# #按天、周、月的时间粒度
# #按'sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'分组，按天、周、月采样的总需求量，去除0的行
d = data_without_outliers.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])[
    'ord_qty'].resample('D').sum().reset_index()
d = d.loc[d['ord_qty'] != 0]
print(d)
w = data_without_outliers.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])[
    'ord_qty'].resample('W').sum().reset_index()
w = w.loc[w['ord_qty'] != 0]
m = data_without_outliers.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])[
    'ord_qty'].resample('M').sum().reset_index()
m = m.loc[m['ord_qty'] != 0]

pd.DataFrame(d).to_csv('../data/day.csv', index=False)
pd.DataFrame(w).to_csv('../data/week.csv', index=False)
pd.DataFrame(m).to_csv('../data/month.csv', index=False)

建立梯度提升树模型

按天时间粒度
按周时间粒度
按月时间粒度

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import joblib

按天时间粒度

df=pd.read_csv('../data/day.csv')
df

	sales_region_code	item_code	first_cate_code	second_cate_code	order_date	ord_qty
0	101	20002	303	406	2017-08-04	4
1	101	20002	303	406	2018-03-14	2
2	101	20002	303	406	2018-03-16	3
3	101	20002	303	406	2018-03-25	3
4	101	20002	303	406	2018-03-31	9
...

257038 rows × 6 columns

划分训练集和测试集

# 数据预处理
# 划分训练集和测试集
X = df.drop('ord_qty', axis=1)
y = df['ord_qty']
c=y.mean()
d=y.std()
# 保存c和d值
np.save('../tmp/c.npy', c)
np.save('../tmp/d.npy', d)
y= (y-c)/d

特征处理

# 对于类别数据，使用独热编码进行处理
X= pd.get_dummies(X, columns=['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])
# 将时间类型的变量转换成数值型变量
X['order_date'] = pd.to_datetime(X['order_date'])
X['year'] = X['order_date'].dt.year
X['month'] = X['order_date'].dt.month
X['quarter']=X['order_date'].dt.quarter
X.drop('order_date', axis=1, inplace=True)

a=X.mean()
b=X.std()
# 保存a和b值
np.save('../tmp/a.npy', a)
np.save('../tmp/b.npy', b)
X= (X-a)/b

建立梯度提升树模型

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 建立梯度提升树模型
model = GradientBoostingRegressor(n_estimators=150, learning_rate=0.08, max_depth=6, random_state=42)
model.fit(X_train, y_train)

# 保存模型
joblib.dump(model, '../tmp/model.pkl')

模型评估

# 模型评估
y_pred_train = model.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred_train)
rmse_train = np.sqrt(mse_train)
r_train = r2_score(y_train, y_pred_train)

y_pred_test = model.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred_test)
rmse_test = np.sqrt(mse_test)
r_test = r2_score(y_test, y_pred_test)

print('训练集均方误差MSE：', mse_train)
print('训练集均方根误差RMSE：', rmse_train)
print('训练集决定系数R2：', r_train)
print('测试集均方误差MSE：', mse_test)
print('测试集均方根误差RMSE：', rmse_test)
print('测试集决定系数R2：', r_test)

训练集均方误差MSE： 0.734888685818914
训练集均方根误差RMSE： 0.8572564877671758
训练集决定系数R2： 0.2635732896957512
测试集均方误差MSE： 0.7696230124579693
测试集均方根误差RMSE： 0.8772816038524741
测试集决定系数R2： 0.23672780093650492

# 测试集预测值和真实值对比图

y_test = y_test * d + c
y_pred = y_pred_test *d + c
plt.plot(y_test.values, label='true')
plt.plot(y_pred, label='pred')
plt.legend()
plt.show()

在这里插入图片描述

这是按天时间粒度训练模型，按周、按月时间粒度的类似…

按月时间粒度

df=pd.read_csv('../data/month.csv')
df

	sales_region_code	item_code	first_cate_code	second_cate_code	order_date	ord_qty
0	101	20002	303	406	2017-08-31	4
1	101	20002	303	406	2018-03-31	17
2	101	20002	303	406	2018-04-30	124
3	101	20002	303	406	2018-05-31	110
4	101	20002	303	406	2018-06-30	77
...	...	...	...	...	...	...

30824 rows × 6 columns

# 数据预处理
# 划分训练集和测试集
X = df.drop('ord_qty', axis=1)
y = df['ord_qty']
c1=y.mean()
d1=y.std()
# 保存c和d值
np.save('../tmp/c1.npy', c1)
np.save('../tmp/d1.npy', d1)
y= (y-c1)/d1

# 对于类别数据，使用独热编码进行处理
X= pd.get_dummies(X, columns=['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])
# 将时间类型的变量转换成数值型变量
X['order_date'] = pd.to_datetime(X['order_date'])
X['year'] = X['order_date'].dt.year
X['month'] = X['order_date'].dt.month
X['quarter']=X['order_date'].dt.quarter
X.drop('order_date', axis=1, inplace=True)

a1=X.mean()
b1=X.std()
# 保存a和b值
np.save('../tmp/a1.npy', a1)
np.save('../tmp/b1.npy', b1)
X= (X-a1)/b1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 建立梯度提升树模型
model = GradientBoostingRegressor(n_estimators=150, learning_rate=0.08, max_depth=6, random_state=42)
model.fit(X_train, y_train)

# 保存模型
joblib.dump(model, '../tmp/model_month.pkl')

['../tmp/model_month.pkl']

# 模型评估
y_pred_train = model.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred_train)
rmse_train = np.sqrt(mse_train)
r_train = r2_score(y_train, y_pred_train)

y_pred_test = model.predict(X_test)
mse_test = mean_squared_error(y_test, y_pred_test)
rmse_test = np.sqrt(mse_test)
r_test = r2_score(y_test, y_pred_test)

print('训练集均方误差MSE：', mse_train)
print('训练集均方根误差RMSE：', rmse_train)
print('训练集决定系数R2：', r_train)
print('测试集均方误差MSE：', mse_test)
print('测试集均方根误差RMSE：', rmse_test)
print('测试集决定系数R2：', r_test)

训练集均方误差MSE： 0.3238078735863478
训练集均方根误差RMSE： 0.5690411879524608
训练集决定系数R2： 0.6913676251507997
测试集均方误差MSE： 0.40153638374400513
测试集均方根误差RMSE： 0.6336689859414023
测试集决定系数R2： 0.49983854868617916

# 测试集预测值和真实值对比图

y_test = y_test * d + c
y_pred = y_pred_test *d + c
plt.plot(y_test.values, label='true')
plt.plot(y_pred, label='pred')
plt.legend()
plt.show()

在这里插入图片描述

预测

#读取预测数据
pred_df = pd.read_csv('../data/predict_sku1.csv',encoding='gbk')
df = pd.read_csv('../data/month.csv',encoding='gbk')

# 选取pred_df中'sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'列中与df中相同的行
pred1= pred_df[pred_df[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1).isin(df[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1))]

no = pred_df.loc[~pred_df[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1).isin(pred1[['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code']].apply(tuple, axis=1))]

数据预处理

# 对于类别数据，使用独热编码进行处理
pred=pred1.copy()
pred = pd.get_dummies(pred, columns=['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code'])

# 加载a,b,c和d值
a=np.load('../tmp/a1.npy')
b=np.load('../tmp/b1.npy')
c = np.load('../tmp/c1.npy')
d = np.load('../tmp/d1.npy')

#4月份
pred4=pred.copy()
pred4['year'] = 2019
pred4['month'] = 4
pred4['quarter'] = 2
pred4=(pred4-a)/b

#5月份
pred5=pred.copy()
pred5['year'] = 2019
pred5['month'] = 5
pred5['quarter'] = 2
pred5=(pred5-a)/b

#6月份
pred6=pred.copy()
pred6['year'] = 2018
pred6['month'] = 6
pred6['quarter'] = 4
pred6=(pred6-a)/b

# 加载模型
model = joblib.load('../tmp/model_month.pkl')

# 预测未来3个月的需求量
y_pred4 = model.predict(pred4)
y_pred4 = y_pred4 *d + c
print('4月份:\n',y_pred4)
y_pred5 = model.predict(pred5)
y_pred5 = y_pred5 *d + c

y_pred6 = model.predict(pred6)
y_pred6 = y_pred6 *d + c

4月份:
 [399.07802768 646.041552   748.03332473 ... 682.8277281  779.00830241
 700.16161931]

# 将预测结果保存为文件
result_df =pred1.copy()
result_df['2019年4月预测需求量'] = y_pred4
result_df['2019年5月预测需求量'] = y_pred5
result_df['2019年6月预测需求量'] = y_pred6

产品编码没有的数据用预测结果的’sales_region_code’,‘first_cate_code’, 'second_cate_code’分组的平均值代替

#产品编码没有的数据用预测结果的'sales_region_code','first_cate_code', 'second_cate_code'分组的平均值代替
buchong=result_df.groupby(['sales_region_code','first_cate_code', 'second_cate_code']).agg({'2019年4月预测需求量':np.mean,'2019年5月预测需求量':np.mean,'2019年6月预测需求量':np.mean}).reset_index()
bu= pd.merge(no,buchong)
bu

	sales_region_code	item_code	first_cate_code	second_cate_code	2019年4月预测需求量	2019年5月预测需求量	2019年6月预测需求量
0	101	20011	303	401	616.096452	616.096452	483.859698
1	101	20198	303	401	616.096452	616.096452	483.859698
2	101	20254	303	401	616.096452	616.096452	483.859698
3	101	20324	303	401	616.096452	616.096452	483.859698
...	...	...	...	...	...	...	...

432 rows × 7 columns

#合并
he=pd.concat([result_df,bu]).drop(['first_cate_code', 'second_cate_code'], axis=1)
he = he.round({'2019年4月预测需求量': 0, '2019年5月预测需求量': 0, '2019年6月预测需求量': 0})
print(he)
he.to_excel('../result2.xlsx', index=False)

     sales_region_code  item_code  2019年4月预测需求量  2019年5月预测需求量  2019年6月预测需求量
0                  101      20002         399.0         399.0         261.0
1                  101      20003         646.0         646.0         508.0
2                  101      20006         748.0         748.0         615.0
4                  101      20014        1209.0        1209.0         906.0
5                  101      20016         540.0         540.0         401.0
..                 ...        ...           ...           ...           ...
431                105      21867         890.0         974.0         694.0

[2619 rows x 5 columns]

总结：建立的模型整体上预测效果还是差，就时间粒度而言，按月份的时间粒度来预测效果是其中最好的。对于该模型，可以进一步提取更多的适合的时间序列特征，使用网格搜索进行参数调优或者使用更复杂的模型（如深度学习模型）来优化。