商店销售预测（回归&随机森林）

最新推荐文章于 2025-05-03 09:07:35 发布

技术小坤

最新推荐文章于 2025-05-03 09:07:35 发布

阅读量4.2k

点赞数 33

分类专栏：实例分析文章标签：机器学习人工智能随机森林线性回归 python 深度学习神经网络

本文链接：https://blog.csdn.net/qq_45791012/article/details/136276691

版权

一、题目概要

在Kaggle竞赛中，要求我们应用时间序列预测，根据厄瓜多尔大型杂货零售商Corporación Favorita的数据预测商店销售情况，建立一个模型，准确地预测在不同商店销售的商品的单位销量。准确的预测可以减少与库存过多相关的食物浪费，提高客户满意度。

在六个可用的数据文件中，我们分析了其中的三个，即训练、测试和存储。虽然我们在这个项目中没有研究每日油价或假日事件的影响，但我们希望在这门课之外花更多的时间来深入学习和成长。

在我们的分析中，我们探索了两种不同的时间序列预测模型:线性回归和随机森林。通过准备线性回归的数据，我们发现了一些有趣的见解，包括周末的销售增长，公共部门支付工资时的销售增长，以及11月和12月的销售增长。我们也注意到2014年和2015年的销量大幅下降。这两种增加和减少都可能是由于商店促销、假期、油价或世界事件，我们无法在分析中调查。

对于线性回归，我们删除了没有在该特定商店销售的产品族，对商店进行聚类，并将销售较低的产品分组到一个产品族中。我们调查并去除异常值，并评估训练数据的季节性。

对于随机森林，我们处理分类变量并去除异常值。

线性回归被证明是低效的，表现出指数性质，而随机森林被证明是一个更容易实现的模型。然而，如果没有线性回归的数据处理，我们将无法发现我们所做的洞察，因为随机森林是一个黑盒模型。

我们还包含了一些关于特征重要性、超参数优化和残差图的进一步发现的最终想法。

综上所述，随机森林是预测商店销售额的较好模型。

#This Python 3 environment comes with many helpful analytics libraries installed
#It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
#For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#Input data files are available in the read-only "../input/" directory
#For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

#You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
#You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/store-sales-time-series-forecasting/oil.csv
/kaggle/input/store-sales-time-series-forecasting/sample_submission.csv
/kaggle/input/store-sales-time-series-forecasting/holidays_events.csv
/kaggle/input/store-sales-time-series-forecasting/stores.csv
/kaggle/input/store-sales-time-series-forecasting/train.csv
/kaggle/input/store-sales-time-series-forecasting/test.csv
/kaggle/input/store-sales-time-series-forecasting/transactions.csv

二、导入包和数据集

#Import packages
#BASE
# ------------------------------------------------------
import numpy as np
import pandas as pd
import os
import gc
import warnings

#Machine Learning
# ------------------------------------------------------
import statsmodels.api as sm
import sklearn

#Data Visualization
# ------------------------------------------------------
#import altair as alt
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

warnings.filterwarnings('ignore')
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
#Import datasets
train = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/train.csv",parse_dates=['date'])
test = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/test.csv",parse_dates=['date'])
stores = pd.read_csv("/kaggle/input/store-sales-time-series-forecasting/stores.csv")

我们将检查以下数据集中每个列的数据类型。为了使用我们导入的包执行时间序列预测，我们必须确保已将日期解析为日期。稍后，我们将把“对象”数据类型转换为类别。我们还将查看我们的数据集，看看是否需要进行任何清理或操作。

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype         
---  ------       -----         
 0   id           int64         
 1   date         datetime64[ns]
 2   store_nbr    int64         
 3   family       object        
 4   sales        float64       
 5   onpromotion  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 137.4+ MB

train.head()

id	date	store_nbr	family	sales
0	0	2013-01-01	1	AUTOMOTIVE
1	1	2013-01-01	1	BABY CARE
2	2	2013-01-01	1	BEAUTY
3	3	2013-01-01	1	BEVERAGES
4	4	2013-01-01	1	BOOKS

stores.head()

store_nbr	city	state	type	cluster
0	1	Quito	Pichincha	D	13
1	2	Quito	Pichincha	D	13
2	3	Quito	Pichincha	D	8
3	4	Quito	Pichincha	D	9
4	5	Santo Domingo	Santo Domingo de los Tsachilas	D	4

三、数据处理

从train.csv和transactions.csv中:

date-记录数据的日期。
store_nbr -标识销售产品的商店。
family-标识所售产品的类型。
sales -给出给定日期某一特定商店某一产品系列的总销售额。小数值是可能的。
onpromotion -给出在给定日期某商店促销的产品族的总数量。
transactions -在给定日期在商店中发生的交易总数。

删除不销售特定系列产品的商店的销售

在快速查看我们的训练数据集之后，我们可以看到有很多零。有些商店可能不销售某些产品，因为它们不是该产品的合适商店。在这种情况下，我们将删除这些值，当预测，他们不应该有任何销售。

zeros = train.groupby(['id', 'store_nbr', 'family']).sales.sum().reset_index().sort_values(['family','store_nbr'])
zeros = zeros[zeros.sales == 0]
zeros

id	store_nbr	family	sales
0	0	1	AUTOMOTIVE	0.0
10692	10692	1	AUTOMOTIVE	0.0
30294	30294	1	AUTOMOTIVE	0.0
40986	40986	1	AUTOMOTIVE	0.0
53460	53460	1	AUTOMOTIVE	0.0
...	...	...	...	...
2981153	2981153	54	SEAFOOD	0.0
2984717	2984717	54	SEAFOOD	0.0
2986499	2986499	54	SEAFOOD	0.0
2993627	2993627	54	SEAFOOD	0.0
2998973	2998973	54	SEAFOOD	0.0

#full outer joining the tables and removing the rows where they match to get rid of the zeros
join = train.merge(zeros[zeros.sales == 0].drop("sales",axis = 1), how='outer', indicator=True)
train1 = join[~(join._merge == 'both')].drop(['id', '_merge'], axis = 1).reset_index()
train1 = train1.drop(['index', 'onpromotion'], axis=1)
train1

date	store_nbr	family	sales
0	2013-01-01	25	BEAUTY	2.000
1	2013-01-01	25	BEVERAGES	810.000
2	2013-01-01	25	BREAD/BAKERY	180.589
3	2013-01-01	25	CLEANING	186.000
4	2013-01-01	25	DAIRY	143.000
...	...	...	...	...
2061753	2017-08-15	9	POULTRY	438.133
2061754	2017-08-15	9	PREPARED FOODS	154.553
2061755	2017-08-15	9	PRODUCE	2419.729
2061756	2017-08-15	9	SCHOOL AND OFFICE SUPPLIES	121.000
2061757	2017-08-15	9	SEAFOOD	16.000