时间序列预测spark_Spark中的多时间序列预测

最新推荐文章于 2025-03-16 22:29:25 发布

weixin_26642417

最新推荐文章于 2025-03-16 22:29:25 发布

阅读量1.8k

点赞数

文章标签： python 机器学习算法人工智能 java

原文链接：https://medium.com/walmartglobaltech/multi-time-series-forecasting-in-spark-cc42be812393

版权

本文介绍了如何在Spark中进行多时间序列预测，翻译自Medium上的一篇文章，探讨了使用Spark进行机器学习和人工智能中的时间序列分析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

时间序列预测spark

Spark is a great platform for parallelising machine learning algorithms. Algorithms like clustering, random forests already have PySpark libraries available mainly under ml library(previously known as mllib) .

Spark是用于并行化机器学习算法的绝佳平台。诸如聚类，随机森林之类的算法已经具有PySpark库，主要在ml库(以前称为mllib)下提供。

However, when it comes to times series forecasting, the options available in Spark may not be very obvious at first look. Very often, we may want to run multiple time series models simultaneously. In the absence of an inbuilt time series library in Spark, the workaround could be Spark Pandas UDF. The best part is you can parallelise your model using just a few lines of Pyspark code while most of your algorithm code can be in python. Its use cases include sales forecasting for multiple items or demand forecasting for multiple stores or even fraud detection in multiple time series data .

但是，谈到时间序列预测，Spark中可用的选项乍一看可能并不十分明显。很多时候，我们可能希望同时运行多个时间序列模型。如果Spark中没有内置的时间序列库，则解决方法可能是Spark Pandas UDF 。最好的部分是您可以仅使用几行Pyspark代码来并行化模型，而大多数算法代码都可以在python中进行。它的用例包括多个项目的销售预测或多个商店的需求预测，甚至多个时间序列数据中的欺诈检测。

To demonstrate, I would use Spark pandas UDF for parallel forecasting of sales of multiple items at individual store level.

为了进行演示，我将使用Spark pandas UDF对各个商店级别的多个商品的销售额进行并行预测。

d ATA (Data)

The data is publicly available at https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data

数据可在https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data上公开获得。

It contains the weekly sales data for 45 Walmart stores across 81 departments. We need to forecast the department sales at each of these stores.

它包含81个部门的45家沃尔玛商店的每周销售数据。我们需要预测这些商店中每个部门的销售额。

I have explored using Spark pandas UDF for parallelising time series forecasting in this example.

在此示例中，我探索了使用Spark pandas UDF并行化时间序列预测。

码 (Code)

The code has just two major components:

该代码只有两个主要部分：

Create the python time series pandas UDF to be run on grouped data
创建要在分组数据上运行的python时间序列熊猫UDF
Group the Spark Dataframe based on the keys and aggregate the results in the form of a new Spark Dataframe
根据键对Spark Dataframe进行分组，并以新Spark Dataframe的形式汇总结果

1.创建熊猫UDF (1. Creating the Pandas UDF)

Pandas UDF is like any normal python function. It allows you to perform any function that you would normally apply to a Pandas Dataframe. In our use-case, it means we can access the time series libraries in python like statsmodels or pmdarima - otherwise inaccessible in spark.

熊猫UDF就像任何普通的python函数一样。它允许您执行通常应用于熊猫数据框的任何功能。在我们的用例中，这意味着我们可以像statsmodels或pmdarima一样在python中访问时间序列库-否则无法在spark中访问。

A Pandas UDF is initialised within the spark environment using the keyword pandas_udf as a decorator.

使用关键字pandas_udf作为装饰器，在spark环境中初始化了Pandas UDF。

Also, before defining the function, it is important to specify the schema of output of the UDF. A StructType object defines the schema of the output DataFrame.

同样，在定义函数之前，重要的是指定UDF的输出模式。 StructType对象定义输出DataFrame的架构。

import statsmodels.tsa.api as sm
import numpy as np
import pandas as pd




schema = StructType([StructField('Store', StringType(), True),
                     StructField('Dept', StringType(), True),
                     StructField('weekly_forecast_1', DoubleType(), True),
                     StructField('weekly_forecast_2', DoubleType(), True)])


@pandas_udf(schema, PandasUDFType.GROUPED_MAP)


##using the holt winters time series algorithm for forecasting weekly sales
def holt_winters_time_series_udf(data):
  
    data.set_index('Date',inplace = True)
    time_series_data = data['Weekly_Sales']
    


    ##the model
    model_monthly = sm.ExponentialSmoothing(np.asarray(time_series_data),trend='add').fit()


    ##forecast values
    forecast_values = pd.Series(model_monthly.forecast(2),name = 'fitted_values')
   
    return pd.DataFrame({'Store': [str(data.Store.iloc[0])],'Dept': [str(data.Dept.iloc[0])],'weekly_forecast_1': [forecast_values[0]], 'weekly_forecast_2':[forecast_values[1]]})

Pandas UDF for time series — an example

时间序列的熊猫UDF-示例

2.汇总结果 (2. Aggregate the results)

Next step is to split the Spark Dataframe into groups using DataFrame.groupBy Then apply the UDF on each group. The results are combined into a new Spark Dataframe.

下一步是使用DataFrame.groupBy将Spark Dataframe分为几组然后在每个组上应用UDF。结果合并到一个新的Spark Dataframe中。

forecasted_spark_df = time_series_data\
.groupBy(['Store','Dept'])\
.apply(holt_winters_time_series_udf)

You can find the entire code at https://github.com/maria-alphonsa-thomas/time_series_pyspark_pandas_udf/tree/master

您可以在https://github.com/maria-alphonsa-thomas/time_series_pyspark_pandas_udf/tree/master中找到完整的代码

结束语.. (Closing thoughts ..)

The major benefit of using pandas UDF is it allows you to take advantage of Big Data processing capabilities of Spark with few lines of PySpark code. Also given that Spark doesn't have an inbuilt time series libraries, this can be especially useful for data scientists wanting to run time series forecasting across multiple groups.

使用熊猫UDF的主要好处是，它使您可以利用几行PySpark代码来利用Spark的大数据处理功能。同样，鉴于Spark没有内置的时间序列库，这对于希望跨多个组运行时间序列预测的数据科学家特别有用。

Word of caution : While pandas UDF can be used for implementing any function, it can lead to out of memory exceptions if the group sizes are skewed. This is because all data for a single group will be loaded into memory before the function is applied. For most time series forecasting applications, this is never an issue since a single forecasting doesn't usually take much memory space.

提醒您：虽然熊猫UDF可用于实现任何功能，但如果组大小偏斜，则可能导致内存不足异常。这是因为在应用该功能之前，单个组的所有数据将被加载到内存中。对于大多数时间序列预测应用程序而言，这从来都不是问题，因为单个预测通常不会占用太多内存空间。

Apache Spark is also bringing another major interaction between pandas and python — called Koalas. It is currently in beta stage but it would be another game changer. Check it out here: https://koalas.readthedocs.io/en/latest/

Apache Spark还带来了熊猫与python之间的另一种主要交互方式，称为Koalas。它目前处于测试阶段，但它将改变游戏规则。在这里查看： https : //koalas.readthedocs.io/en/latest/