Kaggle实战:Store Sales - Time Series Forecasting

(菜鸡梦呓,大佬轻喷)

数据浏览

在这里插入图片描述

train.csv

The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.

  • store_nbr identifies the store at which the products are sold.
  • family identifies the type of product sold.
  • sales gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
  • onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.

test.csv

The test data, having the same features as the training data. You will predict the target sales for the dates in this file.

The dates in the test data are for the 15 days after the last date in the training data.

sample_submission.csv

A sample submission file in the correct format.

stores.csv

Store metadata, including city, state, type, and cluster.

cluster is a grouping of similar stores.

oil.csv

Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it’s economical health is highly vulnerable to shocks in oil prices.)

holidays_events.csv

Holidays and Events, with metadata

  • NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.
  • Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).

Additional Notes

  • Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.
  • A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.

总结

稍微总结一下,现有的可用数据如下:

在train.csv中

每一个货物有这些特征:

  • store_nbr:所在商店
  • family:所属商品类型
  • sales:所属商品类型在货物所在商店当日的销售额
  • onpromotion:所属商品类型在货物所在商店当日的进货数量

在stores.csv中

每一个商店有如下特征:

  • city:所在城市
  • state:所在州
  • type:类型
  • cluster:聚类,也就是一群类似的商店抱的团(只不过不知道哪里类似…)

在oil.csv中

每天的油价dcoilwtico

在holidays_events.csv中

  • date:日期
  • type:类型,包括节假日holiday,时间Event等等
  • locale:事件范围,全国的,本地的等等
  • locale_name:事件范围的地点,如果是国家性的就对应厄瓜多尔,如果是某个地方的节日就对应某个地方
  • description:描述,也就是节日的名字
  • transferred:是否是推迟后的节日。如果是推迟后的,那么就没那么重要了

在transactions.csv中

就是每天的交易额

其他

  • test.csv:训练好模型后就用这里面的数据来求答案了

  • sample_submission.csv:一个答案的格式示例

数据清洗和特征工程

先读入数据:

def ReadInData(file):
    path = 'store-sales-time-series-forecasting/'
    return pd.read_csv(path + file)

df_holidays_events = ReadInData('holidays_events.csv')
df_oil = ReadInData('oil.csv')
df_stores = ReadInData('stores.csv')
df_train = ReadInData('train.csv')
df_test = ReadInData("test.csv")
df_transactions = ReadInData('transactions.csv')

缺失值处理

oil.csv中dcoilwtico属性下有一部分缺失值,观察油价图形:

def DrawLine(X_data, Y_data):
    fig = plt.figure()
    axes = fig.add_subplot(1, 1, 1)
    axes.plot(X_data, Y_data)
    plt.show()
DrawLine(df_oil['date'], df_oil['dcoilwtico'])

发现油价变动较大
在这里插入图片描述
因此使用平均值来填充意义不大,因此使用前一个数据填充:

df_oil_withoutNA = df_oil.fillna(method="pad")
DrawLine(df_oil_withoutNA['date'], df_oil_withoutNA['dcoilwtico'])

在这里插入图片描述

特征整合

观察发现在其他的文件里面有一些可以整合到训练集(train.csv)和测试集(test.csv)里面的特征:

  • stores.csv中的city,state,type,cluster
  • oil.csv中的dcoilwtico
  • holidays_events.csv中的locale,locale_name,description,transferred
  • transaction中的交易额

因此先对其他文件里面的索引重命名,然后使用merge函数对其进行合并操作:

def Add_Feature():
    df_holidays_events.rename(columns={'date': 'date',
                                       'type': 'Daily_holiday_type',
                                       'locale': 'Daily_holiday_locale',
                                       'locale_name': 'Daily_holiday_locale_name',
                                       'description': "Daily_holiday_description",
                                       'transferred': "Daily_holiday_transferred"},
                              inplace=True)
    df_stores.rename(columns={'store_nbr': 'store_nbr',
                              'city': 'stores_city',
                              'state': 'store_state',
                              'type': 'store_type',
                              'cluster': 'store_cluster'},
                     inplace=True)
    df_transactions.rename(columns={'transactions': 'Daily_transactions'})
    DfTrainNew = pd.merge(df_train, df_holidays_events, how='left', left_on='date', right_on='date')
    DfTestNew = pd.merge(df_test, df_holidays_events, how='left', left_on='date', right_on='date')
    DfTrainNew = pd.merge(DfTrainNew, df_oil_withoutNA, how='left', left_on='date', right_on='date')
    DfTestNew = pd.merge(DfTestNew, df_oil_withoutNA, how='left', left_on='date', right_on='date')
    DfTrainNew = pd.merge(DfTrainNew, df_stores, how='left', left_on='store_nbr', right_on='store_nbr')
    DfTestNew = pd.merge(DfTestNew, df_stores, how='left', left_on='store_nbr', right_on='store_nbr')
    DfTrainNew = pd.merge(DfTrainNew, df_transactions, how='left', on=['date', 'store_nbr'])
    DfTestNew = pd.merge(DfTestNew, df_transactions, how='left', on=['date', 'store_nbr'])
    return DfTrainNew, DfTestNew


res = Add_Feature()
df_train_New = res[0]
df_test_New = res[1]

操作完后,统计一下非NA值的个数,来看看数据合并的效果:

def LookIn(DF_In):
    print("length:{}".format(len(DF_In)))
    for i in DF_In.columns:
        a = DF_In[i].describe()
        print("Name:{} Rate:{}%".format(i, 100 * a['count'] / len(DF_In[i])))


print("df_train_New:>>>>>>")
LookIn(df_train_New)
print("df_test_New:>>>>>>")
LookIn(df_test_New)
print("df_test:>>>>>>")
LookIn(df_test)

效果稍微有点出乎意料:

df_train_New:>>>>>>
length:3054348
Name:id Rate:100.0%
Name:date Rate:100.0%
Name:store_nbr Rate:100.0%
Name:family Rate:100.0%
Name:sales Rate:100.0%
Name:onpromotion Rate:100.0%
Name:Daily_holiday_type Rate:16.45274212368728%
Name:Daily_holiday_locale Rate:16.45274212368728%
Name:Daily_holiday_locale_name Rate:16.45274212368728%
Name:Daily_holiday_description Rate:16.45274212368728%
Name:Daily_holiday_transferred Rate:16.45274212368728%
Name:dcoilwtico Rate:71.17852975495916%
Name:stores_city Rate:100.0%
Name:store_state Rate:100.0%
Name:store_type Rate:100.0%
Name:store_cluster Rate:100.0%
Name:transactions Rate:91.84385669216475%
df_test_New:>>>>>>
length:28512
Name:id Rate:100.0%
Name:date Rate:100.0%
Name:store_nbr Rate:100.0%
Name:family Rate:100.0%
Name:onpromotion Rate:100.0%
Name:Daily_holiday_type Rate:6.25%
Name:Daily_holiday_locale Rate:6.25%
Name:Daily_holiday_locale_name Rate:6.25%
Name:Daily_holiday_description Rate:6.25%
Name:Daily_holiday_transferred Rate:6.25%
Name:dcoilwtico Rate:75.0%
Name:stores_city Rate:100.0%
Name:store_state Rate:100.0%
Name:store_type Rate:100.0%
Name:store_cluster Rate:100.0%
Name:transactions Rate:0.0%
df_test:>>>>>>
length:28512
Name:id Rate:100.0%
Name:date Rate:100.0%
Name:store_nbr Rate:100.0%
Name:family Rate:100.0%
Name:onpromotion Rate:100.0%

不难发现,交易额合并后,测试集里面的所有样本都没有对应的销售额,因此销售额这个特征也许不应该这样用

之后看了看训练集数据后发现了一个更大的问题,就是经过增加特征后,训练集里面的样本数量变多了…花了一下午的时间才发现,原来同一天可以有很多节日…

很无语,只好暂时简单粗暴的去个重,有更好的办法就再说吧:

df_train_New.drop_duplicates(subset='id', keep='first', inplace=True)

当然还有一个问题,那就是大量的空缺值的问题。现阶段暂时先删掉这些东西:

df_train_New.dropna(axis=0, inplace=True)

下面也许可以先尝试一下建模,到时候再迭代修改就行

但是建模之前先要把数据映射一下,不然数据进不了机器学习模型的

数据映射

def PreWork():
    df_train_New.dropna(axis=0, inplace=True)
    df_train_New.drop_duplicates(subset='id', keep='first', inplace=True)
    df_train_New['date'] = df_train_New['date'].apply(lambda X: int(str(X).split('-')[0] +
                                                                    str(X).split('-')[1] +
                                                                    str(X).split('-')[2]))
    df_train_New['family'] = pd.factorize(df_train_New['family'])[0].astype(int)
    df_train_New['Daily_holiday_type'] = pd.factorize(df_train_New['Daily_holiday_type'])[0].astype(int)
    df_train_New['Daily_holiday_locale'] = pd.factorize(df_train_New['Daily_holiday_locale'])[0].astype(int)
    df_train_New['Daily_holiday_locale_name'] = pd.factorize(df_train_New['Daily_holiday_locale_name'])[0].astype(int)
    df_train_New['Daily_holiday_description'] = pd.factorize(df_train_New['Daily_holiday_description'])[0].astype(int)
    df_train_New['Daily_holiday_transferred'] = pd.factorize(df_train_New['Daily_holiday_transferred'])[0].astype(int)
    df_train_New['stores_city'] = pd.factorize(df_train_New['stores_city'])[0].astype(int)
    df_train_New['store_state'] = pd.factorize(df_train_New['store_state'])[0].astype(int)
    df_train_New['store_type'] = pd.factorize(df_train_New['store_type'])[0].astype(int)


PreWork()

建模

这个题给了每个商品很多特征,然后需要预测另外一些商品的价格。根据这个题的特点,发现可以先试一试决策树或者随机森林模型。至于题目里面说的时间序列预测…暂时还比较迷。

然后对训练集先用两个模型试试,然后使用MAE来评一下误差度:

def BuildDecisionTree(x_F, Y_F, DataLog):
    x_ = DataLog[x_F]
    Y_ = DataLog[Y_F]
    model = DecisionTreeRegressor(random_state=1, max_depth=100)
    a_x, b_x, a_y, b_y = train_test_split(x_, Y_, random_state=1)
    model.fit(a_x, a_y)
    predictions = model.predict(b_x)
    delta = mean_absolute_error(b_y, predictions)
    print("DecisionTree:mean_absolute_error delta:{}".format(delta))


def BuildRandomForest(x_F, Y_F, DataLog):
    x_ = DataLog[x_F]
    Y_ = DataLog[Y_F]
    model = RandomForestRegressor(random_state=1, max_depth=100)
    a_x, b_x, a_y, b_y = train_test_split(x_, Y_, random_state=1)
    model.fit(a_x, a_y)
    predictions = model.predict(b_x)
    delta = mean_absolute_error(b_y, predictions)
    print("RandomForest:mean_absolute_error delta:{}".format(delta))

X_Feature = ['id', 'date', 'store_nbr', 'family', 'onpromotion',
             'Daily_holiday_type', 'Daily_holiday_locale',
             'Daily_holiday_locale_name', 'Daily_holiday_description',
             'Daily_holiday_transferred', 'dcoilwtico', 'stores_city', 'store_state',
             'store_type', 'store_cluster', 'transactions']
Y_Feature = 'sales'
BuildRandomForest(X_Feature, Y_Feature, df_train_New)
BuildDecisionTree(X_Feature, Y_Feature, df_train_New)

当然随机森林不出意外的要精确一些:

RandomForest:mean_absolute_error delta:73.51907510982585
DecisionTree:mean_absolute_error delta:97.08605686543322

于是尝试用随机森林模型来预测一下测试集


def Forecast():
    model = RandomForestRegressor(random_state=1, max_depth=100)
    model.fit(df_train_New[X_Feature], df_train_New[Y_Feature])
    Aim = df_test_New[X_Feature]
    predictions = model.predict(Aim)
    res = pd.DataFrame(predictions)
    path = "store-sales-time-series-forecasting/submission.csv"
    res.to_csv(path)


Forecast()

然后把预测数据修改一下索引,放到kaggle上面去:

在这里插入图片描述
芜湖!

To Be Continue:

  • 18
    点赞
  • 52
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 12
    评论
提供的源码资源涵盖了安卓应用、小程序、Python应用和Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 12
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

AndrewMe8211

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值