Kaggle 竞赛项目——Rossmann 销售预测 Top1%

最新推荐文章于 2023-12-12 13:28:23 发布

aicanghai_smile

最新推荐文章于 2023-12-12 13:28:23 发布

阅读量1w

点赞数 9

文章标签： Kaggle 销售预测 xgboost 时序 top1%

本文链接：https://blog.csdn.net/aicanghai_smile/article/details/80987666

版权



# coding: utf-8

#开发环境：windows10, Anacoda3.5 , jupyter notebook ,python3.6 
#库： numpy,pandas,matplotlib,seaborn,xgboost,time
#运行时间：CPU: i7-6700HQ，约8h

#项目名称： Rossmann 销售预测

# 1.数据分析

# In[1]:


#导入所需要的库
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
import xgboost as xgb
from time import time


# In[2]:


#读取数据
train = pd.read_csv('train.csv',parse_dates=[2])
test = pd.read_csv('test.csv',parse_dates=[3])
store = pd.read_csv('store.csv')


# In[3]:


#查看训练集
train.head().append(train.tail())


# In[4]:


#查看测试集
test.head().append(test.tail())


# In[5]:


#查看店铺信息
store.head().append(store.tail())


# In[6]:


#查看数据缺失
display(train.isnull().sum(),test.isnull().sum(),store.isnull().sum())


# In[7]:


#缺失数据分析
#测试集缺失数据
test[pd.isnull(test.Open)]


# - 缺失数据都来自于622店铺，从周1到周6而且没有假期，所以我们认为这个店铺的状态应该是正常营业的

# In[8]:


#店铺集缺失数据
store[pd.isnull(store.CompetitionDistance)]


# In[9]:


store[pd.isnull(store.CompetitionOpenSinceMonth)].head(10)


# In[10]:


#查看是否Promo2系列的缺失是否是因为没有参加促销
NoPW = store[pd.isnull(store.Promo2SinceWeek)]
NoPW[NoPW.Promo2 != 0].shape


# - 店铺竞争数据缺失的原因不明，且数量比较多，我们可以用中值或者0来填充，后续的实验发现以0填充的效果更好
# - 店铺促销信息的缺失是因为没有参加促销活动，所以我们以0填充

# In[11]:


#分析店铺销量随时间的变化
strain = train[train.Sales>0]
strain.loc[strain['Store']==1 ,['Date','Sales']]     .plot(x='Date',y='Sales',title='Store1',figsize=(16,4))


# In[12]:


#分析店铺6-9月份的销量变化
strain = train[train.Sales>0]
strain.loc[strain['Store']==1 ,['Date','Sales']]     .plot(x='Date',y='Sales',title='Store1',figsize=(8,2),xlim=['2014-6-1','2014-7-31'])
strain.loc[strain['Store']==1 ,['Date','Sales']]     .plot(x='Date',y='Sales',title='Store1',figsize=(8,2),xlim=['2014-8-1','2014-9-30'])


# - 从上图的分析中，我们可以看到店铺的销售额是有周期性变化的，一年之中11，12月份销量要高于其他月份，可能有季节因素或者促销等原因.
# - 此外从对2014年6月-9月份的销量来看，6，7月份的销售趋势与8，9月份类似，因为我们需要预测的6周在2015年8，9月份，因此我们可以把2015年6，7月份最近的6周数据作为hold-out数据集，用于模型的优化和验证。

# 2.数据预处理

# In[13]:


#缺失值处理
#我们将test中的open数据补为1，即营业状态
test.fillna(1, inplace=True)
#store['CompetitionDistance'].fillna(store['CompetitionDistance'].median(), inplace = True)
#store['CompetitionOpenScinceYear'].fillna(store['CompetitionDistance'].median(), inplace = True)
#store['CompetitionOPenScinceMonth'].fillna(store['CompetitionDistance'].median(), inplace = True)

#store中的缺失数据大多与竞争对手和促销有关，在实验中我们发现竞争对手信息的中值填充效果并不好，所以这里统一采用0填充
store.fillna(0, inplace=True)


# In[14]:


#查看是否还存在缺失值
display(train.isnull().sum(),test.isnull().sum(),store.isnull().sum())


# In[15]:


#合并store信息
train = pd.merge(train, store, on='Store')
test = pd.merge(test, store, on='Store')


# In[16]:


#留出最近的6周数据作为hold_out数据集进行测试
train = train.sort_values(['Date'],ascending = False)
ho_test = train[:6*7*1115]
ho_train = train[6*7*1115:]


# In[17]:


#因为销售额为0的记录不计入评分，所以只采用店铺为开，且销售额大于0的数据进行训练
ho_test = ho_tes

最低0.47元/天解锁文章

aicanghai_smile

关注

9
点赞
踩
84

收藏

觉得还不错? 一键收藏
17
评论
Kaggle 竞赛项目——Rossmann 销售预测 Top1%

# coding: utf-8# ## Rossmann 销售预测# 1.数据分析# In[1]:#导入所需要的库import numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltget_ipython().run_line_magic('matplotlib...
复制链接

扫一扫