机器学习训练营(入群联系qq:2279055353)—— 机器学习案例详解的直播互动平台
下期直播案例预告:大数据预测商品的销售量波动趋势
案例介绍
像沃尔玛、家乐福这样的超市零售商,使用销售预报系统和工具补充商品。一套完善的预报系统有助于赢得其它的供应链渠道。如果你擅长预测商品未来一段时间的销量,你就能合理地安排你的库存盘点和分类。
该案例要求根据提供的2013年至2017年商店-商品的销售数据,使用时间序列技术,预测10家不同的商店的50种不同的商品在未来3个月的销售量。
-
数据来源:Kaggle 竞赛
-
代码实现:R 语言
数据文件
训练集train.csv
与检验集test.csv
. 请到以下百度网盘下载:
链接:https://pan.baidu.com/s/1j2FCvikgR2aPkfW2bDtLsA 密码:mmby
加载R包
首先,加载必需的R包。
rm(list=ls())
suppressMessages(library(data.table))
suppressMessages(library(DT))
suppressMessages(library(timeSeries))
suppressMessages(library(tidyverse))
suppressMessages(library(reshape))
suppressMessages(library(stringr))
suppressMessages(library(doBy))
suppressMessages(library(formattable))
suppressMessages(library(gridExtra))
suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(corrplot))
suppressMessages(library(wesanderson))
suppressMessages(library(RColorBrewer))
suppressMessages(library(gridExtra))
suppressMessages(library(zoo))
suppressMessages(library(forecast))
suppressMessages(library(prophet))
set.seed(2018)
加载数据集
train=fread("e:/kaggle_exercises/forecast/input/train.csv")
sprintf("The train data set has %d rows and %d columns", nrow(train), ncol(train) )
str(train)
test <- fread("e:/kaggle_exercises/forecast/input/test.csv")
sprintf("The test data set has %d rows and %d columns", nrow(test), ncol(test) )
str(test)
print("the summary of train sales is:")
summary(train$sales)
从变量date
里提取“年、月”变量
# Extraction of Year and Month of Year :
train$Year=year(train$date) #returns the year from date i.e. 2013, 2014 etc.
train$Month=as.yearmon(train$date) #this yearmon() function is coming from zoo package returns the month of an year i.e Jan 2013, Feb 2015 etc
head(train)
缺失值检查
colSums(is.na(train))
# Function 1 : For ploting missing value
plot_missing <- function(data, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))) {
## Declare variable first to pass R CMD check
feature <- num_missing <- pct_missing <- group <- NULL
## Check if input is data.table
is_data_table <- is.data.table(data)
## Detect input data class
data_class <- class(data)
## Set data to data.table
if (!is_data_table) data <- data.table(data)
## Extract missing value distribution
missing_value <- data.table(
"feature" = names(data),
"num_missing" = sapply(data, function(x) {sum(is.na(x))})
)
missing_value[, feature := factor(feature, levels = feature[order(-rank(num_missing))])]
missing_value[, pct_missing := num_missing / nrow(data)]
missing_value[pct_missing < 0.05, group := "Good"]
missing_value[pct_missing >= 0.05 & pct_missing < 0.4, group := "OK"]
missing_value[pct_missing >= 0.4 & pct_missing < 0.8, group := "Bad"]
missing_value[pct_missing >= 0.8, group := "Remove"][]
## Set data class back to original
if (!is_data_table) class(missing_value) <- data_class
## Create ggplot object
output <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +
scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +
scale_y_continuous(labels = comma) +
coord_flip() +
xlab("Features") + ylab("Number of missing rows") +
ggtitle(title) +
ggtheme + theme_linedraw()+
do.call(theme, theme_config)
## Print plot
print(outpu