猪肉价格预测
- 支持向量机回归
- 随机森林回归
- MLP神经网络回归
问题背景
“猪粮安天下”,生猪自古以来便在国计民生中占据着重要地位,猪肉是我国城乡居民“菜篮子”中不可或缺的产品。但从 2018 年非洲猪瘟爆发以来,生猪产业遭到巨大冲击,生猪市场价格波动频繁,不仅给养殖者造成巨大的经济损失,也给广大消费者造成了很大困扰。2020 年新冠肺炎疫情突袭,再次对逐步恢复的生猪产业产生一定不利影响。
(本文指标选取有待商榷,仅仅做着玩)
导入数据
# 安装库专用
# 通过如下命令设定镜像
options(repos = 'http://mirrors.ustc.edu.cn/CRAN/')
# 查看镜像是否修改
getOption('repos')
# 尝试下载R包
#若有需要,进行安装
#install.packages('h2o')
‘http://mirrors.ustc.edu.cn/CRAN/’
#设置工作路径
setwd("D:/LengPY")
#导入数据
library(readxl)
data<- read_excel("liudata.xlsx")
head(data)
日期 | 活猪 | 仔猪 | 去骨牛肉 | 带骨羊肉 | 豆粕 | 小麦麸 | 玉米 | 育肥猪综合饲料 | 非洲猪瘟 | 新冠疫情 |
---|---|---|---|---|---|---|---|---|---|---|
<dttm> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <lgl> | <lgl> |
2006-01-01 | 7.56 | 9.89 | 18.60 | 19.09 | 2.64 | 1.24 | 1.26 | 1.81 | NA | NA |
2006-02-01 | 7.11 | 9.48 | 18.65 | 18.76 | 2.75 | 1.24 | 1.27 | 1.83 | NA | NA |
2006-03-01 | 6.68 | 8.85 | 18.37 | 18.25 | 2.69 | 1.23 | 1.28 | 1.83 | NA | NA |
2006-04-01 | 6.21 | 7.82 | 18.33 | 18.41 | 2.60 | 1.21 | 1.28 | 1.82 | NA | NA |
2006-05-01 | 5.96 | 6.98 | 18.31 | 18.35 | 2.56 | 1.21 | 1.34 | 1.84 | NA | NA |
2006-06-01 | 6.08 | 6.84 | 18.32 | 18.23 | 2.54 | 1.20 | 1.39 | 1.86 | NA | NA |
data1<-data[,2:9]
head(data1)
活猪 | 仔猪 | 去骨牛肉 | 带骨羊肉 | 豆粕 | 小麦麸 | 玉米 | 育肥猪综合饲料 |
---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
7.56 | 9.89 | 18.60 | 19.09 | 2.64 | 1.24 | 1.26 | 1.81 |
7.11 | 9.48 | 18.65 | 18.76 | 2.75 | 1.24 | 1.27 | 1.83 |
6.68 | 8.85 | 18.37 | 18.25 | 2.69 | 1.23 | 1.28 | 1.83 |
6.21 | 7.82 | 18.33 | 18.41 | 2.60 | 1.21 | 1.28 | 1.82 |
5.96 | 6.98 | 18.31 | 18.35 | 2.56 | 1.21 | 1.34 | 1.84 |
6.08 | 6.84 | 18.32 | 18.23 | 2.54 | 1.20 | 1.39 | 1.86 |
## 可视化特征之间的相关系数
library(corrplot)
cor <- cor(data1)
corrplot.mixed(cor,tl.col="black",tl.pos = "lt",
tl.cex = 0.8,number.cex = 0.7)
可根据相关系数结果,对变量相关性进行初步探索。
一、支持向量机
## 支持向量机回归模型
library(e1071)
library(caret)
library(Metrics)
library(readr)
system.time(
svmreg <- svm(活猪~.,data =data1,kernel = "radial")
)
user system elapsed
0.02 0.00 0.02
summary(svmreg)
Call:
svm(formula = 活猪 ~ ., data = data1, kernel = "radial")
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 1
gamma: 0.1428571
epsilon: 0.1
Number of Support Vectors: 111
set.seed(123)
index <- sample(nrow(data1),round(0.7*nrow(data1)))
train_data <- data1[index,]
test_data <-data1[-index,]
#查看训练集维度
dim(train_data)
- 128
- 8
## 在训练集上的误差
train_pre <- predict(svmreg,train_data)
train_mae <- mae(train_data$活猪,train_pre)
sprintf("训练集上的绝对值误差: %f",train_mae)
‘训练集上的绝对值误差: 1.007821’
test_pre <- predict(svmreg,test_data)
test_mae <- mae(train_data$活猪,test_pre)
sprintf("测试集上的绝对值误差: %f",test_mae)
‘测试集上的绝对值误差: 6.022539’
data<-data.frame(train_data$活猪,train_pre)
预测j效果良好,测试集误差较大。
total_pre <- predict(svmreg,data1[2:8])
total_mae <- mse(data1$活猪,total_pre)
sprintf("全部上的绝对值误差: %f",total_mae)
‘全部上的绝对值误差: 1.651221’
输出对应预测结果:
data5<-data.frame(total_pre,data1$活猪)
colnames(data5)<-c('预测','实际')
data5
预测 | 实际 | |
---|---|---|
<dbl> | <dbl> | |
1 | 7.081429 | 7.56 |
2 | 7.326892 | 7.11 |
3 | 7.125876 | 6.68 |
4 | 6.871381 | 6.21 |
5 | 6.957324 | 5.96 |
6 | 7.221186 | 6.08 |
7 | 7.400246 | 6.47 |
8 | 7.658352 | 7.17 |
9 | 7.709680 | 7.84 |
10 | 7.599010 | 7.93 |
11 | 7.786543 | 8.33 |
12 | 8.612998 | 9.18 |
13 | 8.892967 | 9.55 |
14 | 9.065229 | 9.20 |
15 | 9.556271 | 8.91 |
16 | 9.566126 | 9.02 |
17 | 10.184165 | 10.20 |
18 | 11.079956 | 11.37 |
19 | 12.172093 | 13.12 |
20 | 13.362894 | 14.27 |
21 | 14.095941 | 13.60 |
22 | 13.871400 | 13.21 |
23 | 14.403343 | 14.13 |
... | ... | ... |
171 | 34.98679 | 35.93 |
172 | 34.07102 | 33.41 |
173 | 34.44107 | 29.50 |
174 | 34.54299 | 30.97 |
175 | 33.63167 | 35.73 |
176 | 32.10181 | 36.50 |
177 | 29.42965 | 35.20 |
178 | 31.75152 | 30.93 |
179 | 32.81464 | 29.23 |
180 | 32.59779 | 32.40 |
181 | 29.80336 | 35.17 |
182 | 29.01780 | 31.43 |
183 | 29.59932 | 27.90 |
二、随机森林
library(readr)
library(VIM)
library(rpart)
library(rpart.plot)
library(Metrics)
library(ROCR)
library(tidyr)
library(randomForest)
library(ggRandomForests)
set.seed(123)
index <- sample(nrow(data1),round(0.7*nrow(data1)))
train_data <- data1[index,]
test_data <-data1[-index,]
rfreg <- randomForest(活猪~.,data = train_data,ntree=500)
summary(rfreg)
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 128 -none- numeric
mse 500 -none- numeric
rsq 500 -none- numeric
oob.times 128 -none- numeric
importance 7 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 11 -none- list
coefs 0 -none- NULL
y 128 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
## 可视化模型随着树的增加误差OOB的变化
par(family = "STKaiti")
plot(rfreg,type = "l",col = "red",main = "随机森林回归")
可发现,在trees数量 30时即可获得较好结果。
## 使用ggrandomforest包可视化误差
plot(gg_error(rfreg))+labs(title = "随机森林回归")
## 可视化变量的重要性
importance(rfreg)
IncNodePurity | |
---|---|
仔猪 | 1938.7910 |
去骨牛肉 | 1623.8613 |
带骨羊肉 | 1522.7004 |
豆粕 | 143.6410 |
小麦麸 | 517.8845 |
玉米 | 215.0726 |
育肥猪综合饲料 | 406.2389 |
varImpPlot(rfreg,pch = 20, main = "Importance of Variables")
#可发现仔猪、牛、羊肉对猪肉价格影响较大,因为互为替代品或前提(仔猪),故结果较为合理。
## 对测试集进行预测,并计算 Mean Squared Error
rfpre <- predict(rfreg,test_data)
sprintf("均方根误差为: %f",mse(test_data$活猪,rfpre))
‘均方根误差为: 1.646604’
## 参数搜索,寻找合适的 mtry参数,训练更好的模型
## Tune randomForest for the optimal mtry parameter
set.seed(1234)
rftune <- tuneRF(x = test_data[,2:8],y = test_data$活猪,
stepFactor=1.5,ntreeTry = 500)
mtry = 2 OOB error = 2.785136
Searching left ...
Searching right ...
mtry = 3 OOB error = 2.63388
0.05430821 0.05
mtry = 4 OOB error = 2.531242
0.03896842 0.05
print(rftune)
mtry OOBError
2 2 2.785136
3 3 2.633880
4 4 2.531242
## OOBError误差最小的mtry参数为6
## 建立优化后的随机森林回归模型
rfregbest <- randomForest(活猪~.,data = train_data,ntree=500,mtry = 6)
## 可视化两种模型随着树的增加误差OOB的变化
rfregerr <- as.data.frame(plot(rfreg))
colnames(rfregerr) <- "rfregerr"
rfregbesterr <- as.data.frame(plot(rfregbest))
colnames(rfregerr) <- "rfregerr"
rfregbesterr <- as.data.frame(plot(rfregbest))
colnames(rfregbesterr) <- "rfregbesterr"
plotrfdata <- cbind.data.frame(rfregerr,rfregbesterr)
plotrfdata$ntree <- 1:nrow(plotrfdata)
plotrfdata <- gather(plotrfdata,key = "Type",value = "Error",1:2)
ggplot(plotrfdata,aes(x = ntree,y = Error))+
geom_line(aes(linetype = Type,colour = Type),size = 0.9)+
theme(legend.position = "top")+
ggtitle("随机森林回归模型")+
theme(plot.title = element_text(hjust = 0.5))
## 使用优化后的随机森林回归模型,对测试集进行预测,并计算 Mean Squared Error
rfprebest <- predict(rfregbest,test_data[,2:8])
sprintf("优化后均方根误差为: %f",mse(test_data$活猪,rfprebest))
‘优化后均方根误差为: 1.660115’
## 使用优化后的随机森林回归模型,对测试集进行预测,并计算 Mean Squared Error
#全部数据
total <- predict(rfregbest,data1[,2:8])
sprintf("优化后均方根误差为: %f",mse(data1$活猪,total))
‘优化后均方根误差为: 0.773364’
#预测结果
totalpre<-data.frame(data1$活猪,total)
totalpre
<tr><th scope=row>172</th><td>33.41</td><td>33.70117</td></tr>
<tr><th scope=row>173</th><td>29.50</td><td>31.42532</td></tr>
<tr><th scope=row>174</th><td>30.97</td><td>32.10734</td></tr>
<tr><th scope=row>175</th><td>35.73</td><td>34.39926</td></tr>
<tr><th scope=row>176</th><td>36.50</td><td>35.47128</td></tr>
<tr><th scope=row>177</th><td>35.20</td><td>34.27110</td></tr>
<tr><th scope=row>178</th><td>30.93</td><td>32.03528</td></tr>
<tr><th scope=row>179</th><td>29.23</td><td>31.95318</td></tr>
<tr><th scope=row>180</th><td>32.40</td><td>32.44584</td></tr>
<tr><th scope=row>181</th><td>35.17</td><td>33.15130</td></tr>
<tr><th scope=row>182</th><td>31.43</td><td>32.34892</td></tr>
<tr><th scope=row>183</th><td>27.90</td><td>32.95256</td></tr>
data1.活猪 | total | |
---|---|---|
<dbl> | <dbl> | |
1 | 7.56 | 7.745717 |
2 | 7.11 | 7.473767 |
3 | 6.68 | 6.802603 |
4 | 6.21 | 6.484864 |
5 | 5.96 | 6.340616 |
6 | 6.08 | 6.334053 |
7 | 6.47 | 6.463621 |
8 | 7.17 | 7.035551 |
9 | 7.84 | 6.900711 |
10 | 7.93 | 6.985457 |
11 | 8.33 | 7.875041 |
12 | 9.18 | 9.065969 |
13 | 9.55 | 9.289838 |
14 | 9.20 | 9.244121 |
15 | 8.91 | 9.259116 |
16 | 9.02 | 9.250085 |
17 | 10.20 | 9.805307 |
18 | 11.37 | 10.590388 |
19 | 13.12 | 13.012034 |
20 | 14.27 | 13.206428 |
21 | 13.60 | 13.403372 |
22 | 13.21 | 13.176563 |
23 | 14.13 | 13.715155 |
24 | 15.46 | 15.510558 |
## 数据准备
index <- order(test_data$活猪)
X <- sort(index)
Y1 <- test_data$活猪[index]
rfpre2 <- rfpre[index]
rfprebest2 <- rfprebest[index]
plotdata <- data.frame(X = X,Y1 = Y1,rfpre =rfpre2,rfprebest = rfprebest2)
plotdata <- gather(plotdata,key="model",value="value",c(-X))
## 可视化模型的预测误差
ggplot(plotdata,aes(x = X,y = value))+
geom_line(aes(linetype = model,colour = model),size = 0.8)+
theme(legend.position = c(0.1,0.8),
plot.title = element_text(hjust = 0.5))+
ggtitle("随机森林回归模型")
可发现预测结果良好,在猪瘟 时期价格预测偏差较大,可能 是未量化猪瘟i直接影响,故 可考虑进行改进。
三、 MLP神经网络
library(RSNNS)
set.seed(123)
index <- sample(nrow(data1),round(0.7*nrow(data1)))
train_data <- data1[index,]
test_data <-data1[-index,]
dim(train_data)
## 数据max-min归一化到0-1之间
data1[,2:8] <- normalizeData(data1[,2:8],"0_1")
## 猪肉价归一化到0-1之间,并获取归一化参数
price <- normalizeData(data1$活猪,type = "0_1")
NormParameters <- getNormParameters(price)
- 128
- 8
head(data1)
活猪 | 仔猪 | 去骨牛肉 | 带骨羊肉 | 豆粕 | 小麦麸 | 玉米 | 育肥猪综合饲料 |
---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
7.56 | 0.029987219 | 0.0046354825 | 0.013367831 | 0.2395437 | 0.07368421 | 0.000000000 | 0.000000000 |
7.11 | 0.025956150 | 0.0053378283 | 0.008624407 | 0.2813688 | 0.07368421 | 0.006451613 | 0.011049724 |
6.68 | 0.019762069 | 0.0014046917 | 0.001293661 | 0.2585551 | 0.06315789 | 0.012903226 | 0.011049724 |
6.21 | 0.009635237 | 0.0008428150 | 0.003593503 | 0.2243346 | 0.04210526 | 0.012903226 | 0.005524862 |
5.96 | 0.001376462 | 0.0005618767 | 0.002731062 | 0.2091255 | 0.04210526 | 0.051612903 | 0.016574586 |
6.08 | 0.000000000 | 0.0007023458 | 0.001006181 | 0.2015209 | 0.03157895 | 0.083870968 | 0.027624309 |
hist(price)
## 数据切分
set.seed(123)
datasplist <- splitForTrainingAndTest(data1[,2:8],price,
ratio = 0.3)
## MLP回归模型
system.time(
mlpreg <- mlp(datasplist$inputsTrain,datasplist$targetsTrain,maxit = 200,
size = c(100,50,100,50),learnFunc = "Rprop",
inputsTest=datasplist$inputsTest,
targetsTest = datasplist$targetsTest,
metric = "RSME")
)
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
Warning message in snnsObject$setUnitName(num, iNames[[i]]):
"SNNS error message in setUnitName : SNNS-Kernel Error: Symbol pattern invalid (must match [A-Za-z][^|, ]*)"
user system elapsed
13.86 0.00 14.25
## 可视化模型的效果
plotIterativeError(mlpreg,main = "MLP Iterative Erro")
plotRegressionError(datasplist$targetsTrain,mlpreg$fitted.values,
main="MLP train fit")
plotRegressionError(datasplist$targetsTest,mlpreg$fittedTestValues,
main="MLP test fit")
## 在训练集上的误差
mlppre_train <- denormalizeData(mlpreg$fitted.values,NormParameters)
mlp_train <- denormalizeData(datasplist$targetsTrain,NormParameters)
train_mae <- mae(mlp_train,mlppre_train)
sprintf("训练集上的绝对值误差: %f",train_mae)
‘训练集上的绝对值误差: 0.456312’
mlppre_test <- denormalizeData(mlpreg$fittedTestValues,NormParameters)
mlp_test <- denormalizeData(datasplist$targetsTest,NormParameters)
test_mae <- mae(mlp_test,mlppre_test)
sprintf("测试集上的绝对值误差: %f",test_mae)
‘测试集上的绝对值误差: 6.967324’
测试集误差较大,需加以改进。
如对猪瘟及新冠疫情进行加入分析。结合RNN等方法,获得更精准结果。