钻石与价格预测

最新推荐文章于 2024-01-18 12:13:41 发布

木南花

最新推荐文章于 2024-01-18 12:13:41 发布

阅读量4.4k

点赞数 2

分类专栏： R

本文链接：https://blog.csdn.net/weixin_41100555/article/details/80033035

版权

R 专栏收录该内容

40 篇文章

订阅专栏

1、price和carat的散点图

library(ggplot2)
data('diamonds')
ggplot(aes(x=carat,y=price),data=diamonds)+
  geom_point()+
  xlim(0,quantile(diamonds$carat,0.99))+
  ylim(0,quantile(diamonds$price,0.99))

这里写图片描述
注意：1、xlim和ylim可以使用scale_x_continuous和scale_y_continuous.
2、为了图像好看，可以加参数，color以及shape

ggplot(aes(x=carat,y=price),data=diamonds)+
  geom_point(fill=I('#f79420'),color=I('black'),shape=21)+
  scale_x_continuous(lim=c(0,quantile(diamonds$carat,0.99)))+
  scale_y_continuous(lim=c(0,quantile(diamonds$price,0.99)))

这里写图片描述
看到以上图形的时候，我们可以看到非线性关系，可能是指数关系或者其他的什么。可以看出关系的偏差和变化，也会随着克拉的增加而增加。只是快速观察数据，我们就可以发现价格和克拉大小之间函数关系的重要内容。
我们可以向这幅图中添加一条线性的裁切线，方法是使用统计平滑函数，方法等于lm。

ggplot(aes(x=carat,y=price),data=diamonds)+
  geom_point(fill=I('#f79420'),color=I('black'),shape=21)+
  scale_x_continuous(lim=c(0,quantile(diamonds$carat,0.99)))+
  scale_y_continuous(lim=c(0,quantile(diamonds$price,0.99)))+
  stat_smooth(method='lm')

这里写图片描述
我们可以看到线性的趋势图，在一些关键区域不会穿过数据的中心，如果我们尝试用这个来做预测，我们可能会在所显示的现有数据内部和外部错过一些关键区域。
2、了解钻石的历史
你卖过钻石吗？
De Beers 的钻石垄断
3、ggpairs函数
使用GG配对函数对主要变量间进行绘图。这个函数以成对的方式绘制每个变量之间的关系图。
首先确定安装一下包。

install.packages('GGally')  #使用GGally来作图
install.packages('scales')  #使用标度来表现很多东西
install.packages('memisc')  #用于汇总递归
install.packages('lattice')  #用于其他方面
install.packages('MASS')  #MASS用于各种函数
install.packages('car')  #car用于重写变量代码
install.packages('reshape')  #reshape改造在改造，整理你的数据
install.packages('plyr')  #plyr用于创造有意思的汇总以及传输

然后加载一下程序包

library(ggplot2)
library(GGally)
library(scales)
library(memisc)

设计随机种子数，然后从众选择10000的样本。

set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))

ggpairs 函数的“params”参数用于改变在图形矩阵中绘制的点的形状，以便更容易查看这些点。GGally 1.0 改变了这些绘图参数的语法，使它们不再是 params 参数的一部分，而是让用户能如下所示指定这些参数：

ggpairs(diamond_samp,
lower = list(continuous = wrap(“points”, shape = I(‘.’))),
upper = list(combo = wrap(“box”, outlier.shape = I(‘.’))))
lower 和upper控制这个矩阵图的下半部分和上半部分的样式

set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, 
  lower = list(continuous = wrap("points", shape = I('.'))), 
  upper = list(combo = wrap("box", outlier.shape = I('.'))))

这里写图片描述
我们发现对价格影响最大的还是克拉数。
4、对钻石的需求
对数转换
在需求侧，需要价格低尺寸小的用户，较比较富有的客户，可能对价格更加敏感.比较价格和价格对数的直方图

install.packages('gridExtra')
library(gridExtra)
plot1 <- qplot(x=price,data=diamonds) + 
  ggtitle('Price')

plot2 <- qplot(x=log10(price),data=diamonds) +
  ggtitle('Price (log10)')

grid.arrange(plot1,plot2)

注意：使用qplot之前要先运行ggplot2包
这里写图片描述
5、将需求和价格分布联系起来
看上面的两幅图，特别注意log10的峰值处。
看第一幅图发现，砖石的价格可能被严重扭曲，但是如果把这个价格放到常用对数标度，就好看多了，更加接近于正态分布的钟形曲线，我们在这个常用对数的标度上甚至可以看到一点双峰性的迹象，这符合我们对钻石消费者特点的富人卖家和穷人买家两个类别推测
6、散点图转换
现在我们对变量以及钻石的整体需求有了更好的了解，我们来重新绘制数据图。

> qplot(x=carat,y=log10(price),data=diamonds)+
+     ggtitle('Price (log10) by Carat')

等价代码：

qplot(x=carat,y=price,data=diamonds)+
  scale_y_continuous(trans = log10_trans())+
  ggtitle('Price (log10) by Carat')

这里写图片描述
这时候，价格在克拉大小和价格的高端离散较小，但是其实我们可以做的更好。
我们可以尝试克拉的立方根，这样的话，我们首先需要一个函数来进行立方根计算。

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')

这里写图片描述
由此可以看出，图形基本上呈线性相关。

7、复习过度绘制

> head(sort(table(diamonds$carat),decreasing = T))

 0.3 0.31 1.01  0.7 0.32    1 
2604 2249 2242 1981 1840 1558 
> head(sort(table(diamonds$price),decreasing = T))

605 802 625 828 776 698 
132 127 126 125 124 121

看上面的数据，我们可以看到有很多的个数高值，这样会使图形重叠在一起。

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha=0.5,size=0.75,position = 'jitter') + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')

这里写图片描述

8、除了克拉还会有其他因素影响价格，考虑-净度，calary

install.packages('RColorBrewer')
library(RColorBrewer)

ggplot(aes(x = carat, y = price,color=clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')

这里写图片描述
scale_colour_brewer
ColorBrewer 调色板和安全色
 Legends
从上面图形可以看出，净度与价格相关。

9、价格与克拉和切工

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'cut', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  #注意title后面的cut要加引号
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and cut')

这里写图片描述

10、价格与克拉和颜色

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'color', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and color')

这里写图片描述
钻石科普：颜色

11、构建线性模型
在r中我们可以使用lm函数创建模型，我们需要按照y~x格式提供一个公式。
R 中的线性模型和运算符

> m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
> m2 <- update(m1, ~ . + carat)
> m3 <- update(m2, ~ . + cut)
> m4 <- update(m3, ~ . + color)
> m5 <- update(m4, ~ . + clarity)
> mtable(m1, m2, m3, m4, m5)

Calls:
m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
    data = diamonds)
m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
    clarity, data = diamonds)

============================================================================================
                       m1             m2             m3             m4            m5        
--------------------------------------------------------------------------------------------
  (Intercept)          2.821***       1.039***       0.874***      0.932***       0.415***  
                      (0.006)        (0.019)        (0.019)       (0.017)        (0.010)    
  I(carat^(1/3))       5.558***       8.568***       8.703***      8.438***       **9.144*****  
                      (0.007)        (0.032)        (0.031)       (0.028)        (0.016)    
  carat                              -1.137***      -1.163***     -0.992***      **-1.093*****  
                                     (0.012)        (0.011)       (0.010)        (0.006)    
  cut: .L                                            0.224***      0.224***       0.120***  
                                                    (0.004)       (0.004)        (0.002)    
  cut: .Q                                           -0.062***     -0.062***      -0.031***  
                                                    (0.004)       (0.003)        (0.002)    
  cut: .C                                            0.051***      0.052***       0.014***  
                                                    (0.003)       (0.003)        (0.002)    
  cut: ^4                                            0.018***      0.018***      -0.002     
                                                    (0.003)       (0.002)        (0.001)    
  color: .L                                                       -0.373***      -0.441***  
                                                                  (0.003)        (0.002)    
  color: .Q                                                       -0.129***      -0.093***  
                                                                  (0.003)        (0.002)    
  color: .C                                                        0.001         -0.013***  
                                                                  (0.003)        (0.002)    
  color: ^4                                                        0.029***       0.012***  
                                                                  (0.003)        (0.002)    
  color: ^5                                                       -0.016***      -0.003*    
                                                                  (0.003)        (0.001)    
  color: ^6                                                       -0.023***       0.001     
                                                                  (0.002)        (0.001)    
  clarity: .L                                                                     0.907***  
                                                                                 (0.003)    
  clarity: .Q                                                                    -0.240***  
                                                                                 (0.003)    
  clarity: .C                                                                     0.131***  
                                                                                 (0.003)    
  clarity: ^4                                                                    -0.063***  
                                                                                 (0.002)    
  clarity: ^5                                                                     0.026***  
                                                                                 (0.002)    
  clarity: ^6                                                                    -0.002     
                                                                                 (0.002)    
  clarity: ^7                                                                     0.032***  
                                                                                 (0.001)    
--------------------------------------------------------------------------------------------
  R-squared            0.924          0.935          0.939         0.951          0.984     
  adj. R-squared       0.924          0.935          0.939         0.951          0.984     
  sigma                0.280          0.259          0.250         0.224          0.129     
  F               652012.063     387489.366     138654.523     87959.467     173791.084     
  p                    0.000          0.000          0.000         0.000          0.000     
  Log-likelihood   -7962.499      -3631.319      -1837.416      4235.240      34091.272     
  Deviance          4242.831       3613.360       3380.837      2699.212        892.214     
  AIC              15930.999       7270.637       3690.832     -8442.481     -68140.544     
  BIC              15957.685       7306.220       3761.997     -8317.942     -67953.736     
  N                53940          53940          53940         53940          53940         
============================================================================================

这里写图片描述

当实际中用该模型进行预测时，发现预测价格适中偏低，原因是数据采用的是2018年的数据，而且期间受到经济的影响，价格偏低。而之后钻石市场发展良好，可以加上钻石价格的增长率进行预测，而此时我们采用购买钻石的新婚夫妇可以进行预测。而同时不同克拉的价格增长并不均匀，意味着我们最初的模型，不能简单的用通胀进行调整。

让我们将模型放到更大的环境中进行讨论。假设数据没有遭到损坏，而且我们没有过度违反线性回归的一些关键假设（例如违反 IID 假设，即数据集中有大量重复的观测值），那些，该模型会有哪些问题？在使用该模型时，还有什么是我们应该考虑的？
不要急着回答这个问题，对钻石市场做一些定性研究。查看以下链接，开始研究吧。
近几年的钻石价格s
2013年全球钻石报告
如果你想了解更多关于线性模型以及如何解释回归系数的信息，请参阅以下文章。
R Bloggers 上的解释 R 中的回归系数
Analysis Factor 博客上的解释回归系数
 拟合与解释线性模型（作者：ŷhat）
Stats StackExchange 上的线性模型中因子系数的另一种解释
 供给下降与需求增长：上海夫妇喜爱钻戒

12、采用更大更好的数据集
安装对应数据包

install.packages('bitops')
install.packages('RCurl')
library('bitops')
library('RCurl')

你可以从 https://github.com/solomonm/diamonds-data 下载数据集。点击 BigDiamonds.Rda 链接，然后点击“原始数据”按钮开始下载。下载完成后，你就可以通过命令 load(“BigDiamonds.rda”) 加载数据了。

> getwd()
[1] "C:/Users/Administrator/Documents"
> setwd('C:\\Users\\Administrator\\Downloads')
> load("BigDiamonds.Rda")

suppressMessages(library(lattice))
suppressMessages(library(MASS))
suppressMessages(library(memisc))

diamondsbig$logprice <- log(diamondsbig$price)
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = subset(diamondsbig,(diamondsbig$cert=="GIA")&(diamondsbig$price<=10000)))/m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$cert=="GIA" & diamondsbig$price<10000,])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
Calls:
m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = subset(diamondsbig, 
    (diamondsbig$cert == "GIA") & (diamondsbig$price <= 10000)))
m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = subset(diamondsbig, 
    (diamondsbig$cert == "GIA") & (diamondsbig$price <= 10000)))
m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = subset(diamondsbig, 
    (diamondsbig$cert == "GIA") & (diamondsbig$price <= 10000)))
m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
    data = subset(diamondsbig, (diamondsbig$cert == "GIA") & 
        (diamondsbig$price <= 10000)))
m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
    clarity, data = subset(diamondsbig, (diamondsbig$cert == 
    "GIA") & (diamondsbig$price <= 10000)))

================================================================================================
                        m1              m2             m3             m4              m5        
------------------------------------------------------------------------------------------------
  (Intercept)           2.670***        1.333***       0.949***       0.528***       -0.464***  
                       (0.003)         (0.012)        (0.012)        (0.010)         (0.009)    
  I(carat^(1/3))        5.839***        8.243***       8.633***       8.111***        8.321***  
                       (0.004)         (0.022)        (0.021)        (0.017)         (0.012)    
  carat                                -1.061***      -1.223***      -0.782***       -0.763***  
                                       (0.009)        (0.009)        (0.007)         (0.005)    
  cut: V.Good                                          0.120***       0.090***        0.071***  
                                                      (0.002)        (0.001)         (0.001)    
  cut: Ideal                                           0.211***       0.181***        0.132***  
                                                      (0.002)        (0.001)         (0.001)    
  color: K/L                                                          0.123***        0.117***  
                                                                     (0.004)         (0.003)    
  color: J/L                                                          0.312***        0.318***  
                                                                     (0.003)         (0.002)    
  color: I/L                                                          0.451***        0.469***  
                                                                     (0.003)         (0.002)    
  color: H/L                                                          0.569***        0.602***  
                                                                     (0.003)         (0.002)    
  color: G/L                                                          0.633***        0.665***  
                                                                     (0.003)         (0.002)    
  color: F/L                                                          0.687***        0.723***  
                                                                     (0.003)         (0.002)    
  color: E/L                                                          0.729***        0.757***  
                                                                     (0.003)         (0.002)    
  color: D/L                                                          0.812***        0.827***  
                                                                     (0.003)         (0.002)    
  clarity: I1                                                                         0.301***  
                                                                                     (0.006)    
  clarity: SI2                                                                        0.607***  
                                                                                     (0.006)    
  clarity: SI1                                                                        0.727***  
                                                                                     (0.006)    
  clarity: VS2                                                                        0.836***  
                                                                                     (0.006)    
  clarity: VS1                                                                        0.891***  
                                                                                     (0.006)    
  clarity: VVS2                                                                       0.935***  
                                                                                     (0.006)    
  clarity: VVS1                                                                       0.995***  
                                                                                     (0.006)    
  clarity: IF                                                                         1.052***  
                                                                                     (0.006)    
------------------------------------------------------------------------------------------------
  R-squared             0.889           0.892          0.899          0.937           0.969     
  adj. R-squared        0.889           0.892          0.899          0.937           0.969     
  sigma                 0.289           0.284          0.275          0.216           0.154     
  F               2701226.664     1406702.591     754499.134     423369.679      521229.735     
  p                     0.000           0.000          0.000          0.000           0.000     
  Log-likelihood   -60161.015      -54019.480     -43361.147      37816.844      154117.575     
  Deviance          28303.752       27296.479      25632.748      15877.442        7994.026     
  AIC              120328.031      108046.959      86734.294     -75605.688     -308191.151     
  BIC              120360.232      108089.894      86798.696     -75455.417     -307955.010     
  N                338968          338968         338968         338968          338968         
================================================================================================

13、预测

> thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
+                          color = "I", clarity="VS1")
> modelEstimate = predict(m5, newdata = thisDiamond,
+                         interval="prediction", level = .95)
> modelEstimate
       fit      lwr      upr
1 8.525284 8.224276 8.826292
> exp(modelEstimate)
       fit      lwr      upr
1 5040.617 3730.418 6810.984