Study Notes- Smoothers and Generalised Additive Models

本文探讨了线性回归、多项式回归、样条回归和广义加性模型(GAM)。在对波士顿数据集进行建模时,发现线性和多项式模型对非线性关系的拟合不足。通过使用GAM和样条平滑,可以更准确地捕捉数据中的模式,减少了残差均方根误差并提高了R2值。GAM尤其适用于自动选择平滑项的节点,避免了手动指定knots的问题。
摘要由CSDN通过智能技术生成

Preliminaries

Preliminaries
In this practical we will do some model checking and model choice in R.
We need the following packages

  • ggplot2 - Package to implement the ggplot language for graphics in R.
  • tidyverse - This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step
  • MASS - Functions and datasets to support Venables and Ripley, “Modern Applied Statistics with S”(4th edition, 2002).
  • caret - For easy machine learning workflow
  • splines - Generalized additive (mixed) models, some of their extensions and other generalized ridge regression with multiple smoothing parameter estimation by (Restricted) Marginal Likelihood, Generalized Cross Validation and others

Make sure that these packages are downloaded and installed in R. We use the require() function to load
them into the R library. Note, this does the same as library() in this case.

We will use the Boston data set in the MASS package to predict the median house value (mdev), in Boston
Suburbs, based on the explanatory variable lstat (percentage of lower status of the population).

We want to build some models and then assess how well they do. For this we are going to randomly split
the data into training set (80% for building a predictive model) and evaluation set (20% for evaluating the
model).

As we work through the models we will calculate the usual metrics for model fit, e.g. R2 and RMSE, using
the validation data set, i.e. we will see how well it does at predicting ‘new’ data (out-of-sample validation).

options (warn = -1) # ignore the warnings
require(ggplot2) # input:invalid 可以去掉jupyter 的红色提醒
require(MASS)
require(caret)
require(splines)
require(tidyverse)
require(mgcv)
require(splines2)
Your code contains a unicode char which cannot be displayed in your
current locale and R will silently convert it to an escaped form when the
R kernel executes this code. This can lead to subtle errors if you use
such chars to do comparisons. For more information, please see
https://github.com/IRkernel/repr/wiki/Problems-with-unicode-on-windows
# load the data
data("Boston")
head(Boston)
A data.frame: 6 × 14
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
# Split the data into training and test sets
set.seed(123)
# createDataPartition( )就是数据划分函数,对象是Boston$medv,p=0.8表示训练数据所占的比例为80%,
# list是输出结果的格式,默认list=FALSE。
training.samples<- Boston$medv%>%
    createDataPartition(p= 0.8, list= FALSE)
train.data<- Boston[training.samples, ]
test.data<- Boston[-training.samples, ]
Your code contains a unicode char which cannot be displayed in your
current locale and R will silently convert it to an escaped form when the
R kernel executes this code. This can lead to subtle errors if you use
such chars to do comparisons. For more information, please see
https://github.com/IRkernel/repr/wiki/Problems-with-unicode-on-windows

First let’s have a look at the relationship between the two variables

ggplot(train.data, aes(x= lstat, y= medv))+
    geom_point()+
    geom_smooth(method= 'loess', formula= y~ x)

在这里插入图片描述

This suggests a non-linear relationship between the two variables.

Linear regression

The standard linear regression model equation can be written as m e d v = β 0 + β 1 ∗ l s t a t medv = \beta_0 + \beta_1 * lstat medv=β0+β1lstat.

# Fit the model
model1<- lm(medv~ lstat, data= train.data)
summary(model1)

Call:
lm(formula = medv ~ lstat, data = train.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.218  -4.011  -1.123   2.025  24.459 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  34.6527     0.6230   55.62   <2e-16 ***
lstat        -0.9561     0.0428  -22.34   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.144 on 405 degrees of freedom
Multiple R-squared:  0.5521,	Adjusted R-squared:  0.551 
F-statistic: 499.2 on 1 and 405 DF,  p-value: < 2.2e-16
# Make predictions
predictions<- model1 %>%
    predict(test.data)
# Model performance
model1_performance<- data.frame(
    RMSE= RMSE(predictions, test.data$medv), #均方根误差
    # R平方为回归平方和与总离差平方和的比值,表示总离差平方和中可以由回归平方和解释的比例,
    # 这一比例越大越好,模型越精确,回归效果越显著。R平方介于0~1之间,越接近1,回归拟合效果越好
    R2= R2(predictions, test.data$medv)
)
Y
  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值