第二章.Regression -- 04.A Simple Regression Simulation(上)翻译

Hi so Cynthia has been discussing the
theory of regression in
general and linear regression in
specific, and she's also talked a bit
about how we evaluate regression models
so in this sequence we're going to use
some R code to create some simulated
data and play around with some different
scenarios for regression model so you
can get a feel for how regression models
work in practice
so first thing we have to do is simulate
our data and i'm just going to create a
data frame here and I've created a noise
vector from a random normal
with mean 0 and standard deviation which
is this argument here and so we're just
going to create
n points here
ok
x and y and then we add w to the y
and we're going to run our line from 0 0
to 10 10 on the XY we're going to have 50
points and a standard deviation of 1
so let me run that little bit there
and you see my X values and my Y values
as advertised
and we can just do a quick ggplot2 plot
here to show x vs y
and you see those pretty much fall on a
straight line
so let's compute and evaluate a regression
model we're going to evaluate it in a
very simple manner we're just going to
look at the slope which we know should
be 1 based on how we synthesized that
data and the intercept which we know
should be zero based on how we
synthesized that data and one other
statistic which Cynthia's talked about
the r-squared, the adjusted r-squared which
is just 1 minus the sum of square of the
error over the sum of square of the
residual so if you explain none of the
variance in your data set or none of the
squared error in your data set with your
regression model adjusted r-squared is 0, if
you explain everything
it's a perfect fit adjusted r-squared
approaches 1 and the adjusted part is
just this n minus 1 over n minus 2
that's just a bias adjustment for the
the sum of squared errors
ok so
I'm going to run this code here and
we're just going to
create a regression model using R
using the LM functions that's just a
general linear model we're going to
model X versus Y ok
and we'll use the predict method on that
model to score it and will compute some
residuals which are nothing but the
actual y models minus the score and so we'll
make a plot of the data with the line on
it and we'll make a plot of the
histogram of those residuals and then
we're going to print the intercept the
slope and the adjusted r-squared ok let
me run that
so first off the intercept is pretty
close to 0.22  or .23 the slope is really
close to 1.00  and we can look
at those plots
and you see at the line really looks
good it's you know it's
really close to the theoretical
intercept from our simulation 00 and if
you know there's those residuals which
is just the distance from any one
point to the you know vertically
down to the line are generally
pretty good
and here's the distribution of residuals
it's not you know there's only 50 points
so you're not going to get much of a
bell-shaped curve but you can see there
on pretty narrow range -2 to 2 so
generally looking pretty good pretty
well-behaved but what if we increase the
dispersion of this data, that is we
increase that noise
how much it jumps.
 So let's do that so we
just did a case where the standard
deviation of the noise was one so we're
going to do 1, 5, 10 we're just going to
loop over that same functions we just
used to generate the data and create the
regression model and plot the results
we're just gonna do that three times
so this first one you can see the
intercept is a fairly close and the the
slope is almost spot-on and again
they're adjusted r-squared is 82%
so we're explaining eighty-two
percent of the variance and you can see
it looks very much like our first time
we've got this one point that just
perchance came out here but overall
pretty nicely behaved
residuals pretty nicely behaved
regression life now we've increased the
dispersion of the data with a standard
deviation of five and the intercept is
now instead of being 0 is now like .3
and the slope is .93 instead of
1 and the r-squared is dropped a lot
to only forty percent we're only
explaining with our model
forty percent of the variance in the
data and the plot shows that you can see
some of these residuals if you measure
that distance are really you know quite
large on the positive and on the
negative and look at the dispersion in
the histogram from like minus I don't know it's
seven or eight almost up to like you
know nine maybe so much larger residuals
and poor performance and now we're going
to take this to an extreme the intercept
is no longer really close to zero it's
like- .8 to the slope is not
that close to 1, it's .15 and r
squared is I mean it's basically telling
us this model doesn't predict anything
its .13 the residuals have a range
almost the same range as the data
from like -25 to +25 and there's
our line so with very large dispersion
around the slope and intercept of that
line with the regression model just
can't handle it and that's not peculiar
to linear regression that's any
regression so one last case i'd like to
show you is using some outliers so we're
going to go back to a standard deviation
of 1 here we're going to add an
outlier at 0 10, 0 -10 and 5 10 we're just
going to go through those three cases so
just again looping over those three
cases and we'll just see what that does
to the behavior of our regression line
so our first one you see the
regression, or the outlier there, and
you see the intercept has been pulled up
to almost .5 the slope is less
it's like .94 and the r-squared is
a bit less than we originally had and
and that's because you see we have this
one very large residual here when we
plot the histogram everything else looks
pretty well-behaved but it's clearly
affected our performance
now we have a outlier down here you can
see that in the histogram
the intercept against but now it's been
pulled down by about .5 the
slopes pretty close to 1 that's like
.11
and the r-squared is still not too bad
at seventy percent but it's a little
off and it's because of this one big
residual so finally we put the outlier
here at 5 on the x  and 10
and it's pulled the intercept up the
slopes pretty close .11 so so we've
just
to change in our slope to change a
little bit and what r squared is
actually pretty high and you can see
that residual isn't severe because it's
kind of in the middle it's not at the
end like those others
so I hope this little demo has given you
a feel for the behavior of regression
models using just the simple simulated
data input in general and a little bit
of insight into how we translate those
concepts into linear regression models
in specific

嗨,辛西娅一直在讨论这个问题。

理论的回归

一般和线性回归。

具体来说,她也谈了一点。

关于我们如何评估回归模型。

在这个序列中我们要用到。

一些R代码创建一些模拟。

数据和游戏有一些不同。

回归模型的场景。

能感受一下回归模型吗?

工作实践中

首先要做的是模拟。

我们的数据,我将创建一个。

这里的数据帧,我创建了一个声音。

向量来自于一个随机的正态。

均值为0,标准差为标准差。

这是论点吗?

要创建

n点

好吧

x和y,然后加上w的y次方。

从0开始。

在XY上10个10,我们有50个。

点和标准偏差1。

让我在这里运行一下。

你可以看到我的X值和Y值。

就像广告上说的

我们可以做一个快速的ggplot2图。

这里是x vs y。

你可以看到这些都落在a上。

直线

我们来计算和评估一个回归。

我们要用a来计算。

非常简单的方法。

看看我们已知的斜率。

根据我们是如何合成的。

我们知道的数据和截距。

应该是零吗?

合成了数据和另一个。

这是辛西娅说的统计数据。

r平方,调整后的r平方。

就是1减去这个的平方之和?

误差除以平方之和。

如果你不解释的话。

数据集的方差或没有。

你的数据集的平方误差。

回归模型调整r平方是0,如果。

你解释一切

这是一个完美的调整r平方。

方法1和调整部分是。

就是这个(n - 1) / (n - 2)

这只是一个偏差调整。

平方误差之和。

好的,

我将在这里运行这个代码。

我们只是去

使用R创建一个回归模型。

使用LM函数,这是a。

一般的线性模型。

模型X和Y。

我们会用这个预测方法。

模型来得分并计算一些。

残差,除了。

实际的y模型减去分数,我们会。

用线把数据绘制成图。

我们来画一个图。

这些残差的直方图。

我们要打印截距。

斜率和调整后的r平方,好的。

我跑,

首先,截距很漂亮。

斜率是0。22或。23。

接近1点,我们可以看。

在这些情节

你可以看到这条线看起来是这样的。

很好,你知道的。

非常接近理论。

从我们的模拟00和如果。

你知道有一些残差。

是和任何一个的距离吗?

指向你知道的垂直方向。

通常情况下都是这样。

很好

这是残差的分布。

你不知道只有50分。

所以你不会得到很多a。

钟形曲线,但你可以看到。

在很窄的范围-2到2。

通常看起来都很漂亮。

行为端正,但如果我们增加。

离散的数据,就是我们。

增加噪音

多少它跳跃。

我们来做一下。

只是做了一个标准的例子。

噪声的偏差是1。

要做1 5 10,我们要。

对相同的函数进行循环。

用于生成数据并创建。

回归模型并绘制结果。

我们要做三次。

第一个你可以看到。

截距是相当接近的。

斜率几乎是连续的。

调整r²,等于82%

我们解释了八十二年

方差的百分比。

它看起来很像我们的第一次。

我们得到了这个点。

偶然的机会来了,但总的来说。

很好表现

残差很好表现

回归生活现在我们增加了。

数据的分散与一个标准。

5的偏差和截距是。

现在不是0,而是。3。

斜率是。93,而不是。

1和r平方减少很多。

只有百分之四十。

与我们的模型解释

40%的方差。

数据和情节显示你可以看到。

如果你测量这些残差。

这段距离是你所知道的。

大的在积极的和在。

负的,看它的色散。

直方图就像-我不知道。

七八个人几乎都喜欢你。

知道9个更大的残差。

表现不佳,现在我们要走了。

把这个移到极限。

不再接近于0了吗?

就像-。8的斜率不是。

接近1,它是。15和r。

平方,我的意思是它基本上是在讲。

我们这个模型不能预测任何事情。

它的。13残差有一个范围。

几乎与数据相同。

从-25到+25,还有。

我们的线非常分散。

斜率和截距。

与回归模型一致。

不能处理,这不是很奇怪吗?

线性回归。

回归,最后一个例子。

显示你使用了一些异常值。

回到标准差。

这里是1。

在0 10 0 -10和5 10之间,我们只是。

我们来看看这三种情况。

再循环一下这三个。

我们来看看会发生什么。

回归直线的行为。

第一个是。

回归,或者那边的异常值。

你看到截距被拉上来了。

斜率几乎为0。5。

就像。94和r²。

比原来的少了一点。

那是因为你看到我们有这个。

这里有一个很大的残差。

把其他的东西都画出来。

很规矩,但很明显。

影响我们的表现

现在我们这里有一个离群点。

在直方图中可以看到。

拦截,但现在是。

大概是0。5。

斜率非常接近1。

r²还不算太坏。

70%,但有点小。

因为这个很大。

残差,最后我们放了离群值。

这里是x和10的5。

它把截距向上拉。

斜率非常接近,所以我们有。

只是

改变a的斜率。

也就是r²。

实际上相当高,你可以看到。

残差并不严重,因为它是。

在中间这不是。

结束与他人

我希望这个小演示给了你们。

对回归行为的感觉。

模型使用的只是简单的模拟。

一般的数据输入。

了解我们如何翻译这些。

概念变成线性回归模型。

在特定情况下。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值