第二章.Regression -- 04.A Simple Regression Simulation（下）翻译

最新推荐文章于 2022-12-20 15:47:29 发布

Stella__Lee

最新推荐文章于 2022-12-20 15:47:29 发布

阅读量351

点赞数

分类专栏： Artificial Intelligence

Artificial Intelligence 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

so Cynthia has been discussing the
theory of regression how we build
regression models and how we evaluate
regression models in this demo I'm going
to show you using some Python code we're
going to build a linear regression model
and look at some of the general
properties of regression models using
just some simulated data just to keep
things simple
so first off we do need to simulate some
data and this very simple what I'm doing
here i'm just over a set of steps in the
X and Y i'm just going to build a
line and I but i'm going to add to
that line some
date some noise drawn from a normal
distribution so i'm using numpy random
normal to do that and so I just do that
and
we're going to have our origin are of
the line at 0 0 and the end of the line
at 10 10 we're going to 50 points and
we're gonna use a standard deviation for
that noise of one and then we'll just
look at that so let me run that for you
you can see we've just got some X,Y
values that's all there is to it
and we can plot that using standard
Python plotting and we're just going to
make a scatter plot of that
nothing special there
and there are those points and you can
see they pretty much do fall on a
straight line
remember we have work we're going with a
slope of 1 from 0 0 to 10 10 but we've
added this noise so we'll have some is
what we call residual in when we run our
regression model and their various ways
to measure the residual so in this case
we're just going to use three simple
metrics how close we came to getting the
intercept at 0 0 and the slope at 1
and the what we call the adjusted
r-squared which is simply the sum of
squared error over the sum of squared
residual or 1 minus that and then the
adjusted part is just this N minus 1
over N minus two so that's just a bias
adjustment for the square errors
ok
so this code which is somewhat
voluminous computes a regression model
so it uses from scikit-learn linear
model so we got a linear_model.LinearRegression
we fit that model we
create a prediction of that model and do
some sorting so we can plot it. One
thing I have to do here just keep in
mind and we saw this before when you're
working with scikit-learn you have to
reshape things and set them to a matrix
type so they have to be a numpy matrix
they can't be a pandas dataframe that's
all
all I'm doing here
then i create a scatter plot of my X,Y
data and I plot over that a line which
is my regression line you see I
use lm_y which was my predicted values
and
and then i'm also going to create a
histogram of the residuals the residuals
are just the difference between the
value that that of that line at some x
value so the y value versus the original
data so we'll just see a histogram of
those and then this is where we compute
our adjusted r-squared and display our
intercept and slope so not too
complicated what I'm actually doing just
a few steps
ok so remember our intercept should be 0
and it's actually 0.5 pretty close and
our slope should be 1 and it's you
know effectively .99 and our
adjusted r-squared is like .84 so
that means that we're if you think about
this formula up here we're explaining more
or less eighty-four percent of the
original variants of the model of the
data with our regression line and so
scroll down look at the plots and yet
looks pretty good right here we've got
the line that starts almost at 0 0 and goes
to 10 10
and so the residual like for this one
the residual would be that distance from
the line up there and here's the
histogram of those residuals and then a
pretty tight range from less than -2
to less than 2 and it should be kind of
a bell-shaped curve almost normal but
you know given that we only have 50
points
it's not quite but you know it looks
pretty good
but what happens if we increase the
dispersion so basically by that we're
adding more noise we're making the
standard deviation of that normal
distribution of the noise much you know
larger so we're going to expect larger
residuals but what happens to the
performance of our ability to estimate
that intercept and slope so all we're
going to do is loop over the same code
we just saw but with standard deviations
of one which is what you just saw five
and ten so significant increases in the
dispersion of the data
so let's just look at so we start out
with an intercept for the first case of
.26
but then you see with higher dispersion
data that jumps to like .24 and
then it goes back down to you
know .26 but the slope now and again the
slope bounces around you know
like .98 1.05 but then it's way
off here 1.25 and notice what really
changes is the adjusted r-squared so
it's .85 .26 and then again
.26 so we're really not
explaining much of the variance only
like twenty-six twenty-seven percent
when we get to those high dispersions
and here's the plot so that looks very
much like we had before
it's a little bit further off but that's
just the luck of the draw
we simulate a new data so we don't quite
go through 0 0 here then we have
again the histogram the range of our
residuals is about the same but with
greater
error here greater dispersion so
larger residuals larger errors you can
see the histogram is now from -10 to over 10
for the residuals and the line you know
has a remember the slope especially was
quite different and this one where we
was really just explaining very little
of this dispersion and you can see that
i mean the the line seems to go about
the right way but look at the the range
of errors in the residuals are quite
large if you measure those distances are
down here and that's reflected in the
large dispersion of this histogram so
there's just one last thing I'd like to
show you with the regression here which
is what's the effects of adding some
outliers so the simulation we're going
to do is more or less the same as what
you saw before it's just that we're
going to add an outlier at 0 10, 0 -10, and
5 10 so we're going to do three different
regressions with one outlier the
standard deviation of our data otherwise
is just one this kind of like the first
regression we did but in each case we're
adding a single outlier
ok let me run that for you
so you can see it's a little hard to see
but here's that first outlier 0 10 and
if you look at the intercept
is now at .92 the slope has
been affected a fair amount is .95
and the adjusted r-squared is down a
little bit like .75 and you can see
why the residuals are all pretty tightly
group but wow we got that one outlier
and we've got another outlier here at
0 -10 you see that's pulled the line
down and tilted the slope and you can
see that in the intercept being now
.89 and the slope is like
.11 and the adjusted r-squared is
quite a bit lower yet again .74 and
so you see it's it's pulled that down
and
here's that one big outlier and so our
last case we just have this one outlier
in the middle here so it's kinda it's
pulled this line a little bit this way
you can see that outlier it's not as
extreme as these others and if we look
at the
intercept
pretty close to zero so it's closer to 0
then these other cases which were almost
one and the slope is also close to one
not a whole lot different than these
cases so so basically it's pulled the
the line up a bit but hasn't affected
the slope and adjusted r-squared is a
respectable no .82 or .81
so I hope that this little video has
given you some insight into working with
regression models but also how they
behave with as the data changes like
additional dispersion or the addition of
outliers you'll see reduced performance
of your regression model that's true not
just with these linear models but a lot
of different models you work with

辛西娅一直在讨论这个问题。

回归理论我们如何构建。

回归模型和我们如何评估。

在这个demo中，我要讲的是回归模型。

为了显示你使用了一些Python代码。

建立一个线性回归模型。

看看这个将军。

使用的回归模型的属性。

只是为了保存一些模拟数据。

事情变得简单

首先，我们需要模拟一些。

数据，这很简单。

在这里，我只是经过了一系列步骤。

X和Y我要建立a。

直线和I相加。

这条线有

从一个正常的地方发出一些噪音。

分布，我用的是numpy随机。

这是正常的，我就这么做。

和

我们的原点是。

这条线在0和这条线的末端。

10点10分，50分。

我们要用一个标准差。

那噪音，然后我们就会。

看看这个，让我帮你运行一下。

你可以看到，我们得到了一些X Y。

这些都是有价值的。

我们可以用标准来画。

我们要做的是。

做一个散点图。

没什么特别的,

有这些点，你可以。

看，它们几乎都落在a上了。

直线

记住，我们的功是a。

从0到10的斜率，但我们有。

加上这声音，我们就有一些。

当我们运行我们的时候，我们称之为残余。

回归模型及其各种方法。

在这种情况下测量残差。

我们只用三个简单的。

度量我们是如何接近的。

截距为0，斜率为1。

我们称之为调整。

r平方，也就是。

平方误差除以平方和。

残差或1减去那个，然后。

调整的部分就是这个N - 1。

除以N - 2，这只是一个偏差。

调整平方误差。

好吧

这个代码有点。

大量计算回归模型。

所以它从scikitt学习线性。

我们得到了线性回归模型。

我们符合那个模型。

创建该模型的预测并执行。

一些排序，我们可以画出来。一个

我要做的就是继续。

我们以前见过这个。

和scikitt合作，你必须学习。

重塑事物，并将其设定为一个矩阵。

类型，所以它们必须是一个numpy矩阵。

他们不可能是一只熊猫的dataframe。

所有

我所做的

然后我创建X的散点图，Y。

数据和我画出一条线。

我的回归线是你看到的吗?

使用lm_y，这是我的预测值。

和

然后我还要创建一个。

残差的直方图。

仅仅是两者之间的区别吗?

这条线在x处的值。

值是y值与原始值。

我们会看到一个直方图。

这就是我们计算的地方。

调整后的r平方，然后显示。

截距和斜率也不一样。

我实际上做的很复杂。

几个步骤

我们的截距应该是0。

实际上是0。5非常接近。

斜率应该是1，这是你。

有效地知道。99和我们的。

调整r平方等于。84。

这意味着如果你考虑的话。

上面这个公式我们解释得更多。

或者少于百分之八十四。

模型的原始变量。

回归直线的数据。

向下滚动查看这些情节。

看起来很不错。

从0到0的直线。

10 10

剩下的就像这样。

剩下的就是那个距离。

这条线在这，这是。

这些残差的直方图，然后是a。

小于-2的范围很窄。

小于2，应该是这样。

钟形曲线几乎是正常的。

已知我们只有50个。

点

它并不完全，但你知道它看起来。

很好

但是如果我们增加。

分散，基本上就是这样。

增加更多的噪音。

标准偏差。

你知道的噪音的分布。

更大，我们期望更大。

残差，但发生了什么。

我们估计的能力。

这是截距和斜率。

要做的是循环使用相同的代码。

我们只是看到了标准差。

其中一个就是你刚才看到的5个。

十有显著的增加。

分散的数据

我们先来看。

对于第一个例子的截距。

点

但是你可以看到，分散度更高。

数据会跳转到。24和。

然后它又回到你的身边。

知道。26，但是斜率现在和。

斜率反弹。

就像。98。05，但它是。

这里是1。25，注意一下。

变化是调整后的r平方。

是。85。26，然后。

所以我们真的没有。

只解释大部分的方差。

百分之二千六百二十七

当我们达到那些高度分散的时候。

这是这个图。

就像以前一样。

稍微远一点，但那是。

只是运气好。

我们模拟一个新的数据，所以我们不完全。

从0到0。

直方图是我们的范围。

残差是一样的，但是有。

更大的

误差更大。

较大的残差，你可以。

见直方图从-10到10。

对于残差和直线。

还记得斜率吗?

非常不同，这个我们。

真的只是解释得很少吗?

在这个色散中，你可以看到。

我的意思是，这条线似乎要走了。

正确的方法，但是看看这个范围。

残差的误差相当大。

如果你测量这些距离的话。

在这里，这反映在。

这个直方图的大色散。

我还想做最后一件事。

这里是回归。

添加的效果是什么?

离群值，所以我们要进行模拟。

做或多或少与之相同。

你之前看到的只是我们。

在0 10 0 -10，和。

所以我们要做三种不同的。

回归与一个局外人。

我们的数据的标准偏差。

这就像第一次一样吗?

回归我们做过，但在每一种情况下。

添加一个局外人

好吧，让我来帮你。

你可以看到它有点难看。

但这是第一个异常值0。10。

如果你看一下截距。

斜率是0。92吗?

受到的影响相当大。

调整后的r²是a。

有点像。75，你可以看到。

为什么残差都很紧?

但是我们有一个例外。

这里还有一个离群点。

0 -10你看到这条线了。

向下倾斜，你可以。

注意，在截距中。

斜率是这样的。

。11和调整后的r平方。

再低一点。74。

你看，是它把它拉下来了。

和

这是一个很大的离群点。

最后一个例子，我们有一个离群值。

在中间这是。

把这条线往这边拉一点。

你可以看到它不是这样的。

和其他的一样极端，如果我们看。

在

拦截

非常接近0，所以接近0。

然后是其他的案例。

1和斜率也接近1。

没有什么大的不同。

情况基本上是这样的。

排队的人有点多，但没有受到影响。

斜率和调整r平方是a。

令人尊敬的第82或第81名。

所以我希望这个小视频。

让你对工作有一些了解。

回归模型，也包括它们。

随数据的变化而变化。

额外的分散或增加。

离群值会降低性能。

你的回归模型是正确的。

只是用这些线性模型。

不同的模型。

Stella__Lee

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录