线性回归模拟_线性回归模拟以了解边坡敏感性

线性回归模拟

介绍 (Introduction)

Over the next few minutes, I’ll send you on your way to leveraging linear regression for a bit more than explanation or prediction, rather you’ll utilize them to for the sake of inference.

在接下来的几分钟中,我将向您介绍利用线性回归的方法,而不仅仅是进行解释或预测,而是出于推理的目的而使用它们。

We will leverage simulation for inference in three ways:

我们将通过三种方式利用仿真进行推理:

  • Understanding model sensitivity

    了解模型敏感性
  • p-value

    p值
  • confidence intervals

    置信区间

In this post, we’ll mostly be exploring the first one. It will be foundational to my next posts of using simulation to determine p-value and confidence intervals.

在本文中,我们将主要探讨第一个。 这将是我接下来使用模拟确定p值和置信区间的基础。

传统回归 (Traditional Regression)

If you’re not familiar with how linear regression works in general, jump over to this post.

如果您不熟悉线性回归的总体工作原理,请跳至本文章

You can jump over here to find various posts on different variations on linear regression, from creating them, to understanding and explaining them.

您可以跳过此处以查找有关线性回归的不同变体的各种帖子,从创建到理解和解释它们。

增强信心 (Increasing Confidence)

Traditionally we use linear regression to make an assessment between a variety of variables. On top of that assessment, what we are going to learn here is how you can adjust the inputs of various regression models to drive deeper understanding of the sensitivity or variability of the relationship between your explanatory & response variables.

传统上,我们使用线性回归对各种变量进行评估。 在评估的基础上,我们将在这里学习的是如何调整各种回归模型的输入,以加深对解释变量和响应变量之间关系的敏感性或可变性的了解。

So how might we go about determining the variability of the relationship of two variables?

那么我们如何确定两个变量之间的关系的可变性呢?

Think about it like this…

这样想吧...

What is the key output of a linear regression? If you guessed a line, then you’ve got it right! The regression output is effectively the equation of a line, and the slope of that equation serves as the indication of relationship of X & Y. When seeking to understand the variation of our the relationship between response & explanatory variable... it's the slope that we're after. Let's say you ran your linear regression over different samples... the question we would have, is does our slope vary? Or how much does it vary? Is it positive sometimes and negative others? etc.

线性回归的主要输出是什么? 如果您猜中了一条线,那么您就对了! 回归输出实际上是直线方程,该方程的斜率用作XY关系的指示。 当试图了解我们的React和解释变量之间的关系的变化时...这就是我们所追求的斜率。 假设您对不同的样本进行了线性回归...我们会遇到的问题是,我们的斜率会变化吗? 或多少变化? 有时候是积极的,而别人是消极的吗? 等等

我们追求的冲床 (The Punchline We’re After)

We’ve done a bit of exposition to get to the punch line here, but hopefully this serves to give you a solid foundational footing to really understand and use this is practice.

我们已经对这里的重点做了一些说明,但是希望这可以为您提供扎实的基础,以真正理解和使用这是实践。

To sum up our introduction, it comes down to this:

总结一下我们的介绍,可以归结为:

We want to understand the variability and sensitivity to variability of the relationship between two variables when we vary the sample driving the model

当我们改变 驱动模型 的样本时,我们想了解两个变量之间关系的变异性和敏感性。

让我们得到第一个斜坡! (Let’s Get Our First Slope!)

The dataset we’re working with is a Seattle home prices dataset. I’ve used this dataset many times before and find it particularly flexible for demonstration. The record level of the dataset is by home and details price, square footage, # of beds, # of baths, and so forth.

我们正在使用的数据集是西雅图房屋价格数据集。 我以前曾多次使用过该数据集,并发现它对于演示特别灵活。 数据集的记录级别是按房屋和详细信息,价格,平方英尺,床位数,浴室数量等等。

Through the course of this post, we’ll be trying to explain price through a function of other square footage.

在本文的整个过程中,我们将尝试通过其他平方英尺的功能来解释价格。

There is certainly a lot of exploratory data analysis (EDA) you’d want to engage in before you jumped right into this section. There are also certain data pre-requisites that you’d confirm, but for the sake of this illustration, let’s dive in.

在您直接进入本节之前,肯定要进行很多探索性数据分析 (EDA)。 您还需要确认某些数据前提条件,但是为了便于说明,让我们深入了解一下。

fit <- lm(price_log ~ sqft_living_log
data = housing)
summary(fit)

Perfect! We’ve got ourselves a linear model, let’s go ahead and visualize that. Also keep in mind that I have taken the log of both variables to clean up standardize their distributions.

完善! 我们已经有了一个线性模型,让我们继续进行可视化。 还要记住,我已经记录了两个变量的日志,以清理标准化它们的分布。

housing %>%
mutate(sqft_living_log = log(sqft_living),
price_log = log(price))%>%
ggplot(aes(x = sqft_living_log, y = price_log)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Image for post

In this dataset, we’re just working with a sample of 4600 homes. This is not an exhaustive population. As such, we are going to use a certain sampling technique to generate many “perspectives”. Said perspectives will drive how we go about understanding the sensitivity of our response and explanatory variables.

在此数据集中,我们仅处理了4600个房屋的样本。 这不是穷尽的人口。 因此,我们将使用某种采样技术来产生许多“观点”。 所说的观点将驱动我们如何去理解我们的React和解释变量的敏感性。

Sampling variability creates difficulty when trying to draw conclusions about an underlying population. These many perspectives or samples of the data we have are how we eliminate the potentially adverse effects of sampling variability.

当试图得出有关潜在总体的结论时,抽样变异性会带来困难。 我们拥有这些数据的许多角度或样本,这就是我们如何消除样本变异性的潜在不利影响。

So above we have one line… but what we need is many lines, for many situations.

因此,在上面,我们只有一条线……但是,在许多情况下,我们需要的是多条线。

What we’re going to do next is sample our housing data in smaller groups in give each their own regression model.

我们下一步要做的是将我们的住房数据分成较小的组,并给每个模型提供自己的回归模型。

First things first, we are going to use the rep_sample_n function to randomly select a group of 100 homes... we'll repeat that process a total of 100 times.

首先,我们将使用rep_sample_n函数随机选择一组100个房屋...我们将重复该过程总共100次。

samples <- housing %>%
rep_sample_n(size = 100, reps = 100)

Now that we have our samples dataset, let’s visualize them very similar to how we did it before. Only in this case, we are going to group our visualization by replicate. The reason this is pertinent is so that we can distinguish point to point; which replicate they pertain to. As you can see in the above code, there will be 100 replicates of 100 records each.

现在我们有了示例数据集,让我们将其可视化与我们之前的工作非常相似。 仅在这种情况下,我们将按复制对可视化进行分组。 之所以如此,是因为我们可以区分点对点。 它们属于哪个复制品。 如您在上面的代码中看到的,每个将有100个记录的100个重复。

ggplot(samples, aes(x = sqft_living_log, y = price_log, group = replicate)) + 
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
Image for post

What you’ll see above are the various regression lines fit to each of the disparate samples of 100. As you can see there are cases where slopes are greater or lower. This is the foundation of us being able to understand a range of ‘slope’ that applies to the underlying population.

您将在上面看到的是适合于100个不同样本的各种回归线。您可以看到在某些情况下斜率更大或更小。 这是我们能够理解适用于基础人群的一系列“坡度”的基础。

As you’d imagine interacting with our samples is going to vary the amount of variation in slope. Below I’ve run the same code, but am only drawing 10 random samples for each replicate.

正如您想象的那样,与我们的样本进行交互将改变斜率的变化量。 下面,我运行了相同的代码,但是每个重复仅绘制10个随机样本。

Image for post

So here you have the visualization, but you don’t yet have the actual details of the linear regression itself.

因此,在这里您可以看到图表,但是还没有线性回归本身的实际细节。

We’re going to need to run a separate regression for each replicate.

我们将需要为每个重复运行单独的回归。

Since we already have our generated simulated dataset, we just need to group by replicate, in this case it’s not for the sake of aggregation, rather it’s to model at the group level. Once we declare our group_by, we are going to leverage the do function to indicate our group action. For the group action, we want to run separate models for each of them.

因为我们已经有了生成的模拟数据集,所以我们只需要按复制分组,在这种情况下,这不是为了聚合,而是在分组级别建模。 声明group_by ,我们将利用do函数来指示我们的组操作。 对于小组动作,我们希望为每个模型分别运行模型。

Now what we have is 100 regression outputs.

现在我们有100个回归输出。

While there are many relevant pieces of the output, we are targeting the term for our explanatory variable.

尽管输出中有许多相关的部分,但我们以解释变量的术语为目标。

Take a look at the code below!

看看下面的代码!

coefs <- samples %>% 
group_by(replicate) %>%
do(lm(price_log ~ sqft_living_log, data = .) %>%
tidy()) %>%
filter(term == 'sqft_living_log')

We now have a dataframe with each replicate and the corresponding coefficient for our term of interest.

现在,我们有了一个数据框,其中包含每个重复项以及与我们感兴趣的项相对应的系数。

Image for post

Let’s take a peek at our distribution of slopes.

让我们来看看我们的坡度分布。

ggplot(coefs, aes(x = estimate)) +
geom_histogram()
Image for post

We can see a mostly normal distribution. In the case that we ran it with more replicates, it would look smoother.

我们可以看到大部分为正态分布。 如果我们使用更多的副本来运行它,它将看起来更平滑。

One thing for you to keep in mind. I’m not suggesting that every time you run a linear regression, you need to arbitrarily run 100 of them for various samples of your data. For many business applications, your data may be representative of the entire population. But even in cases when you don’t have the entire population, the purposes to either approach are different. Here we are leveraging simulation and many linear regression models to eventually make inferential claims about the underlying population. It still makes sense to leverage linear regression in different formats for things like modeling for explanation/description, or prediction.

请记住一件事。 我并不是建议每次运行线性回归时,都需要针对各种数据样本任意运行100个线性回归。 对于许多业务应用程序,您的数据可能代表整个人口。 但是,即使在您没有全部人口的情况下,两种方法的目的也有所不同。 在这里,我们利用模拟和许多线性回归模型来最终对潜在人口做出推断。 在诸如解释/描述或预测模型之类的东西上利用不同格式的线性回归仍然有意义。

坡度变化 (Variation in Slope)

As we seek to understand the distribution of slope coefficients, it can be very helpful to vary the data that eventually supports said distribution. As displayed above, altering the sample size of each replicate is going to lend greater understanding to the reduction in slope variation with different samples.

当我们试图了解斜率系数的分布时,更改最终支持所述分布的数据可能会非常有帮助。 如上所示,更改每个副本的样本大小将有助于更好地理解不同样本的斜率变化。

Another thing that will drive greater variation in our slope would be a reduction in variation on our explanatory variables. It may come as a bit of surprise, but with a broader range of explanatory datapoints, the our model has more information upon which to explain a relationship.

导致我们的斜率更大变化的另一件事是减少我们的解释变量的变化。 这可能有点令人惊讶,但是由于解释性数据点的范围更广,我们的模型具有更多的信息可用来解释关系。

结论 (Conclusion)

We have done a lot in such a short amount of time. It’s easy to get lost when dissecting statistics concepts like inference. My hope is that having a strong foundational understanding of the need and corresponding execution of simulation to better understand the relationship between our response and explanatory variables.

在这么短的时间内我们做了很多事情。 在剖析推理等统计概念时,很容易迷路。 我的希望是对模拟的需求和相应的执行有深刻的基础了解,以更好地理解我们的响应和解释变量之间的关系。

If this was helpful, feel free to check out my other posts at datasciencelessons.com. Happy Data Science-ing!

如果这样做有帮助,请随时在datasciencelessons.com上查看我的其他帖子。 快乐数据科学!

翻译自: https://towardsdatascience.com/linear-regression-simulation-to-understand-slope-sensitivity-ab6887d45fe1

线性回归模拟

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值