我要学统计|1.1 Simple Linear Regression

sunwx98

已于 2024-07-24 17:03:10 修改

阅读量292

点赞数 6

文章标签：数据分析

于 2024-07-24 15:34:39 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/sunwx98/article/details/140664558

版权

对应视频教程：ISLP【EP013】

Residual Sum of Squares, RSS 残差平方和

可以理解为实际结果和拟合结果之间的总平方差异。The total square discrepancy between the actual outcome and the fit.

$RSS=e_{1}^{2} +e_{2}^{2} +e_{3}^{2} +......+e_{n}^{2}$

$\hat{e}_{i}=y_{i}-\hat{y} _{i}$ , $\hat{y}_{i} =\hat{\beta }_{0} +\hat{\beta }_{1} x_{i}$

$\hat{e}_{i}$ Residual 残差：the discrepancy between the actual outcome and the fit(the predicted outcome)

the Least Square Approach 最小二乘法

the least square line即为令 $\sum e_{i}^{2}$ 最小的一条线

slope 斜率； intercept 截距

Accessing the Accuracy of the coefficient Estimates （Confidence Intervals）

Standard Error of the slope and intercept

$SE(\hat{\beta } )^{2} =\frac{\sigma ^2}{ {\textstyle \sum_{i=1}^{n}}(x_{i}-\bar{x} )^2 }$

$\sigma ^2=Var(\varepsilon )$ variance of the errors, noice variance

若需减少SE, 则需增加 ${ \sum_{i=1}^{n}(x_{i}-\bar{x} )^2 }$ ，这意味着x spread out

*SE的用处一：可用于计算confident intervals

3 sigma rules：

Hypothesis tests 假设检验

*SE的用处二：可用于perform hypothesis tests

null hypothesis:

$H_{0}$ : $\beta _{1} =0$ : x and y no relationship

$H_{a}$ : $\beta _{1} \neq 0$ : x and y have some relationship

为了验证 $H_{0}$ ，我们计算t-statistic t统计量

$t=\frac{\hat{\beta }_{1}-0 }{SE(\hat{\beta }_{1})}$ ----------斜率/标准误差

查表得到t-distribution t分布，（n-2）degree of freedom, assuming $\beta _{1} =0$ 。

p-value： the probability of observing any value to |t| or larger.

**selflearning addition

1. t检验：本质上是用于比较两组数据之间的差异显著程度，差异显著程度用t统计量来反映

2. t-statistic：也就是t统计量，是一个数值，用于反映两组数据之间的差异大小。t-statistic 服从t分布。

3. t分布是一种概率分布，用于表示在两个组无显著差异时t统计量的值，不同的自由度（样本大小）决定不同的t分布形状，因此需要查表来确定t-distribution

4. 在简单线性回归中， $y=\beta _{0} +\beta _{1}x+\varepsilon$

此时，t检验用于检验回归系数 $\beta _{1}$ 是否显著，也就是用于判断hypothesis testing。

HT比较的是，H0: 无模型，用均值；Ha：有模型。二者之间的组间差异。

5. t-test的使用场景：

1）比较两个独立样本的平均值，判断是否显著不同。

2）配对样本t检验：比较相同个体在不同条件下的平均值。用于判断条件变化是否显著影响结果。

3）单样本t检验

单样本平均值与已知总体平均值之间的差异。

4）回归分析中的系数显著性检验

Simple LR： t检验，H0；Ha

Multiple LR：Backward Selection，使用t检验来确定remove对象

6. 延申问题：比较多组数据的组间差异时，用什么方法？

ANOVA方差分析

Accessing the Overall Accuracy of the Model

RSE, Residual Standard Error

$RSE=\sqrt{(\frac{1}{n-2} )RSS}$

$RSS=e_{1}^{2} +e_{2}^{2} +e_{3}^{2} +......+e_{n}^{2}= {\textstyle \sum_{i=1}^{n}(y_{i}-\hat{y}_{i} )^2}$ : fit model error

R-Square/fraction of variance

$R^{2} =\frac{TSS-RSS}{RSS} =1-\frac{RSS}{TSS}$

TSS, total sum of squares

$TSS= {\textstyle \sum_{i=1}^{n}(y_{i}-\bar {y}_{i} )^2}$ : no model error (a slope=0 model)

$(R^2)$ = $(r)^2$

r: correlation 线性相关系数 between x and y

$R^2$ ，又称决定系数。表达自变量对因变量的解释能力

$R^{2} =\frac{TSS-RSS}{RSS} =1-\frac{RSS}{TSS}=1-\frac {\textstyle \sum_{i=1}^{n}(y_{i}-\hat{y}_{i} )^2}{{\textstyle \sum_{i=1}^{n}(y_{i}-\bar {y}_{i} )^2}}$ ，分母为一定值

$r\in \left [ 0.75,1 \right ]$ 两变量强正相关；反之 $r\in \left [ -1,-0.75 \right ]$ 强负相关。

$r\in \left [ -0.25,0.25 \right ]$ 两变量弱正相关；反之。

|r|越大，拟合效果越好。

$R^2$ 取值在0和1之间，越趋近于一，拟合效果越好。

拟合效果越好，也就意味着模型精度越高，几种表现：

$\hat{e}_{i}$ 残差小； |r|大； $R^2$ 趋近于一

关注

6
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。