ISLR第三章的理解
几种常见的线性模型
简单线性回归
Y=β0+β1X 多元线性回归
Y=β0+β1X1+β2X2+... 扩展线性回归
Y=β0+β1X1+β2X2+β3X3
克服了多元线性模型 X1 与 X2 不协同作用的假设。
线性模型的评价指标
F-statistic
可以评价sales与几个变量是否有关系。F-statistic是大于1的,相同数量的样本下F-statistic越大,越说明sales与几个变量越相关,至于比较小的值究竟是否相关,可以查询F-statistic表。这里,F-statistic为570,所以我们认为他们有关系。RSE
全名为残留标准偏差(Residual Standard Error),RSE越小,说明训练模型越准确。The RSE is considered a measure of the lack of fit of the model (3.5) to
the data.R2-statistic
相比于RSE,R2 在0到1之间, R2 越大,越说明模型中Y与X相关p-value
p-value 越小,越说明该X与Y相关,这个模型中,明显newspaper变量与sales无明显的相关关系,在之后的模型优化上应该舍弃这一变量(p81).
不能通过系数大小来判断变量是否和模型的相关关系(p134 3.c).
Consequently, it is a simple matter to compute the probability of observing any value equal to |t| or larger, assuming β1 = 0.
We call this probability the p-value. Roughly speaking, we interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response.
t-statistic
t-statistic 有正负,比较时应取绝对值,绝对值大说明该变量X与Y有相关关系,这里,newspaper的t-statistic 参数为-0.18,明显newspaper变量与sales无明显的相关关系SE
SE是用来计算置信区间的,常用的95%的置信区间为
[coefficient−2×SE,coefficient+2×SE]studentized residuals
studentized residuals是用来检测数据中的异常值的,一般某数据的studentized residuals的绝对值超过3就定性为异常值,需要进行处理如舍弃等。TSS
TSS measures the total variance in the response Y , and can be squares
thought of as the amount of variability inherent in the response before the
regression is performed.