【回归分析】logistic regresssion中的拟合优度检验（goodness-of-fit test）

最新推荐文章于 2024-09-08 01:10:43 发布

萝卜丝皮尔

最新推荐文章于 2024-09-08 01:10:43 发布

阅读量1w

点赞数 4

分类专栏：统计学文章标签：回归分析 logistic regression goodness-of-fit

本文链接：https://blog.csdn.net/qq_43448491/article/details/109461359

版权

统计学专栏收录该内容

17 篇文章

订阅专栏

本文探讨了逻辑回归模型的评估方法，包括模型假设如独立性、线性及无交互作用等，并介绍了拟合优度检验的方法，如通过−2ln(似然函数)计算G1值来比较模型与饱和模型的差距。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

参考资料：【回归分析】台湾交通大学-黄冠华教授

goal : to test how well the used model fits to the observed data.
in the linear regression,the coeffient of determination $R^2$ , which represents the fraction of the total variation of the data explained by the used model, can be used as a goodness-of-fit measurement.
in logistic regression, the coefficient of determination is not a valid goodness-of-fit measurement. we need to develop a quantity in logistic regression for goodness-of-fit test.
Model assumption in logistic regression
独立：用处体现在likelihood function上，如果不能保持独立，那么根据likelihood function推出来的标准差和显著性检验全部都是错的。
线性：log odds $l n (p / (1 - p))$ 与参数保持线性关系。
in this case linearity is on the logit scale,meaning that $l n (p / (1 - p))$ has the same increment with every unit increase in x. this is the same as saying that the odds ratio is the same between x and x+1 no matter what x is.
无交互作用：no interaction effects are assumptions of constancy of odd ratios of one variable across level of the other.
saturated model ，饱和模型，是最复杂的模型，是对原始数据的完全描述，它不需要再添加任何假设，使用它的预测和使用原始数据做预测效果一样。
如何识别一个模型是不是饱和模型？
如果数据分分组数 = 模型中未知参数的个数，则该模型是饱和模型。
回顾之前：拟合优度检验主要是看新模型和原数据拟合得是否贴切，而饱和模型和原数据是完全符合的，于是问题进一步转化成了：检验新模型和饱和模型是否接近。如果很靠近，说明新模型很好，否则不好。
for grouped data , goodness-of -fit amounts to compare the model we have with the saturated model(since the data can be exactly reproduced by the saturated model)
this is then equivalent to testing whether enough interaction effects have been included in the model (since a saturated model is the model with all possible interaction)
如何比较两个模型之间的差距？
使用 $- 2 l n (l i k e l i h o o d f u n c t i o n)$ ,在这里,
two forms of goodness-of-fit test are commonly used with logistic regression, where sums are taken over risk factor-confounder combinations:
form1: $G_1=2ln(L(saturated.model))-2ln(L(fitted.model))=2\Sigma_iO_iln(O_i/E_i)$
where $O_i$ are the numbers of observations in each cell, and $E_i$ are the predicted numbers of observations based on the fitted model.
form2:

举个小栗子：下面x1表示随机变量（binary variable），D21和D31表示虚拟变量（dummy variable）,y表示因变量（binary variable），分别写出饱和模型和现需要检验的新模型。

在这里插入图片描述
根据fitted model 算出公式中的y=1 的概率，也就是 $E_i$ 。
以上是对分组数据而言，goodness-of-fit for individual data 的情况如下
特点是这里的自变量是连续的，没有分组。
解决思路：使用the Hosmer and Lemeshow 方法进行分组。
Hosmer and Lemeshow分组的大致步骤：
使用上述fitted model，根据每一条case已知的covariate,计算出对应的y=1时的概率 $P r (y = 1)$ 。
然后，按照计算出来的概率，从小到大对这些case，进行排序。
如果是分成10组，那么就是【0%-10%】的为第一组，【10%-20%】的为第二组，……，【90%-100%】的为第10组。
分过组后，如何计算第一组的 $E_i$ ？对第一组的 $P r (y = 1)$ 取平均，再乘上该组的sample size,即可。
在这里插入图片描述

在这里插入图片描述

上面的goodness-of-fit test 是一种overall test,也就是如果接受原假设意味着模型符合得很好，而如果不接受原假设意味着模型符合得不好，但却不说哪里不好，无法给出是哪一条假设出了问题，于是考虑使用 residual analysis。
普通线性回归中的残差图，在logistic regression不再适用，因为logistics regression 中的 outcome 只取0或1，做出的图像分段(不均匀分散，所以无法按照之前的特性对残差图进行分析)，于是提出Pearson residuals.
在这里插入图片描述