R源代码研究——逻辑回归logistic regression

最新推荐文章于 2023-03-08 21:36:02 发布

KIDxu

最新推荐文章于 2023-03-08 21:36:02 发布

阅读量8.3k

点赞数 1

文章标签： R语言逻辑回归 R源代码

本文链接：https://blog.csdn.net/KIDxu/article/details/86494996

版权

本文探讨R中逻辑回归的源代码实现，重点关注glm函数和迭代加权最小二乘法(IWLS)。通过对样例代码的debug，揭示了glm.fit如何使用IRLS方法，并指出与传统对数似然函数求解的不同。文中还介绍了C代码在逻辑回归计算过程中的作用，如C_logit_linkinv函数和Fortran文件dqrls.f及其依赖的子程序，展示了一个逻辑回归模型从初始化到求解的完整流程。

摘要由CSDN通过智能技术生成

概述

R中逻辑回归用过很多次了，最近突然想对其源代码探究一二，以便更好理解该算法。此文章记录了R中逻辑回归的源代码的研究理解，如果有任何问题或错误欢迎各位读者提出。

建议：此篇文章主要以介绍代码实现为重点，会穿插理论知识。建议读者可以先大致了解逻辑回归理论再读此文。

R中实现逻辑回归可以通过调用glm函数实现，R中对该函数的使用方法及描述如下:

glm(formula, family = gaussian, data, weights, subset,
    na.action, start = NULL, etastart, mustart, offset,
    control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, contrasts = NULL, ...)

glm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

可见glm函数实际是用于广义线性模型的拟合，通过指定参数可以实现逻辑回归（实际逻辑回归属于广义线性回归的一种），简单介绍一下广义线性回归：

$g(y)=\boldsymbol{\beta}^\text{T}\cdot \boldsymbol{x}$

其中g(y)称为链接函数link function。其中逻辑回归的link function为

$g(y)=ln(\frac{y}{1-y})$

样例

以下是一个实现逻辑回归的代码样例，family = binomial(link = "logit) 即为指定link function的语句。先不考虑其他参数，通过这个简单的例子来研究其实现原理。

fit <- glm(label ~., family = binomial(link="logit"), data= train.yx,
           control = list(maxit = 5000, epsilon = 0.00000001))

`family`	character: the family name.
`link`	character: the link name.
`linkfun`	function: the link.
`linkinv`	function: the inverse of the link function.
`variance`	function: the variance as a function of the mean.
`dev.resids`	function giving the deviance residuals as a function of `(y, mu, wt)`.
`aic`	function giving the AIC value if appropriate (but `NA` for the quasi- families). See `logLik` for the assumptions made about the dispersion parameter.
`mu.eta`	function: derivative `function(eta)` dμ/dη.
`initialize`	expression. This needs to set up whatever data objects are needed for the family as well as `n` (needed for AIC in the binomial family) and `mustart` (see `glm`).
`validmu`	logical function. Returns `TRUE` if a mean vector `mu` is within the domain of `variance`.
`valideta`	logical function. Returns `TRUE` if a linear predictor `eta` is within the domain of `linkinv`.
`simulate`	(optional) function `simulate(object, nsim)` to be called by the `"lm"` method of `simulate`. It will normally return a matrix with `nsim` columns and one row for each fitted value, but it can also return a list of length `nsim`. Clearly this will be missing for ‘quasi-’ families.

debug该行代码进入glm函数，函数中赋值、传参等操作代码不讨论，主要讨论与实现相关的核心功能代码。下面这句代码意为调用“method”这个函数，后面为函数参数。R中这样介绍method参数：the method to be used in fitting the model. The default method "glm.fit" uses iteratively reweighted least squares (IWLS)。由于我们没有指定method，那么此时method即为glm.fit。注意这句话后半句说该method用iteratively reweighted least squares (IWLS)方法，也叫IRLS。这里是R中glm实现逻辑回归与一般对逻辑回归介绍不同的地方。一般书籍或文章介绍的逻辑回归求解时的cost function是对数似然函数，而glm.fit则是用IWLS方法，译为“迭代加权最小二乘