指数分布族与广义线性模型

指数分布族与广义线性模型

指数分布族

定义

连续型分布的概率密度函数(离散型分布的分布律)可以写成如下形式的分布均属于指数分布族
f ( y ; θ ) = s ( y ) t ( θ ) e a ( y ) b ( θ ) f(y ; \theta)=s(y) t(\theta) e^{a(y) b(\theta)} f(y;θ)=s(y)t(θ)ea(y)b(θ)
也即可以写成如下形式
f ( y ; θ ) = exp ⁡ [ a ( y ) b ( θ ) + c ( θ ) + d ( y ) ] f(y ; \theta)=\exp [a(y) b(\theta)+c(\theta)+d(y)] f(y;θ)=exp[a(y)b(θ)+c(θ)+d(y)]
其中 s ( y ) = exp ⁡ d ( y ) , t ( θ ) = exp ⁡ c ( θ ) s(y)=\exp d(y), t(\theta)=\exp c(\theta) s(y)=expd(y),t(θ)=expc(θ)

a ( y ) = y a(y)=y a(y)=y时,称为标准形式,此时 b ( θ ) b(\theta) b(θ)称为自然参数

例子

正态分布

正态分布的概率密度函数可以写成
f ( y ; μ ) = 1 ( 2 π σ 2 ) 1 / 2 exp ⁡ [ − 1 2 σ 2 ( y − μ ) 2 ] = exp ⁡ [ − y 2 2 σ 2 + y μ σ 2 − μ 2 2 σ 2 − 1 2 log ⁡ ( 2 π σ 2 ) ] f(y ; \mu)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{1 / 2}} \exp \left[-\frac{1}{2 \sigma^{2}}(y-\mu)^{2}\right]=\exp \left[-\frac{y^{2}}{2 \sigma^{2}}+\frac{y \mu}{\sigma^{2}}-\frac{\mu^{2}}{2 \sigma^{2}}-\frac{1}{2} \log \left(2 \pi \sigma^{2}\right)\right] f(y;μ)=(2πσ2)1/21exp[2σ21(yμ)2]=exp[2σ2y2+σ2yμ2σ2μ221log(2πσ2)]
此时 a ( y ) = y , b ( μ ) = μ / σ 2 a(y)=y, b(\mu)=\mu / \sigma^{2} a(y)=y,b(μ)=μ/σ2

二项分布

二项分布的分布律可以写成
f ( y ; π ) = ( n y ) π y ( 1 − π ) n − y = exp ⁡ [ y log ⁡ π − y log ⁡ ( 1 − π ) + n log ⁡ ( 1 − π ) + log ⁡ ( n y ) ] f(y ; \pi)=\left(\begin{array}{l} n \\ y \end{array}\right) \pi^{y}(1-\pi)^{n-y}=\exp \left[y \log \pi-y \log (1-\pi)+n \log (1-\pi)+\log \left(\begin{array}{l} n \\ y \end{array}\right)\right] f(y;π)=(ny)πy(1π)ny=exp[ylogπylog(1π)+nlog(1π)+log(ny)]
此时 a ( y ) = y , b ( π ) = log ⁡ π − log ⁡ ( 1 − π ) = log ⁡ [ π / ( 1 − π ) ] a(y)=y, b(\pi)=\log \pi-\log (1-\pi)=\log [\pi /(1-\pi)] a(y)=y,b(π)=logπlog(1π)=log[π/(1π)]

泊松分布

泊松分布的分布律可以写成
f ( y , θ ) = θ y e − θ y ! = exp ⁡ ( y log ⁡ θ − θ − log ⁡ y ! ) f(y, \theta)=\frac{\theta^{y} e^{-\theta}}{y !}=\exp (y \log \theta-\theta-\log y !) f(y,θ)=y!θyeθ=exp(ylogθθlogy!)
此时 a ( y ) = y , b ( θ ) = log ⁡ θ a(y)=y, b(\theta)=\log \theta a(y)=y,b(θ)=logθ

小结
  1. 正态分布、二项分布、泊松分布均属于指数分布族,它们在表示成指数分布族时,均可直接写成标准形式,即 a ( y ) = y a(y)=y a(y)=y, 它们的自然参数 b ( ⋅ ) b(\cdot) b()和其他两个参数 c ( ⋅ ) , d ( ⋅ ) c(\cdot), d(\cdot) c(),d()可以总结成下表:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-x2Q0abxk-1619277736620)(D:\Documents\my_github_repo\learn-machine-learning\imgs\exponential_example.png)]

  1. 这里只是列举了较常用且较简单的几个指数分布族中的分布,指数分布族还包括很多其他分布,比如伽马分布等
  2. 这里只列举了一元情形下的指数分布,包括定义也只是给的一元情形下的,多元情形是类似的
  3. Y Y Y服从指数分布,则 a ( Y ) a(Y) a(Y) Y Y Y的充分统计量,即给定 a ( Y ) a(Y) a(Y)情况下 Y Y Y的分布与 θ \theta θ无关,对充分统计量感兴趣可以进一步翻阅陈希孺老师写的《高等数理统计学》或其他相关资料

性质

  1. 充分统计量 a ( Y ) a(Y) a(Y)的期望,当 a ( Y ) a(Y) a(Y)等于 Y Y Y时,即为 Y Y Y的期望
    E [ a ( Y ) ] = − c ′ ( θ ) / b ′ ( θ ) \mathrm{E}[a(Y)]=-c^{\prime}(\theta) / b^{\prime}(\theta) E[a(Y)]=c(θ)/b(θ)
    证明:首先由 f ( y ; θ ) f(y ; \theta) f(y;θ)是密度函数或分布律可得
    ∫ f ( y ; θ ) d y = 1 \int f(y ; \theta) d y=1 f(y;θ)dy=1
    于是两边关于 θ \theta θ求导,可得
    d d θ ∫ f ( y ; θ ) d y = d d θ . 1 = 0 \frac{d}{d \theta} \int f(y ; \theta) d y=\frac{d}{d \theta} .1=0 dθdf(y;θ)dy=dθd.1=0
    满足一定的正则性条件,求导和积分可换序,于是可得
    ∫ d f ( y ; θ ) d θ d y = ∫ [ a ( y ) b ′ ( θ ) + c ′ ( θ ) ] f ( y ; θ ) d y = 0 \int \frac{d f(y ; \theta)}{d \theta} d y=\int\left[a(y) b^{\prime}(\theta)+c^{\prime}(\theta)\right] f(y ; \theta) d y=0 dθdf(y;θ)dy=[a(y)b(θ)+c(θ)]f(y;θ)dy=0
    于是有
    E [ a ( Y ) ] = − c ′ ( θ ) / b ′ ( θ ) \mathrm{E}[a(Y)]=-c^{\prime}(\theta) / b^{\prime}(\theta) E[a(Y)]=c(θ)/b(θ)

  2. 充分统计量 a ( Y ) a(Y) a(Y)的方差,当 a ( Y ) a(Y) a(Y)等于 Y Y Y时,即为 Y Y Y的方差
    var ⁡ [ a ( Y ) ] = b ′ ′ ( θ ) c ′ ( θ ) − c ′ ′ ( θ ) b ′ ( θ ) [ b ′ ( θ ) ] 3 \operatorname{var}[a(Y)]=\frac{b^{\prime \prime}(\theta) c^{\prime}(\theta)-c^{\prime \prime}(\theta) b^{\prime}(\theta)}{\left[b^{\prime}(\theta)\right]^{3}} var[a(Y)]=[b(θ)]3b(θ)c(θ)c(θ)b(θ)
    证明:首先由 f ( y ; θ ) f(y ; \theta) f(y;θ)是密度函数或分布律可得
    ∫ f ( y ; θ ) d y = 1 \int f(y ; \theta) d y=1 f(y;θ)dy=1
    于是两边关于 θ \theta θ连续求两次导,满足一定的正则性条件,求导和积分可换序,可得
    ∫ d 2 f ( y ; θ ) d θ 2 d y = 0 \int \frac{d^{2} f(y ; \theta)}{d \theta^{2}} d y=0 dθ2d2f(y;θ)dy=0
    而我们有
    d 2 f ( y ; θ ) d θ 2 = [ a ( y ) b ′ ′ ( θ ) + c ′ ′ ( θ ) ] f ( y ; θ ) + [ a ( y ) b ′ ( θ ) + c ′ ( θ ) ] 2 f ( y ; θ ) \frac{d^{2} f(y ; \theta)}{d \theta^{2}}=\left[a(y) b^{\prime \prime}(\theta)+c^{\prime \prime}(\theta)\right] f(y ; \theta)+\left[a(y) b^{\prime}(\theta)+c^{\prime}(\theta)\right]^{2} f(y ; \theta) dθ2d2f(y;θ)=[a(y)b(θ)+c(θ)]f(y;θ)+[a(y)b(θ)+c(θ)]2f(y;θ)
    利用性质1,上式的第二项可以写为
    [ b ′ ( θ ) ] 2 { a ( y ) − E [ a ( Y ) ] } 2 f ( y ; θ ) \left[b^{\prime}(\theta)\right]^{2}\{a(y)-E[a(Y)]\}^{2} f(y ; \theta) [b(θ)]2{a(y)E[a(Y)]}2f(y;θ)
    于是
    b ′ ′ ( θ ) E [ a ( Y ) ] + c ′ ′ ( θ ) + [ b ′ ( θ ) ] 2 var ⁡ [ a ( Y ) ] = 0 b^{\prime \prime}(\theta) \mathrm{E}[a(Y)]+c^{\prime \prime}(\theta)+\left[b^{\prime}(\theta)\right]^{2} \operatorname{var}[a(Y)]=0 b(θ)E[a(Y)]+c(θ)+[b(θ)]2var[a(Y)]=0
    从而
    var ⁡ [ a ( Y ) ] = b ′ ′ ( θ ) c ′ ( θ ) − c ′ ′ ( θ ) b ′ ( θ ) [ b ′ ( θ ) ] 3 \operatorname{var}[a(Y)]=\frac{b^{\prime \prime}(\theta) c^{\prime}(\theta)-c^{\prime \prime}(\theta) b^{\prime}(\theta)}{\left[b^{\prime}(\theta)\right]^{3}} var[a(Y)]=[b(θ)]3b(θ)c(θ)c(θ)b(θ)

  3. 关于充分统计量 a ( Y ) a(Y) a(Y)的更高阶矩,可以类似计算,感兴趣可以自行推导其通项公式

  4. 在贝叶斯理论中,指数分布族有一个好的性质:参数的先验服从指数分布族,给定参数,随机变量 X X X也服从指数分布族(也即似然服从指数分布族),则后验也服从指数分布族,即指数分布族关于指数分布族共轭。具体证明就用下条件概率的相关公式验证下即可

参考资料

  1. 陈希孺《高等数理统计学》
  2. Exponential Families.pdf

广义线性模型

定义

设训练数据为 ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x n , y n ) (x_1,y_1), (x_2,y_2),\cdots,(x_n,y_n) (x1,y1),(x2,y2),,(xn,yn), 广义线性模型定义为满足以下三个条件的模型:

  1. Random component: The random component of a GLM consists of a response variable y y y with independent observations ( y 1 , ⋯   , y n ) (y_1, \cdots, y_n) (y1,,yn) from a distribution having probability density or mass function for y i y_i yi of the form
    f ( y i ; θ i , ϕ ) = exp ⁡ { [ y i θ i − b ( θ i ) ] / a ( ϕ ) + c ( y i , ϕ ) } f\left(y_{i} ; \theta_{i}, \phi\right)=\exp \left\{\left[y_{i} \theta_{i}-b\left(\theta_{i}\right)\right] / a(\phi)+c\left(y_{i}, \phi\right)\right\} f(yi;θi,ϕ)=exp{[yiθib(θi)]/a(ϕ)+c(yi,ϕ)}
    For y y y, the probability density or mass function is
    f ( y ; θ , ϕ ) = exp ⁡ { [ y θ − b ( θ ) ] / a ( ϕ ) + c ( y , ϕ ) } f\left(y; \theta, \phi\right)=\exp \left\{\left[y \theta-b\left(\theta\right)\right] / a(\phi)+c\left(y, \phi\right)\right\} f(y;θ,ϕ)=exp{[yθb(θ)]/a(ϕ)+c(y,ϕ)}
    θ \theta θ is called canonical parameter

    It’s easy to show that (利用前面指数分布族的性质即可)
    E ( Y ) = μ = b ′ ( θ ) var ⁡ ( Y ) = b ′ ′ ( θ ) ⋅ a ( ϕ ) \begin{aligned} E(Y) &=\mu=b^{\prime}(\theta) \\ \operatorname{var}(Y) &=b^{\prime \prime}(\theta) \cdot a(\phi) \end{aligned} E(Y)var(Y)=μ=b(θ)=b(θ)a(ϕ)

  2. Linear predictor: For a parameter vector β = ( β 1 , β 2 , … , β p ) T \boldsymbol{\beta}=\left(\beta_{1}, \beta_{2}, \ldots, \beta_{p}\right)^{\mathrm{T}} β=(β1,β2,,βp)T and a n × p n \times p n×p model matrix X X X that contains values of p p p explanatory variables for the n n n observations, the linear predictor is X β \boldsymbol{X} \boldsymbol{\beta} Xβ

  3. Link function: This is a function g g g applied to each component of E ( y ) E(y) E(y) that relates it to the linear predictor,
    g [ E ( y ) ] = X β g[E(\boldsymbol{y})]=X \boldsymbol{\beta} g[E(y)]=Xβ
    A function g ( ⋅ ) g(\cdot) g() such that
    g [ E ( y ) ] = θ   ( canonical parameter ) g[E(\boldsymbol{y})]= \theta \ (\text{canonical parameter}) g[E(y)]=θ (canonical parameter)
    is called canonical link function. Usually, we take the canonical link function in GLMs, because it will make the MLE(Maximum likelihood estimation) equation more simpler that other link functions

MLE

log-likelihood for a GLM

Let L i = log ⁡ f ( y i ; θ i , ϕ ) L_{i}=\log f\left(y_{i} ; \theta_{i}, \phi\right) Li=logf(yi;θi,ϕ), then we have
L i = [ y i θ i − b ( θ i ) ] / a ( ϕ ) + c ( y i , ϕ ) , L_{i}=\left[y_{i} \theta_{i}-b\left(\theta_{i}\right)\right] / a(\phi)+c\left(y_{i}, \phi\right), Li=[yiθib(θi)]/a(ϕ)+c(yi,ϕ),
So the entire log-likelihood is
L ( β ) = ∑ i = 1 n L i = ∑ i = 1 n log ⁡ f ( y i ; θ i , ϕ ) = ∑ i = 1 n y i θ i − b ( θ i ) a ( ϕ ) + ∑ i = 1 n c ( y i , ϕ ) L(\boldsymbol{\beta})=\sum_{i=1}^{n} L_{i}=\sum_{i=1}^{n} \log f\left(y_{i} ; \theta_{i}, \phi\right)=\sum_{i=1}^{n} \frac{y_{i} \theta_{i}-b\left(\theta_{i}\right)}{a(\phi)}+\sum_{i=1}^{n} c\left(y_{i}, \phi\right) L(β)=i=1nLi=i=1nlogf(yi;θi,ϕ)=i=1na(ϕ)yiθib(θi)+i=1nc(yi,ϕ)

Likelihood Equations for a GLM

For a GLM, we have η i = ∑ j β j x i j = g ( μ i ) \eta_{i}=\sum_{j} \beta_{j} x_{i j}=g\left(\mu_{i}\right) ηi=jβjxij=g(μi) with link function g g g

By chain rule, we have
∂ L i ∂ β j = ∂ L i ∂ θ i ∂ θ i ∂ μ i ∂ μ i ∂ η i ∂ η i ∂ β j = ( y i − μ i ) a ( ϕ ) a ( ϕ ) var ⁡ ( y i ) ∂ μ i ∂ η i x i j = ( y i − μ i ) x i j var ⁡ ( y i ) ∂ μ i ∂ η i . \begin{aligned} \frac{\partial L_{i}}{\partial \beta_{j}} &=\frac{\partial L_{i}}{\partial \theta_{i}} \frac{\partial \theta_{i}}{\partial \mu_{i}} \frac{\partial \mu_{i}}{\partial \eta_{i}} \frac{\partial \eta_{i}}{\partial \beta_{j}} \\ &=\frac{\left(y_{i}-\mu_{i}\right)}{a(\phi)} \frac{a(\phi)}{\operatorname{var}\left(y_{i}\right)} \frac{\partial \mu_{i}}{\partial \eta_{i}} x_{i j}=\frac{\left(y_{i}-\mu_{i}\right) x_{i j}}{\operatorname{var}\left(y_{i}\right)} \frac{\partial \mu_{i}}{\partial \eta_{i}} . \end{aligned} βjLi=θiLiμiθiηiμiβjηi=a(ϕ)(yiμi)var(yi)a(ϕ)ηiμixij=var(yi)(yiμi)xijηiμi.
where we have used μ i = b ′ ( θ i ) \mu_{i}=b^{\prime}\left(\theta_{i}\right) μi=b(θi) and var ⁡ ( y i ) = b ′ ′ ( θ i ) a ( ϕ ) \operatorname{var}\left(y_{i}\right)=b^{\prime \prime}\left(\theta_{i}\right) a(\phi) var(yi)=b(θi)a(ϕ)

Summing over the n observations yields the likelihood equations
∂ L ( β ) ∂ β j = ∑ i = 1 n ( y i − μ i ) x i j var ⁡ ( y i ) ∂ μ i ∂ η i = 0 , j = 1 , 2 , … , p \frac{\partial L(\boldsymbol{\beta})}{\partial \beta_{j}}=\sum_{i=1}^{n} \frac{\left(y_{i}-\mu_{i}\right) x_{i j}}{\operatorname{var}\left(y_{i}\right)} \frac{\partial \mu_{i}}{\partial \eta_{i}}=0, \quad j=1,2, \ldots, p βjL(β)=i=1nvar(yi)(yiμi)xijηiμi=0,j=1,2,,p
where η i = ∑ j β j x i j = g ( μ i ) \eta_{i}=\sum_{j} \beta_{j} x_{i j}=g\left(\mu_{i}\right) ηi=jβjxij=g(μi) with link function g g g

Particularly, in matrix-form
X T D V − 1 ( y − μ ) = 0 \boldsymbol{X}^{\mathrm{T}} \boldsymbol{D} \boldsymbol{V}^{-1}(\boldsymbol{y}-\boldsymbol{\mu})=\mathbf{0} XTDV1(yμ)=0
where V \boldsymbol{V} V denote the diagonal matrix of variances of the observations, and D \boldsymbol{D} D denote the diagonal matrix with elements ∂ μ i ∂ η i \frac{\partial \mu_{i}}{\partial \eta_{i}} ηiμi

When we take g g g as the canonical link function, we have
η i = θ i = ∑ j = 1 p β j x i j \eta_{i}=\theta_{i}=\sum_{j=1}^{p} \beta_{j} x_{i j} ηi=θi=j=1pβjxij
and thus
∂ μ i / ∂ η i = ∂ μ i / ∂ θ i = ∂ b ′ ( θ i ) / ∂ θ i = b ′ ′ ( θ i ) \partial \mu_{i} / \partial \eta_{i}=\partial \mu_{i} / \partial \theta_{i}=\partial b^{\prime}\left(\theta_{i}\right) / \partial \theta_{i}=b^{\prime \prime}\left(\theta_{i}\right) μi/ηi=μi/θi=b(θi)/θi=b(θi)
So we can obtain
∂ L i ∂ β j = ( y i − μ i ) var ⁡ ( y i ) b ′ ′ ( θ i ) x i j = ( y i − μ i ) x i j a ( ϕ ) \frac{\partial L_{i}}{\partial \beta_{j}}=\frac{\left(y_{i}-\mu_{i}\right)}{\operatorname{var}\left(y_{i}\right)} b^{\prime \prime}\left(\theta_{i}\right) x_{i j}=\frac{\left(y_{i}-\mu_{i}\right) x_{i j}}{a(\phi)} βjLi=var(yi)(yiμi)b(θi)xij=a(ϕ)(yiμi)xij
and thus the likelihood equations become
∑ i = 1 n x i j y i = ∑ i = 1 n x i j μ i , j = 1 , 2 , … , p \sum_{i=1}^{n} x_{i j} y_{i}=\sum_{i=1}^{n} x_{i j} \mu_{i}, \quad j=1,2, \ldots, p i=1nxijyi=i=1nxijμi,j=1,2,,p
so when we take the canonical link function in GLMs, it will make the MLE(Maximum likelihood estimation) equation more simpler that other link functions

备注

有了似然方程或者似然函数,就可以通过一些近似求根的方式求解该似然方程了,比如采用Newton–Raphson迭代方法等,也可以对似然函数梯度上升或者负似然函数梯度下降,这里不再详细讨论GLM的求解了,感兴趣可以查阅相关资料

例子

普通线性回归

Y Y Y服从正态分布时,此时的GLM模型就是通常意义下的线性模型
Y = X β + ϵ , ϵ ∼ N ( 0 , σ 2 ) Y=X\beta+\epsilon, \epsilon\sim \mathcal{N}(0,\sigma^2) Y=Xβ+ϵ,ϵN(0,σ2)
证明:正态分布的概率密度函数可以写成
f ( y ; μ ) = 1 ( 2 π σ 2 ) 1 / 2 exp ⁡ [ − 1 2 σ 2 ( y − μ ) 2 ] = exp ⁡ [ − y 2 2 σ 2 + y μ σ 2 − μ 2 2 σ 2 − 1 2 log ⁡ ( 2 π σ 2 ) ] f(y ; \mu)=\frac{1}{\left(2 \pi \sigma^{2}\right)^{1 / 2}} \exp \left[-\frac{1}{2 \sigma^{2}}(y-\mu)^{2}\right]=\exp \left[-\frac{y^{2}}{2 \sigma^{2}}+\frac{y \mu}{\sigma^{2}}-\frac{\mu^{2}}{2 \sigma^{2}}-\frac{1}{2} \log \left(2 \pi \sigma^{2}\right)\right] f(y;μ)=(2πσ2)1/21exp[2σ21(yμ)2]=exp[2σ2y2+σ2yμ2σ2μ221log(2πσ2)]
于是 θ = μ , b ( θ ) = b ( μ ) = μ 2 / 2 , a ( ϕ ) = σ 2 \theta=\mu, b(\theta)=b(\mu)=\mu^2/2, a(\phi)=\sigma^2 θ=μ,b(θ)=b(μ)=μ2/2,a(ϕ)=σ2, 于是 E ( Y ) = μ , D ( Y ) = σ 2 E(Y)=\mu, D(Y)=\sigma^2 E(Y)=μ,D(Y)=σ2, 此时标准连接函数为恒等映射

于是GLM为
E ( Y ) = X β , D ( Y ) = σ 2 E(Y)=X\beta, D(Y)=\sigma^2 E(Y)=Xβ,D(Y)=σ2
Y Y Y又是正态分布,就等价于
Y = X β + ϵ , ϵ ∼ N ( 0 , σ 2 ) Y=X\beta+\epsilon, \epsilon\sim \mathcal{N}(0,\sigma^2) Y=Xβ+ϵ,ϵN(0,σ2)

logistic回归

Y Y Y服从伯努利分布时,此时的GLM模型就是通常意义下的logistic回归模型或对数几率模型

证明:伯努利分布的分布律可以写成
f ( y ; π ) = π y ( 1 − π ) 1 − y = exp ⁡ [ y log ⁡ π − y log ⁡ ( 1 − π ) + log ⁡ ( 1 − π ) ] f(y ; \pi)=\pi^{y}(1-\pi)^{1-y}=\exp \left[y \log \pi-y \log (1-\pi)+\log (1-\pi)\right] f(y;π)=πy(1π)1y=exp[ylogπylog(1π)+log(1π)]
于是 θ = log ⁡ [ π / ( 1 − π ) ] , b ( θ ) = − log ⁡ ( 1 − π ) = log ⁡ ( 1 + e θ ) , a ( ϕ ) = 1 \theta=\log [\pi /(1-\pi)], b(\theta)=-\log(1-\pi)=\log(1+e^\theta), a(\phi)=1 θ=log[π/(1π)],b(θ)=log(1π)=log(1+eθ),a(ϕ)=1, 于是
E ( Y ) = e θ 1 + e θ = π , D ( Y ) = π ( 1 − π ) E(Y)=\frac{e^\theta}{1+e^\theta}=\pi, D(Y)=\pi(1-\pi) E(Y)=1+eθeθ=π,D(Y)=π(1π)
此时标准连接函数为 g ( x ) = log ⁡ [ x / ( 1 − x ) ] g(x)=\log[x/(1-x)] g(x)=log[x/(1x)], 于是GLM为
log ⁡ [ π / ( 1 − π ) ] = X β , D ( Y ) = π ( 1 − π ) \log [\pi /(1-\pi)]=X\beta, D(Y)=\pi(1-\pi) log[π/(1π)]=Xβ,D(Y)=π(1π)

π = 1 1 + e − X β \pi = \frac{1}{1+e^{-X\beta}} π=1+eXβ1
然后求解可以根据此建立最大似然函数,对最大似然函数梯度下降,也可以用其他方法,比如似然方程用牛顿法求根等,这里不再详细讨论,感兴趣可以查阅相关资料

泊松回归

Y Y Y服从泊松分布时,此时的GLM模型就是通常意义下的泊松回归模型,又被称作对数-线性模型

证明:泊松分布的分布律可以写成
f ( y , λ ) = λ y e − λ y ! = exp ⁡ ( y log ⁡ λ − λ − log ⁡ y ! ) f(y, \lambda)=\frac{\lambda^{y} e^{-\lambda}}{y !}=\exp (y \log \lambda-\lambda-\log y !) f(y,λ)=y!λyeλ=exp(ylogλλlogy!)
于是 θ = log ⁡ ( λ ) , b ( θ ) = λ = e θ , a ( ϕ ) = 1 \theta=\log(\lambda), b(\theta)=\lambda=e^\theta, a(\phi)=1 θ=log(λ),b(θ)=λ=eθ,a(ϕ)=1, 于是
E ( Y ) = e θ = λ , D ( Y ) = e θ = λ E(Y)=e^\theta=\lambda, D(Y)=e^\theta=\lambda E(Y)=eθ=λ,D(Y)=eθ=λ
此时标准连接函数为 g ( x ) = log ⁡ ( x ) g(x)=\log(x) g(x)=log(x), 于是GLM为
l o g ( λ ) = X β , D ( Y ) = e θ = λ log(\lambda)=X\beta, D(Y)=e^\theta=\lambda log(λ)=Xβ,D(Y)=eθ=λ
也即
E ( Y ) = e X β , D ( Y ) = e θ = λ = e X β E(Y)=e^{X\beta},D(Y)=e^\theta=\lambda=e^{X\beta} E(Y)=eXβ,D(Y)=eθ=λ=eXβ
然后求解可以根据此建立最大似然函数,对最大似然函数梯度下降,也可以用其他方法,比如似然方程用牛顿法求根等,这里不再详细讨论,感兴趣可以查阅相关资料

softmax回归

Y = ( y 1 , ⋯   , y n ) ′ Y=(y_1,\cdots,y_n)' Y=(y1,,yn)服从如下分布时,特别地, y 1 = 0 , ⋯   , y i − 1 = 0 , y i = 1 , y i + 1 = 0 , ⋯   , y n = 0 y_1=0,\cdots,y_{i-1}=0,y_i=1,y_{i+1}=0,\cdots,y_n=0 y1=0,,yi1=0,yi=1,yi+1=0,,yn=0的概率为 π i \pi_i πi, 且 ∑ i π i = 1 \sum_i\pi_i=1 iπi=1,此时的GLM模型就是通常意义下的softmax回归模型,记 Y n − 1 = ( y 1 , ⋯   , y n − 1 ) ′ Y_{n-1}=(y_1,\cdots,y_{n-1})' Yn1=(y1,,yn1)

证明: Y Y Y或者说 Y n − 1 Y_{n-1} Yn1的分布律(其实只需要考查 Y n − 1 Y_{n-1} Yn1的分布律即可)可以写成
KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ f(y ; \pi)=&\p…
其中 θ = ( log ⁡ ( π 1 / π n ) , ⋯   , log ⁡ ( π n − 1 / π n ) ) ′ \theta=(\log(\pi_1/\pi_n),\cdots,\log(\pi_{n-1}/\pi_n))' θ=(log(π1/πn),,log(πn1/πn))

此时是高维的指数分布族,同样地有相应的GLM的定义,和一元情形类似的,不再详细讨论

于是 θ = ( log ⁡ ( π 1 / π n ) , ⋯   , log ⁡ ( π n − 1 / π n ) ) ′ , b ( θ ) = − log ⁡ ( π n ) = − log ⁡ ( 1 − ∑ i = 1 n − 1 π i ) , a ( ϕ ) = 1 \theta=(\log(\pi_1/\pi_n),\cdots,\log(\pi_{n-1}/\pi_n))', b(\theta)=-\log(\pi_n)=-\log(1-\sum_{i=1}^{n-1}\pi_i),a(\phi)=1 θ=(log(π1/πn),,log(πn1/πn)),b(θ)=log(πn)=log(1i=1n1πi),a(ϕ)=1, 于是

可得
π i = e θ i 1 + ∑ i = 1 n − 1 e θ i , i = 1 , ⋯   , n − 1 , π n = 1 1 + ∑ i = 1 n − 1 e θ i \pi_i=\frac{e^{\theta_i}}{1+\sum_{i=1}^{n-1}e^{\theta_i}}, i=1,\cdots,n-1,\pi_n=\frac{1}{1+\sum_{i=1}^{n-1}e^{\theta_i}} πi=1+i=1n1eθieθi,i=1,,n1,πn=1+i=1n1eθi1

KaTeX parse error: No such environment: align* at position 8: \begin{̲a̲l̲i̲g̲n̲*̲}̲ \frac{\partial…

于是
E ( Y ) = ( π 1 , ⋯   , π n ) , ( D ( Y ) 这里不算了,感兴趣可以自行计算 ) E(Y)=(\pi_1,\cdots,\pi_n), (D(Y)\text{这里不算了,感兴趣可以自行计算}) E(Y)=(π1,,πn),(D(Y)这里不算了,感兴趣可以自行计算)
此时标准连接函数为 g ( π 1 , ⋯   , π n − 1 ) = ( log ⁡ ( π 1 / π n ) , ⋯   , log ⁡ ( π n − 1 / π n ) ) ′ = θ g(\pi_1,\cdots,\pi_{n-1})=(\log(\pi_1/\pi_n),\cdots,\log(\pi_{n-1}/\pi_n))'=\theta g(π1,,πn1)=(log(π1/πn),,log(πn1/πn))=θ, 于是GLM为
θ = X β , i . e . θ i = x i ′ β \theta=X\beta, i.e. \theta_i=x_i'\beta θ=Xβ,i.e.θi=xiβ
于是
π i = e x i ′ β 1 + ∑ i = 1 n − 1 e x i ′ β , i = 1 , ⋯   , n − 1 , π n = 1 1 + ∑ i = 1 n − 1 e x i ′ β \pi_i=\frac{e^{x_i'\beta}}{1+\sum_{i=1}^{n-1}e^{x_i'\beta}}, i=1,\cdots,n-1,\pi_n=\frac{1}{1+\sum_{i=1}^{n-1}e^{x_i'\beta}} πi=1+i=1n1exiβexiβ,i=1,,n1,πn=1+i=1n1exiβ1
特别地,当 n = 2 n=2 n=2时退化为logistic回归模型

然后求解可以根据此建立最大似然函数,对最大似然函数梯度下降,也可以用其他方法,比如似然方程用牛顿法求根等,这里不再详细讨论,感兴趣可以查阅相关资料

参考资料

  1. GLM.pdf
  2. “Foundations of Linear and Generalized Linear Models” written by Alan Agresti
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值