[cs229] 广义线性模型 Generalized Linear Models

Generalized Linear Models

The exponential family

  指数家族分布形式如下:
p ( y ; η ) = b ( y ) exp ⁡ ( η T T ( y ) − a ( η ) ) p(y;\eta) = b(y) \exp {(\eta^TT(y)-a(\eta))} p(y;η)=b(y)exp(ηTT(y)a(η))
  大部分时候, T ( y ) = y T(y) = y T(y)=y η \eta η 为自然参数 (natural parameter), a ( η ) a(\eta) a(η) 为对数分割函数 (log partition function), b ( y ) b(y) b(y) 为基础测量 (base measure)。固定 T 、 a 、 b T、a、b Tab 后,通过调整 η \eta η 来调整分布,很多分布都可以转化成指数家族分布的形式。

Bernoulli distribution

  下面推导将伯努利分布转换为指数家族分布的形式。假设均值为 ϕ \phi ϕ y ∈ { 0 ,   1 } y \in \{0,\ 1\} y{0, 1} ,则:
p ( y = 1 ; ϕ ) = ϕ p ( y = 0 ; ϕ ) = 1 − ϕ \begin{aligned} p(y = 1;\phi) &= \phi \\ p(y = 0;\phi) &= 1 - \phi \end{aligned} p(y=1;ϕ)p(y=0;ϕ)=ϕ=1ϕ
  所以可以推出:
p ( y ; ϕ ) = ϕ y ( 1 − ϕ ) 1 − y p(y;\phi) = \phi ^y (1 - \phi)^{1 - y} p(y;ϕ)=ϕy(1ϕ)1y
  所以有:
p ( y ; ϕ ) = exp ⁡ ( ln ⁡ ( ϕ y ( 1 − ϕ ) 1 − y ) ) = exp ⁡ ( y ln ⁡ ϕ + ( 1 − y ) ln ⁡ ( 1 − ϕ ) ) = exp ⁡ ( y ln ⁡ ϕ 1 − ϕ + ln ⁡ ( 1 − ϕ ) ) \begin{aligned} p(y;\phi) &= \exp {( \ln {(\phi ^y (1 - \phi)^{1 - y})} )} \\ &= \exp {(y \ln \phi +(1-y)\ln(1-\phi))} \\ &= \exp {(y\ln \frac {\phi} {1-\phi} +\ln (1-\phi))} \end{aligned} p(y;ϕ)=exp(ln(ϕy(1ϕ)1y))=exp(ylnϕ+(1y)ln(1ϕ))=exp(yln1ϕϕ+ln(1ϕ))
  因为 ln ⁡ ϕ 1 − ϕ \ln \frac {\phi} { 1 - \phi } ln1ϕϕ 是一个常数,所以令 η = ln ⁡ ϕ 1 − ϕ \eta = \ln \frac {\phi} {1 - \phi} η=ln1ϕϕ 有:
e η = ϕ 1 − ϕ e η = ( 1 + e η ) ϕ ϕ = e η 1 + e η = 1 1 + e − η \begin{aligned} e^\eta &= \frac {\phi} {1 - \phi} \\ e^\eta &= (1 + e^\eta) \phi \\ \phi &= \frac {e^\eta} {1 + e^\eta} \\ &= \frac {1} {1+e^{-\eta}} \end{aligned} eηeηϕ=1ϕϕ=(1+eη)ϕ=1+eηeη=1+eη1
  则 p ( y ; ϕ ) p(y;\phi) p(y;ϕ) 可以转换为:
p ( y ; ϕ ) = exp ⁡ ( η T y + ln ⁡ 1 1 + e η ) \begin{aligned} p(y;\phi) &= \exp (\eta^T y + \ln \frac {1} {1 + e^\eta}) \end{aligned} p(y;ϕ)=exp(ηTy+ln1+eη1)
  至此将该式转化为了指数家族分布的形式,其中:
b ( y ) = 1 T ( y ) = y a ( η ) = − ln ⁡ 1 1 + e η = ln ⁡ ( 1 + e η ) \begin{aligned} b(y) &= 1 \\ T(y) &= y \\ a(\eta) &= - \ln \frac {1} {1 + e^\eta} \\ &=\ln (1 + e^\eta) \\ \end{aligned} b(y)T(y)a(η)=1=y=ln1+eη1=ln(1+eη)

Gaussian distribution

  回顾 Linear Regression,当时用高斯分布推导的时候,高斯分布的方差 σ 2 \sigma^2 σ2 并不会影响 θ \boldsymbol \theta θ 的结果,所以在这里,不妨将其设为 1 1 1 。所以,高斯分布如下:
p ( y ; μ ) = 1 2 π exp ⁡ ( − ( y − μ ) 2 2 ) = 1 2 π exp ⁡ ( − y 2 2 ) exp ⁡ ( μ y − 1 2 μ 2 ) \begin{aligned} p(y;\mu) &= \frac {1} {\sqrt{2 \pi}} \exp (- \frac {(y - \mu)^2} {2}) \\ &=\frac {1} {\sqrt{2 \pi}} \exp (- \frac {y^2} {2}) \exp (\mu y - \frac {1} {2} \mu^2) \end{aligned} p(y;μ)=2π 1exp(2(yμ)2)=2π 1exp(2y2)exp(μy21μ2)
  至此,我们将高斯分布转换为了指数家族分布的形式,其中:
b ( y ) = 1 2 π exp ⁡ ( − y 2 2 ) T ( y ) = y η = μ a ( η ) = 1 2 μ 2 = 1 2 η 2 \begin{aligned} b(y) &= \frac {1} {\sqrt{2 \pi}} \exp (- \frac {y^2} {2}) \\ T(y) &= y \\ \eta &= \mu \\ a(\eta) &= \frac {1} {2} \mu^2 \\ &= \frac {1} {2} \eta^2 \\ \end{aligned} b(y)T(y)ηa(η)=2π 1exp(2y2)=y=μ=21μ2=21η2
  指数家族分布是广义线性模型,很多分布都可以转化为这种形式

构建 GLMs

  首先需要进行三个假设:

1.    y ∣ x ; θ ∼ ExponentialFamily ( η ) 2.    Given  x , our goal is to predict the expected value of  T ( y ) . This means we would like learned hypothesis  h  to satisfy  h ( x ) = E [ y ∣ x ] 3.    η = θ T x . (If  η  is a vector-valued, then  η i = θ i T x \begin{aligned} 1.\ \ &y|\boldsymbol x; \boldsymbol \theta \sim \text {ExponentialFamily}(\eta) \\ 2.\ \ &\text {Given $\boldsymbol x$, our goal is to predict the expected value of $T(y)$. This means} \\ &\text{we would like learned hypothesis $h$ to satisfy $h(\boldsymbol x) = E[y|\boldsymbol x]$} \\ 3.\ \ &\eta = \boldsymbol \theta^T \boldsymbol x \text{. (If $\eta$ is a vector-valued, then $\eta_i = \boldsymbol \theta_i^T \boldsymbol x$} \end{aligned} 1.  2.  3.  yx;θExponentialFamily(η)Given x, our goal is to predict the expected value of T(y). This meanswe would like learned hypothesis h to satisfy h(x)=E[yx]η=θTx. (If η is a vector-valued, then ηi=θiTx
  从这三个假设出发,我们来推导线性回归、Logistic Regression 的模型形式。

Ordinary Least Squares(普通最小二乘法)

  在线性回归中,我们假设了 y y y x \boldsymbol x x 是严格的线性关系加高斯分布的随机噪声的结果,所以 y ∣ θ ; x ∼ N ( μ , σ 2 ) y|\theta;x \sim N(\mu,\sigma^2) yθ;xN(μ,σ2) ,在前面我们推到出高斯分布可以转化为指数家族分布,由假设 2和假设3 ,我们有:
h θ ( x ) = E [ y ∣ x ] = μ = η = θ T x \begin{aligned} h_{\boldsymbol \theta} (\boldsymbol x) &= E[y|\boldsymbol x] \\ &=\mu \\ &=\eta \\ &= \boldsymbol \theta^T \boldsymbol x \end{aligned} hθ(x)=E[yx]=μ=η=θTx

Logistic Regression

  根据前面推到的结果和假设2、假设3,我们有:
h θ ( x ) = E [ y ∣ x ] = ϕ = 1 1 + e − η = 1 1 + e − θ T x \begin{aligned} h_{\boldsymbol \theta} (\boldsymbol x) &= E[y|\boldsymbol x] \\ &= \phi \\ &= \frac {1} {1 + e^{-\eta}} \\ &= \frac {1} {1 + e^{-\boldsymbol \theta^T \boldsymbol x}} \\ \end{aligned} hθ(x)=E[yx]=ϕ=1+eη1=1+eθTx1

Softmax Regression

  在分类问题中,当分类结果不是两个,而是 k k k 个时,伯努利分布就不满足我们的需求了,这个时候我们可以假设 y ∣ x ; θ y|\boldsymbol x; \boldsymbol \theta yx;θ 服从多项式分布 (Multinomial distribution) ,我们可以使用 ϕ 1 , ϕ 2 , ⋯   , ϕ k \phi_1,\phi_2, \cdots,\phi_k ϕ1,ϕ2,,ϕk 来分别表示取到某一个值的概率。因为 ∑ i = 1 k ϕ k = 1 \sum_{i=1}^k \phi_k = 1 i=1kϕk=1 ,所以可以令:
ϕ k = 1 − ∑ i = 1 k − 1 ϕ i \phi_k = 1 - \sum_{i= 1}^{k-1} \phi_i ϕk=1i=1k1ϕi
  定义如下计算,大括号中是一个逻辑表达式的结果:
1 { True } = 1 1 { False } = 0 ​ 1\{ \text{True}\} = 1\\ 1\{ \text{False}\} = 0​ 1{True}=11{False}=0
  所以有:
p ( y ; ϕ ) = ∏ i = 1 k ϕ i 1 { y = i } \begin{aligned} p(y;\phi) = \prod_{i=1}^k\phi_i^{1\{y = i\}} \end{aligned} p(y;ϕ)=i=1kϕi1{y=i}
  令 T T T 为对 y y y 进行的一个转换,转换结果为一个 k − 1 k-1 k1 行的列向量,转换规则为第 y y y 行为 1 1 1 ,其余全为零,若 y = k y = k y=k ,则每一行都为 0 0 0

KaTeX parse error: Undefined control sequence: \ at position 267: …) \times 1} , \̲ ̲T(k) = \begin{…

  则 p ( y ; ϕ ) p(y;\phi) p(y;ϕ) 可继续转化为:

p ( y ; ϕ ) = ϕ 1 1 { y = 1 } ϕ 2 1 { y = 2 } ⋯ ϕ k − 1 1 { y = k − 1 } ϕ k 1 − ∑ i = 1 k − 1 1 { y = i } = ϕ 1 ( T ( y ) ) 1 ϕ 2 ( T ( y ) ) 2 ⋯ ϕ k − 1 ( T ( y ) ) k − 1 ϕ k 1 − ∑ i = 1 k − 1 ( T ( y ) ) i = exp ⁡ ( ( T ( y ) ) 1 ln ⁡ ϕ 1 + ( T ( y ) ) 2 ln ⁡ ϕ 2 + ⋯ + ( T ( y ) ) k − 1 ln ⁡ ϕ k − 1 + ( 1 − ∑ i = 1 k − 1 ( T ( y ) ) i ) ln ⁡ ϕ k ) = exp ⁡ ( ( T ( y ) ) 1 ln ⁡ ϕ 1 ϕ k + ⋯ + ( T ( y ) ) k − 1 ln ⁡ ϕ k − 1 ϕ k + ln ⁡ ϕ k ) = b ( y ) exp ⁡ ( η T T ( y ) − a ( η ) ) \begin{aligned} p(y;\phi) &= \phi_1^{1\{y = 1\}} \phi_2^{1\{y = 2\}} \cdots\phi_{k-1}^{1\{y = k-1\}}\phi_k^{1 -\sum_{i=1}^{k-1}1\{y=i\}} \\ &= \phi_1^{(T(y))_1}\phi_2^{(T(y))_2}\cdots\phi_{k-1}^{(T(y))_{k-1}}\phi_k^{1 - \sum_{i=1}^{k-1}(T(y))_i} \\ &=\exp((T(y))_1\ln \phi_1 + (T(y))_2\ln \phi_2 +\cdots + (T(y))_{k-1}\ln \phi_{k-1} +(1-\sum_{i=1}^{k-1} (T(y))_i)\ln \phi_k ) \\ &=\exp ((T(y))_1 \ln \frac {\phi_1} {\phi_k} + \cdots + (T(y))_{k-1}\ln \frac {\phi_{k-1}} {\phi_k}+\ln \phi_k) \\ &= b(y)\exp (\eta^T T(y) - a(\eta)) \end{aligned} p(y;ϕ)=ϕ11{y=1}ϕ21{y=2}ϕk11{y=k1}ϕk1i=1k11{y=i}=ϕ1(T(y))1ϕ2(T(y))2ϕk1(T(y))k1ϕk1i=1k1(T(y))i=exp((T(y))1lnϕ1+(T(y))2lnϕ2++(T(y))k1lnϕk1+(1i=1k1(T(y))i)lnϕk)=exp((T(y))1lnϕkϕ1++(T(y))k1lnϕkϕk1+lnϕk)=b(y)exp(ηTT(y)a(η))
  所以有:
b ( y ) = 1 η = [ ln ⁡ ϕ 1 ϕ k ⋮ ln ⁡ ϕ k − 1 ϕ k ] a ( η ) = − ln ⁡ ( ϕ k ) \begin{aligned} b(y) &=1 \\ \eta &= \begin{bmatrix} \ln \frac {\phi_1} {\phi_k} \\ \vdots \\ \ln \frac {\phi_{k-1}} {\phi_k} \\ \end{bmatrix} \\ a(\eta) &= - \ln (\phi_k) \end{aligned} b(y)ηa(η)=1=lnϕkϕ1lnϕkϕk1=ln(ϕk)
  所以可以得到:
η i = ln ⁡ ϕ i ϕ k ϕ k e η i = ϕ i ϕ k ∑ i = 1 k e η i = ∑ i = 1 k ϕ i = 1 ϕ k = 1 ∑ i = 1 k e η i \begin{aligned} \eta_i &= \ln \frac {\phi_i} {\phi_k} \\ \phi_k e^{\eta_i} &= \phi_i \\ \phi_k \sum_{i=1}^{k} e^{\eta_i} &= \sum_{i=1}^{k} \phi_i \\ &=1 \\ \phi_k &= \frac {1} {\sum_{i=1}^{k} e^{\eta_i}} \end{aligned} ηiϕkeηiϕki=1keηiϕk=lnϕkϕi=ϕi=i=1kϕi=1=i=1keηi1
  将 ϕ k \phi_k ϕk 代回到 $\phi_k e^{\eta_i} = \phi_i $ 有:
ϕ i = e η i ∑ j = 1 k e η j \phi_i =\frac {e^{\eta_i}} {\sum_{j=1}^{k} e^{\eta_j}} ϕi=j=1keηjeηi
  由假设 2 和假设 3 可以推出:
p ( y = k ∣ x ; θ ) = ϕ i = e η i ∑ j = 1 k e η j = e θ i T x ∑ j = 1 k e θ j T x \begin{aligned} p(y=k|\boldsymbol x; \boldsymbol \theta) &= \phi_i \\ &= \frac {e^{\eta_i}} {\sum_{j=1}^{k} e^{\eta_j}} \\ &= \frac {e^{\boldsymbol \theta_i ^ T \boldsymbol x}} {\sum_{j=1}^{k} e^{\boldsymbol \theta_j ^ T \boldsymbol x}} \\ \end{aligned} p(y=kx;θ)=ϕi=j=1keηjeηi=j=1keθjTxeθiTx

h θ ( x ) = E [ T ( y ) ∣ x ; θ ] = [ ϕ 1 ϕ 2 ⋮ ϕ k − 1 ] = [ exp ⁡ ( θ 1 T x ) ∑ j = 1 k exp ⁡ ( θ j T x ) exp ⁡ ( θ 2 T x ) ∑ j = 1 k exp ⁡ ( θ j T x ) ⋮ exp ⁡ ( θ k − 1 T x ) ∑ j = 1 k exp ⁡ ( θ j T x ) ] \begin{aligned} h_{\boldsymbol \theta}(\boldsymbol x) &= E[T(y)| \boldsymbol x;\boldsymbol \theta] \\ &= \begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_{k-1} \end{bmatrix} \\ &= \begin{bmatrix} \frac {\exp({\boldsymbol \theta_{1} ^ T \boldsymbol x})} {\sum_{j=1}^{k} \exp({\boldsymbol \theta_j ^ T \boldsymbol x})} \\ \frac {\exp({\boldsymbol \theta_{2} ^ T \boldsymbol x})} {\sum_{j=1}^{k} \exp({\boldsymbol \theta_j ^ T \boldsymbol x})} \\ \vdots \\ \frac {\exp({\boldsymbol \theta_{k-1} ^ T \boldsymbol x})} {\sum_{j=1}^{k} \exp({\boldsymbol \theta_j ^ T \boldsymbol x})} \end{bmatrix} \\ \end{aligned} hθ(x)=E[T(y)x;θ]=ϕ1ϕ2ϕk1=j=1kexp(θjTx)exp(θ1Tx)j=1kexp(θjTx)exp(θ2Tx)j=1kexp(θjTx)exp(θk1Tx)

损失函数推导

  样本被采集到的概率 L ( θ ) L(\boldsymbol \theta) L(θ) 为:
L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) L(\boldsymbol \theta) = \prod_{i = 1}^{m}p(y^{(i)}|\boldsymbol x^{(i)};\boldsymbol \theta) L(θ)=i=1mp(y(i)x(i);θ)
  两边取对数有:
ln ⁡ L ( θ ) = ∑ i = 1 m ln ⁡ p ( y ( i ) ∣ x ( i ) ; θ ) = ∑ i = 1 m ∑ k = 1 K 1 ⋅ { y = k } ln ⁡ ϕ k = ∑ i = 1 m ∑ k = 1 K y k ln ⁡ ϕ k \begin{aligned} \ln L(\boldsymbol \theta) &= \sum_{i=1}^m \ln p(y^{(i)}|\boldsymbol x^{(i)};\boldsymbol \theta) \\ &= \sum_{i=1}^m \sum_{k=1}^K 1 \cdot \{ y=k\} \ln \phi_k \\ &= \sum_{i=1}^m \sum_{k=1}^K y_k\ln \phi_k \end{aligned} lnL(θ)=i=1mlnp(y(i)x(i);θ)=i=1mk=1K1{y=k}lnϕk=i=1mk=1Kyklnϕk
  由极大似然估计可以,我们的目标是使 L L L 最大,故:
θ = arg ⁡ max ⁡ L ( θ ) \boldsymbol \theta = \arg \max L(\boldsymbol \theta) θ=argmaxL(θ)
  所以损失函数 J ( θ ) J(\boldsymbol \theta) J(θ) 为:
J ( θ ) = − ∑ i = 1 m ∑ k = 1 K y k ln ⁡ ϕ k J(\boldsymbol \theta) = -\sum_{i=1}^m \sum_{k=1}^K y_k\ln \phi_k J(θ)=i=1mk=1Kyklnϕk
  这也被称为交叉熵。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值