Choi H. I. Lecture 4: Exponential family of distributions and generalized linear model (GLM).
定义
定义: 一个分布具有如下形式的密度函数:
f
θ
(
x
)
=
1
Z
(
θ
)
h
(
x
)
e
⟨
T
(
x
)
,
θ
⟩
,
f_{\theta}(x) = \frac{1}{Z(\theta)} h(x) e^{\langle T(x), \theta \rangle},
fθ(x)=Z(θ)1h(x)e⟨T(x),θ⟩,
则该分布属于指数族分布.
其中
x
∈
R
m
x \in \mathbb{R}^m
x∈Rm,
T
(
x
)
=
(
T
1
(
x
)
,
T
2
(
x
)
,
⋯
,
T
k
(
x
)
)
∈
R
k
T(x) = (T_1(x), T_2(x), \cdots, T_k(x)) \in \mathbb{R}^k
T(x)=(T1(x),T2(x),⋯,Tk(x))∈Rk,
θ
=
(
θ
1
,
θ
2
,
⋯
,
θ
k
)
\theta = (\theta_1, \theta_2,\cdots, \theta_k)
θ=(θ1,θ2,⋯,θk)为未知参数,
Z
(
θ
)
=
∫
h
(
x
)
e
⟨
T
(
x
)
,
θ
⟩
d
x
Z(\theta) = \int h(x)e^{\langle T(x), \theta \rangle} \mathrm{d}x
Z(θ)=∫h(x)e⟨T(x),θ⟩dx为配平常数.
若令
C
(
x
)
=
log
h
(
x
)
C(x) = \log h (x)
C(x)=logh(x),
A
(
θ
)
=
log
Z
(
θ
)
A(\theta) = \log Z(\theta)
A(θ)=logZ(θ), 则
f
θ
(
x
)
=
exp
(
⟨
T
(
x
)
,
θ
⟩
−
A
(
θ
)
+
C
(
x
)
)
.
f_{\theta}(x) = \exp (\langle T(x), \theta \rangle - A(\theta) + C(x)).
fθ(x)=exp(⟨T(x),θ⟩−A(θ)+C(x)).
指数族分布还有一种更一般的形式:
f
θ
(
x
)
=
exp
(
⟨
T
(
x
)
,
θ
⟩
−
A
(
θ
)
ϕ
+
C
(
x
,
ϕ
)
)
,
f_{\theta}(x) = \exp (\frac{\langle T(x), \theta \rangle - A(\theta)}{\phi} + C(x, \phi)),
fθ(x)=exp(ϕ⟨T(x),θ⟩−A(θ)+C(x,ϕ)),
更甚者
f
θ
(
x
)
=
exp
(
⟨
T
(
x
)
,
λ
(
θ
)
⟩
−
A
(
θ
)
ϕ
+
C
(
x
,
ϕ
)
)
,
f_{\theta}(x) = \exp (\frac{\langle T(x), \lambda(\theta) \rangle - A(\theta)}{\phi} + C(x, \phi)),
fθ(x)=exp(ϕ⟨T(x),λ(θ)⟩−A(θ)+C(x,ϕ)),
ϕ
\phi
ϕ控制分布的形状.
性质
A ( θ ) A(\theta) A(θ)
Proposition 1:
∇
θ
A
(
θ
)
=
∫
f
θ
(
x
)
T
(
x
)
d
x
=
E
[
T
(
X
)
]
.
\nabla_{\theta}A(\theta) = \int f_{\theta}(x) T(x) \mathrm{d}x = \mathbb{E}[T(X)].
∇θA(θ)=∫fθ(x)T(x)dx=E[T(X)].
proof:
已知:
∫
f
θ
(
x
)
d
x
=
∫
exp
(
⟨
T
(
x
)
,
θ
⟩
−
A
(
θ
)
ϕ
+
C
(
x
,
ϕ
)
)
d
x
=
1.
\int f_{\theta}(x) \mathrm{d}x = \int \exp (\frac{\langle T(x), \theta \rangle - A(\theta)}{\phi} + C(x, \phi)) \mathrm{d}x = 1.
∫fθ(x)dx=∫exp(ϕ⟨T(x),θ⟩−A(θ)+C(x,ϕ))dx=1.
两边关于
θ
\theta
θ求梯度得:
∫
f
θ
(
x
)
T
(
x
)
−
∇
θ
A
(
θ
)
ϕ
d
x
=
0
⇒
∇
θ
A
(
θ
)
=
E
[
T
(
X
)
]
.
\int f_{\theta}(x) \frac{T(x) - \nabla_{\theta} A(\theta)}{\phi} \mathrm{d}x = 0 \Rightarrow \nabla_{\theta} A(\theta) = \mathbb{E}[T(X)].
∫fθ(x)ϕT(x)−∇θA(θ)dx=0⇒∇θA(θ)=E[T(X)].
Proposition 2:
D
θ
2
A
=
(
∂
2
A
∂
θ
i
∂
θ
j
)
=
1
ϕ
C
o
v
(
T
(
X
)
,
T
(
X
)
)
=
1
ϕ
C
o
v
(
T
(
X
)
)
.
D^2_{\theta} A = (\frac{\partial^2 A}{\partial\theta_i \partial \theta_j}) = \frac{1}{\phi}\mathrm{Cov}(T(X), T(X)) = \frac{1}{\phi}Cov(T(X)).
Dθ2A=(∂θi∂θj∂2A)=ϕ1Cov(T(X),T(X))=ϕ1Cov(T(X)).
proof:
∂ A ∂ θ i = ∫ exp ( ⟨ T ( x ) , θ ⟩ − A ( θ ) ϕ + C ( x , ϕ ) ) T i ( x ) d x . \frac{\partial A}{\partial \theta_i} = \int \exp (\frac{\langle T(x), \theta \rangle - A(\theta)}{\phi} + C(x, \phi)) T_i(x) \mathrm{d}x. ∂θi∂A=∫exp(ϕ⟨T(x),θ⟩−A(θ)+C(x,ϕ))Ti(x)dx.
∂ 2 A ∂ θ i ∂ θ j = ∫ f θ ( x ) T j ( x ) − ∂ A ∂ θ j ϕ T i ( x ) d x = 1 ϕ ∫ f θ ( x ) ( T j ( x ) − ∂ A ∂ θ j ) ( T i ( x ) − ∂ A ∂ θ i ) d x = C o v ( T i ( X ) , T j ( X ) ) . \begin{array}{ll} \frac{\partial^2 A}{\partial \theta_i \partial \theta_j} &= \int f_{\theta}(x) \frac{T_j (x) - \frac{\partial A}{\partial \theta_j}}{\phi} T_i(x) \mathrm{d}x \\ &= \frac{1}{\phi}\int f_{\theta}(x) (T_j(x) - \frac{\partial A}{\partial \theta_j}) (T_i(x) - \frac{\partial A}{\partial \theta_i})\mathrm{d}x \\ &= \mathrm{Cov}(T_i(X), T_j(X)). \end{array} ∂θi∂θj∂2A=∫fθ(x)ϕTj(x)−∂θj∂ATi(x)dx=ϕ1∫fθ(x)(Tj(x)−∂θj∂A)(Ti(x)−∂θi∂A)dx=Cov(Ti(X),Tj(X)).
Corollary 1: A ( θ ) A({\theta}) A(θ)关于 θ \theta θ是凸函数.
既然其黑塞矩阵半正定.
极大似然估计
设有
{
x
i
}
i
=
1
n
\{x^i\}_{i=1}^n
{xi}i=1n个样本, 则对数似然函数为
l
(
θ
)
=
1
θ
[
⟨
θ
,
∑
i
=
1
n
T
(
x
i
)
−
n
A
(
θ
)
]
+
∑
i
=
1
n
C
(
x
i
,
ϕ
)
,
l(\theta) = \frac{1}{\theta}[\langle \theta, \sum_{i=1}^n T(x^i)-nA(\theta)] + \sum_{i=1}^n C(x^i, \phi),
l(θ)=θ1[⟨θ,i=1∑nT(xi)−nA(θ)]+i=1∑nC(xi,ϕ),
因为
A
(
θ
)
A(\theta)
A(θ)是凸函数, 所以上述存在最小值点, 且
∇
θ
l
(
θ
)
=
1
ϕ
[
∑
i
=
1
n
T
(
x
i
)
−
n
∇
θ
A
(
θ
)
]
,
\nabla_{\theta} l(\theta) = \frac{1}{\phi}[\sum_{i=1}^n T(x^i) - n \nabla_{\theta}A(\theta)],
∇θl(θ)=ϕ1[i=1∑nT(xi)−n∇θA(θ)],
故该最小值点在
∇
θ
A
(
θ
)
=
1
n
∑
i
=
1
n
T
(
x
i
)
,
\nabla_{\theta}A(\theta) = \frac{1}{n} \sum_{i=1}^n T(x^i),
∇θA(θ)=n1i=1∑nT(xi),
处达到.
最大熵
指数族分布实际上满足最大熵分布, 这是在没有任何偏爱的尺度下的分布.
即
max
f
H
(
f
)
=
−
∫
f
(
x
)
log
f
(
x
)
d
x
.
\max_{f} \quad H(f) = -\int f(x)\log f(x) \mathrm{d} x.
fmaxH(f)=−∫f(x)logf(x)dx.
等价于最小化
min
f
∫
f
(
x
)
log
f
(
x
)
d
x
.
\min_f \int f(x)\log f(x) \mathrm{d}x.
fmin∫f(x)logf(x)dx.
往往, 我们会有一些已知的统计信息, 通常以期望的形式表示:
∫
f
(
x
)
h
i
(
x
)
d
x
=
c
i
,
i
=
1
,
2
⋯
,
s
.
\int f(x) h_i(x) \mathrm{d}x = c_i, \quad i=1,2\cdots, s.
∫f(x)hi(x)dx=ci,i=1,2⋯,s.
则我们的目标实际上是:
min
f
∫
f
(
x
)
log
f
(
x
)
d
x
s
.
t
.
∫
f
(
x
)
h
i
(
x
)
d
x
=
c
i
,
i
=
0
,
2
⋯
,
s
.
\min_f \quad \int f(x)\log f(x) \mathrm{d}x \\ \mathrm{s.t.} \quad \int f(x) h_i(x) \mathrm{d}x = c_i, \quad i=0,2\cdots, s.
fmin∫f(x)logf(x)dxs.t.∫f(x)hi(x)dx=ci,i=0,2⋯,s.
其中
h
0
=
1
,
c
0
=
1
h_0 = 1, c_0 =1
h0=1,c0=1, 即密度函数需满足
∫
f
(
x
)
d
x
=
1
\int f(x) \mathrm{d} x= 1
∫f(x)dx=1.
利用拉格朗日乘数得:
J
(
f
,
λ
)
=
∫
f
(
x
)
log
f
(
x
)
d
x
+
λ
0
(
1
−
∫
f
(
x
)
d
x
)
+
∑
i
=
1
s
λ
i
[
c
i
−
∫
f
(
x
)
h
i
(
x
)
d
x
]
.
J(f,\lambda) = \int f(x)\log f(x) \mathrm{d}x + \lambda_0 (1 - \int f(x) \mathrm{d}x) + \sum_{i=1}^s \lambda_i [c_i - \int f(x) h_i(x) \mathrm{d}x] .
J(f,λ)=∫f(x)logf(x)dx+λ0(1−∫f(x)dx)+i=1∑sλi[ci−∫f(x)hi(x)dx].
最优条件,
J
J
J关于
f
f
f的变分为0, 即
1
+
log
f
(
x
)
−
λ
0
−
∑
i
=
1
s
λ
i
h
i
(
x
)
=
0.
1 + \log f(x) - \lambda_0 - \sum_{i=1}^s \lambda_i h_i(x) = 0.
1+logf(x)−λ0−i=1∑sλihi(x)=0.
即
f
(
x
)
=
1
Z
exp
(
∑
i
=
1
s
λ
i
h
i
(
x
)
)
.
f(x) = \frac{1}{Z} \exp(\sum_{i=1}^s \lambda_i h_i(x)).
f(x)=Z1exp(i=1∑sλihi(x)).
属于指数分布族.
例子
Bernoulli
P ( x ) = p x ( 1 − p ) 1 − x = exp [ x log p 1 − p + log ( 1 − p ) ] . P(x) = p^x (1-p)^{1-x} = \exp[x\log\frac{p}{1-p} + \log (1 - p)]. P(x)=px(1−p)1−x=exp[xlog1−pp+log(1−p)].
θ = log p 1 − p , T ( x ) = x , A ( θ ) = log ( 1 + e θ ) , h ( x ) = 0. \theta = \log \frac{p}{1-p}, \\ T(x) = x, \\ A(\theta) = \log (1 + e^{\theta}),\\ h(x) = 0. θ=log1−pp,T(x)=x,A(θ)=log(1+eθ),h(x)=0.
指数分布
p ( x ) = λ ⋅ e − λ x = exp [ − λ x + log λ ] , x ≥ 0. p(x) = \lambda \cdot e^{-\lambda x}=\exp[-\lambda x +\log \lambda ], \quad x \ge 0. p(x)=λ⋅e−λx=exp[−λx+logλ],x≥0.
θ = λ , T ( x ) = − x , A ( θ ) = log 1 λ , h ( x ) = I ( x ≥ 0 ) . \theta = \lambda,\\ T(x) =-x, \\ A(\theta) = \log \frac{1}{\lambda}, \\ h(x) = \mathbb{I}(x\ge0). θ=λ,T(x)=−x,A(θ)=logλ1,h(x)=I(x≥0).
正态分布
p ( x ) = 1 2 π σ 2 exp [ − ( x − μ ) 2 2 σ 2 ] . p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp [-\frac{(x-\mu)^2}{2\sigma^2}]. p(x)=2πσ21exp[−2σ2(x−μ)2].
σ
\sigma
σ视作已知参数:
p
(
x
)
=
exp
[
−
1
2
x
2
+
x
μ
−
1
2
μ
2
σ
2
−
1
2
log
(
2
π
σ
2
)
]
.
p(x) = \exp [\frac{-\frac{1}{2}x^2 + x\mu - \frac{1}{2}\mu^2}{\sigma^2} - \frac{1}{2}\log (2\pi \sigma^2)].
p(x)=exp[σ2−21x2+xμ−21μ2−21log(2πσ2)].
θ = ( μ , 1 ) , T ( x ) = ( x , − 1 2 x 2 ) , ϕ = σ 2 , A ( θ ) = 1 2 μ 2 , C ( x , ϕ ) = 1 2 log ( 2 π σ 2 ) . \theta = (\mu, 1), \\ T(x) = (x, -\frac{1}{2}x^2), \\ \phi = \sigma^2, \\ A(\theta) = \frac{1}{2}\mu^2, \\ C(x, \phi) = \frac{1}{2} \log (2\pi \sigma^2). θ=(μ,1),T(x)=(x,−21x2),ϕ=σ2,A(θ)=21μ2,C(x,ϕ)=21log(2πσ2).
σ
\sigma
σ视作未知参数:
p
(
x
)
=
exp
[
−
1
2
σ
2
y
2
+
μ
σ
2
x
−
1
2
σ
2
μ
2
−
log
σ
−
1
2
log
2
π
]
.
p(x) = \exp [-\frac{1}{2\sigma^2}y^2 + \frac{\mu}{\sigma^2}x - \frac{1}{2\sigma^2}\mu^2 - \log \sigma - \frac{1}{2}\log 2\pi].
p(x)=exp[−2σ21y2+σ2μx−2σ21μ2−logσ−21log2π].
T ( x ) = ( x , 1 2 x 2 ) , θ = ( μ σ 2 , − 1 σ 2 ) , A ( θ ) = μ 2 2 σ 2 + log σ , C ( x ) = − 1 2 log ( 2 π ) . T(x) = (x, \frac{1}{2}x^2), \\ \theta = (\frac{\mu}{\sigma^2}, -\frac{1}{\sigma^2}), \\ A(\theta) = \frac{\mu^2}{2\sigma^2} + \log\sigma, \\ C(x) = -\frac{1}{2}\log(2\pi). T(x)=(x,21x2),θ=(σ2μ,−σ21),A(θ)=2σ2μ2+logσ,C(x)=−21log(2π).