The exponential family
A class of distributions is in the exponential family if it can be written in the from
p(y,η)=b(y)exp(ηTT(y)−a(η))
where:- η : the natural parameter (also called the canonical parameter)
- T(y) : the sufficient statistic (often be the case that T(y)=y )
- a(η) : the log partition function ( e−a(η) plays the role as a normalization constant)
在指数分布族中,给定 T 、
a 、和 b 的函数形式,我们就确定了一组以η 为参数的分布族。Bernoulli distribution family:
p(y;ϕ)=ϕy(1−ϕ)1−y=exp(ylogϕ+(1−y)log(1−ϕ))=exp((log(ϕ1−ϕ))y+log(1−ϕ))
thus we have:- η=log(ϕ1−ϕ)
- ϕ=1/(1+e−η) (the Sigmoid function!)
- T(y)=y
- a(η)=−log(1−ϕ)=log(1+eη)
- b(y)=1
Bernoulli分布是指数分布族的一个例子。值得注意的是,如果我们将Bernoulli分布写成指数分布的形式,并用参数 η 来表示 y=1 的概率 ϕ ,我们很自然地得到了logistic function: ϕ=1/(1+e−η) 。在后面学习GLM时,我们将进一步阐释这个结论。
Gaussian distribution family (for simplicity we set σ2=1 ):
p(y;μ)=12π‾‾‾√exp(−12(y−μ)2)=12π‾‾‾√exp(−12y2)⋅exp(μy−12μ2)
thus we have:- η=μ
- T(y)=y
- a(η)=μ2/2=η2/2
- b(y)=(1/2π‾‾‾√)exp(−y2/2)
Gaussian分布也是指数分布族的一个例子。只不过,对于Gaussian分布而言,其均值 μ (也是要预估的 y )恰是其对应指数分布的参数
η 。后面将会看到为什么要将这些分布写成以 η 为参数的指数分布的形式。
Constructing GLMs
Motivation: Given the distribution family of response variable (such as Bernoulli distribution or Gaussian distribution), how can we construct a regression/classification hypothesis?
Three assumptions for constructing a Generalized Linear Model:
- p(y|x;θ)∼ExponentialFamily(η)
- h(x)=E[T(y)|x] (for most cases, T(y)=y , which leads to h(x)=E[y|x] )
- η=θTx (design choice)
通过上面三个假定得到的模型 h(x) 称之为Generalized Linear Model。后面会看到,通过这种方式得到的GLMs有着很多优雅的性质,使得模型的学习更加简单高效。
Derivative of Ordinary Least Squares (OLS):
- probabilistic assumption: p(y|x)∼(μ,σ2)∼ExponentialFamily(η)
- canonical response function: g(η)=E[T(y)|x;η]=μ=η
- hypothesis: hθ(x)=g(θTx)=θTx
Derivative of Logistic Regression:
- probabilistic assumption: p(y|x)∼Bernoulli(ϕ)∼ExponentialFamily(η)
- canonical response function: g(η)=E[T(y)|x;η]=ϕ=11+e−η
- hypothesis: hθ(x)=g(θTx)=11+e−θTx
无论是linear regression,还是logistic regression,都是广义线性模型的一个特例。这也隐含着二者在学习算法上的相通性。
- Derivative of Softmax Regression:
- multi-classification problem
- probabilistic assumption:
p(y|x)∼Multinomial(ϕ1,...,ϕk−1)∼ExponentialFamily(η)
with:
-
T(y)∈k−1
and
T(y)i=1{y=i}={10y=iy≠i - a(η)=−log(ϕk)=−log(1−∑k−1i=1ϕi)
- b(y)=1
-
η∈k−1
and
ηi=logϕiϕk
-
T(y)∈k−1
and
- canonical response function:
g(η)i=E[T(y)i|x;η]=ϕi=eηi1+∑k−1j=1eηj
which is called the softmax function - hypothesis:
[hθ(x)]i=g(η)i=eθTix1+∑k−1j=1eθTjx
which is called the softmax regression