UTF8gbsn
我们在机器学习中常用的损失函数, 比如MLE(极大似然估计), KL(散度),
H(交叉熵), 均方误差等之间是什么关系? 本文就简单的讲解一下它们之间的关系.
均方误差 vs MLE
如果你的数据分布是来自指数型数据分布, 那么他们是等价的.
具体证明这里就不详细去证明了. 详情参看文章.
Charnes, A., Frome, E. L., Yu, P. L. (1976).
The equivalence of generalized least squares and
maximum likelihood estimates in the exponential family.
Journal of the American Statistical Association, 71(353), 169–171.
https://doi.org/10.1080/01621459.1976.10481508
为什么不讲, 因为我也没看这篇文章.
KL vs 交叉熵
K L ( θ t , θ ) = ∫ P ( x ∣ θ t ) l o g P ( x ∣ θ t ) P ( x ∣ θ ) H ( θ t , θ ) = − ∫ P ( x ∣ θ t ) l o g P ( x ∣ θ ) H ( θ t ) = − ∫ P ( x ∣ θ t ) l o g P ( x ∣ θ t ) \left. \begin{aligned} KL(\theta^t,\theta)&=\int P(x|\theta^t)log \frac{P(x|\theta^t)}{P(x|\theta)}\\ H(\theta^t, \theta)&=-\int P(x|\theta^t)log P(x|\theta)\\ H(\theta^t)&=- \int P(x|\theta^t)log P(x|\theta^t) \end{aligned} \right. KL(θt,θ)H(θt,θ)H(θt)=∫P(x∣θt)logP(x∣θ)P(x∣θt)=−∫P(x∣θt)logP(x∣θ)=−∫P(x∣θt)logP(x∣θt)
H ( θ t , θ ) = K L ( θ t , θ ) + H ( θ t ) H(\theta^t,\theta)=KL(\theta^t,\theta)+H(\theta^t) H(θt,θ)=KL(θt,θ)+H(θt)
其中
θ
\theta
θ, 是真实概率模型的参数.
而
θ
t
\theta^t
θt是我们估计的概率模型的参数.
由此可见KL散度和交叉熵实际上是一回事.
arg min θ H ( θ t , θ ) ≡ arg min θ K L ( θ t , θ ) \arg\min_{\theta} H(\theta^t,\theta) \equiv \arg\min_\theta KL(\theta^t,\theta) argθminH(θt,θ)≡argθminKL(θt,θ)
KL vs MLE
那么, 我们再来看看极大似然估计和KL散度之间的关系.
L ( θ ) = arg max θ ∑ i N l o g P ( x i ∣ θ ) = arg max θ ∑ i N l o g P ( x i ∣ θ ) − ∑ i N l o g P ( x i ∣ θ t ) = arg max θ ∑ i N l o g P ( x i ∣ θ ) P ( x i ∣ θ t ) = arg min θ ∑ i N l o g P ( x i ∣ θ t ) P ( x i ∣ θ ) = arg min θ 1 N arg min θ ∑ i N l o g P ( x i ∣ θ t ) P ( x i ∣ θ ) = arg min θ E ( l o g P ( x ∣ θ t ) P ( x ∣ θ ) ) \left. \begin{aligned} L(\theta)&=\arg\max_\theta\sum_i^N logP(x_i|\theta)\\ &=\arg\max_\theta \sum_i^NlogP(x_i|\theta)-\sum_i^N logP(x_i|\theta^t)\\ &=\arg\max_\theta \sum_{i}^{N}log \frac{P(x_i|\theta)}{P(x_i|\theta^t)}\\ &=\arg\min_\theta \sum_{i}^{N}log \frac{P(x_i|\theta^t)}{P(x_i|\theta)}\\ &=\arg\min_\theta\frac{1}{N}\arg\min_\theta \sum_{i}^{N}log \frac{P(x_i|\theta^t)}{P(x_i|\theta)}\\ &=\arg\min_\theta E(log \frac{P(x|\theta^t)}{P(x|\theta)}) \end{aligned} \right. L(θ)=argθmaxi∑NlogP(xi∣θ)=argθmaxi∑NlogP(xi∣θ)−i∑NlogP(xi∣θt)=argθmaxi∑NlogP(xi∣θt)P(xi∣θ)=argθmini∑NlogP(xi∣θ)P(xi∣θt)=argθminN1argθmini∑NlogP(xi∣θ)P(xi∣θt)=argθminE(logP(x∣θ)P(x∣θt))
如果当
N
→
+
∞
N \rightarrow +\infty
N→+∞,
那么可得上式最后一项等于
l
o
g
P
(
x
∣
θ
t
)
P
(
x
∣
θ
)
log \frac{P(x|\theta^t)}{P(x|\theta)}
logP(x∣θ)P(x∣θt)的期望.
arg
min
θ
E
(
l
o
g
P
(
x
∣
θ
t
)
P
(
x
∣
θ
)
)
=
arg
min
θ
∫
P
(
x
∣
θ
t
)
l
o
g
P
(
x
i
∣
θ
t
)
P
(
x
i
∣
θ
)
=
arg
min
θ
K
L
(
θ
t
,
θ
)
\left. \begin{aligned} \arg\min_\theta E(log \frac{P(x|\theta^t)}{P(x|\theta)}) &= \arg\min_\theta\int P(x|\theta^t) log \frac{P(x_i|\theta^t)}{P(x_i|\theta)}\\ &= \arg\min_\theta KL(\theta^t,\theta) \end{aligned} \right.
argθminE(logP(x∣θ)P(x∣θt))=argθmin∫P(x∣θt)logP(xi∣θ)P(xi∣θt)=argθminKL(θt,θ)
可见MLE和KL散度也是等价的.
总结
由此可见, 在一般意义下MLE, KL, H交叉熵, 均方误差都是等价.
加粗样式