证明
E n t ( D ) Ent(D) Ent(D)的最小值为0,最大值为 l o g 2 ∣ Y ∣ log_2|\mathcal{Y}| log2∣Y∣
Ent ( D ) = − ∑ k = 1 ∣ Y ∣ p k log 2 p k \operatorname{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|} p_{k} \log _{2} p_{k} Ent(D)=−k=1∑∣Y∣pklog2pk
其中, 0 ≤ p k ≤ 1 0\le p_k\le1 0≤pk≤1, − ∑ k = 1 ∣ y ∣ p k = 1 -\sum_{k=1}^{|\mathcal{y}|}p_k=1 −∑k=1∣y∣pk=1
最大值
若令 ∣ Y ∣ = n , p k = x k |\mathcal{Y}|=n, p_{k}=x_{k} ∣Y∣=n,pk=xk,那么信息熵Ent(D)就可以看作一个n元实值函数,也即
Ent ( D ) = f ( x 1 , … , x n ) = − ∑ k = 1 n x k log 2 x k \operatorname{Ent}(D)=f\left(x_{1}, \ldots, x_{n}\right)=-\sum_{k=1}^{n} x_{k} \log _{2} x_{k} Ent(D)=f(x1,…,xn)=−k=1∑nxklog2xk
先考虑 − ∑ k = 1 ∣ y ∣ x k = 1 -\sum_{k=1}^{|\mathcal{y}|}x_k=1 −∑k=1∣y∣xk=1,对Ent(D)求最大值等价于如下最大化问题:
min ∑ k = 1 n x k log 2 x k s.t. ∑ k = 1 n x k = 1 \begin{aligned}&\min \sum_{k=1}^{n} x_{k} \log _{2} x_{k} \\&\text { s.t. } \sum_{k=1}^{n} x_{k}=1\end{aligned} mink=1∑nxklog2xk s.t. k=1∑nxk=1
显然,在 0 ≤ x k ≤ 1 0\le x_k\le1 0≤xk≤1时,此问题为凸优化问题,而对于凸优化问题来说,满足KKT条件 的点即为最优解。由于此最小化问题仅含等式约束,那么能令其拉格朗日函数的一阶偏导 数等于0的点即为满足KKT条件的点。
根据拉格朗日乘子法可知,该优化问题的拉格朗日函数为
L ( x 1 , … , x n , λ ) = ∑ k = 1 n x k log 2 x k + λ ( ∑ k = 1 n x k − 1 ) L\left(x_{1}, \ldots, x_{n}, \lambda\right)=\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right) L(x1,…,xn,λ)=k=1∑nxklog2xk+λ(k=1∑nxk−1)
对拉格朗日函数分别关于 x 1 , x 2 , … … , x n , λ x_1,x_2,……,x_n,\lambda x1,x2,……,xn,λ求一阶偏导数,并令偏导数等于0可得
∂ L ( x 1 , … , x n , λ ) ∂ x 1 = ∂ ∂ x 1 [ ∑ k = 1 n x k log 2 x k + λ ( ∑ k = 1 n x k − 1 ) ] = 0 \frac{\partial L\left(x_{1}, \ldots, x_{n}, \lambda\right)}{\partial x_{1}}=\frac{\partial}{\partial x_{1}}\left[\sum_{k=1}^{n} x_{k} \log _{2} x_{k}+\lambda\left(\sum_{k=1}^{n} x_{k}-1\right)\right]=0 ∂x1∂L(x1,…,xn,λ)=∂x1∂[k=1∑nxklog2xk+λ(k=1∑nxk−1)]=0
⇒ λ = − log 2 x 1 − 1 ln 2 \Rightarrow \lambda=-\log _{2} x_{1}-\frac{1}{\ln 2} ⇒λ=−log2x1−ln21
同理可得:
λ = − log 2 x 1 − 1 ln 2 = − log 2 x 2 − 1 ln 2 = … = − log 2 x n − 1 ln 2 \lambda=-\log _{2} x_{1}-\frac{1}{\ln 2}=-\log _{2} x_{2}-\frac{1}{\ln 2}=\ldots=-\log _{2} x_{n}-\frac{1}{\ln 2} λ=−log2x1−ln21=−log2x2−ln21=…=−log2xn−ln21
又因为 − ∑ k = 1 ∣ y ∣ x k = 1 -\sum_{k=1}^{|\mathcal{y}|}x_k=1 −∑k=1∣y∣xk=1,所以解得 x 1 = x 2 = … = x n = 1 n x_{1}=x_{2}=\ldots=x_{n}=\frac{1}{n} x1=x2=…=xn=n1
代入得
f ( 1 n , … , 1 n ) = − ∑ k = 1 n 1 n log 2 1 n = − n ⋅ 1 n log 2 1 n = log 2 n f\left(\frac{1}{n}, \ldots, \frac{1}{n}\right)=-\sum_{k=1}^{n} \frac{1}{n} \log _{2} \frac{1}{n}=-n \cdot \frac{1}{n} \log _{2} \frac{1}{n}=\log _{2} n f(n1,…,n1)=−k=1∑nn1log2n1=−n⋅n1log2n1=log2n
因此可得最大值为 log 2 n \log _{2} n log2n,即最大值为 l o g 2 ∣ Y ∣ log_2|\mathcal{Y}| log2∣Y∣
最小值
g ( x k ) = − x k log 2 x k g\left(x_{k}\right)=-x_{k} \log _{2} x_{k} g(xk)=−xklog2xk
对 g ( x 1 ) g(x_1) g(x1)求二阶导数
g
′
′
(
x
1
)
=
d
(
g
′
(
x
1
)
)
d
x
1
=
d
(
−
log
2
x
1
−
1
ln
2
)
d
x
1
=
−
1
x
1
ln
2
g^{\prime \prime}\left(x_{1}\right)=\frac{d\left(g^{\prime}\left(x_{1}\right)\right)}{d x_{1}}=\frac{d\left(-\log _{2} x_{1}-\frac{1}{\ln 2}\right)}{d x_{1}}=-\frac{1}{x_{1} \ln 2}
g′′(x1)=dx1d(g′(x1))=dx1d(−log2x1−ln21)=−x1ln21
g
′
′
(
x
)
g''(x)
g′′(x)在其定义域内恒小于0,当x趋近于0时,
g
′
(
x
)
>
0
g'(x)>0
g′(x)>0,因此
g
(
x
)
g(x)
g(x)是一个在其定义域内开口向下的凹函数,那么其最小值必然在边界0和1处取得。
g(0) = 0(计算信息熵时约定若 x k x_k xk = 0,则 x k l o g 2 x k = 0 x_klog_2x_k=0 xklog2xk=0)
g(1) = 0
因此,令某个 x k = 1 x_k=1 xk=1,其他0时能取得最小值0
综上所述可得 E n t ( D ) Ent(D) Ent(D)的最小值为0,最大值为 l o g 2 ∣ Y ∣ log_2|\mathcal{Y}| log2∣Y∣,证毕