6.1 贝叶斯判定准则
贝叶斯判定准则:
为最小化总体风险,只需在每个样本上选择那个能使条件风险
R
(
c
∣
x
)
R(c \mid x)
R(c∣x)最小的类别标记,即
h
∗
(
x
)
=
arg
min
c
∈
Y
R
(
c
∣
x
)
h^{*}(\boldsymbol{x})=\underset{c \in \mathcal{Y}}{\arg \min } R(c \mid \boldsymbol{x})
h∗(x)=c∈YargminR(c∣x)
此时,
h
∗
h^{*}
h∗称为贝叶斯最优分类器
【这里的R和 h ∗ h^{*} h∗针对的都是单个输入样本。也就是说,对于单个样本, h ∗ ( x ) h^{*}(\boldsymbol{x}) h∗(x)输出一个类别标记c,这个c使得R取到最小值】
已知条件风险
R
(
c
∣
x
)
R(c \mid x)
R(c∣x)的计算公式是
R
(
c
i
∣
x
)
=
∑
j
=
1
N
λ
i
j
P
(
c
j
∣
x
)
R\left(c_{i} \mid x\right)=\sum_{j=1}^{N} \lambda_{i j} P\left(c_{j} \mid x\right)
R(ci∣x)=j=1∑NλijP(cj∣x)
跟西瓜书中一样,这里我们假设有N种可能的类别标记,
Y
=
{
c
1
,
c
2
,
…
,
c
N
}
\mathcal{Y}=\left\{c_{1}, c_{2}, \ldots, c_{N}\right\}
Y={c1,c2,…,cN}。
λ
i
j
\lambda_{ij}
λij是将一个真是标记为
c
j
c_{j}
cj的样本误分类为
c
i
c_i
ci所产生的损失。若目标是最小化分类错误率,则误判损失
λ
i
j
\lambda_{ij}
λij对应为0/1损失,也即
λ
i
,
j
=
{
0.
if
i
=
j
1.
otherwise
\lambda_{i, j}=\left\{\begin{array}{l} 0 . \text { if } i=j \\ 1 . \text { otherwise } \end{array}\right.
λi,j={0. if i=j1. otherwise
那么条件风险
R
(
c
∣
x
)
R(c \mid x)
R(c∣x)的计算公式可以进一步展开为
R
(
c
i
∣
x
)
=
1
×
P
(
c
1
∣
x
)
+
…
+
1
×
P
(
c
i
−
1
∣
x
)
+
0
×
P
(
c
i
∣
x
)
+
1
×
P
(
c
i
−
1
∣
x
)
+
…
+
1
×
P
(
c
N
∣
x
)
=
P
(
c
1
∣
x
)
+
…
+
P
(
c
i
−
1
∣
x
)
+
P
(
c
i
−
1
∣
x
)
+
…
+
P
(
c
N
∣
x
)
\begin{aligned} R\left(c_{i} \mid \boldsymbol{x}\right) &=1 \times P\left(c_{1} \mid \boldsymbol{x}\right)+\ldots+1 \times P\left(c_{i-1} \mid \boldsymbol{x}\right)+0 \times P\left(c_{i} \mid \boldsymbol{x}\right)+1 \times P\left(c_{i-1} \mid \boldsymbol{x}\right)+\ldots+1 \times P\left(c_{N} \mid \boldsymbol{x}\right) \\ &=P\left(c_{1} \mid \boldsymbol{x}\right)+\ldots+P\left(c_{i-1} \mid \boldsymbol{x}\right)+P\left(c_{i-1} \mid x\right)+\ldots+P\left(c_{N} \mid \boldsymbol{x}\right) \end{aligned}
R(ci∣x)=1×P(c1∣x)+…+1×P(ci−1∣x)+0×P(ci∣x)+1×P(ci−1∣x)+…+1×P(cN∣x)=P(c1∣x)+…+P(ci−1∣x)+P(ci−1∣x)+…+P(cN∣x)
这里除了
λ
i
i
=
0
\lambda_{ii}=0
λii=0,其余的
λ
\lambda
λ都等于1。又因为
∑
j
=
1
N
P
(
c
j
∣
x
)
=
1
\sum_{j=1}^{N} P\left(c_{j} \mid x\right)=1
∑j=1NP(cj∣x)=1, 所以
R
(
c
i
∣
x
)
=
1
−
P
(
c
i
∣
x
)
R\left(c_{i} \mid x\right)=1-P\left(c_{i} \mid x\right)
R(ci∣x)=1−P(ci∣x)
也就是西瓜书式7.5。
于是,最小化错误率的贝叶斯最优分类器为
h
∗
(
x
)
=
arg
min
c
∈
Y
R
(
c
∣
x
)
=
arg
min
c
∈
Y
(
1
−
P
(
c
∣
x
)
)
=
arg
max
c
∈
Y
P
(
c
∣
x
)
h^{*}(x)=\underset{c \in \mathcal{Y}}{\arg \min } R(c \mid x)=\underset{c \in \mathcal{Y}}{\arg \min }(1-P(c \mid x))=\underset{c \in \mathcal{Y}}{\arg \max } P(c \mid x)
h∗(x)=c∈YargminR(c∣x)=c∈Yargmin(1−P(c∣x))=c∈YargmaxP(c∣x)
6.2 多元正态分布参数的极大似然估计
已知对数似然函数为
L
L
(
θ
C
)
=
∑
x
∈
D
c
log
P
(
x
∣
θ
C
)
L L\left(\boldsymbol{\theta}_{C}\right)=\sum_{\boldsymbol{x} \in D_{c}} \log P\left(\boldsymbol{x} \mid \boldsymbol{\theta}_{C}\right)
LL(θC)=x∈Dc∑logP(x∣θC)
此为西瓜书式7.10
为了便于后续计算,我们令log的底数为e,则对数似然函数可化为
L
L
(
θ
C
)
=
∑
x
∈
D
c
ln
P
(
x
∣
θ
C
)
L L\left(\boldsymbol{\theta}_{C}\right)=\sum_{\boldsymbol{x} \in D_{c}} \ln P\left(\boldsymbol{x} \mid \boldsymbol{\theta}_{C}\right)
LL(θC)=x∈Dc∑lnP(x∣θC)
由于
P
(
x
∣
θ
c
)
=
P
(
x
∣
c
)
∼
N
(
μ
c
,
σ
c
2
)
P\left(x \mid \boldsymbol{\theta_{c}}\right)=P(x \mid c) \sim \mathcal{N}\left(\mu_{c}, \sigma_{c}^{2}\right)
P(x∣θc)=P(x∣c)∼N(μc,σc2),那么
P
(
x
∣
θ
c
)
=
1
(
2
π
)
d
∣
Σ
c
∣
exp
(
−
1
2
(
x
−
μ
c
)
T
Σ
c
−
1
(
x
−
μ
c
)
)
P\left(\boldsymbol{x} \mid \boldsymbol{\theta}_{c}\right)=\frac{1}{\sqrt{(2 \pi)^{d}\left|\boldsymbol{\Sigma}_{c}\right|}} \exp \left(-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{c}\right)\right)
P(x∣θc)=(2π)d∣Σc∣1exp(−21(x−μc)TΣc−1(x−μc))
其中,d表示
x
\boldsymbol{x}
x的维数,
Σ
C
=
σ
C
2
\Sigma_{C}=\sigma_{C}^{2}
ΣC=σC2为对称正定协方差矩阵,
∣
Σ
c
∣
\left|\Sigma_{c}\right|
∣Σc∣表示
Σ
c
\Sigma_{c}
Σc的行列式,将上 式代入对数似然函数可得
L
L
(
θ
c
)
=
∑
x
∈
D
c
ln
[
1
(
2
π
)
d
∣
Σ
c
∣
exp
(
−
1
2
(
x
−
μ
c
)
T
Σ
c
−
1
(
x
−
μ
c
)
)
]
L L\left(\boldsymbol{\theta}_{c}\right)=\sum_{\boldsymbol{x} \in D_{c}} \ln \left[\frac{1}{\sqrt{(2 \pi)^{d}\left|\boldsymbol{\Sigma}_{c}\right|}} \exp \left(-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_{c}\right)\right)\right]
LL(θc)=x∈Dc∑ln[(2π)d∣Σc∣1exp(−21(x−μc)TΣc−1(x−μc))]
令
∣
D
c
∣
=
N
\left|D_{c}\right|=N
∣Dc∣=N,则对数似然函数可化为:
L
L
(
θ
c
)
=
∑
i
=
1
N
ln
[
1
(
2
π
)
d
∣
Σ
c
∣
exp
(
−
1
2
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
)
]
=
∑
i
=
1
N
ln
[
1
(
2
π
)
d
⋅
1
∣
Σ
c
∣
⋅
exp
(
−
1
2
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
)
]
=
∑
i
=
1
N
{
ln
1
(
2
π
)
d
+
ln
1
∣
Σ
c
∣
+
ln
[
exp
(
−
1
2
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
)
]
}
=
∑
i
=
1
N
{
−
d
2
ln
(
2
π
)
−
1
2
ln
∣
Σ
c
∣
−
1
2
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
}
=
−
N
d
2
ln
(
2
π
)
−
N
2
ln
∣
Σ
c
∣
−
1
2
∑
i
=
1
N
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
\begin{aligned} L L\left(\boldsymbol{\theta}_{c}\right) &=\sum_{i=1}^{N} \ln \left[\frac{1}{\sqrt{(2 \pi)^{d}\left|\boldsymbol{\Sigma_c}\right|}} \exp \left(-\frac{1}{2}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma_c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right)\right] \\ &=\sum_{i=1}^{N} \ln \left[\frac{1}{\sqrt{(2 \pi)^{d}}} \cdot \frac{1}{\sqrt{\left|\boldsymbol{\Sigma_c}\right|}} \cdot \exp \left(-\frac{1}{2}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma_c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right)\right] \\ &=\sum_{i=1}^{N}\left\{\ln \frac{1}{\sqrt{(2 \pi)^{d}}}+\ln \frac{1}{\sqrt{\left|\boldsymbol{\Sigma_c}\right|}}+\ln \left[\exp \left(-\frac{1}{2}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma_c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right)\right]\right\}\\ &=\sum_{i=1}^{N}\left\{-\frac{d}{2} \ln (2 \pi)-\frac{1}{2} \ln \left|\boldsymbol{\Sigma_c}\right|-\frac{1}{2}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right\} \\ &=-\frac{N d}{2} \ln (2 \pi)-\frac{N}{2} \ln \left|\boldsymbol{\Sigma}_{c}\right|-\frac{1}{2} \sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right) \end{aligned}
LL(θc)=i=1∑Nln[(2π)d∣Σc∣1exp(−21(xi−μc)TΣc−1(xi−μc))]=i=1∑Nln[(2π)d1⋅∣Σc∣1⋅exp(−21(xi−μc)TΣc−1(xi−μc))]=i=1∑N{ln(2π)d1+ln∣Σc∣1+ln[exp(−21(xi−μc)TΣc−1(xi−μc))]}=i=1∑N{−2dln(2π)−21ln∣Σc∣−21(xi−μc)TΣc−1(xi−μc)}=−2Ndln(2π)−2Nln∣Σc∣−21i=1∑N(xi−μc)TΣc−1(xi−μc)
由于参数
θ
c
\boldsymbol{\theta_{c}}
θc的极大似然估计
θ
^
C
\hat{\boldsymbol{\theta}}_{C}
θ^C为:
θ
^
C
=
arg
max
θ
c
L
L
(
θ
C
)
\hat{\boldsymbol{\theta}}_{C}=\underset{\boldsymbol{\theta}_{c}}{\arg \max } L L\left(\boldsymbol{\theta}_{C}\right)
θ^C=θcargmaxLL(θC)
所以接来下只需要求出使得对数似然函数
L
L
(
θ
C
)
L L\left(\boldsymbol{\theta}_{C}\right)
LL(θC)取到最大值的
μ
^
c
\hat{\mu}_{c}
μ^c和
Σ
^
c
\hat{\Sigma}_{c}
Σ^c, 也就求出了
θ
^
c
\hat{\theta}_{c}
θ^c。
对
L
L
(
θ
C
)
L L\left(\boldsymbol{\theta}_{C}\right)
LL(θC)关于
μ
c
\mu_{c}
μc,求偏导
∂
L
L
(
θ
c
)
∂
μ
c
=
∂
∂
μ
c
[
−
N
d
2
ln
(
2
π
)
−
N
2
ln
∣
Σ
c
∣
−
1
2
∑
i
=
1
N
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
]
=
∂
∂
μ
c
[
−
1
2
∑
i
=
1
N
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
]
=
−
1
2
∑
i
=
1
N
∂
∂
μ
c
[
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
]
=
−
1
2
∑
i
=
1
N
∂
∂
μ
c
[
(
x
i
T
−
μ
c
T
)
Σ
c
−
1
(
x
i
−
μ
c
)
]
=
−
1
2
∑
i
=
1
N
∂
∂
μ
c
[
(
x
i
T
−
μ
c
T
)
(
Σ
c
−
1
x
i
−
Σ
c
−
1
μ
c
)
]
=
−
1
2
∑
i
=
1
N
∂
∂
μ
c
[
x
i
T
Σ
c
−
1
x
i
−
x
i
T
Σ
c
−
1
μ
c
−
μ
c
T
Σ
c
−
1
x
i
+
μ
c
T
Σ
c
−
1
μ
c
]
\begin{aligned} \frac{\partial L L\left(\boldsymbol{\theta}_{c}\right)}{\partial \boldsymbol{\mu}_{c}} &=\frac{\partial}{\partial \boldsymbol{\mu}_{c}}\left[-\frac{N d}{2} \ln (2 \pi)-\frac{N}{2} \ln \left|\boldsymbol{\Sigma}_{c}\right|-\frac{1}{2} \sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right] \\ &=\frac{\partial}{\partial \boldsymbol{\mu}_{c}}\left[-\frac{1}{2} \sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right] \\ &=-\frac{1}{2} \sum_{i=1}^{N} \frac{\partial}{\partial \boldsymbol{\mu}_{c}}\left[\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right] \\ &=-\frac{1}{2} \sum_{i=1}^{N} \frac{\partial}{\partial \boldsymbol{\mu}_{c}}\left[\left(\boldsymbol{x}_{i}^{T}-\boldsymbol{\mu}_{c}^{T}\right) \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right]\\ &=-\frac{1}{2} \sum_{i=1}^{N} \frac{\partial}{\partial \boldsymbol{\mu}_{c}}\left[\left(\boldsymbol{x}_{i}^{T}-\boldsymbol{\mu}_{c}^{T}\right)\left(\boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i}-\boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}\right)\right] \\ &=-\frac{1}{2} \sum_{i=1}^{N} \frac{\partial}{\partial \boldsymbol{\mu}_{c}}\left[\boldsymbol{x}_{i}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i}-\boldsymbol{x}_{i}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}-\boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i}+\boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}\right] \end{aligned}
∂μc∂LL(θc)=∂μc∂[−2Ndln(2π)−2Nln∣Σc∣−21i=1∑N(xi−μc)TΣc−1(xi−μc)]=∂μc∂[−21i=1∑N(xi−μc)TΣc−1(xi−μc)]=−21i=1∑N∂μc∂[(xi−μc)TΣc−1(xi−μc)]=−21i=1∑N∂μc∂[(xiT−μcT)Σc−1(xi−μc)]=−21i=1∑N∂μc∂[(xiT−μcT)(Σc−1xi−Σc−1μc)]=−21i=1∑N∂μc∂[xiTΣc−1xi−xiTΣc−1μc−μcTΣc−1xi+μcTΣc−1μc]
由于
x
i
T
Σ
c
−
1
μ
c
\boldsymbol{x}_{i}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}
xiTΣc−1μc的计算结果为标量,所以
x
i
T
Σ
c
−
1
μ
c
=
(
x
i
T
Σ
c
−
1
μ
c
)
T
=
μ
c
T
(
Σ
c
−
1
)
T
x
i
=
μ
c
T
(
Σ
c
T
)
−
1
x
i
=
μ
c
T
Σ
c
−
1
x
i
\boldsymbol{x}_{i}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}=\left(\boldsymbol{x}_{i}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}\right)^{T}=\boldsymbol{\mu}_{c}^{T}\left(\boldsymbol{\Sigma}_{c}^{-1}\right)^{T} \boldsymbol{x}_{i}=\boldsymbol{\mu}_{c}^{T}\left(\boldsymbol{\Sigma}_{c}^{T}\right)^{-1} \boldsymbol{x}_{i}=\boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i}
xiTΣc−1μc=(xiTΣc−1μc)T=μcT(Σc−1)Txi=μcT(ΣcT)−1xi=μcTΣc−1xi
于是上式可以进一步化为
∂
L
L
(
θ
c
)
∂
μ
c
=
−
1
2
∑
i
=
1
N
∂
∂
μ
c
[
x
i
T
Σ
c
−
1
x
i
−
2
x
i
T
Σ
c
−
1
μ
c
+
μ
c
T
Σ
c
−
1
μ
c
]
\frac{\partial L L\left(\boldsymbol{\theta}_{c}\right)}{\partial \boldsymbol{\mu}_{c}}=-\frac{1}{2} \sum_{i=1}^{N} \frac{\partial}{\partial \boldsymbol{\mu}_{c}}\left[\boldsymbol{x}_{i}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i}-2 \boldsymbol{x}_{i}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}+\boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}\right]
∂μc∂LL(θc)=−21i=1∑N∂μc∂[xiTΣc−1xi−2xiTΣc−1μc+μcTΣc−1μc]
由矩阵微分公式
∂
a
T
x
∂
x
=
a
\dfrac{\partial \boldsymbol{a}^{T} \boldsymbol{x}}{\partial \boldsymbol{x}}=\boldsymbol{a}
∂x∂aTx=a,
∂
x
T
B
x
∂
x
=
(
B
+
B
T
)
x
\dfrac{\partial \boldsymbol{x}^{T} \boldsymbol{B} \boldsymbol{x}}{\partial \boldsymbol{x}}=\left(\boldsymbol{B}+\boldsymbol{B}^{T}\right) \boldsymbol{x}
∂x∂xTBx=(B+BT)x可得
∂
L
L
(
θ
c
)
∂
μ
c
=
−
1
2
∑
i
=
1
N
[
0
−
(
2
x
i
T
Σ
c
−
1
)
T
+
(
Σ
c
−
1
+
(
Σ
c
−
1
)
T
)
μ
c
]
=
−
1
2
∑
i
=
1
N
[
−
(
2
(
Σ
c
−
1
)
T
x
i
)
+
(
Σ
c
−
1
+
(
Σ
c
−
1
)
T
)
μ
c
]
=
−
1
2
∑
i
=
1
N
[
−
(
2
Σ
c
−
1
x
i
)
+
2
Σ
c
−
1
μ
c
]
=
∑
i
=
1
N
Σ
c
−
1
x
i
−
N
Σ
c
−
1
μ
c
\begin{aligned} \frac{\partial L L\left(\boldsymbol{\theta}_{c}\right)}{\partial \boldsymbol{\mu}_{c}} &=-\frac{1}{2} \sum_{i=1}^{N}\left[0-\left(2 \boldsymbol{x}_{i}^{T} \boldsymbol{\Sigma}_{c}^{-1}\right)^{T}+\left(\boldsymbol{\Sigma}_{c}^{-1}+\left(\boldsymbol{\Sigma}_{c}^{-1}\right)^{T}\right) \boldsymbol{\mu}_{c}\right] \\ &=-\frac{1}{2} \sum_{i=1}^{N}\left[-\left(2\left(\boldsymbol{\Sigma}_{c}^{-1}\right)^{T} \boldsymbol{x}_{i}\right)+\left(\boldsymbol{\Sigma}_{c}^{-1}+\left(\boldsymbol{\Sigma}_{c}^{-1}\right)^{T}\right) \boldsymbol{\mu}_{c}\right] \\ &=-\frac{1}{2} \sum_{i=1}^{N}\left[-\left(2 \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i}\right)+2 \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}\right] \\ &=\sum_{i=1}^{N} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i}-N \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c} \end{aligned}
∂μc∂LL(θc)=−21i=1∑N[0−(2xiTΣc−1)T+(Σc−1+(Σc−1)T)μc]=−21i=1∑N[−(2(Σc−1)Txi)+(Σc−1+(Σc−1)T)μc]=−21i=1∑N[−(2Σc−1xi)+2Σc−1μc]=i=1∑NΣc−1xi−NΣc−1μc
令偏导数等于0可得
∂
L
L
(
θ
c
)
∂
μ
c
=
∑
i
=
1
N
Σ
c
−
1
x
i
−
N
Σ
c
−
1
μ
c
=
0
N
Σ
c
−
1
μ
c
=
∑
i
=
1
N
Σ
c
−
1
x
i
N
Σ
c
−
1
μ
c
=
Σ
c
−
1
∑
i
=
1
N
x
i
N
μ
c
=
∑
i
=
1
N
x
i
\begin{gathered} \frac{\partial L L\left(\boldsymbol{\theta}_{c}\right)}{\partial \boldsymbol{\mu}_{c}}=\sum_{i=1}^{N} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i}-N \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}=0 \\ N \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}=\sum_{i=1}^{N} \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{x}_{i} \\ N \boldsymbol{\Sigma}_{c}^{-1} \boldsymbol{\mu}_{c}=\boldsymbol{\Sigma}_{c}^{-1} \sum_{i=1}^{N} \boldsymbol{x}_{i} \\ N \boldsymbol{\mu}_{c}=\sum_{i=1}^{N} \boldsymbol{x}_{i} \end{gathered}
∂μc∂LL(θc)=i=1∑NΣc−1xi−NΣc−1μc=0NΣc−1μc=i=1∑NΣc−1xiNΣc−1μc=Σc−1i=1∑NxiNμc=i=1∑Nxi
μ c = 1 N ∑ i = 1 N x i ⇒ μ ^ c = 1 N ∑ i = 1 N x i \boldsymbol{\mu_{c}}=\frac{1}{N} \sum_{i=1}^{N} \boldsymbol{x}_{i} \Rightarrow \hat{\boldsymbol{\mu}}_{c}=\frac{1}{N} \sum_{i=1}^{N} \boldsymbol{x}_{i} μc=N1i=1∑Nxi⇒μ^c=N1i=1∑Nxi
此即为西瓜书式7.12
对
L
L
(
θ
C
)
L L\left(\boldsymbol{\theta}_{C}\right)
LL(θC)关于
Σ
C
\Sigma_{C}
ΣC求偏导
∂
L
L
(
θ
c
)
∂
Σ
c
=
∂
∂
Σ
c
[
−
N
d
2
ln
(
2
π
)
−
N
2
ln
∣
Σ
c
∣
−
1
2
∑
i
=
1
N
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
]
=
∂
∂
Σ
c
[
−
N
2
ln
∣
Σ
c
∣
−
1
2
∑
i
=
1
N
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
]
=
−
N
2
⋅
∂
∂
Σ
c
[
ln
∣
Σ
c
∣
]
−
1
2
∑
i
=
1
N
∂
∂
Σ
c
[
(
x
i
−
μ
c
)
T
Σ
c
−
1
(
x
i
−
μ
c
)
]
\begin{aligned} \frac{\partial L L\left(\boldsymbol{\theta}_{c}\right)}{\partial \boldsymbol{\Sigma}_{c}} &=\frac{\partial}{\partial \boldsymbol{\Sigma}_{c}}\left[-\frac{N d}{2} \ln (2 \pi)-\frac{N}{2} \ln \left|\boldsymbol{\Sigma}_{c}\right|-\frac{1}{2} \sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right] \\ &=\frac{\partial}{\partial \boldsymbol{\Sigma}_{c}}\left[-\frac{N}{2} \ln \left|\boldsymbol{\Sigma}_{c}\right|-\frac{1}{2} \sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right] \\ &=-\frac{N}{2} \cdot \frac{\partial}{\partial \boldsymbol{\Sigma}_{c}}\left[\ln \left|\boldsymbol{\Sigma}_{c}\right|\right]-\frac{1}{2} \sum_{i=1}^{N} \frac{\partial}{\partial \boldsymbol{\Sigma}_{c}}\left[\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{\mathrm{T}} \boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\right] \end{aligned}
∂Σc∂LL(θc)=∂Σc∂[−2Ndln(2π)−2Nln∣Σc∣−21i=1∑N(xi−μc)TΣc−1(xi−μc)]=∂Σc∂[−2Nln∣Σc∣−21i=1∑N(xi−μc)TΣc−1(xi−μc)]=−2N⋅∂Σc∂[ln∣Σc∣]−21i=1∑N∂Σc∂[(xi−μc)TΣc−1(xi−μc)]
由矩阵微分公式
∂
∣
X
∣
∂
X
=
∣
X
∣
⋅
(
X
−
1
)
T
\dfrac{\partial|\mathbf{X}|}{\partial \mathbf{X}}=|\mathbf{X}| \cdot\left(\mathbf{X}^{-1}\right)^{T}
∂X∂∣X∣=∣X∣⋅(X−1)T,
∂
a
T
X
−
1
b
∂
X
=
−
X
−
T
a
b
T
X
−
T
\dfrac{\partial \boldsymbol{a}^{T} \mathbf{X}^{-1} \boldsymbol{b}}{\partial \mathbf{X}}=-\mathbf{X}^{-T} \boldsymbol{a} \boldsymbol{b}^{T} \mathbf{X}^{-T}
∂X∂aTX−1b=−X−TabTX−T可得
∂
L
L
(
θ
c
)
∂
Σ
c
=
−
N
2
⋅
1
∣
Σ
c
∣
⋅
∣
Σ
c
∣
⋅
(
Σ
c
−
1
)
T
−
1
2
∑
i
=
1
N
[
−
Σ
c
−
T
(
x
i
−
μ
c
)
(
x
i
−
μ
c
)
T
Σ
c
−
T
]
=
−
N
2
⋅
(
Σ
c
−
1
)
T
−
1
2
∑
i
=
1
N
[
−
Σ
c
−
T
(
x
i
−
μ
c
)
(
x
i
−
μ
c
)
T
Σ
c
−
T
]
=
−
N
2
Σ
c
−
1
+
1
2
∑
i
=
1
N
[
Σ
c
−
1
(
x
i
−
μ
c
)
(
x
i
−
μ
c
)
T
Σ
c
−
1
]
\begin{aligned} \frac{\partial L L\left(\boldsymbol{\theta}_{c}\right)}{\partial \boldsymbol{\Sigma}_{c}} &=-\frac{N}{2} \cdot \frac{1}{\left|\boldsymbol{\Sigma}_{c}\right|} \cdot\left|\boldsymbol{\Sigma}_{c}\right| \cdot\left(\boldsymbol{\Sigma}_{c}^{-1}\right)^{T}-\frac{1}{2} \sum_{i=1}^{N}\left[-\boldsymbol{\Sigma}_{c}^{-T}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-T}\right] \\ &=-\frac{N}{2} \cdot\left(\boldsymbol{\Sigma}_{c}^{-1}\right)^{T}-\frac{1}{2} \sum_{i=1}^{N}\left[-\boldsymbol{\Sigma}_{c}^{-T}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-T}\right] \\ &=-\frac{N}{2} \boldsymbol{\Sigma}_{c}^{-1}+\frac{1}{2} \sum_{i=1}^{N}\left[\boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-1}\right] \end{aligned}
∂Σc∂LL(θc)=−2N⋅∣Σc∣1⋅∣Σc∣⋅(Σc−1)T−21i=1∑N[−Σc−T(xi−μc)(xi−μc)TΣc−T]=−2N⋅(Σc−1)T−21i=1∑N[−Σc−T(xi−μc)(xi−μc)TΣc−T]=−2NΣc−1+21i=1∑N[Σc−1(xi−μc)(xi−μc)TΣc−1]
令偏导数等于0可得
∂
L
L
(
θ
c
)
∂
Σ
c
=
−
N
2
Σ
c
−
1
+
1
2
∑
i
=
1
N
[
Σ
c
−
1
(
x
i
−
μ
c
)
(
x
i
−
μ
c
)
T
Σ
c
−
1
]
=
0
\frac{\partial L L\left(\boldsymbol{\theta}_{c}\right)}{\partial \boldsymbol{\Sigma}_{c}}=-\frac{N}{2} \boldsymbol{\Sigma}_{c}^{-1}+\frac{1}{2} \sum_{i=1}^{N}\left[\boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-1}\right]=0
∂Σc∂LL(θc)=−2NΣc−1+21i=1∑N[Σc−1(xi−μc)(xi−μc)TΣc−1]=0
− N 2 Σ c − 1 = − 1 2 ∑ i = 1 N [ Σ c − 1 ( x i − μ c ) ( x i − μ c ) T Σ c − 1 ] N Σ c − 1 = ∑ i = 1 N [ Σ c − 1 ( x i − μ c ) ( x i − μ c ) T Σ c − 1 ] N Σ c − 1 = Σ c − 1 [ ∑ i = 1 N ( x i − μ c ) ( x i − μ c ) T ] Σ c − 1 N = Σ c − 1 [ ∑ i = 1 N ( x i − μ c ) ( x i − μ c ) T ] \begin{gathered} -\frac{N}{2} \Sigma_{c}^{-1}=-\frac{1}{2} \sum_{i=1}^{N}\left[\boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-1}\right] \\ N \Sigma_{c}^{-1}=\sum_{i=1}^{N}\left[\boldsymbol{\Sigma}_{c}^{-1}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-1}\right] \\ N \boldsymbol{\Sigma}_{c}^{-1}=\boldsymbol{\Sigma}_{c}^{-1}\left[\sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T}\right] \boldsymbol{\Sigma}_{c}^{-1} \\ N=\boldsymbol{\Sigma}_{c}^{-1}\left[\sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T}\right] \end{gathered} −2NΣc−1=−21i=1∑N[Σc−1(xi−μc)(xi−μc)TΣc−1]NΣc−1=i=1∑N[Σc−1(xi−μc)(xi−μc)TΣc−1]NΣc−1=Σc−1[i=1∑N(xi−μc)(xi−μc)T]Σc−1N=Σc−1[i=1∑N(xi−μc)(xi−μc)T]
Σ c = 1 N ∑ i = 1 N ( x i − μ c ) ( x i − μ c ) T ⇒ Σ ^ c = 1 N ∑ i = 1 N ( x i − μ c ) ( x i − μ c ) T \Sigma_{c}=\frac{1}{N} \sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T} \Rightarrow \hat{\boldsymbol{\Sigma}}_{c}=\frac{1}{N} \sum_{i=1}^{N}\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)\left(\boldsymbol{x}_{i}-\boldsymbol{\mu}_{c}\right)^{T} Σc=N1i=1∑N(xi−μc)(xi−μc)T⇒Σ^c=N1i=1∑N(xi−μc)(xi−μc)T
此即为西瓜书式7.13
6.3 朴素贝叶斯分类器
已知最小化分类错误率的贝叶斯最优分类器为
h
∗
(
x
)
=
arg
max
c
∈
Y
P
(
c
∣
x
)
h^{*}(\boldsymbol{x})=\underset{c \in \mathcal{Y}}{\arg \max } P(c \mid \boldsymbol{x})
h∗(x)=c∈YargmaxP(c∣x)
又由贝叶斯定理可知
P
(
c
∣
x
)
=
P
(
x
,
c
)
P
(
x
)
=
P
(
c
)
P
(
x
∣
c
)
P
(
x
)
P(c \mid \boldsymbol{x})=\frac{P(\boldsymbol{x}, c)}{P(\boldsymbol{x})}=\frac{P(c) P(\boldsymbol{x} \mid c)}{P(\boldsymbol{x})}
P(c∣x)=P(x)P(x,c)=P(x)P(c)P(x∣c)
所以
h
∗
(
x
)
=
arg
max
c
∈
Y
P
(
c
)
P
(
x
∣
c
)
P
(
x
)
=
arg
max
c
∈
Y
P
(
c
)
P
(
x
∣
c
)
h^{*}(\boldsymbol{x})=\underset{c \in \mathcal{Y}}{\arg \max } \frac{P(c) P(\boldsymbol{x} \mid c)}{P(\boldsymbol{x})}=\underset{c \in \mathcal{Y}}{\arg \max } P(c) P(\boldsymbol{x} \mid c)
h∗(x)=c∈YargmaxP(x)P(c)P(x∣c)=c∈YargmaxP(c)P(x∣c)
已知属性条件独立性假设为
P
(
x
∣
c
)
=
P
(
x
1
,
x
2
,
…
,
x
d
∣
c
)
=
∏
i
=
1
d
P
(
x
i
∣
c
)
P(\boldsymbol{x} \mid c)=P\left(x_{1}, x_{2}, \ldots, x_{d} \mid c\right)=\prod_{i=1}^{d} P\left(x_{i} \mid c\right)
P(x∣c)=P(x1,x2,…,xd∣c)=i=1∏dP(xi∣c)
【其中,d表示
x
\boldsymbol{x}
x的维数】
所以
h
∗
(
x
)
=
arg
max
c
∈
Y
P
(
c
)
∏
i
=
1
d
P
(
x
i
∣
c
)
h^{*}(\boldsymbol{x})=\underset{c \in \mathcal{Y}}{\arg \max } P(c) \prod_{i=1}^{d} P\left(x_{i} \mid c\right)
h∗(x)=c∈YargmaxP(c)i=1∏dP(xi∣c)
此即为朴素贝叶斯分类器的分类器
对于
P
(
c
)
P(c)
P(c),它表示的是样本空间中各类样本所占的比例,根据大数定律,当训练集包含充足的独立同分布样本时,
P
(
c
)
P(c)
P(c)可通过各类样本出现的频率来进行估计,也即
P
(
c
)
=
∣
D
c
∣
∣
D
∣
P(c)=\frac{\left|D_{c}\right|}{|D|}
P(c)=∣D∣∣Dc∣
其中,D表示训练集,
∣
D
∣
|D|
∣D∣表示D中的样本个数,
D
c
D_{c}
Dc表示训练集D中第c类样本组成的集合,
∣
D
c
∣
\left|D_{c}\right|
∣Dc∣表示集合
D
c
D_{c}
Dc中的样本个数。
对于
P
(
x
i
∣
c
)
P\left(x_{i} \mid c\right)
P(xi∣c),若样本的第i个属性
x
i
x_{i}
xi取值为连续值,我们假设该属性的取值服从正态分布,也即
P
(
x
i
∣
c
)
∼
N
(
μ
c
,
i
,
σ
c
,
i
2
)
⇒
P
(
x
i
∣
c
)
=
1
2
π
σ
c
,
i
exp
(
−
(
x
i
−
μ
c
,
i
)
2
2
σ
c
,
i
2
)
P\left(x_{i} \mid c\right) \sim \mathcal{N}\left(\mu_{c, i}, \sigma_{c, i}^{2}\right) \Rightarrow P\left(x_{i} \mid c\right)=\frac{1}{\sqrt{2 \pi} \sigma_{c, i}} \exp \left(-\frac{\left(x_{i}-\mu_{c, i}\right)^{2}}{2 \sigma_{c, i}^{2}}\right)
P(xi∣c)∼N(μc,i,σc,i2)⇒P(xi∣c)=2πσc,i1exp(−2σc,i2(xi−μc,i)2)
其中正态分布的参数可以用极大似然估计法推得:
μ
c
,
i
\mu_{c, i}
μc,i和
σ
c
,
i
2
\sigma_{c, i}^{2}
σc,i2属性上取值的均值和方差
对于
P
(
x
i
∣
c
)
P\left(x_{i} \mid c\right)
P(xi∣c),若样本的第i个属性
x
i
x_{i}
xi取值为离散值,同样根据极大似然估计法,我们用其频率值作为其概率值的估计值,也即
P
(
x
i
∣
c
)
=
∣
D
c
,
x
i
∣
∣
D
c
∣
P\left(x_{i} \mid c\right)=\frac{\left|D_{c, x_{i}}\right|}{\left|D_{c}\right|}
P(xi∣c)=∣Dc∣∣Dc,xi∣
其中,
D
c
,
x
i
D_{c, x_{i}}
Dc,xi表示
D
c
D_c
Dc中在第i个属性上取值为
x
i
x_{i}
xi的样本组成的集合。
例:现将一枚6面骰子抛掷10次,抛掷出的点数分别为2、3、2、5、4、6、1、3、4、2, 试基于此抛掷结果估计这枚骰子抛掷出各个点数的概率。
解:设这枚骰子抛掷出点数i的概率为
P
i
P_i
Pi,根据极大似然估计法可以写出似然函数为
L
(
θ
)
=
P
1
×
P
2
3
×
P
3
2
×
P
4
2
×
P
5
×
P
6
L(\theta)=P_{1} \times P_{2}^{3} \times P_{3}^{2} \times P_{4}^{2} \times P_{5} \times P_{6}
L(θ)=P1×P23×P32×P42×P5×P6
其对数似然函数即为
L
L
(
θ
)
=
ln
L
(
θ
)
=
ln
(
P
1
×
P
2
3
×
P
3
2
×
P
4
2
×
P
5
×
P
6
)
=
ln
P
1
+
3
ln
P
2
+
2
ln
P
3
+
2
ln
P
4
+
ln
P
5
+
ln
P
6
\begin{aligned} L L(\theta) &=\ln L(\theta)=\ln \left(P_{1} \times P_{2}^{3} \times P_{3}^{2} \times P_{4}^{2} \times P_{5} \times P_{6}\right) \\ &=\ln P_{1}+3 \ln P_{2}+2 \ln P_{3}+2 \ln P_{4}+\ln P_{5}+\ln P_{6} \end{aligned}
LL(θ)=lnL(θ)=ln(P1×P23×P32×P42×P5×P6)=lnP1+3lnP2+2lnP3+2lnP4+lnP5+lnP6
由于
P
i
P_i
Pi之间满足如下约束
P
1
+
P
2
+
P
3
+
P
4
+
P
5
+
P
6
=
1
P_{1}+P_{2}+P_{3}+P_{4}+P_{5}+P_{6}=1
P1+P2+P3+P4+P5+P6=1
所以此时最大化对数似然函数属于带约束的最优化问题,也即
max
L
L
(
θ
)
=
ln
P
1
+
3
ln
P
2
+
2
ln
P
3
+
2
ln
P
4
+
ln
P
5
+
ln
P
6
s.t.
P
1
+
P
2
+
P
3
+
P
4
+
P
5
+
P
6
=
1
\begin{array}{ll} \max & L L(\theta)=\ln P_{1}+3 \ln P_{2}+2 \ln P_{3}+2 \ln P_{4}+\ln P_{5}+\ln P_{6} \\ \text { s.t. } & P_{1}+P_{2}+P_{3}+P_{4}+P_{5}+P_{6}=1 \end{array}
max s.t. LL(θ)=lnP1+3lnP2+2lnP3+2lnP4+lnP5+lnP6P1+P2+P3+P4+P5+P6=1
定理:对于一个优化问题
min
f
(
x
)
s.t.
g
i
(
x
)
≤
0
(
i
=
1
,
…
,
m
)
h
j
(
x
)
=
0
(
j
=
1
,
…
,
n
)
\begin{array}{ll} \min & f(x) \\ \text { s.t. } & g_{i}(x) \leq 0 \quad(i=1, \ldots, m) \\ & h_{j}(x)=0 \quad(j=1, \ldots, n) \end{array}
min s.t. f(x)gi(x)≤0(i=1,…,m)hj(x)=0(j=1,…,n)
若
f
(
x
)
,
g
i
(
x
)
,
h
j
(
x
)
f(x) ,g_{i}(x) ,h_{j}(x)
f(x),gi(x),hj(x)一阶连续可微,并且
f
(
x
)
,
g
i
(
x
)
f(x), g_{i}(x)
f(x),gi(x)是凸函数,
h
j
(
x
)
h_{j}(x)
hj(x)是线性函数,那么满足如下KKT条件的点一定是优化问题的最优解。
{
∇
x
L
(
x
∗
,
μ
∗
,
λ
∗
)
=
∇
f
(
x
∗
)
+
∑
i
=
1
m
μ
i
∗
∇
g
i
(
x
∗
)
+
∑
j
=
1
n
λ
j
∗
∇
h
j
(
x
∗
)
=
0
h
j
(
x
∗
)
=
0
g
i
(
x
∗
)
≤
0
μ
i
∗
≥
0
μ
i
∗
g
i
(
x
∗
)
=
0
\left\{\begin{array}{l} \nabla_{x} L\left(\boldsymbol{x}^{*} , \boldsymbol{\mu}^{*} , \boldsymbol{\lambda}^{*}\right)=\nabla f\left(\boldsymbol{x}^{*}\right)+\sum_{i=1}^{m} \mu_{i}^{*} \nabla g_{i}\left(\boldsymbol{x}^{*}\right)+\sum_{j=1}^{n} \lambda_{j}^{*} \nabla h_{j}\left(\boldsymbol{x}^{*}\right)=0 \\ h_{j}\left(\boldsymbol{x}^{*}\right)=0 \\ g_{i}\left(\boldsymbol{x}^{*}\right) \leq 0 \\ \mu_{i}^{*} \geq 0 \\ \mu_{i}^{*}g_{i}\left(\boldsymbol{x}^{*}\right)=0 \end{array}\right.
⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧∇xL(x∗,μ∗,λ∗)=∇f(x∗)+∑i=1mμi∗∇gi(x∗)+∑j=1nλj∗∇hj(x∗)=0hj(x∗)=0gi(x∗)≤0μi∗≥0μi∗gi(x∗)=0
【参考文献:王燕军, 梁治安. 最优化基础理论与方法[M]. 复旦大学出版社, 2011.】
由拉格朗日乘子法可得拉格拉格朗日函数为
L
(
θ
,
λ
)
=
ln
P
1
+
3
ln
P
2
+
2
ln
P
3
+
2
ln
P
4
+
ln
P
5
+
ln
P
6
+
λ
(
P
1
+
P
2
+
P
3
+
P
4
+
P
5
+
P
6
−
1
)
\mathcal{L}(\theta, \lambda)=\ln P_{1}+3 \ln P_{2}+2 \ln P_{3}+2 \ln P_{4}+\ln P_{5}+\ln P_{6}+\lambda\left(P_{1}+P_{2}+P_{3}+P_{4}+P_{5}+P_{6}-1\right)
L(θ,λ)=lnP1+3lnP2+2lnP3+2lnP4+lnP5+lnP6+λ(P1+P2+P3+P4+P5+P6−1)
对拉格朗日函数
L
(
θ
)
\mathcal{L}(\theta)
L(θ)分别关于
P
i
P_i
Pi求偏导,然后令其等于0可得
∂
L
(
θ
,
λ
)
∂
P
1
=
∂
∂
P
1
[
ln
P
1
+
3
ln
P
2
+
2
ln
P
3
+
2
ln
P
4
+
ln
P
5
+
ln
P
6
+
λ
(
P
1
+
P
2
+
P
3
+
P
4
+
P
5
+
P
6
−
1
)
]
=
0
=
∂
∂
P
1
(
ln
P
1
+
λ
P
1
)
=
0
=
1
P
1
+
λ
=
0
⇒
λ
=
−
1
P
1
\begin{aligned} \frac{\partial \mathcal{L}(\theta, \lambda)}{\partial P_{1}} &=\frac{\partial}{\partial P_{1}}\left[\ln P_{1}+3 \ln P_{2}+2 \ln P_{3}+2 \ln P_{4}+\ln P_{5}+\ln P_{6}+\lambda\left(P_{1}+P_{2}+P_{3}+P_{4}+P_{5}+P_{6}-1\right)\right]=0 \\ &=\frac{\partial}{\partial P_{1}}\left(\ln P_{1}+\lambda P_{1}\right)=0 \\ &=\frac{1}{P_{1}}+\lambda=0 \\ & \Rightarrow \lambda=-\frac{1}{P_{1}} \end{aligned}
∂P1∂L(θ,λ)=∂P1∂[lnP1+3lnP2+2lnP3+2lnP4+lnP5+lnP6+λ(P1+P2+P3+P4+P5+P6−1)]=0=∂P1∂(lnP1+λP1)=0=P11+λ=0⇒λ=−P11
同理可求得:
λ
=
−
1
P
1
=
−
3
P
2
=
−
2
P
3
=
−
2
P
4
=
−
1
P
5
=
−
1
P
6
\lambda=-\frac{1}{P_{1}}=-\frac{3}{P_{2}}=-\frac{2}{P_{3}}=-\frac{2}{P_{4}}=-\frac{1}{P_{5}}=-\frac{1}{P_{6}}
λ=−P11=−P23=−P32=−P42=−P51=−P61
又因为
P
1
+
P
2
+
P
3
+
P
4
+
P
5
+
P
6
=
1
P_{1}+P_{2}+P_{3}+P_{4}+P_{5}+P_{6}=1
P1+P2+P3+P4+P5+P6=1
所以
P
1
=
1
10
,
P
2
=
3
10
,
P
3
=
2
10
,
P
4
=
2
10
,
P
5
=
1
10
,
P
6
=
1
10
P_{1}=\frac{1}{10}, P_{2}=\frac{3}{10}, P_{3}=\frac{2}{10}, P_{4}=\frac{2}{10}, P_{5}=\frac{1}{10}, P_{6}=\frac{1}{10}
P1=101,P2=103,P3=102,P4=102,P5=101,P6=101
此时抛掷出各个点数的概率值与其频率值相等。