1.1
说明伯努利模型的极大似然估计以及贝叶斯估计中的统计学习方法三要素。伯努利模型是定义在取值为 0 与 1 的随机变量上的概率分布。假设观测到伯努利模型n 次独立的数据生成结果,其中 k 次的结果为 1,这时可以用极大似然估计或贝叶斯估计来估计结果为 1 的概率。
1)最大似然估计
模型:伯努利分布
策略:风险函数最小化
算法:
记
∑
i
=
1
n
x
i
=
k
\sum_{i=1}^{n} x_i = k
∑i=1nxi=k
似然函数
L
(
x
,
p
)
=
p
(
x
∣
p
)
=
∏
i
=
1
n
p
(
x
i
∣
p
)
=
p
k
(
1
−
p
)
n
−
k
L(x, p)=p(x|p)=\prod_{i=1}^n p(x_i | p) = p^{k} (1 -p)^{n - k}
L(x,p)=p(x∣p)=∏i=1np(xi∣p)=pk(1−p)n−k
取对数,
l
(
x
,
p
)
=
ln
L
(
x
,
p
)
=
k
ln
p
+
(
n
−
k
)
ln
(
1
−
p
)
l(x, p) = \ln{L(x, p)} = {k} \ln p + (n - k) \ln {(1 - p)}
l(x,p)=lnL(x,p)=klnp+(n−k)ln(1−p)
求导,
∂
l
∂
p
=
n
−
k
p
−
k
1
−
p
=
k
−
p
p
(
1
−
p
)
=
0
\frac {\partial{l}}{\partial p} = \frac{n - k}{p} - \frac{k}{1 - p} = \frac{k - p}{p(1 - p)} = 0
∂p∂l=pn−k−1−pk=p(1−p)k−p=0
最大似然估计
p
^
=
k
n
=
x
ˉ
\hat{p} = \frac{k}{n} = \bar{x}
p^=nk=xˉ
2)贝叶斯估计
模型:贝塔分布
策略:风险函数最小化
算法:
假设
p
p
p服从
B
e
t
a
(
a
,
b
)
Beta(a,b)
Beta(a,b)分布
p
p
p的密度函数为
π
(
p
)
=
τ
(
a
+
b
)
τ
(
a
)
τ
(
b
)
p
a
−
1
(
1
−
p
)
b
−
1
\pi (p) = \frac {\tau (a + b)} {\tau (a) \tau (b)} p^{a - 1} (1 - p)^{b - 1}
π(p)=τ(a)τ(b)τ(a+b)pa−1(1−p)b−1
P
(
x
∣
p
)
=
p
k
(
1
−
p
)
n
−
k
P(x|p) = p^{k} (1 -p)^{n - k}
P(x∣p)=pk(1−p)n−k
联合密度函数
h
(
x
,
p
)
=
π
(
p
)
P
(
x
∣
p
)
=
τ
(
a
+
b
)
τ
(
a
)
τ
(
b
)
p
k
+
a
−
1
(
1
−
p
)
n
+
b
−
k
−
1
h(x, p) = \pi (p) P(x|p) = \frac {\tau (a + b)} {\tau (a) \tau (b)} p^{k + a - 1} (1 - p)^{n + b - k - 1}
h(x,p)=π(p)P(x∣p)=τ(a)τ(b)τ(a+b)pk+a−1(1−p)n+b−k−1
x
x
x的边际密度函数
m
(
x
)
=
∫
0
1
h
(
x
,
p
)
d
p
=
∫
0
1
τ
(
a
+
b
)
τ
(
a
)
τ
(
b
)
p
k
+
a
−
1
(
1
−
p
)
n
+
b
−
k
−
1
d
p
=
τ
(
a
+
b
)
τ
(
a
)
τ
(
b
)
τ
(
k
+
a
)
τ
(
n
+
b
−
k
)
τ
(
a
+
b
+
n
)
m(x) = \int_0^1 h(x,p) {\rm d} p = \int_0^1 \frac {\tau (a + b)} {\tau (a) \tau (b)} p^{k + a - 1} (1 - p)^{n + b - k - 1} {\rm d} p = \frac {\tau (a + b)} {\tau (a) \tau (b)} \frac {\tau (k + a) \tau (n + b - k)} {\tau (a + b + n)}
m(x)=∫01h(x,p)dp=∫01τ(a)τ(b)τ(a+b)pk+a−1(1−p)n+b−k−1dp=τ(a)τ(b)τ(a+b)τ(a+b+n)τ(k+a)τ(n+b−k)
可得后验概率
π
(
p
∣
X
)
=
τ
(
a
+
b
+
n
)
τ
(
k
+
a
)
τ
(
n
+
b
−
k
)
p
k
+
a
−
1
(
1
−
p
)
n
+
b
−
k
−
1
\pi (p|X) = \frac {\tau (a + b +n)}{\tau (k + a) \tau (n + b - k)} p^{k + a - 1} (1 - p)^{n + b - k - 1}
π(p∣X)=τ(k+a)τ(n+b−k)τ(a+b+n)pk+a−1(1−p)n+b−k−1
此时后验概率服从 Beta
(
k
+
a
,
n
+
b
−
k
)
(k + a,n + b - k)
(k+a,n+b−k)分布
令
f
(
p
)
=
p
k
+
a
−
1
(
1
−
p
)
n
+
b
−
k
−
1
f(p) = p^{k + a - 1} (1 - p)^{n + b - k - 1}
f(p)=pk+a−1(1−p)n+b−k−1, 在
f
′
(
p
)
=
0
f'(p) = 0
f′(p)=0时得到最优解
求导,得
f
′
(
p
)
=
p
k
+
a
−
2
(
1
−
p
)
n
+
b
−
k
−
2
(
(
k
+
a
−
1
)
(
1
−
p
)
−
p
(
n
+
b
−
k
−
1
)
)
f'(p) = p^{k + a - 2} (1 - p)^{n + b - k - 2} ((k + a - 1)(1 - p) - p(n + b - k - 1))
f′(p)=pk+a−2(1−p)n+b−k−2((k+a−1)(1−p)−p(n+b−k−1))
令
f
′
(
p
)
=
0
f'(p) = 0
f′(p)=0,则
(
(
k
+
a
−
1
)
(
p
−
1
)
−
p
(
n
+
b
−
k
−
1
)
)
=
k
+
a
−
1
−
p
(
n
+
a
+
b
−
2
)
=
0
((k + a - 1)(p - 1) - p(n + b - k - 1)) = k + a - 1 - p(n + a +b - 2) = 0
((k+a−1)(p−1)−p(n+b−k−1))=k+a−1−p(n+a+b−2)=0
贝叶斯估计是
p
^
=
k
+
a
−
1
n
+
a
+
b
−
2
\hat{p} = \frac {k + a - 1}{n + a +b - 2}
p^=n+a+b−2k+a−1
1.2
通过经验风险最小化推导极大似然估计。证明模型是条件概率分布,当损失函数是对数损失函数时,经验风险最小化等价于极大似然估计。
经验风险最小化即为
min
1
N
∑
i
=
1
N
L
(
y
i
,
f
(
x
i
)
)
\min \frac {1} {N} \sum_{i=1}^N L(y_i, f(x_i))
minN1∑i=1NL(yi,f(xi))
当损失函数是对数损失函数时,上式为
min
1
N
(
−
ln
f
(
x
∣
y
)
)
=
min
1
N
(
−
ln
∏
i
=
1
N
f
(
x
i
∣
y
i
)
)
=
min
1
N
(
−
∑
i
=
1
N
ln
f
(
x
i
∣
y
i
)
)
=
max
1
N
∑
i
=
1
N
ln
f
(
x
i
∣
y
i
)
\min \frac {1} {N} (- \ln { f(x|y)} )= \min \frac {1} {N} (- \ln { \prod_{i = 1}^N f(x_i|y_i)}) = \min \frac {1} {N}(- \sum_{i=1}^N \ln { f(x_i|y_i)}) = \max \frac {1} {N} \sum_{i=1}^N \ln { f(x_i|y_i)}
minN1(−lnf(x∣y))=minN1(−ln∏i=1Nf(xi∣yi))=minN1(−∑i=1Nlnf(xi∣yi))=maxN1∑i=1Nlnf(xi∣yi)
最大似然估计的算法是
max
L
(
x
,
y
)
=
max
∑
i
=
1
N
f
(
x
i
∣
y
i
)
\max L(x,y) = \max \sum_{i=1}^N { f(x_i|y_i)}
maxL(x,y)=max∑i=1Nf(xi∣yi)
取对数后则是
max
∑
i
=
1
N
ln
f
(
x
i
∣
y
i
)
\max \sum_{i=1}^N \ln { f(x_i|y_i)}
max∑i=1Nlnf(xi∣yi),这是
max
l
(
x
∣
y
)
\max \, \, l(x|y)
maxl(x∣y),等价于
max
L
(
x
∣
y
)
\max \, \, L(x|y)
maxL(x∣y),即为极大似然估计
因此模型是条件概率分布,当损失函数是对数损失函数时,经验风险最小化等价于极大似然估计