一、第一篇 监督学习
第一章 统计学习及监督学习概论
定理1.1–泛化误差上界
-
泛化误差:(generalization error)
若学习到的模型为 f ^ \hat{f} f^ ,则用这个模型对未知数据集预测的误差称为泛化误差,它表现的是模型对未知数据的预测能力,事实上泛化误差就是所学习到的模型的期望风险
R exp ( f ^ ) = E P [ L ( Y , f ^ ( X ) ) ] = ∫ X × Y L ( y , f ^ ( x ) ) P ( x , y ) d x d y \begin{aligned} R_{\exp }(\hat{f}) &=E_{P}[L(Y, \hat{f}(X))] \\ &=\int_{\mathcal{X} \times \mathcal{Y}} L(y, \hat{f}(x)) P(x, y) \mathrm{d} x \mathrm{~d} y \end{aligned} Rexp(f^)=EP[L(Y,f^(X))]=∫X×YL(y,f^(x))P(x,y)dx dy -
期望风险:
R ( f ) = E [ L ( Y , f ( x ) ) ] R(f) = E[L(Y,f(x))] R(f)=E[L(Y,f(x))] -
经验风险:
R ^ ( f ) = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) \hat{R} (f) = \frac{1}{N}\sum_{i=1}^NL(y_i,f(x_i)) R^(f)=N1i=1∑NL(yi,f(xi)) -
泛化误差上界定理
定理 1.1 1.1 1.1 (
泛化误差上界
) 对二类分类问题, 当假设空间是有限个
函数的集合 F = { f 1 , f 2 , ⋯ , f d } \mathcal{F}=\left\{f_{1}, f_{2}, \cdots, f_{d}\right\} F={f1,f2,⋯,fd} 时, 对任意一个函数 f ∈ F f \in \mathcal{F} f∈F, 至少以概率 1 − δ , 0 < δ < 1 1-\delta, 0<\delta<1 1−δ,0<δ<1, 以下 不等式成立:
R ( f ) ⩽ R ^ ( f ) + ε ( d , N , δ ) R(f) \leqslant \hat{R}(f)+\varepsilon(d, N, \delta) R(f)⩽R^(f)+ε(d,N,δ)
其中,
ε ( d , N , δ ) = 1 2 N ( log d + log 1 δ ) \varepsilon(d, N, \delta)=\sqrt{\frac{1}{2 N}\left(\log d+\log \frac{1}{\delta}\right)} ε(d,N,δ)=2N1(logd+logδ1)
d为假设空间中备选模型的数量,N为样本数量 -
Hoeffiding不等式
设有随机变量 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn 的独立随机变量序列, S n = ∑ i = 1 N x i S_n=\sum\limits_{i=1}^Nx_i Sn=i=1∑Nxi是独立随机变量之和, E ( S n ) = E ( ∑ i = 1 N x i ) E(S_n) = E(\sum\limits_{i=1}^Nx_i) E(Sn)=E(i=1∑Nxi)为随机变量和的期望, x i ∈ [ a i , b i ] x_i\isin[a_i,b_i] xi∈[ai,bi](x取值在 a i a_i ai和 b i b_i bi之间),则对任意 t > 0 t>0 t>0,以下不等式成立
P ( S n − E ( S n ) ⩾ t ) ⩽ e ( − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 ) P\left(S_{n}-E\left(S_{n}\right) \geqslant t\right) \leqslant e^{\left(\frac{-2 t^{2}}{\sum\limits_{i=1}^n\left(b_{i}-a i\right)^{2}}\right)} P(Sn−E(Sn)⩾t)⩽e(i=1∑n(bi−ai)2−2t2)
或者
P ( E ( S n ) − S n ⩾ t ) ⩽ e ( − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 ) P\left(E\left(S_{n}\right)-S_{n} \geqslant t\right) \leqslant e^{\left(\frac{-2 t^{2}}{\sum\limits_{i=1}^n\left(b_{i}-a i\right)^{2}}\right)} P(E(Sn)−Sn⩾t)⩽e(i=1∑n(bi−ai)2−2t2)
在这里 ( b i − a i ) 2 (b_i-a_i)^2 (bi−ai)2可以看成是常数
二分类问题泛化误差上界定理证明
-
二分类问题,有随机变量 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn 的独立随机变量序列, S n = ∑ i = 1 n x i S_n=\sum\limits_{i=1}^nx_i Sn=i=1∑nxi,如果其损失函数取值区间为 [ 0 , 1 ] [0,1] [0,1],即 x i ∈ [ 0 , 1 ] x_i \isin[0,1] xi∈[0,1], X ˉ \bar{X} Xˉ是独立变量的均值,即 X ˉ = S n n = 1 N ∑ i = 1 N X i \bar{X}=\frac{S_n}{n}=\frac{1}{N} \sum\limits_{i=1}^{N} X_{i} Xˉ=nSn=N1i=1∑NXi,那么 E ( X ˉ n ) = 1 n E ( S n ) E(\bar X_n) = \frac{1}{n}E(S_n) E(Xˉn)=n1E(Sn),则有以下公式成立
P ( X ˉ n − E ( X ˉ n ) ≥ t ) = P ( S n n − E ( S n ) n ≥ t ) = P ( S n − E ( S n ) ≥ n t ) P(\bar X_n - E(\bar X_n)\geq t) = P(\frac{S_n}{n}-\frac{E(S_n)}{n}\geq t) = P(S_n-E(S_n)\geq nt) P(Xˉn−E(Xˉn)≥t)=P(nSn−nE(Sn)≥t)=P(Sn−E(Sn)≥nt)
那么根据Hodffding
不等式
P ( X ˉ n − E ( X ˉ n ) ≥ t ) = P ( S n − E ( S n ) ≥ n t ) ≤ e ( − 2 n 2 t 2 ∑ i = 1 n ( b i − a i ) 2 ) \begin{aligned} P(\bar X_n - E(\bar X_n)\geq t) & = P(S_n-E(S_n)\geq nt) \\ & \leq e ^{\left(\frac{-2 n^2 t^{2}}{\sum\limits_{i=1}^n\left(b_{i}-a i\right)^{2}}\right)} \end{aligned} P(Xˉn−E(Xˉn)≥t)=P(Sn−E(Sn)≥nt)≤e(i=1∑n(bi−ai)2−2n2t2)
因为 ( b i − a i ) 2 (b_i-a_i)^2 (bi−ai)2可看成常数(在这里是0)所以后面的小于等于实际上为 e − n e^{-n} e−n阶,当n趋于无穷打时候, e − n e^{-n} e−n是趋于0的,即当样本量很大的时候,该随机变量均值到均值期望之间大于等于一个数(t)的概率是很小的(趋于0的)。从分类问题的模型假设空间 F \mathcal F F( F = { f 1 , f 2 , ⋯ , f d } \mathcal{F}=\left\{f_{1}, f_{2}, \cdots, f_{d}\right\} F={f1,f2,⋯,fd}是个有限集合)中任选一个备选模型f,其训练集经验风险为
R ^ ( f ) = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) \hat{R} (f) = \frac{1}{N}\sum_{i=1}^NL(y_i,f(x_i)) R^(f)=N1i=1∑NL(yi,f(xi))
测试集期望风险为:
R ( f ) = E [ L ( Y , f ( x ) ) ] R(f) = E[L(Y,f(x))] R(f)=E[L(Y,f(x))]
则利用上面转换的公式,将 t 换成 ϵ \epsilon ϵ ,区间取值[a,b]换成[0,1],则幂函数上的指数变为 − 2 N 2 ϵ 2 N = − 2 N ϵ 2 -\frac{2N^2\epsilon^2}{N}=-2N\epsilon^2 −N2N2ϵ2=−2Nϵ2
P ( R ( f ) − R ^ ( f ) ⩾ ε ) ⩽ exp ( − 2 N ε 2 ) P(R(f)-\hat{R}(f) \geqslant \varepsilon) \leqslant \exp \left(-2 N \varepsilon^{2}\right) P(R(f)−R^(f)⩾ε)⩽exp(−2Nε2)
上述公式为从假设空间中任选的一个模型,假设空间中共有d个备选模型,我们并不知道我们未来将要使用哪个模型,我们期望这d个备选模型在训练集上的经验风险 R ^ ( f ) \hat R(f) R^(f)和测试集上的期望风险 R ( f ) R(f) R(f)之间的差值都不打,即我们期望这两个风险之间的差值大于等于某个数(比如这里的 ϵ \epsilon ϵ,原则上这个数要足够小)的概率要非常低,即在假设空间中至少存在一个模型满足这个条件,用以下公式来表达:
P ( ∃ f ∈ F : R ( f ) − R ^ ( f ) ⩾ ε ) = P ( ⋃ f ∈ F { R ( f ) − R ^ ( f ) ⩾ ε } ) ⩽ ∑ f ∈ F P ( R ( f ) − R ^ ( f ) ⩾ ε ) ⩽ d e ( − 2 N ε 2 ) \begin{aligned} P(\exists f \in \mathcal{F}: R(f)-\hat{R}(f) \geqslant \varepsilon) &=P ( \bigcup_{f \in \mathcal{F}}\{R(f)-\hat{R}(f) \geqslant \varepsilon\} ) \\ & \leqslant \sum_{f \in \mathcal{F}} P(R(f)-\hat{R}(f) \geqslant \varepsilon) \\ & \leqslant d e^{(-2 N \varepsilon^{2})} \end{aligned} P(∃f∈F:R(f)−R^(f)⩾ε)=P(f∈F⋃{R(f)−R^(f)⩾ε})⩽f∈F∑P(R(f)−R^(f)⩾ε)⩽de(−2Nε2)
这样我们考虑对立事件就是:任取一个模型,两个风险之间的差值都要以一个很大的概率小于某一个足够小的数 ϵ \epsilon ϵ,等价的, 对任意 f ∈ F f \in \mathcal{F} f∈F, 有
P ( ∀ f ∈ F : R ( f ) − R ^ ( f ) < ε ) ⩾ 1 − d exp ( − 2 N ε 2 ) P(\forall f \in \mathcal{F}: R(f)-\hat{R}(f)<\varepsilon) \geqslant 1-d \exp \left(-2 N \varepsilon^{2}\right) P(∀f∈F:R(f)−R^(f)<ε)⩾1−dexp(−2Nε2)
令
δ = d e ( − 2 N ε 2 ) \delta=d e^ {(-2 N \varepsilon^{2})} δ=de(−2Nε2)
则
P ( R ( f ) < R ^ ( f ) + ε ) ⩾ 1 − δ P(R(f)<\hat{R}(f)+\varepsilon) \geqslant 1-\delta P(R(f)<R^(f)+ε)⩾1−δ
即至少以概率 1 − δ 1-\delta 1−δ 有 R ( f ) < R ^ ( f ) + ε R(f)<\hat{R}(f)+\varepsilon R(f)<R^(f)+ε其中
δ = d e ( − 2 N ϵ 2 ) ln δ = ln d − 2 N ϵ 2 2 N ϵ 2 = ln d − ln n 2 N ϵ 2 = ln d + ln 1 δ ϵ = 1 2 N ( log d + log 1 δ ) \begin{aligned} \delta & = de^{(-2N\epsilon^2)}\\ \ln\delta & = \ln d -2N\epsilon^2 \\ 2N\epsilon^2 & = \ln d -\ln n\\ 2N\epsilon^2 & = \ln d + \ln {\frac{1}{\delta}}\\ \epsilon & = \sqrt{\frac{1}{2 N}\left(\log d+\log \frac{1}{\delta}\right)} \end{aligned} δlnδ2Nϵ22Nϵ2ϵ=de(−2Nϵ2)=lnd−2Nϵ2=lnd−lnn=lnd+lnδ1=2N1(logd+logδ1)
以上讨论的只是假设空间包含有限个函数情况下的泛化误差上界,对一半的假设空间要找到泛化误差界就没有那么简单了
极大似然估计和贝叶斯估计(掷硬币问题)
-
极大似然估计
在掷硬币实验中用1表示出现正面向上,用0表示出现反面向上,即
x i = { 1 , 正面出现 0 , 反面出现 x_i= \left\{ \begin{aligned} 1,\quad正面出现 \\ 0,\quad反面出现 \end{aligned} \right. xi={1,正面出现0,反面出现
估计出现正面向上的概率为 θ \theta θ,反面出现向上的概率为 1 − θ 1-\theta 1−θ,$x_i \sim B(1.,\theta) $,概率分布函数为
P ( X = x ) = θ x ( 1 − θ ) 1 − x = { P ( x = 0 ) = 1 − θ P ( x = 1 ) = θ P(X=x) = \theta^x(1-\theta)^{1-x} = \left\{ \begin{aligned} P(x=0) & = 1-\theta \\ P(x=1) & = \theta \end{aligned} \right. P(X=x)=θx(1−θ)1−x={P(x=0)P(x=1)=1−θ=θ
似然函数:
L ( θ ) = P ( X 1 = x 1 ∣ θ ) ∗ ⋯ ∗ P ( X n = x n ∣ θ ) = ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i \begin{aligned} L(\theta) & = P(X_1=x_1|\theta)*\cdots*P(X_n=x_n|\theta) \\ & = \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} \end{aligned} L(θ)=P(X1=x1∣θ)∗⋯∗P(Xn=xn∣θ)=i=1∏nθxi(1−θ)1−xi
对数似然函数:
ln L ( θ ) = ln ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i = ∑ i = 1 n [ ln θ x i + ln ( 1 − θ ) 1 − x i ] = ∑ i = 1 n x i ln θ + ∑ i = 1 n ( 1 − x i ) ln ( 1 − θ ) = ∑ i = 1 n x i ln θ + ( n − ∑ i = 1 n x i ) ln ( 1 − θ ) \begin{aligned} \ln {L(\theta)} & = \ln { \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}} \\ & = \sum_{i=1}^n\left[ \ln\theta^{x_i} + \ln{(1- \theta)^{1-x_i} } \right]\\ & = \sum_{i=1}^nx_i\ln \theta + \sum_{i=1}^n(1-x_i)\ln{(1- \theta)} \\ & = \sum_{i=1}^nx_i\ln \theta + (n-\sum_{i=1}^nx_i)\ln{(1- \theta)} \end{aligned} lnL(θ)=lni=1∏nθxi(1−θ)1−xi=i=1∑n[lnθxi+ln(1−θ)1−xi]=i=1∑nxilnθ+i=1∑n(1−xi)ln(1−θ)=i=1∑nxilnθ+(n−i=1∑nxi)ln(1−θ)
目标
: m a x ln L ( θ ) \mathcal {max} \ln L(\theta) maxlnL(θ)对 θ \theta θ求偏导
∂ ln L ( θ ) ∂ θ = ∑ i = 1 n x i θ − n − ∑ i = 1 n x i 1 − θ \\ \frac{\partial\ln L(\theta)}{\partial \theta}=\frac{\sum\limits_{i=1}^nx_i}{\theta}-\frac{n-\sum\limits_{i=1}^nx_i}{1-\theta} ∂θ∂lnL(θ)=θi=1∑nxi−1−θn−i=1∑nxi
令偏导数等于0,则
∑ i = 1 n x i θ = n − ∑ i = 1 n x i 1 − θ \frac{\sum\limits_{i=1}^nx_i}{\theta}=\frac{n-\sum\limits_{i=1}^nx_i}{1-\theta} θi=1∑nxi=1−θn−i=1∑nxi
求出
θ ^ = 1 n ∑ i = 1 n x i \hat {\theta} = \frac{1}{n}\sum\limits_{i=1}^nx_i θ^=n1i=1∑nxi -
贝叶斯估计
假设已知先验概率为 β \beta β分布
π ( θ ) = Γ ( α + β ) Γ ( α ) Γ ( β ) θ α − 1 ( 1 − θ ) β − 1 \pi(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1} π(θ)=Γ(α)Γ(β)Γ(α+β)θα−1(1−θ)β−1
求后验概率 P ( θ ∣ x 1 , x 2 , . . . , x n ) P(\theta \mathcal{|} x_1,x_2,...,x_n) P(θ∣x1,x2,...,xn)
P ( θ ∣ x 1 , x 2 , . . . , x n ) = P ( θ , x 1 , x 2 , . . . , x n ) P ( x 1 , x 2 , . . . , x n ) = π ( θ ) ∗ p ( x 1 ∣ θ ) ∗ ⋯ ∗ p ( x n ∣ θ ) ∫ P ( θ , x 1 , x 2 , . . . , x n ) d θ ∝ π ( θ ) ∗ p ( x 1 ∣ θ ) ∗ ⋯ ∗ p ( x n ∣ θ ) = θ α − 1 ( 1 − θ ) β − 1 ∗ ln ∏ i = 1 n θ x i ( 1 − θ ) 1 − x i = θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 \begin{aligned} P(\theta \mathcal{|} x_1,x_2,...,x_n) & = \frac{P(\theta,x_1,x_2,...,x_n)}{P(x_1,x_2,...,x_n)} \\ & = \frac{\pi (\theta)*p(x_1|\theta)*\cdots*p(x_n|\theta)}{\int P(\theta,x_1,x_2,...,x_n) \mathcal{d} \theta} \\ &\propto \pi (\theta)*p(x_1|\theta)*\cdots*p(x_n|\theta) \\ & = \theta^{\alpha-1}(1-\theta)^{\beta-1}*\ln { \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}} \\ & = \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} \end{aligned} P(θ∣x1,x2,...,xn)=P(x1,x2,...,xn)P(θ,x1,x2,...,xn)=∫P(θ,x1,x2,...,xn)dθπ(θ)∗p(x1∣θ)∗⋯∗p(xn∣θ)∝π(θ)∗p(x1∣θ)∗⋯∗p(xn∣θ)=θα−1(1−θ)β−1∗lni=1∏nθxi(1−θ)1−xi=θ∑xi+α−1∗(1−θ)n−∑xi+β−1
备注:- 因为 ∫ P ( θ , x 1 , x 2 , . . . , x n ) d θ \int P(\theta,x_1,x_2,...,x_n) d\theta ∫P(θ,x1,x2,...,xn)dθ已将 θ \theta θ积分挤掉了,所以与其无关,为一个常数;
- ∝ \propto ∝ :正比于;
- Γ ( α + β ) Γ ( α ) Γ ( β ) \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} Γ(α)Γ(β)Γ(α+β) 也是一个常数不考虑;
- θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} θ∑xi+α−1∗(1−θ)n−∑xi+β−1是参数为 ∑ x i + α − 1 , n − ∑ x i + β − 1 \sum x_i + \alpha -1,n-\sum x_i +\beta -1 ∑xi+α−1,n−∑xi+β−1的 β \beta β分布
此时 L ( θ ) L(\theta) L(θ):
L ( θ ) = θ ∑ x i + α − 1 ∗ ( 1 − θ ) n − ∑ x i + β − 1 L(\theta) = \theta^{\sum x_i + \alpha -1} * (1-\theta)^{n-\sum x_i +\beta -1} L(θ)=θ∑xi+α−1∗(1−θ)n−∑xi+β−1
对数似然
:
ln L ( θ ) = ( ∑ i = 1 n x i + α − 1 ) ln θ + ( n − ∑ i = 1 n x i + β − 1 ) ln ( 1 − θ ) \ln L(\theta) = (\sum\limits_{i=1}^n x_i + \alpha -1)\ln \theta +(n-\sum \limits_{i=1}^n x_i +\beta -1)\ln(1-\theta) lnL(θ)=(i=1∑nxi+α−1)lnθ+(n−i=1∑nxi+β−1)ln(1−θ)
对 θ \theta θ求偏导 :
∂ ln L ( θ ) ∂ θ = ∑ i = 1 n x i + α − 1 θ − n − ∑ i = 1 n x i + β − 1 1 − θ \frac{\partial\ln L(\theta)}{\partial \theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{\theta} - \frac{n-\sum \limits_{i=1}^n x_i +\beta -1}{1-\theta} ∂θ∂lnL(θ)=θi=1∑nxi+α−1−1−θn−i=1∑nxi+β−1
令偏导数等于0,则
∑ i = 1 n x i + α − 1 θ = n − ∑ i = 1 n x i + β − 1 1 − θ \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{\theta} = \frac{n-\sum \limits_{i=1}^n x_i +\beta -1}{1-\theta} θi=1∑nxi+α−1=1−θn−i=1∑nxi+β−1
求出:
θ ^ = ∑ i = 1 n x i + α − 1 n + α + β − 2 \hat {\theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{n+\alpha +\beta -2} θ^=n+α+β−2i=1∑nxi+α−1 -
极大似然和贝叶斯总结
- 对比极大似然 θ \theta θ估计值 θ ^ = 1 n ∑ i = 1 n x i \hat {\theta} = \frac{1}{n}\sum\limits_{i=1}^nx_i θ^=n1i=1∑nxi和贝叶斯估计值 θ ^ = ∑ i = 1 n x i + α − 1 n + α + β − 2 \hat {\theta} = \frac{\sum\limits_{i=1}^n x_i + \alpha -1}{n+\alpha +\beta -2} θ^=n+α+β−2i=1∑nxi+α−1,当样本n趋于无穷大的时候,两者 θ \theta θ的估计值是区域一致的;
- 贝叶斯估计中会给出参数的先验信息,当样本n足够大的时候,我们先前的先验信息和样本信息比就微不足道了,所以就近似于只用所有样本信息去估计 θ \theta θ所得到的结果;
- 考虑极端情况下,n=1,通过极大似然估计,结果是0,或者是1,但是在贝叶斯估计中,若样本n=1,那么贝叶斯估计结果就是 α α + β − 1 \frac{ \alpha }{\alpha +\beta -1} α+β−1α或者 α − 1 α + β − 1 \frac{ \alpha-1 }{\alpha +\beta -1} α+β−1α−1,这是样本量雄安的时候,贝叶斯估计的优势所在。
推导正太分布均值的极大似然估计和贝叶斯估计
-
问题
:推导下述正太分布均值的极大似然估计和贝叶斯估计,数据 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn来自正太分布 N ( μ , σ 2 ) \mathcal{N}(\mu,\sigma^2) N(μ,σ2),其中 σ 2 \sigma^2 σ2已知:
- 根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的极大似然估计
- 假设 μ \mu μ的先验分布是 N ( 0 , τ 2 ) \mathcal{N}(0,\tau^2) N(0,τ2),根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的贝叶斯估计
-
1、根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的极大似然估计
样本的概率密度函数 f ( x i ) = 1 2 π σ exp ( − ( x i − μ ) 2 2 σ 2 ) i = 1 , 2 , . . . . . , n f(x_i)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) \quad i=1,2,.....,n f(xi)=2πσ1exp(−2σ2(xi−μ)2)i=1,2,.....,n
似然函数:
L ( x i ; μ ) = ∏ i = 1 n f ( x i ; μ ) = ( 2 π σ ) − n ∗ exp ( − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ) \begin{aligned} L(x_i;\mu) & = \prod_{i=1}^n f(x_i;\mu)\\ & = ({\sqrt{2 \pi} \sigma})^{-n} * \exp \left( -\frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \right) \end{aligned} L(xi;μ)=i=1∏nf(xi;μ)=(2πσ)−n∗exp(−2σ21i=1∑n(xi−μ)2)
对数似然函数:
ln L ( x i ; μ ) = − n ln ( 2 π σ ) − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ⇒ ∂ ln L ( x i ; μ ) ∂ μ = 1 σ 2 ∑ i = 1 n ( x i − μ ) = 1 σ 2 ( ∑ i = 1 n x i − n μ ) 令 1 σ 2 ( ∑ i = 1 n x i − n μ ) = 0 ⇒ μ ^ = 1 n ∑ i = 1 n x i \begin{aligned} \ln L(x_i;\mu) & = -n \ln ({\sqrt{2 \pi} \sigma}) - \frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \\ \Rightarrow \quad \frac{\partial\ln L(x_i;\mu)}{\partial \mu} & = \frac{1}{\sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu) \\ & = \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu)\\ 令 \quad \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) & = 0 \\ \Rightarrow \quad \hat{\mu} & = \frac{1}{n}\sum\limits_{i=1}^nx_i \end{aligned} lnL(xi;μ)⇒∂μ∂lnL(xi;μ)令σ21(i=1∑nxi−nμ)⇒μ^=−nln(2πσ)−2σ21i=1∑n(xi−μ)2=σ21i=1∑n(xi−μ)=σ21(i=1∑nxi−nμ)=0=n1i=1∑nxi -
假设 μ \mu μ的先验分布是 N ( 0 , τ 2 ) \mathcal{N}(0,\tau^2) N(0,τ2),根据样本 x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn写出 μ \mu μ的贝叶斯估计
先验分布 f ( μ ) = 1 2 π τ exp ( − μ 2 2 τ 2 ) i = 1 , 2 , . . . . . , n f(\mu)=\frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right) \quad i=1,2,.....,n f(μ)=2πτ1exp(−2τ2μ2)i=1,2,.....,n
P ( μ ∣ x 1 , x 2 , . . . , x n ) = P ( μ , x 1 , x 2 , . . . , x n ) P ( x 1 , x 2 , . . . , x n ) = f ( μ ) ∗ p ( x 1 ∣ μ ) ∗ ⋯ ∗ p ( x n ∣ μ ) ∫ P ( μ , x 1 , x 2 , . . . , x n ) d μ ∝ f ( μ ) ∗ p ( x 1 ∣ μ ) ∗ ⋯ ∗ p ( x n ∣ μ ) = 1 2 π τ exp ( − μ 2 2 τ 2 ) ∗ ∏ i = 1 n 1 2 π σ exp ( − ( x i − μ ) 2 2 σ 2 ) \begin{aligned} P(\mu \mathcal{|} x_1,x_2,...,x_n) & = \frac{P(\mu,x_1,x_2,...,x_n)}{P(x_1,x_2,...,x_n)} \\ & = \frac{f (\mu)*p(x_1|\mu)*\cdots*p(x_n|\mu)}{\int P(\mu,x_1,x_2,...,x_n) \mathcal{d} \mu} \\ &\propto f (\mu)*p(x_1|\mu)*\cdots*p(x_n|\mu) \\ & = \frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right)*\prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) \end{aligned} P(μ∣x1,x2,...,xn)=P(x1,x2,...,xn)P(μ,x1,x2,...,xn)=∫P(μ,x1,x2,...,xn)dμf(μ)∗p(x1∣μ)∗⋯∗p(xn∣μ)∝f(μ)∗p(x1∣μ)∗⋯∗p(xn∣μ)=2πτ1exp(−2τ2μ2)∗i=1∏n2πσ1exp(−2σ2(xi−μ)2)
此时 L ( θ ) L(\theta) L(θ):
L ( θ ) = 1 2 π τ exp ( − μ 2 2 τ 2 ) ∗ ∏ i = 1 n 1 2 π σ exp ( − ( x i − μ ) 2 2 σ 2 ) L(\theta) =\frac{1}{\sqrt{2 \pi} \tau} \exp \left(-\frac{\mu^{2}}{2 \tau^{2}}\right)*\prod_{i=1}^n \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(x_i-\mu)^{2}}{2 \sigma^{2}}\right) L(θ)=2πτ1exp(−2τ2μ2)∗i=1∏n2πσ1exp(−2σ2(xi−μ)2)
对数似然
:
ln P ( μ ∣ x 1 , x 2 , . . . , x n ) = − ln 2 π τ − μ 2 2 τ 2 − n ln ( 2 π σ ) − 1 2 σ 2 ∑ i = 1 n ( x i − μ ) 2 ⇒ ∂ ln P ( μ ∣ x 1 , x 2 , . . . , x n ) ∂ μ = − μ τ 2 + 1 σ 2 ∑ i = 1 n ( x i − μ ) = 1 σ 2 ( ∑ i = 1 n x i − n μ ) − μ τ 2 令 1 σ 2 ( ∑ i = 1 n x i − n μ ) − μ τ 2 = 0 ⇒ 1 σ 2 ( ∑ i = 1 n x i − n μ ) = μ τ 2 ⇒ μ ^ = τ 2 ∑ i = 1 2 x i σ 2 + n τ 2 = ∑ i = 1 2 x i n + σ 2 τ 2 \begin{aligned} \ln P(\mu \mathcal{|} x_1,x_2,...,x_n) & = -\ln \sqrt{2 \pi} \tau-\frac{\mu^{2}}{2 \tau^{2}} -n \ln ({\sqrt{2 \pi} \sigma})-\frac{1}{2 \sigma^{2}} \sum\limits_{i=1}^n(x_i-\mu)^{2} \\ \Rightarrow \quad \frac{\partial\ln P(\mu \mathcal{|} x_1,x_2,...,x_n)}{\partial \mu}& = -\frac{\mu}{\tau^2} + \frac{1}{\sigma ^ 2}\sum\limits_{i=1}^n(x_i-\mu) \\ & = \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) - \frac{\mu}{\tau^2} \\ 令\quad \frac{1}{\sigma^{2}} (\sum\limits_{i=1}^nx_i-n\mu) - \frac{\mu}{\tau^2} & = 0 \\ \Rightarrow \quad \frac{1}{\sigma^{2}}(\sum\limits_{i=1}^nx_i-n\mu) & = \frac{\mu}{\tau^2}\\ \Rightarrow \quad \hat{\mu} & = \frac{\tau^2\sum\limits_{i=1}^2x_i}{\sigma^2+n\tau^2}\\ & = \frac{\sum\limits_{i=1}^2x_i}{n+\frac{\sigma^2}{\tau^2}} \end{aligned} lnP(μ∣x1,x2,...,xn)⇒∂μ∂lnP(μ∣x1,x2,...,xn)令σ21(i=1∑nxi−nμ)−τ2μ⇒σ21(i=1∑nxi−nμ)⇒μ^=−ln2πτ−2τ2μ2−nln(2πσ)−2σ21i=1∑n(xi−μ)2=−τ2μ+σ21i=1∑n(xi−μ)=σ21(i=1∑nxi−nμ)−τ2μ=0=τ2μ=σ2+nτ2τ2i=1∑2xi=n+τ2σ2i=1∑2xi
当n较小时候,贝叶斯估计比极大似然估计要准确一些