2.1 概述
2.2 贝叶斯决策论
概率基础:
- 事件A的概率$0\leq P(A) \leq 1 $
- 条件概率: P ( A ∣ B ) = P ( A B ) P ( B ) P(A|B)=\frac{P(AB)}{P(B)} P(A∣B)=P(B)P(AB), P ( B ∣ A ) = P ( A B ) P ( A ) P(B|A)=\frac{P(AB)}{P(A)} P(B∣A)=P(A)P(AB)
- 乘法定律: P ( A B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) P(AB)=P(A|B)P(B)=P(B|A)P(A) P(AB)=P(A∣B)P(B)=P(B∣A)P(A)
- 全概率公式: A 1 ∪ A 2 ∪ . . . A n = Ω 且 A i ∩ A j = φ ,则 P ( A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i ) A_{1}\cup A_{2}\cup ... A_{n}=\Omega 且 A_{i}\cap A_{j}=\varphi , 则P(A)=\sum_{i=1}^{n}P(A|B_{i})P(B_{i}) A1∪A2∪...An=Ω且Ai∩Aj=φ,则P(A)=∑i=1nP(A∣Bi)P(Bi)
- Bayes公式:
- P ( A i ∣ B ) = P ( B ∣ A i ) P ( A i ) ∑ j = 1 n P ( B ∣ A j ) P ( A j ) P(A_{i}|B)=\frac{P(B|A_{i})P(A_{i})}{\sum_{j=1}^{n}P(B|A_{j})P(A_{j})} P(Ai∣B)=∑j=1nP(B∣Aj)P(Aj)P(B∣Ai)P(Ai)
- P ( A i ∣ B ) ∝ P ( B ∣ A i ) P ( A i ) P(A_{i}|B)\propto P(B|A_{i})P(A_{i}) P(Ai∣B)∝P(B∣Ai)P(Ai)
Bayes决策:
- 基于观察特征、类别的贝叶斯公式
- P ( ω i ∣ x ) = P ( x ∣ ω i ) P ( ω i ) P ( x ) P(\omega _{i}|x)=\frac{P(x|\omega _{i})P(\omega _{i})}{P(x)} P(ωi∣x)=P(x)P(x∣ωi)P(ωi) ( p o s t e r i o r = l i k e h o o d ∗ p r i o r e v i d e n c e ) (posterior=\frac{likehood*prior}{evidence}) (posterior=evidencelikehood∗prior)
- = P ( x ∣ ω i ) P ( ω i ) ∑ j P ( x ∣ ω i ) P ( ω i ) =\frac{P(x|\omega _{i})P(\omega _{i})}{\sum_{j} P(x|\omega _{i})P(\omega _{i})} =∑jP(x∣ωi)P(ωi)P(x∣ωi)P(ωi)
- P ( ω i ∣ x ) ∝ P ( x ∣ ω i ) P ( ω i ) P(\omega _{i}|x)\propto P(x|\omega _{i})P(\omega _{i}) P(ωi∣x)∝P(x∣ωi)P(ωi) ( p o s t e r i o r ∝ l i k e l i h o o d ∗ p r i o r ) (posterior\propto likelihood*prior) (posterior∝likelihood∗prior)
- 贝叶斯决策:
- D e c i d e = { ω 1 p ( ω 1 ∣ x ) > p ( ω 2 ∣ x ) ω 2 o t h e r w i s e Decide=\left\{\begin{matrix} \omega_{1} &p(\omega_{1}|x)>p(\omega_{2}|x) \\ \omega_{2} &otherwise \\ \end{matrix}\right. Decide={ω1ω2p(ω1∣x)>p(ω2∣x)otherwise $\Rightarrow $ $\left{\begin{matrix} \omega_{1} &p(x|\omega_{1})p(\omega_{1})>p(x|\omega_{2})p(\omega_{2}) \ \omega_{2} &otherwise \ \end{matrix}\right. $
- { ω 1 p ( x ∣ ω 1 ) p ( x ∣ ω 2 ) > p ( ω 2 ) p ( ω 1 ) ω 2 o t h e r w i s e \left\{\begin{matrix} \omega_{1} &\frac{p(x|\omega_{1})}{p(x|\omega_{2})}>\frac{p(\omega_{2})}{p(\omega_{1})} \\ \omega_{2} &otherwise \\ \end{matrix}\right. {ω1ω2p(x∣ω2)p(x∣ω1)>p(ω1)p(ω2)otherwise
- 类别相似性函数:
- g i ( x ) = p ( ω i ∣ x ) = p ( x ∣ ω i ) p ( ω i ) ∑ j = 1 c p ( x ∣ ω j ) p ( ω j ) g_{i}(x)=p(\omega_{i}|x)=\frac{p(x|\omega_{i})p(\omega_{i})}{\sum_{j=1}^{c}p(x|\omega_{j})p(\omega_{j})} gi(x)=p(ωi∣x)=∑j=1cp(x∣ωj)p(ωj)p(x∣ωi)p(ωi)
- g i ( x ) = p ( x ∣ ω i ) p ( ω i ) g_{i}(x)=p(x|\omega_{i})p(\omega_{i}) gi(x)=p(x∣ωi)p(ωi)
- g i ( x ) = l n p ( x ∣ ω i ) + l n p ( ω i ) g_{i}(x)=lnp(x|\omega_{i})+lnp(\omega_{i}) gi(x)=lnp(x∣ωi)+lnp(ωi)
- 决策函数:
- g ( x ) = g 1 ( x ) − g 2 ( x ) g(x)=g_{1}(x)-g_{2}(x) g(x)=g1(x)−g2(x)
- g ( x ) = p ( ω 1 ∣ x ) − p ( ω 2 ∣ x ) g(x)=p(\omega_{1}|x)-p(\omega_{2}|x) g(x)=p(ω1∣x)−p(ω2∣x)
- g ( x ) = l n p ( x ∣ ω 1 ) p ( x ∣ ω 2 ) + l n p ( ω 1 ) p ( ω 2 ) g(x)=ln\frac{p(x|\omega_{1})}{p(x|\omega_{2})}+ln\frac{p(\omega_{1})}{p(\omega_{2})} g(x)=lnp(x∣ω2)p(x∣ω1)+lnp(ω2)p(ω1)
2.3 贝叶斯分类器
贝叶斯分类器:
- 朴素贝叶斯分类器:假设
p
(
x
∣
c
)
p(x|c)
p(x∣c)中x特征向量的各维属性独立
- 采用了“属性独立性假设”
- p ( c ∣ x ) = p ( c ) p ( x ∣ c ) p ( x ) ∝ p ( c ) p ( x ∣ c ) = p ( c ) ∏ i = 1 d p ( x i ∣ c ) p(c|x)=\frac{p(c)p(x|c)}{p(x)} \propto p(c)p(x|c)=p(c)\prod_{i=1}^{d}p(x_{i}|c) p(c∣x)=p(x)p(c)p(x∣c)∝p(c)p(x∣c)=p(c)∏i=1dp(xi∣c)
- 关键问题:由训练样本学习类别条件概率和类别先验概率 p ( x i ∣ c ) 和 p ( c ) p(x_{i}|c)和p(c) p(xi∣c)和p(c)
- k个类别,d个属性,共1+k*d个概率分布要统计
- 类别先验概率的估计: p ( c ) = ∣ D c ∣ ∣ D ∣ p(c)=\frac{|D_{c}|}{|D|} p(c)=∣D∣∣Dc∣
- 类别概率密度估计:
- x i x_{i} xi离散情况: p ( x i ∣ c ) = ∣ D c , x i ∣ ∣ D c ∣ p(x_{i}|c)=\frac{|D_{c,x_{i}}|}{|D_{c}|} p(xi∣c)=∣Dc∣∣Dc,xi∣
- x i x_{i} xi连续情况: p ( x i ∣ c ) = 1 2 π σ c , i e x p ( − ( x i − μ c , i ) 2 2 σ c , i 2 ) p(x_{i}|c)=\frac{1}{\sqrt{2\pi}\sigma _{c,i}}exp(-\frac{(x_{i}-\mu_{c,i})^{2}}{2\sigma_{c,i}^{2}}) p(xi∣c)=2πσc,i1exp(−2σc,i2(xi−μc,i)2) (由某一概率分布估计类别概率)
- 学习过程:
- 类别先验估计
- 类别条件概率估计
- 决策过程:
- 类别先验估计
- 类别条件概率估计
- 贝叶斯决策 h ( x ) = a r g m a x c ϵ y P ( c ) ∏ i = 1 d P ( x i ∣ c ) h(x)=\underset{c\epsilon y}{argmax}P(c)\prod_{i=1}^{d}P(x_{i}|c) h(x)=cϵyargmaxP(c)∏i=1dP(xi∣c)
- 半朴素贝叶斯分类器:假设 p ( x ∣ c ) p(x|c) p(x∣c)中x各维属性存在依赖
- 正态分布的贝叶斯分类器:假设 p ( x ∣ c ( θ ) ) p(x|c(\theta)) p(x∣c(θ))服从正态分布
2.4 贝叶斯学习与参数估计问题
贝叶斯学习
通过观测数据likelihood修正模型先验,得到后验概率分布:
p
(
θ
∣
D
,
α
)
∝
p
(
D
∣
θ
)
p
(
θ
∣
α
)
p(\theta|D,\alpha)\propto p(D|\theta)p(\theta|\alpha)
p(θ∣D,α)∝p(D∣θ)p(θ∣α)
其中,
α
\alpha
α是超参数,不是估计的参数
极大似然估计
- 最大化观测数据的概率
- p ( θ ∣ D , α ) ∝ p ( D ∣ θ ) p ( θ ∣ α ) p(\theta|D,\alpha)\propto p(D|\theta)p(\theta|\alpha) p(θ∣D,α)∝p(D∣θ)p(θ∣α)
- 最大化 p ( D ∣ θ ) p(D|\theta) p(D∣θ)
- 似然函数likelihood:
- p ( D ∣ θ ) = p ( x 1 , . . . , x n ∣ θ ) = ∏ i = 1 n p ( x i ∣ θ ) p(D|\theta)=p(x_{1},...,x_{n}|\theta)=\prod_{i=1}^{n}p(x_{i}|\theta) p(D∣θ)=p(x1,...,xn∣θ)=∏i=1np(xi∣θ)
- Maximum Likelihood
- θ ^ = a r g max θ p ( D ∣ θ ) \hat{\theta}=arg \displaystyle \max_{\theta}p(D|\theta) θ^=argθmaxp(D∣θ)
转化为求log-likelihood极大的问题:
θ ^ = a r g max θ ∑ i = 1 n l o g p ( x i ∣ θ ) \hat{\theta}=arg \displaystyle \max_{\theta}\sum_{i=1}^{n}logp(x_{i}|\theta) θ^=argθmaxi=1∑nlogp(xi∣θ)
求解过程
∑ i = 1 n ∇ θ l o g p ( x i ∣ θ ) = 0 \sum_{i=1}^{n}\nabla_{\theta}logp(x_{i}|\theta)=0 ∑i=1n∇θlogp(xi∣θ)=0
最大后验估计
问题描述
- 求使后验概率最大的模型或参数(
θ
\theta
θ)
- p ( θ ∣ D , α ) ∝ p ( D ∣ θ ) p ( θ ∣ α ) p(\theta|D,\alpha)\propto p(D|\theta)p(\theta|\alpha) p(θ∣D,α)∝p(D∣θ)p(θ∣α)
- 最大化 p ( θ ∣ D , α ) p(\theta|D,\alpha) p(θ∣D,α)
- 贝叶斯公式中
- p ( θ ∣ D , α ) = P ( D ∣ θ ) P ( θ ∣ α ) P ( D ∣ α ) p(\theta|D,\alpha)=\frac{P(D|\theta)P(\theta|\alpha)}{P(D|\alpha)} p(θ∣D,α)=P(D∣α)P(D∣θ)P(θ∣α)
- θ ^ M A P : ∂ ∂ θ p ( θ ∣ D , α ) = 0 o r ∂ ∂ θ p ( D ∣ θ ) p ( θ ∣ α ) = 0 \hat{\theta}_{MAP}:\frac{\partial }{\partial \theta}p(\theta|D,\alpha)=0 \ or \ \frac{\partial }{\partial \theta}p(D|\theta)p(\theta|\alpha)=0 θ^MAP:∂θ∂p(θ∣D,α)=0 or ∂θ∂p(D∣θ)p(θ∣α)=0