Classifiers Based on Bayes Decision Theory

本文介绍了贝叶斯决策理论在二分类问题中的应用,涉及先验概率、似然函数、后验概率计算及最小化分类错误概率和平均风险的方法。通过举例,展示了如何利用高斯分布进行正常分布类别的贝叶斯分类,以及非参数分类方法如直方图和k-近邻算法。
摘要由CSDN通过智能技术生成

Reference:

Section 2.1-2.6 of Pattern recognition, by Sergios Theodoridis, Konstantinos Koutroumbas (2009)

Slides of CS4220, TUD

Bayes Decision Theory

We will initially focus on the two-class case.

Let ω 1 , ω 2 \omega_1,\omega_2 ω1,ω2 be the two classes in which our patterns belong.

  • priori probabilities: P ( ω 1 ) , P ( ω 2 ) P(\omega_1),P(\omega_2) P(ω1),P(ω2)

    Assumed to be known. If not, they can easily estimated as P ( ω 1 ) ≈ N 1 / N , P ( ω 2 ) ≈ N 2 / N P(\omega_1)\approx N_1/N, P(\omega_2)\approx N_2/N P(ω1)N1/N,P(ω2)N2/N.

  • likelihood function: p ( x ∣ ω 1 ) , p ( x ∣ ω 2 ) p(\mathbf x|\omega_1), p(\mathbf x|\omega_2) p(xω1),p(xω2)

    Assumed to be known. If not, they can also be estimated from the available training data.

  • posteriori probabilities: P ( ω 1 ∣ x ) , P ( ω 2 ∣ x ) P(\omega_1|\mathbf x), P(\omega_2|\mathbf x) P(ω1x),P(ω2x)

    Can be computed by the Bayes rule:
    P ( ω i ∣ x ) = p ( x ∣ ω i ) P ( ω i ) p ( x ) (BD.1) P(\omega_i|\mathbf x)=\frac{p(\mathbf x| \omega_i)P(\omega_i)}{p(\mathbf x)}\tag{BD.1} P(ωix)=p(x)p(xωi)P(ωi)(BD.1)
    where p ( x ) p(\mathbf x) p(x) is the pdf of x \mathbf x x and for which we have
    p ( x ) = ∑ p ( x ∣ ω i ) P ( ω i ) (BD.2) p(\mathbf x)=\sum p(\mathbf x|\omega_i)P(\omega_i)\tag{BD.2} p(x)=p(xωi)P(ωi)(BD.2)

The Bayes classification rule can now be stated as
 If  P ( ω 1 ∣ x ) > P ( ω 2 ∣ x ) , x  is classified to  ω 1  If  P ( ω 1 ∣ x ) < P ( ω 2 ∣ x ) , x  is classified to  ω 2 (BD.3) \begin{aligned} &\text { If } P\left(\omega_{1}| x\right)>P\left(\omega_{2}| x\right), \quad x \text { is classified to } \omega_{1}\\ &\text { If } P\left(\omega_{1} | x\right)<P\left(\omega_{2} | x\right), \quad x \text { is classified to } \omega_{2} \end{aligned}\tag{BD.3}  If P(ω1x)>P(ω2x),x is classified to ω1 If P(ω1x)<P(ω2x),x is classified to ω2(BD.3)
Using ( B D . 2 ) (BD.2) (BD.2), the decision can equivalently be based on the inequalities
p ( x ∣ ω 1 ) P ( ω 1 ) ≷ p ( x ∣ ω 2 ) P ( ω 2 ) (BD.4) p(\mathbf x| \omega_1)P(\omega_1)\gtrless p(\mathbf x| \omega_2)P(\omega_2)\tag{BD.4} p(xω1)P(ω1)p(xω2)P(ω2)(BD.4)
It can also be described as
p ( x ∣ ω 1 ) P ( ω 1 ) − p ( x ∣ ω 2 ) P ( ω 2 ) ≷ 0 p ( x ∣ ω 1 ) P ( ω 1 ) p ( x ∣ ω 2 ) P ( ω 2 ) ≷ 1 log ⁡ ( p ( x ∣ ω 1 ) P ( ω 1 ) ) − log ⁡ ( p ( x ∣ ω 2 ) P ( ω 2 ) ) ≷ 0 p(\mathbf x| \omega_1)P(\omega_1)-p(\mathbf x| \omega_2)P(\omega_2)\gtrless 0\\ \frac{p(\mathbf x| \omega_1)P(\omega_1)}{p(\mathbf x| \omega_2)P(\omega_2)}\gtrless 1\\ \log(p(\mathbf x| \omega_1)P(\omega_1))-\log(p(\mathbf x| \omega_2)P(\omega_2))\gtrless 0 p(xω1)P(ω1)p(xω2)P(ω2)0p(xω2)P(ω2)p(xω1)P(ω1)1log(p(xω1)P(ω1))log(p(xω2)P(ω2))0


Minimizing the classification error probability

We will show that the Bayesian classifier is optimal with respect to minimizing the classification error probability.

Define the classification error as
P ( e r r o r ) = ∑ i = 1 C P ( e r r o r ∣ ω i ) P ( ω i ) (BD.5) P(error)=\sum_{i=1}^C P(error|\omega_i)P(\omega_i)\tag{BD.5} P(error)=i=1CP(errorωi)P(ωi)(BD.5)

在这里插入图片描述

For a determined decision boundary (the green line), it could happen that x ∈ ω 2 \mathbf x\in \omega_2 xω2 is assigned to ω 1 \omega_1 ω1, and that x ∈ ω 1 \mathbf x\in \omega_1 xω1 is assigned to ω 2 \omega_2 ω2. Let R i R_i Ri be the region of the feature space in which we decide in favor of
ω i \omega_i ωi. The classification error probability can be written as
P e = P ( ω 1 ) ∫ R 2 p ( x ∣ ω 1 ) d x + P ( ω 2 ) ∫ R 1 p ( x ∣ ω 2 ) d x = ∫ R 2 P ( ω 1 ∣ x ) p ( x ) d x + ∫ R 1 P ( ω 2 ∣ x ) p ( x ) d x (BD.6) \begin{aligned} P_{e}&=P(\omega_1)\int_{R_{2}} p\left(\mathbf x|\omega_1\right) d \mathbf x+P(\omega_2)\int_{R_{1}} p\left(\mathbf x|\omega_2\right) d \mathbf x \\&=\int_{R_{2}} P\left(\omega_{1} |\mathbf x\right) p(\mathbf x) d \mathbf x+\int_{R_{1}} P\left(\omega_{2} | \mathbf x\right) p(\mathbf x) d \mathbf x \end{aligned} \tag{BD.6} Pe=P(ω1)R2p(xω1)dx+P(ω2)R1p(xω2)dx=R2P(ω1x)p(x)dx+R1P(ω2x)p(x)dx(BD.6)
which is the marked yellow area in the figure above. Then it’s intuitive that the optimal decision boundary should be
x 0 : P ( ω 1 ∣ x 0 ) = P ( ω 2 ∣ x 0 ) (BD.7) \mathbf x_0:P\left(\omega_{1} | \mathbf x_0\right)=P\left(\omega_{2} | \mathbf x_0\right)\tag{BD.7} x0:P(ω1x0)=P(ω2x0)(BD.7)

在这里插入图片描述

so that
R 1 : P ( ω 1 ∣ x ) > P ( ω 2 ∣ x ) R 2 : P ( ω 2 ∣ x ) > P ( ω 1 ∣ x ) (BD.8) \begin{array}{l} R_{1}: P\left(\omega_{1} | \mathbf x\right)>P\left(\omega_{2} | \mathbf x\right) \\ R_{2}: P\left(\omega_{2} |\mathbf x\right)>P\left(\omega_{1} | \mathbf x\right) \end{array}\tag{BD.8} R1:P(ω1x)>P(ω2x)R2:P(ω2x)>P(ω1x)(BD.8)
By combining ( B D . 6 ) (BD.6) (BD.6) and ( B D . 7 ) (BD.7) (BD.7), we obtain the Bayes error ε ∗ \varepsilon^* ε, which is the minimum classification error we can get. It does not depend on the classification rule that we apply, but on the distribution of data. In practical we can not obtain ε ∗ \varepsilon^* ε, since we do not have the true distributions, and the high dimensional integrals are very complicated.


Minimizing the average risk

Sometimes misclassification of class A to class B is much more dangerous than misclassification of class B to class A. Thus, it is more appropriate to assign a penalty term to weigh each error:
r = λ 12 P ( ω 1 ) ∫ R 2 p ( x ∣ ω 1 ) d x + λ 21 P ( ω 2 ) ∫ R 1 p ( x ∣ ω 2 ) d x (BD.9) \mathbf r=\lambda_{12}P(\omega_1)\int_{R_{2}} p\left(\mathbf x|\omega_1\right) d x+\lambda_{21}P(\omega_2)\int_{R_{1}} p\left(\mathbf x|\omega_2\right) d x\tag{BD.9} r=λ12P(ω1)R2p(xω1)dx+λ21P(ω2)R1p

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值