Reference:
Section 2.1-2.6 of Pattern recognition, by Sergios Theodoridis, Konstantinos Koutroumbas (2009)
Slides of CS4220, TUD
Content
Bayes Decision Theory
We will initially focus on the two-class case.
Let ω 1 , ω 2 \omega_1,\omega_2 ω1,ω2 be the two classes in which our patterns belong.
-
priori probabilities: P ( ω 1 ) , P ( ω 2 ) P(\omega_1),P(\omega_2) P(ω1),P(ω2)
Assumed to be known. If not, they can easily estimated as P ( ω 1 ) ≈ N 1 / N , P ( ω 2 ) ≈ N 2 / N P(\omega_1)\approx N_1/N, P(\omega_2)\approx N_2/N P(ω1)≈N1/N,P(ω2)≈N2/N.
-
likelihood function: p ( x ∣ ω 1 ) , p ( x ∣ ω 2 ) p(\mathbf x|\omega_1), p(\mathbf x|\omega_2) p(x∣ω1),p(x∣ω2)
Assumed to be known. If not, they can also be estimated from the available training data.
-
posteriori probabilities: P ( ω 1 ∣ x ) , P ( ω 2 ∣ x ) P(\omega_1|\mathbf x), P(\omega_2|\mathbf x) P(ω1∣x),P(ω2∣x)
Can be computed by the Bayes rule:
P ( ω i ∣ x ) = p ( x ∣ ω i ) P ( ω i ) p ( x ) (BD.1) P(\omega_i|\mathbf x)=\frac{p(\mathbf x| \omega_i)P(\omega_i)}{p(\mathbf x)}\tag{BD.1} P(ωi∣x)=p(x)p(x∣ωi)P(ωi)(BD.1)
where p ( x ) p(\mathbf x) p(x) is the pdf of x \mathbf x x and for which we have
p ( x ) = ∑ p ( x ∣ ω i ) P ( ω i ) (BD.2) p(\mathbf x)=\sum p(\mathbf x|\omega_i)P(\omega_i)\tag{BD.2} p(x)=∑p(x∣ωi)P(ωi)(BD.2)
The Bayes classification rule can now be stated as
If P ( ω 1 ∣ x ) > P ( ω 2 ∣ x ) , x is classified to ω 1 If P ( ω 1 ∣ x ) < P ( ω 2 ∣ x ) , x is classified to ω 2 (BD.3) \begin{aligned} &\text { If } P\left(\omega_{1}| x\right)>P\left(\omega_{2}| x\right), \quad x \text { is classified to } \omega_{1}\\ &\text { If } P\left(\omega_{1} | x\right)<P\left(\omega_{2} | x\right), \quad x \text { is classified to } \omega_{2} \end{aligned}\tag{BD.3} If P(ω1∣x)>P(ω2∣x),x is classified to ω1 If P(ω1∣x)<P(ω2∣x),x is classified to ω2(BD.3)
Using ( B D . 2 ) (BD.2) (BD.2), the decision can equivalently be based on the inequalities
p ( x ∣ ω 1 ) P ( ω 1 ) ≷ p ( x ∣ ω 2 ) P ( ω 2 ) (BD.4) p(\mathbf x| \omega_1)P(\omega_1)\gtrless p(\mathbf x| \omega_2)P(\omega_2)\tag{BD.4} p(x∣ω1)P(ω1)≷p(x∣ω2)P(ω2)(BD.4)
It can also be described as
p ( x ∣ ω 1 ) P ( ω 1 ) − p ( x ∣ ω 2 ) P ( ω 2 ) ≷ 0 p ( x ∣ ω 1 ) P ( ω 1 ) p ( x ∣ ω 2 ) P ( ω 2 ) ≷ 1 log ( p ( x ∣ ω 1 ) P ( ω 1 ) ) − log ( p ( x ∣ ω 2 ) P ( ω 2 ) ) ≷ 0 p(\mathbf x| \omega_1)P(\omega_1)-p(\mathbf x| \omega_2)P(\omega_2)\gtrless 0\\ \frac{p(\mathbf x| \omega_1)P(\omega_1)}{p(\mathbf x| \omega_2)P(\omega_2)}\gtrless 1\\ \log(p(\mathbf x| \omega_1)P(\omega_1))-\log(p(\mathbf x| \omega_2)P(\omega_2))\gtrless 0 p(x∣ω1)P(ω1)−p(x∣ω2)P(ω2)≷0p(x∣ω2)P(ω2)p(x∣ω1)P(ω1)≷1log(p(x∣ω1)P(ω1))−log(p(x∣ω2)P(ω2))≷0
Minimizing the classification error probability
We will show that the Bayesian classifier is optimal with respect to minimizing the classification error probability.
Define the classification error as
P ( e r r o r ) = ∑ i = 1 C P ( e r r o r ∣ ω i ) P ( ω i ) (BD.5) P(error)=\sum_{i=1}^C P(error|\omega_i)P(\omega_i)\tag{BD.5} P(error)=i=1∑CP(error∣ωi)P(ωi)(BD.5)
For a determined decision boundary (the green line), it could happen that x ∈ ω 2 \mathbf x\in \omega_2 x∈ω2 is assigned to ω 1 \omega_1 ω1, and that x ∈ ω 1 \mathbf x\in \omega_1 x∈ω1 is assigned to ω 2 \omega_2 ω2. Let R i R_i Ri be the region of the feature space in which we decide in favor of
ω i \omega_i ωi. The classification error probability can be written as
P e = P ( ω 1 ) ∫ R 2 p ( x ∣ ω 1 ) d x + P ( ω 2 ) ∫ R 1 p ( x ∣ ω 2 ) d x = ∫ R 2 P ( ω 1 ∣ x ) p ( x ) d x + ∫ R 1 P ( ω 2 ∣ x ) p ( x ) d x (BD.6) \begin{aligned} P_{e}&=P(\omega_1)\int_{R_{2}} p\left(\mathbf x|\omega_1\right) d \mathbf x+P(\omega_2)\int_{R_{1}} p\left(\mathbf x|\omega_2\right) d \mathbf x \\&=\int_{R_{2}} P\left(\omega_{1} |\mathbf x\right) p(\mathbf x) d \mathbf x+\int_{R_{1}} P\left(\omega_{2} | \mathbf x\right) p(\mathbf x) d \mathbf x \end{aligned} \tag{BD.6} Pe=P(ω1)∫R2p(x∣ω1)dx+P(ω2)∫R1p(x∣ω2)dx=∫R2P(ω1∣x)p(x)dx+∫R1P(ω2∣x)p(x)dx(BD.6)
which is the marked yellow area in the figure above. Then it’s intuitive that the optimal decision boundary should be
x 0 : P ( ω 1 ∣ x 0 ) = P ( ω 2 ∣ x 0 ) (BD.7) \mathbf x_0:P\left(\omega_{1} | \mathbf x_0\right)=P\left(\omega_{2} | \mathbf x_0\right)\tag{BD.7} x0:P(ω1∣x0)=P(ω2∣x0)(BD.7)
so that
R 1 : P ( ω 1 ∣ x ) > P ( ω 2 ∣ x ) R 2 : P ( ω 2 ∣ x ) > P ( ω 1 ∣ x ) (BD.8) \begin{array}{l} R_{1}: P\left(\omega_{1} | \mathbf x\right)>P\left(\omega_{2} | \mathbf x\right) \\ R_{2}: P\left(\omega_{2} |\mathbf x\right)>P\left(\omega_{1} | \mathbf x\right) \end{array}\tag{BD.8} R1:P(ω1∣x)>P(ω2∣x)R2:P(ω2∣x)>P(ω1∣x)(BD.8)
By combining ( B D . 6 ) (BD.6) (BD.6) and ( B D . 7 ) (BD.7) (BD.7), we obtain the Bayes error ε ∗ \varepsilon^* ε∗, which is the minimum classification error we can get. It does not depend on the classification rule that we apply, but on the distribution of data. In practical we can not obtain ε ∗ \varepsilon^* ε∗, since we do not have the true distributions, and the high dimensional integrals are very complicated.
Minimizing the average risk
Sometimes misclassification of class A to class B is much more dangerous than misclassification of class B to class A. Thus, it is more appropriate to assign a penalty term to weigh each error:
r = λ 12 P ( ω 1 ) ∫ R 2 p ( x ∣ ω 1 ) d x + λ 21 P ( ω 2 ) ∫ R 1 p ( x ∣ ω 2 ) d x (BD.9) \mathbf r=\lambda_{12}P(\omega_1)\int_{R_{2}} p\left(\mathbf x|\omega_1\right) d x+\lambda_{21}P(\omega_2)\int_{R_{1}} p\left(\mathbf x|\omega_2\right) d x\tag{BD.9} r=λ12P(ω1)∫R2p(x∣ω1)dx+λ21P(ω2)∫R1p