Classification
X inputs into Function and generates output.The output will be classified into different classes.Like credit scoring:Input is income,savings,profession and so on.Output is accept or refuse.Medical Diagnosis,handwritten,face recognition.
How to do Classification
Ideal Alternatives.Loss Function is:
L
(
f
)
=
∑
n
δ
(
f
(
x
n
)
≠
y
^
n
)
L(f)=\sum_{n}\delta(f(x^n)\neq \hat{y}^n)
L(f)=n∑δ(f(xn)=y^n)
And this time we can not differential.Because it’s not continuous.
Two Classes
We all know:
P
(
C
1
∣
X
)
=
P
(
X
∣
C
1
)
P
(
C
1
)
P
(
X
)
P(C_1|X)=\frac{P(X|C_1)P(C_1)}{P(X)}
P(C1∣X)=P(X)P(X∣C1)P(C1)
We need use training data to calculate the probabilities.
Generative Model
P ( X ) = P ( X ∣ C 1 ) P ( C 1 ) + P ( X ∣ C 2 ) P ( C 2 ) P(X)=P(X|C_1)P(C_1)+P(X|C_2)P(C_2) P(X)=P(X∣C1)P(C1)+P(X∣C2)P(C2)
Prior
P ( C 1 ) P(C_1) P(C1) and P ( C 2 ) P(C_2) P(C2) are Prior.
Feature
Under a class, analyze the feature distribution of the training data.
Gaussian Distribution
Like this:
f
μ
,
∑
(
)
=
1
(
2
π
)
D
/
2
1
∣
∑
∣
1
/
2
e
x
p
{
−
1
2
(
x
−
μ
)
T
∑
−
1
(
x
−
μ
)
}
f_{\mu,\sum}()=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\sum|^{1/2}}exp\{-\frac{1}{2}(x-\mu)^T\sum^{-1}(x-\mu)\}
fμ,∑()=(2π)D/21∣∑∣1/21exp{−21(x−μ)T∑−1(x−μ)}
Function.This function’s shape determines by mean
μ
\mu
μ and covariance matrix
∑
\sum
∑.
μ
\mu
μ likes the center of circles,and
∑
\sum
∑ likes radius.The features of trainning data can be sampled by Gaussian Distribution.If the data is close to
μ
\mu
μ,it can be easily sampled.Otherwise,it is hard.How to get
μ
\mu
μ and
∑
\sum
∑?
Maximum Likelihood
The key is find Maximum Likelihood.Any sample image can use any
μ
\mu
μ and
∑
\sum
∑.The Gaussian with any mean
μ
\mu
μ and covariance matrix
∑
\sum
∑ can generate these points.Different likelihood.Likelihood of a Gaussian with mean
μ
\mu
μ and covariance matrix
∑
\sum
∑=the probability of the Gaussian samples
x
1
,
x
2
,
.
.
.
,
x
79
x^1,x^2,...,x^{79}
x1,x2,...,x79.
L
(
μ
,
∑
)
=
f
μ
,
∑
(
x
1
)
f
μ
,
∑
(
x
2
)
.
.
.
f
μ
,
∑
(
x
n
)
L(\mu,\sum)=f_{\mu,\sum}(x^1)f_{\mu,\sum}(x^2)...f_{\mu,\sum}(x^n)
L(μ,∑)=fμ,∑(x1)fμ,∑(x2)...fμ,∑(xn)
We assume
x
1
,
x
2
,
x
3
,
.
.
.
,
x
n
x^1,x^2,x^3,...,x^n
x1,x2,x3,...,xn generate from the Gaussian(
μ
∗
,
∑
∗
\mu^*,\sum^*
μ∗,∑∗) with the maximum likelihood.
μ
∗
,
∑
∗
=
a
r
g
max
μ
,
∑
L
(
μ
,
∑
)
\mu^*,\sum^*=arg\max_{\mu,\sum}L(\mu,\sum)
μ∗,∑∗=argμ,∑maxL(μ,∑)
How to quickly calculate the parameters?
Average:
μ
∗
=
1
n
∑
i
=
1
n
x
i
\mu^*=\frac{1}{n}\sum_{i=1}^{n}x^i
μ∗=n1i=1∑nxi
∑
∗
=
1
n
∑
i
=
1
n
(
x
i
−
μ
∗
)
(
x
i
−
μ
∗
)
T
\sum^*=\frac{1}{n}\sum_{i=1}^{n}(x^i-\mu^*)(x^i-\mu^*)^T
∑∗=n1i=1∑n(xi−μ∗)(xi−μ∗)T
Now we can do classification!
If
P
(
C
1
∣
X
)
>
0.5
P(C_1|X)>0.5
P(C1∣X)>0.5 ,it means that x belongs to class 1.Then we already know every class’s
μ
\mu
μ and
∑
\sum
∑.
P
(
C
1
∣
X
)
=
P
(
X
∣
C
1
)
P
(
C
1
)
P
(
X
∣
C
1
)
P
(
C
1
)
+
P
(
X
∣
C
2
)
P
(
C
2
)
P(C_1|X)=\frac{P(X|C_1)P(C_1)}{P(X|C_1)P(C_1)+P(X|C_2)P(C_2)}
P(C1∣X)=P(X∣C1)P(C1)+P(X∣C2)P(C2)P(X∣C1)P(C1)
We know
P
(
C
1
)
P(C_1)
P(C1) and
P
(
C
2
)
P(C_2)
P(C2).
P
(
X
∣
C
1
)
P(X|C_1)
P(X∣C1) and
P
(
X
∣
C
2
)
P(X|C_2)
P(X∣C2) use Guassian.
Modifying Model
We can make different classes the same covariance matrix to reduce the number of the parameters.How to generate out Likelihood Function?:
L
(
μ
1
,
μ
2
,
∑
)
=
f
μ
1
,
∑
(
x
1
)
.
.
.
f
μ
1
,
∑
(
x
n
)
f
μ
2
,
∑
(
x
n
+
1
)
.
.
.
f
μ
2
,
∑
(
x
m
)
L(\mu^1,\mu^2,\sum)=f_{\mu^1,\sum}(x^1)...f_{\mu^1,\sum}(x^n)f_{\mu^2,\sum}(x^{n+1})...f_{\mu^2,\sum}(x^m)
L(μ1,μ2,∑)=fμ1,∑(x1)...fμ1,∑(xn)fμ2,∑(xn+1)...fμ2,∑(xm)
How to get the same
∑
\sum
∑?We assume that there are two classes, 60 and 80 in the training data.We need to combine two
∑
\sum
∑.
∑
=
60
140
∑
1
+
80
140
∑
2
\sum=\frac{60}{140}\sum^1+\frac{80}{140}\sum^2
∑=14060∑1+14080∑2
And boundry will be linearbut not curve,called linear model.
Three Steps
Function Set(Model):
P
(
C
1
∣
x
)
=
P
(
x
∣
C
1
)
P
(
C
1
)
P
(
x
∣
C
1
)
P
(
C
1
)
+
P
(
x
∣
C
2
)
P
(
C
2
)
P(C_1|x)=\frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1)+P(x|C_2)P(C_2)}
P(C1∣x)=P(x∣C1)P(C1)+P(x∣C2)P(C2)P(x∣C1)P(C1)
(if
P
(
C
1
∣
x
)
>
0.5
P(C_1|x)>0.5
P(C1∣x)>0.5,output:class1 Otherwise,output:class2)
Then Goodness of a function:The mean
μ
\mu
μ and covariance
∑
\sum
∑ that maximizing the likelihood.And find the best function.