Gaussian Discriminative Analysis 高斯判别分析 GDA
Multidimensional Gaussian Model
z
∼
N
(
μ
⃗
,
Σ
)
z \sim N(\vec\mu,\Sigma)
z∼N(μ,Σ)
z
∈
R
n
,
μ
⃗
∈
R
n
,
Σ
∈
R
n
∗
n
z \in R^n,\vec\mu \in R^n, \Sigma \in R^{n*n}
z∈Rn,μ∈Rn,Σ∈Rn∗n
z
z
z – variable
μ
⃗
=
[
μ
1
μ
2
.
.
.
μ
n
]
\vec\mu = \begin{bmatrix} \mu_1\\ \mu_2 \\ ... \\ \mu_n \end{bmatrix}
μ=⎣⎢⎢⎡μ1μ2...μn⎦⎥⎥⎤ – mean vector
Σ
\Sigma
Σ – covarience matrix
All the Gaussian models share one covarience matrix.
E ( z ) = μ ⃗ , C o v ( z ) = E [ ( x − μ ⃗ ) ( x − μ ⃗ ) T ] = E ( z z T ) − ( E ( z ) ) ( E ( z ) ) T E(z) = \vec\mu, Cov(z)=E[(x-\vec\mu)(x-\vec\mu)^T]=E(zz^T)-(E(z))(E(z))^T E(z)=μ,Cov(z)=E[(x−μ)(x−μ)T]=E(zzT)−(E(z))(E(z))T
Intro
GDA assumes:
x
∣
y
=
0
∼
N
(
μ
0
,
Σ
)
x|y=0 \sim N(\mu_0,\Sigma)
x∣y=0∼N(μ0,Σ)
x
∣
y
=
1
∼
N
(
μ
1
,
Σ
)
x|y=1 \sim N(\mu_1,\Sigma)
x∣y=1∼N(μ1,Σ)
y
∼
B
e
r
(
ϕ
)
,
ϕ
=
P
(
y
=
1
)
y \sim Ber(\phi), \phi = P(y=1)
y∼Ber(ϕ),ϕ=P(y=1)
GDA model(binary classification)
Multivariate Gaussian distribution:
P
(
x
)
=
1
(
2
π
)
d
2
∣
Σ
∣
1
2
e
x
p
(
−
1
2
(
x
−
μ
)
T
Σ
−
1
(
x
−
μ
)
)
P(x) = \frac{1}{(2\pi)^{\frac d2}|\Sigma|^{\frac12}}exp(-\frac12(x-\mu)^T\Sigma^{-1}(x-\mu))
P(x)=(2π)2d∣Σ∣211exp(−21(x−μ)TΣ−1(x−μ))
∣
Σ
∣
|\Sigma|
∣Σ∣ is the value of determinant of
Σ
\Sigma
Σ
parameter:
μ
0
,
μ
1
,
Σ
,
ϕ
\mu_0,\mu_1, \Sigma, \phi
μ0,μ1,Σ,ϕ
P
(
y
)
=
ϕ
y
(
1
−
ϕ
)
1
−
y
P(y) = \phi^y(1-\phi)^{1-y}
P(y)=ϕy(1−ϕ)1−y
ϕ
\phi
ϕ is prior probability, and it depends on the proportion of two classes.
Joint likelihood:
L
(
ϕ
,
μ
0
,
μ
1
,
Σ
)
=
∑
i
=
1
m
P
(
x
(
i
)
,
y
(
i
)
;
ϕ
,
μ
0
,
μ
1
,
Σ
)
=
∑
i
=
1
m
P
(
x
(
i
)
∣
y
(
i
)
)
P
(
y
(
i
)
)
L(\phi, \mu_0, \mu_1, \Sigma) = \sum\limits_{i=1}^mP(x^{(i)},y^{(i)};\phi, \mu_0, \mu_1, \Sigma) = \sum\limits_{i=1}^mP(x^{(i)}|y^{(i)})P(y^{(i)})
L(ϕ,μ0,μ1,Σ)=i=1∑mP(x(i),y(i);ϕ,μ0,μ1,Σ)=i=1∑mP(x(i)∣y(i))P(y(i))
MLE:
arg
max
ϕ
,
μ
0
,
μ
1
,
Σ
l
(
ϕ
,
μ
0
,
μ
1
,
Σ
)
\arg\max\limits_{\phi, \mu_0, \mu_1, \Sigma}l(\phi, \mu_0, \mu_1, \Sigma)
argϕ,μ0,μ1,Σmaxl(ϕ,μ0,μ1,Σ)
ϕ
=
∑
i
=
1
m
y
(
i
)
m
=
∑
i
=
1
m
1
{
y
(
i
)
=
1
}
m
\phi = \frac{\sum\limits_{i=1}^my^{(i)}}{m}=\frac{\sum\limits_{i=1}^m1\{y^{(i)}=1\}}{m}
ϕ=mi=1∑my(i)=mi=1∑m1{y(i)=1}
μ
k
=
∑
i
=
1
m
1
{
y
(
i
)
=
k
}
x
(
i
)
∑
i
=
1
m
1
{
y
(
i
)
=
k
}
,
k
∈
{
0
,
1
}
\mu_k = \frac{\sum\limits_{i=1}^m1\{y^{(i)}=k\}x^{(i)}}{\sum\limits_{i=1}^m1\{y^{(i)}=k\}},k\in \{0,1\}
μk=i=1∑m1{y(i)=k}i=1∑m1{y(i)=k}x(i),k∈{0,1}
Σ
=
1
m
∑
i
=
1
m
(
x
(
i
)
−
μ
y
(
i
)
)
(
x
(
i
)
−
μ
y
(
i
)
)
T
\Sigma = \frac1m\sum\limits_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T
Σ=m1i=1∑m(x(i)−μy(i))(x(i)−μy(i))T
Based on the two Gaussian models, we can draw a boundary line.
图片来源
Prediction
arg
max
y
P
(
y
∣
x
)
=
arg
max
y
P
(
x
∣
y
)
P
(
y
)
P
(
x
)
=
arg
max
y
P
(
x
∣
y
)
P
(
y
)
\arg\max\limits_yP(y|x) = \arg\max\limits_y \frac{P(x|y)P(y)}{P(x)}=\arg\max\limits_yP(x|y)P(y)
argymaxP(y∣x)=argymaxP(x)P(x∣y)P(y)=argymaxP(x∣y)P(y)
(
P
(
x
)
P(x)
P(x) is a constant)
& Logistic Regression
图片是我的笔记
The picture shows when our data is 1D the function looks like Sigmoid function. Actually, it is Sigmoid function and it also applys to higher dimension. I won’t prove it here.
GDA is a stricter version of logistic regression because the data has to follow Gaussian distribution.
When the data follows Gaussian distribution or the data is very big(according to the central limit theorem), GDA works better than logistic regression.
Also, the data follows Gaussian distribution so the model has no local optima.