Logistic Regression
1. Problems of Linear Regression When Applied to Classification Problem
1) h(x) may out of range
2) some unusual feature values lead to failure of classification
2. Logistic Regression Model
1 ) h θ ( x ) = g ( θ T x ) = P ( y = 1 ∣ x ; θ ) 1)h_{\theta}(x)=g(\theta^{T}x) = P(y=1| x ; \theta) 1)hθ(x)=g(θTx)=P(y=1∣x;θ)
where g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+e−z1 is called Sigmoid Function / Logistic Function
3. Decision Boundary
y=1 →
h
θ
(
x
)
>
0.5
h_{\theta}(x)>0.5
hθ(x)>0.5 →
θ
T
x
>
0
\theta^{T}x>0
θTx>0
y=0 →
h
θ
(
x
)
<
0.5
h_{\theta}(x)<0.5
hθ(x)<0.5 →
θ
T
x
<
0
\theta^{T}x<0
θTx<0
decision boundary:
h
θ
(
x
)
=
0.5
h_{\theta}(x)=0.5
hθ(x)=0.5 →
θ
T
x
=
0
\theta^{T}x=0
θTx=0 (may be nonlinear)
4. Cost Function
C
o
s
t
(
h
(
x
)
,
y
)
=
{
−
l
o
g
(
h
(
x
)
)
,
y
=
1
−
l
o
g
(
1
−
h
(
x
)
)
,
y
=
0
=
−
y
l
o
g
(
h
(
x
)
)
−
(
1
−
y
)
l
o
g
(
1
−
h
(
x
)
)
Cost(h(x),y)=\begin{cases} -log(h(x)), & y=1\\ -log(1-h(x)), & y=0 \end{cases} =-ylog(h(x))-(1-y)log(1-h(x))
Cost(h(x),y)={−log(h(x)),−log(1−h(x)),y=1y=0=−ylog(h(x))−(1−y)log(1−h(x))
J
(
θ
)
=
1
m
∑
i
=
1
m
C
o
s
t
(
h
(
x
(
i
)
)
,
y
(
i
)
)
=
1
m
∑
i
=
1
m
(
−
y
T
l
o
g
(
h
)
−
(
1
−
y
)
T
l
o
g
(
1
−
h
)
)
J(\theta)=\frac{1}{m}\sum_{i=1}^{m}{Cost(h(x^{(i)}),y^{(i)})} =\frac{1}{m}\sum_{i=1}^{m}({-y^{T}log(h)-(1-y)^{T}log(1-h)})
J(θ)=m1i=1∑mCost(h(x(i)),y(i))=m1i=1∑m(−yTlog(h)−(1−y)Tlog(1−h))
where
h
=
g
(
X
θ
)
h=g(X\theta)
h=g(Xθ)
5.Iteration Formula
θ
j
:
=
θ
j
−
α
∗
1
m
∑
i
=
1
m
(
h
(
x
(
i
)
)
−
y
(
i
)
)
∗
x
j
(
i
)
\theta_{j}:=\theta_{j}-\alpha *\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})*x_j^{(i)}
θj:=θj−α∗m1i=1∑m(h(x(i))−y(i))∗xj(i)
vectorized formula:
θ
:
=
α
∗
1
m
X
T
(
g
(
X
θ
)
−
y
)
\theta:=\alpha *\frac{1}{m}X^{T}(g(X\theta)-y)
θ:=α∗m1XT(g(Xθ)−y)
(identical to linear regression)
6. Some Optimization Algorithms
Conjugate Gradient / BFGS / L-BFGS
No need to pick α and faster, but more complex
7. Multiclass Classification: one-vs-all
Train
h
θ
(
i
)
h_{\theta}^{(i)}
hθ(i) for every individual i.
When predicting, using
m
a
x
i
(
h
θ
(
i
)
(
x
)
)
max_{i}(h_{\theta}^{(i)}(x))
maxi(hθ(i)(x))
8.Overfitting Problems
underfit — high bias — too few features
overfit — high variance — too many features ---- fail to predict
2 solutions:
- Reduce features
- Regularization: Keep all features while reduce the values of some features
9. Regularization
adding
λ
2
m
∑
j
=
1
n
θ
j
2
\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^{2}
2mλ∑j=1nθj2 to
J
(
θ
)
J(\theta)
J(θ)
Note that it does not contain
θ
0
\theta_{0}
θ0 !
λ
\lambda
λ : regularization parameter, making
θ
\theta
θ small
10. Regularized Linear Regression
(1) Linear Regression
J
(
θ
)
=
1
2
m
(
∑
i
=
1
m
(
h
(
x
(
i
)
)
−
y
(
i
)
)
2
+
λ
∑
j
=
1
n
θ
j
2
)
J(θ) = \frac{1}{2m}(\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})^2+\lambda\sum_{j=1}^{n}\theta_{j}^{2})
J(θ)=2m1(∑i=1m(h(x(i))−y(i))2+λ∑j=1nθj2)
Note that it does not contain
θ
0
\theta_{0}
θ0 !
θ
j
:
=
θ
j
−
α
(
1
m
∑
i
=
1
m
(
h
(
x
(
i
)
)
−
y
(
i
)
)
∗
x
j
(
i
)
+
λ
m
θ
j
)
\theta_{j}:=\theta_{j}-\alpha (\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})*x_j^{(i)}+\frac{\lambda}{m}\theta_{j})
θj:=θj−α(m1∑i=1m(h(x(i))−y(i))∗xj(i)+mλθj) for
j
≠
0
j≠0
j=0
which is also
θ
j
:
=
θ
j
(
1
−
α
λ
m
)
−
α
∗
1
m
∑
i
=
1
m
(
h
(
x
(
i
)
)
−
y
(
i
)
)
∗
x
j
(
i
)
+
\theta_{j}:=\theta_{j}(1-\alpha\frac{\lambda}{m})-\alpha *\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})*x_j^{(i)}+
θj:=θj(1−αmλ)−α∗m1∑i=1m(h(x(i))−y(i))∗xj(i)+ for
j
≠
0
j≠0
j=0
(2) Normal Equation
θ = ( X T X + λ d i a g ( 0 , 1 , 1 , . . . 1 , 1 ) ) − 1 X T y \theta=(X^{T}X+\lambda diag(0,1,1,...1,1))^{-1}X^{T}y θ=(XTX+λdiag(0,1,1,...1,1))−1XTy where size of diag() is (n+1)*(n+1)