Note on Machine Learning By Andrew Ng (3)
Click here for previous note.
Logistic Regression
—A classification algorithm.
Classificaton
binary classification: y ∈ { 0 , 1 } y \in \{0 ,1\} y∈{0,1}
0: “Negative Class”
1: “Positive Class”
If we want to use a line to fit or predict something (like the tumor is malignant ), we could set threshold classifier output h θ ( x ) h_\theta(x) hθ(x) at 0.5.
If h θ ( x ) ≥ 0.5 h_\theta(x) \geq 0.5 hθ(x)≥0.5, predict y = 1 y =1 y=1,
if h θ ( x ) < 0.5 h_\theta(x) < 0.5 hθ(x)<0.5, predict y = 0 y =0 y=0.
You may think linear regression does work in classification, but if the data changes a little bit, it would give us a really bad prediction.
Also, the classification requires y = 0 y = 0 y=0 or 1, but h θ ( x ) h_\theta(x) hθ(x) by linear regression can be > 1 >1 >1 or < 0 <0 <0.
So, by using Logistic Regression, we will generate an h θ ( x ) h_\theta(x) hθ(x) that around [ 0 , 1 ] [0, 1] [0,1].
Hypothesis Representation
Fix
h
θ
(
x
)
h_\theta(x)
hθ(x) that fit outr need, which is map
R
\mathbb{R}
R to
[
0
,
1
]
[0, 1]
[0,1].
h
θ
(
x
)
=
g
(
θ
T
x
)
g
(
z
)
=
1
1
+
e
−
z
→
h
θ
(
x
)
=
1
1
+
e
−
θ
T
x
h_\theta(x) = g(\theta^Tx) \\ g(z) = \frac{1}{1+e^{-z}} \\ \rightarrow h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}
hθ(x)=g(θTx)g(z)=1+e−z1→hθ(x)=1+e−θTx1
We call that Sigmoid function or Logistic function.
h θ ( x ) = P ( y = 1 ∣ x ; θ ) h_\theta(x) = P(y=1|x;\theta) hθ(x)=P(y=1∣x;θ)
Probability that y=1, given x, parameterized by θ \theta θ.
P ( y = 1 ∣ x ; θ ) + P ( y = 0 ∣ x ; θ ) = 1 P(y=1|x;\theta)+ P(y=0|x;\theta) = 1 P(y=1∣x;θ)+P(y=0∣x;θ)=1
Decision Boundary
The property of the hypothesis.
Separate two different classes by a line odr curve.
If h θ ( x ) ≥ 0.5 h_\theta(x) \geq 0.5 hθ(x)≥0.5, predict y = 1 y =1 y=1,
if h θ ( x ) < 0.5 h_\theta(x) < 0.5 hθ(x)<0.5, predict y = 0 y =0 y=0.
When z ≥ 0 z \geq 0 z≥0, g ( z ) ≥ 0.5 g(z) \geq 0.5 g(z)≥0.5. So, h θ ( x ) = g ( θ T x ) ≥ 0.5 h_\theta(x) = g(\theta^Tx) \geq 0.5 hθ(x)=g(θTx)≥0.5, when θ T x ≥ 0 \theta^Tx \geq 0 θTx≥0.
Non-linear decision boundaries
USE POLYNOMIAL TERMS
Cost Function
How do we choose parameters θ \theta θ ?
For linear regression, J ( θ ) = 1 m ∑ i = 1 m 1 2 ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta) = \frac{1}{m}\sum_{i=1}^m\frac{1}{2}{(h_\theta(x^{(i)}) - y^{(i)})}^2 J(θ)=m1∑i=1m21(hθ(x(i))−y(i))2, and we modify it to be like J ( θ ) = 1 m ∑ i = 1 m c o s t ( h θ ( x ) , y ) J(\theta) = \frac{1}{m}\sum_{i=1}^m cost(h_\theta (x),y) J(θ)=m1∑i=1mcost(hθ(x),y),and the c o s t ( h θ ( x ) , y ) = 1 2 ( h θ ( x ( i ) ) − y ( i ) ) 2 cost(h_\theta (x),y) = \frac{1}{2}{(h_\theta(x^{(i)}) - y^{(i)})}^2 cost(hθ(x),y)=21(hθ(x(i))−y(i))2.
But it is a non-convex function, so we have to find a new cost function, which is convex and has a global minimum.
Logistic regression cost function
c
o
s
t
=
0
i
f
y
=
1
,
h
θ
(
x
)
=
1
cost = 0 \quad if \quad y=1, h_\theta(x) =1
cost=0ify=1,hθ(x)=1
But as
h
θ
(
x
)
→
0
c
o
s
t
→
∞
h_\theta(x) \rightarrow 0\\ cost \rightarrow \infin
hθ(x)→0cost→∞
Capture intuition that if h θ ( x ) = 0 h_\theta(x) = 0 hθ(x)=0, (predict P ( y = 1 ∣ x ; θ ) = 0 P(y=1|x;\theta) =0 P(y=1∣x;θ)=0), but y = 1 y =1 y=1, we’ll penalize learning algorithm by a very large cost.
Simplified cost function and gradient descent
C
o
s
t
(
h
θ
(
x
)
,
y
)
=
−
y
∗
l
o
g
(
h
θ
(
x
)
)
−
(
1
−
y
)
l
o
g
(
1
−
h
θ
(
x
)
)
Cost(h_\theta(x), y) = -y*log(h_\theta(x))-(1-y)log(1-h_\theta(x))
Cost(hθ(x),y)=−y∗log(hθ(x))−(1−y)log(1−hθ(x))
J
(
θ
)
=
1
m
∑
i
=
1
m
C
o
s
t
(
h
θ
(
x
)
,
y
)
=
−
1
m
[
∑
i
=
1
m
y
(
i
)
∗
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
l
o
g
(
1
−
h
θ
(
x
(
i
)
)
)
]
J(\theta) = \frac{1}{m}\sum_{i = 1}^mCost(h_\theta(x), y)\\ =-\frac{1}{m}[\sum_{i=1}^m y^{(i)}*log(h_\theta(x^{(i)}))+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]
J(θ)=m1i=1∑mCost(hθ(x),y)=−m1[i=1∑my(i)∗log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]
To find parameters
θ
\theta
θ:
m i n J ( θ ) min J(\theta) minJ(θ)
To make a prediction given new x x x:
Output h θ ( x ) = 1 1 + e − θ T x h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}} hθ(x)=1+e−θTx1
Gradient Descent
Repeat{
θ j : = θ j − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ∗ x j ( i ) \theta_j := \theta_j - \alpha \sum_{i=1}^m(h_\theta(x^{(i)})- y^{(i)})*x_j^{(i)} θj:=θj−α∑i=1m(hθ(x(i))−y(i))∗xj(i)
Just like linear regression!
}
Advanced optimization
Multi-class classification(One vs. all)
Multiclass classification
Email tagging: Work, Friends, Family, Hobby
Weather: Sunny, Cloudy, Rain, Snow
One-vs-all
By sparating to lots of 2-class classification problems.
Train a logistic regression classifier h θ ( i ) ( x ) h_\theta^{(i)}(x) hθ(i)(x) for each class i i i to predict the probability that y = i y =i y=i.
On a new input x x x, to make a prediction, pick the class i i i that maximizes m a x i h θ ( i ) ( x ) \mathop{max}\limits_{i} h_\theta^{(i)}(x) imaxhθ(i)(x).
Please click here to see the next note.