假设陈述
h(x)
: estimated probability thaty = 1
on inputx
h θ ( x ) = P ( y = 1   ∣   x ; θ ) = g ( θ T x ) \large h_\theta(x) = P(y=1\,|\,x;\theta)= g(\theta^Tx) hθ(x)=P(y=1∣x;θ)=g(θTx)- sigmod 函数
g ( z ) = 1 1 + e − z \large g(z) = \frac{1}{1+e^{-z}} g(z)=1+e−z1
决策界限 Decision Boundary
y
=
{
1
,
h
θ
(
x
)
≥
0.5
 
o
r
 
h
θ
(
x
)
≥
0
0
,
h
θ
(
x
)
<
0.5
 
o
r
 
h
θ
(
x
)
<
0
\large y= \begin{cases} 1, & { h_\theta (x) \geq 0.5 \, or \, h_\theta(x) \geq 0 } \\\ 0, & h_\theta (x) < 0.5 \,or \, h_\theta(x) < 0 \end{cases}
y=⎩⎨⎧1, 0,hθ(x)≥0.5orhθ(x)≥0hθ(x)<0.5orhθ(x)<0
- Non-Linear Decision Boundary
Logistic regression cost function
- cost
C o s t ( h θ ( x ( i ) , y ( i ) ) ) = { − log ( h θ ( x ) ) if y = 1 − log ( 1 − h θ ( x ) ) if y = 0 \large Cost(h_\theta(x^{(i)},y^{(i)})) = \begin{cases} -\log(h_\theta(x))\quad \text{if y = 1} \\ -\log(1-h_\theta(x))\quad \text{if y = 0} \end{cases} Cost(hθ(x(i),y(i)))=⎩⎨⎧−log(hθ(x))if y = 1−log(1−hθ(x))if y = 0 - simplify version
C o s t ( h θ ( x ( i ) , y ( i ) ) ) = − y log ( h θ ( x ) ) − ( 1 − y ) log ( 1 − h θ ( x ) ) \large Cost(h_\theta(x^{(i)},y^{(i)})) = -y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x)) Cost(hθ(x(i),y(i)))=−ylog(hθ(x))−(1−y)log(1−hθ(x))
- J ( θ ) J(\theta) J(θ)
J ( θ ) = − 1 m ∑ i = 1 m C o s t ( h θ ( x ( i ) , y ( i ) ) ) = − 1 m [ ∑ i = 1 m y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] \large J(\theta)= -\frac{1}{m}\sum_{i=1}^m Cost(h_\theta(x^{(i)},y^{(i)})) \\ = -\frac{1}{m}[\sum_{i=1}^my^{(i)} \log h_\theta (x^{(i)}) + (1-y^{(i)})\log (1-h_\theta (x^{(i)}))] J(θ)=−m1i=1∑mCost(hθ(x(i),y(i)))=−m1[i=1∑my(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]
- to fie parameters θ \theta θ
min θ J ( θ ) \large \min_\theta J(\theta) θminJ(θ)
高级优化
Optimization algorithms:
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS
Advantages
- No need to manually pick α \alpha α
- Often faster than gradient descent
Disadvantage
- More complex
Multiclass classification
Train a Logistic regression classifier
h
θ
(
i
)
(
x
)
h_\theta ^{(i)}(x)
hθ(i)(x) for each class
i
i
i to predict the probability that
y
=
i
y=i
y=i.
On a new input
x
x
x, to maake a prediction, pick the class
i
i
i that maximizes
max
i
h
θ
(
i
)
(
x
)
\large \max_i h_\theta^{(i)}(x)
imaxhθ(i)(x)
h
θ
(
i
)
(
x
)
=
P
(
y
=
i
 
∣
 
x
;
θ
)
(
i
=
1
,
2
,
3
)
h_\theta ^{(i)}(x) = P(y=i \,|\, x; \theta) \quad (i = 1, 2, 3)
hθ(i)(x)=P(y=i∣x;θ)(i=1,2,3)