5 Octave Tutorial
(Octave也可以用MATLAB学习,个人决定使用Python进行学习实现,5-6建议无论学习什么语言都可以看一看)
5-1 Basic operations
5-2 Moving data around
5-3 Computing on data
5-4 Plotting data
5-5 for,while,if statements,and function
5-6 Vectorization
Vectorization example
h
θ
(
x
)
=
∑
j
=
0
n
θ
j
x
j
=
θ
T
x
h_\theta(x)=\sum_{j=0}^n \theta_jx_j=\theta^Tx
hθ(x)=j=0∑nθjxj=θTx
Matlab
%% Unvectorized implementation
prediction = 0.0;
for j =1:n+1
prediction=prediction + theta(j) * x(j)
end
%% Vectorized implementation
prediction = theta' * x
C++
// Unvectorized implementation
double prediction =0.0;
for (int j=0;j<n;j++)
prediction += theta[j] * x[j];
// Vectorized implementation
double prediction = theta.transpose() * x;
Gradient descent
6 Logistic Regression
6-1 Classification
Classification
y ∈ { 0 , 1 } 0 : " N e g a t i v e C l a s s " 1 : " P o s i t i v e C l a s s " y \in \{0,1\}\ \ \ \ \ \ \ \ \ \ \ \ \begin{matrix} 0:"Negative Class"\\1:"Positive Class" \end{matrix} y∈{0,1} 0:"NegativeClass"1:"PositiveClass"
h θ ( x ) h_\theta(x) hθ(x) can be >1 or <0
Logistic Regression: 0 ≤ h θ ( x ) ≤ 1 0 \le h_\theta(x) \le 1 0≤hθ(x)≤1
6-2 Hypothesis Representation
Logistic Regression Model
Want 0 ≤ h θ ( x ) ≤ 1 0 \le h_\theta(x) \le 1 0≤hθ(x)≤1
Sigmoid function/Logistic function
h
θ
(
x
)
=
g
(
θ
T
x
)
h_\theta(x)=g(\theta^Tx)
hθ(x)=g(θTx)
g
(
z
)
=
1
1
+
e
−
z
g(z)=\frac{1}{1+e^{-z}}
g(z)=1+e−z1
h
θ
(
x
)
=
1
1
+
e
−
θ
T
x
h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}
hθ(x)=1+e−θTx1
Interpretation of Hypothesis Output
h θ ( x ) h_\theta(x) hθ(x)=estimated probability that y=1 on input
probability that y= 1, given x, parameterized by $$
6-3 Decision boundary
h θ ( x ) = g ( θ T x ) = 1 1 + e − θ T x = P ( y = 1 ∣ x ; θ ) h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}=P(y=1|x;\theta) hθ(x)=g(θTx)=1+e−θTx1=P(y=1∣x;θ)
predict “y=1” if θ T x ≥ 0 \theta^Tx\geq 0 θTx≥0 ( h θ ( x ) ≥ 0.5 h_\theta(x)\geq 0.5 hθ(x)≥0.5)
predict “y=0” if θ T x ≤ 0 \theta^Tx\leq 0 θTx≤0 ( h θ ( x ) ≤ 0.5 h_\theta(x)\leq 0.5 hθ(x)≤0.5)
Decision Boundary
Non-linear decision boundaries
h θ ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 2 + θ 4 x 2 2 ) {h _ {\theta} ( x ) = g ( \theta _ { 0 } + \theta _ { 1 } x _ { 1 } + \theta _ { 2 } x _ { 2 } } { + \theta _ { 3 } x ^ { 2 } + \theta _ { 4 } x _ { 2 } ^ { 2 } }) hθ(x)=g(θ0+θ1x1+θ2x2+θ3x2+θ4x22)
6-4 Cost function
Training set:
{ ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ , ( x ( m ) , y ( m ) ) } \{ ( x ^ { ( 1 ) } , y ^ { ( 1 ) } ) , ( x ^ { ( 2 ) } , y ^ { ( 2 ) } ) , \cdots , ( x ^ { ( m ) } , y ^ { ( m ) } ) \} {(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))}
m example
x ∈ [ x 0 x 1 x n ] x 0 = 1 , y ∈ { 0 , 1 } x \in \left[ \begin{array} { l } { x _ { 0 } } \\ { x _ { 1 } } \\ { x _ { n } } \end{array} \right] \quad x _ { 0 } = 1 , y \in \{ 0 , 1 \} x∈⎣⎡x0x1xn⎦⎤x0=1,y∈{0,1}
h θ ( x ) = 1 1 + e − θ T x h _ { \theta } ( x ) = \frac { 1 } { 1 + e ^ { - \theta Tx } } hθ(x)=1+e−θTx1
Cost function
Linear regression
J ( θ ) = 1 m ∑ i = 1 m 1 2 ( h 0 ( x ( i ) ) − y ( i ) ) 2 = 1 m ∑ i = 1 m c o s t ( h θ ( x ( i ) ) , y ( i ) ) J ( \theta ) = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } \frac { 1 } { 2 } ( h _ { 0 } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 }= \frac { 1 } { m } \sum _ { i = 1 } ^ { m }cost(h_\theta(x^{(i)}),y{(i)}) J(θ)=m1∑i=1m21(h0(x(i))−y(i))2=m1∑i=1mcost(hθ(x(i)),y(i))
non-convex ⟶ \longrightarrow ⟶ convex
Logistic regression function
Cost ( h θ ( x ) , y ) = { − log ( h θ ( x ) ) y = 1 − log ( 1 − h θ ( x ) ) y = 0 \operatorname { Cost } ( h _ { \theta } ( x ) , y ) = \{ \begin{array} { l l } { - \log ( h _ { \theta } ( x ) ) } & { y = 1 } \\ { - \log ( 1 - h _ { \theta } ( x ) ) } & { y = 0 } \end{array} Cost(hθ(x),y)={−log(hθ(x))−log(1−hθ(x))y=1y=0
Cost =0 if y= 1, h θ ( x ) h_\theta(x) hθ(x)=1But as h θ ( x ) h_\theta(x) hθ(x)→0,Cost→ ∞ \infty ∞
Captures intuition that if h θ ( x ) h_\theta(x) hθ(x)=0 (predict P(y=1lx;0)=0), but y=1,well penalize learning algorithm by a very large cost.
6-5 Simplified cost function and gradient descent
Logistic regression cost function
J
(
θ
)
=
1
m
∑
i
=
1
m
Cost
(
h
θ
(
x
(
i
)
)
,
y
(
i
)
)
J ( \theta ) = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } \operatorname { Cost } ( h _ { \theta } ( x ^ { ( i ) } ) , y ^ { ( i ) } )
J(θ)=m1∑i=1mCost(hθ(x(i)),y(i))
Cost
(
h
o
(
x
)
,
y
)
=
{
−
log
(
h
θ
(
x
)
)
y
=
1
−
log
(
1
−
h
θ
(
x
)
)
y
=
0
\operatorname { Cost } ( h _ { o } ( x ) , y ) = \{ \begin{array} { l l } { - \log ( h _ { \theta } ( x ) ) } & { y = 1 } \\ { - \log ( 1 - h _ { \theta } ( x ) ) } & { y = 0 } \end{array}
Cost(ho(x),y)={−log(hθ(x))−log(1−hθ(x))y=1y=0
N
o
t
e
:
y
=
0
o
r
1
a
l
w
a
y
s
Note: y =0\ or\ 1\ always
Note:y=0 or 1 always
↓
\downarrow
↓
J
(
θ
)
=
−
1
m
[
∑
i
=
1
m
y
(
i
)
log
h
b
(
x
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
g
(
x
(
i
)
)
)
]
J(\theta)= - \frac { 1 } { m } [ \sum _ { i = 1 } ^ { m } y ^ { ( i ) } \log h _ { b } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) \log ( 1 - h _ { g } ( x ^ { ( i ) } ) ) ]
J(θ)=−m1[∑i=1my(i)loghb(x(i))+(1−y(i))log(1−hg(x(i)))]
To fit parameters θ {\theta} θ
m i n θ J ( θ ) min_\theta J ( \theta ) minθJ(θ)
To make a prediction given new x:
Output h θ ( x ) = 1 1 + e − θ T x h _ { \theta } ( x ) = \frac { 1 } { 1 + e ^ { - \theta Tx } } hθ(x)=1+e−θTx1
Gradient Descent
Repeat{
θ j : = θ j − α θ ∂ θ j J ( θ ) = θ j − α ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta _ { j } : = \theta _ { j } - \alpha \frac { \theta } { \partial \theta _ { j } }J ( \theta )=\theta _ { j } - \alpha \sum _ { i = 1 }^{m} ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) x _ { j } ^ { ( i ) } θj:=θj−α∂θjθJ(θ)=θj−α∑i=1m(hθ(x(i))−y(i))xj(i)
}
Algorithm looks identical to linear regression !
But h θ ( x ) h_\theta(x) hθ(x) different
6-6 Advanced optimization
Optimization algorithm
Cost function J ( θ ) J(\theta) J(θ).Want m i n θ J ( θ ) min_\theta J(\theta) minθJ(θ)
Given θ \theta θ, we have code that can compute
Optimization algorithms
-Gradient descent
-Conjugate gradient
-BFGS
-L-BFGS
Advantages:
- No need to manually pick α \alpha α
- Often faster than gradient descent
Disadvantages:
- More complex
function [jVal,gradient]=costFunction(theta)
jVal=(theta(1)-5)^2+...
(theta(2)-5)^2;
gradient=zero(2,1);
gradient(1)=2*(theta(1)-5);
gradient(2)=2*(theta(2)-5);
options =optimset('GradObj','on','MaxIter','100');
initialTheta =zeros(2,1);
[optTheta,functionVal,exitFlag]...=Fminunc(@costFuncion,initiaTheta,options);
note: theta 0 is actually written theta 1 in octave
6-7 Multi-class classification : One-vs-all
Multiclass classification
*example :*Email foldering/tagging: Work, Friends, Family, Hobby
h θ ( i ) ( x ) = P ( y = i ∣ x ; θ ) ( i = 1 , 2 , 3 ⋯ ) h _ { \theta } ^ { ( i ) } ( x ) = P ( y = i | x ; \theta ) \quad ( i = 1 , 2 , 3 \cdots) hθ(i)(x)=P(y=i∣x;θ)(i=1,2,3⋯)
One-vs-all
Train a logistic regression classifier
h
θ
i
(
x
)
h_\theta^{i}(x)
hθi(x) for each class
i
i
i to predict the probability that
y
=
i
y=i
y=i
On a new input to make a prediction, pick the class i that maximizes
max i h θ ( i ) ( x ) \operatorname {max}_i h _ { \theta } ^ { ( i ) } ( x ) maxihθ(i)(x)