吴恩达·Machine Learning || chap5&6 Octave Tutorial&Logistic Regression 简记

本文链接：https://blog.csdn.net/qq_46203130/article/details/119461242

本文介绍了机器学习中的基本操作、数据处理、计算和绘图，特别关注了梯度下降法和逻辑回归。在逻辑回归中，讨论了分类问题、sigmoid函数、决策边界和成本函数。提出了非线性决策边界的示例，并详细阐述了简化后的成本函数和梯度下降算法。此外，还提到了多类别分类的One-vs-all策略。

摘要由CSDN通过智能技术生成

5 Octave Tutorial

（Octave也可以用MATLAB学习，个人决定使用Python进行学习实现,5-6建议无论学习什么语言都可以看一看）

5-1 Basic operations

5-2 Moving data around

5-3 Computing on data

5-4 Plotting data

5-5 for,while,if statements,and function

5-6 Vectorization

Vectorization example
$h_\theta(x)=\sum_{j=0}^n \theta_jx_j=\theta^Tx$
Matlab

%% Unvectorized implementation
prediction = 0.0;
for j =1:n+1
	prediction=prediction + theta(j) * x(j)
end

%% Vectorized implementation
prediction = theta' * x

C++

// Unvectorized implementation
double prediction =0.0;
for (int j=0;j<n;j++)
    	prediction += theta[j] * x[j];

// Vectorized implementation
double prediction = theta.transpose() * x;

Gradient descent

6 Logistic Regression

6-1 Classification

Classification

$\in \{0,1\}\ \ \ \ \ \ \ \ \ \ \ \ \begin{matrix} 0:"Negative Class"\\1:"Positive Class" \end{matrix}$

$h_\theta(x)$ can be >1 or <0

Logistic Regression: $\le h_\theta(x) \le 1$

6-2 Hypothesis Representation

Logistic Regression Model

Want $\le h_\theta(x) \le 1$

Sigmoid function/Logistic function

$h_\theta(x)=g(\theta^Tx)$ $g(z)=\frac{1}{1+e^{-z}}$
$h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}$
Interpretation of Hypothesis Output

$h_\theta(x)$ =estimated probability that y=1 on input

probability that y= 1, given x, parameterized by $$

6-3 Decision boundary

$h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}=P(y=1|x;\theta)$

predict “y=1” if $\theta^Tx\geq 0$ ( $h_\theta(x)\geq 0.5$ )

predict “y=0” if $\theta^Tx\leq 0$ ( $h_\theta(x)\leq 0.5$ )

Decision Boundary

Non-linear decision boundaries

${\theta} ( x ) = g ( \theta _ { 0 } + \theta _ { 1 } x _ { 1 } + \theta _ { 2 } x _ { 2 } } { + \theta _ { 3 } x ^ { 2 } + \theta _ { 4 } x _ { 2 } ^ { 2 } })$

6-4 Cost function

Training set:

$\{ ( x ^ { ( 1 ) } , y ^ { ( 1 ) } ) , ( x ^ { ( 2 ) } , y ^ { ( 2 ) } ) , \cdots , ( x ^ { ( m ) } , y ^ { ( m ) } ) \}$

m example

$\in \left[ \begin{array} { l } { x _ { 0 } } \\ { x _ { 1 } } \\ { x _ { n } } \end{array} \right] \quad x _ { 0 } = 1 , y \in \{ 0 , 1 \}$

$\theta } ( x ) = \frac { 1 } { 1 + e ^ { - \theta Tx } }$

Cost function

Linear regression

$\theta ) = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } \frac { 1 } { 2 } ( h _ { 0 } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) ^ { 2 }= \frac { 1 } { m } \sum _ { i = 1 } ^ { m }cost(h_\theta(x^{(i)}),y{(i)})$

non-convex $\longrightarrow$ convex

Logistic regression function

$\operatorname { Cost } ( h _ { \theta } ( x ) , y ) = \{ \begin{array} { l l } { - \log ( h _ { \theta } ( x ) ) } & { y = 1 } \\ { - \log ( 1 - h _ { \theta } ( x ) ) } & { y = 0 } \end{array}$

Cost =0 if y= 1, $h_\theta(x)$ =1But as $h_\theta(x)$ →0,Cost→ $\infty$

Captures intuition that if $h_\theta(x)$ =0 (predict P(y=1lx;0)=0), but y=1,well penalize learning algorithm by a very large cost.

6-5 Simplified cost function and gradient descent

Logistic regression cost function

$\theta ) = \frac { 1 } { m } \sum _ { i = 1 } ^ { m } \operatorname { Cost } ( h _ { \theta } ( x ^ { ( i ) } ) , y ^ { ( i ) } )$
$\operatorname { Cost } ( h _ { o } ( x ) , y ) = \{ \begin{array} { l l } { - \log ( h _ { \theta } ( x ) ) } & { y = 1 } \\ { - \log ( 1 - h _ { \theta } ( x ) ) } & { y = 0 } \end{array}$

$=0\ or\ 1\ always$
$\downarrow$
$J(\theta)= - \frac { 1 } { m } [ \sum _ { i = 1 } ^ { m } y ^ { ( i ) } \log h _ { b } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) \log ( 1 - h _ { g } ( x ^ { ( i ) } ) ) ]$

To fit parameters ${\theta}$

$min_\theta J ( \theta )$

To make a prediction given new x:

Output $\theta } ( x ) = \frac { 1 } { 1 + e ^ { - \theta Tx } }$

Gradient Descent

Repeat{

$\theta _ { j } : = \theta _ { j } - \alpha \frac { \theta } { \partial \theta _ { j } }J ( \theta )=\theta _ { j } - \alpha \sum _ { i = 1 }^{m} ( h _ { \theta } ( x ^ { ( i ) } ) - y ^ { ( i ) } ) x _ { j } ^ { ( i ) }$

}

Algorithm looks identical to linear regression !

But $h_\theta(x)$ different

6-6 Advanced optimization

Optimization algorithm

Cost function $J(\theta)$ .Want $min_\theta J(\theta)$

Given $\theta$ , we have code that can compute

Optimization algorithms

-Gradient descent
-Conjugate gradient
-BFGS
-L-BFGS

Advantages:

No need to manually pick $\alpha$
Often faster than gradient descent

Disadvantages:

More complex

function [jVal,gradient]=costFunction(theta)
	jVal=(theta(1)-5)^2+...
		(theta(2)-5)^2;
	gradient=zero(2,1);
	gradient(1)=2*(theta(1)-5);
	gradient(2)=2*(theta(2)-5);

options =optimset('GradObj','on','MaxIter','100');
initialTheta =zeros(2,1);
[optTheta,functionVal,exitFlag]...=Fminunc(@costFuncion,initiaTheta,options);

note: theta 0 is actually written theta 1 in octave