第二章 逻辑回归
1 Logistic Regression
- It is a classification algorithm although it has the name of regression
逻辑回归是一种解决分类问题的算法
1.1 Differences between Logistic Regression and Linear Regression
Logistic Regression 逻辑回归 | Linear Regression 线性回归 |
---|---|
clssification algorithm 分类算法 | regression algorithm 回归算法 |
0 ≤ h θ ( x ) ≤ 1 0≤h_{\theta}(x)≤1 0≤hθ(x)≤1 | h θ ( x ) h_{\theta}(x) hθ(x) can be > 1 >1 >1 or < 0 <0 <0 |
1.2 Model
- Hypothesis:
h
θ
(
x
)
=
P
(
y
=
1
∣
x
;
θ
)
=
g
(
θ
T
x
)
【estimated probability that y=1, given x, parameterized by
θ
】
g
(
z
)
=
1
1
+
e
−
z
【Sigmoid Function,Logistic Function】
\begin{aligned} h_\theta(x)&=P(y=1|x;\theta)=g(\theta^Tx)&\text{【estimated probability that y=1, given x, parameterized by $\theta$】}\\ g(z)&=\frac{1}{1+e^{-z}}&\text{【Sigmoid Function,Logistic Function】} \end{aligned}
hθ(x)g(z)=P(y=1∣x;θ)=g(θTx)=1+e−z1【estimated probability that y=1, given x, parameterized by θ】【Sigmoid Function,Logistic Function】
suppose predict:
“ y = 1 y=1 y=1” if h θ ( x ) ≥ 0.5 ( θ T x ≥ 0 ) h_{\theta}(x)≥0.5(\theta^Tx≥0) hθ(x)≥0.5(θTx≥0)
“ y = 0 y=0 y=0” if h θ ( x ) < 0.5 ( θ T x < 0 ) h_{\theta}(x)<0.5(\theta^Tx<0) hθ(x)<0.5(θTx<0)
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
- Parameters: θ \theta θ
- Decision Boundary:is a property not of the training set but of the hypothesis and of the patameters
决策边界:类与类之间的界限,是假设函数和参数的属性,不是训练集的属性 - Cost Function:
square error function / square error cost function
J ( θ ) = 1 m ∑ i = 1 m C o s t ( h θ ( x ( i ) , y ) ) C o s t ( h θ ( x , y ) ) = { − l o g ( h θ ( x ) ) , if y = 1 − l o g ( 1 − h θ ( x ) ) , if y = 0 变 式 : C o s t ( h θ ( x , y ) ) = − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( 1 − h θ ( x ) ) 【y=1 or 0】 \begin{aligned} J(\theta)&=\frac{1}{m}\sum_{i=1}^mCost(h_\theta(x^{(i)},y))\\ Cost(h_\theta(x,y))&=\begin{cases} -log(h_{\theta}(x))&,\text{if $y=1$}\\ -log(1-h_{\theta}(x))&,\text{if $y=0$}\end{cases}\\ 变式: Cost(h_\theta(x,y))&=-ylog(h_{\theta}(x))-(1-y)log(1-h_{\theta}(x))&\text{【y=1 or 0】} \end{aligned} J(θ)Cost(hθ(x,y))变式:Cost(hθ(x,y))=m1i=1∑mCost(hθ(x(i),y))={−log(hθ(x))−log(1−hθ(x)),if y=1,if y=0=−ylog(hθ(x))−(1−y)log(1−hθ(x))【y=1 or 0】
import numpy as np
def cost(theta, X, y):
theta = np.matrix(theta)
X = np.matrix(X)
y = np.matrix(y)
first = np.multiply(-y, np.log(sigmoid(X* theta.T)))
second = np.multiply((1-y), np.log(1 - sigmoid(X* theta.T)))
return np.sum(first - second) / (len(X))
- Goal(Object Function): minimize θ J ( θ ) \mathop{\text{minimize}}\limits_{\theta} J(\theta) θminimizeJ(θ)
1.3 use Gradient Descent for J ( θ ) J(\theta) J(θ)
- repeat{
θ j : = θ j − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) ) \theta_j:=\theta_j-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)}) θj:=θj−αm1i=1∑m((hθ(x(i))−y(i))⋅xj(i))
(simultaneously update all θ j \theta_j θj)
}
1.4 Advanced Optimization replace for Gradient Descent
- Optimization Algorithms:
- Conjugate gradient(共轭梯度)
- BFGS(局部优化,Broyden Fletcher Goldfarb Shann)
- L-BFGS(有限内存局部优化)
- Advantages:
- No need to manually pick
α
\alpha
α
不需要人工选择 α \alpha α - Often faster than gradient descent
通常情况下,这些算法都要比梯度下降算法快
- Disadvantages:more complex
缺点:更复杂
Octave代码
function [jval, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [... code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
2 Multi-class classification:One-vs-all
- One-versus-all Classification / One-versus-rest
- Train a logistic regression classifier h θ ( i ) ( x ) h_\theta^{(i)}(x) hθ(i)(x) for each class i i i to predict the probability that y = i y=i y=i
- On a new input x x x to make a prediction, pick the class i i i that maximizes max i h θ ( i ) ( x ) \mathop{\text{max}}\limits_{i} h_\theta^{(i)}(x) imaxhθ(i)(x)
- n n n个类别需要 n n n个分类器
就是当解决多类别分类问题时,每次只分类一个类别A,而将其他类别都看作是一个类别B
3 Reference
吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记