Gradient Descent
- feature scaling+mean normalization
- learning rate:
- small α \alpha α : slow convergence.
- large
α
\alpha
α : may not decrease on every iteration and thus may not converge.
To chose α \alpha α , try:
…, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …
Linear regression
- lost funciton:
L ( y ^ ( i ) , y ( i ) ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 L(\widehat{y}^{(i)}, y^{(i)})=\frac{1}{2}(\widehat{y}^{(i)}-y^{(i)})^{2} L(y (i),y(i))=21(y (i)−y(i))2 - Cost function:
J ( θ ) = 1 2 m ∑ i = 0 m ( y ^ ( i ) − y ( i ) ) 2 J(\theta)=\frac{1}{2m}\sum_{i=0}^{m}(\widehat{y}^{(i)}-y^{(i)})^{2} J(θ)=2m1∑i=0m(y (i)−y(i))2 - optimization algorithms
- Gradient Descent
θ j = θ j − α 1 m ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_{j}=\theta_{j}-\alpha\frac{1}{m}\sum_{i=0}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} θj=θj−αm1∑i=0m(hθ(x(i))−y(i))xj(i) - Normal Equation 正规方程
θ = ( X T X ) − 1 X T Y \theta=(X^{T}X)^{-1}X^{T}Y θ=(XTX)−1XTY - Normal Equation Noninvertibility
np.linalg.pinv(np.dot(X.T, X))
If X T X X^{T}X XTX is noninvertible, the common causes might be having :- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization”
- Gradient Descent
Gradient Descent | Normal Equation |
---|---|
need to chose α \alpha α | No need to chose α \alpha α |
need many iterations | No need to iterate |
work well when n n n is large | slow if n n n is very large, need to calculate inverse of X T X X^{T}X XTX, O ( n 3 ) O(n^{3}) O(n3) |
logistic regression
-
Hypothesis Representation
logistic funnction is sigmoid function.
y ^ = h θ ( x ) = p ( y = 1 ∣ x ; θ ) \widehat{y}=h_{\theta}(x)=p(y=1|x;\theta) y =hθ(x)=p(y=1∣x;θ)- z>0, y=1
- z<0, y=0,相当于sign(x)函数
-
lost function
L ( y ^ ( i ) , y ( i ) ) = { − l o g ( y ^ ( i ) ) if y = 1 − l o g ( 1 − y ^ ( i ) ) if y = 0 = − y ( i ) l o g ( y ^ ( i ) ) − ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) L(\widehat{y}^{(i)}, y^{(i)})= \left\{ \begin{array}{ll} -log(\widehat{y}^{(i)}) & \textrm{if $y=1$}\\ -log(1-\widehat{y}^{(i)}) & \textrm{if $y=0$} \end{array} \right. = -y^{(i)}log(\widehat{y}^{(i)})-(1-y^{(i)})log(1-\widehat{y}^{(i)}) L(y (i),y(i))={−log(y (i))−log(1−y (i))if y=1if y=0=−y(i)log(y (i))−(1−y(i))log(1−y (i)) -
optimization algorithms
- gradient descent
θ j = θ j − α 1 m ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_{j}=\theta_{j}-\alpha\frac{1}{m}\sum_{i=0}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} θj=θj−αm1∑i=0m(hθ(x(i))−y(i))xj(i)
Algorithm looks identical to linear regression. - other optimization algorithms
- Conjugate gradient
- BFGS
- L-BFGS
- advantages:
no need to manually pick α \alpha α
often fast than GD - disadvantage:
more complex
- gradient descent