ML

syqnyue

于 2019-08-16 23:00:37 发布

阅读量176

点赞数

分类专栏： deep learning

本文链接：https://blog.csdn.net/syqnyue/article/details/99686888

版权

2 篇文章 0 订阅

订阅专栏

Gradient Descent

feature scaling+mean normalization
learning rate:
- small $\alpha$ : slow convergence.
- large $\alpha$ : may not decrease on every iteration and thus may not converge.
  To chose $\alpha$ , try:
  …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

lost funciton:
$L(\widehat{y}^{(i)}, y^{(i)})=\frac{1}{2}(\widehat{y}^{(i)}-y^{(i)})^{2}$
Cost function:
$J(\theta)=\frac{1}{2m}\sum_{i=0}^{m}(\widehat{y}^{(i)}-y^{(i)})^{2}$
optimization algorithms
1. Gradient Descent
  $\theta_{j}=\theta_{j}-\alpha\frac{1}{m}\sum_{i=0}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$
2. Normal Equation 正规方程
  $\theta=(X^{T}X)^{-1}X^{T}Y$
3. Normal Equation Noninvertibility
  np.linalg.pinv(np.dot(X.T, X))
  If $X^{T}X$ is noninvertible, the common causes might be having :
  - Redundant features, where two features are very closely related (i.e. they are linearly dependent)
  - Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization”

Gradient Descent	Normal Equation
need to chose $\alpha$	No need to chose $\alpha$
need many iterations	No need to iterate
work well when $n$ is large	slow if $n$ is very large, need to calculate inverse of $X^{T}X$ , $O(n^{3})$

Hypothesis Representation
logistic funnction is sigmoid function.
$\widehat{y}=h_{\theta}(x)=p(y=1|x;\theta)$
- z>0, y=1
- z<0, y=0,相当于sign（x)函数
lost function
$L(\widehat{y}^{(i)}, y^{(i)})= \left\{ \begin{array}{ll} -log(\widehat{y}^{(i)}) & \textrm{if $y=1$}\\ -log(1-\widehat{y}^{(i)}) & \textrm{if $y=0$} \end{array} \right. = -y^{(i)}log(\widehat{y}^{(i)})-(1-y^{(i)})log(1-\widehat{y}^{(i)})$
optimization algorithms
- gradient descent
  $\theta_{j}=\theta_{j}-\alpha\frac{1}{m}\sum_{i=0}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}$
  Algorithm looks identical to linear regression.
- other optimization algorithms
  - Conjugate gradient
  - BFGS
  - L-BFGS
1. advantages:
  no need to manually pick $\alpha$
  often fast than GD
2. disadvantage:
  more complex

关注

专栏目录