ML

Gradient Descent

  1. feature scaling+mean normalization
  2. learning rate:
    • small α \alpha α : slow convergence.
    • large α \alpha α : may not decrease on every iteration and thus may not converge.
      To chose α \alpha α , try:
      …, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

Linear regression

  • lost funciton:
    L ( y ^ ( i ) , y ( i ) ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 L(\widehat{y}^{(i)}, y^{(i)})=\frac{1}{2}(\widehat{y}^{(i)}-y^{(i)})^{2} L(y (i),y(i))=21(y (i)y(i))2
  • Cost function:
    J ( θ ) = 1 2 m ∑ i = 0 m ( y ^ ( i ) − y ( i ) ) 2 J(\theta)=\frac{1}{2m}\sum_{i=0}^{m}(\widehat{y}^{(i)}-y^{(i)})^{2} J(θ)=2m1i=0m(y (i)y(i))2
  • optimization algorithms
    1. Gradient Descent
      θ j = θ j − α 1 m ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_{j}=\theta_{j}-\alpha\frac{1}{m}\sum_{i=0}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} θj=θjαm1i=0m(hθ(x(i))y(i))xj(i)
    2. Normal Equation 正规方程
      θ = ( X T X ) − 1 X T Y \theta=(X^{T}X)^{-1}X^{T}Y θ=(XTX)1XTY
    3. Normal Equation Noninvertibility
      np.linalg.pinv(np.dot(X.T, X))
      If X T X X^{T}X XTX is noninvertible, the common causes might be having :
      • Redundant features, where two features are very closely related (i.e. they are linearly dependent)
      • Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization”
Gradient DescentNormal Equation
need to chose α \alpha αNo need to chose α \alpha α
need many iterationsNo need to iterate
work well when n n n is largeslow if n n n is very large, need to calculate inverse of X T X X^{T}X XTX, O ( n 3 ) O(n^{3}) O(n3)

logistic regression

  • Hypothesis Representation
    logistic funnction is sigmoid function.
    y ^ = h θ ( x ) = p ( y = 1 ∣ x ; θ ) \widehat{y}=h_{\theta}(x)=p(y=1|x;\theta) y =hθ(x)=p(y=1x;θ)

    • z>0, y=1
    • z<0, y=0,相当于sign(x)函数
  • lost function
    L ( y ^ ( i ) , y ( i ) ) = { − l o g ( y ^ ( i ) ) if  y = 1 − l o g ( 1 − y ^ ( i ) ) if  y = 0 = − y ( i ) l o g ( y ^ ( i ) ) − ( 1 − y ( i ) ) l o g ( 1 − y ^ ( i ) ) L(\widehat{y}^{(i)}, y^{(i)})= \left\{ \begin{array}{ll} -log(\widehat{y}^{(i)}) &amp; \textrm{if $y=1$}\\ -log(1-\widehat{y}^{(i)}) &amp; \textrm{if $y=0$} \end{array} \right. = -y^{(i)}log(\widehat{y}^{(i)})-(1-y^{(i)})log(1-\widehat{y}^{(i)}) L(y (i),y(i))={log(y (i))log(1y (i))if y=1if y=0=y(i)log(y (i))(1y(i))log(1y (i))

  • optimization algorithms

    • gradient descent
      θ j = θ j − α 1 m ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_{j}=\theta_{j}-\alpha\frac{1}{m}\sum_{i=0}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)} θj=θjαm1i=0m(hθ(x(i))y(i))xj(i)
      Algorithm looks identical to linear regression.
    • other optimization algorithms
      • Conjugate gradient
      • BFGS
      • L-BFGS
    1. advantages:
      no need to manually pick α \alpha α
      often fast than GD
    2. disadvantage:
      more complex
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值