机器学习-吴恩达网易云课堂笔记

机器学习-吴恩达网易云课堂笔记

  1. Machine Learning: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measure by P, improves with experience E.

  2. Supervised learning, Unsupervised learning; Regression, Classification.

  3. Linear regression

    • Hypothesis function : h θ ( x ) = θ 0 + θ 1 x h_\theta(x) = \theta_0 + \theta_1x hθ(x)=θ0+θ1x
    • Cost function : J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 J(θ0,θ1)=2m1i=1m(hθ(x(i))y(i))2 (Squared error function)
    • Goal : m i n i m i z e θ 0 , θ 1 J ( θ 0 , θ 1 ) \underset{\theta_0, \theta_1}{minimize} J(\theta_0, \theta_1) θ0,θ1minimizeJ(θ0,θ1)
  4. Gradient descent

    • 用于最小化代价函数。这里以上述线性回归为例子描述该方法具体流程。

    • 算法过程
      r e p e a t   u n t i l   c o n v e r g e n c e   {   θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) ( f o r   j = 0   a n d   j = 1 )   } repeat\ until\ convergence \ \{ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \ \\ \theta_j := \theta_j - \alpha\frac{\partial }{\partial \theta_j}J(\theta_0, \theta_1) \quad (for \ j = 0 \ and \ j = 1) \quad \quad \quad   \\ \} \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad repeat until convergence { θj:=θjαθjJ(θ0,θ1)(for j=0 and j=1) }
      其中 α \alpha α称为学习率, α \alpha α越大代表梯度下降的越快, α \alpha α越小代表梯度下降的越慢。

      如果 α \alpha α太小的话梯度下降的会比较慢, α \alpha α太大的话梯度下降可能会越过最低点甚至无法收敛或者发散。

      注: θ 0 \theta_0 θ0 θ 1 \theta_1 θ1需要同时更新,更新方式如下:
      t e m p 0 : = θ 0 − α ∂ ∂ θ 0 J ( θ 0 , θ 1 ) t e m p 1 : = θ 1 − α ∂ ∂ θ 1 J ( θ 0 , θ 1 ) θ 0 : = t e m p 0 θ 1 : = t e m p 1 temp0 := \theta_0 - \alpha\frac{\partial }{\partial \theta_0}J(\theta_0, \theta_1)\\ temp1 := \theta_1 - \alpha\frac{\partial }{\partial \theta_1}J(\theta_0, \theta_1)\\ \theta_0 := temp0\\ \theta_1 := temp1 temp0:=θ0αθ0J(θ0,θ1)temp1:=θ1αθ1J(θ0,θ1)θ0:=temp0θ1:=temp1

    • Gradient descent for linear regression
      j = 0 :   ∂ ∂ θ 0 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) )    j = 1 :   ∂ ∂ θ 1 J ( θ 0 , θ 1 ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) j = 0 :\ \frac{\partial }{\partial \theta_0}J(\theta_0, \theta_1) = \frac{1}{m}\sum_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) \quad \ \ \\ j = 1 :\ \frac{\partial }{\partial \theta_1}J(\theta_0, \theta_1) = \frac{1}{m}\sum_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) x^{(i)} j=0: θ0J(θ0,θ1)=m1i=1m(hθ(x(i))y(i))  j=1: θ1J(θ0,θ1)=m1i=1m(hθ(x(i))y(i))x(i)

    • 这种梯度下降算法也称为Batch Gradient descent, “Batch” : Each step of gradient descent uses all the training examples.

    • 梯度下降法的缺点:需要选择 α \alpha α;需要多次迭代。

  5. Linear regression with multiple variables

    • Hypothesis function : h θ ( x ) = θ T x = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n h_\theta(x) = \theta^Tx = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \cdots + \theta_nx_n hθ(x)=θTx=θ0x0+θ1x1+θ2x2++θnxn

    • Cost function : J ( θ 0 , θ 1 , ⋯   , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1,\cdots ,\theta_n) = \frac{1}{2m}\sum_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 J(θ0,θ1,,θn)=2m1i=1m(hθ(x(i))y(i))2

    • Goal : m i n i m i z e θ 0 , θ 1 , ⋯   , θ n J ( θ 0 , θ 1 , ⋯   , θ n ) \underset{\theta_0, \theta_1, \cdots ,\theta_n}{minimize} J(\theta_0, \theta_1, \cdots ,\theta_n) θ0,θ1,,θnminimizeJ(θ0,θ1,,θn)

    • Gradient descent
      r e p e a t   {       θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 , … , θ n ) ( s i m u l t a n e o u s l y   u p d a t e   f o r   e v e r y   j = 0 , … , n )   } 其 中 ∂ ∂ θ j J ( θ 0 , θ 1 , … , θ n ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) repeat\ \{ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \ \\ \ \ \ \theta_j := \theta_j - \alpha\frac{\partial }{\partial \theta_j}J(\theta_0, \theta_1, \dots, \theta_n) \quad (simultaneously \ update \ for \ every \ j = 0, \dots, n)   \\ \} \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \\ 其中\frac{\partial }{\partial \theta_j}J(\theta_0, \theta_1, \dots, \theta_n) = \frac{1}{m}\sum_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \quad \quad \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad repeat {    θj:=θjαθjJ(θ0,θ1,,θn)(simultaneously update for every j=0,,n) }θJ(θ0,θ1,,θn)=m1i=1m(hθ(x(i))y(i))xj(i)

    • 特征缩放(feature scaling),当不同的特征取值范围相差很大时,可以进行特征缩放,使所有的特征取值范围大致相同,这样可以加快梯度下降的速度。常用**均值归一化(mean normalization)**来进行特征缩放。

    • 正规方程法(normal equation),求解可得 θ = ( X T X ) − 1 X T y \theta = (X^TX)^{-1}X^Ty θ=(XTX)1XTy。注:由于需要计算 ( X T X ) − 1 (X^TX)^{-1} (XTX)1,故当特征数很多的时候运行的非常慢

  6. Logistic regression

    • 逻辑回归是一种分类算法。

    • Hypothesis : h θ ( x ) = g ( θ T x ) , g ( z ) = 1 1 + e − z h_\theta(x) = g(\theta^Tx), \quad g(z) = \frac{1}{1 + e^{-z}} hθ(x)=g(θTx),g(z)=1+ez1.

    • predict “ y = 1 y = 1 y=1” if h θ ( x ) ⩾ 0.5 h_\theta(x) \geqslant 0.5 hθ(x)0.5 ; predict “ y = 0 y = 0 y=0” if h θ ( x ) &lt; 0.5 h_\theta(x) &lt; 0.5 hθ(x)<0.5.

    • Cost function
      J ( θ ) = 1 m ∑ i = 1 m C o s t ( h θ ( x ( i ) ) , y ( i ) ) C o s t ( h θ ( x ) , y ) = { − l o g ( h θ ( x ) ) i f   y = 1 − l o g ( 1 − h θ ( x ) ) i f   y = 0 J(\theta) = \frac{1}{m}\sum_{i = 1}^{m}Cost(h_\theta(x^{(i)}), y^{(i)}) \\ Cost(h_\theta(x), y) = \left\{ \begin{aligned} -log(h_\theta(x)) \quad if \ y = 1\\ -log(1 - h_\theta(x)) \quad if \ y = 0\\ \end{aligned} \right. J(θ)=m1i=1mCost(hθ(x(i)),y(i))Cost(hθ(x),y)={log(hθ(x))if y=1log(1hθ(x))if y=0
      注:这里代价函数不用与线性回归一样的平方误差函数是因为 g ( z ) = 1 1 + e − z g(z) = \frac{1}{1 + e^{-z}} g(z)=1+ez1是非线性函数,如果还用平方误差函数作为代价函数的话,会导致代价函数非凸,这样再用梯度下降法求解的话很容易陷入局部最优解。

      化简上述Cost等式,可得 C o s t ( h θ ( x ) , y ) = − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( 1 − h θ ( x ) ) Cost(h_\theta(x), y) = -ylog(h_\theta(x)) - (1 - y)log(1 - h_\theta(x)) Cost(hθ(x),y)=ylog(hθ(x))(1y)log(1hθ(x)),那么 J ( θ ) = − 1 m [ ∑ i = 1 m ( y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] J(\theta) = -\frac{1}{m}[\sum_{i = 1}^{m}(y^{(i)}log(h_\theta(x^{(i)})) + (1 - y^{(i)})log(1 - h_\theta(x^{(i)}))] J(θ)=m1[i=1m(y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]

    • Gradient descent

      want m i n θ J ( θ ) min_\theta J(\theta) minθJ(θ) :
      r e p e a t   {         θ j : = θ j − α ∂ ∂ θ j J ( θ ) ( s i m u l t a n e o u s l y   u p d a t e   a l l   θ j )   } 其 中 ∂ ∂ θ j J ( θ ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) repeat\ \{ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \ \\ \ \ \ \ \ \theta_j := \theta_j - \alpha\frac{\partial }{\partial \theta_j}J(\theta) \quad (simultaneously \ update \ all \ \theta_j) \quad\quad\quad\quad\quad\quad\quad\quad   \\ \} \quad \quad\quad\quad\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \\ 其中\frac{\partial }{\partial \theta_j}J(\theta) = \frac{1}{m}\sum_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \quad \quad \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad repeat {      θj:=θjαθjJ(θ)(simultaneously update all θj) }θJ(θ)=m1i=1m(hθ(x(i))y(i))xj(i)

  7. Regularization

    • 用于减少过拟合(overfitting)问题。
    • 解决过拟合方法:减少特征数量;正则化。
    • Regularized linear regression : J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ] J(\theta) = \frac{1}{2m}[\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j = 1}^n\theta_j^2] J(θ)=2m1[i=1m(hθ(x(i))y(i))2+λj=1nθj2]
    • Regularized logistic regression : J ( θ ) = − 1 m [ ∑ i = 1 m ( y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J(\theta) = -\frac{1}{m}[\sum_{i = 1}^{m}(y^{(i)}log(h_\theta(x^{(i)})) + (1 - y^{(i)})log(1 - h_\theta(x^{(i)})))] + \frac{\lambda}{2m}\sum_{j = 1}^n\theta_j^2 J(θ)=m1[i=1m(y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i))))]+2mλj=1nθj2
  8. Neural Network

    • 神经元: h θ ( x ) = g ( θ T x ) , g ( z ) = 1 1 + e − z h_\theta(x) = g(\theta^Tx), \quad g(z) = \frac{1}{1 + e^{-z}} hθ(x)=g(θTx),g(z)=1+ez1

    • 神经网络的基本模型

      前馈传播
      a 1 ( 2 ) = g ( z 1 ( 2 ) ) , z 1 ( 2 ) = Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 a 2 ( 2 ) = g ( z 2 ( 2 ) ) , z 2 ( 2 ) = Θ 20 ( 1 ) x 0 + Θ 21 ( 1 ) x 1 + Θ 22 ( 1 ) x 2 + Θ 23 ( 1 ) x 3 a 3 ( 2 ) = g ( z 3 ( 2 ) ) , z 3 ( 2 ) = Θ 30 ( 1 ) x 0 + Θ 31 ( 1 ) x 1 + Θ 32 ( 1 ) x 2 + Θ 33 ( 1 ) x 3 h Θ ( x ) = a 1 ( 3 ) = g ( Θ 10 ( 2 ) a 0 ( 2 ) + Θ 11 ( 2 ) a 1 ( 2 ) + Θ 12 ( 2 ) a 2 ( 2 ) + Θ 13 ( 2 ) a 3 ( 2 ) ) 由 x = [ x 0 x 1 x 2 x 3 ] , z ( 2 ) = [ z 1 ( 2 ) z 2 ( 2 ) z 3 ( 2 ) ]   可 得 z ( 2 ) = Θ ( 1 ) x , a ( 2 ) = g ( z ( 2 ) ) 【 向 量 化 】 a^{(2)}_1 = g(z_1^{(2)}), z_1^{(2)} = \Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3\\ a^{(2)}_2 = g(z_2^{(2)}), z_2^{(2)} = \Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3\\ a^{(2)}_3 = g(z_3^{(2)}), z_3^{(2)} = \Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3 \\ h_\Theta(x) = a^{(3)}_1 = g(\Theta^{(2)}_{10}a^{(2)}_0 +\Theta^{(2)}_{11}a^{(2)}_1 + \Theta^{(2)}_{12}a^{(2)}_2 + \Theta^{(2)}_{13}a^{(2)}_3) \\ 由x = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix},z^{(2)} = \begin{bmatrix} z_1^{(2)} \\ z_2^{(2)} \\ z_3^{(2)} \end{bmatrix} \quad 可得z^{(2)} = \Theta^{(1)}x, a^{(2)} = g(z^{(2)}) \quad 【向量化】 a1(2)=g(z1(2)),z1(2)=Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3a2(2)=g(z2(2)),z2(2)=Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3a3(2)=g(z3(2)),z3(2)=Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3hΘ(x)=a1(3)=g(Θ10(2)a0(2)+Θ11(2)a1(2)+Θ12(2)a2(2)+Θ13(2)a3(2))x=x0x1x2x3z(2)=z1(2)z2(2)z3(2) z(2)=Θ(1)x,a(2)=g(z(2))

    • 为了更全面的理解神经网络模型,下面用一个稍微复杂一点的模型来举例。

      前馈传播【向量化】
      a ( 1 ) = x   z ( 2 ) = Θ ( 1 ) a ( 1 )    a ( 2 ) = g ( z ( 2 ) ) ( a d d   a 0 ( 2 ) ) z ( 3 ) = Θ ( 2 ) a ( 2 )    a ( 3 ) = g ( z ( 3 ) ) ( a d d   a 0 ( 3 ) ) z ( 4 ) = Θ ( 3 ) a ( 3 )    a ( 4 ) = h Θ ( x ) = g ( z ( 4 ) ) a^{(1)} = x \quad\quad\quad\quad\quad\quad\quad \ \\ z^{(2)} = \Theta^{(1)}a^{(1)} \quad\quad\quad\quad\ \ \\ a^{(2)} = g(z^{(2)}) \quad (add \ a_0^{(2)}) \\ z^{(3)} = \Theta^{(2)}a^{(2)} \quad\quad\quad\quad\ \ \\ a^{(3)} = g(z^{(3)}) \quad (add \ a_0^{(3)}) \\ z^{(4)} = \Theta^{(3)}a^{(3)} \quad\quad\quad\quad\ \ \\ a^{(4)} = h_\Theta(x) = g(z^{(4)}) \quad a(1)=x z(2)=Θ(1)a(1)  a(2)=g(z(2))(add a0(2))z(3)=Θ(2)a(2)  a(3)=g(z(3))(add a0(3))z(4)=Θ(3)a(3)  a(4)=hΘ(x)=g(z(4))

    • Cost funcstion

      假设进行多元分类,类别数为 K K K
      h Θ ( x ) ∈ R K ( h Θ ( x ) ) i = i t h   o u t p u t J ( Θ ) = − 1 m [ ∑ i = 1 m ∑ k = 1 K ( y k ( i ) l o g ( h Θ ( x ( i ) ) ) k + ( 1 − y k ( i ) ) l o g ( 1 − ( h Θ ( x ( i ) ) ) k ) ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j i ( l ) ) 2 h_\Theta(x) \in \mathbb{R}^K \quad (h_\Theta(x))_i = i^{th} \ output \\ J(\Theta) = -\frac{1}{m}[\sum_{i = 1}^{m}\sum_{k = 1}^{K}(y_k^{(i)}log(h_\Theta(x^{(i)}))_k + (1 - y_k^{(i)})log(1 - (h_\Theta(x^{(i)}))_k))] + \frac{\lambda}{2m}\sum_{l = 1}^{L - 1}\sum_{i = 1}^{s_l}\sum_{j = 1}^{s_{l + 1}}(\Theta_{ji}^{(l)})^2 \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad hΘ(x)RK(hΘ(x))i=ith outputJ(Θ)=m1[i=1mk=1K(yk(i)log(hΘ(x(i)))k+(1yk(i))log(1(hΘ(x(i)))k))]+2mλl=1L1i=1slj=1sl+1(Θji(l))2
      其中 L L L = total no. of layers in network, s l s_l sl = no. of units (not counting bias unit) in layer l l l

    • 反向传播(BP)算法

      • 注:主要来求解偏导数。

      • 预处理一些数值

        Intuition : δ j ( l ) \delta_j^{(l)} δj(l) = “error” of node j j j in layer l l l.

        For each output unit (layer L = 4)

        δ ( 4 ) = a ( 4 ) − y \delta^{(4)} = a^{(4)} - y δ(4)=a(4)y

        δ ( 3 ) = ( Θ ( 3 ) ) T δ ( 4 ) . ∗ g ′ ( z ( 3 ) ) \delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}.*g&#x27;(z^{(3)}) δ(3)=(Θ(3))Tδ(4).g(z(3))

        δ ( 2 ) = ( Θ ( 2 ) ) T δ ( 3 ) . ∗ g ′ ( z ( 2 ) ) \delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}.*g&#x27;(z^{(2)}) δ(2)=(Θ(2))Tδ(3).g(z(2))

        【可以证明 ∂ ∂ Θ i j ( l ) J ( Θ ) = a j ( l ) δ i ( l + 1 ) \frac{\partial}{\partial \Theta_{ij}^{(l)}}J(\Theta) = a_j^{(l)}\delta_i^{(l + 1)} Θij(l)J(Θ)=aj(l)δi(l+1), 忽略正则化的情况下】

      • 下面是反向传播算法(Backpropagation algorithm)的具体步骤:

        Training set { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) (x(1),y(1)),(x(2),y(2)),,(x(m),y(m))}

        Set Δ i j ( l ) = 0 \Delta_{ij}^{(l)} = 0 Δij(l)=0 (for all l , i , j l, i, j l,i,j)

        For i = 1 i = 1 i=1 to m m m

        \quad Set a ( 1 ) = x a^{(1)} = x a(1)=x

        \quad Peform forward propagation to compute a ( l ) a^{(l)} a(l) for l = 2 , 3 , … , L l = 2, 3, \dots, L l=2,3,,L

        \quad Using y ( i ) y^{(i)} y(i), compute δ ( L ) = a ( L ) − y ( i ) \delta^{(L)} = a^{(L)} - y^{(i)} δ(L)=a(L)y(i)

        \quad Compute δ ( L − 1 ) , δ L − 2 , … , δ 2 \delta^{(L - 1)}, \delta^{L - 2}, \dots, \delta^2 δ(L1),δL2,,δ2

        \quad Δ i j ( l ) : = Δ i j ( l ) + a j ( l ) δ i ( l + 1 ) \Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l + 1)} Δij(l):=Δij(l)+aj(l)δi(l+1) 【向量化表示: Δ ( l ) : = Δ ( l ) + δ ( l + 1 ) ( a ( l ) ) T \Delta^{(l)} := \Delta^{(l)} +\delta^{(l + 1)}(a^{(l)})^T Δ(l):=Δ(l)+δ(l+1)(a(l))T

        D i j ( l ) : = 1 m Δ i j ( l ) + λ Θ i j ( l ) i f   j = ̸ 0 D_{ij}^{(l)} := \frac{1}{m}\Delta_{ij}^{(l)} + \lambda\Theta_{ij}^{(l)} \quad if \ j =\not 0 Dij(l):=m1Δij(l)+λΘij(l)if j≠0

        D i j ( l ) : = 1 m Δ i j ( l )    i f   j = 0 D_{ij}^{(l)} := \frac{1}{m}\Delta_{ij}^{(l)} \quad\quad \quad \ \ \quad if \ j = 0 Dij(l):=m1Δij(l)  if j=0

        最后可得 ∂ ∂ Θ i j ( l ) J ( Θ ) = D i j ( l ) \frac{\partial}{\partial \Theta_{ij}^{(l)}}J(\Theta) = D_{ij}^{(l)} Θij(l)J(Θ)=Dij(l)

        求得偏导之后可继续使用梯度下降法或其他高级优化算法。

    • 梯度检验(Gradient checking), 利用 d d θ J ( θ ) ≈ J ( θ + ϵ ) − J ( θ − ϵ ) 2 ϵ \frac{d}{d\theta}J(\theta) \approx \frac{J(\theta + \epsilon) - J(\theta - \epsilon)}{2\epsilon} dθdJ(θ)2ϵJ(θ+ϵ)J(θϵ)来近似计算导数,验证所求得的导数是否正确,进而验证前馈传播算法和反向传播算法计算是否正确。

    • 最后,总结一下运用神经网络算法的步骤:

      • Randomly initialize weights (注意一定不能全部初始化为0)

      • Implement forward propagation to get h Θ ( x ( i ) ) h_\Theta(x^{(i)}) hΘ(x(i)) for any x ( i ) x^{(i)} x(i)

      • Implement code to compute cost function J ( Θ ) J(\Theta) J(Θ)

      • Implement backprop to compute partial derivatives ∂ ∂ Θ j k ( l ) J ( Θ ) \frac{\partial}{\partial \Theta_{jk}^{(l)}}J(\Theta) Θjk(l)J(Θ)

      • Use gradient checking to compare ∂ ∂ Θ j k ( l ) J ( Θ ) \frac{\partial}{\partial \Theta_{jk}^{(l)}}J(\Theta) Θjk(l)J(Θ) computed using backpropagation vs. using numerical estimate of gradient of J ( Θ ) J(\Theta) J(Θ).

        Then disable gradient checking code

      • Use gradient descent or advanced optimization method with backpropagation to try to minimize J ( Θ ) J(\Theta) J(Θ) as a function of parameters Θ \Theta Θ

  9. 应用机器学习的建议

    • 按照6:2:2的比例将所有数据分为训练集、交叉验证集和测试集,用训练集来训练模型,用交叉验证集来选择交叉验证误差最小的模型,用测试集来估计模型的泛化误差。

    • 高偏差(high bias)->欠拟合(underfitting):训练误差和交叉验证误差都很大;

      高方差(high variance)->过拟合(overfitting):训练误差小,交叉验证误差大。

    • 正则化项参数 λ \lambda λ小的时候,容易过拟合;正则化项参数 λ \lambda λ大的时候,容易欠拟合。

    • 学习曲线(learning curves):训练样本数量为自变量,误差为因变量。训练误差随着训练样本的增大而增大;交叉验证误差随着训练样本的增大而减小【正常情况】。

      高偏差(欠拟合)情况:交叉验证误差随着训练样本的增加先减小后不变;训练误差随着训练样本的增加先增加后不变,最后与交叉验证误差非常接近【原因:参数太少,拟合情况不好,导致误差很大】。因此在欠拟合情况下增加训练样本对改进算法没有太大帮助。

      高方差(过拟合)情况:训练误差随着训练样本的增加会增加一点,但不会变得很大;交叉验证误差会随着训练样本的增加而减少一点,但还是会非常大【特点:训练误差与交叉验证误差之间有很大的距离,交叉验证误差还有很大的下降空间】。因此在过拟合情况下增加训练样本有助于改进算法。

    • 改进学习算法

      高方差问题:获得更多的训练样本;减少特征数量;增大正则化项参数 λ \lambda λ

      高偏差问题:选用更多的特征;增加多项式特征;减小正则化项参数 λ \lambda λ

    • 偏斜类(skewed classes):正负样本数量差别很大。如果出现了偏斜类这种情况,就不能简单的用误差来评估模型的好坏了,所以引入了查准率和查全率。

      查准率(Precision):真正例(TP) / (真正例(TP) + 假正例(FP));

      召回率(Recall):真正例(TP) / (真正例(TP) + 假反例(FN))。

      F 1 S c o r e : 2 P R P + R F_1 Score : 2\frac{PR}{P + R} F1Score:2P+RPR, 利用 F 1 F_1 F1来度量模型的好坏。

  10. SVM

    • 支持向量机的cost function是根据逻辑回归改动得来的,与逻辑回归的cost function类似。

    • 优化目标: m i n θ C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T x ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T x ( i ) ) ] + 1 2 ∑ j = 1 n θ j 2 \underset{\theta}{min}C\sum_{i = 1}^m[y^{(i)}cost_1(\theta^Tx^{(i)}) + (1 - y^{(i)})cost_0(\theta^Tx^{(i)})] + \frac{1}{2}\sum_{j = 1}^n\theta_j^2 θminCi=1m[y(i)cost1(θTx(i))+(1y(i))cost0(θTx(i))]+21j=1nθj2

      其中 c o s t 1 ( z ) cost_1(z) cost1(z)的图像如左图所示, c o s t 0 ( z ) cost_0(z) cost0(z)的图像如右图所示

      理想的假设:If y = 1 y = 1 y=1 , we want θ T x ≥ 1 \theta^Tx \ge 1 θTx1 (not just ≥ 0 \ge 0 0); If y = 0 y = 0 y=0 , we want θ T x ≤ − 1 \theta^T x\le -1 θTx1 (not just < 0).

      如果优化目标中的系数 C C C非常大,那么优化目标的第一项就需要等于0,这时就需要上述"理想的假设"成立,那么优化目标就会变为
      m i n θ   1 2 ∑ j = 1 n θ j 2 = 1 2 ∣ ∣ θ ∣ ∣ 2 s . t . θ T x ( i ) ≥ 1 i f   y ( i ) = 1    θ T x ( i ) ≤ − 1 i f   y ( i ) = 0    \underset{\theta}{min} \ \frac{1}{2}\sum_{j = 1}^n\theta_j^2 = \frac{1}{2}||\theta||^2 \quad \quad \quad \\ s.t. \quad \theta^Tx^{(i)} \ge 1 \quad if \ y^{(i)} = 1 \\ \quad \quad \quad \ \ \theta^Tx^{(i)} \le -1 \quad if \ y^{(i)} = 0 \ \ θmin 21j=1nθj2=21θ2s.t.θTx(i)1if y(i)=1  θTx(i)1if y(i)=0  

    • 核函数(kernel):通过特征点和核函数(常用高斯核函数)来定义新的特征变量,从而训练复杂的非线性模型。

      高斯核函数: k ( x , l ) = e x p ( − ∣ ∣ x − l ∣ ∣ 2 2 δ 2 ) k(x, l) = exp(-\frac{||x - l||^2}{2\delta^2}) k(x,l)=exp(2δ2xl2)

      举个简单的例子:

      Given ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) (x(1),y(1)),(x(2),y(2)),,(x(m),y(m)),

      choose l ( 1 ) = x ( 1 ) , l ( 2 ) = x ( 2 ) , … , l ( m ) = x ( m ) l^{(1)} = x^{(1)}, l^{(2)} = x^{(2)}, \dots, l^{(m)} = x^{(m)} l(1)=x(1),l(2)=x(2),,l(m)=x(m). 【特征点】
      新 的 特 征 f = [ f 0 f 1 f 2 … f m ] 新的特征f = \begin{bmatrix} f_0 \\ f_1 \\ f_2 \\ \dots \\ f_m \end{bmatrix} \quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad f=f0f1f2fm
      For training example x ( i ) , y ( i ) x^{(i)}, y^{(i)} x(i),y(i) 【similarity函数常用高斯核函数】

      \quad f 1 ( i ) = s i m i l a r i t y ( x ( i ) , l ( 1 ) ) f_1^{(i)} = similarity(x^{(i)}, l^{(1)}) f1(i)=similarity(x(i),l(1))

      \quad f 2 ( i ) = s i m i l a r i t y ( x ( i ) , l ( 2 ) ) f_2^{(i)} = similarity(x^{(i)}, l^{(2)}) f2(i)=similarity(x(i),l(2))

      \quad … \dots

      \quad f m ( i ) = s i m i l a r i t y ( x ( i ) , l ( m ) ) f_m^{(i)} = similarity(x^{(i)}, l^{(m)}) fm(i)=similarity(x(i),l(m))

      Hypothesis : Given x x x, compute features f ∈ R m + 1 f \in \mathbb{R}^{m + 1} fRm+1

      \quad Predict “y = 1” if θ T f ≥ 0 \theta^Tf \ge 0 θTf0

      Training : m i n θ C ∑ i = 1 m [ y ( i ) c o s t 1 ( θ T f ( i ) ) + ( 1 − y ( i ) ) c o s t 0 ( θ T f ( i ) ) ] + 1 2 ∑ j = 1 m θ j 2 \underset{\theta}{min}C\sum_{i = 1}^m[y^{(i)}cost_1(\theta^Tf^{(i)}) + (1 - y^{(i)})cost_0(\theta^Tf^{(i)})] + \frac{1}{2}\sum_{j = 1}^m\theta_j^2 θminCi=1m[y(i)cost1(θTf(i))+(1y(i))cost0(θTf(i))]+21j=1mθj2

    • Large C : Lower bias, higher variance; Small C : Higher bias, lower variance.

    • Large δ 2 \delta^2 δ2 : Featuers f i f_i fi vary more smoothly. Higher bias, lower variance;

      Small δ 2 \delta^2 δ2 : Featuers f i f_i fi vary less smoothly. Lower bias, higher variance.

  11. K-means

    • Input : - K K K (number of clusters)

      \quad \quad - Training set { x ( 1 ) , x ( 2 ) , … , x ( m ) x^{(1)}, x^{(2)}, \dots, x^{(m)} x(1),x(2),,x(m)} , x ( i ) ∈ R n x^{(i)} \in \mathbb{R}^n x(i)Rn

      Randomly initialize K K K cluster centroids μ 1 , μ 2 , … , μ K ∈ R n \mu_1, \mu_2, \dots, \mu_K \in\mathbb{R}^n μ1,μ2,,μKRn

      Repeat {

      \quad for i = 1 i = 1 i=1 to m m m

      \quad \quad c ( i ) c^{(i)} c(i) := index (from 1 to K K K) of cluster centroid closest to x ( i ) x^{(i)} x(i)

      \quad ​ for k = 1 k = 1 k=1 to K K K

      \quad \quad μ k \mu_k μk := average (mean) of points assigned to cluster k k k

      }

    • Optimization objective :

      \quad J ( c ( 1 ) , … , c ( m ) , μ 1 , … , μ K ) = 1 m ∑ i = 1 m ∣ ∣ x ( i ) − μ c ( i ) ∣ ∣ 2 J(c^{(1)}, \dots, c^{(m)},\mu_1, \dots, \mu_K) = \frac{1}{m}\sum_{i = 1}^m||x^{(i)} - \mu_{c^{(i)}}||^2 J(c(1),,c(m),μ1,,μK)=m1i=1mx(i)μc(i)2

      \quad m i n c ( 1 ) , … , c ( m ) , μ 1 , … , μ K J ( c ( 1 ) , … , c ( m ) , μ 1 , … , μ K ) \underset{c^{(1)}, \dots, c^{(m)}, \\ \mu_1, \dots, \mu_K}{min} J(c^{(1)}, \dots, c^{(m)},\mu_1, \dots, \mu_K) c(1),,c(m),μ1,,μKminJ(c(1),,c(m),μ1,,μK)

      其中 c ( i ) c^{(i)} c(i) = index of cluster ( 1 , 2 , . . . , K 1, 2, ..., K 1,2,...,K) to which example x ( i ) x^{(i)} x(i) is currently assigned

      \quad μ k \mu_k μk = cluster centroid k k k ( μ k ∈ R n \mu_k \in \mathbb{R}^n μkRn)

      \quad μ c ( i ) \mu_{c^{(i)}} μc(i) = cluster centroid of cluster to which example x ( i ) x^{(i)} x(i) has been assigned

    • 为了防止算法陷入局部最优解,可以通过多次随机初始化聚类中心,然后选择畸变值最小的那次聚类结果即可。

    • 选择聚类数量:通常通过观察数据来人为进行选择,有时也可通过"肘部法则"来帮助我们选择聚类数量,但并不能期待它每次都能取得好的效果。

  12. 降维

    • 降维可以压缩数据和可视化数据。

    • 主成分分析法(PCA):将数据投影到低维平面。

      • 数据预处理:对原始数据进行均值标准化和特征缩放。
      • 计算协方差矩阵: S i g m a = 1 m ∑ i = 1 m ( x ( i ) ) ( x ( i ) ) T Sigma = \frac{1}{m}\sum_{i = 1}^m(x^{(i)})(x^{(i)})^T Sigma=m1i=1m(x(i))(x(i))T
      • 计算协方差矩阵的特征向量: [ U , S , V ] = s v d ( S i g m a ) [U, S, V] = svd(Sigma) [U,S,V]=svd(Sigma) 【svd为奇异值分解函数】
      • 通过 s v d svd svd函数我们可以得到 n ∗ n n*n nn的特征矩阵 U U U,取矩阵 U U U的前 k k k列作为 k k k维平面的 k k k个方向向量即可,即 U r e d u c e = U ( : , 1 : k ) U_{reduce} = U(:, 1:k) Ureduce=U(:,1:k)
      • 最后可以通过计算得到降维后的数据: z = U r e d u c e ′ ∗ x z = U_{reduce}&#x27; * x z=Ureducex
    • 压缩重现/数据的重构: x a p p r o x = U r e d u c e ∗ z x_{approx} = U_{reduce}*z xapprox=Ureducez

    • PCA中主成分数量 k k k的选择(原理)

      • Average squared projection error (平均平方映射误差) : 1 m ∑ i = 1 m ∣ ∣ x ( i ) − x a p p r o x ( i ) ∣ ∣ 2 \frac{1}{m}\sum_{i = 1}^m||x^{(i)} -x_{approx}^{(i)}||^2 m1i=1mx(i)xapprox(i)2

      • Total variation in the data (数据的总变差) : 1 m ∑ i = 1 m ∣ ∣ x ( i ) ∣ ∣ 2 \frac{1}{m}\sum_{i = 1}^m||x^{(i)}||^2 m1i=1mx(i)2

      • Choose k k k to be the smallest value so that 1 m ∑ i = 1 m ∣ ∣ x ( i ) − x a p p r o x ( i ) ∣ ∣ 2 1 m ∑ i = 1 m ∣ ∣ x ( i ) ∣ ∣ 2 ≤ 0.01 \frac{\frac{1}{m}\sum_{i = 1}^m||x^{(i)} -x_{approx}^{(i)}||^2}{\frac{1}{m}\sum_{i = 1}^m||x^{(i)}||^2} \le 0.01 m1i=1mx(i)2m1i=1mx(i)xapprox(i)20.01 (1%)

        “99% of variance is retained”

    • PCA中主成分数量 k k k的选择(实际操作)

      • [U, S, V] = svd(Sigma)

      • Pick smallest value of k k k for which 1 − ∑ i = 1 k S i i ∑ i = 1 m S i i ≤ 0.01 1 - \frac{\sum_{i = 1}^kS_{ii}}{\sum_{i = 1}^mS_{ii}} \le 0.01 1i=1mSiii=1kSii0.01, 即 ∑ i = 1 k S i i ∑ i = 1 m S i i ≥ 0.99 \frac{\sum_{i = 1}^kS_{ii}}{\sum_{i = 1}^mS_{ii}} \ge 0.99 i=1mSiii=1kSii0.99

        “99% of variance is retained”

  13. 异常检测问题

    • 高斯分布(正态分布)

      • p ( x ; μ , σ 2 ) = 1 2 π σ e x p ( − ( x − μ ) 2 2 σ 2 ) p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x - \mu)^2}{2\sigma^2}) p(x;μ,σ2)=2π σ1exp(2σ2(xμ)2)
      • 参数估计: μ = 1 m ∑ i = 1 m x ( i ) \mu = \frac{1}{m}\sum_{i = 1}^mx^{(i)} μ=m1i=1mx(i), σ 2 = 1 m ∑ i = 1 m ( x ( i ) − μ ) 2 \sigma^2 = \frac{1}{m}\sum_{i = 1}^m(x^{(i)} - \mu)^2 σ2=m1i=1m(x(i)μ)2
    • Anomaly detection algorithm

      • Choose features x i x_i xi that you think might be indicative of anomalous examples.

      • Fit parameters μ 1 , … , μ n , σ 1 2 , … , σ n 2 \mu_1,\dots, \mu_n,\sigma_1^2,\dots,\sigma_n^2 μ1,,μn,σ12,,σn2

        μ j = 1 m ∑ i = 1 m x j ( i ) σ j 2 = 1 m ∑ i = 1 m ( x j ( i ) − μ j ) 2 \mu_j = \frac{1}{m}\sum_{i = 1}^mx_j^{(i)} \quad \quad \sigma_j^2 = \frac{1}{m}\sum_{i = 1}^m(x_j^{(i)} - \mu_j)^2 μj=m1i=1mxj(i)σj2=m1i=1m(xj(i)μj)2

      • Given new example x x x, compute p ( x ) p(x) p(x) :

        p ( x ) = ∏ j = 1 n p ( x j ; μ j , σ j 2 ) = ∏ j = 1 n 1 2 π σ j e x p ( − ( x j − μ j ) 2 2 σ j 2 ) p(x) = \prod_{j = 1}^{n}p(x_j;\mu_j, \sigma_j^2) = \prod_{j = 1}^{n}\frac{1}{\sqrt{2\pi}\sigma_j}exp(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}) p(x)=j=1np(xj;μj,σj2)=j=1n2π σj1exp(2σj2(xjμj)2)

        Anomaly if p ( x ) &lt; ϵ p(x) &lt; \epsilon p(x)<ϵ

    • 异常检测与监督学习的试用情况:

      异常检测:大量的负样本,极少的正样本;许多不同类型的异常;未来可能出现新的异常类型。

      监督学习:大量的正负样本。

    • 注:在使用异常检测算法时,如果有的特征分布不符合高斯分布,可以对这个特征的数据进行对数变换、指数变换等使其符合高斯分布,这样算法运行的会更好。

    • 多元高斯分布

      • p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 e x p ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) ) p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp(-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu)) p(x;μ,Σ)=(2π)n/2Σ1/21exp(21(xμ)TΣ1(xμ)) 其中 μ ∈ R , Σ ∈ R n × n \mu \in \mathbb{R}, \Sigma \in \mathbb{R}^{n\times n} μR,ΣRn×n
      • 参数估计: μ = 1 m ∑ i = 1 m x ( i ) \mu = \frac{1}{m}\sum_{i = 1}^mx^{(i)} μ=m1i=1mx(i), Σ = 1 m ∑ i = 1 m ( x ( i ) − μ ) ( x ( i ) − μ ) T \Sigma = \frac{1}{m}\sum_{i = 1}^m(x^{(i)} - \mu)(x^{(i)} - \mu)^T Σ=m1i=1m(x(i)μ)(x(i)μ)T
  14. Recommender Systems(推荐系统)

    • Content-based recommendations(基于内容的推荐算法)

      r ( i , j ) = 1 r(i, j) =1 r(i,j)=1 if user j j j has rated movie i i i (0 otherwise)

      y ( i , j ) = y^{(i, j)} = y(i,j)= rating by user j j j on movie i i i (if defined)

      θ ( j ) = \theta^{(j)} = θ(j)= parameter vector for user j j j

      x ( i ) = x^{(i)} = x(i)= feature vector for movie i i i

      For user j j j, movie i i i, predicted rating: ( θ ( j ) ) T ( x ( i ) ) (\theta^{(j)})^T(x^{(i)}) (θ(j))T(x(i))

      m ( j ) = m^{(j)} = m(j)= no. of movies rated by user j j j

      To learn θ ( j ) \theta^{(j)} θ(j) (parameter for user j j j): m i n θ ( j ) 1 2 ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ k = 1 n ( θ k ( j ) ) 2 \underset{\theta^{(j)}}{min}\frac{1}{2}\sum_{i:r(i, j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2}\sum_{k = 1}^n(\theta_k^{(j)})^2 θ(j)min21i:r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λk=1n(θk(j))2

      To learn θ ( 1 ) , θ ( 2 ) , … , θ ( n u ) \theta^{(1)},\theta^{(2)},\dots,\theta^{(n_u)} θ(1),θ(2),,θ(nu) :

      \quad m i n θ ( 1 ) , … , θ ( n u ) 1 2 ∑ j = 1 n u ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 \underset{\theta^{(1)},\dots,\theta^{(n_u)}}{min}\frac{1}{2}\sum_{j = 1}^{n_u}\sum_{i:r(i, j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2}\sum_{j = 1}^{n_u}\sum_{k = 1}^n(\theta_k^{(j)})^2 θ(1),,θ(nu)min21j=1nui:r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λj=1nuk=1n(θk(j))2

      Gradient descent :

      \quad θ k ( j ) : = θ k ( j ) − α ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) x k ( i ) ( f o r   k = 0 ) \theta_k^{(j)} := \theta_k^{(j)} - \alpha\sum_{i:r(i,j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)} \quad (for \ k = 0) θk(j):=θk(j)αi:r(i,j)=1((θ(j))Tx(i)y(i,j))xk(i)(for k=0)

      \quad θ k ( j ) : = θ k ( j ) − α ∑ i : r ( i , j ) = 1 ( ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) x k ( i ) + λ θ ( j ) ) ( f o r   k ≠ 0 ) \theta_k^{(j)} := \theta_k^{(j)} - \alpha\sum_{i:r(i,j) = 1}(((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)} + \lambda\theta^{(j)}) \quad (for \ k \ne 0) θk(j):=θk(j)αi:r(i,j)=1(((θ(j))Tx(i)y(i,j))xk(i)+λθ(j))(for k̸=0)

    • Collaborative filtering(协同过滤)

      • 优化目标

        Given x ( 1 ) , … , x ( n m ) x^{(1)},\dots,x^{(n_m)} x(1),,x(nm), estimate θ ( 1 ) , … , θ ( n u ) \theta^{(1)}, \dots, \theta^{(n_u)} θ(1),,θ(nu) :

        \quad m i n θ ( 1 ) , … , θ ( n u ) 1 2 ∑ j = 1 n u ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 \underset{\theta^{(1)},\dots,\theta^{(n_u)}}{min}\frac{1}{2}\sum_{j = 1}^{n_u}\sum_{i:r(i, j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2}\sum_{j = 1}^{n_u}\sum_{k = 1}^n(\theta_k^{(j)})^2 θ(1),,θ(nu)min21j=1nui:r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λj=1nuk=1n(θk(j))2

        Given θ ( 1 ) , … , θ ( n u ) \theta^{(1)}, \dots,\theta^{(n_u)} θ(1),,θ(nu), estimate x ( 1 ) , … , x ( n m ) x^{(1)},\dots,x^{(n_m)} x(1),,x(nm) :

        \quad m i n x ( 1 ) , … , x ( n m ) 1 2 ∑ i = 1 n m ∑ j : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ i = 1 n m ∑ k = 1 n ( x k ( i ) ) 2 \underset{x^{(1)},\dots,x^{(n_m)}}{min}\frac{1}{2}\sum_{i = 1}^{n_m}\sum_{j:r(i, j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 + \frac{\lambda}{2}\sum_{i = 1}^{n_m}\sum_{k = 1}^n(x_k^{(i)})^2 x(1),,x(nm)min21i=1nmj:r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λi=1nmk=1n(xk(i))2

        Minimizing x ( 1 ) , … , x ( n m ) x^{(1)},\dots,x^{(n_m)} x(1),,x(nm) and θ ( 1 ) , … , θ ( n u ) \theta^{(1)},\dots,\theta^{(n_u)} θ(1),,θ(nu) simultaneously :

        \quad J ( x ( 1 ) , … , x ( n m ) , θ ( 1 ) , … , θ ( n u ) ) = 1 2 ∑ ( i , j ) : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ i = 1 n m ∑ k = 1 n ( x k ( i ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)}) = \frac{1}{2}\sum_{(i,j):r(i,j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i, j)})^2 \\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad + \frac{\lambda}{2}\sum_{i = 1}^{n_m}\sum_{k = 1}^n(x_k^{(i)})^2 + \frac{\lambda}{2}\sum_{j = 1}^{n_u}\sum_{k = 1}^n(\theta_k^{(j)})^2 J(x(1),,x(nm),θ(1),,θ(nu))=21(i,j):r(i,j)=1((θ(j))Tx(i)y(i,j))2+2λi=1nmk=1n(xk(i))2+2λj=1nuk=1n(θk(j))2

        m i n x ( 1 ) , … , x ( n m ) θ ( 1 ) , … , θ ( n u ) J ( x ( 1 ) , … , x ( n m ) , θ ( 1 ) , … , θ ( n u ) ) \underset{x^{(1)},\dots,x^{(n_m)} \\ \theta^{(1)},\dots,\theta^{(n_u)}}{min}J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)}) x(1),,x(nm)θ(1),,θ(nu)minJ(x(1),,x(nm),θ(1),,θ(nu))

      • 算法过程

        Initialize x ( 1 ) , … , x ( n m ) , θ ( 1 ) , … , θ ( n u ) x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)} x(1),,x(nm),θ(1),,θ(nu) to small random values.

        Minimize J ( x ( 1 ) , … , x ( n m ) , θ ( 1 ) , … , θ ( n u ) ) J(x^{(1)},\dots,x^{(n_m)},\theta^{(1)},\dots,\theta^{(n_u)}) J(x(1),,x(nm),θ(1),,θ(nu)) using gradient descent (or an advanced optimization algorithm).E.g. for every j = 1 , … , n u , i = 1 … , n m j = 1, \dots,n_u,i = 1\dots,n_m j=1,,nu,i=1,nm :

        \quad x k ( i ) : = x k ( i ) − α ∑ j : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) θ k ( j ) + λ x k ( j ) ) x_k^{(i)} := x_k^{(i)} - \alpha\sum_{j:r(i,j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})\theta_k^{(j)} + \lambda x_k^{(j)}) xk(i):=xk(i)αj:r(i,j)=1((θ(j))Tx(i)y(i,j))θk(j)+λxk(j))

        \quad θ k ( j ) : = θ k ( j ) − α ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) x k ( i ) + λ θ k ( j ) ) \theta_k^{(j)} := \theta_k^{(j)} - \alpha\sum_{i:r(i,j) = 1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})x_k^{(i)} + \lambda\theta_k^{(j)}) θk(j):=θk(j)αi:r(i,j)=1((θ(j))Tx(i)y(i,j))xk(i)+λθk(j))

        For a user with parameters θ \theta θ and a movie with (learned) features x x x, predict a star rating of θ T x \theta^Tx θTx

    • 改进:如果有一个用户 i i i没有对任何电影进行评分,那么执行上述的协调过滤算法之后预测出的 θ ( i ) \theta^{(i)} θ(i)全为0, ( θ ( j ) ) T x ( x ( i ) ) (\theta^{(j)})^Tx(x^{(i)}) (θ(j))Tx(x(i))也会全为0,即用户 i i i对所有的电影的预测评分均为0,便无法向其推荐电影。我们可以先利用均值归一化来预处理所有的评分数据,然后再执行协同过滤算法,而且在预测评分的时候采用 ( θ ( j ) ) T ( x ( i ) ) + μ i (\theta^{(j)})^T(x^{(i)}) + \mu_i (θ(j))T(x(i))+μi, 即可解决上述问题。[其中矩阵 μ \mu μ为所有电影的平均评分]

  15. 大规模机器学习

    • 大规模机器学习需要处理海量的数据,会带来计算问题。

    • Stochastic gradient descent (随机梯度下降)

      • c o s t ( θ , ( x ( i ) , y ( i ) ) ) = 1 2 ( h θ ( x ( i ) ) − y ( i ) ) 2 cost(\theta,(x^{(i)},y^{(i)})) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2 cost(θ,(x(i),y(i)))=21(hθ(x(i))y(i))2, J t r a i n ( θ ) = 1 m ∑ i = 1 m c o s t ( θ , ( x ( i ) , y ( i ) ) ) J_{train}(\theta) = \frac{1}{m}\sum_{i = 1}^m cost(\theta,(x^{(i)},y^{(i)})) Jtrain(θ)=m1i=1mcost(θ,(x(i),y(i)))

      • Randomly shuffle (reorder) tarining examples.

        Repeat {

        \quad for i : = 1 , … , m i := 1,\dots,m i:=1,,m {

        \quad \quad θ j : = θ j − α ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \theta_j := \theta_j - \alpha(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} θj:=θjα(hθ(x(i))y(i))xj(i) (for every j = 0 , … , n j = 0,\dots,n j=0,,n)

        \quad }

        }

    • Mini-batch gradient descent

      Say b = 10 , m = 1000 b = 10, m = 1000 b=10,m=1000.

      Repeat {

      \quad ​ for i = 1 , 11 , 21 , 31 , … , 991 i = 1, 11, 21, 31, \dots, 991 i=1,11,21,31,,991 {

      \quad \quad θ j : = θ j − α 1 10 ∑ k = i i + 9 ( h θ ( x ( k ) ) − y ( k ) ) x j ( k ) \theta_j := \theta_j - \alpha\frac{1}{10}\sum_{k = i}^{i + 9}(h_\theta(x^{(k)}) - y^{(k)})x_j^{(k)} θj:=θjα101k=ii+9(hθ(x(k))y(k))xj(k) (for every j = 0 , … , n j = 0,\dots,n j=0,,n)

      \quad }

      }

    • Map-reduce: 将大数据集分布到多台电脑上面并行计算。

  16. Photo OCR(照片光学字符识别): Text dection; Character segmentation; Character recognition.通常用滑动窗口来检测图片中含有文字的区域,然后再利用滑动窗口来进行文字切割,最后用神经网络或者其他算法进行字符识别。

  17. 人工数据合成:自己创造数据;从已有数据集中按照某种方法生成更多数据。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值