[ML of Andrew Ng]Week 5 Neural Networks:Learning

Week 5 Neural Networks:Learning


Cost function

Classification

abc

Examples: {(x(1),y(1)),(x(2),y(2)),,(x(m),y(m))}

L= total no. of layers in network
Sl= no. of units(not counting bias unit) in layer l

  • if binary classification
    y=0or1

    • Multi-class classification (K classes)
      yRk , E.g. [1000]T,[0100]T,[0010]T,[0001]T

    • In matlab we often should transfor y from a real number to a vector:

      Y = zeros(m,num_labels);
      for i=1:m,
          Y(i,y(i)) = 1;
      end

      Forward propagation algorithm

      From a(1) To a(L) .

      a(1)=X

      and if we have a(i) :
      z(i+1)=Θ(i)a(i)

      (sl+1×(sl+1))×((1+sl)×1)=sl+1×1

      (where must add a(i)0 )

      a(i+1)=g(z(i+1))

      So we get a(l) ,Then
      hΘ(x)=a(l)

      Θ(l)ij mappping form note j in layer l to note i in layer l+1.
      i=1,2,,sl+1 and j=1,2,,sl+1

      Cost function

      J(Θ)=1m[i=1mk=1Ky(i)klog((hΘ(x(i)))k+(1y(i)k)log(1((hΘ(x(i)))k)]+λ2ml=1L1i=2sl+1j=1sl+1(Θ(l)ji)2

      Where: hΘ(x)RK , (hΘ(x))i=ith output h.

      E.g. in matlab:

      J = 1/m * sum(sum(-Y.*log(a3)-(1-Y).*log(1-a3)))...
          + lambda/2/m * (sum(sum(Theta1(:,2:end).^2)) + sum(sum(Theta2(:,2:end).^2)));

      Backpropagation algorithm

      Similarly,to min J(Θ) we need compute J(Θ) and Θ(l)ijJ(Θ) .

      Gradient computation

      Intution: δ(l)j= “error” of node j in layer l.

      From δ(L) To δ(2) (No δ(1) ):

      δ(L)=a(L)y

      and if we have δ(i) :
      δ(i1)=(Θ(i1):j)Tδ(i),j0

      (sl×(sl1+1)1)T×(sl×1)=sl1×1

      (where must minus Θ(i1)0 )

      where:

      g(z(i))=g(z)(1g(z))

      and we can get Δ(l) :
      Δ(l)=δ(l+1)(a(l))T

      (sl+1×1)×((sl+1)×1)T=sl+1×(sl+1)

      Then:
      D(l)=1mΔ(l)+λΘ(l):j,j0

      Θ(l)J(Θ)=D(l)


      Implementation note:Unrolling parameters

      Example

      Θ(1)R10×11,Θ(1)R10×11,Θ(1)R1×11

      Unrolling

      thetaVec = [Theta1(:);Theta2(:);Theta3(:)];

      Reshape

      Theta1 = reshape(thetaVec(1:110),10,11);
      Theta2 = reshape(thetaVec(111:220),10,11);
      Theta1 = reshape(thetaVec(221:231),1,11);

      Gradient checking

      a

      for i = 1:n,
      thetaPlus = theta;
      thetaPlus(i) = thetaPlus(i) + EPSILON;
      thetaMinus = theta;
      thetaMinus(i) = thetaMinus(i) – EPSILON;
      gradApprox(i) = (J(thetaPlus) – J(thetaMinus))/(2*EPSILON);
      end;

      Implementation Note:

      • Implement backprop to compute DVec (unrolled).
      • Implement numerical gradient check to compute gradApprox.
      • Make sure they give similar values.
      • Turn off gradient checking. Using backprop code for learning.

      Important:

      • Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent (or in the inner loop of costFunction(…))your code will be very slow.

      Random initialization

      If zero initialization:
      After each update, parameters corresponding to inputs going into each of two hidden units are identical.

      So we must random initialization to break symmetry.
      Initialize each Θ(l)ij to a random value in [ϵ,ϵ]
      E.g.

      Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON;
      Theta2 = rand(1,11)*(2*INIT_EPSILON) - INIT_EPSILON;

      Training a neural network

      Pick a network architecture (connectivity pattern between neurons)

      arc
      - No. of input units: Dimension of features x(i)
      - No. output units: Number of classes

      Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better)

      Steps


      1. Randomly initialize weights
      2. Implement forward propagation to get hΘ(x(i)) for any x(i)
      3. Implement code to compute cost function J(Θ)
      4. Implement backprop to compute partial derivatives Θ(l)ijJ(Θ)

      for i = 1:m
      Perform forward propagation and backpropagation using example (x(i),y(i))
      (Get activations a(l) and delta terms δ(l) for l=2,,L ).

      • Use gradient checking to compare Θ(l)ijJ(Θ) computed using backpropagation vs. using numerical estimate of gradient of J(Θ) .
        Then disable gradient checking code.
      • Use gradient descent or advanced optimization method with backpropagation to try to minimize J(Θ) as a function of parameters Θ
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值