MachineLearning笔记week5 NeuralNetworkLearning

5.1 Cost Function

Let’s first define a few variables that we will need to use:

  • L = total number of layers in the network
  • sl s l = number of units (not counting bias unit) in layer l l
  • K = number of output units/classes

Recall that in neural networks, we may have many output nodes. We denote hΘ(x)k as being a hypothesis that results in the kth k t h output. Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. Recall that the cost function for regularized logistic regression was:

J(θ)=1mmi=1[y(i)log(hθ(x(i)))+(1y(i)log(1hθ(x(i)))]+λ2mnj=1θ2j J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) l o g ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2

For neural networks, it is going to be slightly more complicated:

J(Θ)=1mmi=1Kk=1[y(i)klog(hΘ(x(i))k)+(1y(i)klog(1hθ(x(i))k)]+λ2mL1l=1sli=1sl+1j=1(Θ(l)j,i)2 J ( Θ ) = − 1 m ∑ i = 1 m ∑ k = 1 K [ y k ( i ) l o g ( h Θ ( x ( i ) ) k ) + ( 1 − y k ( i ) l o g ( 1 − h θ ( x ( i ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( Θ j , i ( l ) ) 2

We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.

In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.

Note:
- the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
- the triple sum simply adds up the squares of all the individual Θs in the entire network.
- the i in the triple sum does not refer to training example i

5.2 Backpropagation Algorithm

“Backpropagation” is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:

minΘJ(Θ) m i n Θ J ( Θ )

That is, we want to minimize our cost function J using an optimal set of parameters in theta. In this section we’ll look at the equations we use to compute the partial derivative of J(Θ) J ( Θ ) :

Θ(l)i,jJ(Θ) ∂ ∂ Θ i , j ( l ) J ( Θ )

To do so, we use the following algorithm:

这里写图片描述

Back propagation Algorithm
Given training set (x(1),y(1))(x(m),y(m)) ( x ( 1 ) , y ( 1 ) ) ⋯ ( x ( m ) , y ( m ) )

  • Set Δ(l)i,j:=0 Δ i , j ( l ) := 0 for all (l,i,j) ( l , i , j ) , (hence you end up having a matrix full of zeros)

For training example t =1 to m:
1. Set a(1):=x(t) a ( 1 ) := x ( t )
2. Perform forward propagation to compute a(l) a ( l ) for l=2,3,,L l = 2 , 3 , ⋯ , L

这里写图片描述

  1. Using y(t) y ( t ) , compute δ(L)=a(L)y(t) δ ( L ) = a ( L ) − y ( t )

    Where L is our total number of layers and a(L) a ( L ) is the vector of outputs of the activation units for the last layer. So our “error values” for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:

  2. Compute δ(L1),δ(L2),,δ(2) using δ(L)=((Θ(l))Tδ(l+1)).a(l).(1a(l)) δ ( L − 1 ) , δ ( L − 2 ) , ⋯ , δ ( 2 )  using  δ ( L ) = ( ( Θ ( l ) ) T δ ( l + 1 ) ) . ∗ a ( l ) . ∗ ( 1 − a ( l ) )

    The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g’, or g-prime, which is the derivative of the activation function g evaluated with the input values given by z(l) z ( l ) .

    The g-prime derivative terms can also be written out as:

    g(z(l))=a(l).(1a(l)) g ′ ( z ( l ) ) = a ( l ) . ∗ ( 1 − a ( l ) )

  3. Δ(l)i,j:=Δ(l)i,j+a(l)jδ(l+1)i Δ i , j ( l ) := Δ i , j ( l ) + a j ( l ) δ i ( l + 1 ) or with vectorization, Δ(l):=Δ(l)+δ(l+1)(a(l))T Δ ( l ) := Δ ( l ) + δ ( l + 1 ) ( a ( l ) ) T

Hence we update our new Δ Δ matrix.
- D(l)i,j:=1m(Δ(l)i,j+λΘ(l)i,j), if j0 D i , j ( l ) := 1 m ( Δ i , j ( l ) + λ Θ i , j ( l ) ) ,  if  j ≠ 0
- D(l)i,j:=1mΔ(l)i,j, if j=0 D i , j ( l ) := 1 m Δ i , j ( l ) ,  if  j = 0

The capital-delta matrix D is used as an “accumulator” to add up our values as we go along and eventually compute our partial derivative. Thus we get Θ(l)i,jJ(Θ)=D(l)i,j ∂ ∂ Θ i , j ( l ) J ( Θ ) = D i , j ( l )

5.3 Backpropagation Intuition

Recall that the cost function for a neural network is:

J(Θ)=1mmt=1Kk=1[y(t)klog(hΘ(x(t))k+(1y(t)k)(1log(hΘ(x(t))k)]+λ2mL1l=1i=1slsl+1j=1(Θ(l)j,i)2 J ( Θ ) = − 1 m ∑ t = 1 m ∑ k = 1 K [ y k ( t ) l o g ( h Θ ( x ( t ) ) k + ( 1 − y k ( t ) ) ( 1 − l o g ( h Θ ( x ( t ) ) k ) ] + λ 2 m ∑ l = 1 L − 1 ∑ s l i = 1 ∑ j = 1 s l + 1 ( Θ j , i ( l ) ) 2

If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:

cost(t)=y(t)log(hΘ(x(t)))+(1y(t))log(1hΘ(x(t))) c o s t ( t ) = y ( t ) l o g ( h Θ ( x ( t ) ) ) + ( 1 − y ( t ) ) l o g ( 1 − h Θ ( x ( t ) ) )

Intuitively, δ(l)j δ j ( l ) is the “error” for a(l)j a j ( l ) (unit j in layer l). More formally, the delta values are actually the derivative of the cost function:

δ(l)j=z(l)jcost(t) δ j ( l ) = ∂ ∂ z j ( l ) c o s t ( t )

Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we could calculate some δ(l)j δ j ( l ) :

这里写图片描述

In the image above, to calculate δ(2)2 δ 2 ( 2 ) , we multiply the weights Θ(2)12 Θ 12 ( 2 ) and Θ(2)22 Θ 22 ( 2 ) by their respective \deltaδ values found to the right of each edge. So we get δ(2)2=Θ(2)12δ(3)1+Θ(2)22δ(3)2 δ 2 ( 2 ) = Θ 12 ( 2 ) ∗ δ 1 ( 3 ) + Θ 22 ( 2 ) ∗ δ 2 ( 3 ) . To calculate every single possible δ(l)j δ j ( l ) , we could start from the right of our diagram. We can think of our edges as our Θij Θ i j . Going from right to left, to calculate the value of δ(l)j δ j ( l ) , you can just take the over all sum of each weight times the δ δ it is coming from. Hence, another example would be δ(3)2=Θ(3)12δ(4)1 δ 2 ( 3 ) = Θ 12 ( 3 ) ∗ δ 1 ( 4 ) .

5.4 Implementation Note: Unrolling Parameters

With neural networks, we are working with sets of matrices:

Θ(1),Θ(2),Θ(3), Θ ( 1 ) , Θ ( 2 ) , Θ ( 3 ) , ⋯

D(1),D(2),D(3), D ( 1 ) , D ( 2 ) , D ( 3 ) , ⋯

In order to use optimizing functions such as “fminunc()”, we will want to “unroll” all the elements and put them into one long vector:

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the “unrolled” versions as follows:

Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

To summarize:

这里写图片描述

5.5 Gradient Checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

ΘJ(Θ)J(Θ+ϵ)j(Θϵ)2ϵ ∂ ∂ Θ J ( Θ ) ≈ J ( Θ + ϵ ) − j ( Θ − ϵ ) 2 ϵ

With multiple theta matrices, we can approximate the derivative with respect to Θj Θ j as follows:

ΘjJ(Θ)J(Θ1,,Θj+ϵ,,Θn)j(Θ1,,Θjϵ,,Θn)2ϵ ∂ ∂ Θ j J ( Θ ) ≈ J ( Θ 1 , ⋯ , Θ j + ϵ , ⋯ , Θ n ) − j ( Θ 1 , ⋯ , Θ j − ϵ , ⋯ , Θ n ) 2 ϵ

A small value for ϵ ϵ (epsilon) such as ϵ=104 ϵ = 10 − 4 , guarantees that the math works out properly. If the value for \epsilonϵ is too small, we can end up with numerical problems.

Hence, we are only adding or subtracting epsilon to the Θj Θ j matrix. In octave we can do it as follows:

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.

Once you have verified once that your backpropagation algorithm is correct, you don’t need to compute gradApprox again. The code to compute gradApprox can be very slow.

5.6 Random Initialization

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our Θ Θ matrices using the following method:

这里写图片描述

Hence, we initialize each Θ(l)ij Θ i j ( l ) to a random value between [ϵ,ϵ] [ − ϵ , ϵ ] . Using the above formula guarantees that we get the desired bound. The same procedure applies to all the Θ Θ ’s. Below is some working code you could use to experiment.

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.

(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

5.7 Putting it Together

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

  • Number of input units = dimension of features x(i) x ( i )
  • Number of output units = number of classes
  • Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
  • Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

Training a Neural Network

  1. Randomly initialize the weights
  2. Implement forward propagation to get hΘ(x(i)) for any x(i) h Θ ( x ( i ) )  for any  x ( i )
  3. Implement the cost function
  4. Implement backpropagation to compute partial derivatives
  5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
  6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

When we perform forward and back propagation, we loop on every training example:

for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L

The following image gives us an intuition of what is happening as we are implementing our neural network:

这里写图片描述

Ideally, you want hΘ(x(i))y(i) h Θ ( x ( i ) ) ≈ y ( i ) . This will minimize our cost function. However, keep in mind that J(Θ) J ( Θ ) is not convex and thus we can end up in a local minimum instead.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Mathematica for Machine Learning机器学习的Mathematica)是一份关于使用Mathematica进行机器学习笔记。Mathematica是一种功能强大的数学软件包,在处理和分析数据方面非常有用。使用Mathematica,我们可以使用其内置的机器学习函数和算法进行数据建模、预测和分类。 笔记中可能包含以下内容: 1. 数据准备:读取和处理数据是机器学习的第一步。Mathematica提供了各种函数和工具来读取和处理数据。这些函数可以从各种数据源中读取数据,并进行数据清洗、转换和归一化。 2. 特征工程:特征工程是机器学习中至关重要的一步,它涉及将原始数据转换为更有信息量的特征。Mathematica提供了各种函数和工具来进行特征选择、提取和变换。 3. 模型选择和训练:Mathematica提供了各种机器学习算法和函数,可以帮助我们选择适当的模型,并使用训练数据对模型进行训练。这些算法包括回归、分类、聚类和降维等。 4. 模型评估和验证:一旦模型训练完成,需要对其进行评估和验证。Mathematica提供了各种性能评估指标和图形化工具来评估和比较不同的模型。 5. 预测和推断:一旦我们有了训练好的模型,我们可以使用Mathematica进行预测和推断。该软件包提供了函数和工具,可以使用模型对新数据进行预测,并生成相关的可视化结果。 6. 高级机器学习功能:Mathematica还提供了一些高级的机器学习功能,如深度学习和强化学习。这些功能可以帮助我们解决更复杂的机器学习问题。 总之,Mathematica for Machine Learning提供了许多有用的函数和工具,可以帮助我们在机器学习中进行数据处理、模型选择和训练、模型评估和预测等任务。通过学习和使用这些笔记,我们可以更好地理解和应用机器学习算法并解决实际问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值