梯度下降(Greadient Descent with squared error)

Gradient Descent with Squared Errors

We want to find the weights for our neural networks. Let's start by thinking about the goal. The network needs to make predictions as close as possible to the real values. To measure this, we use a metric of how wrong the predictions are, the error. A common metric is the sum of the squared errors (SSE):

E=\frac{1}{2} \sum_{u} \sum_{j}[y_{u}^{j} -\hat{y}_{u}^{j} ]

where y_hat is the prediction and y is the true value, and you take the sum over all output units j and another sum over all data points μ. This might seem like a really complicated equation at first, but it's fairly simple once you understand the symbols and can say what's going on in words.

First, the inside sum over j. This variable j represents the output units of the network. So this inside sum is saying for each output unit, find the difference between the true value y and the predicted value from the network y_hat, then square the difference, then sum up all those squares.

Then the other sum over μ is a sum over all the data points. So, for each data point you calculate the inner sum of the squared differences for each output unit. Then you sum up those squared differences for each data point. That gives you the overall error for all the output predictions for all the data points.

The SSE is a good choice for a few reasons. The square ensures the error is always positive and larger errors are penalized more than smaller errors. Also, it makes the math nice, always a plus.

Remember that the output of a neural network, the prediction, depends on the weights

\hat{y}_{j}^{u}=f\left ( \sum _{u}w_{ij}x_{i}^{u} \right )

and accordingly the error depends on the weights

E=\frac{1}{2} \sum_{u} \sum_{j}[y_{u}^{j}- \left [ f\left ( \sum_{i}w_{ij}x_{i}^{u} \right )\right ]^2

We want the network's prediction error to be as small as possible and the weights are the knobs we can use to make that happen. Our goal is to find weights w that minimize the squared error E. To do this with a neural network, typically you'd use gradient descent.

As Luis said, with gradient descent, we take multiple small steps towards our goal. In this case, we want to change the weights in steps that reduce the error. Continuing the analogy, the error is our mountain and we want to get to the bottom. Since the fastest way down a mountain is in the steepest direction, the steps taken should be in the direction that minimizes the error the most. We can find this direction by calculating the gradient of the squared error.

Gradient is another term for rate of change or slope. If you need to brush up on this concept, check out Khan Academy's great lectures on the topic.

To calculate a rate of change, we turn to calculus, specifically derivatives. A derivative of a function f(x) gives you another function f′(x) that returns the slope of f(x) at point x. For example, consider f(x)=x^2. The derivative of x^2 is f′(x)=2x. So, at x=2, the slope is f′(2)=4. Plotting this out, it looks like:

The gradient is just a derivative generalized to functions with more than one variable. We can use calculus to find the gradient at any point in our error function, which depends on the input weights. You'll see how the gradient descent step is derived on the next page.

Below I've plotted an example of the error of a neural network with two inputs, and accordingly, two weights. You can read this like a topographical map where points on a contour line have the same error and darker contour lines correspond to larger errors.

At each step, you calculate the error and the gradient, then use those to determine how much to change each weight. Repeating this process will eventually find weights that are close to the minimum of the error function, the black dot in the middle.

 

Caveats

Since the weights will just go wherever the gradient takes them, they can end up where the error is low, but not the lowest. These spots are called local minima. If the weights are initialized with the wrong values, gradient descent could lead the weights into a local minimum, illustrated below.

 数学理解

y: true value ; y hat : predicted value ; W: weight , X: input

 1. The error bettewn true vaule y and predicted value y hat E = (y - \hat{y})

2. In order to get the positive error, we add a square sign to the function E = (y - \hat{y})^2,  The reason why not use absolute value is because square can penalize the outlilers more then small values.

3. In order to geth the  error of the whole dataset, we just need sum up the errors for each data record denoted by the sum over mu. E = \sum_{u} (y^{u} - \hat{y^{u}})^{2}

4. To clean up the math later, add a one half in front. E =\frac{1}{2} \sum_{u} (y^{u} - \hat{y^{u}})^{2}

5. Remerbered that y hat is the linear combination  of the weights inputs

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值