What is the estimate of the price of the meal after running the iterative algorithm for one step ?
Recall that
The correct price for portions (2, 5, 3) is 850.
Initial guess of the weights (50, 50, 50) gave an estimate of 500.
After running the iterative algorithm for one step the revised guess of the weights is (70, 100, 80)
70 * 2 + 100 * 5 + 80 * 3 = 880. Note that we are closer to the correct answer now. Also note that the learning changed the weight of chips from 50 to 100 even though the initial guess of 50 was the correct weight. So individual weights did not all get closer to their correct values, but the final output did.
You have seen how the delta rule can be used to minimize the error by looking at one training case at a time. What is a good reason for using this kind of iterative method?
We can avoid referring to minimizing the squared error as linear regression.
We can prove the convergence of the algorithm using the same reasoning as the perceptron algorithm.
An iterative method will scale more gracefully to larger datasets.
The first option is false but is intended to remind you that minimizing the squared error of a set of examples is equivalent to linear regression. The second option is also false because the perceptron convergence proof uses the fact that the weights always get closer to a set of optimal weights, which doesn't happen in this case. The third and fourth options are correct and complimentary: this general approach scales extremely well to large datasets and has the additional advantage that we can collect data as we run the iterative algorithm (that is, we run a step, go out and collect another example, run another step) meaning that we can keep collecting new data and learning better models. This approach to optimization is also known as online learning.
We don't have to collect all of the examples before we begin learning.
The first option is false but is intended to remind you that minimizing the squared error of a set of examples is equivalent to linear regression. The second option is also false because the perceptron convergence proof uses the fact that the weights always get closer to a set of optimal weights, which doesn't happen in this case. The third and fourth options are correct and complimentary: this general approach scales extremely well to large datasets and has the additional advantage that we can collect data as we run the iterative algorithm (that is, we run a step, go out and collect another example, run another step) meaning that we can keep collecting new data and learning better models. This approach to optimization is also known as online learning.
Suppose we have a dataset of two training points.
x1x2=(1,−1)=(0,1)t1t2=0=1
Consider a network with two input units connected to a linear neuron with weights w=(w_1, w_2)w=(w1,w2). What is the equation for the error surface when using a squared error loss function ?
Hint: The squared error is defined as \frac{1}{2}(w^T x_1 - t_1)^2 + \frac{1}{2}(w^T x_2 - t_2)^221(wTx1−t1)2+21(wTx2−t2)2
E = \frac{1}{2}\left(w_1^2 + 2w_2^2 - 2w_1w_2 -2w_2 + 1\right)E=21(w12+2w22−2w1w2−2w2+1)
The error summed over all training cases is
E = \frac{1}{2}(w_1 -w_2 - 0)^2 + \frac{1}{2}(0w_1 + w_2 - 1)^2 = \frac{1}{2}\left(w_1^2 + 2w_2^2 - 2w_1w_2 -2w_2 + 1\right)E=21(w1−w2−0)2+21(0w1+w2−1)2=21(w12+2w22−2w1w2−2w2+1).
Note the quadratic form of the energy surface. For any fixed value of E, the contours will be ellipses. For fixed values of w_1w1, we get a parabolic relation between EE and w_2w2. Similarly for fixed values of w_2w2.
E = \frac{1}{2}\left((w_1 - w_2 - 1)^2 + w_2^2\right)E=21((w1−w2−1)2+w22)
E = \frac{1}{2}\left(w_1^2 - 2w_2^2 + 2w_1w_2 + 1\right)E=21(w12−2w22+2w1w2+1)
E = \frac{1}{2}(w_1 - w_2)^2E=21(w1−w2)2
If our initial weight vector is (w_1,0)(w1,0) for some real number w_1w1, then on which of the following error surfaces can we expect steepest descent to converge poorly? Check all that apply.
The first one is similar to the picture shown in the lectures. It is a diagonally oriented ellipse and steepest descent will still tend to zig-zag on this error surface. Even though the second and third have different minima locations and scalings, the steepest descent direction will still take you very close to the minimum with the appropriate learning rate. The last case is tricky: even though the shape is an ellipse, the initial weight vector starts off somewhere along the x axis, and so again the steepest descent direction points directly toward the minimum. In other words, there is zero gradient along the vertical axis and therefore we are simply minimizing a parabola along one dimension from that point to get to the minimum.
The range of the function y = \frac{1}{1+e^{-z}}y=1+e−z1 is between 0 and 1. Another way of interpreting the logistic unit is that it is modelling:
The probability of the inputs given the outputs.
The probability of the outputs given the inputs. 正确
Instead of randomly perturbing the weights randomly and looking at the change in error, one can try the following approach:
- For each weight parameter w_iwi, perturb w_iwi by adding a small (say, 10^{-5}10−5) constant epsilonepsilon and evaluate the error (call this E_i^+Ei+)
- Now reset w_iwi back to the original parameter and perturb it again by subtracting the same small constant \epsilonϵ and evaluate the error again (call this E_i^-Ei−).
- Repeat this for each weight index ii.
- Upon completing this, we update the weights vector by: w_i \leftarrow w_i - \etawi←wi−η \frac{(E_i^+ - E_i^-)}{2\epsilon}2ϵ(Ei+−Ei−)
for some learning rate \etaη.
True or false: for an appropriately chosen \etaη, repeating this procedure will find the minimum of the error surface for a linear output neuron.
True
Another name for this procedure is the finite difference approximation. This procedure approximates the gradient of the error with the respect to the weights (think of the definition of a derivative). In effect, the procedure states that we approximate the gradient and then take a step of the steepest descent method. Although this method works, the backpropagation algorithm finds the exact gradient much more efficiently.
False
Backpropagation can be used for which kind of neurons ?
Logistic neurons
Backpropagation works with derivatives. In a binary threshold neuron the derivatives of the output function are zero so the error signal will not be able to propagate through it.
Binary threshold neurons
When we perform online learning (using steepest descent), we look at each data example, compute the gradient of the error on that case and then take a little step in the direction of the gradient. In offline (also known as batch) learning, we look at each example, compute the gradient, sum these gradients up and then take a (possibly much bigger) step in the direction of the sum of the gradients.
True or false: for one pass over the dataset, these procedures are equivalent in that if we took all of the gradients after each update of the online learning procedure and added them up, then we will get the same gradient as the offline method.
Follow-up question: which method do we expect to be more stable with respect to the choice of learning rate? Here we define stable to mean that the learning procedure will converge to a minimum.
Check one box from the left column and one box from the right column.
True
False
Then intuitive reason why the answer is false is that in online learning we start moving after seeing just one data point. This takes us to a different region of the parameter space where the gradients will now differ. The message is that these methods are fundamentally different.
They key to understanding the second part is to notice that what we really care about (from an optimization standpoint) is minimizing the error with respect to the entire training set. That is, there is some error surface that is the sum of all of the individual error surfaces from each example. When we do offline learning, we directly minimize this total error while in online learning, each step is minimizing a different error function that only approximates the total error.
Online
Offline
Then intuitive reason why the answer is false is that in online learning we start moving after seeing just one data point. This takes us to a different region of the parameter space where the gradients will now differ. The message is that these methods are fundamentally different.
They key to understanding the second part is to notice that what we really care about (from an optimization standpoint) is minimizing the error with respect to the entire training set. That is, there is some error surface that is the sum of all of the individual error surfaces from each example. When we do offline learning, we directly minimize this total error while in online learning, each step is minimizing a different error function that only approximates the total error.