Question 1
Suppose
w
is the weight on some connection in a neural network. The network is trained using gradient descent until the learning
converges . However, the dataset consists of two mini-batches, which differ from each other somewhat. As usual, we alternate between the mini-batches for our gradient calculations, and that has implications for what happens after convergence. We plot the change of
w
as training progresses. Which of the following scenarios shows that convergence has occurred?
Notice that we're plotting the change in
w
, as opposed to
w
itself.
Note that in the plots below, each
iteration refers to a single
step of steepest descent on a
single minibatch .
Question 2
Suppose you are using mini-batch gradient descent for training some neural net on a large dataset. You have to decide on the learning rate, weight initializations, preprocess the inputs etc. You try some values for these and find that the value of the objective function on the training set decreases smoothly but very slowly. What could be causing this? Check all that apply.
Question 3
Four datasets are shown below. Each dataset has two input values (plotted below) and a target value (not shown). Each point in the plots denotes one training case. Assume that we are solving a classification problem. Which of the following datasets would most likely be easiest to train using neural nets ?
Question 4
Claire is training a neural net using mini-batch gradient descent. She chose a particular learning rate and found that the training error decreased as more iterations of training were performed, as shown here in blue:
She was not sure if this was the best she could do. So she tried a
bigger learning rate. Which of the following error curves (shown in red) might she observe now? Select the two most likely plots.
Note that in the plots below, each
iteration refers to a single
step of steepest descent on a
single minibatch .
Question 5
In the lectures, we discussed two kinds of gradient descent algorithms: mini-batch and full-batch. For which of the following problems is mini-batch gradient descent likely to be
a lot better than full-batch gradient descent?
Disease prediction: Predict if a person will get cancer. The input consists of 1000 medical indicators (blood pressure, family cancer history, etc.); the training set consists of 100 patients who all suffered the same type of cancer, and 100 healthy patients.
Sentiment Analysis: Decide whether a given movie review says that the movie is 'good' or 'bad'. The input consists of the word count in the review, for each of 50,000 words. The training set consists of 1,000,000 movie reviews found on the internet.
Predict if an experiment at the Large Hadron Collider is going to yield positive results. The input consists of 25 experiment parameters (energy level, types of particles, etc). The training set consists of the 200 experiments that have already been completed (some of those yielded positive results; some yielded only negative results).
Language modeling: Predict the next word using the previous 3 words. The vocabulary consists of 100,000 words. The dataset consists of all Wikipedia articles.