机器学习中的神经网络Neural Networks for Machine Learning：Final Exam

最新推荐文章于 2021-12-09 10:54:56 发布

VIP文章 GarfieldEr007

最新推荐文章于 2021-12-09 10:54:56 发布

阅读量3.6k

点赞数

本文链接：https://blog.csdn.net/GarfieldEr007/article/details/50598315

版权

Warning: The hard deadline has passed. You can attempt it, but you will not get credit for it. You are welcome to try it as a learning exercise.

There are 18 questions in this exam. You can earn one point for each of the questions where you're asked to select exactly one of the possible answers. You can earn two points for each of the questions where you're asked to say for each answer whether it's right or wrong, and also two points for each of the questions where you're asked to calculate a number. The total number of points that you can earn is 25. The exam is worth 25% of the course grade, so each of those points is worth 1% of the course grade.

Other than the deadline for submitting your answers, there is no time limit: you don't have to finish it within a set number of hours, or anything like that.

Unlike with the weekly quizzes and the programming assignments, the deadline for this exam is a hard deadline. If you don't submit your answers before the deadline passes, then you will have a score of 0.

Good luck!

In accordance with the Coursera Honor Code, I certify that the answers here are my own work.

Question 1

One regularization technique is to start with lots of connections in a neural network, and then remove those that are least useful to the task at hand (removing connections is the same as setting their weight to zero). Which of the following regularization techniques is best at removing connections that are least useful to the task that the network is trying to accomplish?

L2 weight decay

Weight noise

Early stopping

L1 weight decay

Question 2

Why don't we usually train Restricted Boltzmann Machines by taking steps in the exact direction of the gradient of the objective function, like we do for other systems?

That would lead to severe overfitting, which is exactly what we're trying to avoid by using unsupervised learning.

Because it's unsupervised learning (i.e. there are no targets), there is no objective function that we would like to optimize.

That gradient is intractable to compute exactly.

Question 3

When we want to train a Restricted Boltzmann Machine, we could try the following strategy. Each time we want to do a weight update based on some training cases, we turn each of those training cases into a full configuration by adding a sampled state of the hidden units (sampled from their distribution conditional on the state of the visible units as specified in the training case); and then we do our weight update in the direction that would most increase the goodness (i.e. decrease the energy) of those full configurations. This way, we expect to end up with a model where configurations that match the training data have high goodness (i.e. low energy).

However, that's not what we do in practice. Why not?

That would lead to severe overfitting, which is exactly what we're trying to avoid by using unsupervised learning.

Correctly sampling the state of the hidden units, given the state of the visible units, is intractable.

The gradient of goodness for a configuration with respect to the model parameters is intractable to compute exactly.

High goodness (i.e. low energy) doesn't guarantee high probability.

Question 4

CD-1 and CD-10 both have their strong sides and their weak sides. Which is the main advantage of CD-10 over CD-1?

CD-10 is less sensitive to small changes of the model parameters.

CD-10 gets its negative data (the configurations on which the negative part of the gradient estimate is based) from closer to the model distribution than CD-1 does.

The gradient estimate from CD-10 takes less time to compute than the gradient estimate of CD-1.

The gradient estimate from CD-10 has less variance than the gradient estimate of CD-1.

The gradient estimate from CD-10 has more variance than the gradient estimate of CD-1.

Question 5

CD-1 and CD-10 both have their strong sides and their weak sides. Which are significant advantages of CD-1 over CD-10? Check all that apply.

The gradient estimate from CD-1 takes less time to compute than the gradient estimate from CD-10.

The gradient estimate from CD-1 has more variance than the gradient estimate of CD-10.

CD-1 gets its negative data (the configurations on which the negative part of the gradient estimate is based) from closer to the model distribution than CD-10 does.

The gradient estimate from CD-1 has less variance than the gradient estimate of CD-10.

Question 6

With a lot of training data, is the perceptron learning procedure more likely or less likely to converge than with just a little training data?

Clarification: We're not assuming that the data is always linearly separable.

Less likely.

More likely.

Question 7

You just trained a neural network for a classification task, using some weight decay for regularization. After training it for 20 minutes, you find that on the validation data it performs much worse than on the training data: on the validation data, it classifies 90% of the data cases correctly, while on the training data it classifies 99% of the data cases correctly. Also, you made a plot of the performance on the training data and the performance on the validation data, and that plot shows that at the end of those 20 minutes, the performance on the training data is improving while the performance on the validation data is getting worse.

What would be a reasonble strategy to try next? Check all that apply.

Redo the training with fewer hidden units.

Redo the training with less weight decay.

Redo the training with more weight decay.

Redo the training with more training time.

Redo the training with more hidden units.

Question 8

If the hidden units of a network are independent of each other, then it's easy to get a sample from the correct distribution, which is a very important advantage. For which systems, and under which conditions, are the hidden units independent of each other? Check all that apply.

For a Restricted Boltzmann Machine, when we don't condition on anything, the hidden units are independent of each other.

For a Sigmoid Belief Network where the only connections are from hidden units to visible units (i.e. no hidden-to-hidden or visible-to-visible connections), when we condition on the state of the visible units, the hidden units are independent of each other.

For a Sigmoid Belief Network where the only connections are from hidden units to visible units (i.e. no hidden-to-hidden or visible-to-visible connections), when we don't condition on anything, the hidden units are independent of each other.

For a Restricted Boltzmann Machine, when we condition on the state of the visible units, the hidden units are independent of each other.

Question 9

What is the purpose of momentum?

The primary purpose of momentum is to speed up the learning.

The primary purpose of momentum is to reduce the amount of overfitting.

The primary purpose of momentum is to prevent oscillating gradient estimates from causing vanishing or exploding gradients.

Question 10

Consider a Restricted Boltzmann Machine with 2 visible units

v1,v2 and 1 hidden unit

h . The visible units are connected to the hidden unit by weights

最低0.47元/天解锁文章

GarfieldEr007

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
机器学习中的神经网络Neural Networks for Machine Learning：Final Exam

Final ExamHelp CenterWarning: The hard deadline has passed. You can attempt it, but you will not get credit for it. You are welcome to try it as a learning exercise.There are 18 question
复制链接

扫一扫