机器学习中的神经网络Neural Networks for Machine Learning：Lecture 4 Quiz

最新推荐文章于 2024-09-23 20:31:41 发布

GarfieldEr007

最新推荐文章于 2024-09-23 20:31:41 发布

阅读量3k

点赞数

分类专栏：机器学习文章标签：机器学习神经网络 Neural Networks Machine Learning Quiz

本文链接：https://blog.csdn.net/garfielder007/article/details/50598138

版权

机器学习专栏收录该内容

294 篇文章 15 订阅

订阅专栏

Warning: The hard deadline has passed. You can attempt it, but you will not get credit for it. You are welcome to try it as a learning exercise.

Question 1

The cross-entropy cost function with an

n -way softmax unit (a softmax unit with

n different outputs) is equivalent to:

Clarification:

Let's say that a network with

n linear output units has some weights

w .

w is a matrix with

n columns, and

wi indexes a particular column in this matrix and represents the weights from the inputs to the

ith output unit.
Suppose the target for a particular example is

j (so that it belongs to class

j in other words).
The squared error cost function for

n linear units is given by:

12∑ni=1(ti−wTix)2
where

t is a vector of zeros except for 1 in index

j .
The cross-entropy cost function for an

n -way softmax unit is given by:
-

log(exp(wTjx)∑ni=1exp(wTix))=−wTjx+log(∑ni=1exp(wTix))
Finally,

n logistic units would compute an output of

σ(wTix)=11+exp(−wTix) independently for each class

i . Combined with the squared error the cost would be:

12∑ni=1(ti−σ(wTix))2
Where again,

t is a vector of zeros with a 1 at index

j (assuming the true class of the example is

j ).
Using this same definition for

t , the cross-entropy error for

n logistic units would be the sum of the individual cross-entropy errors:

−∑ni=1tilog(σ(wTix))+(1−ti)log(1−σ(wTix))

For any set of weights

w , the network with a softmax output unit over

n classes will have some cost due to the cross-entropy error (cost function). The question is now asking whether we can define a new network with a set of weights

w∗ using some (possibly different) cost function such that:

a)

w∗ =

f(w) for some function

f
b) For every input, the cost we get using

w in the softmax output network with cross-entropy error is the same as the cost we would get using

w∗ in the new network with the possibly different cost function.

The squared error cost function with

n linear units.

The squared error cost function with

n logistic units.

The cross-entropy cost function with

n logistic units.

None of the above.

Question 2

A logistic unit with the cross-entropy cost function is equivalent to:

Clarification:

In a network with a logistic output, we will have a single vector of weights

w . For a particular example with target

t (which is 0 or 1), the cross-entropy error is given by:

−tlog(σ(wTx))−(1−t)log(1−σ(wTx)) where

σ(wTx)=11+exp(−wTx) .
The squared error if we use a single linear unit would be:

12(t−wTx)2
Now notice that another way we might define

t is by using a vector with 2 elements, [1,0] to indicate the first class, and [0,1] to indicate the second class. Using this definition, we can develop a new type of classification network using a softmax unit over these two classes instead. In this case, we would use a weight matrix

w with two columns, where

wi is the column of the

ith class and connects the inputs to the

ith output unit.
Suppose an example belonged to class

j (where

j is 1 or 2 to indicate [1,0] or [0,1]). Then the cross-entropy cost for this network would be:

−log(exp(wTjx)exp(wT1x)+exp(wT2x))=−wTjx+log(exp(wT1x)+exp(wT2x))

For any set of weights

w , the network with a logistic output unit will have some error due to the cross-entropy cost function. The question is now asking whether we can define a new network with a set of weights

w∗ using some (possibly different) cost function such that:

a)

w∗ =

f(w) for some function

f
b) For every input, the cost we get using

w in the network with a logistic output and cross-entropy error is the same cost that we would get using

w∗ in the new network with the possibly different cost function.

A 2-way softmax unit (a softmax unit with 2 elements) with the cross entropy cost function.

A 2-way softmax unit (a softmax with 2 elements) with the squared error cost function.

A linear unit with the squared error cost function.

None of the above.

Question 3

The output of a neuro-probabilistic language model is a large softmax unit and this creates problems if the vocabulary size is large. Andy claims that the following method solves this problem:
At every iteration of training, train the network to predict the current learned feature vector of the target word (instead of using a softmax). Since the embedding dimensionality is typically much smaller than the vocabulary size, we don't have the problem of having many output weights any more. Which of the following are correct? Check all that apply.

The serialized version of the model discussed in the slides is using the current word embedding for the output word, but it's optimizing something different than what Andy is suggesting.

If we add in extra derivatives that change the feature vector for the target word to be more like what is predicted, it may find a trivial solution in which all words have the same feature vector.

In theory there's nothing wrong with Andy's idea. However, the number of learnable parameters will be so far reduced that the network no longer has sufficient learning capacity to do the task well.

Andy is correct: this is equivalent to the serialized version of the model discussed in the lecture.

Question 4

We are given the following tree that we will use to classify a particular example

x :

In this tree, each

p value indicates the probability that

x will be classified as belonging to a class in the right subtree of the node at which that

p was computed. For example, the probability that

x belongs to Class 2 is

(1−p1)×p2 . Recall that at training time this is a very efficient representation because we only have to consider a single branch of the tree. However, at test-time we need to look over all branches in order to determine the probabilities of each outcome.
Suppose we are not interested in obtaining the exact probability of every outcome, but instead we just want to find the class with the maximum probability. A simple heuristic is to search the tree greedily by starting at the root and choosing the branch with maximum probability at each node on our way from the root to the leaves. That is, at the root of this tree we would choose to go right if

p1≥0.5 and left otherwise.
For this particular tree, what would make it more likely that these two methods (exact search and greedy search) will report the same class?

It helps if the value of each

p is close to 0 or 1.

It helps if

p1 is close to 0.5 while

p2 and

p3 are close to 0 or 1.

It helps if the value of each

p is close to 0.5.

It helps if

p1 is further from 0.5. It hurts if

p2 is further from 0.5.

Question 5

True or false: the neural network in the lectures that was used to predict relationships in family trees had "bottleneck" layers (layers with fewer dimensions than the input). The reason these were used was to prevent the network from memorizing the training data without learning any meaningful features for generalization.

False

True

Question 6

In the Collobert and Weston model, the problem of learning a feature vector from a sequence of words is turned into a problem of:

Learning to predict the next word in an arbitrary length sequence.

Learning to reconstruct the input vector.

Learning a binary classifier.

Learning to predict the middle word in the sequence given the words that came before and the words that came after.

Question 7

Suppose that we have a vocabulary of 3 words, "a", "b", and "c", and we want to predict the next word in a sentence given the previous two words. Also suppose that we don't want to use feature vectors for words: we simply use the local encoding, i.e. a 3-component vector with one entry being

1 and all other two entries being

0 .
In the language models that we have seen so far, each of the context words has its own dedicated section of the network, so we would encode this problem with two 3-dimensional inputs. That makes for a total of 6 dimensions; clearly, the more context words we want to include, the more input units our network must have. Here's a method that uses fewer input units:
We could instead encode the counts of each word in the context. So a context of "a a" would be encoded as input vector [2 0 0] instead of [1 0 0 1 0 0], and "b c" would be encoded as input vector [0 1 1] instead of [0 1 0 0 0 1]. Now we only need an input vector of the size of our vocabulary (3 in our case), as opposed to the size of our vocabulary times the length of the context (which makes for a total of 6 in our case). Are there any significant problems with this idea?

Yes: even though the input has a smaller dimensionality, each entry of the input now requires more bits to encode, because it's no longer just

1 or

0 . Therefore, there would be no significant advantage.

Yes: the neural networks shown in the course so far cannot deal with integer inputs (as opposed to binary inputs).

Yes: although we could encode the context in this way, we would then need a smaller bottleneck layer than we did before, thereby lowering the learning capacity of the model.

Yes: the network loses the knowledge of the location at which a context word occurs, and that is valuable knowledge.