机器学习中的神经网络Neural Networks for Machine Learning:Lecture 4 Quiz

Lecture 4 QuizHelp Center

Warning: The hard deadline has passed. You can attempt it, but you will not get credit for it. You are welcome to try it as a learning exercise.


Question 1

The cross-entropy cost function with an  n -way softmax unit (a softmax unit with  n  different outputs) is equivalent to:

Clarification: 

Let's say that a network with  n  linear output units has some weights  w w  is a matrix with  n  columns, and  wi  indexes a particular column in this matrix and represents the weights from the inputs to the  ith  output unit.
Suppose the target for a particular example is  j  (so that it belongs to class  j  in other words). 
The squared error cost function for  n  linear units is given by:
12ni=1(tiwTix)2
where  t  is a vector of zeros except for 1 in index  j .
The cross-entropy cost function for an  n -way softmax unit is given by:
- log(exp(wTjx)ni=1exp(wTix))=wTjx+log(ni=1exp(wTix))
Finally,  n  logistic units would compute an output of  σ(wTix)=11+exp(wTix)   independently for each class  i . Combined with the squared error the cost would be:
12ni=1(tiσ(wTix))2
Where again,  t  is a vector of zeros with a 1 at index  j  (assuming the true class of the example is  j ).
Using this same definition for  t , the cross-entropy error for  n  logistic units would be the sum of the individual cross-entropy errors:
ni=1tilog(σ(wTix))+(1ti)log(1σ(wTix))

For any set of weights  w , the network with a softmax output unit over  n  classes will have some cost due to the cross-entropy error (cost function). The question is now asking whether we can define a new network with a set of weights  w  using some (possibly different) cost function such that:

a)  w = f(w)  for some function  f
b) For every input, the cost we get using  w  in the softmax output network with cross-entropy error is the same as the cost we would get using  w  in the new network with the possibly different cost function.

Question 2

A logistic unit with the cross-entropy cost function is equivalent to:

Clarification: 

In a network with a logistic output, we will have a single vector of weights  w . For a particular example with target  t  (which is 0 or 1), the cross-entropy error is given by:
tlog(σ(wTx))(1t)log(1σ(wTx))  where  σ(wTx)=11+exp(wTx) .
The squared error if we use a single linear unit would be: 
12(twTx)2
Now notice that another way we might define  t  is by using a vector with 2 elements, [1,0] to indicate the first class, and [0,1] to indicate the second class. Using this definition, we can develop a new type of classification network using a softmax unit over these two classes instead. In this case, we would use a weight  matrix  w  with two columns, where  wi  is the column of the  ith  class and connects the inputs to the  ith  output unit.
Suppose an example belonged to class  j  (where  j  is 1 or 2 to indicate [1,0] or [0,1]). Then the cross-entropy cost for this network would be:
log(exp(wTjx)exp(wT1x)+exp(wT2x))=wTjx+log(exp(wT1x)+exp(wT2x))

For any set of weights  w , the network with a logistic output unit will have some error due to the cross-entropy cost function. The question is now asking whether we can define a new network with a set of weights  w  using some (possibly different) cost function such that:

a)  w = f(w)  for some function  f
b) For every input, the cost we get using  w  in the network with a logistic output and cross-entropy error is the same cost that we would get using  w  in the new network with the possibly different cost function.

Question 3

The output of a neuro-probabilistic language model is a large softmax unit and this creates problems if the vocabulary size is large. Andy claims that the following method solves this problem:
At every iteration of training, train the network to predict the current learned feature vector of the target word (instead of using a softmax). Since the embedding dimensionality is typically much smaller than the vocabulary size, we don't have the problem of having many output weights any more. Which of the following are correct? Check all that apply.

Question 4

We are given the following tree that we will use to classify a particular example  x :
ptree.png
In this tree, each  p  value indicates the probability that  x  will be classified as belonging to a class in the right subtree of the node at which that  p  was computed. For example, the probability that  x  belongs to Class 2 is  (1p1)×p2 . Recall that at training time this is a very efficient representation because we only have to consider a single branch of the tree. However, at test-time we need to look over all branches in order to determine the probabilities of each outcome.
Suppose we are not interested in obtaining the exact probability of every outcome, but instead we just want to find the class with the maximum probability. A simple heuristic is to search the tree greedily by starting at the root and choosing the branch with maximum probability at each node on our way from the root to the leaves. That is, at the root of this tree we would choose to go right if  p10.5 and left otherwise.
For this particular tree, what would make it more likely that these two methods (exact search and greedy search) will report the same class?

Question 5

True or false: the neural network in the lectures that was used to predict relationships in family trees had "bottleneck" layers (layers with fewer dimensions than the input). The reason these were used was to prevent the network from memorizing the training data without learning any meaningful features for generalization.

Question 6

In the Collobert and Weston model, the problem of learning a feature vector from a sequence of words is turned into a problem of:

Question 7

Suppose that we have a vocabulary of 3 words, "a", "b", and "c", and we want to predict the next word in a sentence given the previous two words. Also suppose that we don't want to use feature vectors for words: we simply use the local encoding, i.e. a 3-component vector with one entry being  1  and all other two entries being  0 .
In the language models that we have seen so far, each of the context words has its own dedicated section of the network, so we would encode this problem with two 3-dimensional inputs. That makes for a total of 6 dimensions; clearly, the more context words we want to include, the more input units our network must have. Here's a method that uses fewer input units:
We could instead encode the  counts of each word in the context. So a context of "a a" would be encoded as input vector [2 0 0] instead of [1 0 0 1 0 0], and "b c" would be encoded as input vector [0 1 1] instead of [0 1 0 0 0 1]. Now we only need an input vector of the size of our vocabulary (3 in our case), as opposed to the size of our vocabulary times the length of the context (which makes for a total of 6 in our case). Are there any significant problems with this idea?
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值