Train a model with 50 dimensional embedding space, 200 dimensional hidden layer and default setting of all other hyperparameters. What is average validation set cross entropy as reported by the training program after 10 epochs ? Please provide a numeric answer (three decimal places). [4 points]
3. 第 3 个问题
Train a model for 10 epochs with a 50 dimensional embedding space, 200 dimensional hidden layer, a learning rate of 0.0001 and default setting of all other hyperparameters. What do you observe ? [3 points]
Cross Entropy on the training and validation set decreases very rapidly.
Cross Entropy on the validation set fluctuates wildly and eventually diverges.
Cross Entropy on the training set fluctuates wildly and eventually diverges.
Cross Entropy on the training and validation set decreases very slowly.
4. 第 4 个问题
If all weights and biases in this network were set to zero and no training is performed, what will be the average cross entropy on the training set ? Please provide a numeric answer (three decimal places). [3 points]
The answer you gave is not a number.
If all weights and biases are zero, the output distribution will be uniform for all inputs. The entropy will then be \log_e(n)loge(n) where nn is the number of words in the vocabulary. In this case it will \log_e(250)loge(250)
5. 第 5 个问题
Train three models each with 50 dimensional embedding space, 200 dimensional hidden layer.
- Model A: Learning rate = 0.001,
- Model B: Learning rate = 0.1
- Model C: Learning rate = 10.0.
Use the default settings for all other hyperparameters. Which model gives the lowest training set cross entropy after 1 epoch ? [3 points]
Model C
Model A
Model B 不正确
6. 第 6 个问题
In the models trained in Question 5, which one gives the lowest training set cross entropy after 10 epochs ? [2 points]
Model B
Model A
Model C 试试 不对
7. 第 7 个问题
Train each of following models:
- Model A: 5 dimensional embedding, 100 dimensional hidden layer
- Model B: 50 dimensional embedding, 10 dimensional hidden layer
- Model C: 50 dimensional embedding, 200 dimensional hidden layer
- Model D: 100 dimensional embedding, 5 dimensional hidden layer
Use default values for all other hyperparameters.
Which model gives the best training set cross entropy after 10 epochs of training ? [3 points]
Model D
Model C
Model A
Model B
8. 第 8 个问题
In the models trained in Question 7, which one gives the best validation set cross entropy after 10 epochs of training ? [2 points]
Model D 试试,不对
Model A 试试,不对
Model B
Model C
9. 第 9 个问题
Train three models each with 50 dimensional embedding space, 200 dimensional hidden layer.
- Model A: Momentum = 0.0
- Model B: Momentum = 0.5
- Model C: Momentum = 0.9
Use the default settings for all other hyperparameters. Which model gives the lowest validation set cross entropy after 5 epochs ? [3 points]
Model C 试试,对的!
Model B
Model A
10. 第 10 个问题
Train a model with 50 dimensional embedding layer and 200 dimensional hidden layer for 10 epochs. Use default values for all other hyperparameters.
Which words are among the 10 closest words to the word 'could'. [2 points]
'can'
'some'
'the'
'should'
11. 第 11 个问题
In the model trained in Question 10, why is the word 'percent' close to 'dr.' even though they have very different contexts and are not expected to be close in word embedding space? [2 points]
The model is not capable of separating them in embedding space, even if it got a much larger training set.
Both words occur very rarely, so their embedding weights get updated very few times and remain close to their initialization. 试试,对的!
We trained the model with too large a learning rate.
Both words occur too frequently.
12. 第 12 个问题
In the model trained in Question 10, why is 'he' close to 'she' even though they refer to completely different genders? [2 points]
Both words occur very rarely, so their embedding weights get updated very few times and remain close to their initialization.
They differ by only one letter.
The model does not care about gender. It puts them close because if 'he' occurs in a 4-gram, it is very likely that substituting it by 'she' will also make a sensible 4-gram.
They often occur close by in sentences.
13. 第 13 个问题
In conclusion, what kind of words does the model put close to each other in embedding space. Choose the most appropriate answer. [3 points]
Words that belong to similar topics. A topic is a semantic categorization (like 'sports', 'art', 'business', 'computers' etc).
Words that can be substituted for one another and still make up a sensible 4-gram.
Words that occur close to each other (within three words to the left or right) in many sentences.
Words that occur close in an alphabetical sort.
3. 第 3 个问题
Train a model for 10 epochs with a 50 dimensional embedding space, 200 dimensional hidden layer, a learning rate of 100.0 and default setting of all other hyperparameters. What do you observe ? [3 points]
Cross Entropy on the training set fluctuates around a large value.
Cross Entropy on the training set decreases smoothly but fluctuates around a large value on the validation set.
Cross Entropy on the validation set fluctuates around a large value.
Cross Entropy on the training set fluctuates wildly and eventually diverges.