Deep Learning Review Questions

parry64

于 2022-12-27 21:07:03 发布

阅读量351

点赞数 3

文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/parry64/article/details/128459619

版权

Neural Network and Deep Learning

Basics

Basics

This is neural network simple questions for every one to do some basic review on related area. Here, I recommend learn from Andrew NG’s videos first.

Intro

What is the relationship between artificial intelligence, machine learning and deep learning?
AI>ml>dl

IT’s Intro to Deep Learning

Q: What are the meanings of the following: artificial intelligence (AI), natural intelligence, and cognitive functions?
AI: machine ability to do tasks that are usually done by humans, seems to be intelligent.
NL: human intelligence.
Cognitive functions: the ability to process incoming information, including perception, intuition, and memory.

Q: What is the difference between supervised and unsupervised learning? Name one example of supervised learning and one example of unsupervised learning.
Supervised: learning by example, E.g. regression problem given data set, housing price prediction, photo tagging, machine translation…CNN,RNN
Un-supervised: learn by concept, E.g. clustering(聚类) methods

Q: What are the elements (or parts) of a neural network?
Input layer, hidden layer, output layer

Q: In the example for handwritten digits recognition, what are the inputs and output?
The handwritten figures pictures, the exact number of each figure.

Q: In that example, each neuron has two parts. What do each of these parts compute?
The Input value in matrix form.

Q: In that example, what are the training data?
The graphs with the tagging number.

Q: What two deep learning techniques are used in Alpha GO?
A game tree search procedure and neural networks

C1M1

Q: In the neural network example for housing price prediction, what are the input and output?
Input: Home features. Output: Price

Q: What must we have in supervised learning?
The given labeled data, examples of inputs and corresponding outputs.

Q: What are some examples of supervised learning? What are the input and output in these examples?
Photo tagging: Image->object123; Advertising: Ad.user_info->click or not; ,Machine translation: Eng->Chi

Q: Give examples of supervised learning model that use structured data.
House price prediction. Advertising.

Q: Give examples of supervised learning model that use unstructured data.
Audio; Image;Text

Q: What are the three reasons why deep learning made big progress in recent years?
Large data; Algorithms; Computation

Q: In deep learning, what two things are needed in order to achieve high level of performance?
Good algorithm & time

Q: Name the three parts in the iterative process in training a neural network.
Idea & Code & Experiment

C1M2

Q: What is the output for binary classification?
1 or 0 (true/false) (yes/no)

Q: What is the difference between the outputs for logistic regression and binary classification?
The sigmoid function, its continuous, with probability y=sigmoid(w^Tx+b)

Q: What is the general meaning in the following diagrams? Explain what each part is doing.

在这里插入图片描述
Q: In a single node neural network, write the equations for the linear part and the activation part. Use the sigmoid function.

y ̂ =  𝜎 (w^T𝑥 + 𝑏) 
Loss function: L ( y ̂^((i)), y^((i))) = − [ y^((i)) log y ̂^((i)) + (1 - y^((i))) log(1 - y ̂^((i))) ] 
Cost function: J(w, b) = 1/m   ∑_(i=1)^m  L ( y ̂^((i)), y^((i)))

Q: What is the difference between the loss function and the cost function?
Single training example versus all m training example

Q: Why do we want to find the minimum of the cost function?
gives us the best approximation for w and b in the Logistic Regression algorithm

Q: What is one way to find the minimum of the cost function in a neural network?
the gradient descent method

Q: What is the purpose of backpropagation? How is it used?
Backpropagation is basically using the chain rule to compute a derivative
compute a series of simple derivatives

Q: What does the following code do? Explain the main steps in the code.
You don’t need to memorize the algorithm, but you need to understand its meaning.

J = 0, dw1 = 0, dw2 = 0, db = 0
For i=1 to m:
    z^((i)) = w^T x^((i)) + b
    a^((i)) = s (z^((i)))
    J += − [ y^((i)) log a^((i)) + (1 - y^((i))) log(1 - a^((i)))  ]
    〖dz〗^((i)) = a^((i)) - y^((i))	
    〖dw〗_1 += x_1^((i)) 〖dz〗^((i))	
    〖dw〗_2 += x_2^((i)) 〖dz〗^((i))
    … for # features n > 2
    db += 〖dz〗^((i))
J =J/ m,	〖dw〗_1= 〖dw〗_1/m,  〖dw〗_2=〖dw〗_2/m, db=db/m
Z =   w^T X  +  [b b …b] 
   = np.dot(w.T, X) + b
A = s (Z)
dZ = A – Y
dW = 1/m X 〖dZ〗^T	
db = 1/m np.sum(dZ)
 
One iteration of gradient descent, over all training data
W := W – α * dW
b := b – α * db

Q: What is the purpose of vectorization?
Speed improve
Q: In Python, what is the result of the following? What is this property called?
[1, 2, 3] + 10 =[11,12,13]
在这里插入图片描述

Q: What are the components of a neural network?
Input layer, hidden layer, output layer

Q: How are the hidden layer values different from the input and output layer values?
Hidden layers values are not observed (don’t see them in the calculation)

Q: In a shallow neural network with one hidden layer, what are the parameters for the input layer, hidden layer, and output layer? Please use proper notations.
For input layer, we denote as w^([0]), b^([0]), input features x as a^((0))
For hidden layer, we denote as w^([1]), b^([1]), the vector a^((1))
For output layer, we denote as w^([2]), b^([2]), the scalar a^((2))

Q: At layer l, what is the formula for computing the linear and activation part of node j? Please use proper notations.
在这里插入图片描述

Consider the equation z_3^([2]) = w_3^([2]) a_3^([1]) + b_3^([2])
在这里插入图片描述

Q: Which layer number is the equation computing? What is the node number in the equation?
Layer 2, node 3

Q: Suppose L=3; is the equation computing an input, output or hidden layer?
Hidden layer

Q: What is the equation(s) for a one-hidden layer neural network for 1 training example?
z^([1]) = W¹ a^([0]) + b^([1])
a^([1]) = s (z^([1]))
z^([2]) = W² a^([1]) + b^([2])
a^([2]) = s (z^([2]))

Q: What is the equation(s) for a one-hidden layer NN for m-training examples? Consider both the vectorized and non-vectorized version.
在这里插入图片描述

Q: What do the matrices X, Z, A represent in deep learning? What do each row and column of these matrices refer to?
X: all training examples as columns side-by-side

In each matrix, each row denotes a hidden unit (i.e. a node in the hidden layer), and each column denotes a training examples

Q: Let X = 在这里插入图片描述

How many training examples are there?
3 unit; 4 examples

How many input features does the neural network have?
3

Consider Z^2 =[10,11,12] .
Q: How many training examples are there?
3
Q: Name the different activation functions mentioned in the lectures and video. Why do we choose these activation functions? When would you use each specific type? Discus the pro and cons of each.
Hyperbolic tan function (tanh): center zero, zero important ;z large, derivative small.
Rectified Linear Unit (ReLU & Leak ReLU) prob: derivative=0 for z<0

Q: Why must activation functions be non-linear?
the entire problem is linear, no matter how many hidden layers we use

Q: How should we initialize W and B in gradient descent?
Give current estimate?

C1M4

在这里插入图片描述
Q: How many rows are in W^([1])? How many columns?
Z=WX (2,3) 2行3列 2rows3cols

Q: In a deep (convolutional) neural network, what type of feature do the early layer detect? What type of features do the deeper layers detect? Give examples in image classification and speech recognition.
earlier layers detect simple features, later layers detect more complicated features
For speech recognition
e.g.,
1st layer detects basic waveform features (high/low pitch, white noise),
2nd layer detect basic forms of sound (phonemes),
3rd layer detects words, etc.

C2M1

Q: What are the 4 branches of machine learning? Which branch does this class (and recent success) focuses on?
Supervised learning; Unsupervised learning; Self-supervised learning; Reinforcement learning.

Q: If we have 100 to 10K labeled input data, how should we divide up the labeled data into training/ development/testing datasets? What if we have 100K labeled data? What about 1 M?
7:3; 6:2:2; too large :90:5:5; 98:1:1

Q: What is the relationship between bias, variance, underfitting and overfitting?
High bias->underfitting; high variance->overfitting

Q: If a network has high bias, how should we fix the problem? If a network has high variance, how should we fix the problem? (Assuming we use human judgement as ground truth.)
High bias->bigger network; high variance->more data; regularization; CNN

Q: Given the training set error and dev set error, how do we compute the amount of bias and variance? (Assume we use human judgement as ground truth.)
Training set error = bias
Dev set error = bias + variance

Our model has a 3% training set error when compared to human results. The dev set error is 10%. Assuming the human results are ground truth.
(a) How much bias is in our model? 3%
(b) How much variance is in our model? 7%
© Is our model overfitting or underfitting?
Underfitting at some place, overfitting in another place
(d) List two ways we can improve our model.

Q: In weights regularization, what term do we add to the cost function?
add an regularization term 在这里插入图片描述

Q: In dropout regularization, what do we do with each training example? How do we define the dropout rate?
For each training example at each layer, randomly drop out (set to zero) a fraction of the output features.
Dropout rate = fraction of output features set to 0

Q: What are some of the ways to augment image data to reduce overfitting? How do we augment text data?
flipping, rotating, randomly cropping the image
\Back translation; Crop the text

Q: How do we normalize the inputs? How does that involve the mean and standard deviation?
x:= x – μ
x := x / σ

Q: When do vanishing or exploding gradients occur?
Very deep layers

C2M2

Q: What is the difference between batch gradient descent, min-batch gradient descent and stochastic gradient descent? At each epoch, which one is most accurate? Fastest?
Batch: total training examples, trajectory smooth;
mini-Batch: oscillate a lot(有点曲折)；
stochastic很曲折

Q: Consider the 2-days, 10-days and 50-days moving average on some fluctuating data. Which average smooths out the data the most? Which average follows (the changes in) the data most closely?
Best:10; 50:too smooth; 2:too much oscillations

Q: What are the basic properties of gradients descent with momentum? State its relationship to regular gradient descent, moving average, and damp out oscillations.
Faster; Compute the exponentially weighted average of the gradient, and then use that to update the weights
在这里插入图片描述

Q: What are some of the basic properties of RMSprop?
√(S_dW ) √(S_db ) root mean square, small update for b and large update for W

Q: What are the basic properties of ADAM optimization? State its relationship GD with momentum and RMSprop.
在这里插入图片描述
Q: Know the formulas and illustrations for gradient descent with momentum, RMSprop and ADAM optimization. You don’t have to memorize the formula; just recognize what the formula is for when you see it.

Q: What is the purpose of learning rate decay? When should we take large or small steps sizes?
take smaller steps as we approach the minimum

Q: In deep learning, do local optima tend to be maxima, minima, or saddle points? Why?
saddle points
• The number of dimensions are so large

Q: What is the problem in computing near a saddle point? Why do we need algorithms faster than gradient descent?
Plateaus: take many steps to get out of the wrong minimum.
Most often we have saddle points rather than min or max points.
generate enough noises to get out of saddle point,

C2M3

Q: Describe some good strategies in choosing the values for tuning hyperparameters.
Use Dev Data, randomly choose (range not much, few variation), zoom in region with good values, appropriate scaling (ok if range doesn’t change too much, else use log scale)

Q: When do we need to return hyperparameters?
When hyper parameters range in a certain scope, then we can zoom in to find region with good value.
When using appropriate scaling method.

Q: Explain the “panda” versus the “caviar” approach in tuning hyperparameters. In which situation would you use panda? In caviar?
Panda: train 1 model; do each day plot error and change hyperparameters when lots of data, but limited computation power. Caviar: train many models in parallel. Do when enough computation power.

Q: Briefly explain the main idea in batch normalization. How is batch norm similar to normalizing input (C2M1L09)? How are they different?
Batch normalization the 1st step same as normalizing(calculate mean and std deviation, subtract mean and divide by std deviation), add a new 2nd step (use 1st step result unnormed value add two hyper parameters to form linear relation z ̃^((i)) = 〖γ z〗_norm^((i)) + β
, result in BN term), 3rd step compute activation term (in neural network)

Q: How does batch norm improve the calculations? Under what circumstance would your use batch norm?
1: normalizing in deep layers also speed up learning. 2: add more data with different distribution (since BN can handle weight change). 3: BN as regularization, Mini-batch normalization has some slight regularization effect

Q: Briefly explain softmax. What is it use for?
It’s a kind of activation function, use to classify into different possibilities classes, usually as the output layer of a neural network for outputs.

Q: How do we calculate softmax?
在这里插入图片描述

Q: Name some of the deep learning frameworks presented in this class. Which two are used the most today?
PyTorch, Keras, Tensorflow

C3M1

Q: What is the meaning of perfect precision? What is the meaning of perfect recall?
Perfect precision = TP/(TP+FP) = No FP = 1
Perfect Recall = TP/(TP+FN) = No FN = 1

Q: What metric combines both precision and recall?
Use F1 Score = 2/((1/p)+(1/r))

Q: What is the meaning for “dev set is like setting the target”? What is the consequence of this statement?
Dev set is for validation, we try different methods on this set and apply different ideas on that target.

Q: How should we divide our labeled data into training/dev/testing if we have (a) 1K labeled data, (b) 100K labeled data, © 1 million labeled data?
在这里插入图片描述
Q: What is Bayes optimal error?
The error that at least have. The low limit of error. This error can not be avoided.

Q: How do we compute avoidable bias? How do we compute the variance?
Avoidable bias = training error – bayes error
Variance = dev error – train error

C3M2

Q: What is the main idea in error analysis? When would you use it? Draw a table to illustrate.
在这里插入图片描述
Q: Supposed we have 2 difference sets of labeled data, with each set from a different distribution. One large set was downloaded from the Internet, and a smaller set was specifically made for the app we want to build. How should we divide the data for training, dev and testing?
Use only the app data on dev and testing.
Training set: all 2 sets of data
Dev set: all app data
Test set: all app data
Advantage: the target (dev set) is set up at where we want to be important
Disadvantage: training set distribution is different from dev/test set

C4M1

Q: How do we calculate convolutions? How do we compute output sizes and the number of parameters in each layer?
Current cell value = Col col_in_filter + colcol_in_filter…
对应位置的值相乘再相加得到结果

Q: Why do CONV layers have so much fewer parameters than densely connected layers?
Max pool layers don’t have any parameters
CONV layers have relative few parameters
FC layers have lots of parameters
Activation size gradually decreases as we go deeper into the layers

Parameter sharing – a filter detecting a specific feature (e.g. vertical edges) can be applied anywhere in the image
Sparsity of connections
Translational invariance

C4M2

Q: What are some of the networks presented in the lecture? Briefly explain the main idea in each network.
LeNet-first successful network
AlexNet-good results, need a lot of data.
VGG-simple design, at each layer half of the size of image, double number of channels
ResNet-solve vanishing gradient problem in deep layers.
Inception network-use all filters
Q: Briefly describe how a residual block works in ResNet. You can use a diagram or equations.
在这里插入图片描述

Q: What is a 1x1 convolution? When would you use it?
Filter 1*1 size, power up value in matrix.
If an image is too big, we can use MAXPOOL to reduce image size.
As we go deeper in a network, image size ↓, # channels ↑
Suppose our image has too many channels
We can reduce the number of channels by using 1x1 convolution

Q: Describe the main ideas in transfer learning. How should you apply that in deep learning when you have (a) only a little of your own data, (b) moderate amount of you own data, © and lots and lots of your own data?
Fine tune pre-trained codes, weights and data.
(a) freeze parameters, replace softmax layer by your own data.
(b) Freeze parameters in early layers, train later layers and softmax using own data.
© Retrain all parameters with our data, use pretrained values as starting values.

Q: What are some of the common data augmentation methods?
Mirroring, random cropping, rotation, shearing, local warping, color shifting.

C4M3

Q: What is the difference between image classification, (classification with) localization, and object detection?
Classification: detect 1 object; localization: detect object with location; detect multiple objects with corresponding location.
In an image, suppose we want to detect pedestrian, car, bicycle and background.

Q: What are the components in the training label vector?
[pc,bx,by,bh,bw,c1,c2,c3] pc is whether there is an object in this image; bxyhw are class labels for output; c1c2c3 specify what kind of object in this image.

Q: What is the loss function for this object detection?
在这里插入图片描述 Q: What are landmark points? How do we use it in deep learning?
Important points in an image. Use in AR filters, key-locations, body pose position.

Q: What is a sliding window used for? How does it work? What is its drawback?
Use a window to go along the whole image and check every window object, feed part of image into ConvNet. Disadvantage: computational lost.

Q: What does YOLO stand for in deep learning? What does it do? What are its main idea?
you only look one algorithm. Divide up image into grids, each grid with training labels.assign the object to the grid cell with the object’s midpoint. Only run image through CNN once.

Q: What does IoU stand for? What does it use for? Describe its main idea.
Intersection over union. Location whether is correct or not.
在这里插入图片描述
Q: What is object segmentation? What is class segmentation?
Given an image, identify parts of the image to which object does each part belongs.
Object segmentation: identify each object separately.
Class segmentation: identify the classes in the image.

C4M4

Q: What is face recognition? Briefly describe the issues involved.
Database have a name/id list, input image(without name/id) output name/id or “not recognized”.

Q: In one-shot learning, how do we determine if two facial images match?
d(img1, imag2) = degree of difference between the two images

Q: Briefly describe Siamese network in facial recognition.
Feed image to a convNet, without softmax. The ‘encoding of image’, as the last feature vector.
在这里插入图片描述

Q: What is the meaning of triplet loss? What is the meaning of A, P and N.
this loss function 在这里插入图片描述
the three images for training

Anchor(A) Positive( P) Negative(N)

Q: Briefly describe neural style transfer.
Given content image and style image, generate a new image with corresponding content and style.

Q: In neural style transfer (NST), what does the content involve? In which layer(s) is the content located?
Content = higher lever macrostructure (overall shape) of the C image
Content is determined by the deep layers

Q: In NST, what does the style involve? In which layer(s) is style located?
Style is determined by how correlated the activations are across different channels in a layer.
Style = textures, colors, visual patterns of the S image

C5M1

Q: What is named entity recognition? What are the input and output?
Input: the words sentence.
Output: 11001010001
Word Embedding. Similar words have entities.

Q: Briefly describe RNN. What does it stand for? What are the input and output? You can draw a diagram.
Recurrent Neural Network.
在这里插入图片描述
Q: What does a language model do?
predict the next word in a sentence or fill in the blank.

Q: On what basis does a language model predict the missing word?
Probability of different words.

Q: What is the drawback or weakness of (plain) RNN?
Vanishing gradient problem.

Q: What is GRU? What does it stand for? What does it do?
Gated recurrent Unit. Solve vanishing gradient problem. 在这里插入图片描述

Q: What is LSTM? What does it stand for? What does it do?
Long short term memory.
Has update, forget, and output gates
Has different 𝑐^(<𝑡>) and 𝑎^(<𝑡>)
Is more complex than GRU
Has been around longer (1997) than GRU (2014)

C5M2

Q: What is one-hot vector? What are the elements in a one-hot vector?
1-hot vector 𝑂_𝑖 has
0 in all its component, except
1 in the i th position, where i is the word’s index in V

Q: What is the drawback of one-hot vector?
1-hot vectors have no relationship with each other, even if the words are related linguistically

Q: Briefly describe word embedding. What does the word embedding matrix show? What do the vectors show?
Words with features with values with relations. The relationships.

Q: What advantage does word embedding offer?
Dense (not many zeros, elements have different values)
Lower-dimensional (say 300)
Learned from data (features are learned from reading corpus of text from Internet)

Q: Compare one hot vector with word embedding vector.
在这里插入图片描述
Q: Give an example of using transfer learning in word embedding, e.g. in named entity recognition.
Q: How do we use word embedding in analogy?
Man is to Woman as King is to ______
Use way less memory and way more efficient in computing

Q: Briefly explain the vector space projections from Gensim. What do the plots show?
Relations. Similarity.

Q: How do we use the cosine similarity function? What is it used for?
在这里插入图片描述 Use to find similarity between two words

Q: In sentiment classification, what is the input and output?
Input a sentence, output sentiment level

C5M3

Q: In machine translation (MT) or neural machine translation (NMT), what is the input? What is the output?
A sentence of one type of language, change to another.

Q: How are MT and language model (LM) related? You can draw a diagram.
在这里插入图片描述
Q: Under what basis in MT do we choose a sentence in translation, say from French to English?
We want an English sentence that maximizes the conditional probability

Q: In MT, compare greedy search with beam search.
在这里插入图片描述 Beam better, beam consider the top 3(beam widdth) highest probability. Consider the most likely pair of words, keep only the 3 pairs with the highest probability.

Q: What is the main idea of the attention model? What does the attention model consists of? What is the relation between context and attention weights? (Drawing a diagram may help.)C5M3L07
Base on encoder-decoder model, we look at a few more words then do translation.

在这里插入图片描述