Deeplearning.ai coursera notes

最新推荐文章于 2024-10-11 07:08:55 发布

lucky6qi

最新推荐文章于 2024-10-11 07:08:55 发布

阅读量248

点赞数

本文链接：https://blog.csdn.net/lucky6qi/article/details/103190619

版权

Deeplearning.ai coursera notes

Content

Deeplearning.ai coursera notes

Neural Networks and Deep learning

initialization W!=0
Otherwise it is symetric, after gradient, it will get same step, then all the nueron has the same affect.

Nonlinear activation function
Otherwise it is a linear model, no matter how many layers.

Why a lot of layers
First layers simple features
last layers more complex features
operation example XOR:
multiple layers v.s. 2 layers
number of nueroens smaller and easy to train

Hyper parameters:
W, b learned from data
alpha learning rate -> need to be tried out
number of iterations
number of layers units
activation fucntion
momenton
minibatch size
regulization

Quote: A lot of complexity of your learning algorithm comes from the data rather than necessarily from your writing thousands and thousands of lines of code

Resampe(targetlenth, -1).T -> column vector presample
persample
[r][g][b] -> [r,g,b]
(209,64,64,3) -> (64643,209)

Split train/dev/test
100–10000 70/20/10
1000,000 98/1/1

high bias = underfitting -> train set error -> bigger optimal error (baies) longer train diffmodel
variance = overfitting -> dev set error -> more data, regularization

Hyperparameters

gradient decent
batch v.s. minibatch

one batch should fit in cpu/gpu memory
batch size should be 2^n
batch size should be 1<<m
if batch size != m, cost descent oscillation

exponential weighted average

$V_{t} = \beta_1 V_{t-1} + (1-\beta_1)dW_{t}$
$\beta_1$ 0.9 ~ 10 average, since 0.9^10=1/e 0.98 ~ 50 average
bias correction $V_t'=V_t/(1-\beta_1^t)$ (not care in practice)

momentum: use exponential weighted gradients $W_{t+1}=W_t-\alpha V_t$ , instead of one gradine, since average oscillation directions.

RMSprop root mean square prop

$S_{t} = \beta_2 S_{t-1} + (1-\beta_2)dW_{t}^2$ (element wise)
$W_{t+1}=W_t-\alpha dW_t/(\sqrt{S_t}+\epsilon)$ ( $\epsilon=10^-8$ to avoid devide 0)

Adam: combine momentum and RMSprop

should bias correction $S_t' = S_t/(1-\beta_2^t)$
$W_{t+1}=W_t-\alpha V_t'/(\sqrt{S_t'}+\epsilon)$
$\alpha$ should be tuned, others default are fine $\beta_1$ 0.9, $\beta_2$ 0.999, $\epsilon$ 10^-8.

learning rate decay

smaller learning rate when closing minimal
$\alpha_0/(decayRate*epochNumber)$
$\alpha_0*0.95^{epochNumber}$
$k/\sqrt{epochNumber}$ / k/\sqrt{t}$

local optimal

plateau / saddle point -> adam may help

Tune hyperparameters

learning rate
momentum beta (0.9), mini-batch size, hidden unit
layers, learning rate decay
adam parameters (beta1, beta2, epsilon)

methods

sample random, instead of grids (more values for each, not like only 5 for each parameter)
coarse to fine: zoom in and sample random dense

random sampling

hidden units and layers can be uniformly random sample
appropriate scale for other hyper parameters:
- search on log scale, uniform random on log scale: $\alpha = 10^r$ r uniform random in [a,b] ([-4,0], r = -4 * np.random.rand())
beta: 0.9,…, 0.999, 1-beta: 0.1, …, 0.01, $\beta = 1-10^r$ r in [a, b] ([-3,-1]) (since 1/(1-beta) is changing a lot when beta close to 1, so better to sample also in log scale instead of uniform 0.9, 0.905,…)

Training choices

babysitting one model (no computation power/huge data set) v.s. training many models in parallel (enough computation power/normal data set)
normalising inputs to speed up learning
batch normalising (z[2], z[3], …)
- $z[i]_{norm} = \frac{z[i]-\mu}{\sqrt{\sigma^2+\epsilon}}$
- $\gamma z[i]_{norm}+\beta$ (learnable parameters) (normalised to other mean and variance)
- beta and gamma are learned separately using learning rate (tf.nn.batch_normalization, or implement yourself: beta = beta - alpha*d_beta)
- z[i]=W[i]z[i-1]+b[i], is the value before applying activation function
- when using batch normalising, b[i] is eliminated by normalization, which is replaced by $\beta$
- why it works: learning on shifting input distribution, training w[l] b[l] will change the a[l] distribution, which is “covariance shift” (black cat to colour cat). it limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the third layer now sees and therefore has to learn on. batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on. layers learn independently instead of always changing when other layers update.
- similar to dropout, it adds some noise to each hidden layer’s activations (different mini-batch is scaled by the mean/variance computed on just that mini-batch). a slight regularisation effect. (un-intended side effect), increase minibatch size = reduce regularization effect
- batch norm at test time, may not have many samples, so use estimated mean and variance from every mini-batches ( $\mu^t$ , $\sigma^t$ , apply exponential weighted averaging / average all)

multi-class classification: softmax regression

4 output layer units (p(cat1|x),p(cat2|x),p(cat3|x),p(others|x)) = 4 classes
$t = e^{z[l]}$ $e^{z[l]}/\sum t$ (element-wise, softmax activation)
softmax regression generalised logistic regression to multi classes
loss function y = [0,1,0,0], y’ = [0.3,0.2,0.1,0.4], $-\sum_i (y_i log(y'_i))$
back prop dz[L] = y’ -y

deep learning frameworks

choose a framework [ease of programming, running speed, truly open)
caffe, cntk, dl4j, keras, lasagne, mxnet, paddlepaddle, tensorflow, theano, torch

ML strategy

single optimazing metric
optimazing and satisfying metric
add train dev test info

dev test sample distribution
change metric according to user preference
consider how to get better new metric by adjusting cost functions/add data or other ways

compare to human level performance
training error to human error = avoidable bias
training error to dev error = variance

compare avoidanle bias vs variance decide what to focus

human level performance ~human error estimate ground truth bayes error

human better on natural perception problems (image language voice)

machine could see more data than human

overfiting:
add data/regularization

Regularization

loss+lambd/2m|w|_l2/1/F
how: l2 -> weight decay -> dw = ()+[lambd/m] w -> w = w[-]alpha*dw
why: lambd large->w small->active function middle->more linear->not high degree nn
dropout
how:
drop some parameter with keep_prob and [a3=a3/keep_prob] -> expect[a3] same
no drop out at test time
large matrix layer = low keep_prob 0.5
small matrix layer = high keep_prob, even can be 1, no drop out
cost function not well-defined. so plot of J is important when training

why:smaller nn trained. can’t rely on any feature, since any one could be droped out -> spread out weights
data augmentation (mirror/updown…)
early stop dev error v.s. training error --> not orthogonalization (merge optimize J and prevent overfiting)

Optimization J

normalizing training sets
how: x-E[x]/variance
why: J more symmetry, learning rate/step more big
Weight initialization
why: problem: vanishing/exploding gradients not easy to optimize
how: *np.sqrt(2/n[l-1]) -> relu
nq.sqrt(1/n[l-1]) -> tanh (xavier initiation)
nq.sqrt(1/(n[l-1]+n[l]))
gradient checking
how: numerical approximation of gradients
theta = [flatten(w1),flatten(b1),…]
dtheta = [flatten(dW1), flatten(db1)…]
dtheta_approx_i = J([theta_1,…,theta_i+epsilon,…]) - J([theta_1,…,theta_i-epsilon…])/2epsilon
[dtheta - dtheta_approx]_l2/dtheta_l2+dtheta_approx_l2 ~ epsilon? 10^-7 10^-5 XXX10^-3XXX
only do it when debug/ not in training
do not forget regularzation term
does not work with drop out

Sequence model

Standard network not work -> diff length/ position info lost
RNN also limit-> only previous info passed. not forward info
RNN
a = g([W_aa | W_ax] [a , x] + b_a) tanh/ReLu
y = g(W_ya a + b_y) sigmoid

Many to many Tx = Ty
Many to one Tx > Ty=1 (only y LAST calculated)
one to many only x<0> exsit and y serve like x
double the size, and no y calculated on the first half (encoder), then output all in the second half (decoder).

Language model
example:
predict of next word prob list
get the sampling from the trained prob
character-level v.s. word-level (vocabulary) too long each sequence, long time to train. but without unknown tokens.

Problem vanishing gradients
basics RNN not very good at capturing very long-term dependencies due to vanishing gradients
local dependencies -> solution: Gated recurrent unit
Problem exploding gradients:
get NaN for gradients -> solution: gradient clipping (maxium value applied)

Original:
a(t) = g(W_a[a(t-1),x(t)]+b_a)
tanh
y(t) = softmax(a(t))

Gated recurrent unit (simple):
c = memory cell
c(t) = a(t)

c~(t) = tanh(W_c[c(t-1),x(t)]+b_c)
Gate_u = sigmod(W_u[c(t-1),x(t)]+b_u)
(always between 0,1
Gate_u decide when do you update the memory value)
c(t) = Gate_u * c~(t) + (1-Gate_u) * c(t-1) element wise *
y(t) = softmax(c(t))

Gated recurrent unit (full):
c~(t) = tanh(W_c[Gate_r*c(t-1),x(t)]+b_c)
Gate_u = sigmod(W_u[c(t-1),x(t)]+b_u)
Gate_r = sigmod(W_r[c(t-1),x(t)]+b_r)
c(t) = Gate_u * c~(t) + (1-Gate_u) * c(t-1)
GRU is one of the most commonly used versions that researchers have converged to and found as robust and useful for many different problems.

LSTM more general version of GRU
c(t)!= a(t)

c~(t) = tanh(W_c[a(t-1),x(t)]+b_c)
Gate_u = sigmod(W_u[a(t-1),x(t)]+b_u) update
Gate_f = sigmod(W_f[a(t-1),x(t)]+b_f) forget
Gate_o = sigmod(W_o[a(t-1),x(t)]+b_o) output

c(t) = Gate_u * c~(t)+ Gate_f * c(t-1)
a(t) = Gate_o * tanh(c(t))
y(t) = softmax(a(t))

(peephole connection)
[a(t-1),x(t)] -> [c(t-1),a(t-1),x(t)]

LSTM more powerfull, GRU simpler

Bidirectional RNN (need entire sequence)
backword as
acyclic graph
->a <-a
y(t) = g(W_y[->a(t),<-a(t)]+b_y)

deep RNN
a(t)[1] a(t)[2] a(t)[3]… a(t)[layer]

3 is already deep

Word embedding

learning from large text corpus (1-100B words) / Download pre-trained embedding online
transfer embedding to new tasks with smaller training set (100k)

Analogies (similar words)

word embedding continuety property: vector difference similar argmax sim(a1, b1-b2+a2)
$sim(u, v ) = u^Tv/||u||_2||v||_2$

Learning word embedding

E = embedding matrix 300*10k (embedding dimension, number of words). embedding matrix * on hot = embedding of the word (300, 10k) * (10k, 1) OR embeeding matrix[:,k]
neural languagae model: to predict next word given previous words. to predict middle word give previous words and next words. or only last 1 word.
set up the model: on hot, then E, then embedding words, NN network, softmax output 10k prop. Where E and NN and softmax are learnable parameters.
skip gram, select context (random sampling and skip common words) and randowm pick target ±10 words around it.
Problem softmax computation expensive: possible solution hirenchy softmax.
sample a set of random pairs, target is 0/1 to indicate whether they are close by. pick k samples negative (target = 0, not close by). Small dataset k=20, larger dataset k=2-5.
GloVe Xij times of j appers in context of i. Minimize $\sum_i\sum_j f(X_{ij})(\theta_i^Te_j +b_i+b_j'- log X_{ij})^2$ (f(Xij) is for when Xij =0 or Xij is the frequent word like ‘a’). $\theta, e$ are symetric $e^{final} = (e+\theta)/2$

Sentiment classification

rnn to replace average for sentence embedding. many to one rnn.

Debiasing

word embeding can reflect gender, age, ethnicity, etc. bias e.g.: Nurse Mother, Father Doctor.
Three steps:
- Identify bias direction average(e_he-e_she, e_male, e_female).
- For every word that is not definitional (grandmother, female, she: train a classifier to find non-definitional words), project to get rid of bias (project to the the average)
- Equalize pairs (e.g. distance between grandmother, grandfather to babysitter)

Sequence to Sequence

Greedy search: best each, not work well. not optimal global.
approximal search
beam search algorithm (beam width 3): select most likely top 3 words for the first word. Then select the second word using the previous selected word (repeat 3 times). And pick top 3 first+second words and repeat for the rest. always have 3 copies NN for each choice. (beam width 1 = greedy search)
length normalization for beam search: $\argmax_y \sum log(P(y_t|x,y_1,y_2,..,y_{t-1})) /T_y^\alpha$ . Add log to avoid close to 0 values vanishing and avoid short translation by deviding $T_y$ $T_y=1,2,3,...,30$ . pick the top one acrossing all $T_y$
beam width larger better slower, but 1000->3000 not gain much as 1->5
error analysis to find problem in RNN or beam search by compute p(y*) and p(y’) using RNN, if p(y*)>p(y’) beam search failed to get y*. if p(y*)<=p(y’) RNN failed. Get a table to find all pairs y* and y’ are due to Bean search or RNN.
Bleu (bilingual evaluation understudy) score to measure translation according human generated reference (multiple correct references)
- Precision=#words apper in reference /#words.
- Modified precision = max # appear in reference sentence / # one word
- pn = sum(count_clip)/sum(count): count = n-grams from machine translation appear in how many reference. count_clip = max time (n-grams appears in one reference). n can be 1 or 2 or …
- combine bleu score $\sum_{n=1}^N P_n)$ . BP is to penalty to avoid short translation. BP = 1 if y’_length > y*_length. others BP = exp(1-y*_length/y’_length)

Attention model

long sequence hard to translate. attention model can solve this problem by take it part by part.
activation: $a^{<t'>} = (a_{forward}^{<t'>}, a_{back}^{<t'>})$
attention weight: $\alpha^{<t,t'>}$ t’ input, t=1 output timestamp. $\sum_{t'}\alpha^{<t,t'>}=1$
input for attention layer: $c^{<t>} = \sum_{t'} \alpha^{<1,t'>}a^{<t'>}$
output for attention layer $s^{<t>}$
computing attention $\alpha^{<t,t'>} = exp(e^{<t,t'>})/\sum_{t'=1}^{T_x} exp(e<t,t'>)$ . $e^{<t,t'>}$ is output of a small nn using input $a^{<t'>}$ and $s^{<t-1>}$

Speech recognition
CTC cost (connectionist temporal classification): collapse repeated chars not separated by “blank” to allow 1000 outputs end up a resonable length sentence.