Deeplearning.ai coursera notes

Deeplearning.ai coursera notes

Neural Networks and Deep learning

initialization W!=0
Otherwise it is symetric, after gradient, it will get same step, then all the nueron has the same affect.

Nonlinear activation function
Otherwise it is a linear model, no matter how many layers.

Why a lot of layers
First layers simple features
last layers more complex features
operation example XOR:
multiple layers v.s. 2 layers
number of nueroens smaller and easy to train

Hyper parameters:
W, b learned from data
alpha learning rate -> need to be tried out
number of iterations
number of layers units
activation fucntion
momenton
minibatch size
regulization

Quote: A lot of complexity of your learning algorithm comes from the data rather than necessarily from your writing thousands and thousands of lines of code

Resampe(targetlenth, -1).T -> column vector presample
persample
[r][g][b] -> [r,g,b]
(209,64,64,3) -> (64643,209)

Split train/dev/test
100–10000 70/20/10
1000,000 98/1/1

high bias = underfitting -> train set error -> bigger optimal error (baies) longer train diffmodel
variance = overfitting -> dev set error -> more data, regularization

Hyperparameters

gradient decent
batch v.s. minibatch

  • one batch should fit in cpu/gpu memory
  • batch size should be 2^n
  • batch size should be 1<<m
  • if batch size != m, cost descent oscillation

exponential weighted average

  • V t = β 1 V t − 1 + ( 1 − β 1 ) d W t V_{t} = \beta_1 V_{t-1} + (1-\beta_1)dW_{t} Vt=β1Vt1+(1β1)dWt
  • β 1 \beta_1 β1 0.9 ~ 10 average, since 0.9^10=1/e 0.98 ~ 50 average
  • bias correction V t ′ = V t / ( 1 − β 1 t ) V_t'=V_t/(1-\beta_1^t) Vt=Vt/(1β1t) (not care in practice)

momentum: use exponential weighted gradients W t + 1 = W t − α V t W_{t+1}=W_t-\alpha V_t Wt+1=WtαVt, instead of one gradine, since average oscillation directions.

RMSprop root mean square prop

  • S t = β 2 S t − 1 + ( 1 − β 2 ) d W t 2 S_{t} = \beta_2 S_{t-1} + (1-\beta_2)dW_{t}^2 St=β2St1+(1β2)dWt2 (element wise)
  • W t + 1 = W t − α d W t / ( S t + ϵ ) W_{t+1}=W_t-\alpha dW_t/(\sqrt{S_t}+\epsilon) Wt+1=WtαdWt/(St +ϵ) ( ϵ = 1 0 − 8 \epsilon=10^-8 ϵ=108 to avoid devide 0)

Adam: combine momentum and RMSprop

  • should bias correction S t ′ = S t / ( 1 − β 2 t ) S_t' = S_t/(1-\beta_2^t) St=St/(1β2t)
  • W t + 1 = W t − α V t ′ / ( S t ′ + ϵ ) W_{t+1}=W_t-\alpha V_t'/(\sqrt{S_t'}+\epsilon) Wt+1=WtαVt/(St +ϵ)
  • α \alpha α should be tuned, others default are fine β 1 \beta_1 β1 0.9, β 2 \beta_2 β2 0.999, ϵ \epsilon ϵ 10^-8.

learning rate decay

  • smaller learning rate when closing minimal
  • α 0 / ( d e c a y R a t e ∗ e p o c h N u m b e r ) \alpha_0/(decayRate*epochNumber) α0/(decayRateepochNumber)
  • α 0 ∗ 0.9 5 e p o c h N u m b e r \alpha_0*0.95^{epochNumber} α00.95epochNumber
  • k / e p o c h N u m b e r k/\sqrt{epochNumber} k/epochNumber / k/\sqrt{t}$

local optimal

  • plateau / saddle point -> adam may help

Tune hyperparameters

  1. learning rate
  2. momentum beta (0.9), mini-batch size, hidden unit
  3. layers, learning rate decay
  4. adam parameters (beta1, beta2, epsilon)

methods

  • sample random, instead of grids (more values for each, not like only 5 for each parameter)
  • coarse to fine: zoom in and sample random dense

random sampling

  • hidden units and layers can be uniformly random sample
  • appropriate scale for other hyper parameters:
    • search on log scale, uniform random on log scale: α = 1 0 r \alpha = 10^r α=10r r uniform random in [a,b] ([-4,0], r = -4 * np.random.rand())
  • beta: 0.9,…, 0.999, 1-beta: 0.1, …, 0.01, β = 1 − 1 0 r \beta = 1-10^r β=110r r in [a, b] ([-3,-1]) (since 1/(1-beta) is changing a lot when beta close to 1, so better to sample also in log scale instead of uniform 0.9, 0.905,…)

Training choices

  • babysitting one model (no computation power/huge data set) v.s. training many models in parallel (enough computation power/normal data set)
  • normalising inputs to speed up learning
  • batch normalising (z[2], z[3], …)
    • z [ i ] n o r m = z [ i ] − μ σ 2 + ϵ z[i]_{norm} = \frac{z[i]-\mu}{\sqrt{\sigma^2+\epsilon}} z[i]norm=σ2+ϵ z[i]μ
    • z [ i ] ′ = γ z [ i ] n o r m + β z[i]'= \gamma z[i]_{norm}+\beta z[i]=γz[i]norm+β (learnable parameters) (normalised to other mean and variance)
    • beta and gamma are learned separately using learning rate (tf.nn.batch_normalization, or implement yourself: beta = beta - alpha*d_beta)
    • z[i]=W[i]z[i-1]+b[i], is the value before applying activation function
    • when using batch normalising, b[i] is eliminated by normalization, which is replaced by β \beta β
    • why it works: learning on shifting input distribution, training w[l] b[l] will change the a[l] distribution, which is “covariance shift” (black cat to colour cat). it limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the third layer now sees and therefore has to learn on. batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on. layers learn independently instead of always changing when other layers update.
    • similar to dropout, it adds some noise to each hidden layer’s activations (different mini-batch is scaled by the mean/variance computed on just that mini-batch). a slight regularisation effect. (un-intended side effect), increase minibatch size = reduce regularization effect
    • batch norm at test time, may not have many samples, so use estimated mean and variance from every mini-batches ( μ t \mu^t μt, σ t \sigma^t σt, apply exponential weighted averaging / average all)

multi-class classification: softmax regression

  • 4 output layer units (p(cat1|x),p(cat2|x),p(cat3|x),p(others|x)) = 4 classes
  • t = e z [ l ] t = e^{z[l]} t=ez[l] a [ l ] = e z [ l ] / ∑ t a[l] = e^{z[l]}/\sum t a[l]=ez[l]/t (element-wise, softmax activation)
  • softmax regression generalised logistic regression to multi classes
  • loss function y = [0,1,0,0], y’ = [0.3,0.2,0.1,0.4], L ( y ′ , y ) = − ∑ i ( y i l o g ( y i ′ ) ) L(y',y) = -\sum_i (y_i log(y'_i)) L(y,y)=i(yilog(yi))
  • back prop dz[L] = y’ -y

deep learning frameworks

  • choose a framework [ease of programming, running speed, truly open)
  • caffe, cntk, dl4j, keras, lasagne, mxnet, paddlepaddle, tensorflow, theano, torch

ML strategy

single optimazing metric
optimazing and satisfying metric
add train dev test info

dev test sample distribution
change metric according to user preference
consider how to get better new metric by adjusting cost functions/add data or other ways

compare to human level performance
training error to human error = avoidable bias
training error to dev error = variance

compare avoidanle bias vs variance decide what to focus

human level performance ~human error estimate ground truth bayes error

human better on natural perception problems (image language voice)

machine could see more data than human

overfiting:
add data/regularization

Regularization

  1. loss+lambd/2m|w|_l2/1/F
    how: l2 -> weight decay -> dw = ()+[lambd/m] w -> w = w[-]alpha*dw
    why: lambd large->w small->active function middle->more linear->not high degree nn

  2. dropout
    how:
    drop some parameter with keep_prob and [a3=a3/keep_prob] -> expect[a3] same
    no drop out at test time
    large matrix layer = low keep_prob 0.5
    small matrix layer = high keep_prob, even can be 1, no drop out
    cost function not well-defined. so plot of J is important when training

    why:smaller nn trained. can’t rely on any feature, since any one could be droped out -> spread out weights

  3. data augmentation (mirror/updown…)

  4. early stop dev error v.s. training error --> not orthogonalization (merge optimize J and prevent overfiting)

Optimization J

  1. normalizing training sets
    how: x-E[x]/variance
    why: J more symmetry, learning rate/step more big
  2. Weight initialization
    why: problem: vanishing/exploding gradients not easy to optimize
    how: *np.sqrt(2/n[l-1]) -> relu
    nq.sqrt(1/n[l-1]) -> tanh (xavier initiation)
    nq.sqrt(1/(n[l-1]+n[l]))
  3. gradient checking
    how: numerical approximation of gradients
    theta = [flatten(w1),flatten(b1),…]
    dtheta = [flatten(dW1), flatten(db1)…]
    dtheta_approx_i = J([theta_1,…,theta_i+epsilon,…]) - J([theta_1,…,theta_i-epsilon…])/2epsilon
    [dtheta - dtheta_approx]_l2/dtheta_l2+dtheta_approx_l2 ~ epsilon? 10^-7 10^-5 XXX10^-3XXX
    only do it when debug/ not in training
    do not forget regularzation term
    does not work with drop out

Sequence model

Standard network not work -> diff length/ position info lost
RNN also limit-> only previous info passed. not forward info
RNN
a = g([W_aa | W_ax] [a , x] + b_a) tanh/ReLu
y = g(W_ya a + b_y) sigmoid

  1. Many to many Tx = Ty
  2. Many to one Tx > Ty=1 (only y LAST calculated)
  3. one to many only x<0> exsit and y serve like x
  4. double the size, and no y calculated on the first half (encoder), then output all in the second half (decoder).

Language model
example:
predict of next word prob list
get the sampling from the trained prob
character-level v.s. word-level (vocabulary) too long each sequence, long time to train. but without unknown tokens.

Problem vanishing gradients
basics RNN not very good at capturing very long-term dependencies due to vanishing gradients
local dependencies -> solution: Gated recurrent unit
Problem exploding gradients:
get NaN for gradients -> solution: gradient clipping (maxium value applied)

Original:
a(t) = g(W_a[a(t-1),x(t)]+b_a)
tanh
y(t) = softmax(a(t))

Gated recurrent unit (simple):
c = memory cell
c(t) = a(t)

c~(t) = tanh(W_c[c(t-1),x(t)]+b_c)
Gate_u = sigmod(W_u[c(t-1),x(t)]+b_u)
(always between 0,1
Gate_u decide when do you update the memory value)
c(t) = Gate_u * c~(t) + (1-Gate_u) * c(t-1) element wise *
y(t) = softmax(c(t))

Gated recurrent unit (full):
c~(t) = tanh(W_c[Gate_r*c(t-1),x(t)]+b_c)
Gate_u = sigmod(W_u[c(t-1),x(t)]+b_u)
Gate_r = sigmod(W_r[c(t-1),x(t)]+b_r)
c(t) = Gate_u * c~(t) + (1-Gate_u) * c(t-1)
GRU is one of the most commonly used versions that researchers have converged to and found as robust and useful for many different problems.

LSTM more general version of GRU
c(t)!= a(t)

c~(t) = tanh(W_c[a(t-1),x(t)]+b_c)
Gate_u = sigmod(W_u[a(t-1),x(t)]+b_u) update
Gate_f = sigmod(W_f[a(t-1),x(t)]+b_f) forget
Gate_o = sigmod(W_o[a(t-1),x(t)]+b_o) output

c(t) = Gate_u * c~(t)+ Gate_f * c(t-1)
a(t) = Gate_o * tanh(c(t))
y(t) = softmax(a(t))

(peephole connection)
[a(t-1),x(t)] -> [c(t-1),a(t-1),x(t)]

LSTM more powerfull, GRU simpler

Bidirectional RNN (need entire sequence)
backword as
acyclic graph
->a <-a
y(t) = g(W_y[->a(t),<-a(t)]+b_y)

deep RNN
a(t)[1] a(t)[2] a(t)[3]… a(t)[layer]

3 is already deep

Word embedding

  • learning from large text corpus (1-100B words) / Download pre-trained embedding online
  • transfer embedding to new tasks with smaller training set (100k)

Analogies (similar words)

  • word embedding continuety property: vector difference similar argmax sim(a1, b1-b2+a2)
  • s i m ( u , v ) = u T v / ∣ ∣ u ∣ ∣ 2 ∣ ∣ v ∣ ∣ 2 sim(u, v ) = u^Tv/||u||_2||v||_2 sim(u,v)=uTv/u2v2

Learning word embedding

  • E = embedding matrix 300*10k (embedding dimension, number of words). embedding matrix * on hot = embedding of the word (300, 10k) * (10k, 1) OR embeeding matrix[:,k]
  • neural languagae model: to predict next word given previous words. to predict middle word give previous words and next words. or only last 1 word.
  • set up the model: on hot, then E, then embedding words, NN network, softmax output 10k prop. Where E and NN and softmax are learnable parameters.
  • skip gram, select context (random sampling and skip common words) and randowm pick target ±10 words around it.
  • Problem softmax computation expensive: possible solution hirenchy softmax.
  • sample a set of random pairs, target is 0/1 to indicate whether they are close by. pick k samples negative (target = 0, not close by). Small dataset k=20, larger dataset k=2-5.
  • GloVe Xij times of j appers in context of i. Minimize ∑ i ∑ j f ( X i j ) ( θ i T e j + b i + b j ′ − l o g X i j ) 2 \sum_i\sum_j f(X_{ij})(\theta_i^Te_j +b_i+b_j'- log X_{ij})^2 ijf(Xij)(θiTej+bi+bjlogXij)2 (f(Xij) is for when Xij =0 or Xij is the frequent word like ‘a’). θ , e \theta, e θ,e are symetric e f i n a l = ( e + θ ) / 2 e^{final} = (e+\theta)/2 efinal=(e+θ)/2

Sentiment classification

  • rnn to replace average for sentence embedding. many to one rnn.

Debiasing

  • word embeding can reflect gender, age, ethnicity, etc. bias e.g.: Nurse Mother, Father Doctor.
  • Three steps:
    • Identify bias direction average(e_he-e_she, e_male, e_female).
    • For every word that is not definitional (grandmother, female, she: train a classifier to find non-definitional words), project to get rid of bias (project to the the average)
    • Equalize pairs (e.g. distance between grandmother, grandfather to babysitter)

Sequence to Sequence

  • Greedy search: best each, not work well. not optimal global.
  • approximal search
  • beam search algorithm (beam width 3): select most likely top 3 words for the first word. Then select the second word using the previous selected word (repeat 3 times). And pick top 3 first+second words and repeat for the rest. always have 3 copies NN for each choice. (beam width 1 = greedy search)
  • length normalization for beam search: arg max ⁡ y ∑ l o g ( P ( y t ∣ x , y 1 , y 2 , . . , y t − 1 ) ) / T y α \argmax_y \sum log(P(y_t|x,y_1,y_2,..,y_{t-1})) /T_y^\alpha yargmaxlog(P(ytx,y1,y2,..,yt1))/Tyα. Add log to avoid close to 0 values vanishing and avoid short translation by deviding T y T_y Ty T y = 1 , 2 , 3 , . . . , 30 T_y=1,2,3,...,30 Ty=1,2,3,...,30. pick the top one acrossing all T y T_y Ty
  • beam width larger better slower, but 1000->3000 not gain much as 1->5
  • error analysis to find problem in RNN or beam search by compute p(y*) and p(y’) using RNN, if p(y*)>p(y’) beam search failed to get y*. if p(y*)<=p(y’) RNN failed. Get a table to find all pairs y* and y’ are due to Bean search or RNN.
  • Bleu (bilingual evaluation understudy) score to measure translation according human generated reference (multiple correct references)
    • Precision=#words apper in reference /#words.
    • Modified precision = max # appear in reference sentence / # one word
    • pn = sum(count_clip)/sum(count): count = n-grams from machine translation appear in how many reference. count_clip = max time (n-grams appears in one reference). n can be 1 or 2 or …
    • combine bleu score B P e x p ( 1 / N ∑ n = 1 N P n ) BP exp(1/N \sum_{n=1}^N P_n) BPexp(1/Nn=1NPn). BP is to penalty to avoid short translation. BP = 1 if y’_length > y*_length. others BP = exp(1-y*_length/y’_length)

Attention model

  • long sequence hard to translate. attention model can solve this problem by take it part by part.
    activation: a < t ′ > = ( a f o r w a r d < t ′ > , a b a c k < t ′ > ) a^{<t'>} = (a_{forward}^{<t'>}, a_{back}^{<t'>}) a<t>=(aforward<t>,aback<t>)
    attention weight: α < t , t ′ > \alpha^{<t,t'>} α<t,t> t’ input, t=1 output timestamp. ∑ t ′ α < t , t ′ > = 1 \sum_{t'}\alpha^{<t,t'>}=1 tα<t,t>=1
    input for attention layer: c < t > = ∑ t ′ α < 1 , t ′ > a < t ′ > c^{<t>} = \sum_{t'} \alpha^{<1,t'>}a^{<t'>} c<t>=tα<1,t>a<t>
    output for attention layer s < t > s^{<t>} s<t>
  • computing attention α < t , t ′ > = e x p ( e < t , t ′ > ) / ∑ t ′ = 1 T x e x p ( e < t , t ′ > ) \alpha^{<t,t'>} = exp(e^{<t,t'>})/\sum_{t'=1}^{T_x} exp(e<t,t'>) α<t,t>=exp(e<t,t>)/t=1Txexp(e<t,t>). e < t , t ′ > e^{<t,t'>} e<t,t> is output of a small nn using input a < t ′ > a^{<t'>} a<t> and s < t − 1 > s^{<t-1>} s<t1>

Speech recognition
CTC cost (connectionist temporal classification): collapse repeated chars not separated by “blank” to allow 1000 outputs end up a resonable length sentence.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值