LSTM
- With gated RNN,the network learns to which info is remembered and which should be forgot over a long duration (through forgot gate).
- distinguish between the cell state and hidden state, the former aims to maintain a long term dependency, the latter is just the input of (forget, input and gate gate) and output of output gate.
- The introduction of cell state in LSTM is the primary reason why the vanishing or exploding gradient is mitigated. Pls see Tutorial here.
Others
- Eg of image captioning:
- combination of CNN and RNN
- CNN takes the input of an image and output a feature vector
- then this feature vector is input into RNN as something like a hidden state (but actually not!!), with conversion matrix: Wih W i h
- Gradient cliping:
- solving two problems: sharp cliff in parameter space and exploding gradient space.
The basic idea is to recall that the gradient specifies not the optimal step size, but only the optimal direction within an infinitesimal region.
The objective function for highly nonlinear deep neural networks or forrecurrent neural networks often contains sharp nonlinearities in parameter space resulting from the multiplication of several parameters.
Thus limit the gradient size by a predefined threshold.
- Exploding and vanishing gradient:
- It is sufficient for λ1<1γ λ 1 < 1 γ for the vanishing gradient occurs.
- The necessary condition for exploding gradient is the largest singular value λ1>1γ λ 1 > 1 γ
Generative Model:
- Training example of GAN:
- we sample a mini-batch of m noise example {z(1),⋯,z(m)} { z ( 1 ) , ⋯ , z ( m ) } from noise prior pg(z) p g ( z ) (used to generate image).
- sample minibatch of m example (for training discriminator) {x(1),⋯,x(m)} { x ( 1 ) , ⋯ , x ( m ) } from data generation distribution pdata(x) p d a t a ( x )
- Cost functions may not converge using gradient descent in a minimax game.
- A zero-sum game is also called minimax. Your opponent wants to maximize its actions and your actions are to minimize them.
- Maximum likelihood Estimation: θ̂ =argmaxθ∏Ni=1p(xi|θ) θ ^ = arg m a x θ ∏ i = 1 N p ( x i | θ ) can be viewed as minimizing the KL Divergence: DKL(P‖Q) D K L ( P ‖ Q ) , where P is the true probability distribution we want to approximate, wthile q is the estimated distribution.
So This can be seen that KL divergence DKL(P‖Q) D K L ( P ‖ Q ) penalize probability generator which miss some mode of the real-life distri, a.k.a: p(x)>0 p ( x ) > 0 but q(x)→0 q ( x ) → 0 , while it is acceptable that some generated image looks unreal (in other word DKL(P‖Q) D K L ( P ‖ Q ) won’t penalize this wrong case): p(x)→0 p ( x ) → 0 but q(x)>0 q ( x ) > 0 .
By contrast the reverse KL-divergence DKL(Q‖P) D K L ( Q ‖ P ) penalize generator who only generate unreal image, a.k.a: q(x)>0 q ( x ) > 0 but p(x)→0 p ( x ) → 0 . While it accepts that generator which is less variable but always produces real image, a.k.a: q(x)→0 q ( x ) → 0 but p(x)>0 p ( x ) > 0
- The gradient of JS divergence will vanish if there is a huge mismatch between the μ(x) μ ( x ) and μg(x) μ g ( x ) , especially for an optimal discriminator (cause JS divergence saturate). This makes the learning very slow.
- Mode collapse: nice explain in the above tutorial, less variant: all random noise always generate similar images: collapse to a single mode which can already fool the discriminator.
Extension of GAN & Application
- Pix2Pix formulation: the y in the formulation is the paired example, like the corresponding map from aerial
- The generator now try to generate the paired image: from day to night
- L1 L L 1 measures the difference between the generated pairs image and true pair image.
- requires paired images as training data
- This GAN is on contional setting, which means that the random noise z in latent space should be conditioned on the input image x. So the generator takes z and x as input to output the desired data sample.
- CycleGAN: unpaired image translation, learns two densities and translate a sample from the first (“images of apples”) into a sample likely under the second (“images of oranges”).
- measure the cycle-consistency loss
- F and G are two generator (transformer) whose input is an image (instead of the random noise) and output is the unpaired corresponding image.
- Input - Generate image - Reconstruct
VAE
- Intuition of VAE:
- assuming training dataset