LSTM overview
- This is an awfully great explanation of the idea behind LSTM and its variations.
Experiment
- Vanilla LSTM
- Tested modifications:
- No Input Gate (NIG)
- No Forget Gate (NFG)
- No Output Gate (NOG)
- No Input Activation Function (NIAF)
- No Output Activation Function (NOAF)
- No Peepholes (NP)
- Coupled Input and Forget Gate (CIFG)
- Full Gate Recurrence (FGR)
- Hyperparameter Search
- number of LSTM blocks per hidden layer: log-uniform
samples from [20, 200]; - learning rate: log-uniform samples from [10−6, 10−2];
- momentum: 1 − log-uniform samples from [0.01, 1.0];
- standard deviation of Gaussian input noise: uniform samples from [0, 1].
- number of LSTM blocks per hidden layer: log-uniform
- Tested datasets
- TIMIT Speech corpus (speech recognition)
- IAM Online Handwriting Database (OCR)
- JSB Chorales (music modeling)
- Conclusions
- Vanilla LSTM is good. Combine input/forget gate and remove peephole connections are worth trying.
- Do not remove output gate or forget gate.
- Learning rate is the most important parameter. Momentum is unimportant for LSTM. Gaussian noise on input may hurt.
- Hyperparameters can be tuned independently.