---------------------------------------------------batch---------------------------------------
0.
batch-size= N full-batch --converge and accurate but more epochs
1.
batch-size=1 online_learning --may not converge and more random
2.
natch-size=m mini-batch --maybe better
First, the gradient of the loss over a mini-batch is an estimate of the gradient over the training set, whose quality improves as the batch size increases.
Second, computation over a batch can be much more efficient than m computations for individual examples, due to the parallelism afforded by the modern computing platforms
-------------------------------------------------------dropout---------------------------------
some idea derived from adaboosting... ensemble
---------------------------------------------------batch-normalization--------------------------