Before learning with large datasets, plot a learning curve with a much small datasets, and to see if it appears to be have high bias.
Stochastic Gradient Decent
(taking linear regression for example)
(Repeat between 1 to 10 times)
as the picture shows, it will converge in random directions and fail to converge into a single point, but much more effective and faster, compared with Batch Gradient Descent.
Mini-Batch Gradient Descent
b is often chosen from 2 to 100
Mini-Batch will be faster than Stochastic only by vetorization.
Checking for Converge
(without scanning over the entire training set periodically)
In plot 1, with a smaller
α
, the red line converge better, however, slower.
In plot 2 and 3, with a larger examples, the red line looks smoother, which do help to plot 3 confronted too much noise. Similarly, it will be slower.
In plot 4, with a increasing plot, maybe we should using a smaller
α
instead.
Learning Rate
to get closer to the optima:
However, it will become a hard work to choose these two constants, so it maybe not good choice.
the online learning
—to learn from a continuous stream of data
Without the superscript i, while a set of data has been used, we discard it.
And it also can adapt to change preference.
CTR(Click Through Rate)
With 10 pairs of (x,y), they are also 10 examples, which can improve our parameters.
Map Reduce