目录
1. Gradient Descent with Large Datasets
1.1 Learning with large datasets
1.2 Stochastic gradient descent
1.3 Mini-Batch Gradient Dedscent
1.4 Stochastic gradient descent convergence
2. Advance Topics
2.1 Online learning
2.2 Map-Reduce and Data Paralelism
1. Gradient Descent with Large Datasets
1.1 Learning with large datasets
Learn with large datasets:
m = 100,000,000
plot learning cruve like this:
fig. 1
(引自coursera week 10 Learning with large datasets)
===> 可以用更多数据训练来降低泛化误差
1.2 Stochastic gradient descent
Linear regression with gradient descent:
===> 又叫batch gradient descent(考虑所有样本)
stochastic gradient descent:
(1) Randomly shuffle(reorder) training examples
(2)
1.3 Mini-Batch Gradient Dedscent
Batch Gradient Descent: use all m examples in each iteration
Stochastic Gradient Descent: use 1 example in each iteration
Mini-Batch Gradient Descent: use b examples in each iteration
b = mini-batch size, always 10 or 2 to 100
say b = 10, m = 1000
1.4 Stochastic gradient descent convergence
checking for convergence:
During learning, compute Cost before updating θ using (x(i), y(i))
Every 1000 iterations, plot Cost averaged over the last 1000 examples processed by algorithm.
fig. 2
(引自coursera week 10 Stochastic gradient descent convergence)
Learning rate α is typically held constant. Can slowly decrease α over time if we want θ to converge(e.g.
α = const1 / (#iteration + const2)). 但往往这样将问题转变成寻找常数1和常数2,变得更加复杂.
2. Advance Topics
2.1 Online learning
Shipping service website where user comes, specifies origin and destination, you offer to ship their package for some asking price, and users sometimes choose to use your shipping service(y = 1), not(y = 0).
Features x capture properties of user of origin/destination and asking price. We want to learn p(y = 1| x; θ) to optomize price.
Repeat forever {
Get (x, y) corresponding to user.
Update θ using (x, y)
θ_j = θ_j - α(h(x) - y)x_j
}
Online learing can adaot ti changing user tastes and it allows us to learn from a continuous stream of data, since we use each example once then no longer need to process it again.
2.2 Map-Reduce and Data Paralelism
fig. 3
(引自coursera week 10 Map-Reduce and Data Paralelism)
fig. 4
(引自coursera week 10 Map-Reduce and Data Paralelism)
fig. 5
(引自coursera week 10 Map-Reduce and Data Paralelism)