Example of Gradient Back-propagation in LR
In this example, we would not clearly separate optimizer, cost function and the model as in modern machine learning framework. We just derive the gradient back-propagation function for the whole process.
For SGD Case
Forward propagation:
- logitraw=Σwixi+b
- h=sigmoid(logitraw)
- Loss=−ytrue×log(h)−(1−ytrue)×log(1−h),ytrue∈{0,1}
Back propagation:
- ∂L∂h=−ytrue1h−(1−ytrue)1h−1=−(ytrue−h)h(1−h)
- ∂h∂logitraw=h(1−h)
- ∂logitraw∂wi=xi
- ∂L∂wi=∂L∂h∂h∂logitraw∂logitraw∂wi=−xi(ytrue−h)
So the gradient to wi is derived as above.
For Batch-GD Case
In batch-GD situation, the data flow is just like combined sgd ones as in the 3d-picture(sorry 3d fails. take this 2d one).
By the chain rule of multivariate differential[1], the gradient to
wi
is the summation of the gradient from each back-propagation flow.
Gradient
∂L_TOTAL∂wi=−Σxi(ytrue−h)
Reference:
[1] Chain Rule - Wikipedia
[2] CS231n Back Propagation