# 极简论文笔记

Because of some trivia,  I have read very few papers in the past few months. In the next period of time, I will compensate it, and record my understandings of the paper by English, meanwhile strengthening my writing skill.

### 1. A PID Controller Approach for Stochastic Optimization of Deep Networks

This is a paper published on cvpr18, which introduces PID method (a concept of automation) to optimization.

Motivation： SGD - P control; SGD with moment - PI control.  Integral terms has the ability to reduce steady state error, hence SGD with moment often equips a better performance than SGD. Nevertheless, it also exploits easily overshoot so as to have a decline on train speed.

Approach: By inducing D terms to build a PID optimization. In control theory, derivative term have the prediction function so that it plays the role of reducing overshoot and accelerating. $K_d$ is determined by Ziegler-Nichols method. The update of D terms utilizes moving average.

Thinking:

1.  The author regards the grad as error signal of control theory.  In fact, loss is equal to the error, because the targets of nn and control system are to decrease the loss and error to zero, respectively. But in the case of classision tasks, loss function is also MSE so that grad is associated with loss, therefore we can optimize the grad directly. But in the other tasks, especially the value of grad and loss don't have product relationship, how to slove?
2. The D term is weak against noise disturbance, hence does it means that we can add a extra filter to enhance stability?
3. How to modify the parameter of P and I terms to improve the performance?

2019.06.04