极简论文笔记

最新推荐文章于 2020-08-28 14:16:41 发布

Dlyldxwl

最新推荐文章于 2020-08-28 14:16:41 发布

阅读量463

点赞数 2

分类专栏：论文笔记

本文链接：https://blog.csdn.net/dlyldxwl/article/details/86314844

版权

论文笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Because of some trivia, I have read very few papers in the past few months. In the next period of time, I will compensate it, and record my understandings of the paper by English, meanwhile strengthening my writing skill.

1. A PID Controller Approach for Stochastic Optimization of Deep Networks

This is a paper published on cvpr18, which introduces PID method (a concept of automation) to optimization.

Motivation： SGD - P control; SGD with moment - PI control. Integral terms has the ability to reduce steady state error, hence SGD with moment often equips a better performance than SGD. Nevertheless, it also exploits easily overshoot so as to have a decline on train speed.

Approach: By inducing D terms to build a PID optimization. In control theory, derivative term have the prediction function so that it plays the role of reducing overshoot and accelerating. K_d is determined by Ziegler-Nichols method. The update of D terms utilizes moving average.

Thinking:

The author regards the grad as error signal of control theory. In fact, loss is equal to the error, because the targets of nn and control system are to decrease the loss and error to zero, respectively. But in the case of classision tasks, loss function is also MSE so that grad is associated with loss, therefore we can optimize the grad directly. But in the other tasks, especially the value of grad and loss don't have product relationship, how to slove?
The D term is weak against noise disturbance, hence does it means that we can add a extra filter to enhance stability?
How to modify the parameter of P and I terms to improve the performance?

2019.06.04

很久没有记论文笔记了，简单记录一下今天看到COT（ICLR19）这篇文章。以下观点不从文中获得，欢迎拍砖讨论。

全文是建立在这样的一种观点下讨论的：一张图片的label给定了，那其他的类别的置信度就应该很低，真实label置信度应该很高。先不讨论这个观点正确性与否，且看作者如何说明COT的。

考虑一种情况，比如目标是[0, 0, 0, 1]，但是预测[0.05, 0.05, 0.8, 0.1]，这时候交叉熵会把0.1拉向1，但同时COT这一项也会把[0.05, 0.05, 0.8]这三个搞成[0.3, 0.3, 0.3]，两者是相互助力的，一定程度上来讲，COT帮了CE一把。

而如果是预测结果是[0.05, 0.05, 0.1, 0.8]，那么COT可能就没什么作用了。

说完了COT的作用，再讨论一下前面的观点到底对不对。两个角度去看它，从数据的角度来说，对于一些重合度比较高的类别，如杯子和瓶子这两个类别，很像，我觉得即使label是瓶子，输出给杯子的分数是较高的才说明特征提取和分类是很正确的，如果杯子的分数很低很低，反而不太妥当；从分类器的角度来看，当特征的维度远高于类别数目时，意味着分类空间允许存在很多的重叠度少的subspace，那么这个时候或许COT就可以起到作用了，让每个类别的空间重叠度都较小，分出的类别更准确；如果特征的维度等于或小于类别数目时，本身类别间重叠度就较高，某些物体落到了重合区域的可能性就很大，那么它属于多个类别的分数都会比较高，如果强行的用COT来去排斥非正确lable的类别，可能会起到反作用。

哦说道现在都没说COT是什么，其实很简单，就是考虑所有假的label的熵，然后最大化这个loss，注意是最大化，因为当Yij_hat均分概率时，COT取得max，求-xlogx的导数就晓得了，先增后减函数，有一个极大值。所以在使用CE更新参数后，使用COT更新参数使得下式取得max, 具体实现时，COT loss抹掉前面的负号，等价于取的min, 就可以用正常的梯度下降了。