End-to-end people detection in crowded scenes

End-to-end people detection in crowded scenes

  • 将图像分成网格, 用LSTM在每个网格中单独预测物体, 产生一个序列化的输出预测结果
  • decoding an image into a set of people detections

优点

  • directly outputs a set of distinct detection hypotheses.
  • Because we generate predictions jointly, common post-processing steps
    such as non-maximum suppression are unnecessary

Cognitive Mapping and Planning for Visual Navigation

  • 建立并更新一个metric belief of the world(相当于人或者老鼠大脑中所记录的周围地图)
  • planner 根据 metric belief 规划到目标的路径
  • , 使用planning 算法用的额是value iteration. Our planner is based on value iteration networks proposed by Tamar et al. [58]
  • ,To alleviate this problem, we extend the hierarchical version presented in 58. 多层级的的planner, 在多层金字塔尺度下
  • , 人工产生最好路径. 监督的学习.

Deep Successor Reinforcement Learning

(用了一个内积的方法,不是特别懂), 通过对每个动作预测下步的state feature 和当前state的value 来生成q-value
- , 除了value function 还有个Successor Representations (SR)
- a reward predictor and
- a successor map.
- ,这里写图片描述
- 预测执行某个动作后的下一个状态(特征)
-

ToDo

  • [17] presented an option discovery algorithm where the agent is encouraged to explore regions that were previously out of reach.
  • Hierarchical reinforcement learning algorithms [1] such as the options framework [38, 39] provide
    a flexible framework to create temporal abstractions, which will enable exploration at different
    time-scales.

Attractor Network Dynamics Enable Preplay and Rapid Path Planning in Maze–like Environments

  • 模仿海马区(对episodic memory 很重要)一路引导agent到达goal的过程.

*****LEARNING TO PLAY IN A DAY: FASTER DEEP REINFORCEMENT LEARNING BY OPTIMALITY TIGHTENING

加速传递奖励, 提高收敛速度
用的上下界方式, 具体方法还不知道怎么搞的
- Not only do we propagate information from time instances in the future to our current state, but also will we pass information from states several steps in the past.
-

Deep Attention Recurrent Q-Network

  • 列表内容
  • vt={v1t,...,vLt},vitBD,L=mm m*m是特征图的dim, D是特征图的个数.
  • g是一个softmax函数, 对所有的 v1t 产生attention的概率. zt 则可以是带权值的soft attention(对 v1t 进行weighted化) 或者 hard attention(每次只抽取一个 v1t ).

Learning Purposeful Behaviour in the Absence of Rewards

  • Sutton et al. extended the RL framework by introducing temporally extended actions called options.
  • (i) storing the changes seen between two different time steps, (ii) clustering correlated changes in order to extract a purpose, (iii) learning policies capable of reproducing desired purposes, and (iv) transforming hese policies into options that can be used to move the agent farther in the state space.
  • 将当前状态和前一步状态的变化存储起来. 然后聚类. 奇异值分解出不同的eigenpurpose. 然后将option 添加到一个集合里.

2017年2月20日13:36:34

THE PREDICTRON:END-TO-END LEARNING AND PLANNING

  • state representation
  • a model s,r,γ=m(s,β) that maps from internal state s to subsequent internal state s’, internal rewards r, and internal discounts γ <script type="math/tex" id="MathJax-Element-11">γ</script>.
  • a value function v that outputs internal values v = v(s)
  • Finally, these internal rewards, discounts and values are combined together by an accumulator into an overall estimate of value g.

INCENTIVIZING EXPLORATION IN REINFORCEMENT LEARNING WITH DEEP PREDICTIVE MODELS

2017年2月20日16:18:54

  • main contribution is a scalable and efficient method for assigning exploration bonuses in large RL problems with complex observations, as well as an extensive empirical evaluation of this approach and other simple alternative strategies, such as Boltzman exploration and Thompson sampling.
  • -
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值