Deep RL Bootcamp Lecture 4A: Policy Gradients

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

in policy gradient, "a" is replaced by "u" usually.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 use this new form to estimate how good the update is.

 

 

 

 

 

 

 

 

If all three path show positive reward, should the policy increase the posibility of all the sampling?

 

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 monte carlo estimate

 

 

 

 

 

TD estimate

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 weeks to train as respect to real world time scale.

but could be faster in emulator (MOJOCO).  

we don't know whether a set of hyperparameter is going to work until enough interations have past. So it's kind of tricky, and using emulator could alleviate this problem.

 

 

question: how to transform learnt knowledge of robot to real life if we are not sure about the match between simulator and real world?

Randomly initilize many simulator and see the robustness of the algorithm

 

 

 

 

 

 

 

 

 

this video shows that even a robot with two years of endeavor of a group of experts still isn't good at locomotion

 

 

 

 

hindsight experience replay

Marcin Richard from OpenAI

the program is set to find the best way to get pizza, but when the agent find a ice cream, the agent realizes that ice cream, corresponding to a higher reward, is the exact thing it wants to get. 

 

 

 

 https://zhuanlan.zhihu.com/p/29486661

 https://zhuanlan.zhihu.com/p/31527085

转载于:https://www.cnblogs.com/ecoflex/p/8974602.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值