Topics in 强化学习

目录

什么是强化学习

Q-Learning 和 policy-gradient(策略梯度)

Q-Learning(value-based):

Policy Gradient/PPO/TPRO (policy-based):

Combination of the two:

Rule of Thumb for picking models?  

Applications

Potential Applications in Smart Vehicle (and challenges)

Refences


什么是强化学习

Traditional deep learning and machine learning: X : --> Y

 Reinforcement Learning: Sequential Decision Making

  • s\rightarrow a\rightarrow r\rightarrow s'
  • Markov assumption of states P(s_{t+1}|s_{t}, s_{t-1},...,a_{t}, a_{t-1},...) = P(s_{t+1}|s_{t}, a_{t})

General training flow of RL algorithms

RL objective:    \max \mathbb{E}_{\tau}\sum_{t} [\gamma^{t}r(s_{t}, a_{t})]

Q-Learning 和 policy-gradient(策略梯度)

Q-Learning(value-based):

  • value-based, bellman equation, policy/inference is implicit from Q function 
  • discrete action space only, because it uses a = \arg \max Q_{\theta}(s, a) 
  • Convergence is slow and unstable: need to converge on all (s,a) pairs. Especially for large (s,a) space
  • Off-policy: Replay Memory; data can be reused/high data efficiency; potential for offline learning
  • Tricks: sperate target Q network, Double-DQN, Dueling DQN

Policy Gradient/PPO/TPRO (policy-based):

Loss function

  • Directly optimize policy \pi (s,a) , based on observed future rewards from data
  • with Actor-Critic structure (introducing V(s)), can introduce advantage function
    •  A(s, a) = Q(s, a) - V(s)
  • Gneralized advantage function, a weighted sum of TD(\lambda)

Combination of the two:

        DDPG, SAC etc

Trade-offs Between Policy Optimization and Q-Learning. The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training Q\theta to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.

Side: monte-carlo vs. TD(\lambda)

Rule of Thumb for picking models?  

1. data availability and cost

2. Online or offline data

3. state and action spaces: size? continuous or discrete?

4. multi-optima? Do you want one solution or all?

5. model structure, computing powers etc

Applications

  • Video Games: OpenAI gym, Atari, AlphaGo etc

Google Research Football Academy 3vs1, trained on MAPPO



Potential Applications in Smart Vehicle (and challenges)

  • Autonomous driving: real world data offline, expensive, little exploration of dangerous data
  • Car Assembly and design: inspired by Learning to Design and Construct Bridge without Blueprinthttps://arxiv.org/pdf/2108.02439.pdf

 towards a truly autonomous general-purpose robotics system, the robot should not just be capable of understanding how to accomplish a human-decided target but also ultimately be able to deduce what to produce to help address sophisticated real-world problems possibly beyond human capabilities

  • human-vehicle interaction: vehicle can have a variety of actions, and may collect a variety of multi-modal data. 
    • 主流研究focus在用多模态监督学习识别人的状态和需求,如疲劳驾驶,而车辆的反应是rule-based,或者基于推荐模型智能座舱系列文一,他到底是什么? - 极术社区 - 连接开发者与智能计算生态
    • RL has the potential of learning how the car should react. RL can learn a generalized action strategy from common offline data, but continuously adapt its action to specific driver's preferences in a local mode (kind of like transfer learning)
    • RL can act on sequential data: think of 疲劳驾驶检测

Refences

A (Long) Peek into Reinforcement Learning | Lil'Log

Policy Gradient Algorithms | Lil'Log

OpenAI spinning up

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值