目录
Q-Learning 和 policy-gradient(策略梯度)
Policy Gradient/PPO/TPRO (policy-based):
Rule of Thumb for picking models?
Potential Applications in Smart Vehicle (and challenges)
什么是强化学习
Traditional deep learning and machine learning: X : --> Y
Reinforcement Learning: Sequential Decision Making
- Markov assumption of states
General training flow of RL algorithms
RL objective:
Q-Learning 和 policy-gradient(策略梯度)
Q-Learning(value-based):
- value-based, bellman equation, policy/inference is implicit from Q function
- discrete action space only, because it uses
- Convergence is slow and unstable: need to converge on all pairs. Especially for large (s,a) space
- Off-policy: Replay Memory; data can be reused/high data efficiency; potential for offline learning
- Tricks: sperate target Q network, Double-DQN, Dueling DQN
Policy Gradient/PPO/TPRO (policy-based):
Loss function
- Directly optimize policy , based on observed future rewards from data
- with Actor-Critic structure (introducing ), can introduce advantage function
- Gneralized advantage function, a weighted sum of TD()
Combination of the two:
DDPG, SAC etc
Trade-offs Between Policy Optimization and Q-Learning. The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.
Side: monte-carlo vs. TD()
Rule of Thumb for picking models?
1. data availability and cost
2. Online or offline data
3. state and action spaces: size? continuous or discrete?
4. multi-optima? Do you want one solution or all?
5. model structure, computing powers etc
Applications
- Video Games: OpenAI gym, Atari, AlphaGo etc
Google Research Football Academy 3vs1, trained on MAPPO
- Robotics: Mujoco, picking and sorting objects; see Proximal Policy Optimization
- Smart Factory/inventory management
- Traffic lights planning
- DiDi Order Dispatch: Centralized planning with value estimated in an offline fashion (through bellman) Large-Scale Order Dispatch in On-Demand Ride-Hailing Platforms | Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
- Finance: portfolio management: https://github.com/AI4Finance-Foundation/FinRL
- Others: Emergent Tool Use from Multi-Agent Interaction
Potential Applications in Smart Vehicle (and challenges)
- Autonomous driving: real world data offline, expensive, little exploration of dangerous data
- Car Assembly and design: inspired by Learning to Design and Construct Bridge without Blueprinthttps://arxiv.org/pdf/2108.02439.pdf
towards a truly autonomous general-purpose robotics system, the robot should not just be capable of understanding how to accomplish a human-decided target but also ultimately be able to deduce what to produce to help address sophisticated real-world problems possibly beyond human capabilities
- human-vehicle interaction: vehicle can have a variety of actions, and may collect a variety of multi-modal data.
- 主流研究focus在用多模态监督学习识别人的状态和需求,如疲劳驾驶,而车辆的反应是rule-based,或者基于推荐模型智能座舱系列文一,他到底是什么? - 极术社区 - 连接开发者与智能计算生态
- RL has the potential of learning how the car should react. RL can learn a generalized action strategy from common offline data, but continuously adapt its action to specific driver's preferences in a local mode (kind of like transfer learning)
- RL can act on sequential data: think of 疲劳驾驶检测
Refences
A (Long) Peek into Reinforcement Learning | Lil'Log
Policy Gradient Algorithms | Lil'Log
OpenAI spinning up