Topics in 强化学习

weixin_44504134

已于 2022-09-07 15:36:53 修改

阅读量1.1k

点赞数 27

文章标签：深度学习人工智能

于 2022-09-05 14:45:22 首次发布

本文链接：https://blog.csdn.net/weixin_44504134/article/details/126659572

版权

什么是强化学习

Traditional deep learning and machine learning: X : --> Y

Reinforcement Learning: Sequential Decision Making

$s\rightarrow a\rightarrow r\rightarrow s'$
Markov assumption of states $P(s_{t+1}|s_{t}, s_{t-1},...,a_{t}, a_{t-1},...) = P(s_{t+1}|s_{t}, a_{t})$

General training flow of RL algorithms

RL objective: $\max \mathbb{E}_{\tau}\sum_{t} [\gamma^{t}r(s_{t}, a_{t})]$

Q-Learning 和 policy-gradient（策略梯度）

Q-Learning(value-based):

value-based, bellman equation, policy/inference is implicit from Q function
discrete action space only, because it uses $a = \arg \max Q_{\theta}(s, a)$
Convergence is slow and unstable: need to converge on all $(s,a)$ pairs. Especially for large (s,a) space
Off-policy: Replay Memory; data can be reused/high data efficiency; potential for offline learning
Tricks: sperate target Q network, Double-DQN, Dueling DQN

Policy Gradient/PPO/TPRO (policy-based):

Loss function

Directly optimize policy $\pi (s,a)$ , based on observed future rewards from data
with Actor-Critic structure (introducing ), can introduce advantage function
- $A(s, a) = Q(s, a) - V(s)$
Gneralized advantage function, a weighted sum of TD( $\lambda$ )

Combination of the two:

DDPG, SAC etc

Trade-offs Between Policy Optimization and Q-Learning. The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training $Q\theta$ to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. [1] But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.

Side: monte-carlo vs. TD( $\lambda$ )

Rule of Thumb for picking models?

1. data availability and cost

2. Online or offline data

3. state and action spaces: size? continuous or discrete?

4. multi-optima? Do you want one solution or all?

5. model structure, computing powers etc

Applications

Video Games: OpenAI gym, Atari, AlphaGo etc

Google Research Football Academy 3vs1, trained on MAPPO

Robotics： Mujoco, picking and sorting objects; see Proximal Policy Optimization
Smart Factory/inventory management
Traffic lights planning
DiDi Order Dispatch: Centralized planning with value estimated in an offline fashion (through bellman) Large-Scale Order Dispatch in On-Demand Ride-Hailing Platforms | Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Finance: portfolio management: https://github.com/AI4Finance-Foundation/FinRL
Others: Emergent Tool Use from Multi-Agent Interaction

Potential Applications in Smart Vehicle (and challenges)

Autonomous driving: real world data offline, expensive, little exploration of dangerous data
Car Assembly and design: inspired by Learning to Design and Construct Bridge without Blueprinthttps://arxiv.org/pdf/2108.02439.pdf

towards a truly autonomous general-purpose robotics system, the robot should not just be capable of understanding how to accomplish a human-decided target but also ultimately be able to deduce what to produce to help address sophisticated real-world problems possibly beyond human capabilities

human-vehicle interaction: vehicle can have a variety of actions, and may collect a variety of multi-modal data.
- 主流研究focus在用多模态监督学习识别人的状态和需求，如疲劳驾驶，而车辆的反应是rule-based，或者基于推荐模型智能座舱系列文一，他到底是什么？ - 极术社区 - 连接开发者与智能计算生态
- RL has the potential of learning how the car should react. RL can learn a generalized action strategy from common offline data, but continuously adapt its action to specific driver's preferences in a local mode (kind of like transfer learning)
- RL can act on sequential data: think of 疲劳驾驶检测