一、组成
环境Environments
状态State
-
用于决定下一步的信息
•
分为
Environment State
和
Agent State
-
Environment State
•
反映环境发生什么改变
•
环境自身的状态和环境反馈给
agent
的状态并不一定相同
-
Agent State
•
Agent
的现在所处状态的表示
•
RL
所用的状态
-
?
!
是对状态的观察
•
可能不相同
▸
History
是所有
Action
、
State
、
Reward
的序列
激励
Reward
-
在每个时间步,环境给
Agent
发送的标量数字
-
定义了强化学习问题中的目标
-
定义了对
Agent
而言什么是好、什么是坏的事件;
-
是
Agent
面临问题的即时和决定性的特征,是环境状态即时、本质的期
望
-
Agent
无法改变产生激励信号的函数,也就是说不能改变其面临的问题
-
是改变策略的首要基础
-
通常而言是环境状态和采取行为的随机函数
智能体Agent-一个
Agent
由三部分组成
-
策略
Policy
-
价值函数
Value function
-
模型
Model
-
不是必须同时存在
顺序决策-Sequential Decision Making
二、分类
Value Based
-
No Policy (Implicit)
-
Value Function
Policy Based
-
Policy
-
No Value Function
Actor Critic
-
Policy
-
Value Function
Model Free(无模型的)
-
Policy and/or Value Function
-
No Model
▸
Model Based(基于模型的)
-
Policy and/or Value Function
-
Model
Two fundamental problems in sequential decision making
-
Learning
•
环境初始时是未知的
•
Agent
不知道环境如何工作
•
Agent
通过与环境进行交互,逐渐改善
Policy
-
Planning
•
环境如何工作对于
Agent
是已知或近似已知的
•
Agent
并不与环境发生实际的交互
•
Agent
利用其构建的
Model
进行计算,在此基础上改善
Policy
Markov Decision Processes(马尔可夫决策过程)
Bellman Equation(贝尔曼方程)
Optimal Policies and Optimal Value Functions
Monte Carlo Methods(蒙特卡罗方法)
动态规划 与 时序差分
Dynamic Programming(动态规划)
Temporal-Difference Learning(时序差分学习)
1)Q-learning: Off-policy TD Control
Double Q-Learning
2)Sarsa: On-policy TD Control
Expected Sarsa
n-step Sarsa
Off-policy n-step Sarsa
Off-policy n-step Expected Sarsa
3)Maximization Bias
Integrating Learning and Planning(规划与学习)
1)Model-Based RL
2)Sample-Based Planning
3)Dyna
Dyna-Q with an Inaccurate Model
4)On-line Planning
5)Inaccurate Model
6)Prioritized Sweeping
7)
Trajectory Sampling
8)Real-time Dynamic Programming (RTDP)
Value Function Approximation(值函数逼近)
1)Gradient Descent
Stochastic Gradient Descent(随机梯度下降)
Semi-gradient Methods
2)Linear Function Approximation
Feature Vectors
Linear methods
3)Prediction Algorithms
Monte-Carlo with Value Function Approximation
TD Learning with Value Function Approximation
4)Off-policy Methods with Approximation
-
Off policy: able to improve the policy without generating new samples from
that policy
-
On policy: each time the policy is changed, even a little bit, we need to
generate new samples
5)Feature Construction for Linear Methods
Polynomials
Fourier Basis
Coarse Coding
n-step Bootstrapping(n步自助法)
Eligibility traces(资格迹)
Deep Reinforcement Learning(深度强化学习)
1)Imitation Learning
Value-based
Model-based
Policy gradients
Actor-critic
Deep RL with Q-Functions
Fitted value iteration
DQN algorithms
Advanced Policy Gradient
Policy Performance Bounds
Monotonic Improvement Theory
Approximate Monotonic Improvement
Natural Policy Gradient
Trust Region Policy Optimization
Proximal Policy Optimization(PPO)
Model-based Reinforcement Learning
Trajectory Optimization
Model-Predictive Control
Inverse Reinforcement Learning
Variational Inference
Probabilistic Graphical Model(概率图模型)
Guided Cost Learning
网络资料: