一、组成


环境Environments



状态State



- 用于决定下一步的信息



• 分为 Environment State 和 Agent State



- Environment State



• 反映环境发生什么改变



• 环境自身的状态和环境反馈给 agent 的状态并不一定相同



- Agent State



• Agent 的现在所处状态的表示



• RL 所用的状态



- ? ! 是对状态的观察



• 可能不相同



History 是所有 Action 、 State 、 Reward 的序列



激励 Reward 



- 在每个时间步,环境给 Agent 发送的标量数字



- 定义了强化学习问题中的目标



- 定义了对 Agent 而言什么是好、什么是坏的事件;



- 是 Agent 面临问题的即时和决定性的特征,是环境状态即时、本质的期





- Agent 无法改变产生激励信号的函数,也就是说不能改变其面临的问题



- 是改变策略的首要基础



- 通常而言是环境状态和采取行为的随机函数



智能体Agent-一个 Agent 由三部分组成


- 策略 Policy


- 价值函数 Value function


- 模型 Model


- 不是必须同时存在



顺序决策-Sequential Decision Making




二、分类


Value Based


- No Policy (Implicit)


- Value Function


Policy Based


- Policy


- No Value Function


Actor Critic


- Policy


- Value Function



Model Free(无模型的)


- Policy and/or Value Function


- No Model


Model Based(基于模型的)


- Policy and/or Value Function


- Model



Two fundamental problems in sequential decision making


- Learning


• 环境初始时是未知的


• Agent 不知道环境如何工作


• Agent 通过与环境进行交互,逐渐改善 Policy



- Planning


• 环境如何工作对于 Agent 是已知或近似已知的


• Agent 并不与环境发生实际的交互


• Agent 利用其构建的 Model 进行计算,在此基础上改善 Policy


Markov Decision Processes(马尔可夫决策过程)

Bellman Equation(贝尔曼方程)

Optimal Policies and Optimal Value Functions

Monte Carlo Methods(蒙特卡罗方法)

动态规划 与 时序差分 

强化学习课程重温-(乱七八糟)笔记_ci

Dynamic Programming(动态规划)
Temporal-Difference Learning(时序差分学习)

1)Q-learning: Off-policy TD Control

Double Q-Learning

2)Sarsa: On-policy TD Control

Expected Sarsa

n-step Sarsa

Off-policy n-step Sarsa

Off-policy n-step Expected Sarsa

3)Maximization Bias

强化学习课程重温-(乱七八糟)笔记_笔记_02

Integrating Learning and Planning(规划与学习)


1)Model-Based RL


2)Sample-Based Planning


3)Dyna


Dyna-Q with an Inaccurate Model


4)On-line Planning


5)Inaccurate Model


6)Prioritized Sweeping




强化学习课程重温-(乱七八糟)笔记_sed_03


Trajectory Sampling


8)Real-time Dynamic Programming (RTDP)



Value Function Approximation(值函数逼近)



1)Gradient Descent


Stochastic Gradient Descent(随机梯度下降)


Semi-gradient Methods


2)Linear Function Approximation


Feature Vectors


Linear methods


3)Prediction Algorithms


Monte-Carlo with Value Function Approximation


TD Learning with Value Function Approximation


4)Off-policy Methods with Approximation


- Off policy: able to improve the policy without generating new samples from


that policy


- On policy: each time the policy is changed, even a little bit, we need to


generate new samples


5)Feature Construction for Linear Methods

Polynomials

Fourier Basis

Coarse Coding

n-step Bootstrapping(n步自助法)


Eligibility traces(资格迹)


Deep Reinforcement Learning(深度强化学习)

1)Imitation Learning


Value-based


Model-based


Policy gradients


Actor-critic


Deep RL with Q-Functions


Fitted value iteration


DQN algorithms


Advanced Policy Gradient


Policy Performance Bounds


Monotonic Improvement Theory


Approximate Monotonic Improvement


Natural Policy Gradient


Trust Region Policy Optimization


Proximal Policy Optimization(PPO)


强化学习课程重温-(乱七八糟)笔记_笔记_04


Model-based Reinforcement Learning


Trajectory Optimization


Model-Predictive Control


Inverse Reinforcement Learning

Variational Inference

Probabilistic Graphical Model(概率图模型)

Guided Cost Learning


网络资料:


 读书笔记汇总 - 强化学习 - 知乎 (zhihu.com)