Key Concepts in RL

Key Concepts in RL

for review.
不定期更新。

In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

../_images/rl_diagram_transparent_bg.png

​ Agent-environment interaction loop

The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.

The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal.

More Terminology

DQN (Deep Q Network) 使用神经网络生产Q值

马尔可夫决策过程 MDP

在执行动作前作出的决策为规划 (planning)

但是在强化学习中,agent却不是那么容易知晓MDP中所有的元素的,比如,agent也许不会知道环境将会如何改变当它执行了一个动作后(状态转移概率函数 T T T),也不会知道它执行这个动作获得即时的奖励将会是多少(奖励函数 R R R),agent能做的就是:根据自己已有的策略 π \pi π选择关于当前状态 s s s下自己认为好的动作 a a a,执行此动作给环境,观察环境给出的反馈 r r r和下一个状态 s ′ s′ s,并根据这个反馈 r r r调整更新自己的策略 π \pi π,这样反复迭代,直到找到一种 最优的策略 π ′ \pi' π能够最大限度获得正反馈

那么,当agent不知道转移概率函数 T T T和奖励函数 R R R,它是如何找到一个好的策略的呢,当然会有很多方法:

Model-based RL

一种方法就是Model-based方法,让agent学习一种模型,这种模型能够从它的观察角度描述环境是如何工作的,然后利用这个模型做出动作规划,具体来说,当agent处于 s 1 s_1 s1状态,执行了 a 1 a_1 a1动作,然后观察到了环境从 s 1 s_1 s1转化到了 s 2 s_2 s2以及收到的奖励 r r r, 那么这些信息能够用来提高它对 T ( s 2 ∣ s 1 , a 1 ) T(s_2|s_1, a_1) T(s2s1,a1) R ( s 1 , a 1 ) R(s_1, a_1) R(s1,a1)的估计的准确性,当agent学习的模型能够非常贴近于环境时,它就可以直接通过一些规划算法来找到最优策略,具体来说:当agent已知任何状态下执行任何动作获得的回报,即 R ( s t , a t ) R(s_t,a_t) R(st,at)已知,而且下一个状态也能通过 T ( s t + 1 ∣ s t , a t ) T(s_{t+1}|s_t,a_t) T(st+1st,at)被计算,那么这个问题很容易就通过动态规划算法求解,尤其是当 = T ( s t + 1 ∣ s t , a t ) = 1 =T(s_{t+1}|s_t,a_t)=1 T(st+1st,at)1时,直接利用贪心算法,每次执行只需选择当前状态 s t s_t st下回报函数取最大值的动作 ( m a x a R ( s , a ∣ s = s t ) ) (max_aR(s,a|s=s_t)) (maxaR(s,as=st))即可,这种采取对环境进行建模的强化学习方法就是Model-based方法。

Model-free RL
但是,事实证明,我们有时候并不需要对环境进行建模也能找到最优的策略,一种经典的例子就是Q-learning,Q-learning直接对未来的回报 Q ( s , a ) Q(s,a) Q(s,a)进行估计, Q ( s k , a k ) Q(s_k,a_k) Q(sk,ak)表示对 s k s_k sk状态下执行动作 a t a_t at后获得的未来收益总和 E ( ∑ t = k n γ k R k ) E(\sum _{t=k}^n\gamma^kR_k) E(t=knγkRk)的估计,若对这个Q值估计的越准确,那么我们就越能确定如何选择当前 s t s_t st状态下的动作:选择让 Q ( s t , a t ) Q(s_t,a_t) Q(st,at)最大的 a t a_t at即可,而Q值的更新目标由Bellman方程定义,更新的方式可以有TD(Temporal Difference)等,这种是基于值迭代的方法,类似的还有基于策略迭代的方法以及结合值迭代和策略迭代的actor-critic方法,基础的策略迭代方法一般回合制更新(Monte Carlo Update),这些方法由于没有去对环境进行建模,因此他们都是Model-free的方法。

所以,如果你想查看这个强化学习算法是model-based还是model-free的,你就问你自己这个问题:在agent执行它的动作之前,它是否能对下一步的状态和回报做出预测,如果可以,那么就是model-based方法,如果不能,即为model-free方法。

https://www.quora.com/What-is-the-difference-between-model-based-and-model-free-reinforcement-learning

States and Observation

A state s s s is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation o o o is a partial description of a state, which may omit information.

When the agent is able to observe the complete state of the environment, we say that the environment is fully observed. When the agent can only see a partial observation, we say that the environment is partially observed.

Action Spaces

action space: The set of all valid actions in a given environment.

discrete action space: only a finite number of moves are available to the agent.

continuous action space: like the agent controls a robot in a physical world, actions are real-valued vectors.

Policies

A policy is a rule by an agent to decide what actions to take. The policy is trying to maximize reward.

Deterministic Policies: a t = μ ( s t ) a_t = \mu(s_t) at=μ(st)

Stochastic Policies: a t   π ( ⋅ ∣ s t ) a_t ~ \pi(\cdot|s_t) at π(st)

two kinds of stochastic policies:

  1. categorical policies: used in discrete action spaces.
  2. diagonal Gaussian policies: used in continuous action spaces.

two key computaions:

  1. sampling actions from the policy
  2. computing log likelihoods of particular actions, l o g   π θ ( a ∣ s ) log \ \pi_\theta(a|s) log πθ(as)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值