【维基百科】【无模型强化学习】【Model-free reinforcement learning】

Model-free (reinforcement learning)

无模型(强化学习)

         In reinforcement learning (RL), a model-free algorithm (as opposed to a model-based one) is an algorithm which does not estimate the transition probability distribution (and the reward function) associated with the Markov decision process (MDP),[1] which, in RL, represents the problem to be solved. The transition probability distribution (or transition model) and the reward function are often collectively called the "model" of the environment (or MDP), hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm.[1] Typical examples of model-free algorithms include Monte Carlo RL, Sarsa, and Q-learning.

        在强化学习(RL)中,无模型算法(与基于模型的算法相反)是一种算法,它不估计与马尔可夫决策过程(MDP)相关的转移概率分布(和奖励函数)[1],在RL中,马尔可夫决策过程代表了要解决的问题。转移概率分布(或转移模型)和奖励函数通常统称为环境的“模型”(或MDP),因此得名“无模型”。无模型 RL 算法可以被认为是一种“显式”试错算法。[1] 无模型算法的典型示例包括 Monte Carlo RL、Sarsa 和 Q-learning。

        In model-free reinforcement learning, Monte Carlo (MC) estimation is a central component of a large class of model-free algorithms. The MC learning algorithm is essentially an important branch of generalized policy iteration, which has two periodically alternating steps, i.e., policy evaluation (PEV) and policy improvement (PIM). In this framework, each policy is first evaluated by its corresponding value function. Then, based on the evaluation result, greedy search is completed to output a better policy. The MC estimation is mainly applied to the first step, i.e., policy evaluation. The simplest idea, i.e., averaging the returns of all collected samples, is used to judge the effectiveness of the current policy. As more experience is accumulated, the estimate will converge to the true value by the law of large numbers. Hence, MC policy evaluation does not require any prior knowledge of the environment dynamics. Instead, all it needs is experience, i.e., samples of state, action, and reward, which are generated from interacting with a real environment.[2]
        在无模型强化学习中,蒙特卡洛(MC)估计是一大类无模型算法的核心组成部分。MC学习算法本质上是广义策略迭代的一个重要分支,它有两个周期替的步骤,即政策评估(PEV)和政策改进(PIM)。在此框架中,每个策略首先由其相应的值函数进行评估。然后,根据评估结果,完成贪婪搜索,以输出更好的策略。MC估计主要应用于第一步,即政策评估。最简单的想法,即将所有收集样本的回报平均,用于判断当前政策的有效性。随着经验的积累,根据大数定律,估计值将收敛到真实值。因此,MC政策评估不需要对环境动态的任何先验知识。相反,它所需要的只是经验,即状态、行动和奖励的样本,这些都是通过与真实环境的交互而生成的。[2]

        The estimation of value function is critical for model-free RL algorithms. Unlike Monte Carlo (MC) methods, temporal difference (TD) methods learn the value function by reusing existing value estimates. If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal difference. TD has the ability to learn from an incomplete sequence of events without waiting for the final outcome. TD has the ability to approxi

  • 10
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值