【维基百科】【无模型强化学习】【Model-free reinforcement learning】

资源存储库

已于 2024-08-03 09:29:59 修改

阅读量691

点赞数 10

分类专栏：算法文章标签：算法

于 2024-08-03 09:25:37 首次发布

本文链接：https://blog.csdn.net/wq6qeg88/article/details/140885386

版权

Model-free (reinforcement learning)

无模型（强化学习）

In reinforcement learning (RL), a model-free algorithm (as opposed to a model-based one) is an algorithm which does not estimate the transition probability distribution (and the reward function) associated with the Markov decision process (MDP),[1] which, in RL, represents the problem to be solved. The transition probability distribution (or transition model) and the reward function are often collectively called the "model" of the environment (or MDP), hence the name "model-free". A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm.[1] Typical examples of model-free algorithms include Monte Carlo RL, Sarsa, and Q-learning.

在强化学习（RL）中，无模型算法（与基于模型的算法相反）是一种算法，它不估计与马尔可夫决策过程（MDP）相关的转移概率分布（和奖励函数）[1]，在RL中，马尔可夫决策过程代表了要解决的问题。转移概率分布（或转移模型）和奖励函数通常统称为环境的“模型”（或MDP），因此得名“无模型”。无模型 RL 算法可以被认为是一种“显式”试错算法。[1] 无模型算法的典型示例包括 Monte Carlo RL、Sarsa 和 Q-learning。

In model-free reinforcement learning, Monte Carlo (MC) estimation is a central component of a large class of model-free algorithms. The MC learning algorithm is essentially an important branch of generalized policy iteration, which has two periodically alternating steps, i.e., policy evaluation (PEV) and policy improvement (PIM). In this framework, each policy is first evaluated by its corresponding value function. Then, based on the evaluation result, greedy search is completed to output a better policy. The MC estimation is mainly applied to the first step, i.e., policy evaluation. The simplest idea, i.e., averaging the returns of all collected samples, is used to judge the effectiveness of the current policy. As more experience is accumulated, the estimate will converge to the true value by the law of large numbers. Hence, MC policy evaluation does not require any prior knowledge of the environment dynamics. Instead, all it needs is experience, i.e., samples of state, action, and reward, which are generated from interacting with a real environment.[2]
在无模型强化学习中，蒙特卡洛（MC）估计是一大类无模型算法的核心组成部分。MC学习算法本质上是广义策略迭代的一个重要分支，它有两个周期替的步骤，即政策评估（PEV）和政策改进（PIM）。在此框架中，每个策略首先由其相应的值函数进行评估。然后，根据评估结果，完成贪婪搜索，以输出更好的策略。MC估计主要应用于第一步，即政策评估。最简单的想法，即将所有收集样本的回报平均，用于判断当前政策的有效性。随着经验的积累，根据大数定律，估计值将收敛到真实值。因此，MC政策评估不需要对环境动态的任何先验知识。相反，它所需要的只是经验，即状态、行动和奖励的样本，这些都是通过与真实环境的交互而生成的。[2]

The estimation of value function is critical for model-free RL algorithms. Unlike Monte Carlo (MC) methods, temporal difference (TD) methods learn the value function by reusing existing value estimates. If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal difference. TD has the ability to learn from an incomplete sequence of events without waiting for the final outcome. TD has the ability to approxi

最低0.47元/天解锁文章

资源存储库

关注

10
点赞
踩
19

收藏

觉得还不错? 一键收藏
打赏
0
评论
【维基百科】【无模型强化学习】【Model-free reinforcement learning】

MC学习算法本质上是广义策略迭代的一个重要分支，它有两个周期替的步骤，即政策评估（PEV）和政策改进（PIM）。相反，它所需要的只是经验，即状态、行动和奖励的样本，这些都是通过与真实环境的交互而生成的。因此，TD学习算法可以循序渐进地从不完整的情节中学习或继续的任务中学习，而MC必须以逐集的方式实现。转移概率分布（或转移模型）和奖励函数通常统称为环境的“模型”（或MDP），因此得名“无模型”。在强化学习（RL）中，无模型算法（与基于模型的算法相反）是一种算法，它不估计与马尔可夫决策过程（MDP）相关的。
复制链接

扫一扫