强化学习（马尔科夫决策过程）

最新推荐文章于 2024-05-13 21:15:29 发布

circle_yy

最新推荐文章于 2024-05-13 21:15:29 发布

阅读量805

点赞数

本文链接：https://blog.csdn.net/cy_believ/article/details/86561932

版权

本文介绍了强化学习的基本概念，强调了马尔科夫决策过程（MDP）在强化学习中的核心地位。MDP通过马尔可夫性质简化了问题，通过动态规划寻找最优策略。文章详细阐述了MDP的组成部分，如状态、动作、状态转移概率、奖励函数，并解释了贝尔曼方程。最后，讨论了策略在MDP中的作用以及价值函数和动作值函数的概念。

摘要由CSDN通过智能技术生成

强化学习笔记：

强化学习是机器学习的子领域。机器学习包括监督学习，非监督学习，强化学习。

强化学习的定义：

Reinforcement learning is a learning paradigm concerned with learning to control asystem so as to maximize a numerical performance measure that expresses a long-term objective。

Reinforcement learning is learning what to do—how to map situations to actions—so

as to maximize a numerical reward signal.

强化学习是一种学习控制系统的学习范式，以最大化表达长期目标。强化学习引起了极大的兴趣，因为它可以用来解决大量的实际应用，从人工智能问题到运筹学或控制工程。

强化学习正在学习如何做 - 如何将情境映射到行动 - 以便最大化数字奖励信号。

A controller receives the controlled system’s state and a reward associated with the last state transition. It then calculates an action which is sent back to the system. In response, the system makes a transition to a new state and the cycle is repeated. The problem is to learn a way of controlling the system so as to maximize the total reward.

一个 controller接收控制系统的状态和当前这个状态的immediate reward；之后计算应该发回到行动，系统作为应答，系统会发生一个状态转换，之后过程重复。目标是最大化长期奖励。

Problems with these characteristics are best described in the framework of Markovian Decision Processes (MDPs). The standard approach to ‘solve’ MDPs is to use dynamic programming, which transforms the problem of finding a good controller into the problem of finding a good value function.

马尔可夫决策过程是强化学习的基础。

MDP（Markov Decision Processes）：

Markov Property：

St表示当前状态，在St已知时，可以不必考虑当前状态之前的历史信息，即St包含了所有的历史信息，对于将来状态的确定，具有足够的信息。

马尔科夫过程：

对于马尔科夫状态当前s，后继状态s',状态转移概率定义如下：

表示为矩阵形式如下：

最低0.47元/天解锁文章

circle_yy

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
强化学习（马尔科夫决策过程）

强化学习笔记：强化学习是机器学习的子领域。机器学习包括监督学习，非监督学习，强化学习。强化学习的定义：Reinforcement learning is a learning paradigm concerned with learning to control asystem so as to maximize a numerical performance measure that...
复制链接

扫一扫