reinforce learning、强化学习、增强学习、RL

最新推荐文章于 2024-10-02 06:08:55 发布

干了这碗汤

最新推荐文章于 2024-10-02 06:08:55 发布

阅读量2k

点赞数 2

分类专栏：机器学习 | 深度学习 | 强化学习无人机技术 Matlab | Simulink

本文链接：https://blog.csdn.net/weixin_43321489/article/details/108503159

版权

无人机技术同时被 3 个专栏收录

120 篇文章 66 订阅

订阅专栏

Matlab | Simulink

42 篇文章 6 订阅

订阅专栏

机器学习 | 深度学习 | 强化学习

9 篇文章 0 订阅

订阅专栏

What is RL

通过不断尝试不同策略地种瓜，学会了一个（或几个）种出好瓜的策略π（即学习，或称训练）。利用学习得到的策略π，进行下一次的种瓜（即应用）。这个种瓜的过程，可以看做一个马尔科夫决策过程，这个过程在强化学习理论中的关键概念包括：动作、状态、奖赏、状态转移函数、累积奖赏等。

一些必要的说明

reinforce learning、强化学习、增强学习、RL是同一个概念。
参考书：西瓜书。
最新、前沿强化学习算法：a3c
领头企业：谷歌、百度

关键词

马尔科夫决策过程MDP（markov decision process）
状态 s
状态空间 S
动作 a
动作空间 A
状态转移函数 P
奖赏 R
状态-动作值函数（state-action value function） Q(s,a)
累积奖赏 V(x)
最终奖赏（即累积奖赏）
T步累积奖赏
γ折扣累积奖赏
第t步获得的奖赏值
策略 π
单步强化学习任务
多步强化学习任务
有模型学习（model based learning）
免模型学习
蒙特卡洛强化学习
Q-learning
Q 表、Q look up table 、行为准则、
Q表的更新方法、更新行为准则
α go 打败李世石
使用Q表的形式来决策

步骤：

建立模型：即，确定A,X,P,R.
选择算法，如：仅利用、仅探索、softmax、E-贪心等。
结合实际进行训练，得到策略π.
使用π.

例子

1.种西瓜
2.K-摇臂赌博机

说人话

强化学习的目标：通过不断地尝试（或者说学习），得到一个做好某件事的策略 π
，这里说的“做好某件事”，比如：种出好瓜。
强化学习的训练过程（或称学习过程）可以用MDP来描述。
训练结束后的应用：根据学得的策略π，做好对相应的某件事：根据某个当前状态获得对应的动作，这个动作会使累积奖赏最大 a=π( x ).

累积奖赏最多的策略，即为最终学习得到的策略。得到这一策略之后便是使用这条策略。
强化学习的累积奖赏（最终奖赏）是多步动作之后才能得到的。
最大化单步奖赏：假设一步动作之后就可以得到累积奖赏。

理论细节

在这里插入图片描述

连续动作空间

Q Learnintg

在这里插入图片描述

DQN

在这里插入图片描述
提出的原因，当状态（或动作）很多时，Q表就会变得及其庞大，并且不容易查询（费时）。

在这里插入图片描述

如何训练NN?

DDPG

一些概念：
Q value function 的更新规则：Q learning 算法
AC网络、
critic、actor、评论家、演员
学习过程（训练过程）：可以看作是一个马尔科夫决策过程MDP(Markov Decision Process (MDP)。
策略：π(·)，a_t = π( x_t )
Obtaining the PIDs parameters by determining the vector kt can be formulated as a Markov Decision Process (MDP) into the RL framework where an entity, called agent, makes its decisions as a function, π(·), of the current state of the robot, xt , i.e. kt = π(xt). The RL is an unsupervised learning approach for solving MDP problems, where the RL agent learns a policy, π(·), from direct interactions with its environment. At each time step, t, the agent observes the state, xt , and performs an action, kt , based on its current policy, π, and receives a scalar reward, rt , from the environment after the system transition occurs.
在这里插入图片描述
时间、训练批次
RL algorithms、RL agent
目标：The aim of the RL algorithms is to find an optimal policy π∗ that maximizes the expected future discounted rewards over time。
找到一个策略π（·），在0<t<tmax，得到总奖赏（或折扣奖赏）最大化。
解决方法：To solve the stated RL problem, actor–critic methods can be used.
policy function :
parameterized policy:
state value function: the value function provides a measures of how good those actions are.
parameterization:
parameterized actor function πθ：参数化的策略
DDPG：AC RL算法的一种，uses a state–action value function Q(x, k) as critic and a deterministic action selection function π for the actor。
Q value function的更新有很多种，最经典的是Q-learning 算法：
在这里插入图片描述
Q value function用于评价当前state下执行的action，at=π（xt）。这个评价对策略π（·）的更新有重大作用，作用如下：
critic provides information for training the actor。