[Chapter 6] Reinforcement Learning (4) Policy Search

最新推荐文章于 2022-08-10 19:35:18 发布

超级超级小天才

最新推荐文章于 2022-08-10 19:35:18 发布

阅读量186

点赞数 1

分类专栏： Reinforcement Learning Overview 文章标签：强化学习

本文链接：https://blog.csdn.net/qq_38962621/article/details/117375816

版权

Reinforcement Learning Overview 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

In the previous sections, we try to learn the utility function, or more usually, the action-value functions and greedily select the action with the highest Q-value:

${\pi}(s)=arg ⁡max_a⁡{Q(s,a)}$

This means that once we have learnt the Q-function well, we can get an optimal policy, so before this, all methods were directly or indirectly learning the Q-function, however, for the policy search method, it tries to update the policy function directly.

Policy Search

Based on the function approximation, we can write the policy function as:

${\pi}(s)=arg ⁡max_a⁡{\hat{Q}(s,a)}$

As a function mapping from state to action, the policy function is also a function with parameters ${\theta}$ to learn. Then policy search method adjusts ${\theta}$ to improve the policy directly without approximate the Q-values or utilities.

However, in the formula above, there are two main problem we need to solve firstly:

The operation arg⁡max is not differentiable, which makes the gradient based search difficult
In the environment with discrete actions, which means the outputs of the function are discrete

In fact, one method can solve them easily, you can think the problem to be a classification problem, why? When the agent selects an action, it selects the action with the highest Q-value regards the current state; in a classification problem, our model predicts the probability for each class that the input belongs to and output the class with the highest probability. They are one same thing actually. Remember how we solve the classification problem? Yes, we are using softmax function, here we can also use it:

${\pi}_{\theta}(s,a)=\frac{e^{\hat{Q}_{\theta}(s,a)}}{\sum_{a^′}{e^{\hat{Q}_{\theta}(s,a^′)}}}$

Given a state $s$ , the model can classify it to a class which indicates which action to execute (with highest Q-value).

Using the gradient method, we can get the parameter update formula:

${\theta}_{i+1}={\theta}_i+{\alpha}G_j \frac{\nabla_{\theta} {\pi}_{\theta}(s,a_i)}{{\pi}_{\theta} (s,a_i)}$

Another version for the above formulas is to perform logarithmic operations on both sides of the equation, then we can get:

${\theta}_{i+1}={\theta}_i+{\alpha} G_j \nabla_{\theta} ln{{\pi}_{\theta}(s,a_i)}$

Variance Reduction using a Baseline

Another technology is using a baseline to reduce the variance of the Q-function, to replace the $Q_{{\pi}_{\theta} }(s,a)$ with $Q_{{\pi}_{\theta}} (s,a)−B(s)$ . Usually, a natural choice for the baseline is $V_{{\pi}_{\theta}}(s)$ , then we define a new advantage function:

$A_{{\pi}_{\theta}}(s,a)=Q_{{\pi}_{\theta}} (s,a)−V_{{\pi}_{\theta}}(s)$

Actor Critic

Actor-Critic algorithm tries to combine both the Q-function based learning and the policy search together. It establishes two outputs, one learns a policy that takes action, called actor, at the same time, another learns a value or Q-function that is used only for evaluation, called critic. It divided the evaluation and improvement into two parts, they are executed alternatively.

In the DRL, to save the memory and training time, we usually let these two parts share the bottom layers that are used for feature extracting and divide the network at a higher layer.

超级超级小天才

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Chapter 6] Reinforcement Learning (4) Policy Search

In the previous sections, we try to learn the utility function, or more usually, the action-value functions and greedily select the action with the highest Q-value:π(s)=arg⁡maxa⁡Q(s,a){\pi}(s)=arg ⁡max_a⁡{Q(s,a)}π(s)=arg⁡maxa⁡Q(s,a)This means that once
复制链接

扫一扫