[Chapter 6] Reinforcement Learning (4) Policy Search

In the previous sections, we try to learn the utility function, or more usually, the action-value functions and greedily select the action with the highest Q-value:

π ( s ) = a r g ⁡ m a x a ⁡ Q ( s , a ) {\pi}(s)=arg ⁡max_a⁡{Q(s,a)} π(s)=argmaxaQ(s,a)

This means that once we have learnt the Q-function well, we can get an optimal policy, so before this, all methods were directly or indirectly learning the Q-function, however, for the policy search method, it tries to update the policy function directly.

Policy Search

Based on the function approximation, we can write the policy function as:

π ( s ) = a r g ⁡ m a x a ⁡ Q ^ ( s , a ) {\pi}(s)=arg ⁡max_a⁡{\hat{Q}(s,a)} π(s)=argmaxaQ^(s,a)

As a function mapping from state to action, the policy function is also a function with parameters θ {\theta} θ to learn. Then policy search method adjusts θ {\theta} θ to improve the policy directly without approximate the Q-values or utilities.

However, in the formula above, there are two main problem we need to solve firstly:

  • The operation arg⁡max is not differentiable, which makes the gradient based search difficult
  • In the environment with discrete actions, which means the outputs of the function are discrete

In fact, one method can solve them easily, you can think the problem to be a classification problem, why? When the agent selects an action, it selects the action with the highest Q-value regards the current state; in a classification problem, our model predicts the probability for each class that the input belongs to and output the class with the highest probability. They are one same thing actually. Remember how we solve the classification problem? Yes, we are using softmax function, here we can also use it:

π θ ( s , a ) = e Q ^ θ ( s , a ) ∑ a ′ e Q ^ θ ( s , a ′ ) {\pi}_{\theta}(s,a)=\frac{e^{\hat{Q}_{\theta}(s,a)}}{\sum_{a^′}{e^{\hat{Q}_{\theta}(s,a^′)}}} πθ(s,a)=aeQ^θ(s,a)eQ^θ(s,a)

Given a state s s s, the model can classify it to a class which indicates which action to execute (with highest Q-value).

Using the gradient method, we can get the parameter update formula:

θ i + 1 = θ i + α G j ∇ θ π θ ( s , a i ) π θ ( s , a i ) {\theta}_{i+1}={\theta}_i+{\alpha}G_j \frac{\nabla_{\theta} {\pi}_{\theta}(s,a_i)}{{\pi}_{\theta} (s,a_i)} θi+1=θi+αGjπθ(s,ai)θπθ(s,ai)

Another version for the above formulas is to perform logarithmic operations on both sides of the equation, then we can get:

θ i + 1 = θ i + α G j ∇ θ l n π θ ( s , a i ) {\theta}_{i+1}={\theta}_i+{\alpha} G_j \nabla_{\theta} ln{{\pi}_{\theta}(s,a_i)} θi+1=θi+αGjθlnπθ(s,ai)

Variance Reduction using a Baseline

Another technology is using a baseline to reduce the variance of the Q-function, to replace the Q π θ ( s , a ) Q_{{\pi}_{\theta} }(s,a) Qπθ(s,a) with Q π θ ( s , a ) − B ( s ) Q_{{\pi}_{\theta}} (s,a)−B(s) Qπθ(s,a)B(s). Usually, a natural choice for the baseline is V π θ ( s ) V_{{\pi}_{\theta}}(s) Vπθ(s), then we define a new advantage function:

A π θ ( s , a ) = Q π θ ( s , a ) − V π θ ( s ) A_{{\pi}_{\theta}}(s,a)=Q_{{\pi}_{\theta}} (s,a)−V_{{\pi}_{\theta}}(s) Aπθ(s,a)=Qπθ(s,a)Vπθ(s)

Actor Critic

Actor-Critic algorithm tries to combine both the Q-function based learning and the policy search together. It establishes two outputs, one learns a policy that takes action, called actor, at the same time, another learns a value or Q-function that is used only for evaluation, called critic. It divided the evaluation and improvement into two parts, they are executed alternatively.

In the DRL, to save the memory and training time, we usually let these two parts share the bottom layers that are used for feature extracting and divide the network at a higher layer.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值