DRL --- Policy Based Methods --- Chapter 3-2 Introduction to Policy-Based Method

DRL — Policy Based Methods — Chapter 3-2 Introduction to Policy-Based Method

3.2.1 Policy-Based Methods

Value-Based methods where we first tried to find an estimate of the optimal action value function.
在这里插入图片描述
This optimal value function was represented in a table with one row for each state and one column for each action. Then we use the table to build the optimal policy one state at a time. For each state, we just pull its corresponding row from the table and the optimal action is just the action with the largest entry.

Value-Based 方法首先获取值函数表,然后根据值函数表再获取最优策略;
Policy-Based 方法直接根据当前智能体状态获取最优策略,不需要先求出值函数表。

3.2.5 Hill Climbing Pseudocode

在这里插入图片描述

3.2.12 Summary

Policy-Based Methods

  • With value-based methods, the agent uses its experience with the environment to maintain an estimate of the optimal action-value function. The optimal policy is then obtained from the optimal action-value function estimate.
  • Policy-based methods directly learn the optimal policy, without having to maintain a separate value function estimate.

Policy Function Approximation

  • In deep reinforcement learning, it is common to represent the policy with a neural network.
    • This network takes the environment state as input.
    • If the environment has discrete actions, the output layer has a node for each possible action and contains the probability that the agent should select each possible action.
  • The weights in this neural network are initially set to random values. Then, the agent updates the weights as it interacts with (and learns more about) the environment.

More on the Policy

Policy-based methods can learn either stochastic or deterministic policies, and they can be used to solve environments with either finite or continuous action spaces.

Hill Climbing

  • Hill climbing is an iterative algorithm that can be used to find the weights θ \theta θ for an optimal policy.
  • At each iteration,
    • We slightly perturb the values of the current best estimate for the weights θ b e s t \theta_{best} θbest, to yield a new set of weights.
    • These new weights are then used to collect an episode. If the new weights θ n e w \theta_{new} θnew resulted in higher return than the old weights, then we set θ b e s t ← θ n e w \theta_{best} \leftarrow \theta_{new} θbestθnew.

Beyond Hill Climbing

  • Steepest ascent hill climbing is a variation of hill climbing that chooses a small number of neighboring policies at each iteration and chooses the best among them.
  • Simulated annealing uses a pre-defined schedule to control how the policy space is explored, and gradually reduces the search radius as we get closer to the optimal solution.
  • Adaptive noise scaling decreases the search radius with each iteration when a new best policy is found, and otherwise increases the search radius.

More Black-Box Optimization

  • The cross-entropy method iteratively suggests a small number of neighboring policies, and uses a small percentage of the best performing policies to calculate a new estimate.
  • The evolution strategies technique considers the return corresponding to each candidate policy. The policy estimate at the next iteration is a weighted sum of all of the candidate policies, where policies that got higher return are given higher weight.

Why Policy-Based Methods?

  • There are three reasons why we consider policy-based methods:
    1. Simplicity: Policy-based methods directly get to the problem at hand (estimating the optimal policy), without having to store a bunch of additional data (i.e., the action values) that may not be useful.
    2. Stochastic policies: Unlike value-based methods, policy-based methods can learn true stochastic policies.
    3. Continuous action spaces: Policy-based methods are well-suited for continuous action spaces.

关于随机性策略和确定性策略

所以核心的区别其实很简单,最终的策略是学出
π ( s ) = a \pi\left(s\right)=a π(s)=a:确定性策略
还是
π ( s , a i ) = a \pi\left(s, a_{i}\right)=a π(s,ai)=a:随机性策略

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值