Silver-Slides Chapter 3 - 强化学习之动态规划Dynamic Programming(DP)

Chapter 3 - DP

Introduction

动态规划,分解成子问题

MDP满足动态规划的Optimal substructure、Overlapping subproblems的两个性质。

用于MDP的planning问题

all of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.

Policy Evaluation

policy evaluation就是根据当前的policy: π π 计算value function,计算的公式是:

即下图的左边一列,根据左上角的策略计算左下角的value function。

Policy Iteration

policy iteration包含了两个部分,一是刚才讲的policy evaluation,二是policy improvement(policy improvement是上图的右边一列,根据左边计算的value function更新policy):

可以看到在这两个不断交替的过程中,policy π π 是不断向 π π ∗ 逼近的。

Policy iteration often converges in surprisingly few iterations.

  • policy improvement

    寻找 qπ(s,π(s)) q π ( s , π ′ ( s ) ) 最大值可转化成寻找 vπ(s) v π ′ ( s ) 最大值

这一小节一共举了两个例子,详细解释一下:

  • GridWorld

    • policy evaluation

    policy evaluation里面提到过,我们主要看计算value function的公式,

    对于a:up/down/left/right

    对于 π π :是对a=up/down/left/right 都有 π(a|s)=0.25 π ( a | s ) = 0.25

    对于 p(s,r|s,a) p ( s ′ , r | s , a ) :s’可取s本身或它周围4个,r=-1

    对于r:r=-1

    对于 vk(s) v k ( s ′ ) :上一次迭代中的state-value

    我们可以看到,基本已经没有时间t的概念,因为被Bellman Expectation Equation简化了。

    • policy improvement

    对每个状态寻找执行action后value最大的action,action从{left,right,up,down}里选

  • CarRental

    • policy evaluation

    同样看计算value function的公式,

    对于a:{-5,-4,-3,-2,-1,0,1,2,3,4,5} 一晚上最多移5辆车

    对于 π π :这里对于每个状态,只有一个action,即 π(a|s)=1 π ( a | s ) = 1

    对于 p(s,r|s,a) p ( s ′ , r | s , a ) :s’可取所有状态,概率由泊松分布计算

    对于r:r = (realRentalFirstLoc + realRentalSecondLoc) * RENTAL_CREDIT

    对于 vk(s) v k ( s ′ ) :上一次迭代中的state-value

    对于整个公式的代码是

    returns += prob * (reward + DISCOUNT * 
                       stateValue[numOfCarsFirstLoc, numOfCarsSecondLoc])
    • policy improvement

    对每个状态寻找执行action后value最大的action,action从{-5,-4,-3,-2,-1,0,1,2,3,4,5}里选

    for i, j in states:
        actionReturns = []
        # go through all actions and select the best one
        for action in actions:
            if (action >= 0 and i >= action) or (action < 0 and j >= abs(action)):
                actionReturns.append(expectedReturn([i, j], action, stateValue))
            else:
                actionReturns.append(-float('inf'))
                bestAction = np.argmax(actionReturns)
                newPolicy[i, j] = actions[bestAction]

根据公式:

我们可以知道,如果一个策略 π(a|s) π ( a | s ) 对s取得最优值当且仅当它的下一步状态 s s ′ 取得最优值

Value Iteration

One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set.

In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration.

policy iteration在每次评估完policy后,都要进行policy的更新,这样会很费时间,而且有时是没必要进行policy improvement的,比如gridworld在三次以后policy就不改变了。

因此有一种方法可以改进这样的缺点,就是value iteration,一开始只计算value function,最后一次性得到policy。而且计算value function的公式改变了,如下:

这里不再是将各个action的回报加起来了,而是取最大的。

value iteration的伪代码如下:

Extensions to Dynamic Programming

  • Asynchronous Dynamic Programming

    A major drawback to the DP methods that we have discussed so far is that they involve operations over the entire state set of the MDP, that is, they require sweeps of the state set. If the state set is very large, then even a single sweep can be prohibitively expensive. For example, the game of backgammon has over 1020 states. Even if we could perform the value iteration update on a million states per second, it would take over a thousand years to complete a single sweep.

    Asynchronous Dynamic Programming可以做到:

    1. Can significantly reduce computation
    2. Guaranteed to converge if all states continue to be selected

    Three simple ideas for asynchronous dynamic programming:

    1. In-place dynamic programming
    2. Prioritised sweeping
    3. Real-time dynamic programming
  • Full-width and sample backups

    Using sample rewards and sample transitions Instead of reward function R R and transition dynamics P P

  • Approximate Dynamic Programming

    Approximate the value function

Contraction Mapping

Contraction Mapping可以解决以下困惑:

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值