Chapter 3 - DP
Introduction
动态规划,分解成子问题
MDP满足动态规划的Optimal substructure、Overlapping subproblems的两个性质。
用于MDP的planning问题
all of these methods can be viewed as attempts to achieve much the same effect as DP, only with less computation and without assuming a perfect model of the environment.
Policy Evaluation
policy evaluation就是根据当前的policy: π π 计算value function,计算的公式是:
即下图的左边一列,根据左上角的策略计算左下角的value function。
Policy Iteration
policy iteration包含了两个部分,一是刚才讲的policy evaluation,二是policy improvement(policy improvement是上图的右边一列,根据左边计算的value function更新policy):
可以看到在这两个不断交替的过程中,policy π π 是不断向 π∗ π ∗ 逼近的。
Policy iteration often converges in surprisingly few iterations.
policy improvement
寻找 qπ(s,π′(s)) q π ( s , π ′ ( s ) ) 最大值可转化成寻找 vπ′(s) v π ′ ( s ) 最大值
这一小节一共举了两个例子,详细解释一下:
GridWorld
- policy evaluation
policy evaluation里面提到过,我们主要看计算value function的公式,
对于a:up/down/left/right
对于 π π :是对a=up/down/left/right 都有 π(a|s)=0.25 π ( a | s ) = 0.25
对于 p(s′,r|s,a) p ( s ′ , r | s , a ) :s’可取s本身或它周围4个,r=-1
对于r:r=-1
对于 vk(s′) v k ( s ′ ) :上一次迭代中的state-value
我们可以看到,基本已经没有时间t的概念,因为被Bellman Expectation Equation简化了。
- policy improvement
对每个状态寻找执行action后value最大的action,action从{left,right,up,down}里选
CarRental
- policy evaluation
同样看计算value function的公式,
对于a:{-5,-4,-3,-2,-1,0,1,2,3,4,5} 一晚上最多移5辆车
对于 π π :这里对于每个状态,只有一个action,即 π(a|s)=1 π ( a | s ) = 1
对于 p(s′,r|s,a) p ( s ′ , r | s , a ) :s’可取所有状态,概率由泊松分布计算
对于r:r = (realRentalFirstLoc + realRentalSecondLoc) * RENTAL_CREDIT
对于 vk(s′) v k ( s ′ ) :上一次迭代中的state-value
对于整个公式的代码是
returns += prob * (reward + DISCOUNT * stateValue[numOfCarsFirstLoc, numOfCarsSecondLoc])
- policy improvement
对每个状态寻找执行action后value最大的action,action从{-5,-4,-3,-2,-1,0,1,2,3,4,5}里选
for i, j in states: actionReturns = [] # go through all actions and select the best one for action in actions: if (action >= 0 and i >= action) or (action < 0 and j >= abs(action)): actionReturns.append(expectedReturn([i, j], action, stateValue)) else: actionReturns.append(-float('inf')) bestAction = np.argmax(actionReturns) newPolicy[i, j] = actions[bestAction]
根据公式:
我们可以知道,如果一个策略 π(a|s) π ( a | s ) 对s取得最优值当且仅当它的下一步状态 s′ s ′ 取得最优值
Value Iteration
One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set.
In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration.
policy iteration在每次评估完policy后,都要进行policy的更新,这样会很费时间,而且有时是没必要进行policy improvement的,比如gridworld在三次以后policy就不改变了。
因此有一种方法可以改进这样的缺点,就是value iteration,一开始只计算value function,最后一次性得到policy。而且计算value function的公式改变了,如下:
这里不再是将各个action的回报加起来了,而是取最大的。
value iteration的伪代码如下:
Extensions to Dynamic Programming
Asynchronous Dynamic Programming
A major drawback to the DP methods that we have discussed so far is that they involve operations over the entire state set of the MDP, that is, they require sweeps of the state set. If the state set is very large, then even a single sweep can be prohibitively expensive. For example, the game of backgammon has over 1020 states. Even if we could perform the value iteration update on a million states per second, it would take over a thousand years to complete a single sweep.
Asynchronous Dynamic Programming可以做到:
- Can significantly reduce computation
- Guaranteed to converge if all states continue to be selected
Three simple ideas for asynchronous dynamic programming:
- In-place dynamic programming
- Prioritised sweeping
- Real-time dynamic programming
Full-width and sample backups
Using sample rewards and sample transitions Instead of reward function R R and transition dynamics P P
Approximate Dynamic Programming
Approximate the value function
Contraction Mapping
Contraction Mapping可以解决以下困惑: