David Silver强化学习笔记3

本文是看David Silver的强化学习视频的时候记的笔记,整理了视频的思路,补充了一些证明。

1,什么是动态规划?

跳过,不写。

 

2,迭代策略估计

任务:估计给定策略\pi的状态值函数v_{\pi}(s)

方法1:迭代算法(iterative application of Bellman expectation backup,咋翻译?),

             使用synchronous backups(区别于asynchronous  backups)

可以证明,上述迭代算法下,当k\rightarrow \infty时,v_{k}\rightarrow v_{\pi}。(PPT的后面有相关证明)

 

3,举例(grid world)

如下图,估计5x5 gridworld的状态值函数v_{\pi}(s)

35

动作:north, south, west, east;

状态:25个格子对应了25个状态;

奖励:1)若处于A,不论执行什么动作,都将移动到A',并获得奖励+10;

           2)若处于B,不论执行什么动作,都将移动到A',并获得奖励+10;

           3)若在边界处,例如C,如果执行动作right,或者动作down,只会停留在原处,并获得奖励-1;

           4)其他位置都将获得奖励0,并根据动作移动到相应的位置;

折扣:\gamma = 0.9

策略:采用随机策略,任意一个状态(位置),采用4个动作的概率均为0.25

36

python 编程实现:

在Shangtong Zhang代码上做了些修改,上代码:

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.table import Table

WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
gamma = 0.9

ACTIONS = np.array([[0, -1], [-1, 0], [0, 1], [1, 0]]);
ACTION_PROB = 0.25

def move(state, action):
    if state == A_POS:
        return A_PRIME_POS, 10
    if state == B_POS:
        return B_PRIME_POS, 5

    state = np.array(state)
    next_state = (state + action).tolist()
    x, y = next_state
    if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
        reward = -1.0
        next_state = state
    else:
        reward = 0
    return next_state, reward


def policyEvaluation():
    Vk = np.zeros((WORLD_SIZE, WORLD_SIZE));
    iteration = 0;
    while True:
        # keep iteration until convergence
        Vkk = np.zeros(Vk.shape);
        iteration += 1;
        for i in range(0, WORLD_SIZE):
            for j in range(0, WORLD_SIZE):
                for action in ACTIONS:
                    (next_i, next_j), reward = move([i, j], action);
                    # bellman equation
                    Vkk[i, j] += ACTION_PROB * (reward + gamma * Vk[next_i, next_j]);
        if np.sum(np.abs(Vk - Vkk)) < 1e-4:
            print("iteration:", iteration);
            print(np.around(Vk, 1));
            break;
        Vk = Vkk;


if __name__ == '__main__':
    policyEvaluation();

运行结果为:

 

4,策略改进

问题提出:

已知一确定性策略\pi和其对应的状态值函数v_{\pi},怎样改进策略\pi,得到一个新的策略\pi'使得,\pi'\geqslant \pi  。

一个思路:

先看v_{\pi}(s)的原始形式:

v_{\pi}(s) = E_{\pi}(G_{t} | S_{t} = s)

注意下标\pi的含义,表示在t,t+1, t+2....时刻,均使用策略\pi,执行对应的动作A_{t} = \pi(s),A_{t+1} = \pi(S_{t+1})A_{t+3} = \pi(S_{t+3})...,得到的累积奖赏的数学期望作为v_{\pi}(s)

 

再看v_{\pi}(s)的Bellman方程:

v_{\pi}(s) = E_{\pi}(R_{t+1} + \gamma v_{\pi}(S_{t+1})| S_{t} = s)

注意由于v_{\pi}(S_{t+1})已有下标\pi,可以写成:

v_{\pi}(s) = E_{\pi}(R_{t+1} | S_{t} = s)+ E(\gamma v_{\pi}(S_{t+1})| S_{t} = s)

           = E(R_{t+1} | S_{t} = s, A_{t}=\pi(s))+ E(\gamma v_{\pi}(S_{t+1})| S_{t} = s)

           = E(R_{t+1} + \gamma v_{\pi}(S_{t+1})| S_{t} = s, A_{t}=\pi(s))

           = q_{\pi}(s,\pi(s))

于是有改进思路,分两步,对t+1, t+2....时刻时的状态S_{t+1}S_{t+2}...仍然使用策略\pi,对t时刻{\color{Red} S_{t} = s}使用新的策略{\color{Red} \pi'},并且\forall s\in S有:q_{\pi}(s,\pi'(s))\geqslant q_{\pi}(s,\pi(s)) = v_{\pi}(s),此时,是否有\pi'\geqslant \pi呢?即是否有\forall s\in S, v_{\pi'}(s)\geqslant v_{\pi}(s)

 

5,策略改进定理

\pi\pi'是确定性策略,如果有\forall s\in Sq_{\pi}(s,\pi'(s))\geqslant v_{\pi}(s),则\pi'\geqslant \pi,即\forall s\in S, v_{\pi'}(s)\geqslant v_{\pi}(s)

证明如下:

当改进终止时有:

        (1)

此时,满足了Bellman最优方程的形式:

v_{\pi}(s) = v_{*}(s),\forall s\in S,因此\pi就是最优策略\pi_{*}(没想通)

 

6,策略迭代

由策略改进定理,于是有获取最优策略的算法:策略迭代;

算法描述如下:

对照右边的值函数估计,这里没有了π(a|s),其实相当于将π(a|s)设置为1,因为现在的a = π(s)

7,策略迭代示例

仍以前面的grid world为例,找最优策略;

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

import numpy as np
#import matplotlib
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from matplotlib.table import Table

WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
gamma = 0.9

ACTIONS = np.array([[0, -1], [-1, 0], [0, 1], [1, 0]]);
ACTION_PROB = 0.25

def move(state, action):
    if state == A_POS:
        return A_PRIME_POS, 10
    if state == B_POS:
        return B_PRIME_POS, 5

    state = np.array(state)
    next_state = (state + action).tolist()
    x, y = next_state
    if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
        reward = -1.0
        next_state = state
    else:
        reward = 0
    return next_state, reward
  
def Evaluation(value, policy):
    Q = np.zeros(value.shape);
    while True:
        Q = value.copy();
        for i in np.arange(0, WORLD_SIZE):
            for j in np.arange(0, WORLD_SIZE):
                action = ACTIONS[policy[i, j]];
                (next_i, next_j), reward = move([i, j], action);
                value[i, j] = reward + gamma * value[next_i, next_j];
        delta = np.abs(value - Q);
        if delta.max() < 0.000001:
            break;
    return;
       
def Improvement(value, policy):
    isOptimal = True;
    new_policy = np.zeros(policy.shape);
    new_policy = policy.copy();
    for i in np.arange(0, WORLD_SIZE):
        for j in np.arange(0, WORLD_SIZE):
            Q = [];
            maxQ = -99999999.0;
            index = policy[i, j];
            for action in ACTIONS:
                (next_i, next_j), reward = move([i, j], action);
                Q.append(reward + gamma * value[next_i, next_j]);
            for k in np.arange(0, ACTIONS.itemsize):
                if(maxQ <= Q[k]):
                    maxQ = Q[k];
                    new_policy[i, j] = k;
            if new_policy[i, j] != policy[i, j]:
                isOptimal = False;
            policy[i, j] = new_policy[i, j];
    return isOptimal;
   

def policyIteration():
    value = np.zeros((WORLD_SIZE, WORLD_SIZE));
    policy = np.zeros((WORLD_SIZE, WORLD_SIZE), dtype = int);  
    while True:
        Evaluation(value, policy);
        isOptimal = Improvement(value, policy);
        if(isOptimal == True):
            print(policy);
            print(np.around(value, 2));
            break;

if __name__ == '__main__':
    policyIteration();

8,值迭代

利用了Bellman最优方程

 

上代码:

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)             #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

import numpy as np
#import matplotlib
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.table import Table

WORLD_SIZE = 5
A_POS = [0, 1]
A_PRIME_POS = [4, 1]
B_POS = [0, 3]
B_PRIME_POS = [2, 3]
gamma = 0.9

ACTIONS = np.array([[0, -1], [-1, 0], [0, 1], [1, 0]]);
ACTION_PROB = 0.25

def move(state, action):
    if state == A_POS:
        return A_PRIME_POS, 10
    if state == B_POS:
        return B_PRIME_POS, 5

    state = np.array(state)
    next_state = (state + action).tolist()
    x, y = next_state
    if x < 0 or x >= WORLD_SIZE or y < 0 or y >= WORLD_SIZE:
        reward = -1.0
        next_state = state
    else:
        reward = 0
    return next_state, reward

def valueIteration():
    Vk = np.zeros((WORLD_SIZE, WORLD_SIZE));
    policy = np.zeros((WORLD_SIZE, WORLD_SIZE), dtype = int);  
    iteration = 0; 
    while True:
        # keep iteration until convergence
        iteration += 1;
        Vkk = np.zeros(Vk.shape)
        for i in range(0, WORLD_SIZE):
            for j in range(0, WORLD_SIZE):
                Q = []
                for action in ACTIONS:
                    (next_i, next_j), reward = move([i, j], action)
                    # value iteration
                    Q.append(reward + gamma * Vk[next_i, next_j])
                Vkk[i, j] = np.max(Q)
                maxQ = -99999999.0;
                for k in np.arange(0, ACTIONS.itemsize):
                    if(maxQ <= Q[k]):
                        maxQ = Q[k];
                        policy[i, j] = k;
        if np.sum(np.abs(Vkk - Vk)) < 1e-4:
            print("value iteration:", iteration);
            print(policy)
            print(np.around(Vk, 2));
            break
        Vk = Vkk

if __name__ == '__main__':
    valueIteration();

 

  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值