强化学习基础知识笔记[2] - 基于模型的动态规划

参考资料
[1] 强化学习入门 第二讲 基于模型的动态规划方法

强化学习的分类

aaa

分类如上图所示,需要说明的是基于模型的动态规划方法知道转移概率 P P P、立即回报函数 R R R、衰减因子 γ \gamma γ,而无模型的强化学习不知道。

动态规划的理解

动态-状态的变化;规划-优化方法。动态规划可解决的问题的两个条件:

  1. 可分解为多个子问题
  2. 子问题的解可存储并重复利用

v ∗ ( s ) = m a x a R s a + γ ∑ s ∈ S P s s ′ a v ∗ ( s ′ ) (1.1) v^*(s) = max_{a}R^a_{s} + \gamma\sum_{s \in S} P^{a}_{ss'}v^*(s') \tag{1.1} v(s)=maxaRsa+γsSPssav(s)(1.1)
q ∗ ( s , a ) = R s a + γ ∑ s ∈ S P s s ′ a m a x a ′ q ∗ ( s ′ , a ′ ) (1.2) q^*(s,a) = R^a_{s} + \gamma\sum_{s \in S} P^{a}_{ss'}max_{a'}q^*(s',a') \tag{1.2} q(s,a)=Rsa+γsSPssamaxaq(s,a)(1.2)

从马尔可夫决策过程的最优化贝尔曼方程满足这两个条件,因此可以用动态规划求解。其核心在于:寻找最优的策略 π \pi π以最大化值函数(1.1)

因此,需要解决两个问题:

  1. 如何计算值函数
  2. 如何选择策略

值函数的计算

通过之前对MDP的学习有如下两个公式:
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) (1.3) v_\pi (s) = \sum_{a \in A} \pi(a|s)q_{\pi}(s,a) \tag{1.3} vπ(s)=aAπ(as)qπ(s,a)(1.3)
q π ( s , a ) = R s a + γ ∑ s ′ P s s ′ a v π ( s ′ ) (1.4) q_\pi (s,a) = R^a_s + \gamma\sum_{s'} P^{a}_{ss'}v_{\pi}(s') \tag{1.4} qπ(s,a)=Rsa+γsPssavπ(s)(1.4)

合并,得:
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R s a + γ ∑ s ′ P s s ′ a v π ( s ′ ) ) (1.5) v_\pi (s) = \sum_{a \in A} \pi(a|s)\left( R^a_s + \gamma\sum_{s'} P^{a}_{ss'}v_{\pi}(s') \right) \tag{1.5} vπ(s)=aAπ(as)(Rsa+γsPssavπ(s))(1.5)

该式说明 v π ( s ) v_\pi (s) vπ(s)需要由以未来状态为起始的 v π ( s ′ ) v_\pi (s') vπ(s)推出。
对于基于模型的动态规划,由于其R、P、γ均为已知,当前状态s也已知,(1.5)中的未知数仅为 v π ( s ′ ) v_{\pi}(s') vπ(s)。因此,该式可认为是一个线性方程组,采用高斯-赛德尔迭代法求解:

v k + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R s a + γ ∑ s ′ P s s ′ a v k ( s ′ ) ) (1.6) v_{k+1} (s) = \sum_{a \in A} \pi(a|s)\left( R^a_s + \gamma\sum_{s'} P^{a}_{ss'}v_{k}(s') \right) \tag{1.6} vk+1(s)=aAπ(as)(Rsa+γsPssavk(s))(1.6)

其中,k为迭代轮次。

该算法计算过程如下:

这里举[1]中网格的例子:


根据理解编写了[1]中计算网格中值函数的代码:

import numpy as np

def index2pos(vec, width):
    return np.array([np.floor(vec/width), vec%width], dtype=np.int32)

def get_avil_pos(pos, shape):
    ret = [
        np.array([np.clip(pos[0]-1, 0, shape[0] - 1), pos[1]]),
        np.array([np.clip(pos[0]+1, 0, shape[0] - 1), pos[1]]),
        np.array([pos[0], np.clip(pos[1]-1, 0, shape[1] - 1)]),
        np.array([pos[0], np.clip(pos[1]+1, 0, shape[1] - 1)])
        ]
    return ret

def main():
    States = np.zeros((4,4))
    States_new = np.zeros((4,4))
    r = -1
    gamma = 1
    pi = 0.25

    times = 10

    for k in range(1,times):
        for s in range(1,14+1):
            pos = index2pos(s, 4)
            avil_pos = get_avil_pos(pos, np.array([4,4]))
            States_new[pos[0], pos[1]] = 0
            for p in avil_pos:
                States_new[pos[0], pos[1]] += 1/len(avil_pos) * (
                    r + gamma*States[p[0], p[1]]
                    )
        States = States_new.copy()
        print("\n----------------------------------------------------")
        print("\nK = {0}".format(k), end="\n\n")
        print(States)

if __name__ == "__main__":
    main()

运行结果如下:

----------------------------------------------------

K = 1

[[ 0. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1. -1.]
 [-1. -1. -1.  0.]]

----------------------------------------------------

K = 2

[[ 0.   -1.75 -2.   -2.  ]
 [-1.75 -2.   -2.   -2.  ]
 [-2.   -2.   -2.   -1.75]
 [-2.   -2.   -1.75  0.  ]]

----------------------------------------------------

K = 3

[[ 0.     -2.4375 -2.9375 -3.    ]
 [-2.4375 -2.875  -3.     -2.9375]
 [-2.9375 -3.     -2.875  -2.4375]
 [-3.     -2.9375 -2.4375  0.    ]]

----------------------------------------------------

K = 4

[[ 0.      -3.0625  -3.84375 -3.96875]
 [-3.0625  -3.71875 -3.90625 -3.84375]
 [-3.84375 -3.90625 -3.71875 -3.0625 ]
 [-3.96875 -3.84375 -3.0625   0.     ]]

----------------------------------------------------

K = 5

[[ 0.        -3.65625   -4.6953125 -4.90625  ]
 [-3.65625   -4.484375  -4.78125   -4.6953125]
 [-4.6953125 -4.78125   -4.484375  -3.65625  ]
 [-4.90625   -4.6953125 -3.65625    0.       ]]

----------------------------------------------------

K = 6

[[ 0.         -4.20898438 -5.50976562 -5.80078125]
 [-4.20898438 -5.21875    -5.58984375 -5.50976562]
 [-5.50976562 -5.58984375 -5.21875    -4.20898438]
 [-5.80078125 -5.50976562 -4.20898438  0.        ]]

----------------------------------------------------

K = 7

[[ 0.         -4.734375   -6.27734375 -6.65527344]
 [-4.734375   -5.89941406 -6.36425781 -6.27734375]
 [-6.27734375 -6.36425781 -5.89941406 -4.734375  ]
 [-6.65527344 -6.27734375 -4.734375    0.        ]]

----------------------------------------------------

K = 8

[[ 0.         -5.2277832  -7.0078125  -7.46630859]
 [-5.2277832  -6.54931641 -7.08837891 -7.0078125 ]
 [-7.0078125  -7.08837891 -6.54931641 -5.2277832 ]
 [-7.46630859 -7.0078125  -5.2277832   0.        ]]

----------------------------------------------------

K = 9

[[ 0.         -5.69622803 -7.6975708  -8.23706055]
 [-5.69622803 -7.15808105 -7.77856445 -7.6975708 ]
 [-7.6975708  -7.77856445 -7.15808105 -5.69622803]
 [-8.23706055 -7.6975708  -5.69622803  0.        ]]

策略的选择

因为当前策略的值函数时,因此在每个状态采用贪婪策略对当前策略进行改进:即在给定状态下选择可以使状态-行为值函数取值最大的策略。
即:

2.9
2.9

策略迭代算法

将上述值函数计算与策略迭代的过程结合起来,得到策略迭代算法:

根据上述算法,编写的代码如下:

import numpy as np

def index2pos(vec, width):
    return np.array([np.floor(vec/width), vec%width], dtype=np.int32)

def get_avil_pos(pos, shape):
    ret = {
        "up":    np.array([np.clip(pos[0]-1, 0, shape[0] - 1), pos[1]]), #上
        "down":  np.array([np.clip(pos[0]+1, 0, shape[0] - 1), pos[1]]), #下
        "left":  np.array([pos[0], np.clip(pos[1]-1, 0, shape[1] - 1)]), #左
        "right": np.array([pos[0], np.clip(pos[1]+1, 0, shape[1] - 1)])  #右
        }
    return ret

def find_best_policy(policy):
    best = []
    for i in policy:
        v, k = max(zip(i.values(), i.keys()))
        best.append(k)
    return best

def main():
    r = -1
    gamma = 1
    pi = 0.25

    policy = []
    best_policy = []
    for i in range(0,16):
        policy.append({"up":    0.25, "down":  0.25, "left":  0.25, "right": 0.25})

    K_times = 4
    Times = 1
    while True:
        States = np.zeros((4,4))
        States_new = np.zeros((4,4))

        print("\n----------------------------------------------------")
        print("{0} iteration: \n".format(Times))
        print("Initial Policy:")
        print(np.array(policy))

        # Calculate V(s)
        for k in range(1,K_times):
            for s in range(1,14+1):
                pos = index2pos(s, 4)
                avil_pos = get_avil_pos(pos, np.array([4,4]))
                States_new[pos[0], pos[1]] = 0
                for p in avil_pos:
                    States_new[pos[0], pos[1]] += policy[s][p] * (
                        r + gamma*States[avil_pos[p][0], avil_pos[p][1]]
                        )
            States = States_new.copy()
        
        print("\n----------------------------------------------------")
        print("\nAfter {0} iterations, the function of value of all states:".format(k), end="\n\n")
        print(States)
        
        # Calculate argMax
        for s in range(1,14+1):
            pos = index2pos(s, 4)
            avil_pos = get_avil_pos(pos, np.array([4,4]))
            sum = 0
            minnum = 0
            for p in avil_pos:
                minnum = States[avil_pos[p][0], avil_pos[p][1]] if \
                            States[avil_pos[p][0], avil_pos[p][1]] < minnum else \
                            minnum
            for p in avil_pos:
                sum += States[avil_pos[p][0], avil_pos[p][1]] - minnum # for calculate minus value
            for p in avil_pos:
                policy[s][p] = (States[avil_pos[p][0], avil_pos[p][1]] - minnum) / sum

        print("\n----------------------------------------------------")
        print("Updated Policy:")
        print(np.array(policy))

        print("\nChoose:")
        print(np.array(best_policy))

        temp_policy = find_best_policy(policy)
        if best_policy == temp_policy:
            break
        else:
            best_policy = temp_policy
            Times += 1
    
    print("After {0} times of iteration, we find the best policy.".format(Times))

if __name__ == "__main__":
    main()

运行结果(依次为状态0~15的最佳动作):

Choose:
['up' 'left' 'left' 'left' 'up' 'up' 'left' 'down' 'up' 'up' 'right'
 'down' 'up' 'right' 'right' 'up']
After 2 times of iteration, we find the best policy.

需要说明的是,上述2中的值函数计算需要多步迭代,而实际中往往在迭代一定次数后的策略迭代无穷次所得到的贪婪策略是一样的,因此可以在进行一次评估之后就进行策略改善,称为值函数迭代算法。

值函数迭代算法

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值