Sutton&Bart Reinforce learning

以下仅为个人理解,原著英文,共同探讨学习。

名词解释

policy: state 到 action的可能性的映射,记为 π ( a ∣ s ) \pi(a|s) π(as)
state-value:从当前状态开始到结束状态(如果有的话),期望reward。记为 v π ( a ) v_{\pi}(a) vπ(a)
action-value: 从当前状态开始,已知采取的a这个action,到结束状态(如果有的话),期望reward。记为 q π ( s , a ) q_{\pi}(s,a) qπ(s,a)

Markov Decision Process

只要environment的下一个状态只取决于当前状态,那么环境就满足markov的性质。

如果集合action、state、reward有限,就进一步定义为finite MDP.

如果environment满足finite MDP ,那么对于某个policy,state value function、action value function定义如下:
v k + 1 ( s t ) = E [ R t + 1 + γ v k ( s t + 1 ) ]                      = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) ( r + γ v k ( s ′ ) ) v_{k+1}(s_t) = E[R_{t+1}+\gamma v_{k}(s_{t+1})]\\~~~~~~~~~~~~~~~~~~~~=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)(r +\gamma v_{k}(s')) vk+1(st)=E[Rt+1+γvk(st+1)]                    =aπ(as)s,rp(s,rs,a)(r+γvk(s))
q ( s , a ) = E [ R t + 1 + γ v k ( s ′ ) ] = ∑ s ′ , r p ( s ′ , r ∣ s , a ) ( r + γ v k ( s ′ ) ) q(s,a)=E[R_{t+1}+\gamma v_k(s')]\\=\sum_{s',r}p(s',r|s,a)(r+\gamma v_k(s')) q(s,a)=E[Rt+1+γvk(s)]=s,rp(s,rs,a)(r+γvk(s))

这两个的区别在于state value function是对当前状态所有可能的action所获得expected reward的估计;action value function 是当前状态采取某一个action可以获得expected reward。

两个公式中的变量是policy.

DP求解finite MDP

state value function、action value function 都为一张表。

文中in-place的意义

在更新表的过程中,更新每个状态,都需要他的后继所有状态,后继状态可能是之前更新过的状态,这个时候,如果使用更新过的值而不用之前原来的值,那么更新方法就是in-place的,通常,in-place的方法会加快收敛速度。

policy evaluation

获得policy的state value

  1. 初始化表
  2. for each state,calculate π ( a ∣ s ) \pi(a|s) π(as)
  3. 用上述state value function的定义,更新每个state value,记录每次更新的差值,并求所有state更新过程中最大的改变量
  4. 该变量小于某个值,停止算法,否则转2

policy improvement

对与上述收敛的policy,如果在某一个state,采取某一个action,他的q(s,a)>=v(s),那么可以以此更新policy。当一直采用这种greedy的方式,一定会得到比上一次更好或者一样好的policy.

v π ′ ( s ) = q ( s , π ( s ) ) v_{\pi'}(s) = q(s, \pi(s)) vπ(s)=q(s,π(s))

policy iteration

如果环境满足finite MDP,可以采用policy improvement,获得optimal policy

  1. init table
  2. do policy evaluation
  3. do policy improvement
    sweep through state value table, if max ⁡ a q ( s , a ) \max_aq(s,a) maxaq(s,a) not equals π ( s ) \pi(s) π(s),go to step 2,else done

以下为5×5的grid,(0,0)为terminal state的代码:

import numpy as np

grid_width, grid_height = 5, 6
#up down left right
action_set = np.array([[-1,0],[1,0],[0,-1],[0,1]])
policy = np.zeros((grid_height, grid_width),dtype=np.int8)
gamma = 0.9


def step(state, action):
    if state[0] == 0 and state[1] == 0:
        return state, 0
    newstate = state + action
    if newstate[0] < 0 or newstate[0] >= grid_height or newstate[1] < 0 or newstate[1] >= grid_width:
        return state , -1
    else:
        return newstate, -1

#init state-value function
state_value = np.zeros((grid_height, grid_width))
iteration = 0
policy_evaluation_iteration = 0

while True:
    #policy evaluation
    while True:
        delta = 0
        for x in range(grid_height):
            for y in range(grid_width):
                next_state, reward = step(np.array([x, y]), action_set[policy[x][y]])
                value = reward + gamma * state_value[next_state[0]][next_state[1]]

                if delta < abs(state_value[x][y] - value):
                    delta = abs(state_value[x][y] - value)
                
                state_value[x][y] = value
        policy_evaluation_iteration += 1
        if delta < 1e-4:
            break
    
    #policy improvement
    policy_stable = True
    for x in range(grid_height):
        for y in range(grid_width):
            next_state, reward = step(np.array([x, y]), action_set[policy[x][y]])
            max_aciton_value = reward + gamma * state_value[next_state[0]][next_state[1]]
            optimal_action = policy[x][y]
            for actionindex in range(4):
                next_state, reward = step(np.array([x, y]), action_set[actionindex])
                aciton_value = reward + gamma * state_value[next_state[0]][next_state[1]]

                if max_aciton_value < aciton_value:
                    aciton_value = max_aciton_value
                    optimal_action = actionindex
            
            if optimal_action != policy[x][y]:
                policy_stable = False
            
            policy[x][y] = optimal_action
    
    if policy_stable:
        break

print(state_value)
print(policy)
print(policy_evaluation_iteration)

    



    

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值