强化学习之迷宫问题(MC, Sarsa, Q-learning实现)

通过简易迷宫问题,理解Monte-Carlo, Sarsa和Q-learning算法

3 × 3 3\times3 3×3的迷宫如下
在这里插入图片描述
可以通过每一步都随机地走,直到走到S8为止
这里主要写通过强化学习,找到最佳路线

MC算法:

首先定义一个policy,它是一个矩阵,行代表S1~S8
列代表action的4个方向,这里定义方向为↑、→、↓、←的顺序

policy定义为 π θ ( s , a ) \pi_{\theta}(s, a) πθ(s,a), 其中 θ \theta θ是决定 π \pi π的参数,首先定义初始 θ 0 \theta_{0} θ0
θ 0 \theta_{0} θ0矩阵中值为1代表这个方向可以走,如果是墙壁,就定义为np.nan

#row: S0~S7, column: ↑、→、↓、←
#theta: parameter of policy pi(s, a)
theta_0 = np.array([[np.nan, 1, 1, np.nan], #S0
                    [np.nan, 1, np.nan, 1], #S1
                    [np.nan, np.nan, 1, 1], #S2
                    [1, 1, 1, np.nan], #S3
                    [np.nan, np.nan, 1, 1], #S4
                    [1, np.nan, np.nan, np.nan], #S5
                    [1, np.nan, np.nan, np.nan], #S6
                    [1, 1, np.nan, np.nan], #S7
                    #S8 is goal, no policy
    ])

下面由 θ 0 \theta_{0} θ0 π θ ( s , a ) \pi_{\theta}(s, a) πθ(s,a),可以单纯求比例,也可以用softmax
soft公式如下:
P ( θ i ) = e x p ( β θ i ) ∑ i = 1 N a e x p ( β θ i ) P(\theta_{i}) = \frac{exp(\beta\theta_{i})}{\sum_{i=1}^{N_{a}}exp(\beta\theta_{i})} P(θi)=i=1Naexp(βθi)exp(βθi)
N a N_{a} Na表示action的种类数(4个)

#use softmax to get policy
def softmax_convert_into_pi_from_theta(theta):
    beta = 1.0
    [m, n] = theta.shape
    pi = np.zeros((m,n))
    
    exp_theta = np.exp(beta * theta)
    
    for i in range(0, m):
        pi[i, :] = exp_theta[i, :] /np.nansum(exp_theta[i, :])
        
    pi = np.nan_to_num(pi)
    
    return pi

打印初期 π θ \pi_{\theta} πθ可以看到在不同的state下不同action的概率

pi_0 = softmax_convert_into_pi_from_theta(theta_0)
print(pi_0)
#output:
#[[ 0.          0.5         0.5         0.        ]
# [ 0.          0.5         0.          0.5       ]
# [ 0.          0.          0.5         0.5       ]
# [ 0.33333333  0.33333333  0.33333333  0.        ]
# [ 0.          0.          0.5         0.5       ]
# [ 1.          0.          0.          0.        ]
# [ 1.          0.          0.          0.        ]
# [ 0.5         0.5         0.          0.        ]]

通过policy来确定agent的动作
通过 π \pi π的概率来选择下一步的action和state

def get_action_and_next_s(pi, s):
    direction = ["up", "right", "down", "left"]
    
    next_direction = np.random.choice(direction, p=pi[s, :])
    
    if next_direction == "up":
        action = 0
        s_next = s - 3
    elif next_direction == "right":
        action = 1
        s_next = s + 1
    elif next_direction == "down":
        action = 2
        s_next = s + 3
    elif next_direction == "left":
        action = 3
        s_next = s - 1
    
    return [action, s_next]

直到走到Goal,一直不停探索路径,到达goal的路径保存在s_a_history

def goal_maze_ret_s_a(pi):
    s = 0  #start
    s_a_history = [[0, np.nan]]
    
    while(1):
        [action, next_s] = get_action_and_next_s(pi, s)
        s_a_history[-1][1] = action
        #print([action, next_s])
        
        s_a_history.append([next_s, np.nan]) #now don't know next action, add np.nan as next action
        
        if next_s == 8:
            break
        else:
            s = next_s
            
    return s_a_history

随机走一条路径

s_a_history = goal_maze_ret_s_a(pi_0)
print(s_a_history)
#output:
#[[0, 2], [3, 0], [0, 2], [3, 0], [0, 1], [1, 3], [0, 2], [3, 0], [0, 2], [3, #1], [4, 2], [7, 0], [4, 3], [3, 1], [4, 2], [7, 1], [8, nan]]
#17 steps

MC算法需要完整的路径来更新policy,所以每次得到完整路径后去更新 θ \theta θ,通过 θ \theta θ得到 π \pi π
重复上面的步骤直到policy收敛
θ \theta θ的更新公式:
θ s i , a j = θ s i , a j + η . Δ θ s i , a j \theta_{s_{i},a_{j}} = \theta_{s_{i},a_{j}} + \eta.\Delta\theta_{s_{i},a_{j}} θsi,aj=θsi,aj+η.Δθsi,aj
Δ θ s i , a j = { N ( s i , a j ) + P ( s i , a j ) N ( s i , a ) } / T \Delta\theta_{s_{i},a_{j}} = \{N(s_{i}, a_{j}) + P(s_{i}, a_{j})N(s_{i}, a)\}/T Δθsi,aj={N(si,aj)+P(si,aj)N(si,a)}/T

N N N表示次数, T T T表示到达goal的总step数

def update_theta(theta, pi, s_a_history):
    eta = 0.1  #learning rate
    T = len(s_a_history) - 1
    
    [m, n] = theta.shape
    delta_theta = theta.copy() #deep copy
    
    for i in range(0, m):
        for j in range(0, n):
            if not(np.isnan(theta[i, j])):
                SA_i = [SA for SA in s_a_history if SA[0] == i] #the (s,a) pair with state i
                SA_ij = [SA for SA in s_a_history if SA == [i, j]] #the (s,a) pair with state i, action j
                
                N_i = len(SA_i)
                N_ij = len(SA_ij)
                delta_theta[i, j] = (N_ij + pi[i, j] * N_i) / T
    
    new_theta = theta + eta * delta_theta
    
    return new_theta

MC算法迭代,定义收敛标准

stop_epsilon = 10**-8

theta = theta_0
pi = pi_0

is_continue = True
count = 1

while is_continue:
    s_a_history = goal_maze_ret_s_a(pi)
    new_theta = update_theta(theta, pi, s_a_history)
    new_pi = softmax_convert_into_pi_from_theta(new_theta)
    
    print(np.sum(np.abs(new_pi - pi)))
    print("steps to reach goal is: " + str(len(s_a_history) - 1))
    
    if np.sum(np.abs(new_pi - pi)) < stop_epsilon:
        is_continue = False
    else:
        theta = new_theta
        pi = new_pi

最后确认下收敛后的policy

np.set_printoptions(precision=3, suppress=True) #not display by e
print(pi)
#output
#[[ 0.     0.     1.     0.   ]
# [ 0.     0.38   0.     0.62 ]
# [ 0.     0.     0.474  0.526]
# [ 0.     1.     0.     0.   ]
# [ 0.     0.     1.     0.   ]
# [ 1.     0.     0.     0.   ]
# [ 1.     0.     0.     0.   ]
# [ 0.     1.     0.     0.   ]]

可以看到policy已经可以找到最佳路径,而未收敛的概率处是因为通过最优路径不会再采取那个action,所以仍保留初始附近的概率

Sarsa算法

名称解释,Sarsa指当前State(S), Action(a)得到即时Reward( r), 并通过下一步的State(s), Action(a)得到预期reward

定义Q矩阵,通过 ϵ − g r e e d y \epsilon-greedy ϵgreedy方法选择下一步的action, 在每一个step都更新Q值,而不是MC算法的每个episode更新policy
最后Q矩阵收敛可以得到最优路径

Q矩阵的size和上面policy一样,行代表state,列代表action,初始值为random
但是为了保持墙壁的限制条件,让Q乘 θ 0 \theta_{0} θ0
将Q的terminal state设为0

#row: S0~S7, column: ↑、→、↓、←
#theta: policy pi(s, a)
theta_0 = np.array([[np.nan, 1, 1, np.nan], #S0
                    [np.nan, 1, np.nan, 1], #S1
                    [np.nan, np.nan, 1, 1], #S2
                    [1, 1, 1, np.nan], #S3
                    [np.nan, np.nan, 1, 1], #S4
                    [1, np.nan, np.nan, np.nan], #S5
                    [1, np.nan, np.nan, np.nan], #S6
                    [1, 1, np.nan, np.nan], #S7
                    #S8 is goal, no policy
    ])

[a, b] = theta_0.shape
Q = np.random.rand(a, b) * theta_0
Q[7,3]=0

print(Q)
#output:
[[   nan  0.745  0.711    nan]
 [   nan  0.286    nan  0.385]
 [   nan    nan  0.513  0.056]
 [ 0.501  0.921  0.993    nan]
 [   nan    nan  0.891  0.01 ]
 [ 0.875    nan    nan    nan]
 [ 0.439    nan    nan    nan]
 [ 0.958  0.936    nan  0.   ]]

ϵ − g r e e d y \epsilon-greedy ϵgreedy(选择下一步的action):
greedy表示每次选择Q[s, :]中的最大值作为下一步的action,但是它有一个问题,就是容易陷入local optimal,所以要平衡随机探索和greedy
于是以 ϵ \epsilon ϵ的概率进行随机探索, 1 − ϵ 1-\epsilon 1ϵ的概率采取greedy
ϵ \epsilon ϵ随着step增加而衰减

而且有一个理论,也就是说 ϵ − g r e e d y \epsilon-greedy ϵgreedy是可以改进policy的:
for any ϵ − g r e e d y \epsilon-greedy ϵgreedy policy π \pi π, ϵ − g r e e d y \epsilon-greedy ϵgreedy policy π ′ \pi&#x27; π with respect to q π q_{\pi} qπ is an improvement. V π ′ ( s ) ≥ V π ( s ) V_{\pi&#x27;}(s) \geq V_{\pi}(s) Vπ(s)Vπ(s)
证明就不写了

由于涉及到随机探索,所以仍然需要探索用的policy
这里不用softmax,而是用简单的概率

def simple_convert_into_pi_from_theta(theta):
    [m, n] = theta.shape
    pi = np.zeros((m, n))
    for i in range(0, m):
        pi[i, :] = theta[i, :] / np.nansum(theta[i, :])
        
    pi = np.nan_to_num(pi)
    return pi

pi_0 = simple_convert_into_pi_from_theta(theta_0)

实现 ϵ − g r e e d y \epsilon-greedy ϵgreedy

#epsilon-greedy
#get the action
def get_action(s, Q, epsilon, pi_0):
    direction = ["up", "right", "down", "left"]
    #print("s = " + str(s))
    
    #probability epsilon to random search
    if np.random.rand() < epsilon:
        next_direction = np.random.choice(direction, p=pi_0[s, :])
    else:
        #move by the maximum Q
        next_direction = direction[np.nanargmax(Q[s, :])]
    
    if next_direction == "up":
        action = 0
    elif next_direction == "right":
        action = 1
    elif next_direction == "down":
        action = 2
    elif next_direction == "left":
        action = 3
        
    return action
    
#get next state by action
def get_s_next(s, a, Q, epsilon, pi_0):
    direction = ["up", "right", "down", "left"]
    next_direction = direction[a]
    
    if next_direction == "up":
        s_next = s - 3
    elif next_direction == "right":
        s_next = s + 1
    elif next_direction == "down":
        s_next = s + 3
    elif next_direction == "left":
        s_next = s - 1
        
    return s_next

通过state, action更新Q矩阵的Sarsa算法:
推导:
μ k = 1 k ∑ j = 1 k x j \mu_{k} = \frac{1}{k}\sum_{j=1}^{k}x_{j} μk=k1j=1kxj
= 1 k ( x k + ∑ j = 1 k − 1 x j ) =\frac{1}{k}(x_{k} + \sum_{j=1}^{k-1}x_{j}) =k1(xk+j=1k1xj)
= 1 k ( x k + ( k − 1 ) μ k − 1 ) =\frac{1}{k}(x_{k} + (k-1)\mu_{k-1}) =k1(xk+(k1)μk1)
= μ k − 1 + 1 k ( x k − μ k − 1 ) =\mu_{k-1} + \frac{1}{k}(x_{k} -\mu_{k-1}) =μk1+k1(xkμk1)

for each state S t S_{t} St with G t G_{t} Gt:
N ( S t ) ← N ( S t ) + 1 N(S_{t})\leftarrow N(S_{t})+1 N(St)N(St)+1
V ( S t ) ← V ( S t ) + 1 N ( S t ) ( G t − V ( S t ) ) V(S_{t})\leftarrow V(S_{t}) + \frac{1}{N(S_{t})}(G_{t}-V(S_{t})) V(St)V(St)+N(St)1(GtV(St))

G t = R t + 1 + γ V ( S t + 1 ) G_{t}=R_{t+1} +\gamma V(S_{t+1}) Gt=Rt+1+γV(St+1)

得到:
Q ( s t , a t ) = Q ( s t , a t ) + η ∗ ( R t + 1 + γ Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) ) Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \eta*(R_{t+1} +\gamma Q(s_{t+1},a_{t+1})-Q(s_{t}, a_{t})) Q(st,at)=Q(st,at)+η(Rt+1+γQ(st+1,at+1)Q(st,at))
其中 R t + 1 + γ Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) R_{t+1} +\gamma Q(s_{t+1},a_{t+1})-Q(s_{t}, a_{t}) Rt+1+γQ(st+1,at+1)Q(st,at)叫做TD error

根据上述公式更新Q:

def Sarsa(s, a, r, s_next, a_next, Q, eta, gamma):
    if s_next == 8:  #the goal, next state not exist
        Q[s, a] = Q[s, a] + eta * (r - Q[s, a])
    else:
        Q[s, a] = Q[s, a] + eta * (r + gamma*Q[s_next, a_next] - Q[s, a])
        
    return Q

一个episode:

def goal_maze_ret_s_a_Q(Q, epsilon, eta, gamma, pi):
    s = 0 #start state
    a = a_next = get_action(s, Q, epsilon, pi) #get action by epsilon-greedy
    s_a_history = [[0, np.nan]]
    
    while(1): #until reach the goal
        a = a_next
        
        s_a_history[-1][1] = a
        
        s_next = get_s_next(s, a, Q, epsilon, pi)
        s_a_history.append([s_next, np.nan])
        
        if s_next == 8:
            r = 1 #get the reward
            a_next = np.nan
        else:
            r = 0
            a_next = get_action(s_next, Q, epsilon, pi)
            
        Q = Sarsa(s, a, r, s_next, a_next, Q, eta, gamma)  #update Q
        
        if s_next == 8:
            break
        else:
            s = s_next
    
    return [s_a_history, Q]

完整迭代:

eta = 0.1
gamma = 0.9
epsilon = 0.5
v = np.nanmax(Q, axis=1) #select the maximum Q value for each state
is_continue = True
episode = 1

while is_continue:
    print("episode: " + str(episode))
    
    epsilon = epsilon / 2  #epsilon-greedy
    
    [s_a_history, Q] = goal_maze_ret_s_a_Q(Q, epsilon, eta, gamma, pi_0)
    
    new_v = np.nanmax(Q, axis=1) #maximum value for each state
    
    print(np.sum(np.abs(new_v - v)))
    
    v = new_v
    
    print("steps to reach goal is: " + str(len(s_a_history) - 1))
    
    episode = episode + 1
    if episode > 100:
        break

虽然写了到100 episode,但是从output看出很快就收敛到最优路径

episode: 1
0.227489819094
steps to reach goal is: 14
episode: 2
0.154785251991
steps to reach goal is: 10
episode: 3
0.0105243446365
steps to reach goal is: 4
episode: 4
0.00946282714278
steps to reach goal is: 4
episode: 5
0.00848520483778
steps to reach goal is: 4
episode: 6
0.00758386849718
steps to reach goal is: 4
episode: 7
0.0051117594454
steps to reach goal is: 4
episode: 8

看下收敛后的路径及Q矩阵

print(s_a_history)
#output:
[[0, 2], [3, 1], [4, 2], [7, 1], [8, nan]]

print(Q)
#output:
[[   nan  0.705  0.729    nan]
 [   nan  0.286    nan  0.41 ]
 [   nan    nan  0.513  0.056]
 [ 0.501  0.81   0.775    nan]
 [   nan    nan  0.9    0.088]
 [ 0.875    nan    nan    nan]
 [ 0.575    nan    nan    nan]
 [ 0.928  1.       nan  0.   ]]

标准Sarsa只能backward一步,而Sarsa( λ ) \lambda) λ)算法一次可以更新整个路径,与Sarsa不同的是加了一个 E E E矩阵
(这个单独写)

Q-learning

Q-learning与Sarsa的区别:

对比Q的更新公式:

Sarsa:
Q ( s t , a t ) = Q ( s t , a t ) + η ∗ ( R t + 1 + γ Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) ) Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \eta*(R_{t+1} +\gamma Q(s_{t+1},a_{t+1})-Q(s_{t}, a_{t})) Q(st,at)=Q(st,at)+η(Rt+1+γQ(st+1,at+1)Q(st,at))

Q-learning:
Q ( s t , a t ) = Q ( s t , a t ) + η ∗ ( R t + 1 + γ max ⁡ a Q ( s t + 1 , a ) − Q ( s t , a t ) ) Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \eta*(R_{t+1} +\gamma \mathop{\max}\limits_{a }Q(s_{t+1},a)-Q(s_{t}, a_{t})) Q(st,at)=Q(st,at)+η(Rt+1+γamaxQ(st+1,a)Q(st,at))

可以看到Sarsa更新时需要求并且使用 a t + 1 a_{t+1} at+1,而Q-learning需要用到 s t + 1 s_{t+1} st+1对应的最大Q值。
a t + 1 a_{t+1} at+1是由决定下一步action的policy决定的,所以Q的更新依赖于policy,称为online-policy
而Q-learning不依赖于决定action的policy,称为offline-policy

Sarsa中用到的 ϵ − g r e e d y \epsilon-greedy ϵgreedy中random的部分Q-learning中没有,所以Q-learning有比Sarsa收敛更快的特征

按照Q-learning公式:

def Q_learning(s, a, r, s_next, Q, eta, gamma):
    if s_next == 8:  #goal
        Q[s, a] = Q[s, a] + eta*(r - Q[s, a])
    else:
        Q[s, a] = Q[s, a] + eta*(r + gamma*np.nanmax(Q[s_next,:]) - Q[s, a])
    
    return Q

因为不用 ϵ − g r e e d y \epsilon-greedy ϵgreedy,所以action函数去掉随机的部分:

#get the action
def get_action(s, Q):
    direction = ["up", "right", "down", "left"]
    #print("s = " + str(s))    

    next_direction = direction[np.nanargmax(Q[s, :])]
    
    if next_direction == "up":
        action = 0
    elif next_direction == "right":
        action = 1
    elif next_direction == "down":
        action = 2
    elif next_direction == "left":
        action = 3
        
    return action

def get_s_next(s, a):
    direction = ["up", "right", "down", "left"]
    next_direction = direction[a]
    
    if next_direction == "up":
        s_next = s - 3
    elif next_direction == "right":
        s_next = s + 1
    elif next_direction == "down":
        s_next = s + 3
    elif next_direction == "left":
        s_next = s - 1
        
    return s_next

一次episode:

def goal_maze_ret_Q_learning(Q, eta, gamma):
    s = 0 #start state
    a = a_next = get_action(s, Q) #get action by maximum Q value
    s_a_history = [[0, np.nan]]
    
    while(1): #until reach the goal
        a = a_next
        
        s_a_history[-1][1] = a
        
        s_next = get_s_next(s, a)
        s_a_history.append([s_next, np.nan])
        
        if s_next == 8:
            r = 1 #get the reward
            a_next = np.nan
        else:
            r = 0
            a_next = get_action(s_next, Q)
            
        Q = Q_learning(s, a, r, s_next, Q, eta, gamma) #update Q
        
        if s_next == 8:
            break
        else:
            s = s_next
    
    return [s_a_history, Q]

完整迭代到100次:

eta = 0.1 #learning-rate
gamma = 0.9 #decrease rate
v = np.nanmax(Q, axis=1) #maximum value for each state
is_continue = True
episode = 1

V = []   #state value for each episode
V.append(np.nanmax(Q, axis=1))  #get the maximum value for each state

while is_continue:
    print("episode " + str(episode))
    
    [s_a_history, Q] = goal_maze_ret_Q_learning(Q, eta, gamma) #get one path
    
    new_v = np.nanmax(Q, axis=1)
    
    print(np.sum(np.abs(new_v - v))) #get the error
    
    v = new_v
    
    V.append(v)
    
    print("steps to reach goal: " + str(len(s_a_history) - 1))
    episode = episode + 1
    if episode > 100:
        is_continue = False

收敛结果

episode 1
0.100490460992
steps to reach goal: 20
episode 2
0.0969381947436
steps to reach goal: 16
episode 3
0.0943713455446
steps to reach goal: 8
episode 4
0.0932547392558
steps to reach goal: 6
episode 5
0.0918454990963
steps to reach goal: 6
episode 6
0.0904328729472
steps to reach goal: 4
episode 7
0.0895058129093
steps to reach goal: 4
episode 8
0.0885468091866
steps to reach goal: 4
episode 9
0.0875441463751
steps to reach goal: 4
episode 10
0.0864873768061
steps to reach goal: 4
episode 11
0.0853677368575
steps to reach goal: 4
episode 12
0.0841783707317
steps to reach goal: 4
episode 13
0.0829144098366
steps to reach goal: 4
episode 14
0.0815729464098
steps to reach goal: 4
episode 15
0.0801529321705
steps to reach goal: 4
episode 16
0.0786550263091
steps to reach goal: 4
episode 17
0.0770814118009
steps to reach goal: 4
episode 18
0.0754355946909
steps to reach goal: 4
episode 19
0.0737221974552
steps to reach goal: 4
episode 20
0.0719467546908
steps to reach goal: 4
Q-learningSARSA都属于时序差分强化学习方法,而不是蒙特卡洛强化学习方法。 时序差分强化学习是一种结合了动态规划和蒙特卡洛方法的强化学习方法。它通过使用经验数据进行增量式的更新,同时利用了当前和未来的估计值来逼近最优值函数。 具体来说,Q-learningSARSA都是基于Q值函数的时序差分强化学习算法。 1. Q-learning:Q-learning是一种基于动态规划的无模型强化学习算法。它使用了时序差分(TD)方法,通过不断迭代更新Q值函数的估计值,使其逼近最优的Q值。Q-learning算法通过将当前状态和动作的估计值与下一个状态和动作的最大估计值相结合,来更新Q值函数的估计值。 2. SARSASARSA是一种基于时序差分的强化学习算法,也是一种模型-free的强化学习算法SARSA算法使用了时序差分的方法,通过不断迭代更新Q值函数的估计值。与Q-learning不同的是,SARSA算法采用了一个策略(Policy)来决定下一个动作,并在更新Q值时使用下一个动作的估计值。 时序差分强化学习方法与蒙特卡洛强化学习方法相比,具有更高的效率和更好的适应性。它可以在每个时间步骤中进行更新,不需要等到任务结束后才进行更新,从而更快地收敛到最优策略。而蒙特卡洛强化学习方法则需要等到任务结束后才能获取完整的回报信息,进行全局更新。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

蓝羽飞鸟

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值