通过简易迷宫问题,理解Monte-Carlo, Sarsa和Q-learning算法
3
×
3
3\times3
3×3的迷宫如下
可以通过每一步都随机地走,直到走到S8为止
这里主要写通过强化学习,找到最佳路线
MC算法:
首先定义一个policy,它是一个矩阵,行代表S1~S8
列代表action的4个方向,这里定义方向为↑、→、↓、←的顺序
policy定义为
π
θ
(
s
,
a
)
\pi_{\theta}(s, a)
πθ(s,a), 其中
θ
\theta
θ是决定
π
\pi
π的参数,首先定义初始
θ
0
\theta_{0}
θ0
θ
0
\theta_{0}
θ0矩阵中值为1代表这个方向可以走,如果是墙壁,就定义为np.nan
#row: S0~S7, column: ↑、→、↓、←
#theta: parameter of policy pi(s, a)
theta_0 = np.array([[np.nan, 1, 1, np.nan], #S0
[np.nan, 1, np.nan, 1], #S1
[np.nan, np.nan, 1, 1], #S2
[1, 1, 1, np.nan], #S3
[np.nan, np.nan, 1, 1], #S4
[1, np.nan, np.nan, np.nan], #S5
[1, np.nan, np.nan, np.nan], #S6
[1, 1, np.nan, np.nan], #S7
#S8 is goal, no policy
])
下面由
θ
0
\theta_{0}
θ0求
π
θ
(
s
,
a
)
\pi_{\theta}(s, a)
πθ(s,a),可以单纯求比例,也可以用softmax
soft公式如下:
P
(
θ
i
)
=
e
x
p
(
β
θ
i
)
∑
i
=
1
N
a
e
x
p
(
β
θ
i
)
P(\theta_{i}) = \frac{exp(\beta\theta_{i})}{\sum_{i=1}^{N_{a}}exp(\beta\theta_{i})}
P(θi)=∑i=1Naexp(βθi)exp(βθi)
N
a
N_{a}
Na表示action的种类数(4个)
#use softmax to get policy
def softmax_convert_into_pi_from_theta(theta):
beta = 1.0
[m, n] = theta.shape
pi = np.zeros((m,n))
exp_theta = np.exp(beta * theta)
for i in range(0, m):
pi[i, :] = exp_theta[i, :] /np.nansum(exp_theta[i, :])
pi = np.nan_to_num(pi)
return pi
打印初期 π θ \pi_{\theta} πθ可以看到在不同的state下不同action的概率
pi_0 = softmax_convert_into_pi_from_theta(theta_0)
print(pi_0)
#output:
#[[ 0. 0.5 0.5 0. ]
# [ 0. 0.5 0. 0.5 ]
# [ 0. 0. 0.5 0.5 ]
# [ 0.33333333 0.33333333 0.33333333 0. ]
# [ 0. 0. 0.5 0.5 ]
# [ 1. 0. 0. 0. ]
# [ 1. 0. 0. 0. ]
# [ 0.5 0.5 0. 0. ]]
通过policy来确定agent的动作
通过
π
\pi
π的概率来选择下一步的action和state
def get_action_and_next_s(pi, s):
direction = ["up", "right", "down", "left"]
next_direction = np.random.choice(direction, p=pi[s, :])
if next_direction == "up":
action = 0
s_next = s - 3
elif next_direction == "right":
action = 1
s_next = s + 1
elif next_direction == "down":
action = 2
s_next = s + 3
elif next_direction == "left":
action = 3
s_next = s - 1
return [action, s_next]
直到走到Goal,一直不停探索路径,到达goal的路径保存在s_a_history
def goal_maze_ret_s_a(pi):
s = 0 #start
s_a_history = [[0, np.nan]]
while(1):
[action, next_s] = get_action_and_next_s(pi, s)
s_a_history[-1][1] = action
#print([action, next_s])
s_a_history.append([next_s, np.nan]) #now don't know next action, add np.nan as next action
if next_s == 8:
break
else:
s = next_s
return s_a_history
随机走一条路径
s_a_history = goal_maze_ret_s_a(pi_0)
print(s_a_history)
#output:
#[[0, 2], [3, 0], [0, 2], [3, 0], [0, 1], [1, 3], [0, 2], [3, 0], [0, 2], [3, #1], [4, 2], [7, 0], [4, 3], [3, 1], [4, 2], [7, 1], [8, nan]]
#17 steps
MC算法需要完整的路径来更新policy,所以每次得到完整路径后去更新
θ
\theta
θ,通过
θ
\theta
θ得到
π
\pi
π,
重复上面的步骤直到policy收敛
θ
\theta
θ的更新公式:
θ
s
i
,
a
j
=
θ
s
i
,
a
j
+
η
.
Δ
θ
s
i
,
a
j
\theta_{s_{i},a_{j}} = \theta_{s_{i},a_{j}} + \eta.\Delta\theta_{s_{i},a_{j}}
θsi,aj=θsi,aj+η.Δθsi,aj
Δ
θ
s
i
,
a
j
=
{
N
(
s
i
,
a
j
)
+
P
(
s
i
,
a
j
)
N
(
s
i
,
a
)
}
/
T
\Delta\theta_{s_{i},a_{j}} = \{N(s_{i}, a_{j}) + P(s_{i}, a_{j})N(s_{i}, a)\}/T
Δθsi,aj={N(si,aj)+P(si,aj)N(si,a)}/T
N N N表示次数, T T T表示到达goal的总step数
def update_theta(theta, pi, s_a_history):
eta = 0.1 #learning rate
T = len(s_a_history) - 1
[m, n] = theta.shape
delta_theta = theta.copy() #deep copy
for i in range(0, m):
for j in range(0, n):
if not(np.isnan(theta[i, j])):
SA_i = [SA for SA in s_a_history if SA[0] == i] #the (s,a) pair with state i
SA_ij = [SA for SA in s_a_history if SA == [i, j]] #the (s,a) pair with state i, action j
N_i = len(SA_i)
N_ij = len(SA_ij)
delta_theta[i, j] = (N_ij + pi[i, j] * N_i) / T
new_theta = theta + eta * delta_theta
return new_theta
MC算法迭代,定义收敛标准
stop_epsilon = 10**-8
theta = theta_0
pi = pi_0
is_continue = True
count = 1
while is_continue:
s_a_history = goal_maze_ret_s_a(pi)
new_theta = update_theta(theta, pi, s_a_history)
new_pi = softmax_convert_into_pi_from_theta(new_theta)
print(np.sum(np.abs(new_pi - pi)))
print("steps to reach goal is: " + str(len(s_a_history) - 1))
if np.sum(np.abs(new_pi - pi)) < stop_epsilon:
is_continue = False
else:
theta = new_theta
pi = new_pi
最后确认下收敛后的policy
np.set_printoptions(precision=3, suppress=True) #not display by e
print(pi)
#output
#[[ 0. 0. 1. 0. ]
# [ 0. 0.38 0. 0.62 ]
# [ 0. 0. 0.474 0.526]
# [ 0. 1. 0. 0. ]
# [ 0. 0. 1. 0. ]
# [ 1. 0. 0. 0. ]
# [ 1. 0. 0. 0. ]
# [ 0. 1. 0. 0. ]]
可以看到policy已经可以找到最佳路径,而未收敛的概率处是因为通过最优路径不会再采取那个action,所以仍保留初始附近的概率
Sarsa算法
名称解释,Sarsa指当前State(S), Action(a)得到即时Reward( r), 并通过下一步的State(s), Action(a)得到预期reward
定义Q矩阵,通过
ϵ
−
g
r
e
e
d
y
\epsilon-greedy
ϵ−greedy方法选择下一步的action, 在每一个step都更新Q值,而不是MC算法的每个episode更新policy
最后Q矩阵收敛可以得到最优路径
Q矩阵的size和上面policy一样,行代表state,列代表action,初始值为random
但是为了保持墙壁的限制条件,让Q乘
θ
0
\theta_{0}
θ0
将Q的terminal state设为0
#row: S0~S7, column: ↑、→、↓、←
#theta: policy pi(s, a)
theta_0 = np.array([[np.nan, 1, 1, np.nan], #S0
[np.nan, 1, np.nan, 1], #S1
[np.nan, np.nan, 1, 1], #S2
[1, 1, 1, np.nan], #S3
[np.nan, np.nan, 1, 1], #S4
[1, np.nan, np.nan, np.nan], #S5
[1, np.nan, np.nan, np.nan], #S6
[1, 1, np.nan, np.nan], #S7
#S8 is goal, no policy
])
[a, b] = theta_0.shape
Q = np.random.rand(a, b) * theta_0
Q[7,3]=0
print(Q)
#output:
[[ nan 0.745 0.711 nan]
[ nan 0.286 nan 0.385]
[ nan nan 0.513 0.056]
[ 0.501 0.921 0.993 nan]
[ nan nan 0.891 0.01 ]
[ 0.875 nan nan nan]
[ 0.439 nan nan nan]
[ 0.958 0.936 nan 0. ]]
ϵ
−
g
r
e
e
d
y
\epsilon-greedy
ϵ−greedy(选择下一步的action):
greedy表示每次选择Q[s, :]中的最大值作为下一步的action,但是它有一个问题,就是容易陷入local optimal,所以要平衡随机探索和greedy
于是以
ϵ
\epsilon
ϵ的概率进行随机探索,
1
−
ϵ
1-\epsilon
1−ϵ的概率采取greedy
ϵ
\epsilon
ϵ随着step增加而衰减
而且有一个理论,也就是说
ϵ
−
g
r
e
e
d
y
\epsilon-greedy
ϵ−greedy是可以改进policy的:
for any
ϵ
−
g
r
e
e
d
y
\epsilon-greedy
ϵ−greedy policy
π
\pi
π,
ϵ
−
g
r
e
e
d
y
\epsilon-greedy
ϵ−greedy policy
π
′
\pi'
π′ with respect to
q
π
q_{\pi}
qπ is an improvement.
V
π
′
(
s
)
≥
V
π
(
s
)
V_{\pi'}(s) \geq V_{\pi}(s)
Vπ′(s)≥Vπ(s)
证明就不写了
由于涉及到随机探索,所以仍然需要探索用的policy
这里不用softmax,而是用简单的概率
def simple_convert_into_pi_from_theta(theta):
[m, n] = theta.shape
pi = np.zeros((m, n))
for i in range(0, m):
pi[i, :] = theta[i, :] / np.nansum(theta[i, :])
pi = np.nan_to_num(pi)
return pi
pi_0 = simple_convert_into_pi_from_theta(theta_0)
实现 ϵ − g r e e d y \epsilon-greedy ϵ−greedy:
#epsilon-greedy
#get the action
def get_action(s, Q, epsilon, pi_0):
direction = ["up", "right", "down", "left"]
#print("s = " + str(s))
#probability epsilon to random search
if np.random.rand() < epsilon:
next_direction = np.random.choice(direction, p=pi_0[s, :])
else:
#move by the maximum Q
next_direction = direction[np.nanargmax(Q[s, :])]
if next_direction == "up":
action = 0
elif next_direction == "right":
action = 1
elif next_direction == "down":
action = 2
elif next_direction == "left":
action = 3
return action
#get next state by action
def get_s_next(s, a, Q, epsilon, pi_0):
direction = ["up", "right", "down", "left"]
next_direction = direction[a]
if next_direction == "up":
s_next = s - 3
elif next_direction == "right":
s_next = s + 1
elif next_direction == "down":
s_next = s + 3
elif next_direction == "left":
s_next = s - 1
return s_next
通过state, action更新Q矩阵的Sarsa算法:
推导:
μ
k
=
1
k
∑
j
=
1
k
x
j
\mu_{k} = \frac{1}{k}\sum_{j=1}^{k}x_{j}
μk=k1∑j=1kxj
=
1
k
(
x
k
+
∑
j
=
1
k
−
1
x
j
)
=\frac{1}{k}(x_{k} + \sum_{j=1}^{k-1}x_{j})
=k1(xk+∑j=1k−1xj)
=
1
k
(
x
k
+
(
k
−
1
)
μ
k
−
1
)
=\frac{1}{k}(x_{k} + (k-1)\mu_{k-1})
=k1(xk+(k−1)μk−1)
=
μ
k
−
1
+
1
k
(
x
k
−
μ
k
−
1
)
=\mu_{k-1} + \frac{1}{k}(x_{k} -\mu_{k-1})
=μk−1+k1(xk−μk−1)
for each state
S
t
S_{t}
St with
G
t
G_{t}
Gt:
N
(
S
t
)
←
N
(
S
t
)
+
1
N(S_{t})\leftarrow N(S_{t})+1
N(St)←N(St)+1
V
(
S
t
)
←
V
(
S
t
)
+
1
N
(
S
t
)
(
G
t
−
V
(
S
t
)
)
V(S_{t})\leftarrow V(S_{t}) + \frac{1}{N(S_{t})}(G_{t}-V(S_{t}))
V(St)←V(St)+N(St)1(Gt−V(St))
G t = R t + 1 + γ V ( S t + 1 ) G_{t}=R_{t+1} +\gamma V(S_{t+1}) Gt=Rt+1+γV(St+1)
得到:
Q
(
s
t
,
a
t
)
=
Q
(
s
t
,
a
t
)
+
η
∗
(
R
t
+
1
+
γ
Q
(
s
t
+
1
,
a
t
+
1
)
−
Q
(
s
t
,
a
t
)
)
Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \eta*(R_{t+1} +\gamma Q(s_{t+1},a_{t+1})-Q(s_{t}, a_{t}))
Q(st,at)=Q(st,at)+η∗(Rt+1+γQ(st+1,at+1)−Q(st,at))
其中
R
t
+
1
+
γ
Q
(
s
t
+
1
,
a
t
+
1
)
−
Q
(
s
t
,
a
t
)
R_{t+1} +\gamma Q(s_{t+1},a_{t+1})-Q(s_{t}, a_{t})
Rt+1+γQ(st+1,at+1)−Q(st,at)叫做TD error
根据上述公式更新Q:
def Sarsa(s, a, r, s_next, a_next, Q, eta, gamma):
if s_next == 8: #the goal, next state not exist
Q[s, a] = Q[s, a] + eta * (r - Q[s, a])
else:
Q[s, a] = Q[s, a] + eta * (r + gamma*Q[s_next, a_next] - Q[s, a])
return Q
一个episode:
def goal_maze_ret_s_a_Q(Q, epsilon, eta, gamma, pi):
s = 0 #start state
a = a_next = get_action(s, Q, epsilon, pi) #get action by epsilon-greedy
s_a_history = [[0, np.nan]]
while(1): #until reach the goal
a = a_next
s_a_history[-1][1] = a
s_next = get_s_next(s, a, Q, epsilon, pi)
s_a_history.append([s_next, np.nan])
if s_next == 8:
r = 1 #get the reward
a_next = np.nan
else:
r = 0
a_next = get_action(s_next, Q, epsilon, pi)
Q = Sarsa(s, a, r, s_next, a_next, Q, eta, gamma) #update Q
if s_next == 8:
break
else:
s = s_next
return [s_a_history, Q]
完整迭代:
eta = 0.1
gamma = 0.9
epsilon = 0.5
v = np.nanmax(Q, axis=1) #select the maximum Q value for each state
is_continue = True
episode = 1
while is_continue:
print("episode: " + str(episode))
epsilon = epsilon / 2 #epsilon-greedy
[s_a_history, Q] = goal_maze_ret_s_a_Q(Q, epsilon, eta, gamma, pi_0)
new_v = np.nanmax(Q, axis=1) #maximum value for each state
print(np.sum(np.abs(new_v - v)))
v = new_v
print("steps to reach goal is: " + str(len(s_a_history) - 1))
episode = episode + 1
if episode > 100:
break
虽然写了到100 episode,但是从output看出很快就收敛到最优路径
episode: 1
0.227489819094
steps to reach goal is: 14
episode: 2
0.154785251991
steps to reach goal is: 10
episode: 3
0.0105243446365
steps to reach goal is: 4
episode: 4
0.00946282714278
steps to reach goal is: 4
episode: 5
0.00848520483778
steps to reach goal is: 4
episode: 6
0.00758386849718
steps to reach goal is: 4
episode: 7
0.0051117594454
steps to reach goal is: 4
episode: 8
看下收敛后的路径及Q矩阵
print(s_a_history)
#output:
[[0, 2], [3, 1], [4, 2], [7, 1], [8, nan]]
print(Q)
#output:
[[ nan 0.705 0.729 nan]
[ nan 0.286 nan 0.41 ]
[ nan nan 0.513 0.056]
[ 0.501 0.81 0.775 nan]
[ nan nan 0.9 0.088]
[ 0.875 nan nan nan]
[ 0.575 nan nan nan]
[ 0.928 1. nan 0. ]]
标准Sarsa只能backward一步,而Sarsa(
λ
)
\lambda)
λ)算法一次可以更新整个路径,与Sarsa不同的是加了一个
E
E
E矩阵
(这个单独写)
Q-learning
Q-learning与Sarsa的区别:
对比Q的更新公式:
Sarsa:
Q
(
s
t
,
a
t
)
=
Q
(
s
t
,
a
t
)
+
η
∗
(
R
t
+
1
+
γ
Q
(
s
t
+
1
,
a
t
+
1
)
−
Q
(
s
t
,
a
t
)
)
Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \eta*(R_{t+1} +\gamma Q(s_{t+1},a_{t+1})-Q(s_{t}, a_{t}))
Q(st,at)=Q(st,at)+η∗(Rt+1+γQ(st+1,at+1)−Q(st,at))
Q-learning:
Q
(
s
t
,
a
t
)
=
Q
(
s
t
,
a
t
)
+
η
∗
(
R
t
+
1
+
γ
max
a
Q
(
s
t
+
1
,
a
)
−
Q
(
s
t
,
a
t
)
)
Q(s_{t}, a_{t}) = Q(s_{t}, a_{t}) + \eta*(R_{t+1} +\gamma \mathop{\max}\limits_{a }Q(s_{t+1},a)-Q(s_{t}, a_{t}))
Q(st,at)=Q(st,at)+η∗(Rt+1+γamaxQ(st+1,a)−Q(st,at))
可以看到Sarsa更新时需要求并且使用
a
t
+
1
a_{t+1}
at+1,而Q-learning需要用到
s
t
+
1
s_{t+1}
st+1对应的最大Q值。
a
t
+
1
a_{t+1}
at+1是由决定下一步action的policy决定的,所以Q的更新依赖于policy,称为online-policy
而Q-learning不依赖于决定action的policy,称为offline-policy
Sarsa中用到的 ϵ − g r e e d y \epsilon-greedy ϵ−greedy中random的部分Q-learning中没有,所以Q-learning有比Sarsa收敛更快的特征
按照Q-learning公式:
def Q_learning(s, a, r, s_next, Q, eta, gamma):
if s_next == 8: #goal
Q[s, a] = Q[s, a] + eta*(r - Q[s, a])
else:
Q[s, a] = Q[s, a] + eta*(r + gamma*np.nanmax(Q[s_next,:]) - Q[s, a])
return Q
因为不用 ϵ − g r e e d y \epsilon-greedy ϵ−greedy,所以action函数去掉随机的部分:
#get the action
def get_action(s, Q):
direction = ["up", "right", "down", "left"]
#print("s = " + str(s))
next_direction = direction[np.nanargmax(Q[s, :])]
if next_direction == "up":
action = 0
elif next_direction == "right":
action = 1
elif next_direction == "down":
action = 2
elif next_direction == "left":
action = 3
return action
def get_s_next(s, a):
direction = ["up", "right", "down", "left"]
next_direction = direction[a]
if next_direction == "up":
s_next = s - 3
elif next_direction == "right":
s_next = s + 1
elif next_direction == "down":
s_next = s + 3
elif next_direction == "left":
s_next = s - 1
return s_next
一次episode:
def goal_maze_ret_Q_learning(Q, eta, gamma):
s = 0 #start state
a = a_next = get_action(s, Q) #get action by maximum Q value
s_a_history = [[0, np.nan]]
while(1): #until reach the goal
a = a_next
s_a_history[-1][1] = a
s_next = get_s_next(s, a)
s_a_history.append([s_next, np.nan])
if s_next == 8:
r = 1 #get the reward
a_next = np.nan
else:
r = 0
a_next = get_action(s_next, Q)
Q = Q_learning(s, a, r, s_next, Q, eta, gamma) #update Q
if s_next == 8:
break
else:
s = s_next
return [s_a_history, Q]
完整迭代到100次:
eta = 0.1 #learning-rate
gamma = 0.9 #decrease rate
v = np.nanmax(Q, axis=1) #maximum value for each state
is_continue = True
episode = 1
V = [] #state value for each episode
V.append(np.nanmax(Q, axis=1)) #get the maximum value for each state
while is_continue:
print("episode " + str(episode))
[s_a_history, Q] = goal_maze_ret_Q_learning(Q, eta, gamma) #get one path
new_v = np.nanmax(Q, axis=1)
print(np.sum(np.abs(new_v - v))) #get the error
v = new_v
V.append(v)
print("steps to reach goal: " + str(len(s_a_history) - 1))
episode = episode + 1
if episode > 100:
is_continue = False
收敛结果
episode 1
0.100490460992
steps to reach goal: 20
episode 2
0.0969381947436
steps to reach goal: 16
episode 3
0.0943713455446
steps to reach goal: 8
episode 4
0.0932547392558
steps to reach goal: 6
episode 5
0.0918454990963
steps to reach goal: 6
episode 6
0.0904328729472
steps to reach goal: 4
episode 7
0.0895058129093
steps to reach goal: 4
episode 8
0.0885468091866
steps to reach goal: 4
episode 9
0.0875441463751
steps to reach goal: 4
episode 10
0.0864873768061
steps to reach goal: 4
episode 11
0.0853677368575
steps to reach goal: 4
episode 12
0.0841783707317
steps to reach goal: 4
episode 13
0.0829144098366
steps to reach goal: 4
episode 14
0.0815729464098
steps to reach goal: 4
episode 15
0.0801529321705
steps to reach goal: 4
episode 16
0.0786550263091
steps to reach goal: 4
episode 17
0.0770814118009
steps to reach goal: 4
episode 18
0.0754355946909
steps to reach goal: 4
episode 19
0.0737221974552
steps to reach goal: 4
episode 20
0.0719467546908
steps to reach goal: 4