q-learning

KpLn_HJL

已于 2023-01-29 10:59:14 修改

阅读量130

点赞数

分类专栏： # 强化学习文章标签：强化学习 q learning

于 2021-09-12 16:29:32 首次发布

本文链接：https://blog.csdn.net/sinat_41679123/article/details/120248222

版权

强化学习专栏收录该内容

20 篇文章 0 订阅

订阅专栏

核心思想

有一个q-table，能够根据(state, action)的元组，给出当前状态 $s_t$ 下采取动作 $a_t$ 时获得的reward $r_t$ ，agent会根据这个 $r_t$ 做选择

细节

首先有一个估计的q_table如下：

`	a1	a2
s1	-2	2
根据q-table，选择a2，来到了s2
在s2，不采取行动，而是估计之后的value，得到到达s2的value： $\gamma * \max(Q(s2))$

现在有2个s2的value，分别是：1. 实际到达s2的value；2. 之前估计的q-table中的 $Q (s 1, a 2)$

q-learning会优化这个现实的value和预估的value之间的差距，也即：差距 = 现实 - 预估，同时更新q-table： $\alpha * 差距$

上面的内容写成公式：
在这里插入图片描述
实际写代码时，大概流程为：

build_q_table()
for epoch in range(EPOCHS):
	update_env()
	while not terminated:
		a = choose_action(s, q_table)
		s_next, r = get_env_feedback(s, a)
		# 预估的q值
		q_predict = q_table[s, a]
		# 实际的q值
		q_target = r + gamma * q_table[s_next, :].max()
		# 更新q_table
		q_table[s,a] += alpha * (q_target - q_predict)
		# 移动到s_next
		s = s_next
		update_env()

进一步抽象一下：

for epoch in range(EPOCHS):
	observation = env.reset()
	while True:
		# 显示环境
		env.render()
		action = RL.choose_action(observation)
		# 下一个state的观测值observation_, reward是到达的reward, done表示当前是否走到结束
		observation_next, reward, done = env.step(action)
		RL.learn(observation, action, reward, observation_next)
		# 更新state
		observation = observation_next
		if done:
			break
# 结束游戏
print('Game over')
env.destroy()

其中的RL类，实际上就是q-learning查q表对应的类，根据mofanRL的内容，大概为：

class RL(object):
	def __init__(self, actions: list, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
		pass
	def check_state_exist(self, state):
		pass
	def choose_action(self, observation):
		pass
	def learn(self, *args):
		pass

class QLearningAgent(RL):
	def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        super(QLearningTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)
    def learn(self, s, a, r, s_):
        self.check_state_exist(s_)
        q_predict = self.q_table.loc[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.loc[s_, :].max()
        else:
            q_target = r
        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)