强化学习
1 Q表
1.1 什么是Q表?Q表有啥用?
通过之前对DNQ的学习,了解到DNQ是通过一个神经网络去进行决策,而在神经网络没有运用在强化学习前,是通过创建一个表格来进行决策。
首先是创建一个全0的表格,然后在每个事件中利用这个表格及贪婪技巧进行学习,使得表格最后找(收敛)到最好的决策策略【1】。
1.2 Q表的具体形态
2 本文的强化学习目标
本文的强化学习目标问题来源是【2】.
问题就是一个宝藏在右边走6步的地方,一个人去寻找宝藏,但他不知道在哪里,唯一的只知道往右走会有奖励。
3 伪代码及Python代码
3.1 伪代码
)
3.2 python代码
import pandas as pd
import numpy as np
import time
MAX_EPISODE = 10 # ? 多少个事件
actions = ["left", "right"] # 可选的行为
states = 6 # 状态总数
endflag = "end"
GAMMA = 0.9
ALPHA = 0.1
def init_q_table(states, actions):
q_table = pd.DataFrame(np.zeros((states, len(actions))), columns=actions)
return q_table
def choose_action_fun(state, q_table):
optional_actions = q_table.iloc[state, :]
random = np.random.uniform()
if (random > 0.9) or ((optional_actions == 0).all()):
choose_action = np.random.choice(actions) # 随机选择一个动作
else:
choose_action = optional_actions.idxmax()
return choose_action
def get_env_feedback(now_state, choose_action, ):
if choose_action == "right":
if(now_state == states - 2):
new_state, reward = endflag, 1
else:
new_state, reward = now_state + 1, 0
else:
new_state, reward = max(0, now_state - 1), 0
return new_state, reward
def update_env(state, ):
env_list = ['-'] * (states - 1) + ['T']
env_list[state] = '0'
interaction = ''.join(env_list)
print(interaction)
time.sleep(0.3)
if __name__ == '__main__': # with the pseudocode
q_table = init_q_table(states, actions, )
for episode in range(MAX_EPISODE):
step_count = 0 # 计数器
S = 0 # 状态
is_end = False
update_env(S, )
while(is_end == False):
A = choose_action_fun(S, q_table) # greedy algorithm choose action
S_, R = get_env_feedback(S, A) # return new state and reward
q_predict = q_table.loc[S, A]
if S_ != "end":
q_target = R + GAMMA * q_table.iloc[S_, :].max()
else:
q_target = R
is_end = True
q_table.loc[S, A] += ALPHA * (q_target - q_predict)
S = S_
if S != "end":
update_env(S, )
step_count += 1
print('Episode %s: total_step = %s' % (episode + 1, step_count))
time.sleep(2)
print(q_table)
4 参考文献来源
【1】王树森-强化学习
【2】莫烦-强化学习