【强化学习】Q学习算法_表格形式_python实现

kt4ngw

已于 2022-10-04 14:49:36 修改

阅读量1.1k

点赞数

分类专栏：强化学习文章标签： python 算法强化学习 Qlearning

于 2022-10-04 11:53:00 首次发布

本文链接：https://blog.csdn.net/t4ngw/article/details/127159291

版权

强化学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

强化学习

1 Q表

1.1 什么是Q表？Q表有啥用？

通过之前对DNQ的学习，了解到DNQ是通过一个神经网络去进行决策，而在神经网络没有运用在强化学习前，是通过创建一个表格来进行决策。

首先是创建一个全0的表格，然后在每个事件中利用这个表格及贪婪技巧进行学习，使得表格最后找（收敛）到最好的决策策略【1】。

1.2 Q表的具体形态

2 本文的强化学习目标

本文的强化学习目标问题来源是【2】.
问题就是一个宝藏在右边走6步的地方，一个人去寻找宝藏，但他不知道在哪里，唯一的只知道往右走会有奖励。

3 伪代码及Python代码

3.1 伪代码
在这里插入图片描述 )

3.2 python代码

import pandas as pd
import numpy as np
import time

MAX_EPISODE = 10 # ? 多少个事件
actions = ["left", "right"] # 可选的行为
states = 6 # 状态总数
endflag = "end"
GAMMA = 0.9
ALPHA = 0.1

def init_q_table(states, actions):
    q_table = pd.DataFrame(np.zeros((states, len(actions))), columns=actions)
    return q_table


def choose_action_fun(state, q_table):
    optional_actions = q_table.iloc[state, :]
    random = np.random.uniform()
    if (random > 0.9) or ((optional_actions == 0).all()):
        choose_action = np.random.choice(actions) # 随机选择一个动作
    else:
        choose_action = optional_actions.idxmax()
    return choose_action


def get_env_feedback(now_state, choose_action, ):
    if choose_action == "right":
        if(now_state == states - 2):
            new_state, reward = endflag, 1
        else:
            new_state, reward = now_state + 1, 0
    else:
        new_state, reward = max(0, now_state - 1), 0
    return new_state, reward


def update_env(state, ):
    env_list = ['-'] * (states - 1) + ['T']
    env_list[state] = '0'
    interaction = ''.join(env_list)
    print(interaction)
    time.sleep(0.3)


if __name__ == '__main__':  # with the pseudocode
    q_table = init_q_table(states, actions, )
    for episode in range(MAX_EPISODE):
        step_count = 0 # 计数器
        S = 0 # 状态
        is_end = False
        update_env(S, )
        while(is_end == False):
            A = choose_action_fun(S, q_table) # greedy algorithm choose action
            S_, R = get_env_feedback(S, A) # return new state and reward
            q_predict = q_table.loc[S, A]

            if S_ != "end":
                q_target = R + GAMMA * q_table.iloc[S_, :].max()
            else:
                q_target = R
                is_end = True
            q_table.loc[S, A] += ALPHA * (q_target - q_predict)
            S = S_
            if S != "end":
                update_env(S, )
            step_count += 1
        print('Episode %s: total_step = %s' % (episode + 1, step_count))
        time.sleep(2)
    print(q_table)