强化学习之Q_lerning实现。

最新推荐文章于 2024-05-30 07:30:00 发布

Laplace666

最新推荐文章于 2024-05-30 07:30:00 发布

阅读量672

点赞数

分类专栏： deep Learning

deep Learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

强化学习之Q_lerning:

Q-Learning算法学习

Q-Learning算法下，目标是达到目标状态(Goal State)并获取最高收益，一旦到达目标状态，最终收益保持不变。因此，目标状态又称之为吸收态。

Q-Learning算法下的agent，不知道整体的环境，知道当前状态下可以选择哪些动作。

通常，我们需要构建一个即时奖励矩阵R，用于表示从状态s到下一个状态s’的动作奖励值。

由即时奖励矩阵R计算得出指导agent行动的Q矩阵。

Q矩阵是agent的大脑。

初始时，Q矩阵元素全部初始化为0，表示当前的agent大脑一片空白，什么也不知道。

而计算Q(s,a)Q(s,a)的推导公式是：

Q(s,a)=R(s,a)+γmax[Q(s′,all actions)]

其中，s′s′表示下一个状态。

通过这个推导公式计算出Q矩阵的元素，注意右边的max内的Q值通过查找当前的Q矩阵得到，左边是计算。

Q-Learning算法核心

以一个episode为一个训练周期：从初始状态到终结态。

每学完一个episode后，再进入下一个episode学习。

因此，可以得到Q-Learning外层循环是一个episode，内层循环是episode的每一个step。

算法核心：

1.设置好 γγ 值以及矩阵R

2.初始化矩阵Q全为0

3.For each episode:

1.Select a random initial state

2.Do while the goal state hasn’t been reached.

1.Select one among all possible actions for current state

2.Using this possible action, consider going to the next state

3.Get maximum Q value for this next state based on all possible actions

4.Compute: Q(s,a)=R(s,a)+γmax[Q(s,all actions)]

5.Set the next state as the current state

3.End Do

4.End For

每个episode是一个training session，且每一轮训练的意义是：加强大脑，表现形式是agent的Q矩阵的元素更新。

当Q习得后，可以用Q矩阵来指引agent的行动。

以一个迷宫为例：

初始Q-table如下（行：state，列：action）：

UDLR——上下左右；N——不变

奖励矩阵：

执行矩阵（为了保持索引对应，state-1）

运行如下：

分析一下结果，比如在state1时，根据表格选右，到了state2时，选择下，成功到达state4.这里仅仅是一个小的迷宫，我们完全可以把迷宫扩大，这样也能达到效果。

代码：

import numpy as np
import random
import matplotlib.pyplot as plt

gamma = 0.7

reward = np.array([[0, -10, 0, -1, -1],
                   [0, 10, -1, 0, -1],
                   [-1, 0, 0, 10, -1],
                   [-1, 0, -10, 0, 10]])

q_matrix = np.zeros((4, 5))

transition_matrix = np.array([[-1, 2, -1, 1, 1],
                              [-1, 3, 0, -1, 2],
                              [0, -1, -1 , 3, 3],
                              [1, -1, 2, -1, 4]])

valid_actions = np.array([[1, 3, 4],
                          [1, 2, 4],
                          [0, 3, 4],
                          [0, 2, 4]])


for i in range(1000):
    start_state = 0
    current_state = start_state
    while current_state != 3:
        action = random.choice(valid_actions[current_state])
        next_state = transition_matrix[current_state][action]
        future_rewards = []
        for action_nxt in valid_actions[next_state]:
            future_rewards.append(q_matrix[next_state][action_nxt])
        q_state = reward[current_state][action] + gamma * max(future_rewards)
        q_matrix[current_state][action] = q_state
        #print(q_matrix)
        current_state = next_state

print('Final Q-table:')
print(q_matrix)

Laplace666

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
强化学习之Q_lerning实现。

强化学习之Q_lerning:Q-Learning算法学习Q-Learning算法下，目标是达到目标状态(Goal State)并获取最高收益，一旦到达目标状态，最终收益保持不变。因此，目标状态又称之为吸收态。Q-Learning算法下的agent，不知道整体的环境，知道当前状态下可以选择哪些动作。通常，我们需要构建一个即时奖励矩阵R，用于表示从状态s到下一个状态s’的动作奖励...
复制链接

扫一扫

专栏目录