【强化学习算法： Q学习】示例源码，适合于理解算法的基本原理和工作方式

最新推荐文章于 2024-05-14 01:07:44 发布

小黄人软件

最新推荐文章于 2024-05-14 01:07:44 发布

阅读量535

点赞数 7

分类专栏： chatGPT 文章标签：算法学习 python chatgpt 人工智能

本文链接：https://blog.csdn.net/chenhao0568/article/details/134923048

版权

chatGPT 专栏收录该内容

50 篇文章 0 订阅

订阅专栏

智能体（agent）能够在环境中通过试错来学习如何达成目标。智能体根据其观察到的环境状态，选择行动，然后接收环境给出的奖励或惩罚，自适应操作，获取更高的得分。智能体的目标是最大化其长期获得的总奖励。

import numpy as np
import random

# 定义一个简单的环境：网格世界
class GridWorld:
    def __init__(self, width, height, start, goal):
        self.width = width
        self.height = height
        self.start = start
        self.goal = goal
        self.state = start

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        x, y = self.state
        if action == 0:   # 上
            y = max(y - 1, 0)
        elif action == 1: # 右
            x = min(x + 1, self.width - 1)
        elif action == 2: # 下
            y = min(y + 1, self.height - 1)
        elif action == 3: # 左
            x = max(x - 1, 0)

        self.state = (x, y)
        reward = 1 if self.state == self.goal else 0
        done = self.state == self.goal
        return self.state, reward, done

# 创建环境
env = GridWorld(5, 5, (0, 0), (4, 4))
n_states = env.width * env.height
n_actions = 4

# 初始化Q表
Q = np.zeros((n_states, n_actions))

# 学习参数
learning_rate = 0.1
discount_factor = 0.99
epsilon = 0.1
n_episodes = 1000

# 将网格位置转换为状态编号
def to_state(x, y):
    return y * env.width + x

# Q学习过程
for episode in range(n_episodes):
    state = env.reset()
    done = False

    while not done:
        current_state = to_state(*state)

        # ε-贪婪策略进行动作选择
        if random.uniform(0, 1) < epsilon:
            action = random.choice(range(n_actions))
        else:
            action = np.argmax(Q[current_state])

        # 执行动作
        next_state, reward, done = env.step(action)
        next_state = to_state(*next_state)

        # 更新Q值
        best_next_action = np.argmax(Q[next_state])
        td_target = reward + discount_factor * Q[next_state, best_next_action]
        td_error = td_target - Q[current_state, action]
        Q[current_state, action] += learning_rate * td_error

# 输出学习到的Q表
print("学到的Q表:")
print(Q.reshape(env.width, env.height, n_actions))