强化学习算法Q-learning学习

最新推荐文章于 2024-05-30 07:30:00 发布

TCCCLY

最新推荐文章于 2024-05-30 07:30:00 发布

阅读量1.3k

点赞数 2

文章标签：学习

本文链接：https://blog.csdn.net/TCCCLY/article/details/126492056

版权

Q-learning是一个典型的表格型off-policy强化学习方法。

算法原理

环境设置

# gym==0.21.0;torch==1.9.0+cu111
# Datawhale《EASY-RL》中风格世界环境gridworld_env.py (Copyright (c) 2020 PaddlePaddle)
import torch
import gym
from gridworld_env import CliffWalkingWapper

算法实现

Q-learning的精髓：

1.行为策略

def choose_action(self, state):
    self.sample_count += 1
    self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
        math.exp(-1. * self.sample_count / self.epsilon_decay) # epsilon是会递减的，这里选择指数递减
    # e-greedy 策略
    if np.random.uniform(0, 1) > self.epsilon:
        action = np.argmax(self.Q_table[str(state)]) # 选择Q(s,a)最大对应的动作
    else:
        action = np.random.choice(self.n_actions) # 随机选择动作
    return action

其中，贪婪率的设置为指数变化，随着更新次数的增长，epsilon值逐渐减小，表示刚开始趋向于随机选择动作，随着智能体的学习，再趋向于贪婪法选择Q值最大的动作

2.更新策略

def update(self, state, action, reward, next_state, done):
    Q_predict = self.Q_table[str(state)][action] 
    if done: # 终止状态
        Q_target = reward  
    else:
        Q_target = reward + self.gamma * np.max(self.Q_table[str(next_state)]) 
    self.Q_table[str(state)][action] += self.lr * (Q_target - Q_predict)

其中，折扣因子gamma=0.90;学习率lr=0.1

开始训练

rewards = []  # 记录奖励
ma_rewards = [] # 记录滑动平均奖励
for i_ep in range(train_eps):
    ep_reward = 0  # 记录每个回合的奖励
    state = env.reset()  # 重置环境,即开始新的回合
    while True:
        action = agent.choose_action(state)  # 根据算法选择一个动作
        next_state, reward, done, _ = env.step(action)  # 与环境进行一次动作交互
        agent.update(state, action, reward, next_state, done)  # Q学习算法更新
        state = next_state  # 更新状态
        ep_reward += reward
        if done:
            break
    rewards.append(ep_reward)
    if ma_rewards:
        ma_rewards.append(ma_rewards[-1]*0.9+ep_reward*0.1)
    else:
        ma_rewards.append(ep_reward)