蒙特卡洛方法（MC Exploring Starts算法例子）

最新推荐文章于 2025-03-04 15:33:48 发布

edgarrrr

最新推荐文章于 2025-03-04 15:33:48 发布

阅读量1.2k

点赞数 26

分类专栏：强化学习文章标签：算法 python 机器学习

本文链接：https://blog.csdn.net/edgarrr/article/details/143493805

版权

强化学习专栏收录该内容

3 篇文章

订阅专栏

本文章中使用的算法和例子来源于bilibili中西湖大学赵世钰老师的【强化学习的数学原理】课程。网址：第5课-蒙特卡洛方法（MC Exploring Starts算法）_哔哩哔哩_bilibili

一、算法简介

"Monte Carlo Exploring Starts"（MC Exploring Starts）是一种强化学习算法，用于估计策略下的状态-动作值函数 Q(s,a)。该方法适用于有限状态和动作空间的问题，主要用于确定性策略的优化。它属于蒙特卡洛（Monte Carlo）方法的一类，使用从起始状态到终止状态的完整回合（episode）进行学习。

二、相关定义

1、策略评估

①每次访问（Every-Visit）蒙特卡洛策略评估：每次一个（状态，动作）对出现在一个 episode 中时，都会对其进行一次回报更新。

②首次访问（First-Visit）蒙特卡洛策略评估：在一个 episode 中，只对某个（状态，动作）对的首次访问进行回报更新。即在一个 episode 中，如果某个（状态，动作）对已经更新过，就不再更新。

Every-Visit 和 First-Visit 的对比

策略评估方法	是否更新重复访问的状态	数据量	收敛速度	适用场景
每次访问	是	多	较快	适用于频繁访问的状态，数据充分时
首次访问	否	少	较慢	适用于减少计算量和数据冗余的情况

2、visit定义

对于从棋盘格中的一个位置出发，每一个状态-动作组成的一个元组称为一个visit，如(s1,a2)，(s2,a4)，(s1,a2)……，如此定义。

3、episode定义

从一个位置出发，所有的visit组成一个episode，至于步数可以自己自行设置。

三、算法流程

四、代码演示

import numpy as np
import random
from collections import defaultdict

class BehrmanMC_EGreedy:
    # Attributes
    n, m = None, None  # Chessboard dimensions
    xx, yy = [-1, 0, 1, 0, 0], [0, 1, 0, -1, 0]  # 动作（上、右、下、左、停留）
    s = "↑→↓←O"  # 动作符号
    _lambda = 0.9  # 折扣因子

    # 初始化棋盘格和策略
    chessboard = [
        ["***"],
        ["**#"],
        ["#*x"]
    ]
    policy = [
        [1, 3, 2],
        [2, 3, 3],
        [2, 2, 0]
    ]
    states = None
    Punishment = None

    def __init__(self):
        self.n, self.m = len(self.chessboard), len(self.chessboard[0][0])
        self.init_matrix()
        self.returns = defaultdict(list)  # 用于记录每个 (状态, 动作) 对的累积回报
        self.Q = defaultdict(lambda: [0] * 5)  # 用来存储每个 (状态, 动作) 对的 Q 值

    def init_matrix(self):
        """Initialize chessboard matrix, rewards, states, and policy."""
        self.chessboard = [list(row[0]) for row in self.chessboard]
        self.states = [[0] * self.m for _ in range(self.n)]
        self.Punishment = [[0 if cell == "*" else (-10 if cell == "#" else 1)
                            for cell in row] for row in self.chessboard]

    def next_point(self, x, y, i):
        """Calculate the next position coordinates, returns current position if out of bounds."""
        wall = 0
        nx, ny = x + self.xx[i], y + self.yy[i]
        if 0 <= nx < self.n and 0 <= ny < self.m:
            return nx, ny, wall
        else:
            wall = 1
            return x, y, wall

    def action(self, x, y, i):
        """Return reward for the given action."""
        next_pos = self.next_point(x, y, i)
        if next_pos[2] == 1:
            return -1  # 撞墙
        else:
            return self.Punishment[next_pos[0]][next_pos[1]]

    def generate_episode(self,x,y,action):
        """Generate an episode using Exploring Starts."""
        episode = []
        for t in range(10):  # episode中visit的数量
            reward = self.action(x, y, action)
            episode.append(((x, y), action, reward))
            x, y, _ = self.next_point(x, y, action)
        return episode


    def update_policy(self):
        """Update Q values and policy based on episode results."""
        for x in range(self.n):
            for y in range(self.m):
                for z in range(5):
                    episode_data = self.generate_episode(x,y,z)
                    G = 0
                    visited = set()

                    # 使用首次访问
                    for (x, y), action, reward in reversed(episode_data):
                        G = self._lambda * G + reward
                        if (x, y, action) not in visited:  # Every-visit MC update
                            self.returns[(x, y, action)].append(G)
                            self.Q[(x, y)][action] = np.mean(self.returns[(x, y, action)])
                            visited.add((x, y, action))
                    for (x, y) in self.Q.keys():
                        best_action = np.argmax(self.Q[(x, y)])
                        self.policy[x][y] = best_action  # 将策略更新为最佳动作

    def show_policy(self):
        """Display the final policy matrix."""
        policy_display = "\n".join("\t".join(self.s[self.policy[x][y]] for y in range(self.m)) for x in range(self.n))
        print(policy_display)

if __name__ == '__main__':
    behrman_mc = BehrmanMC_EGreedy()
    behrman_mc.update_policy()  # 实施MC Exploring Starts算法
    behrman_mc.show_policy()  # 输出最终策略