强化学习:实现了基于蒙特卡洛树和策略价值网络的深度强化学习五子棋(含码源)

在这里插入图片描述
【强化学习原理+项目专栏】必看系列:单智能体、多智能体算法原理+项目实战、相关技巧(调参、画图等、趣味项目实现、学术应用项目实现

在这里插入图片描述
专栏详细介绍【强化学习原理+项目专栏】必看系列:单智能体、多智能体算法原理+项目实战、相关技巧(调参、画图等、趣味项目实现、学术应用项目实现

对于深度强化学习这块规划为:

  • 基础单智能算法教学(gym环境为主)
  • 主流多智能算法教学(gym环境为主)
    • 主流算法:DDPG、DQN、TD3、SAC、PPO、RainbowDQN、QLearning、A2C等算法项目实战
  • 一些趣味项目(超级玛丽、下五子棋、斗地主、各种游戏上应用)
  • 单智能多智能题实战(论文复现偏业务如:无人机优化调度、电力资源调度等项目应用)

本专栏主要方便入门同学快速掌握强化学习单智能体|多智能体算法原理+项目实战。后续会持续把深度学习涉及知识原理分析给大家,让大家在项目实操的同时也能知识储备,知其然、知其所以然、知何由以知其所以然。

声明:部分项目为网络经典项目方便大家快速学习,后续会不断增添实战环节(比赛、论文、现实应用等)

实现了基于蒙特卡洛树和策略价值网络的深度强化学习五子棋(含码源)

  • 特点

    • 自我对弈
    • 详细注释
    • 流程简单
  • 代码结构

    • net:策略价值网络实现
    • mcts:蒙特卡洛树实现
    • server:前端界面代码
    • legacy:废弃代码
    • docs:其他文件
    • utils:工具代码
    • network.py:移植过来的网络结构代码
    • model_5400.pkl:移植过来的网络训练权重
    • train_agent.py:训练脚本
    • web_server.py:对弈服务脚本
    • web_server_demo.py:对弈服务脚本(移植网络)

1.1 流程

1.2策略价值网络

采用了类似ResNet的结构,加入了SPP模块。

(目前,由于训练太耗时间了,连续跑了三个多星期,才跑了2000多个自我对弈的棋谱,经过实验,这个策略网络的表现,目前还是不行,可能育有还没有训练充分)

同时移植了另一个开源的策略网络以及其训练权重(network.py、model_5400.pkl),用于进行仿真演示效果。

1.3 训练

根据注释调整train_agent.py文件,并运行该脚本

部分代码展示:


if __name__ == '__main__':

    conf = LinXiaoNetConfig()
    conf.set_cuda(True)
    conf.set_input_shape(8, 8)
    conf.set_train_info(5, 16, 1e-2)
    conf.set_checkpoint_config(5, 'checkpoints/v2train')
    conf.set_num_worker(0)
    conf.set_log('log/v2train.log')
    # conf.set_pretrained_path('checkpoints/v2m4000/epoch_15')

    init_logger(conf.log_file)
    logger()(conf)

    device = 'cuda' if conf.use_cuda else 'cpu'

    # 创建策略网络
    model = LinXiaoNet(3)
    model.to(device)

    loss_func = AlphaLoss()
    loss_func.to(device)

    optimizer = torch.optim.SGD(model.parameters(), conf.init_lr, 0.9, weight_decay=5e-4)
    lr_schedule = torch.optim.lr_scheduler.StepLR(optimizer, 1, 0.95)

    # initial config tree
    tree = MonteTree(model, device, chess_size=conf.input_shape[0], simulate_count=500)
    data_cache = TrainDataCache(num_worker=conf.num_worker)

    ep_num = 0
    chess_num = 0
    # config train interval
    train_every_chess = 18

    # 加载检查点
    if conf.pretrain_path is not None:
        model_data, optimizer_data, lr_schedule_data, data_cache, ep_num, chess_num = load_checkpoint(conf.pretrain_path)
        model.load_state_dict(model_data)
        optimizer.load_state_dict(optimizer_data)
        lr_schedule.load_state_dict(lr_schedule_data)
        logger()('successfully load pretrained : {}'.format(conf.pretrain_path))

    while True:
        logger()(f'self chess game no.{chess_num+1} start.')
        # 进行一次自我对弈,获取对弈记录
        chess_record = tree.self_game()
        logger()(f'self chess game no.{chess_num+1} end.')
        # 根据对弈记录生成训练数据
        train_data = generate_train_data(tree.chess_size, chess_record)
        # 将训练数据存入缓存
        for i in range(len(train_data)):
            data_cache.push(train_data[i])
        if chess_num % train_every_chess == 0:
            logger()(f'train start.')
            loader = data_cache.get_loader(conf.batch_size)
            model.train()
            for _ in range(conf.epoch_num):
                loss_record = []
                for bat_state, bat_dist, bat_winner in loader:
                    bat_state, bat_dist, bat_winner = bat_state.to(device), bat_dist.to(device), bat_winner.to(device)
                    optimizer.zero_grad()
                    prob, value = model(bat_state)
                    loss = loss_func(prob, value, bat_dist, bat_winner)
                    loss.backward()
                    optimizer.step()
                    loss_record.append(loss.item())
                logger()(f'train epoch {ep_num} loss: {sum(loss_record) / float(len(loss_record))}')
                ep_num += 1
                if ep_num % conf.checkpoint_save_every_num == 0:
                    save_checkpoint(
                        os.path.join(conf.checkpoint_save_dir, f'epoch_{ep_num}'),
                        ep_num, chess_num, model.state_dict(), optimizer.state_dict(), lr_schedule.state_dict(), data_cache
                    )
            lr_schedule.step()
            logger()(f'train end.')
        chess_num += 1
        save_chess_record(
            os.path.join(conf.checkpoint_save_dir, f'chess_record_{chess_num}.pkl'),
            chess_record
        )
        # break

    pass

1.4 仿真实验

根据注释调整web_server.py文件,加载所用的预训练权重,并运行该脚本

浏览器打开网址:http://127.0.0.1:8080/ 进行对弈

部分代码展示

# 用户查询机器落子状态
@app.route('/state/get/<state_id>', methods=['GET'])
def get_state(state_id):
    global state_result
    state_id = int(state_id)
    state = 0
    chess_state = None
    if state_id in state_result.keys() and state_result[state_id] is not None:
        state = 1
        chess_state = state_result[state_id]
        state_result[state_id] = None
    ret = {
        'code': 0,
        'msg': 'OK',
        'data': {
            'state': state,
            'chess_state': chess_state
        }
    }
    return jsonify(ret)


# 游戏开始,为这场游戏创建蒙特卡洛树
@app.route('/game/start', methods=['POST'])
def game_start():
    global trees
    global model, device, chess_size, simulate_count
    tree_id = random.randint(1000, 100000)
    trees[tree_id] = MonteTree(model, device, chess_size=chess_size, simulate_count=simulate_count)
    ret = {
        'code': 0,
        'msg': 'OK',
        'data': {
            'tree_id': tree_id
        }
    }
    return jsonify(ret)


# 游戏结束,销毁蒙特卡洛树
@app.route('/game/end/<tree_id>', methods=['POST'])
def game_end(tree_id):
    global trees
    tree_id = int(tree_id)
    trees[tree_id] = None
    ret = {
        'code': 0,
        'msg': 'OK',
        'data': {}
    }
    return ret


if __name__ == '__main__':
    app.run(
        '0.0.0.0',
        8080
    )

1.5 仿真实验(移植网络)

运行脚本:python web_server_demo.py

浏览器打开网址:http://127.0.0.1:8080/ 进行对弈

码源链接见文章顶部或者文末

https://download.csdn.net/download/sinat_39620217/88045879

  • 3
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
好的,以下是一个基于Python实现的基于MCTS和UCT的五子棋游戏AI的代码示例: ``` python import random import math class TreeNode: def __init__(self, state, parent=None): self.state = state self.parent = parent self.children = [] self.visits = 0 self.score = 0 def UCT(node): C = 1.4 if node.visits == 0: return float('inf') return (node.score / node.visits) + C * math.sqrt(math.log(node.parent.visits) / node.visits) def MCTS(state, iterations): root = TreeNode(state) for i in range(iterations): node = root # selection while node.children: node = max(node.children, key=UCT) # expansion if node.visits > 0: moves = node.state.get_moves() for move in moves: if move not in [c.state.last_move for c in node.children]: child_state = node.state.apply_move(move) child_node = TreeNode(child_state, node) node.children.append(child_node) # simulation sim_node = node while sim_node.children: sim_node = random.choice(sim_node.children) score = simulate(sim_node.state) # backpropagation while node: node.visits += 1 node.score += score node = node.parent return max(root.children, key=lambda c: c.visits).state.last_move def simulate(state): player = state.get_current_player() while not state.is_terminal(): move = random.choice(state.get_moves()) state = state.apply_move(move) player = state.get_current_player() if state.get_winner() == player: return 1 elif state.get_winner() == None: return 0.5 else: return 0 class Board: def __init__(self, width=15, height=15, win_length=5): self.width = width self.height = height self.win_length = win_length self.board = [[None for y in range(height)] for x in range(width)] self.last_move = None def get_moves(self): moves = [] for x in range(self.width): for y in range(self.height): if self.board[x][y] == None: moves.append((x, y)) return moves def apply_move(self, move): x, y = move player = self.get_current_player() new_board = Board(self.width, self.height, self.win_length) new_board.board = [row[:] for row in self.board] new_board.board[x][y] = player new_board.last_move = move return new_board def get_current_player(self): if sum(row.count(None) for row in self.board) % 2 == 0: return "X" else: return "O" def is_terminal(self): if self.get_winner() != None: return True for x in range(self.width): for y in range(self.height): if self.board[x][y] == None: return False return True def get_winner(self): for x in range(self.width): for y in range(self.height): if self.board[x][y] == None: continue if x + self.win_length <= self.width: if all(self.board[x+i][y] == self.board[x][y] for i in range(self.win_length)): return self.board[x][y] if y + self.win_length <= self.height: if all(self.board[x][y+i] == self.board[x][y] for i in range(self.win_length)): return self.board[x][y] if x + self.win_length <= self.width and y + self.win_length <= self.height: if all(self.board[x+i][y+i] == self.board[x][y] for i in range(self.win_length)): return self.board[x][y] if x + self.win_length <= self.width and y - self.win_length >= -1: if all(self.board[x+i][y-i] == self.board[x][y] for i in range(self.win_length)): return self.board[x][y] return None def __str__(self): return "\n".join(" ".join(self.board[x][y] or "-" for x in range(self.width)) for y in range(self.height)) if __name__ == "__main__": board = Board() while not board.is_terminal(): if board.get_current_player() == "X": x, y = map(int, input("Enter move (x y): ").split()) board = board.apply_move((x, y)) else: move = MCTS(board, 1000) print("AI move:", move) board = board.apply_move(move) print(board) print("Winner:", board.get_winner()) ``` 该代码定义了一个 `TreeNode` 类来保存节点的状态和统计信息,实现了基于UCB公式的UCT算法和基于MCTS和UCT的五子棋AI。同时,代码还定义了一个 `Board` 类来表示五子棋游戏的状态和规则,并实现了判断胜负、获取可行落子位置等方法。在 `__main__` 函数中,代码通过交替输入玩家落子位置和调用AI选择落子位置的方式,实现了人机对战的功能。 希望这个代码对你有所帮助!

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

汀、人工智能

谢谢支持

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值