【RL从入门到放弃】【十四】

alphazero下五子棋,code赏析

1、play

之前直接开始就开始train,结果导致学习mcts的时候,觉得甚是复杂,所以这里先讲如何去玩

play

class Play(object):
    def __init__(self):
        net = Net()
        if USECUDA:#这个为false
            net = net.cuda()
        net.load_model("model.pt", cuda=USECUDA)
        self.net = net
        self.net.eval()#这样会打印出网络的结构,如果是字典的话,就是求里面的数据
        #print("Play __init__ self.net.eval() is {0}".format(self.net.eval()))

    def go(self):
        print("One rule:\r\n Move piece form 'x,y' \r\n eg 1,3\r\n")
        print("-" * 60)
        print("Ready Go")

        mc = MonteCarloTreeSearch(self.net, 1000)
        node = TreeNode()
        board = Board()

        while True:
            print("Play board.c_player is {0}".format(board.c_player))#白子走
            if board.c_player == BLACK:
                action = input(f"Your piece is 'O' and move: ")
                action = [int(n, 10) for n in action.split(",")]
                action = action[0] * board.size + action[1]
                print("Play and action is {0}".format(action))#1-8,2-16这样依次下去
                next_node = TreeNode(action=action)#如果不传入参数的话,里面的值就是None,就是默认的
            else:#上一步有了trigger的动作,所以下一次循环就开始c_player = white
                _, next_node = mc.search(board, node)#白子需要在

            board.move(next_node.action)
            board.show()

            next_node.parent = None
            node = next_node

            if board.is_draw():
                print("board bas all been drawed\n")
                print("-" * 28 + "Draw" + "-" * 28)
                return

            if board.is_game_over():#游戏结束
                if board.c_player == BLACK:
                    print("-" * 28 + "Win" + "-" * 28)
                else:
                    print("-" * 28 + "Loss" + "-" * 28)
                return

            board.trigger()

play的过程十分简单,人选择一次棋子颜色c_player, 剩下的是alphazero计算出下一次的落子位置,来开始对弈。

  1. mcts,设置的num起到了什么作用?

应该是尝试多少次update mcts上的值,maybe是simulation的次数,好像不是,应该是round值

  1. Node.parent = None的原因是什么?

对于root节点,需要给他添加一个噪声项

mcts

alphazero如何知道下一步的落子位置,来源于mcts的search,关于mcts:

Each round【轮】 of Monte Carlo tree search consists of four steps:[4]

· ·Selection: start from root R and select successive child nodes down to a leaf node L. The section below says more about a way of choosing child nodes that lets the game tree expand towards most promising moves, which is the essence【精髓】 of Monte Carlo tree search.

· ·Expansion: unless L ends the game with a win/loss for either player, create one (or more) child nodes and choose node Cfrom one of them.

· ·Simulation【模拟】: play a random playout from node C. This step is sometimes also called playout or rollout.

· ·Backpropagation: use the result of the playout to update information in the nodes on the path from C to R.

 

Sample steps from one round are shown in the figure below. Each tree node stores the number of won/played playouts.

https://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/MCTS_%28English%29.svg/808px-MCTS_%28English%29.svg.png

class MonteCarloTreeSearch(object):
    def __init__(self, net,
                 ms_num=MCTSSIMNUM):

        self.net = net
        self.ms_num = ms_num#self.ms_num数值是400
        print("self.ms_num is {0}".format(self.ms_num))

    """
    1、从根节点开始往下搜索直到叶节点
    2、将当前棋面使用神经网络给出落子概率和价值评估
    3、然后从叶节点返回到根节点一路更新
    """
    def search(self, borad, node, temperature=.001):
        self.borad = borad
        self.root = node#节点

        for _ in range(self.ms_num):
            node = self.root
            borad = self.borad.clone()#为什么这里需要clone一个board?
            print("node is {0} borad is {1}, num is {2}".format(node, borad, _))
			
            print("node.is_leaf is {0}".format(node.is_leaf()))
            while not node.is_leaf():#node.is_leaf()返回true或者false
                print("while node.is_leaf is {0}".format(node.is_leaf()))#先暂时不走这里
                node = node.select_child()
                borad.move(node.action)#移动到clone的棋盘上
                borad.trigger()#开始另外一个棋子开始移动,实际在borad上只更新了一步
            print("search borad show and num is {0}".format(_))
            borad.show()

            # be carefull - opponent state
			#已经有一部分forward开始被调用了
            """
            Zero的net输入为历史盘面和当前盘面特征,二进制格式,即0和1,输出策略p和价值v,
            其中p为在棋盘上每个点落子的概率,v为评估当前盘面下当前玩家胜利的概率。
            """
            #net的先将他作为一个黑盒子
            value, props = self.net(#应该是在他的前向函数里面进行返回的
                to_tensor(borad.gen_state(), unsqueeze=True))#unsqueeze是处理成二维数据,不知道这里是不是
            #print("before MonteCarloTreeSearch value is {0}".format(value))#tensor([[0.0450]], grad_fn=<TanhBackward>)
            #print("before MonteCarloTreeSearch props is {0}".format(props))#torch.Size([1, 64])

            value = to_numpy(value, USECUDA)[0]#USECUDA faise,这个应该是转换成np数据
            #print("value is {0} USECUDA is {1}".format(value, USECUDA))
            props = np.exp(to_numpy(props, USECUDA))#np.exp计算e的多少次方
            print("after MonteCarloTreeSearch value is {0}".format(value))
            #print("after MonteCarloTreeSearch props is {0}".format(props))#(64,)#原来的props都计算了e的props次方
			

			
            # add dirichlet noise for root node# dirichlet狄氏噪音,这是个什么鬼呢?
            print("node.parent is {0}\t borad.invalid_moves is {1}, node.parent is {2}".format(node.parent, borad.invalid_moves, node.parent))
            if node.parent is None:#第一次这里返回的是none
                props = self.dirichlet_noise(props)

            # normalize,这里如何进行正则化呢?
            #print("before now prop is {0}, borad.invalid_moves is {1}".format(props, borad.invalid_moves))
            props[borad.invalid_moves] = 0.
            #print("after now prop is {0}, borad.invalid_moves is {1}".format(props, borad.invalid_moves))
            total_p = np.sum(props)#所有概率总和
            #print("total_p is {0}\t props is {1}".format(total_p, props))#props实际上是一个list

            if total_p > 0:
                props /= total_p#why?
	
            # winner, draw or continue
            if borad.is_draw():#如果棋盘已经全部被画完了,那实际上游戏终止了,不此时应该是平局
                print("enter value = 0.")
                value = 0.#平均的话,value就是0
            else:
                print("not enter value = 0.\t borad.last_player is {0}".format(borad.last_player))
                done = borad.is_game_over(player=borad.last_player)#这个是在何处被更新呢?我去,这个就是当前的c_player啊!
                print("done is {0}".format(done))
                if done:#输了,就是-1
                    value = -1.#最后一个下棋的,难道不是当前的player吗?如果是当前的play导致他赢了,那他不应该是1吗?不最后一个应该是他的对手
                else:#下面应该有更新c_player的地方
                    node.expand_node(props)#需要扩展这个node吗?

            while node is not None:#这里应该是更新mcts
                value = -value#为什么这里是负数
                node.backup(value)#q值成负数了
                node = node.parent
                print("search and node is {0}".format(node is not None))


        action_times = np.zeros(borad.size**2)#动作的次数
        for child in self.root.children:
            action_times[child.action] = child.N
        print("action_times is {0}".format(action_times))#(64,)#更加子树的times难道不需要去统计吗?

        action, pi = self.decision(action_times, temperature)
        print("search action is {0}, pi is {1}".format(action, pi))
        for child in self.root.children:
            if child.action == action:
                print("search pi is {0}\n child is {1}".format(pi,child))
                return pi, child#返回应该是下一个节点

    @staticmethod
    def dirichlet_noise(props, eps=DLEPS, alpha=DLALPHA):#DLEPS 0.25,DLALPHA 0.03
        #np.random.dirichlet((1,1,1,1,1,1))产生dirichlet分布的数据
        return (1 - eps) * props + eps * np.random.dirichlet(np.full(len(props), alpha))

    @staticmethod
    def decision(pi, temperature):#根据pi产生以一定的概率去选择动作
        #temp -- temperature parameter in (0, 1] that controls the level of exploration
        pi = (1.0 / temperature) * np.log(pi + 1e-10)#temperature用来控制探索的水平
        #下面这两步就是softmax函数
        pi = np.exp(pi - np.max(pi))
        pi /= np.sum(pi)
        action = np.random.choice(len(pi), p=pi)# np.arange(5) 中产生一个size为3的随机采样:
        return action, pi

mcts大致长成这个鬼样子

每一轮的步鄹里面包含的三步分别如下:

selection

其实很简单,选择U值最大的child里面的TreeNode

    def select_child(self):#子节点选择q+u值最大的值
        index = np.argmax(np.asarray([c.uct() for c in self.children]))
        return self.children[index]

    def uct(self):#原来他是这里定义的函数, CPUCT是5
        return self.Q + self.P * CPUCT * (np.sqrt(self.parent.N) / (1 + self.N))

expand

通过net会得到每个落子的props,在q+u值最大的child的treenode开始expand mcts

    def expand_node(self, props):#先验证概率里面包含了action和prop吗?
        #print("TreeNode props is {0}".format(props))
        #enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标
        #是产生了64个treeNode
        self.children = [TreeNode(action=action, props=p, parent=self)
                         for action, p in enumerate(props) if p > 0.]#实际上就产生了64个对象
        for action, p in enumerate(props):
            pass
            #print("action is {0} \t p is {1}".format(action, p))
        #print("expand_node self.children is {0}".format(self.children))

其中N、Q、W在backpropagation里面进行update

    def __init__(self,
                 action=None,
                 props=None,
                 parent=None):

        self.parent = parent#估计这两个都是传入的节点
        self.action = action
        self.children = []
        self.P = props  # prior probability#先验概率
        
        self.N = 0  # visit count
        self.Q = .0  # mean action value#这里难道是指动作函数吗?
        self.W = .0  # total action value

backprogation

    def backup(self, v):#backup是什么意思呢?
        self.N += 1
        self.W += v#v应该是传入的q值
        self.Q = self.W / self.N

backprogation是selection的反向过程,他只会更新该子树上的q、w值

NET

pytorch可以借助tensorboard进行可视化的,后面可以尝试:https://www.jianshu.com/p/46eb3004beca

但是我自己画完之后再print出来就觉得自己是个大傻逼,还是贴出来,毕竟蠢也画了时间

最后去看net structure

play games不需要去训练参数,参数在网络里看不见,中间过程你也看不见,所以将net这个黑盒子真的好黑

关于pytorch搭建net,以及执行时调用forward func另外一章上有提到。

2、train

单单只是玩耍还是相对简单的,但是我们如何去训练这个.modle呢?

训练modle的时候需要使用到train data。这里是随机在data_sets里面选择batch_size大小的数据

dataloader

class DataLoader(object):
    def __init__(self, cuda, batch_size):
        print("DataLoader batch_size is {0}\t cuda is {1}".format(batch_size, cuda))
        self.cuda = cuda
        self.bsz = batch_size

    def __call__(self, datas):
        print("enter __call__ datas is {0}".format(datas))#难道是没有进入这个函数里面吗?至少说在创建对象的时候是没有进来的
        mini_batch = random.sample(datas, self.bsz)
        states, pi, rewards = [], [], []
        for s, p, r in mini_batch:
            states.append(s)
            pi.append(p)
            rewards.append(r)

        states = to_tensor(np.stack(states, axis=0), use_cuda=self.cuda)
        pi = to_tensor(np.stack(pi, axis=0), use_cuda=self.cuda)
        rewards = to_tensor(np.stack(rewards, axis=0), use_cuda=self.cuda)

        return states, pi, rewards.view(-1, 1)

optimizer

class ScheduledOptim(object):
  def __init__(self, optimizer, lr):

    self.lr = lr
    self.optimizer = optimizer

  def step(self):#对应优化器的两个操作
    self.optimizer.step()

  def zero_grad(self):#对应优化器的两个操作
    self.optimizer.zero_grad()

  def update_learning_rate(self, lr_multiplier):#更新lr
    print("enter update_learning_rate lr_multiplier is {0}".format(lr_multiplier))
    new_lr = self.lr * lr_multiplier
    for param_group in self.optimizer.param_groups:
      param_group['lr'] = new_lr

entrory

class AlphaEntropy(nn.Module):#Entropy熵,熵是不确定性的度量
  def __init__(self):
    super().__init__()
    self.v_loss = nn.MSELoss()#均方差
    #print("self.v_loss is {0}".format(self.v_loss))

  def forward(self, props, v, pi, reward):#props 是先验概率
    print("AlphaEntropy forward v is {0}\t reward is {1}".format(v, reward))
    v_loss = self.v_loss(v, reward)#reward是RL中的奖励吗?
    p_loss = -torch.mean(torch.sum(props * pi, 1))
    print("AlphaEntropy forward p_loss + v_loss is {0}".format(p_loss + v_loss))

    return p_loss + v_loss

大致流程如下:

  1. Black  start后self play玩完整个game,并得到本次self-play的data_sets数据
  2. 在data_sets中抽取sample_data进行loss计算,backforward更新参数
  3. Checkout让最新的AI和MCTS AI进行对战,eval modle
  4. According eval result update modle args

play_game

class Game(object):
    def __init__(self, net, evl_net):
        self.net = net
        self.evl_net = evl_net
        self.board = Board()#board是产生这个棋盘的意思吗?

    def play(self):#应该是先执行这个函数
        datas, node = [], TreeNode()
        #print("datas is {0}\t node is {1}".format(datas, node))

        mc = MonteCarloTreeSearch(self.net)#蒙特卡洛树搜索,这个还是有空了再去看吧!

        move_count = 0

        while True:
            #print("move_count is {0}".format(move_count))
            if move_count < TEMPTRIG:#TEMPTRIG这个值是8
                pi, next_node = mc.search(self.board, node, temperature=1)
                #返回的是pi和下一个节点
            else:
                pi, next_node = mc.search(self.board, node)

            datas.append([self.board.gen_state(), pi, self.board.c_player])#将数据丢到data里面去

            self.board.move(next_node.action)
            next_node.parent = None#为什么这里需要是None呢?因为每走一步都去建立一个mcts
            node = next_node
            
            if self.board.is_draw():
                reward = 0.
                break

            if self.board.is_game_over():
                reward = 1.
                break

            self.board.trigger()
            move_count += 1
            #self.board.show()

        datas = np.asarray(datas)
        datas[:, 2][datas[:, 2] == self.board.c_player] = reward
        datas[:, 2][datas[:, 2] != self.board.c_player] = -reward

        return datas

    def evaluate(self, result):
        self.net.eval()
        self.evl_net.eval()

        if random.randint(0, 1) == 1:
            players = {
                BLACK: (MonteCarloTreeSearch(self.net), "net"),
                WHITE: (MonteCarloTreeSearch(self.evl_net), "eval"),
            }
        else:
            players = {
                WHITE: (MonteCarloTreeSearch(self.net), "net"),
                BLACK: (MonteCarloTreeSearch(self.evl_net), "eval"),
            }
        node = TreeNode()

        while True:#这里又是走一个完整的回合
            print("self.board.c_player is {0}".format(self.board.c_player))
            print("players[self.board.c_player][0] is {0}".format(players[self.board.c_player][0]))
            print("players[self.board.c_player][1] is {0}".format(players[self.board.c_player][1]))
            _, next_node = players[self.board.c_player][0].search(
                self.board, node)

            self.board.move(next_node.action)

            if self.board.is_draw():
                result[0] += 1
                return

            if self.board.is_game_over():
                if players[self.board.c_player][1] == "net":
                    result[1] += 1
                else:
                    result[2] += 1
                return

            self.board.trigger()

            next_node.parent = None
            node = next_node

    def reset(self):
        self.board = Board()

update_args

class Train(object):
    def __init__(self, use_cuda=USECUDA, lr=LR):
        print("use_cuda is {0}".format(use_cuda))#这个值是false
        if use_cuda:
            torch.cuda.manual_seed(1234)
        else:
            torch.manual_seed(1234)#是和torch相关的获取随机数

        self.kl_targ = 0.02#这个是什么意思呢?
        self.lr_multiplier = 1.#multiplier是乘数
        self.use_cuda = use_cuda#false

        self.net = Net()
        self.eval_net = Net()#为什么这里需要定义两个net
        if use_cuda:#这里是false
            self.net = self.net.cuda()#cuda应该是他父类的一个函数
            self.eval_net = self.eval_net.cuda()
        #这个难道是里面批处理的接口
		
		#self.dl是生成的一个对象
        self.dl = DataLoader(use_cuda, MINIBATCH)#MINIBATCH 512
        #print("self.dl is {0}".format(self.dl))

        self.sample_data = deque(maxlen=TRAINLEN)#队列 TRAINLEN数据是10000
        #print("self.sample_data is {0}".format(self.sample_data))

        self.gen_optim(lr)#这个是获得优化器吗?

        self.entropy = AlphaEntropy()#Entropy是熵的意思

        print("end of Train __init__ TRAINLEN is {0}".format(TRAINLEN))
        
    def run(self):
        model_path = f"model_{time.strftime('%Y%m%d%H%M', time.localtime())}.pt"
        print("model_path is {0}".format(model_path))
        self.net.save_model(path=model_path)#只是存储他的参数

        self.eval_net.load_model(path=model_path, cuda=self.use_cuda)
        print("GAMETIMES is {0}".format(GAMETIMES))
        for step in range(1, 1 + GAMETIMES):#GAMETIMES的值是3000
            game = Game(self.net, self.eval_net)#传入的是两个网络
            print(f"Game - {step} | data length - {self.sample(game.play())}")#加上了f后{}将成为变量或者表达式,也就是说会执行到simple函数,应该是先执行了play函数
            if len(self.sample_data) < MINIBATCH:
                continue
            game.board.show()


            states, pi, rewards = self.dl(self.sample_data)#saple_data里面的queue的值是在什么时候被塞进去的呢?应该是在games里面
            #print("run states is {0}\n pi is {1}\n rewards is {2}".format(states, pi, rewards))
            _, old_props = self.net(states)#这里是完成的一轮的数据
            #break
            for _ in range(EPOCHS):
                self.optim.zero_grad()

                v, props = self.net(states)
                loss = self.entropy(props, v, pi, rewards)#熵的计算
                loss.backward()#反向传递去更新参数

                self.optim.step()

                _, new_props = self.net(states)
                #果tensor只有一个元素那么调用item方法的时候就是将tensor转换成python的scalars;
                #如果tensor不是单个元素的话那就会引发ValueError
                kl = torch.mean(torch.sum(
                    torch.exp(old_props) * (old_props - new_props), 1)).item()
                if kl > self.kl_targ * 4:
                    break

            if kl > self.kl_targ * 2 and self.lr_multiplier > 0.1:
                self.lr_multiplier /= 1.5
            elif kl < self.kl_targ / 2 and self.lr_multiplier < 10:
                self.lr_multiplier *= 1.5

            self.optim.update_learning_rate(self.lr_multiplier)

            print(
                f"kl - {kl} | lr_multiplier - {self.lr_multiplier} | loss - {loss}")
            print("-" * 100 + "\r\n")

            if step % CHECKOUT == 0:#CHECKOUT是50
                result = [0, 0, 0]  # draw win loss
                for _ in range(EVALNUMS):#EVALNUMS是20
评估的方式是使用当前最新的 AI 模型和纯的 MCTS AI(基于随机 rollout)
                    game.reset()#为什么这里要做reset呢?
                    game.evaluate(result)
                    break

                if result[1] + result[2] == 0:
                    rate = 0
                else:
                    rate = result[1] / (result[1] + result[2])

                print(f"step - {step} evaluation")
                print(
                    f"win - {result[1]} | loss - {result[2]} | draw - {result[0]}")

                # save or reload model
                if rate >= WINRATE:
                    print(f"new best model. rate - {rate}")
                    self.net.save_model(path=model_path)
                    self.eval_net.load_model(
                        path=model_path, cuda=self.use_cuda)
                else:
                    print(f"load last model. rate - {rate}")
                    self.net.load_model(path=model_path, cuda=self.use_cuda)

                print("-" * 100 + "\r\n")
            #break
    def gen_optim(self, lr):
        optim = torch.optim.Adam(self.net.parameters(), lr=lr, weight_decay=L2)#优化器
        #print("optim is {0}".format(optim))
        self.optim = ScheduledOptim(optim, lr)#ScheduledOptim这个是什么东西呢?
        #print("self.optim is {0}".format(self.optim))#生成一个对象
    
    def sample(self, datas):#转换game,play的参数
        #print("datas is {0}".format(datas))
        for state, pi, reward in datas:
            c_state = state.copy()
            c_pi = pi.copy()
            for i in range(4):
                c_state = np.array([np.rot90(s, i) for s in c_state])
                c_pi = np.rot90(c_pi.reshape(SIZE, SIZE), i)
                self.sample_data.append([c_state, c_pi.flatten(), reward])

                c_state = np.array([np.fliplr(s) for s in c_state])
                c_pi = np.fliplr(c_pi)
                self.sample_data.append([c_state, c_pi.flatten(), reward])

        return len(datas)

Play __init__ self.net.eval() is Net(
  (feat): Feature(
    (layer): Sequential(
      (0): Conv2d(8, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (encodes): ModuleList(
      (0): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (2): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (3): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (4): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (5): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (6): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (7): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (8): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (9): ResBlockNet(
        (layers): Sequential(
          (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (2): ReLU()
          (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
    )
  )
  (value): Value(
    (conv): Sequential(
      (0): Conv2d(128, 1, kernel_size=(1, 1), stride=(1, 1))
      (1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (linear): Sequential(
      (0): Linear(in_features=64, out_features=256, bias=True)
      (1): ReLU()
      (2): Linear(in_features=256, out_features=1, bias=True)
      (3): Tanh()
    )
  )
  (policy): Policy(
    (conv): Sequential(
      (0): Conv2d(128, 2, kernel_size=(1, 1), stride=(1, 1))
      (1): BatchNorm2d(2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
    )
    (linear): Linear(in_features=128, out_features=64, bias=True)
  )
)

细节

  • 温度:在self-play的前30步中使温度等于1,其他时候包括评估模型变现时使用一个极小值
  • 根节点使用Dirichlet初始化P,使得可以尝试更多的落子选择
  • 每落子一次,将以选择子节点更新为新的根节点
  • 输入net的特征分别为8张仅有黑子和8张仅有白子已经当前玩家落子的棋盘
  • 损失函数使用mean-squared和cross-entropy(公式如下图)
  • 训练时对输入特征使用旋转和反转方式上采样
  • 在mcts中,获取的每一步概率需要经过softmax处理
  • 在mcts中,每次更新父节点时注意value的正负值
  • 五子棋和围棋不同的是判断棋局正负时只需要关注最后落子方
  • 神经网络在Zero中并不起到决定性作用,就算换成概率平均分布经过多次模拟之后也可以实现对弈

上面一章看了alphzero的code,但是发现和RL感觉没啥子关系啊!现在来看看alphgo zero的code来试试

https://github.com/junxiaosong/AlphaZero_Gomoku

1、play

initial和play和alphzero类似

那human  get_action实际很简单:接收命令行的输入

mctsplayer实际是通过获取每个action的props来以一定的概率选择action

那mctsplayer如何得到每个action的props呢?和alphazero一样通过更新mcts的q+u值,然后累积计算visits,最后通过softmax将visits转换成props

至于net,在PolicyValueNetNumpy.policy_value_fnc传入net,

在action_probs, leaf_value = self._policy(state)#policy策略传入当前的board作为参数中被调用

2、train

感觉和alphzero并无多大差异啊!里面都包含了self-play,和RL又扯上什么关系呢?  

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值