RL(Chapter 1): Tic-Tac-Toe (井字棋)

本文介绍如何使用强化学习解决井字棋游戏问题。通过建立价值函数并采用ε-贪婪策略,实现自我对弈训练,最终达到针对特定对手的最优策略。文中详细介绍了状态表示、价值函数初始化、价值更新等关键步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文为强化学习笔记,主要参考以下内容:

An Extended Example: Tic-Tac-Toe (井字棋)

在这里插入图片描述

  • Because a skilled player can play so as never to lose, let us assume that we are playing against an imperfect player.
  • For the moment, in fact, let us consider draws and losses to be equally bad for us.
  • How might we construct a player that will find the imperfections in its opponent’s play and learn to maximize its chances of winning?

Although this is a simple problem, it cannot readily be solved in a satisfactory way through classical techniques. For example, the classical “minimax” solution is not correct here because it assumes a particular way of playing by the opponent. For example, a minimax player would never reach a game state from which it could lose, even if in fact it always won from that state because of incorrect play by the opponent.


Here is how the tic-tac-toe problem would be approached with a method making use of a value function.

  • Set up a value function:
    First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state’s v a l u e value value, and the whole table is the learned v a l u e value value f u n c t i o n function function.

  • Initialize the value function:
    Assuming we always play X X Xs, then for all states with three X X Xs in a row the probability of winning is 1 1 1. Similarly, for all states with three O O Os in a row, or that are filled up," the correct probability is 0 0 0. We set the initial values of all the other states to 0.5 0.5 0.5.

  • Train the agent:

    • We then play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves and look up their current values in the table. We can adapt ε \varepsilon ε-greedy policy to achieve balance between exploiting and exploring.
    • While we are playing, we change the values of the states in which we find ourselves during the game. We attempt to make them more accurate estimates of the probabilities of winning. To do this, we “back up (回溯更新)” the value of the state after each greedy move to the state before the move, as suggested by the arrows in Figure 1.1.
    • This can be done by moving the earlier state’s value a fraction of the way toward the value of the later state.
      在这里插入图片描述
      where α \alpha α is a small positive fraction called the s t e p step step- s i z e size size p a r a m e t e r parameter parameter (步长参数), which influences the rate of learning.

This update rule is an example of a t e m p o r a l temporal temporal- d f f e r e n c e dfference dfference learning method (See chapter 6), so called because its changes are based on a difference between estimates at two successive times.

在这里插入图片描述

The tic-tac-toe player is model-free in this sense with respect to its opponent: it has no model of its opponent of any kind.


  • The method described above performs quite well on this task. For example, if the step-size parameter is reduced properly over time, this method (the states’ v a l u e value value ) converges, for any fixed opponent, to the true probabilities of winning from each state given optimal play by our player.
  • Furthermore, the moves then taken (except on exploratory moves) are in fact the optimal moves against this (imperfect) opponent. In other words, the method converges to an optimal policy for playing the game against this opponent.
  • If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing.

It is a striking feature of the reinforcement learning solution that it can achieve the effects of planning and lookahead without using a model of the opponent and without conducting an explicit search over possible sequences of future states and actions.


Supervised learning methods within reinforcement learning:

  • Tic-tac-toe has a relatively small, finite state set, whereas reinforcement learning can be used when the state set is very large, or even infinite.
  • How well a reinforcement learning system can work in problems with such large state sets is intimately tied to how appropriately it can generalize from past experience. It is in this role that we have the greatest need for supervised learning methods within reinforcement learning. Artificial neural networks and deep learning (Section 9.7) are not the only, or necessarily the best, way to do this.

Prior information:

  • In this tic-tac-toe example, learning started with no prior knowledge beyond the rules of the game, but prior information can be incorporated into reinforcement learning in a variety of ways that can be critical for efficient learning (e.g., see Sections 9.5, 17.4, and 13.1)

Code (Python)

#######################################################################
# Copyright (C)                                                       #
# 2016 - 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com)           #
# 2016 Jan Hakenberg(jan.hakenberg@gmail.com)                         #
# 2016 Tian Jun(tianjun.cpp@gmail.com)                                #
# 2016 Kenta Shimada(hyperkentakun@gmail.com)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################
BOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLS

State

这个类主要负责管理棋盘信息,判断输赢

class State:
    def __init__(self):
        # the board is represented by an n * n array,
        # 1 represents a chessman of the player who moves first,
        # -1 represents a chessman of another player
        # 0 represents an empty position
        self.data = np.zeros((BOARD_ROWS, BOARD_COLS)) # 棋盘
        self.winner = None		# 当前棋局的胜者
        self.hash_val = None 	# 当前棋局对应的 hash 值
        self.end = None			# 当前棋局是否结束

    # compute the hash value for one state, it's unique
    def hash(self):
        if self.hash_val is None:
            self.hash_val = 0
            for i in np.nditer(self.data): # np.nditer 返回一个迭代器
                self.hash_val = self.hash_val * 3 + i + 1
        return self.hash_val

    # check whether a player has won the game, or it's a tie
    # 对每个棋局都只计算一次,结果缓存起来,之后直接利用哈希值查找即可
    def is_end(self):
        if self.end is not None:
            return self.end
        results = []
        # check row
        for i in range(BOARD_ROWS):
            results.append(np.sum(self.data[i, :]))
        # check columns
        for i in range(BOARD_COLS):
            results.append(np.sum(self.data[:, i]))

        # check diagonals
        trace = 0
        reverse_trace = 0
        for i in range(BOARD_ROWS):
            trace += self.data[i, i]
            reverse_trace += self.data[i, BOARD_ROWS - 1 - i]
        results.append(trace)
        results.append(reverse_trace)

        for result in results:
            if result == 3:
                self.winner = 1
                self.end = True
                return self.end
            if result == -3:
                self.winner = -1
                self.end = True
                return self.end

        # whether it's a tie
        sum_values = np.sum(np.abs(self.data))
        if sum_values == BOARD_SIZE:
            self.winner = 0
            self.end = True
            return self.end

        # game is still going on
        self.end = False
        return self.end

    # @symbol: 1 or -1
    # put chessman symbol in position (i, j)
    def next_state(self, i, j, symbol):
        new_state = State()
        new_state.data = np.copy(self.data)
        new_state.data[i, j] = symbol
        return new_state

    # print the board
    def print_state(self):
        for i in range(BOARD_ROWS):
            print('-------------')
            out = '| '
            for j in range(BOARD_COLS):
                if self.data[i, j] == 1:
                    token = '*'
                elif self.data[i, j] == -1:
                    token = 'x'
                else:
                    token = '0'
                out += token + ' | '
            print(out)
        print('-------------')

下面的代码用于直接获取所有可能的棋局,计算每个棋局的信息(棋局是否已经结束、胜者是谁…),然后缓存到字典 all_states中,之后直接通过哈希值就可以快速获取每个棋局的信息了:

def get_all_states_impl(current_state, current_symbol, all_states):
    for i in range(BOARD_ROWS):
        for j in range(BOARD_COLS):
            if current_state.data[i][j] == 0:
                new_state = current_state.next_state(i, j, current_symbol)
                new_hash = new_state.hash()
                if new_hash not in all_states:
                    is_end = new_state.is_end()
                    all_states[new_hash] = (new_state, is_end)
                    if not is_end:
                        get_all_states_impl(new_state, -current_symbol, all_states) # 对手落子


def get_all_states():
    current_symbol = 1 	# 1 为先手
    current_state = State()
    all_states = dict()
    all_states[current_state.hash()] = (current_state, current_state.is_end())
    get_all_states_impl(current_state, current_symbol, all_states)
    return all_states


# all possible board configurations
all_states = get_all_states()

HumanPlayer

# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:
    def __init__(self, **kwargs):
        self.symbol = None # 标记先后手
        self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']
        self.state = None

    def reset(self):
        pass

    def set_state(self, state):
        self.state = state

    def set_symbol(self, symbol):
        self.symbol = symbol

    def act(self):
        self.state.print_state()
        key = input("Input your position:")
        data = self.keys.index(key)
        i = data // BOARD_COLS
        j = data % BOARD_COLS
        return i, j, self.symbol

Player

# AI player
class Player:
    # @step_size: the step size to update estimations 
    # @epsilon: the probability to explore
    def __init__(self, step_size=0.1, epsilon=0.1):
        self.estimations = dict()	# 这个表就是存储了各个棋局的估算胜率,即 value 
        self.step_size = step_size	# 即步长参数,用于控制回溯更新的步长
        self.epsilon = epsilon		# 探索的概率
        self.states = []			# 记录本局经过的棋局
        self.greedy = []
        self.symbol = 0				# 先手为 1,后手为 -1

    def reset(self):
        self.states = []
        self.greedy = []

    def set_state(self, state):
        self.states.append(state)
        self.greedy.append(True)

    def set_symbol(self, symbol):
        self.symbol = symbol
        # 初始化预测表
        for hash_val in all_states:
            state, is_end = all_states[hash_val]
            if is_end:
                if state.winner == self.symbol:
                    self.estimations[hash_val] = 1.0
                elif state.winner == 0:
                    # we need to distinguish between a tie and a lose
                    self.estimations[hash_val] = 0.5
                else:
                    self.estimations[hash_val] = 0
            else:
                self.estimations[hash_val] = 0.5

    # update value estimation
    def backup(self):
        states = [state.hash() for state in self.states]
		
		# 按逆序对本局所有按贪心原则落子局面的预测值进行更新
        for i in reversed(range(len(states) - 1)): 
            state = states[i] 
            td_error = self.greedy[i] * (
                self.estimations[states[i + 1]] - self.estimations[state]
            )
            self.estimations[state] += self.step_size * td_error

    # choose an action based on the state
    def act(self):
        state = self.states[-1] # 上一个棋局
        next_states = []	# 记录所有可能的落子位置对应棋局的哈希值
        next_positions = [] # 记录所有可能的落子位置
        for i in range(BOARD_ROWS):
            for j in range(BOARD_COLS):
                if state.data[i, j] == 0:
                    next_positions.append([i, j])
                    next_states.append(state.next_state(
                        i, j, self.symbol).hash())

        if np.random.rand() < self.epsilon: # 进行试探落子
            action = next_positions[np.random.randint(len(next_positions))]
            action.append(self.symbol)
            self.greedy[-1] = False # 更新标志位
            return action

		# 按贪心原则落子,选择预测胜率最高的地方落子
        values = []
        for hash_val, pos in zip(next_states, next_positions):
            values.append((self.estimations[hash_val], pos))
        # to select one of the actions of equal value at random due to Python's sort is stable
        np.random.shuffle(values)
        values.sort(key=lambda x: x[0], reverse=True)
        action = values[0][1] # 落子位置
        action.append(self.symbol)
        return action

    def save_policy(self):
        with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:
            pickle.dump(self.estimations, f)

    def load_policy(self):
        with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:
            self.estimations = pickle.load(f)

Judger

class Judger:
    # @player1: the player who will move first, its chessman will be 1
    # @player2: another player with a chessman -1
    def __init__(self, player1, player2):
        self.p1 = player1
        self.p2 = player2
        self.current_player = None
        self.p1_symbol = 1
        self.p2_symbol = -1
        self.p1.set_symbol(self.p1_symbol) 	# 初始化玩家先后手,如果是电脑,还会初始化预测表
        self.p2.set_symbol(self.p2_symbol)
        self.current_state = State() 		# 这个好像没用到

    def reset(self):
        self.p1.reset()
        self.p2.reset()

    def alternate(self): # 双方轮番落子
        while True:
            yield self.p1
            yield self.p2

    # @print_state: if True, print each board during the game
    def play(self, print_state=False):
        alternator = self.alternate() 	# 返回一个生成器
        self.reset() 					# 初始化电脑玩家的状态
        current_state = State() 		# 初始化棋局
        self.p1.set_state(current_state) 	
        self.p2.set_state(current_state)
        if print_state:
            current_state.print_state()
        while True:
            player = next(alternator)	# 获取下一个玩家
            i, j, symbol = player.act()	# 玩家落子
            next_state_hash = current_state.next_state(i, j, symbol).hash() # 获取落子后棋盘的哈希值
            current_state, is_end = all_states[next_state_hash] # 由哈希值获取棋盘信息
            self.p1.set_state(current_state) 	# 记录当前棋局
            self.p2.set_state(current_state)
            if print_state:
                current_state.print_state()
            if is_end:
                return current_state.winner

训练及对局部分

def train(epochs, print_every_n=500):
    player1 = Player(epsilon=0.01) # 训练时采用左右互搏 (self-play)
    player2 = Player(epsilon=0.01)
    judger = Judger(player1, player2)
    player1_win = 0.0
    player2_win = 0.0
    for i in range(1, epochs + 1):
        winner = judger.play(print_state=False)
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        if i % print_every_n == 0:
            print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
        player1.backup()
        player2.backup()
        judger.reset()
    player1.save_policy()
    player2.save_policy()

# 测试 AI
def compete(turns):
    player1 = Player(epsilon=0)
    player2 = Player(epsilon=0)
    judger = Judger(player1, player2)
    player1.load_policy()
    player2.load_policy()
    player1_win = 0.0
    player2_win = 0.0
    for _ in range(turns):
        winner = judger.play()
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        judger.reset()
    print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))


# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
    while True:
        player1 = HumanPlayer()
        player2 = Player(epsilon=0) 		# 真实对战时,不进行试探
        judger = Judger(player1, player2)	
        player2.load_policy()
        winner = judger.play()
        if winner == player2.symbol:
            print("You lose!")
        elif winner == player1.symbol:
            print("You win!")
        else:
            print("It is a tie!")


if __name__ == '__main__':
    train(int(1e5))
    compete(int(1e3))
    play()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值