本文为强化学习笔记,主要参考以下内容:
- Reinforcement Learning: An Introduction
- 代码全部来自 GitHub
- 习题答案参考 Github
目录
An Extended Example: Tic-Tac-Toe (井字棋)
- Because a skilled player can play so as never to lose, let us assume that we are playing against an imperfect player.
- For the moment, in fact, let us consider draws and losses to be equally bad for us.
- How might we construct a player that will find the imperfections in its opponent’s play and learn to maximize its chances of winning?
Although this is a simple problem, it cannot readily be solved in a satisfactory way through classical techniques. For example, the classical “minimax” solution is not correct here because it assumes a particular way of playing by the opponent. For example, a minimax player would never reach a game state from which it could lose, even if in fact it always won from that state because of incorrect play by the opponent.
Here is how the tic-tac-toe problem would be approached with a method making use of a value function.
-
Set up a value function:
First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state’s v a l u e value value, and the whole table is the learned v a l u e value value f u n c t i o n function function. -
Initialize the value function:
Assuming we always play X X Xs, then for all states with three X X Xs in a row the probability of winning is 1 1 1. Similarly, for all states with three O O Os in a row, or that are filled up," the correct probability is 0 0 0. We set the initial values of all the other states to 0.5 0.5 0.5. -
Train the agent:
- We then play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves and look up their current values in the table. We can adapt ε \varepsilon ε-greedy policy to achieve balance between exploiting and exploring.
- While we are playing, we change the values of the states in which we find ourselves during the game. We attempt to make them more accurate estimates of the probabilities of winning. To do this, we “back up (回溯更新)” the value of the state after each greedy move to the state before the move, as suggested by the arrows in Figure 1.1.
- This can be done by moving the earlier state’s value a fraction of the way toward the value of the later state.
where α \alpha α is a small positive fraction called the s t e p step step- s i z e size size p a r a m e t e r parameter parameter (步长参数), which influences the rate of learning.
This update rule is an example of a t e m p o r a l temporal temporal- d f f e r e n c e dfference dfference learning method (See chapter 6), so called because its changes are based on a difference between estimates at two successive times.
The tic-tac-toe player is model-free in this sense with respect to its opponent: it has no model of its opponent of any kind.
- The method described above performs quite well on this task. For example, if the step-size parameter is reduced properly over time, this method (the states’ v a l u e value value ) converges, for any fixed opponent, to the true probabilities of winning from each state given optimal play by our player.
- Furthermore, the moves then taken (except on exploratory moves) are in fact the optimal moves against this (imperfect) opponent. In other words, the method converges to an optimal policy for playing the game against this opponent.
- If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing.
It is a striking feature of the reinforcement learning solution that it can achieve the effects of planning and lookahead without using a model of the opponent and without conducting an explicit search over possible sequences of future states and actions.
Supervised learning methods within reinforcement learning:
- Tic-tac-toe has a relatively small, finite state set, whereas reinforcement learning can be used when the state set is very large, or even infinite.
- How well a reinforcement learning system can work in problems with such large state sets is intimately tied to how appropriately it can generalize from past experience. It is in this role that we have the greatest need for supervised learning methods within reinforcement learning. Artificial neural networks and deep learning (Section 9.7) are not the only, or necessarily the best, way to do this.
Prior information:
- In this tic-tac-toe example, learning started with no prior knowledge beyond the rules of the game, but prior information can be incorporated into reinforcement learning in a variety of ways that can be critical for efficient learning (e.g., see Sections 9.5, 17.4, and 13.1)
Code (Python)
#######################################################################
# Copyright (C) #
# 2016 - 2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Jan Hakenberg(jan.hakenberg@gmail.com) #
# 2016 Tian Jun(tianjun.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
BOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLS
State
类
这个类主要负责管理棋盘信息,判断输赢
class State:
def __init__(self):
# the board is represented by an n * n array,
# 1 represents a chessman of the player who moves first,
# -1 represents a chessman of another player
# 0 represents an empty position
self.data = np.zeros((BOARD_ROWS, BOARD_COLS)) # 棋盘
self.winner = None # 当前棋局的胜者
self.hash_val = None # 当前棋局对应的 hash 值
self.end = None # 当前棋局是否结束
# compute the hash value for one state, it's unique
def hash(self):
if self.hash_val is None:
self.hash_val = 0
for i in np.nditer(self.data): # np.nditer 返回一个迭代器
self.hash_val = self.hash_val * 3 + i + 1
return self.hash_val
# check whether a player has won the game, or it's a tie
# 对每个棋局都只计算一次,结果缓存起来,之后直接利用哈希值查找即可
def is_end(self):
if self.end is not None:
return self.end
results = []
# check row
for i in range(BOARD_ROWS):
results.append(np.sum(self.data[i, :]))
# check columns
for i in range(BOARD_COLS):
results.append(np.sum(self.data[:, i]))
# check diagonals
trace = 0
reverse_trace = 0
for i in range(BOARD_ROWS):
trace += self.data[i, i]
reverse_trace += self.data[i, BOARD_ROWS - 1 - i]
results.append(trace)
results.append(reverse_trace)
for result in results:
if result == 3:
self.winner = 1
self.end = True
return self.end
if result == -3:
self.winner = -1
self.end = True
return self.end
# whether it's a tie
sum_values = np.sum(np.abs(self.data))
if sum_values == BOARD_SIZE:
self.winner = 0
self.end = True
return self.end
# game is still going on
self.end = False
return self.end
# @symbol: 1 or -1
# put chessman symbol in position (i, j)
def next_state(self, i, j, symbol):
new_state = State()
new_state.data = np.copy(self.data)
new_state.data[i, j] = symbol
return new_state
# print the board
def print_state(self):
for i in range(BOARD_ROWS):
print('-------------')
out = '| '
for j in range(BOARD_COLS):
if self.data[i, j] == 1:
token = '*'
elif self.data[i, j] == -1:
token = 'x'
else:
token = '0'
out += token + ' | '
print(out)
print('-------------')
下面的代码用于直接获取所有可能的棋局,计算每个棋局的信息(棋局是否已经结束、胜者是谁…),然后缓存到字典 all_states
中,之后直接通过哈希值就可以快速获取每个棋局的信息了:
def get_all_states_impl(current_state, current_symbol, all_states):
for i in range(BOARD_ROWS):
for j in range(BOARD_COLS):
if current_state.data[i][j] == 0:
new_state = current_state.next_state(i, j, current_symbol)
new_hash = new_state.hash()
if new_hash not in all_states:
is_end = new_state.is_end()
all_states[new_hash] = (new_state, is_end)
if not is_end:
get_all_states_impl(new_state, -current_symbol, all_states) # 对手落子
def get_all_states():
current_symbol = 1 # 1 为先手
current_state = State()
all_states = dict()
all_states[current_state.hash()] = (current_state, current_state.is_end())
get_all_states_impl(current_state, current_symbol, all_states)
return all_states
# all possible board configurations
all_states = get_all_states()
HumanPlayer
类
# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:
def __init__(self, **kwargs):
self.symbol = None # 标记先后手
self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']
self.state = None
def reset(self):
pass
def set_state(self, state):
self.state = state
def set_symbol(self, symbol):
self.symbol = symbol
def act(self):
self.state.print_state()
key = input("Input your position:")
data = self.keys.index(key)
i = data // BOARD_COLS
j = data % BOARD_COLS
return i, j, self.symbol
Player
类
# AI player
class Player:
# @step_size: the step size to update estimations
# @epsilon: the probability to explore
def __init__(self, step_size=0.1, epsilon=0.1):
self.estimations = dict() # 这个表就是存储了各个棋局的估算胜率,即 value
self.step_size = step_size # 即步长参数,用于控制回溯更新的步长
self.epsilon = epsilon # 探索的概率
self.states = [] # 记录本局经过的棋局
self.greedy = []
self.symbol = 0 # 先手为 1,后手为 -1
def reset(self):
self.states = []
self.greedy = []
def set_state(self, state):
self.states.append(state)
self.greedy.append(True)
def set_symbol(self, symbol):
self.symbol = symbol
# 初始化预测表
for hash_val in all_states:
state, is_end = all_states[hash_val]
if is_end:
if state.winner == self.symbol:
self.estimations[hash_val] = 1.0
elif state.winner == 0:
# we need to distinguish between a tie and a lose
self.estimations[hash_val] = 0.5
else:
self.estimations[hash_val] = 0
else:
self.estimations[hash_val] = 0.5
# update value estimation
def backup(self):
states = [state.hash() for state in self.states]
# 按逆序对本局所有按贪心原则落子局面的预测值进行更新
for i in reversed(range(len(states) - 1)):
state = states[i]
td_error = self.greedy[i] * (
self.estimations[states[i + 1]] - self.estimations[state]
)
self.estimations[state] += self.step_size * td_error
# choose an action based on the state
def act(self):
state = self.states[-1] # 上一个棋局
next_states = [] # 记录所有可能的落子位置对应棋局的哈希值
next_positions = [] # 记录所有可能的落子位置
for i in range(BOARD_ROWS):
for j in range(BOARD_COLS):
if state.data[i, j] == 0:
next_positions.append([i, j])
next_states.append(state.next_state(
i, j, self.symbol).hash())
if np.random.rand() < self.epsilon: # 进行试探落子
action = next_positions[np.random.randint(len(next_positions))]
action.append(self.symbol)
self.greedy[-1] = False # 更新标志位
return action
# 按贪心原则落子,选择预测胜率最高的地方落子
values = []
for hash_val, pos in zip(next_states, next_positions):
values.append((self.estimations[hash_val], pos))
# to select one of the actions of equal value at random due to Python's sort is stable
np.random.shuffle(values)
values.sort(key=lambda x: x[0], reverse=True)
action = values[0][1] # 落子位置
action.append(self.symbol)
return action
def save_policy(self):
with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:
pickle.dump(self.estimations, f)
def load_policy(self):
with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:
self.estimations = pickle.load(f)
Judger
类
class Judger:
# @player1: the player who will move first, its chessman will be 1
# @player2: another player with a chessman -1
def __init__(self, player1, player2):
self.p1 = player1
self.p2 = player2
self.current_player = None
self.p1_symbol = 1
self.p2_symbol = -1
self.p1.set_symbol(self.p1_symbol) # 初始化玩家先后手,如果是电脑,还会初始化预测表
self.p2.set_symbol(self.p2_symbol)
self.current_state = State() # 这个好像没用到
def reset(self):
self.p1.reset()
self.p2.reset()
def alternate(self): # 双方轮番落子
while True:
yield self.p1
yield self.p2
# @print_state: if True, print each board during the game
def play(self, print_state=False):
alternator = self.alternate() # 返回一个生成器
self.reset() # 初始化电脑玩家的状态
current_state = State() # 初始化棋局
self.p1.set_state(current_state)
self.p2.set_state(current_state)
if print_state:
current_state.print_state()
while True:
player = next(alternator) # 获取下一个玩家
i, j, symbol = player.act() # 玩家落子
next_state_hash = current_state.next_state(i, j, symbol).hash() # 获取落子后棋盘的哈希值
current_state, is_end = all_states[next_state_hash] # 由哈希值获取棋盘信息
self.p1.set_state(current_state) # 记录当前棋局
self.p2.set_state(current_state)
if print_state:
current_state.print_state()
if is_end:
return current_state.winner
训练及对局部分
def train(epochs, print_every_n=500):
player1 = Player(epsilon=0.01) # 训练时采用左右互搏 (self-play)
player2 = Player(epsilon=0.01)
judger = Judger(player1, player2)
player1_win = 0.0
player2_win = 0.0
for i in range(1, epochs + 1):
winner = judger.play(print_state=False)
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
if i % print_every_n == 0:
print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
player1.backup()
player2.backup()
judger.reset()
player1.save_policy()
player2.save_policy()
# 测试 AI
def compete(turns):
player1 = Player(epsilon=0)
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player1.load_policy()
player2.load_policy()
player1_win = 0.0
player2_win = 0.0
for _ in range(turns):
winner = judger.play()
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
judger.reset()
print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))
# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
while True:
player1 = HumanPlayer()
player2 = Player(epsilon=0) # 真实对战时,不进行试探
judger = Judger(player1, player2)
player2.load_policy()
winner = judger.play()
if winner == player2.symbol:
print("You lose!")
elif winner == player1.symbol:
print("You win!")
else:
print("It is a tie!")
if __name__ == '__main__':
train(int(1e5))
compete(int(1e3))
play()