文章目录
Overview
The content of this page is summarized by my note about Reinforcement Learning. This will be updated frequently.
Differ from supervised ML
- absence of labels in RL
- data is not iid in RL
- RL naturally models sequential nature of problem
Q-learning
Background: TD(0) method
V
(
S
t
)
←
V
(
S
t
)
+
α
[
R
t
+
1
+
γ
V
(
S
t
+
1
)
−
V
(
S
t
)
]
V(S_t)\leftarrow V(S_t) + \alpha [R_{t+1}+\gamma V(S_{t+1}) - V(S_t)]
V(St)←V(St)+α[Rt+1+γV(St+1)−V(St)]
Here we define the TD error (
l
o
s
s
loss
loss):
δ
t
=
R
t
+
1
+
γ
V
(
S
t
+
1
)
−
V
(
S
t
)
\delta _{t} = R_{t+1}+\gamma V(S_{t+1}) - V(S_t)
δt=Rt+1+γV(St+1)−V(St)
We call this the T e m p o r a l − D i f f e r e n c e Temporal-Difference Temporal−Difference (one-step) method (or T D ( 0 ) TD(0) TD(0)) because it updates the value function of time step t immediately at time t+1
Description: Off-Policy Q-learning
Q
(
S
t
,
A
t
)
←
Q
(
S
t
,
A
t
)
+
α
[
R
t
+
1
+
γ
m
a
x
a
Q
(
S
t
+
1
,
a
)
−
Q
(
S
t
,
A
t
)
]
Q(S_t,A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t)]
Q(St,At)←Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)−Q(St,At)]
Again we define the
T
a
r
g
e
t
Target
Target as
R
t
+
1
+
γ
m
a
x
a
Q
(
S
t
+
1
,
a
)
R_{t+1} + \gamma max_a Q(S_{t+1}, a)
Rt+1+γmaxaQ(St+1,a)
and the
l
o
s
s
loss
loss as
R
t
+
1
+
γ
m
a
x
a
Q
(
S
t
+
1
,
a
)
−
Q
(
S
t
,
A
t
)
R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t)
Rt+1+γmaxaQ(St+1,a)−Q(St,At)
Here α \alpha α is the learning rate, γ \gamma γ is the reward decay.
It is called off-policy method because the behavior policy is ϵ − s o f t \epsilon-soft ϵ−soft policy while the target policy is the greedy policy.
OpenAI Gym
Import Libraries
import gym
Environment
-
Get the environment
env = gym.make('MountainCar-v0')
-
Get the basic info of the environment
env.action_space: return the number of available actions env.observation_space: return the number of observations in one state env.observation_space.high/low: return the highest/lowest value of each observation
-
Functions
-
Get the initial state from the environment
observation = env.reset()
-
Refresh the environment (shown in the window)
env.render()
-
Get the next state given the current action
observation_, reward, done, info = env.step(action)
-
Close the environment
env.close()
Example
import gym
from RL_brain import DeepQNetwork
env = gym.make('CartPole-v0')
env = env.unwrapped
print(env.action_space)
print(env.observation_space)
print(env.observation_space.high)
print(env.observation_space.low)
RL = DeepQNetwork(n_actions=env.action_space.n,
n_features=env.observation_space.shape[0],
learning_rate=0.01, e_greedy=0.9,
replace_target_iter=100, memory_size=2000,
e_greedy_increment=0.001,)
total_steps = 0
for i_episode in range(50):
observation = env.reset()
ep_r = 0
while True:
env.render()
action = RL.choose_action(observation)
observation_, reward, done, info = env.step(action)
# the smaller theta and closer to center the better
x, x_dot, theta, theta_dot = observation_
r1 = (env.x_threshold - abs(x))/env.x_threshold - 0.8
r2 = (env.theta_threshold_radians - abs(theta))/env.theta_threshold_radians - 0.5
reward = r1 + r2
RL.store_transition(observation, action, reward, observation_)
ep_r += reward
if total_steps > 1000:
RL.learn()
if done:
print('episode: ', i_episode,
'ep_r: ', round(ep_r, 2),
' epsilon: ', round(RL.epsilon, 2))
break
observation = observation_
total_steps += 1
RL.plot_cost()
Tensorflow
Import Libraries
import tensorflow as tf
import numpy as np
Session
Key: tf.Session() & sess.run(Variable_Name)
import tensorflow as tf
matrix1 = tf.constant([[4,3]]) # 1 by 2 matrix
matrix2 = tf.constant([[2],[5]]) # 2 by 1 matrix
product = tf.matmul(matrix1, matrix2) # matrix multiply np.dot(m1, m2)
# method 1
sess = tf.Session()
result = sess.run(product) # product is updated once it is run by sess
print(result)
sess.close()
# method 2
with tf.Session() as sess:
result2 = sess.run(product)
print(result2)
# do not write close after with
Ouput:
[[23]]
[[23]]
Variable
Variables must be defined and initialized before using them.
Define Variable
tf.get_variable() is highly recommended rather than tf.Variable()
-
tf.Variable
# define a variable and assign its value and name w = tf.Variable(<initial-value>, name=<optional-name>) #
Example:
import tensorflow as tf state = tf.Variable(0, name = 'counter') # print(state.name) # define a constant and assign its value one = tf.constant(1) # add two variables or constants new_value = tf.add(state, one) # assign one variable to another variable update = tf.assign(state, new_value) # state = new_value # Initialize all variables (must have if define varibales) init = tf.global_variables_initializer() # perform operations on variables with tf.Session() as sess: sess.run(init) # activate the initialization for _ in range(3): sess.run(update) # perform update print(sess.run(state)) # print the value of state
Output:
1 2 3
-
tf.get_variable()
Must be used within tf.variable_scope()w = tf.get_variable(name, shape=None, dtype=tf.float32, initializer=None, regularizer=None, trainable=True, collections=None)
Example:
with tf.variable_scope('eval_net'): # c_names(collections_names) are the collections to store variables c_names, n_l1, w_initializer, b_initializer = \ ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \ tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1) # config of layers # first layer. collections is used later when assign to target net with tf.variable_scope('l1'): w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names) b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names) l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)
-
collection
c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES] ... t_params = tf.get_collection('target_net_params')
Reuse variable
with tf.variable_scope("one"):
a = tf.get_variable("v", [1]) #a.name == "one/v:0"
with tf.variable_scope("one", reuse = True):
c = tf.get_variable("v", [1]) #c.name == "one/v:0"
assert a==c #Assertion is true, they refer to the same object.
Placeholder
Placeholder is used to feed different values to one variable. Define it first and then feed an dict when it is run by session.
import tensorflow as tf
# define a placeholder, must define the type (usually use float32)
input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)
output = tf.multiply(input1, input2)
# feed a dict to placeholder
with tf.Session() as sess:
print(sess.run(output, feed_dict={input1: [7.], input2: [2.]}))
Activation Function
Activation function is used to nonlinearize the output of one layer. For CNN (Convolution Neural Network), relu is applied. For RNN (Recurrent Neural Network), relu or tanh is used.
List of activation functions offered by Tensorflow:
- tf.nn.relu
- tf.nn.relu6
- tf.nn.crelu
- tf.nn.elu
- tf.nn.selu
- tf.nn.softplus
- tf.nn.softsign
- tf.nn.dropout
- tf.nn.bias_add
- tf.sigmoid
- tf.tanh
Optimizer
tf.train.Optimizer_Name
List of optimizers offered by Tensorflow:
- GradientDescentOptimizer
- AdagradOptimizer
- AdagradDAOptimizer
- MomentumOptimizer
- AdamOptimizer
- FtrlOptimizer
- RMSPropOptimizer # used by Alpha Go
Structure
- input layer
We do not add an input layer cause we usually use it as the input for the hidden layer. In python, it is a placeholder. - hidden layer (Relu layer)
The hidden layer uses the input layer as its input, and use Relu (or other activation functions) for this layer. Its input size is the size of the input layer (number of features for each example in the input data). Its output size is the number of nodes (neurons) in this layer. - output layer (Logit layer)
The output layer uses the hidden layer as its input, and do not apply activation functions usually. Its input size is the output size of the hidden layer. Its output size is the size of the output result (number of features for each result in the output.)
Add one layer
The function add_layer has four parameters: inputs, in_size, outsize. activation_function.
import tensorflow as tf
import numpy as np
def add_layer(inputs, in_size, out_size, activation_function=None):
Weights = tf.Variable(tf.random_normal([in_size, out_size])) # in_size by out_size Matrix
biases = tf.Variable(tf.zeros([1, out_size])+ 0.1) # one-dimensional vector
Wx_plus_b =tf.matmul(inputs, Weights) + biases
if activation_function is None:
outputs = Wx_plus_b
else:
outputs = activation_function(Wx_plus_b)
return outputs
Compute loss value
In the following code, the loss value is calculated by mean-square error.
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction), # compute loss
reduction_indices=[1]))
The following code is used to print the loss value
print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))
Training
The following code uses the gradient descent method to optimize the loss in the training process.
train_variable = tf.train.Optimizer_Name(Learning_Rate).minimize(loss_var)
Example:
#### define the training variable ####
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss) # train with learning rate 0.1
#### run the training process ####
sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
Example 1: Nonlinear Structure
import tensorflow as tf
import numpy as np
# make up some real data
x_data = np.linspace(-1,1,300)[:, np.newaxis] # 300 rows, 1 column (one feature)
noise = np.random.normal(0, 0.05, x_data.shape) # mean 0, variance 0.05
y_data = np.square(x_data) - 0.5 + noise
# define placeholder for inputs to networks
xs = tf.placeholder(tf.float32, [None, 1]) # None means unfixed rows
ys = tf.placeholder(tf.float32, [None, 1]) # None means unfixed rows
# add hidden layer
l1 = add_layer(xs, 1, 10, activation_function=tf.nn.relu)
# add output layer
prediction = add_layer(l1, 10, 1, activation_function=None)
# the error between prediction and real data
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction), # compute loss
reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss) # train with learning rate 0.1
# initialize all variables
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(1000):
# training
sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
if i % 50 == 0:
# to see the step improvement
print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))
Expected Results:
0.39016193
0.009708067
0.008203865
0.0074426793
0.0069677234
0.006761972
0.006046914
0.0058904393
0.005795052
0.0056967423
0.005605927
0.005513572
0.0054187346
0.0053218375
0.005224105
0.0051228567
0.0050213463
0.004917563
0.004811827
0.0047068885
Example 2: linear structure
- create data
x_data = np.random.rand(100).astype(np.float32) y_data = x_data * 0.1 + 0.3
- create tensorflow structure
Weights = tf.Variable(tf.random_uniform([1], -1.0, 1.0)) # one dimension between -1.0 and 1.0 biases = tf.Variable(tf.zeros([1])) # one dimension, all zero y = Weights*x_data + biases loss = tf.reduce_mean(tf.square(y-y_data)) optimizer = tf.train.GradientDescentOptimizer(0.5) # learning rate <1 train = optimizer.minimize(loss) init = tf.global_variables_initializer()
- create session and run it
sess = tf.Session() sess.run(init) # Activate the network, Very important!! for step in range(201): sess.run(train) if step % 20 == 0: print(step, sess.run(Weights), sess.run(biases))
- Expected Results
0 [-0.25551698] [0.68949234] 20 [-0.02096055] [0.3655322] 40 [0.06659328] [0.3180986] 60 [0.09077379] [0.30499846] 80 [0.09745194] [0.30138046] 100 [0.09929628] [0.30038127] 120 [0.09980565] [0.3001053] 140 [0.09994634] [0.3000291] 160 [0.09998519] [0.30000803] 180 [0.09999589] [0.30000225] 200 [0.09999888] [0.3000006]
Some useful function
Random
tf.set.random.seed(num): generate an array of fixed random numbers with a fixed seed.
tf.random_normal([row_size, column size]): generate normal distributed random matrix
Initializer
tf.random_normal_initializer(mean, variance): set a randomly normal distributed initalizer, which can be passed to the parameter 'initializer' of tf.get_variable.
tf.constant_initializer(num): ser a constant-valued initializer, which can be passed to the parameter 'initializer' of tf.get_variable.
Matrix Computation
tf.square(Mat): compute pointwise square of a matrix
tf.reduce_sum(Mat): compute the sum of a Mat
tf.reduce_sum(Mat, reduction_indices = [1]): compute the sum for each row
tf.reduce_sum(Mat, reduction_indices = [0]): compute the sum for each column
tf.matmul(MatA, MatB): Matrix Multiplication A*B
tf.multiply(MatA, MatB): Component-wise multipication of A and B
tf.reduce_mean(Array): compute the mean value of an array
Example: Compute loss value
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction), # compute loss
reduction_indices=[1]))
Value Assignment
- tf. assign(t,e): assign the value of e to t
self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]
- tf.zeros([row_size, column_size]): generate all-zero matrix
Tensorboard
open tensorboard
First run the code
# using terminal (in pycharm)
tensorboard --logdir logs
Then open the link or 0.0.0.0:6006
generate logs
# must be placed after sess = tf.Session()
writer = tf.summary.FileWriter('logs/', sess.graph)
name the framework, layer, and objects
Example 1
with tf.name_scope('layer'):
with tf.name_scope('weights'):
Weights = tf.Variable(tf.random_normal([in_size, out_size]), name='W') # in_size by out_size Matrix
with tf.name_scope('biases'):
biases = tf.Variable(tf.zeros([1, out_size]), name='b') # one-dimensional vector
with tf.name_scope('Wx_plus_b'):
Wx_plus_b = tf.matmul(inputs, Weights) + biases
Example 2
with tf.variable_scope('eval_net'):
...
with tf.variable_scope('l1'):
w1 = tf.get_variable('w1', ...)
b1 = tf.get_variable('b1', ...)
l1 = ...
with tf.variance_scope('l2'):
w2 = tf.get_variable('w2', ...)
b2 = tf.get_variable('b2', ...)
l2 = ...
with tf.variable_scope('loss'):
self.loss = ...
with tf.variance_scope('train'):
self.train = ...
add different types of summary
- Scalar, for example, loss
tf.summary.scalar('loss', loss)
- Histogram, for example, weights and biases
tf.summary.histogram(layer_name+'/weights', Weights)
merge different summary
- define the merged summary
merged = tf.summary.merge_all()
- run the merged summary
result = sess.run(merged, feed_dict={xs:x_data, ys:y_data})
- add the merged summary to the file writer
writer.add_summary(result,i)
Expected Results
DQN
Idea
- Use features to predict the Q_value for different actions.
- Apply ANN to do the prediction.
Property
- Experience Replay
- Fixed Q-target
Initialization
def __init__(
self,
n_actions,
n_features,
learning_rate=0.01,
reward_decay=0.9,
e_greedy=0.9,
replace_target_iter=300,
memory_size=500,
batch_size=32,
e_greedy_increment=None,
output_graph=False,
):
self.n_actions = n_actions
self.n_features = n_features
self.lr = learning_rate
self.gamma = reward_decay
self.epsilon_max = e_greedy
self.replace_target_iter = replace_target_iter # steps to replace the target number
self.memory_size = memory_size # size of memory in Experience Replay
self.batch_size = batch_size # batch SGD
self.epsilon_increment = e_greedy_increment # decrease the stochastic range gradually
self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max
# total learning step
self.learn_step_counter = 0
# initialize zero memory [s, a, r, s_]
self.memory = np.zeros((self.memory_size, n_features * 2 + 2))
# consist of [target_net, evaluate_net]
self._build_net()
# replace target parameter with evaluate parameter
t_params = tf.get_collection('target_net_params')
e_params = tf.get_collection('eval_net_params')
self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]
self.sess = tf.Session()
if output_graph:
# $ tensorboard --logdir=logs
# tf.train.SummaryWriter soon be deprecated, use following
tf.summary.FileWriter("logs/", self.sess.graph)
self.sess.run(tf.global_variables_initializer())
# cost_his stores loss in every step
self.cost_his = []
Build Neural Network
evaluate network
# ------------------ build evaluate_net ------------------
self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s') # input
self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target') # for calculating loss
with tf.variable_scope('eval_net'):
# c_names(collections_names) are the collections to store variables
c_names, n_l1, w_initializer, b_initializer = \
['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1) # config of layers
# first layer. collections is used later when assign to target net
with tf.variable_scope('l1'):
w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)
# second layer. collections is used later when assign to target net
with tf.variable_scope('l2'):
w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
self.q_eval = tf.matmul(l1, w2) + b2
with tf.variable_scope('loss'):
self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
with tf.variable_scope('train'):
self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)
target network
# ------------------ build target_net ------------------
self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_') # input
with tf.variable_scope('target_net'):
# c_names(collections_names) are the collections to store variables
c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]
# first layer. collections is used later when assign to target net
with tf.variable_scope('l1'):
w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)
# second layer. collections is used later when assign to target net
with tf.variable_scope('l2'):
w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
self.q_next = tf.matmul(l1, w2) + b2
Store transition
# store memory
def store_transition(self, s, a, r, s_):
# check if memory_counter exists in the object
if not hasattr(self, 'memory_counter'):
self.memory_counter = 0
transition = np.hstack((s, [a, r], s_))
# replace the old memory with new memory
index = self.memory_counter % self.memory_size
self.memory[index, :] = transition
self.memory_counter += 1
Choose action
def choose_action(self, observation):
# to have batch dimension when feed into tf placeholder
observation = observation[np.newaxis, :]
if np.random.uniform() < self.epsilon:
# forward feed the observation and get q value for every actions
actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})
action = np.argmax(actions_value)
else:
action = np.random.randint(0, self.n_actions)
return action
Learn
def learn(self):
# check to replace target parameters
if self.learn_step_counter % self.replace_target_iter == 0:
self.sess.run(self.replace_target_op)
print('\ntarget_params_replaced\n')
# sample batch memory from all memory
if self.memory_counter > self.memory_size:
sample_index = np.random.choice(self.memory_size, size=self.batch_size)
else:
sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
batch_memory = self.memory[sample_index, :]
q_next, q_eval = self.sess.run(
[self.q_next, self.q_eval],
feed_dict={
self.s_: batch_memory[:, -self.n_features:], # fixed params
self.s: batch_memory[:, :self.n_features], # newest params
})
# change q_target w.r.t q_eval's action
q_target = q_eval.copy()
batch_index = np.arange(self.batch_size, dtype=np.int32)
eval_act_index = batch_memory[:, self.n_features].astype(int)
reward = batch_memory[:, self.n_features + 1]
q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)
"""
For example in this batch I have 2 samples and 3 actions:
q_eval =
[[1, 2, 3],
[4, 5, 6]]
q_target = q_eval =
[[1, 2, 3],
[4, 5, 6]]
Then change q_target with the real q_target value w.r.t the q_eval's action.
For example in:
sample 0, I took action 0, and the max q_target value is -1;
sample 1, I took action 2, and the max q_target value is -2:
q_target =
[[-1, 2, 3],
[4, 5, -2]]
So the (q_target - q_eval) becomes:
[[(-1)-(1), 0, 0],
[0, 0, (-2)-(6)]]
We then backpropagate this error w.r.t the corresponding action to network,
leave other action as error=0 cause we didn't choose it.
"""
# train eval network
_, self.cost = self.sess.run([self._train_op, self.loss],
feed_dict={self.s: batch_memory[:, :self.n_features],
self.q_target: q_target})
self.cost_his.append(self.cost)
# increasing epsilon
self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
self.learn_step_counter += 1
Plot loss
def plot_cost(self):
import matplotlib.pyplot as plt
plt.plot(np.arange(len(self.cost_his)), self.cost_his)
plt.ylabel('Cost')
plt.xlabel('training steps')
plt.show()
Double DQN
Problem of the original DQN
The original DQN has the Q-target as
Y
t
D
Q
N
=
R
t
+
1
+
γ
m
a
x
a
Q
(
S
t
+
1
,
a
;
θ
t
−
)
Y_t^{DQN} = R_{t+1} + \gamma max_aQ(S_{t+1}, a ;\theta_t^-)
YtDQN=Rt+1+γmaxaQ(St+1,a;θt−)
The neural network improves towards the Q-target with the maximum error, so it makes the issue of overestimation.
Idea: Double DQN
There are two neural networks: evaluation network (Eval-Net) and target network (Target-Net). In DQN, we use Eval-Net to evaluate the Q value of the current state-action pair, while using Target-Net to evaluate the Q value of next state with all actions.
In Double DQN, additionally, we use Eval-Net to evaluate the action of the Q-target.
Y
t
D
o
u
b
l
e
D
Q
N
=
R
t
+
1
+
γ
Q
(
S
t
+
1
,
a
r
g
m
a
x
a
Q
(
S
t
+
1
,
a
;
θ
t
)
;
θ
t
−
)
Y_t^{DoubleDQN} = R_{t+1} + \gamma Q(S_{t+1}, argmax_a Q(S_{t+1}, a; \theta_t) ;\theta_t^-)
YtDoubleDQN=Rt+1+γQ(St+1,argmaxaQ(St+1,a;θt);θt−)