Reinforcement Learning Tips

最新推荐文章于 2022-04-18 21:24:47 发布

Kyle Qian

最新推荐文章于 2022-04-18 21:24:47 发布

阅读量292

点赞数

分类专栏： Python Reinforcement Learning 文章标签： Reinforcement Learning Tensorflow

本文链接：https://blog.csdn.net/weixin_43935639/article/details/85246770

版权

Python 同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

Reinforcement Learning

1 篇文章 0 订阅

订阅专栏

Overview

The content of this page is summarized by my note about Reinforcement Learning. This will be updated frequently.

Differ from supervised ML

absence of labels in RL
data is not iid in RL
RL naturally models sequential nature of problem

Q-learning

Background: TD(0) method

$V(S_t)\leftarrow V(S_t) + \alpha [R_{t+1}+\gamma V(S_{t+1}) - V(S_t)]$
Here we define the TD error ( $l o s s$ ):
$\delta _{t} = R_{t+1}+\gamma V(S_{t+1}) - V(S_t)$

We call this the $T e m p o r a l - D i f f e r e n c e$ (one-step) method (or $T D (0)$ ) because it updates the value function of time step t immediately at time t+1

Description: Off-Policy Q-learning

$Q(S_t,A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$
Again we define the $T a r g e t$ as
$R_{t+1} + \gamma max_a Q(S_{t+1}, a)$
and the $l o s s$ as
$R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t)$

Here $\alpha$ is the learning rate, $\gamma$ is the reward decay.

It is called off-policy method because the behavior policy is $\epsilon-soft$ policy while the target policy is the greedy policy.

OpenAI Gym

Import Libraries

import gym

Environment

Get the environment
```
env = gym.make('MountainCar-v0')
```

Get the basic info of the environment

env.action_space: return the number of available actions
env.observation_space: return the number of observations in one state
env.observation_space.high/low: return the highest/lowest value of each observation

Functions

Get the initial state from the environment
```
observation = env.reset()
```
Refresh the environment (shown in the window)
```
env.render()
```

Get the next state given the current action

observation_, reward, done, info = env.step(action)

Close the environment
```
env.close()
```

Example

import gym
from RL_brain import DeepQNetwork

env = gym.make('CartPole-v0')
env = env.unwrapped

print(env.action_space)
print(env.observation_space)
print(env.observation_space.high)
print(env.observation_space.low)

RL = DeepQNetwork(n_actions=env.action_space.n,
                  n_features=env.observation_space.shape[0],
                  learning_rate=0.01, e_greedy=0.9,
                  replace_target_iter=100, memory_size=2000,
                  e_greedy_increment=0.001,)

total_steps = 0


for i_episode in range(50):

    observation = env.reset()
    ep_r = 0
    while True:
        env.render()

        action = RL.choose_action(observation)

        observation_, reward, done, info = env.step(action)

        # the smaller theta and closer to center the better
        x, x_dot, theta, theta_dot = observation_
        r1 = (env.x_threshold - abs(x))/env.x_threshold - 0.8
        r2 = (env.theta_threshold_radians - abs(theta))/env.theta_threshold_radians - 0.5
        reward = r1 + r2

        RL.store_transition(observation, action, reward, observation_)

        ep_r += reward
        if total_steps > 1000:
            RL.learn()

        if done:
            print('episode: ', i_episode,
                  'ep_r: ', round(ep_r, 2),
                  ' epsilon: ', round(RL.epsilon, 2))
            break

        observation = observation_
        total_steps += 1

RL.plot_cost()

Tensorflow

Import Libraries

	import tensorflow as tf
	import numpy as np

Session

Key: tf.Session() & sess.run(Variable_Name)

import tensorflow as tf

matrix1 = tf.constant([[4,3]])  # 1 by 2 matrix
matrix2 = tf.constant([[2],[5]])    # 2 by 1 matrix

product = tf.matmul(matrix1, matrix2)    # matrix multiply np.dot(m1, m2)

# method 1
sess = tf.Session()
result = sess.run(product)  # product is updated once it is run by sess
print(result)
sess.close()

# method 2
with tf.Session() as sess:
    result2 = sess.run(product)
    print(result2)

# do not write close after with

Ouput:

[[23]]
[[23]]

Variable

Variables must be defined and initialized before using them.

Define Variable

tf.get_variable() is highly recommended rather than tf.Variable()

tf.Variable

# define a variable and assign its value and name
w = tf.Variable(<initial-value>, name=<optional-name>)
#

Example:

import tensorflow as tf

state = tf.Variable(0, name = 'counter')
# print(state.name)

# define a constant and assign its value
one = tf.constant(1)

# add two variables or constants
new_value = tf.add(state, one)

# assign one variable to another variable
update = tf.assign(state, new_value)    # state = new_value

# Initialize all variables (must have if define varibales)
init = tf.global_variables_initializer()

# perform operations on variables
with tf.Session() as sess:
		sess.run(init)  # activate the initialization
		for _ in range(3):
			sess.run(update)    # perform update
			print(sess.run(state))  # print the value of state

Output:

1
2
3

tf.get_variable()
Must be used within tf.variable_scope()

w = tf.get_variable(name, shape=None, dtype=tf.float32, initializer=None, 
				regularizer=None, trainable=True, collections=None)

Example:

with tf.variable_scope('eval_net'):
        # c_names(collections_names) are the collections to store variables
        c_names, n_l1, w_initializer, b_initializer = \
            ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
            tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers

        # first layer. collections is used later when assign to target net
        with tf.variable_scope('l1'):
            w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
            b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
            l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)

collection

c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]
...
t_params = tf.get_collection('target_net_params')

Reuse variable

with tf.variable_scope("one"):
    a = tf.get_variable("v", [1]) #a.name == "one/v:0"
with tf.variable_scope("one", reuse = True): 
    c = tf.get_variable("v", [1]) #c.name == "one/v:0"
assert a==c #Assertion is true, they refer to the same object.

Placeholder

Placeholder is used to feed different values to one variable. Define it first and then feed an dict when it is run by session.

import tensorflow as tf

# define a placeholder, must define the type (usually use float32)
input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)

output = tf.multiply(input1, input2)

# feed a dict to placeholder
with tf.Session() as sess:
    print(sess.run(output, feed_dict={input1: [7.], input2: [2.]}))

Activation Function

Activation function is used to nonlinearize the output of one layer. For CNN (Convolution Neural Network), relu is applied. For RNN (Recurrent Neural Network), relu or tanh is used.
List of activation functions offered by Tensorflow:

tf.nn.relu
tf.nn.relu6
tf.nn.crelu
tf.nn.elu
tf.nn.selu
tf.nn.softplus
tf.nn.softsign
tf.nn.dropout
tf.nn.bias_add
tf.sigmoid
tf.tanh

Optimizer

tf.train.Optimizer_Name
List of optimizers offered by Tensorflow:

GradientDescentOptimizer
AdagradOptimizer
AdagradDAOptimizer
MomentumOptimizer
AdamOptimizer
FtrlOptimizer
RMSPropOptimizer # used by Alpha Go

Structure

input layer
We do not add an input layer cause we usually use it as the input for the hidden layer. In python, it is a placeholder.
hidden layer (Relu layer)
The hidden layer uses the input layer as its input, and use Relu (or other activation functions) for this layer. Its input size is the size of the input layer (number of features for each example in the input data). Its output size is the number of nodes (neurons) in this layer.
output layer (Logit layer)
The output layer uses the hidden layer as its input, and do not apply activation functions usually. Its input size is the output size of the hidden layer. Its output size is the size of the output result (number of features for each result in the output.)

Add one layer

The function add_layer has four parameters: inputs, in_size, outsize. activation_function.

import tensorflow as tf
import numpy as np

def add_layer(inputs, in_size, out_size, activation_function=None):
    Weights = tf.Variable(tf.random_normal([in_size, out_size]))    # in_size by out_size Matrix
    biases = tf.Variable(tf.zeros([1, out_size])+ 0.1)   #    one-dimensional vector
    Wx_plus_b  =tf.matmul(inputs, Weights) + biases
    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs

Compute loss value

In the following code, the loss value is calculated by mean-square error.

loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction),   # compute loss
                                    reduction_indices=[1]))

The following code is used to print the loss value

 print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

Training

The following code uses the gradient descent method to optimize the loss in the training process.

train_variable = tf.train.Optimizer_Name(Learning_Rate).minimize(loss_var)

Example:

#### define the training variable ####
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss) # train with learning rate 0.1

#### run the training process ####
sess.run(train_step, feed_dict={xs: x_data, ys: y_data})

Example 1: Nonlinear Structure

import tensorflow as tf
import numpy as np

# make up some real data
x_data = np.linspace(-1,1,300)[:, np.newaxis]    # 300 rows, 1 column (one feature)
noise = np.random.normal(0, 0.05, x_data.shape) # mean 0, variance 0.05
y_data = np.square(x_data) - 0.5 + noise

# define placeholder for inputs to networks
xs = tf.placeholder(tf.float32, [None, 1])  # None means unfixed rows
ys = tf.placeholder(tf.float32, [None, 1])  # None means unfixed rows

# add hidden layer
l1 = add_layer(xs, 1, 10, activation_function=tf.nn.relu)
# add output layer
prediction = add_layer(l1, 10, 1, activation_function=None)

# the error between prediction and real data
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction),   # compute loss
                                    reduction_indices=[1]))

train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss) # train with learning rate 0.1

# initialize all variables
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    for i in range(1000):
        # training
        sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
        if i % 50 == 0:
            # to see the step improvement
            print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

Expected Results:

0.39016193
0.009708067
0.008203865
0.0074426793
0.0069677234
0.006761972
0.006046914
0.0058904393
0.005795052
0.0056967423
0.005605927
0.005513572
0.0054187346
0.0053218375
0.005224105
0.0051228567
0.0050213463
0.004917563
0.004811827
0.0047068885

Example 2: linear structure

create data

x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.1 + 0.3

create tensorflow structure

Weights = tf.Variable(tf.random_uniform([1], -1.0, 1.0))    # one dimension between -1.0 and 1.0
biases = tf.Variable(tf.zeros([1]))  # one dimension, all zero

y = Weights*x_data + biases

loss = tf.reduce_mean(tf.square(y-y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)  # learning rate <1
train = optimizer.minimize(loss)

init = tf.global_variables_initializer()

create session and run it

sess = tf.Session()
sess.run(init)  # Activate the network, Very important!!

for step in range(201):
	sess.run(train)
	if step % 20 == 0:
		print(step, sess.run(Weights), sess.run(biases))

Expected Results

0 [-0.25551698] [0.68949234]
20 [-0.02096055] [0.3655322]
40 [0.06659328] [0.3180986]
60 [0.09077379] [0.30499846]
80 [0.09745194] [0.30138046]
100 [0.09929628] [0.30038127]
120 [0.09980565] [0.3001053]
140 [0.09994634] [0.3000291]
160 [0.09998519] [0.30000803]
180 [0.09999589] [0.30000225]
200 [0.09999888] [0.3000006]

Some useful function

Random

tf.set.random.seed(num): generate an array of fixed random numbers with a fixed seed.
tf.random_normal([row_size, column size]): generate normal distributed random matrix

Initializer

tf.random_normal_initializer(mean, variance): set a randomly normal distributed initalizer, which can be passed to the parameter 'initializer' of tf.get_variable.
tf.constant_initializer(num): ser a constant-valued initializer, which can be passed to the parameter 'initializer' of tf.get_variable.

Matrix Computation

tf.square(Mat): compute pointwise square of a matrix
tf.reduce_sum(Mat): compute the sum of a Mat
tf.reduce_sum(Mat, reduction_indices = [1]): compute the sum for each row
tf.reduce_sum(Mat, reduction_indices = [0]): compute the sum for each column
tf.matmul(MatA, MatB): Matrix Multiplication A*B
tf.multiply(MatA, MatB): Component-wise multipication of A and B
tf.reduce_mean(Array): compute the mean value of an array

Example: Compute loss value

loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction),   # compute loss
                                    reduction_indices=[1]))

Value Assignment

tf. assign(t,e): assign the value of e to t

self.replace_target_op = [tf.assign(t, e) 
							for t, e in zip(t_params, e_params)]

tf.zeros([row_size, column_size]): generate all-zero matrix

Tensorboard

open tensorboard

First run the code

# using terminal (in pycharm)
tensorboard --logdir logs

Then open the link or 0.0.0.0:6006

generate logs

# must be placed after sess = tf.Session()
writer = tf.summary.FileWriter('logs/', sess.graph)

name the framework, layer, and objects

Example 1

with tf.name_scope('layer'):
    with tf.name_scope('weights'):
        Weights = tf.Variable(tf.random_normal([in_size, out_size]), name='W')    # in_size by out_size Matrix
    with tf.name_scope('biases'):
        biases = tf.Variable(tf.zeros([1, out_size]), name='b')   # one-dimensional vector
    with tf.name_scope('Wx_plus_b'):
        Wx_plus_b = tf.matmul(inputs, Weights) + biases

Example 2

with tf.variable_scope('eval_net'):
	...
	with tf.variable_scope('l1'):
		w1 = tf.get_variable('w1', ...)
		b1 = tf.get_variable('b1', ...)
		l1 = ...
	with tf.variance_scope('l2'):
		w2 = tf.get_variable('w2', ...)
		b2 = tf.get_variable('b2', ...)
		l2 = ...
with tf.variable_scope('loss'):
	self.loss = ...
with tf.variance_scope('train'):
	self.train = ...

add different types of summary

Scalar, for example, loss
```
tf.summary.scalar('loss', loss)
```

Histogram, for example, weights and biases

tf.summary.histogram(layer_name+'/weights', Weights)

merge different summary

define the merged summary
```
merged = tf.summary.merge_all()
```

run the merged summary

result = sess.run(merged, feed_dict={xs:x_data, ys:y_data})

add the merged summary to the file writer
```
writer.add_summary(result,i)
```

Expected Results

在这里插入图片描述

DQN

Idea

Use features to predict the Q_value for different actions.
Apply ANN to do the prediction.

Property

Experience Replay
Fixed Q-target

Initialization

def __init__(
        self,
        n_actions,
        n_features,
        learning_rate=0.01,
        reward_decay=0.9,
        e_greedy=0.9,
        replace_target_iter=300,
        memory_size=500,
        batch_size=32,
        e_greedy_increment=None,
        output_graph=False,
):
    self.n_actions = n_actions
    self.n_features = n_features
    self.lr = learning_rate
    self.gamma = reward_decay
    self.epsilon_max = e_greedy
    self.replace_target_iter = replace_target_iter  # steps to replace the target number
    self.memory_size = memory_size  # size of memory in Experience Replay
    self.batch_size = batch_size    # batch SGD
    self.epsilon_increment = e_greedy_increment # decrease the stochastic range gradually
    self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max

    # total learning step
    self.learn_step_counter = 0

    # initialize zero memory [s, a, r, s_]
    self.memory = np.zeros((self.memory_size, n_features * 2 + 2))

    # consist of [target_net, evaluate_net]
    self._build_net()

    # replace target parameter with evaluate parameter
    t_params = tf.get_collection('target_net_params')
    e_params = tf.get_collection('eval_net_params')
    self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]

    self.sess = tf.Session()

    if output_graph:
        # $ tensorboard --logdir=logs
        # tf.train.SummaryWriter soon be deprecated, use following
        tf.summary.FileWriter("logs/", self.sess.graph)

    self.sess.run(tf.global_variables_initializer())

    # cost_his stores loss in every step
    self.cost_his = []

Build Neural Network

evaluate network

# ------------------ build evaluate_net ------------------
self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input
self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss
with tf.variable_scope('eval_net'):
    # c_names(collections_names) are the collections to store variables
    c_names, n_l1, w_initializer, b_initializer = \
        ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
        tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers

    # first layer. collections is used later when assign to target net
    with tf.variable_scope('l1'):
        w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
        b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
        l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)

    # second layer. collections is used later when assign to target net
    with tf.variable_scope('l2'):
        w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
        b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
        self.q_eval = tf.matmul(l1, w2) + b2

with tf.variable_scope('loss'):
    self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
with tf.variable_scope('train'):
    self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

target network

# ------------------ build target_net ------------------
self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input
with tf.variable_scope('target_net'):
    # c_names(collections_names) are the collections to store variables
    c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

    # first layer. collections is used later when assign to target net
    with tf.variable_scope('l1'):
        w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
        b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
        l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)

    # second layer. collections is used later when assign to target net
    with tf.variable_scope('l2'):
        w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
        b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
        self.q_next = tf.matmul(l1, w2) + b2

Store transition

# store memory
def store_transition(self, s, a, r, s_):
    # check if memory_counter exists in the object
    if not hasattr(self, 'memory_counter'):
        self.memory_counter = 0

    transition = np.hstack((s, [a, r], s_))

    # replace the old memory with new memory
    index = self.memory_counter % self.memory_size
    self.memory[index, :] = transition

    self.memory_counter += 1

Choose action

def choose_action(self, observation):
    # to have batch dimension when feed into tf placeholder
    observation = observation[np.newaxis, :]

    if np.random.uniform() < self.epsilon:
        # forward feed the observation and get q value for every actions
        actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})
        action = np.argmax(actions_value)
    else:
        action = np.random.randint(0, self.n_actions)
    return action

Learn

def learn(self):
    # check to replace target parameters
    if self.learn_step_counter % self.replace_target_iter == 0:
        self.sess.run(self.replace_target_op)
        print('\ntarget_params_replaced\n')

    # sample batch memory from all memory
    if self.memory_counter > self.memory_size:
        sample_index = np.random.choice(self.memory_size, size=self.batch_size)
    else:
        sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
    batch_memory = self.memory[sample_index, :]

    q_next, q_eval = self.sess.run(
        [self.q_next, self.q_eval],
        feed_dict={
            self.s_: batch_memory[:, -self.n_features:],  # fixed params
            self.s: batch_memory[:, :self.n_features],  # newest params
        })

    # change q_target w.r.t q_eval's action
    q_target = q_eval.copy()

    batch_index = np.arange(self.batch_size, dtype=np.int32)
    eval_act_index = batch_memory[:, self.n_features].astype(int)
    reward = batch_memory[:, self.n_features + 1]

    q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

    """
    For example in this batch I have 2 samples and 3 actions:
    q_eval =
    [[1, 2, 3],
     [4, 5, 6]]

    q_target = q_eval =
    [[1, 2, 3],
     [4, 5, 6]]

    Then change q_target with the real q_target value w.r.t the q_eval's action.
    For example in:
        sample 0, I took action 0, and the max q_target value is -1;
        sample 1, I took action 2, and the max q_target value is -2:
    q_target =
    [[-1, 2, 3],
     [4, 5, -2]]

    So the (q_target - q_eval) becomes:
    [[(-1)-(1), 0, 0],
     [0, 0, (-2)-(6)]]

    We then backpropagate this error w.r.t the corresponding action to network,
    leave other action as error=0 cause we didn't choose it.
    """

    # train eval network
    _, self.cost = self.sess.run([self._train_op, self.loss],
                                 feed_dict={self.s: batch_memory[:, :self.n_features],
                                            self.q_target: q_target})
    self.cost_his.append(self.cost)

    # increasing epsilon
    self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
    self.learn_step_counter += 1

Plot loss

def plot_cost(self):
    import matplotlib.pyplot as plt
    plt.plot(np.arange(len(self.cost_his)), self.cost_his)
    plt.ylabel('Cost')
    plt.xlabel('training steps')
    plt.show()

Double DQN

Problem of the original DQN

The original DQN has the Q-target as
$Y_t^{DQN} = R_{t+1} + \gamma max_aQ(S_{t+1}, a ;\theta_t^-)$
The neural network improves towards the Q-target with the maximum error, so it makes the issue of overestimation.

Idea: Double DQN

There are two neural networks: evaluation network (Eval-Net) and target network (Target-Net). In DQN, we use Eval-Net to evaluate the Q value of the current state-action pair, while using Target-Net to evaluate the Q value of next state with all actions.

In Double DQN, additionally, we use Eval-Net to evaluate the action of the Q-target.
$Y_t^{DoubleDQN} = R_{t+1} + \gamma Q(S_{t+1}, argmax_a Q(S_{t+1}, a; \theta_t) ;\theta_t^-)$

Kyle Qian

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Reinforcement Learning Tips

文章目录OverviewDQNPropertyBuild Neural Networkevaluate networktarget networkTensorflowStructureExample: linear structureopen tensorboard (using terminal)OverviewThe content of this page is summarized b...
复制链接

扫一扫