Reinforcement Learning Tips

Overview

The content of this page is summarized by my note about Reinforcement Learning. This will be updated frequently.

Differ from supervised ML

  1. absence of labels in RL
  2. data is not iid in RL
  3. RL naturally models sequential nature of problem

Q-learning

Background: TD(0) method

V ( S t ) ← V ( S t ) + α [ R t + 1 + γ V ( S t + 1 ) − V ( S t ) ] V(S_t)\leftarrow V(S_t) + \alpha [R_{t+1}+\gamma V(S_{t+1}) - V(S_t)] V(St)V(St)+α[Rt+1+γV(St+1)V(St)]
Here we define the TD error ( l o s s loss loss):
δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta _{t} = R_{t+1}+\gamma V(S_{t+1}) - V(S_t) δt=Rt+1+γV(St+1)V(St)

We call this the T e m p o r a l − D i f f e r e n c e Temporal-Difference TemporalDifference (one-step) method (or T D ( 0 ) TD(0) TD(0)) because it updates the value function of time step t immediately at time t+1

Description: Off-Policy Q-learning

Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ m a x a Q ( S t + 1 , a ) − Q ( S t , A t ) ] Q(S_t,A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t)] Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)]
Again we define the T a r g e t Target Target as
R t + 1 + γ m a x a Q ( S t + 1 , a ) R_{t+1} + \gamma max_a Q(S_{t+1}, a) Rt+1+γmaxaQ(St+1,a)
and the l o s s loss loss as
R t + 1 + γ m a x a Q ( S t + 1 , a ) − Q ( S t , A t ) R_{t+1} + \gamma max_a Q(S_{t+1}, a) - Q(S_t, A_t) Rt+1+γmaxaQ(St+1,a)Q(St,At)

Here α \alpha α is the learning rate, γ \gamma γ is the reward decay.

It is called off-policy method because the behavior policy is ϵ − s o f t \epsilon-soft ϵsoft policy while the target policy is the greedy policy.

OpenAI Gym

Import Libraries

import gym

Environment

  1. Get the environment

    env = gym.make('MountainCar-v0')
    
  2. Get the basic info of the environment

    env.action_space: return the number of available actions
    env.observation_space: return the number of observations in one state
    env.observation_space.high/low: return the highest/lowest value of each observation
    
  3. Functions

  • Get the initial state from the environment

    observation = env.reset()
    
  • Refresh the environment (shown in the window)

    env.render()
    
  • Get the next state given the current action

    observation_, reward, done, info = env.step(action)
    
  • Close the environment

    env.close()
    

Example

import gym
from RL_brain import DeepQNetwork

env = gym.make('CartPole-v0')
env = env.unwrapped

print(env.action_space)
print(env.observation_space)
print(env.observation_space.high)
print(env.observation_space.low)

RL = DeepQNetwork(n_actions=env.action_space.n,
                  n_features=env.observation_space.shape[0],
                  learning_rate=0.01, e_greedy=0.9,
                  replace_target_iter=100, memory_size=2000,
                  e_greedy_increment=0.001,)

total_steps = 0


for i_episode in range(50):

    observation = env.reset()
    ep_r = 0
    while True:
        env.render()

        action = RL.choose_action(observation)

        observation_, reward, done, info = env.step(action)

        # the smaller theta and closer to center the better
        x, x_dot, theta, theta_dot = observation_
        r1 = (env.x_threshold - abs(x))/env.x_threshold - 0.8
        r2 = (env.theta_threshold_radians - abs(theta))/env.theta_threshold_radians - 0.5
        reward = r1 + r2

        RL.store_transition(observation, action, reward, observation_)

        ep_r += reward
        if total_steps > 1000:
            RL.learn()

        if done:
            print('episode: ', i_episode,
                  'ep_r: ', round(ep_r, 2),
                  ' epsilon: ', round(RL.epsilon, 2))
            break

        observation = observation_
        total_steps += 1

RL.plot_cost()

Tensorflow

Import Libraries

	import tensorflow as tf
	import numpy as np

Session

Key: tf.Session() & sess.run(Variable_Name)

import tensorflow as tf

matrix1 = tf.constant([[4,3]])  # 1 by 2 matrix
matrix2 = tf.constant([[2],[5]])    # 2 by 1 matrix

product = tf.matmul(matrix1, matrix2)    # matrix multiply np.dot(m1, m2)

# method 1
sess = tf.Session()
result = sess.run(product)  # product is updated once it is run by sess
print(result)
sess.close()

# method 2
with tf.Session() as sess:
    result2 = sess.run(product)
    print(result2)

# do not write close after with

Ouput:

[[23]]
[[23]]

Variable

Variables must be defined and initialized before using them.

Define Variable

tf.get_variable() is highly recommended rather than tf.Variable()

  1. tf.Variable

    # define a variable and assign its value and name
    w = tf.Variable(<initial-value>, name=<optional-name>)
    # 
    

    Example:

    import tensorflow as tf
    
    state = tf.Variable(0, name = 'counter')
    # print(state.name)
    
    # define a constant and assign its value
    one = tf.constant(1)
    
    # add two variables or constants
    new_value = tf.add(state, one)
    
    # assign one variable to another variable
    update = tf.assign(state, new_value)    # state = new_value
    
    # Initialize all variables (must have if define varibales)
    init = tf.global_variables_initializer()
    
    # perform operations on variables
    with tf.Session() as sess:
    		sess.run(init)  # activate the initialization
    		for _ in range(3):
    			sess.run(update)    # perform update
    			print(sess.run(state))  # print the value of state
    

    Output:

    1
    2
    3
    
  2. tf.get_variable()
    Must be used within tf.variable_scope()

    w = tf.get_variable(name, shape=None, dtype=tf.float32, initializer=None, 
    				regularizer=None, trainable=True, collections=None)
    

    Example:

    with tf.variable_scope('eval_net'):
            # c_names(collections_names) are the collections to store variables
            c_names, n_l1, w_initializer, b_initializer = \
                ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
                tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers
    
            # first layer. collections is used later when assign to target net
            with tf.variable_scope('l1'):
                w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
                b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
                l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)
    
  3. collection

    c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]
    ...
    t_params = tf.get_collection('target_net_params')
    

Reuse variable

with tf.variable_scope("one"):
    a = tf.get_variable("v", [1]) #a.name == "one/v:0"
with tf.variable_scope("one", reuse = True): 
    c = tf.get_variable("v", [1]) #c.name == "one/v:0"
assert a==c #Assertion is true, they refer to the same object.

Placeholder

Placeholder is used to feed different values to one variable. Define it first and then feed an dict when it is run by session.

import tensorflow as tf

# define a placeholder, must define the type (usually use float32)
input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)

output = tf.multiply(input1, input2)

# feed a dict to placeholder
with tf.Session() as sess:
    print(sess.run(output, feed_dict={input1: [7.], input2: [2.]}))

Activation Function

Activation function is used to nonlinearize the output of one layer. For CNN (Convolution Neural Network), relu is applied. For RNN (Recurrent Neural Network), relu or tanh is used.
List of activation functions offered by Tensorflow:

  • tf.nn.relu
  • tf.nn.relu6
  • tf.nn.crelu
  • tf.nn.elu
  • tf.nn.selu
  • tf.nn.softplus
  • tf.nn.softsign
  • tf.nn.dropout
  • tf.nn.bias_add
  • tf.sigmoid
  • tf.tanh

Optimizer

tf.train.Optimizer_Name
List of optimizers offered by Tensorflow:

  • GradientDescentOptimizer
  • AdagradOptimizer
  • AdagradDAOptimizer
  • MomentumOptimizer
  • AdamOptimizer
  • FtrlOptimizer
  • RMSPropOptimizer # used by Alpha Go

Structure

  1. input layer
    We do not add an input layer cause we usually use it as the input for the hidden layer. In python, it is a placeholder.
  2. hidden layer (Relu layer)
    The hidden layer uses the input layer as its input, and use Relu (or other activation functions) for this layer. Its input size is the size of the input layer (number of features for each example in the input data). Its output size is the number of nodes (neurons) in this layer.
  3. output layer (Logit layer)
    The output layer uses the hidden layer as its input, and do not apply activation functions usually. Its input size is the output size of the hidden layer. Its output size is the size of the output result (number of features for each result in the output.)

Add one layer

The function add_layer has four parameters: inputs, in_size, outsize. activation_function.

import tensorflow as tf
import numpy as np

def add_layer(inputs, in_size, out_size, activation_function=None):
    Weights = tf.Variable(tf.random_normal([in_size, out_size]))    # in_size by out_size Matrix
    biases = tf.Variable(tf.zeros([1, out_size])+ 0.1)   #    one-dimensional vector
    Wx_plus_b  =tf.matmul(inputs, Weights) + biases
    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs

Compute loss value

In the following code, the loss value is calculated by mean-square error.

loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction),   # compute loss
                                    reduction_indices=[1]))

The following code is used to print the loss value

 print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

Training

The following code uses the gradient descent method to optimize the loss in the training process.

train_variable = tf.train.Optimizer_Name(Learning_Rate).minimize(loss_var)

Example:

#### define the training variable ####
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss) # train with learning rate 0.1

#### run the training process ####
sess.run(train_step, feed_dict={xs: x_data, ys: y_data})

Example 1: Nonlinear Structure

import tensorflow as tf
import numpy as np

# make up some real data
x_data = np.linspace(-1,1,300)[:, np.newaxis]    # 300 rows, 1 column (one feature)
noise = np.random.normal(0, 0.05, x_data.shape) # mean 0, variance 0.05
y_data = np.square(x_data) - 0.5 + noise

# define placeholder for inputs to networks
xs = tf.placeholder(tf.float32, [None, 1])  # None means unfixed rows
ys = tf.placeholder(tf.float32, [None, 1])  # None means unfixed rows

# add hidden layer
l1 = add_layer(xs, 1, 10, activation_function=tf.nn.relu)
# add output layer
prediction = add_layer(l1, 10, 1, activation_function=None)

# the error between prediction and real data
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction),   # compute loss
                                    reduction_indices=[1]))

train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss) # train with learning rate 0.1

# initialize all variables
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    for i in range(1000):
        # training
        sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
        if i % 50 == 0:
            # to see the step improvement
            print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))
            

Expected Results:

0.39016193
0.009708067
0.008203865
0.0074426793
0.0069677234
0.006761972
0.006046914
0.0058904393
0.005795052
0.0056967423
0.005605927
0.005513572
0.0054187346
0.0053218375
0.005224105
0.0051228567
0.0050213463
0.004917563
0.004811827
0.0047068885

Example 2: linear structure

  1. create data
    x_data = np.random.rand(100).astype(np.float32)
    y_data = x_data * 0.1 + 0.3
    
  2. create tensorflow structure
    Weights = tf.Variable(tf.random_uniform([1], -1.0, 1.0))    # one dimension between -1.0 and 1.0
    biases = tf.Variable(tf.zeros([1]))  # one dimension, all zero
    
    y = Weights*x_data + biases
    
    loss = tf.reduce_mean(tf.square(y-y_data))
    optimizer = tf.train.GradientDescentOptimizer(0.5)  # learning rate <1
    train = optimizer.minimize(loss)
    
    init = tf.global_variables_initializer()
    
  3. create session and run it
    sess = tf.Session()
    sess.run(init)  # Activate the network, Very important!!
    
    for step in range(201):
    	sess.run(train)
    	if step % 20 == 0:
    		print(step, sess.run(Weights), sess.run(biases))
    
    
  4. Expected Results
    0 [-0.25551698] [0.68949234]
    20 [-0.02096055] [0.3655322]
    40 [0.06659328] [0.3180986]
    60 [0.09077379] [0.30499846]
    80 [0.09745194] [0.30138046]
    100 [0.09929628] [0.30038127]
    120 [0.09980565] [0.3001053]
    140 [0.09994634] [0.3000291]
    160 [0.09998519] [0.30000803]
    180 [0.09999589] [0.30000225]
    200 [0.09999888] [0.3000006]
    

Some useful function

Random

tf.set.random.seed(num): generate an array of fixed random numbers with a fixed seed.
tf.random_normal([row_size, column size]): generate normal distributed random matrix

Initializer

tf.random_normal_initializer(mean, variance): set a randomly normal distributed initalizer, which can be passed to the parameter 'initializer' of tf.get_variable.
tf.constant_initializer(num): ser a constant-valued initializer, which can be passed to the parameter 'initializer' of tf.get_variable.

Matrix Computation

tf.square(Mat): compute pointwise square of a matrix
tf.reduce_sum(Mat): compute the sum of a Mat
tf.reduce_sum(Mat, reduction_indices = [1]): compute the sum for each row
tf.reduce_sum(Mat, reduction_indices = [0]): compute the sum for each column
tf.matmul(MatA, MatB): Matrix Multiplication A*B
tf.multiply(MatA, MatB): Component-wise multipication of A and B
tf.reduce_mean(Array): compute the mean value of an array

Example: Compute loss value

loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction),   # compute loss
                                    reduction_indices=[1]))

Value Assignment

  1. tf. assign(t,e): assign the value of e to t
    self.replace_target_op = [tf.assign(t, e) 
    							for t, e in zip(t_params, e_params)]
    
  2. tf.zeros([row_size, column_size]): generate all-zero matrix

Tensorboard

open tensorboard

First run the code

# using terminal (in pycharm)
tensorboard --logdir logs

Then open the link or 0.0.0.0:6006

generate logs

# must be placed after sess = tf.Session()
writer = tf.summary.FileWriter('logs/', sess.graph)

name the framework, layer, and objects

Example 1

with tf.name_scope('layer'):
    with tf.name_scope('weights'):
        Weights = tf.Variable(tf.random_normal([in_size, out_size]), name='W')    # in_size by out_size Matrix
    with tf.name_scope('biases'):
        biases = tf.Variable(tf.zeros([1, out_size]), name='b')   # one-dimensional vector
    with tf.name_scope('Wx_plus_b'):
        Wx_plus_b = tf.matmul(inputs, Weights) + biases

Example 2

with tf.variable_scope('eval_net'):
	...
	with tf.variable_scope('l1'):
		w1 = tf.get_variable('w1', ...)
		b1 = tf.get_variable('b1', ...)
		l1 = ...
	with tf.variance_scope('l2'):
		w2 = tf.get_variable('w2', ...)
		b2 = tf.get_variable('b2', ...)
		l2 = ...
with tf.variable_scope('loss'):
	self.loss = ...
with tf.variance_scope('train'):
	self.train = ...
 

	

add different types of summary

  1. Scalar, for example, loss
    tf.summary.scalar('loss', loss)
    
  2. Histogram, for example, weights and biases
    tf.summary.histogram(layer_name+'/weights', Weights)
    

merge different summary

  1. define the merged summary
    merged = tf.summary.merge_all()
    
  2. run the merged summary
    result = sess.run(merged, feed_dict={xs:x_data, ys:y_data})	
    
  3. add the merged summary to the file writer
    writer.add_summary(result,i)
    

Expected Results

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

DQN

Idea

  1. Use features to predict the Q_value for different actions.
  2. Apply ANN to do the prediction.

Property

  1. Experience Replay
  2. Fixed Q-target

Initialization

def __init__(
        self,
        n_actions,
        n_features,
        learning_rate=0.01,
        reward_decay=0.9,
        e_greedy=0.9,
        replace_target_iter=300,
        memory_size=500,
        batch_size=32,
        e_greedy_increment=None,
        output_graph=False,
):
    self.n_actions = n_actions
    self.n_features = n_features
    self.lr = learning_rate
    self.gamma = reward_decay
    self.epsilon_max = e_greedy
    self.replace_target_iter = replace_target_iter  # steps to replace the target number
    self.memory_size = memory_size  # size of memory in Experience Replay
    self.batch_size = batch_size    # batch SGD
    self.epsilon_increment = e_greedy_increment # decrease the stochastic range gradually
    self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max

    # total learning step
    self.learn_step_counter = 0

    # initialize zero memory [s, a, r, s_]
    self.memory = np.zeros((self.memory_size, n_features * 2 + 2))

    # consist of [target_net, evaluate_net]
    self._build_net()

    # replace target parameter with evaluate parameter
    t_params = tf.get_collection('target_net_params')
    e_params = tf.get_collection('eval_net_params')
    self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]

    self.sess = tf.Session()

    if output_graph:
        # $ tensorboard --logdir=logs
        # tf.train.SummaryWriter soon be deprecated, use following
        tf.summary.FileWriter("logs/", self.sess.graph)

    self.sess.run(tf.global_variables_initializer())

    # cost_his stores loss in every step
    self.cost_his = []

Build Neural Network

evaluate network

# ------------------ build evaluate_net ------------------
self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')  # input
self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target')  # for calculating loss
with tf.variable_scope('eval_net'):
    # c_names(collections_names) are the collections to store variables
    c_names, n_l1, w_initializer, b_initializer = \
        ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 10, \
        tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)  # config of layers

    # first layer. collections is used later when assign to target net
    with tf.variable_scope('l1'):
        w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
        b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
        l1 = tf.nn.relu(tf.matmul(self.s, w1) + b1)

    # second layer. collections is used later when assign to target net
    with tf.variable_scope('l2'):
        w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
        b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
        self.q_eval = tf.matmul(l1, w2) + b2

with tf.variable_scope('loss'):
    self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
with tf.variable_scope('train'):
    self._train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

target network

# ------------------ build target_net ------------------
self.s_ = tf.placeholder(tf.float32, [None, self.n_features], name='s_')    # input
with tf.variable_scope('target_net'):
    # c_names(collections_names) are the collections to store variables
    c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]

    # first layer. collections is used later when assign to target net
    with tf.variable_scope('l1'):
        w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)
        b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)
        l1 = tf.nn.relu(tf.matmul(self.s_, w1) + b1)

    # second layer. collections is used later when assign to target net
    with tf.variable_scope('l2'):
        w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)
        b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)
        self.q_next = tf.matmul(l1, w2) + b2

Store transition

# store memory
def store_transition(self, s, a, r, s_):
    # check if memory_counter exists in the object
    if not hasattr(self, 'memory_counter'):
        self.memory_counter = 0

    transition = np.hstack((s, [a, r], s_))

    # replace the old memory with new memory
    index = self.memory_counter % self.memory_size
    self.memory[index, :] = transition

    self.memory_counter += 1

Choose action

def choose_action(self, observation):
    # to have batch dimension when feed into tf placeholder
    observation = observation[np.newaxis, :]

    if np.random.uniform() < self.epsilon:
        # forward feed the observation and get q value for every actions
        actions_value = self.sess.run(self.q_eval, feed_dict={self.s: observation})
        action = np.argmax(actions_value)
    else:
        action = np.random.randint(0, self.n_actions)
    return action

Learn

def learn(self):
    # check to replace target parameters
    if self.learn_step_counter % self.replace_target_iter == 0:
        self.sess.run(self.replace_target_op)
        print('\ntarget_params_replaced\n')

    # sample batch memory from all memory
    if self.memory_counter > self.memory_size:
        sample_index = np.random.choice(self.memory_size, size=self.batch_size)
    else:
        sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
    batch_memory = self.memory[sample_index, :]

    q_next, q_eval = self.sess.run(
        [self.q_next, self.q_eval],
        feed_dict={
            self.s_: batch_memory[:, -self.n_features:],  # fixed params
            self.s: batch_memory[:, :self.n_features],  # newest params
        })

    # change q_target w.r.t q_eval's action
    q_target = q_eval.copy()

    batch_index = np.arange(self.batch_size, dtype=np.int32)
    eval_act_index = batch_memory[:, self.n_features].astype(int)
    reward = batch_memory[:, self.n_features + 1]

    q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

    """
    For example in this batch I have 2 samples and 3 actions:
    q_eval =
    [[1, 2, 3],
     [4, 5, 6]]

    q_target = q_eval =
    [[1, 2, 3],
     [4, 5, 6]]

    Then change q_target with the real q_target value w.r.t the q_eval's action.
    For example in:
        sample 0, I took action 0, and the max q_target value is -1;
        sample 1, I took action 2, and the max q_target value is -2:
    q_target =
    [[-1, 2, 3],
     [4, 5, -2]]

    So the (q_target - q_eval) becomes:
    [[(-1)-(1), 0, 0],
     [0, 0, (-2)-(6)]]

    We then backpropagate this error w.r.t the corresponding action to network,
    leave other action as error=0 cause we didn't choose it.
    """

    # train eval network
    _, self.cost = self.sess.run([self._train_op, self.loss],
                                 feed_dict={self.s: batch_memory[:, :self.n_features],
                                            self.q_target: q_target})
    self.cost_his.append(self.cost)

    # increasing epsilon
    self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
    self.learn_step_counter += 1

Plot loss

def plot_cost(self):
    import matplotlib.pyplot as plt
    plt.plot(np.arange(len(self.cost_his)), self.cost_his)
    plt.ylabel('Cost')
    plt.xlabel('training steps')
    plt.show()

Double DQN

Problem of the original DQN

The original DQN has the Q-target as
Y t D Q N = R t + 1 + γ m a x a Q ( S t + 1 , a ; θ t − ) Y_t^{DQN} = R_{t+1} + \gamma max_aQ(S_{t+1}, a ;\theta_t^-) YtDQN=Rt+1+γmaxaQ(St+1,a;θt)
The neural network improves towards the Q-target with the maximum error, so it makes the issue of overestimation.

Idea: Double DQN

There are two neural networks: evaluation network (Eval-Net) and target network (Target-Net). In DQN, we use Eval-Net to evaluate the Q value of the current state-action pair, while using Target-Net to evaluate the Q value of next state with all actions.

In Double DQN, additionally, we use Eval-Net to evaluate the action of the Q-target.
Y t D o u b l e D Q N = R t + 1 + γ Q ( S t + 1 , a r g m a x a Q ( S t + 1 , a ; θ t ) ; θ t − ) Y_t^{DoubleDQN} = R_{t+1} + \gamma Q(S_{t+1}, argmax_a Q(S_{t+1}, a; \theta_t) ;\theta_t^-) YtDoubleDQN=Rt+1+γQ(St+1,argmaxaQ(St+1,a;θt);θt)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值