深度强化学习之策略梯度和优化(二) — DDPG

DDPG

   之前讨论的应用DQN来玩Atari游戏。然而这些都是在离散环境下完成的, 其中具有有限个行为。考虑一个连续的环境空间,如训练机器人行走。在这些环境下,就不能应用 Q Q Q 学习了,这是因为贪婪策略在每个时间步都需要大量的优化。即使将这一连续环境离散化,也可能会失去一些重要特征,从而最终得到一个庞大的行为空间。在此情况下,很难保证收敛。

   为此,使用一种称为行为者评论家的新架构,其中包括两个网络:行为者网络和评论家网络。行为者评论家架构是将策略梯度和状态行为值函数相结合。行为者网络的作用是通过调节参数 θ \theta θ 来确定状态下的最佳行为,评论家网络的作用是评估行为者网络产生的行为。评论家网络是通过计算时间差分误差来评估行为者网络产生的行为的。也就是说,对行为者网络执行策略梯度来选择行为,然后评论家网络利用时间差分误差来评估行为者网络所产生的行为。行为者评论家网络架构如图所示。

在这里插入图片描述
   类似于DQN, 在此使用一个经验缓存, 通过采样小批量的经验来训练行为者网络和评论家网络。另外,还使用一个独立的行为者目标网络和评论家目标网络来计算损失。

   例如,在乒乓球游戏中,会有不同尺度的不同特征,如位置、速度等。因此,将所有特征都以相同尺度来缩放特征。在此,采用一种称为批量归一化的方法来缩放特征。该方法将所有特征进行归一化,使之具有单位均值和方差。那么如何探索新的行为呢?在一个连续环境中,将有n个行为。为探索新的行为,对行为者网络产生的行为添加一些噪声N。在此, 采用一种称为Ornstein-Uhlenbeck的随机过程来产生噪声。

   接下来, 将详细分析DDPG算法。
   设有两种网络:行为者网络和评论家网络。行为者网络记为 μ ( s ; θ μ ) \mu\left(s ; \theta^{\mu}\right) μ(s;θμ),以状态为输人,并产生行为,其中 θ μ \theta^{\mu} θμ是行为者网络的权重。评论家网络记为 Q ( s , a ; θ Q ) Q\left(s, a ; \theta^{Q}\right) Q(s,a;θQ),以状态和行为为输人,并返回Q值,其中 θ Q \theta^{Q} θQ 为评论家网络的权重。

   同理,分别对行为者目标网络和评论家网络定义一个目标网络 μ ( s ; θ μ ′ ) \mu\left(s ; \theta^{\mu^{\prime}}\right) μ(s;θμ) Q ( s , a ; θ Q ′ ) Q\left(s, a ; \theta^{Q^{\prime}}\right) Q(s,a;θQ),其中, θ μ ′ \theta^{\mu^{\prime}} θμ θ Q ′ \theta^{Q^{\prime}} θQ分别为行为者目标网络和评论家目标网络的权重。

  通过策略梯度来更新行为者网络的权重,根据由时间差分误差计算而得的梯度来更新评论家网络的权重。

  首先,通过对行为者网络产生的行为添加探索噪声N来选择一个行为,如 μ ( s ; θ μ ) + N \mu\left(s ; \theta^{\mu}\right)+N μ(s;θμ)+N。在状态 s s s下执行该行为,接收奖励 r r r,并转移到一个新状态 s ′ s' s。将状态转移信息保存在经验回放缓存中。

  经过多次迭代,从回放缓存中采样转移信息来训练网络,然后计算目标Q值, y i = r i + γ Q ′ ( s i + 1 , μ ′ ( s i + 1 ∣ θ μ ′ ) ∣ θ Q ) y_{i}=r_{i}+\gamma Q^{\prime}\left(s_{i+1}, \mu^{\prime}\left(s_{i+1} | \theta^{\mu^{\prime}}\right) | \theta^{Q}\right) yi=ri+γQ(si+1,μ(si+1θμ)θQ)。计算时间差分误差为

                 L = 1 M ∑ i ( y i − Q ( s i , a i ∣ θ Q ) 2 ) L=\frac{1}{M} \sum_{i}\left(y_{i}-Q\left(s_{i}, a_{i} | \theta^{Q}\right)^{2}\right) L=M1i(yiQ(si,aiθQ)2)

式中,M是从回放缓存中采样的用于训练的样本数。

根据由损失L计算所得的梯度来更新评论家网络的权重。
同理,通过策略梯度更新策略网络的权重,然后在目标网络中更新行为者网络和评论家网络的权重。缓慢更新目标网络的权重,会提高稳定性,这称为替换:

                 θ ′ < − τ θ + ( 1 − τ ) θ ′ \theta^{\prime}<-\tau \theta+(1-\tau) \theta^{\prime} θ<τθ+(1τ)θ

倒立摆

智能体使倒立摆摆动起来,使之保持直立。

import tensorflow as tf
import numpy as np
import gym

定义超参数如下:

# number of steps in each episode  每个情景中时间步的个数
epsiode_steps = 500 

# learning rate for actor 
lr_a = 0.001    

# learning rate for critic
lr_c = 0.002 

# discount factor
gamma = 0.9 

# soft replacement      soft替换
alpha = 0.01 

# replay buffer size   回放缓存大小
memory = 10000 

# batch size for training  用于训练的批量样本大小
batch_size = 32     
render = False

DDPG的实现

class DDPG(object):
    def __init__(self, no_of_actions, no_of_states, a_bound,):
        
        # initialize the memory with shape as no of actions, no of states and our defined memory size
        # 初始化记忆内存,形式为无行为、无状态及自定义的记忆内存大小
        self.memory = np.zeros((memory, no_of_states * 2 + no_of_actions + 1), dtype=np.float32)
        
        # initialize pointer to point to our experience buffer  初始化指针指向经验缓存
        self.pointer = 0
        
        # initialize tensorflow session
        self.sess = tf.Session()
        
        # initialize the variance for OU process for exploring policies 针对探索策略过程初始化方差
        self.noise_variance = 3.0
        
        self.no_of_actions, self.no_of_states, self.a_bound = no_of_actions, no_of_states, a_bound,
        
        # placeholder for current state, next state and rewards
        self.state = tf.placeholder(tf.float32, [None, no_of_states], 's')
        self.next_state = tf.placeholder(tf.float32, [None, no_of_states], 's_')
        self.reward = tf.placeholder(tf.float32, [None, 1], 'r')
        
        # build the actor network which has separate eval(primary) and target network 构建独立的行为者网络(主网络)和目标网络
        with tf.variable_scope('Actor'):
            self.a = self.build_actor_network(self.state, scope='eval', trainable=True)
            a_ = self.build_actor_network(self.next_state, scope='target', trainable=False)
            
        # build the critic network which has separate eval(primary) and target network 构建独立的评论家网络(主网络)和目标网络    
        with tf.variable_scope('Critic'):
            q = self.build_crtic_network(self.state, self.a, scope='eval', trainable=True)
            q_ = self.build_crtic_network(self.next_state, a_, scope='target', trainable=False)
            

        # initialize the network parameters
        self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval')
        self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target')
        
        self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/eval')
        self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Critic/target')

        # update target value  更新目标值
        self.soft_replace = [[tf.assign(at, (1-alpha)*at+alpha*ae), tf.assign(ct, (1-alpha)*ct+alpha*ce)]
            for at, ae, ct, ce in zip(self.at_params, self.ae_params, self.ct_params, self.ce_params)]
        
        
        # compute target Q value, we know that Q(s,a) = reward + gamma * Q'(s',a')
        q_target = self.reward + gamma * q_
        
    
        # compute TD error i.e  actual - predicted values  计算TD误差,即实际值-预测值
        td_error = tf.losses.mean_squared_error(labels=(self.reward + gamma * q_), predictions=q)
        
        # train the critic network with adam optimizer
        self.ctrain = tf.train.AdamOptimizer(lr_c).minimize(td_error, name="adam-ink", var_list = self.ce_params)
        
        # compute the loss in actor network
        a_loss = - tf.reduce_mean(q)    
        
        # train the actor network with adam optimizer for minimizing the loss
        self.atrain = tf.train.AdamOptimizer(lr_a).minimize(a_loss, var_list=self.ae_params)

        # initialize summary writer to visualize our network in tensorboard 初始化summary以在tensorboard中可视化
        tf.summary.FileWriter("logs", self.sess.graph)
        
        # initialize all variables
        self.sess.run(tf.global_variables_initializer())

       

    # How do we select acion in DDPG? We select action by adding noise to the action space. We use
    # Ornstein-Uhlenbeck random process for generating noise  通过对行为空间中添加噪声后来选择行为

    def choose_action(self, s):
        a = self.sess.run(self.a, {self.state: s[np.newaxis, :]})[0]
        a = np.clip(np.random.normal(a, self.noise_variance), -2, 2)
        
        return a
    
    
    # then we define the function called learn where the actual training happens,
    # here we select a minibatch of states, actions, rewards and next state from the experience buffer
    # and we train actor and critic network    从经验缓存中选择一批states, actions, rewards and next state

    def learn(self):
        
        # soft target replacement   软目标置换
        self.sess.run(self.soft_replace)

        indices = np.random.choice(memory, size=batch_size)
        batch_transition = self.memory[indices, :]
        batch_states = batch_transition[:, :self.no_of_states]
        batch_actions = batch_transition[:, self.no_of_states: self.no_of_states + self.no_of_actions]
        batch_rewards = batch_transition[:, -self.no_of_states - 1: -self.no_of_states]
        batch_next_state = batch_transition[:, -self.no_of_states:]

        self.sess.run(self.atrain, {self.state: batch_states})
        self.sess.run(self.ctrain, {self.state: batch_states, self.a: batch_actions, self.reward: batch_rewards, self.next_state: batch_next_state})

    # we define a function store_transition which stores all the transition information in the buffer
    # 在缓存中保存所有的转移信息
    def store_transition(self, s, a, r, s_):
        trans = np.hstack((s,a,[r],s_))
        
        index = self.pointer % memory
        self.memory[index, :] = trans
        self.pointer += 1

        if self.pointer > memory:
            self.noise_variance *= 0.99995
            self.learn()
            
            
     # we define the function build_actor_network for builing our actor network       
    def build_actor_network(self, s, scope, trainable):
        # Actor DPG
        with tf.variable_scope(scope):
            l1 = tf.layers.dense(s, 30, activation = tf.nn.tanh, name = 'l1', trainable = trainable)
            a = tf.layers.dense(l1, self.no_of_actions, activation = tf.nn.tanh, name = 'a', trainable = trainable)     
            return tf.multiply(a, self.a_bound, name = "scaled_a")  


    # followed by we define the function build_crtic_network which build our critic network
    def build_crtic_network(self, s, a, scope, trainable):
        with tf.variable_scope(scope):
            n_l1 = 30
            w1_s = tf.get_variable('w1_s', [self.no_of_states, n_l1], trainable = trainable)
            w1_a = tf.get_variable('w1_a', [self.no_of_actions, n_l1], trainable = trainable)
            b1 = tf.get_variable('b1', [1, n_l1], trainable = trainable)
            net = tf.nn.tanh( tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1 )

            q = tf.layers.dense(net, 1, trainable = trainable)
            return q

初始gym环境

env = gym.make("Pendulum-v0")
env = env.unwrapped
env.seed(1)

得到状态数和行为数:

no_of_states = env.observation_space.shape[0]
no_of_actions = env.action_space.shape[0]

另外,行为的上界:

a_bound = env.action_space.high

创建DDPG类的对象:

ddpg = DDPG(no_of_actions, no_of_states, a_bound)
# for storing the total rewards
total_reward = []

# set the number of episodes
no_of_episodes = 300

现在,开始训练:

# for each episodes 
for i in range(no_of_episodes):
    # initialize the environment
    s = env.reset()
    
    # episodic reward
    ep_reward = 0
    
    for j in range(epsiode_steps):
        
        env.render()

        # select action by adding noise through OU process
        # 通过在过程中添加噪声来选择行为
        a = ddpg.choose_action(s)
        
        # peform the action and move to the next state s
        s_, r, done, info = env.step(a)
        
        # store the the transition to our experience buffer 
        # sample some minibatch of experience and train the network
        ddpg.store_transition(s, a, r, s_)
      
        # update current state as next state
        s = s_
        
        # add episodic rewards
        ep_reward += r
        
        if j == epsiode_steps-1:
            
            # store the total rewards
            total_reward.append(ep_reward)
            
            # print rewards obtained per each episode
            print('Episode:', i, ' Reward: %i' % int(ep_reward))
   
            break

https://github.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/blob/master/Chapter11/11.3%20Swinging%20Up%20the%20Pendulum%20Using%20DDPG.ipynb


打包程序

DDPG.py:

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np
import gym


class DDPG(object):
    def __init__(self, n_states, n_actions, action_low_bound, action_high_bound, gamma=0.99,
                 actor_lr=0.002, critic_lr=0.002, tau=0.01, memory_size=10000, batch_size=32):
        self.n_states = n_states
        self.n_actions = n_actions
        self.action_low_bound = action_low_bound
        self.action_high_bound = action_high_bound
        self.gamma = gamma
        self.actor_lr = actor_lr
        self.critic_lr = critic_lr
        self.tau = tau

        self.memory_size = memory_size
        self.memory = np.zeros([self.memory_size, self.n_states*2 + self.n_actions + 1])
        self.memory_counter = 0
        self.batch_size = batch_size

        self.action_noise = 3
        self.action_noise_decay = 0.9995
        self.learning_counter = 0

        self._build_graph()

        self.session = tf.Session()
        self.session.run(tf.global_variables_initializer())

    def _build_graph(self):
        self.s = tf.placeholder(tf.float32, [None, self.n_states], name='s')
        self.s_ = tf.placeholder(tf.float32, [None, self.n_states], name='s_')
        self.r = tf.placeholder(tf.float32, [None, 1], name='r')

        self.low_action = tf.constant(self.action_low_bound, dtype=tf.float32)
        self.high_action = tf.constant(self.action_high_bound, dtype=tf.float32)

        self.actor_net = self._build_actor_net(s=self.s, trainable=True, scope='actor_eval')
        self.actor_target_net = self._build_actor_net(s=self.s_, trainable=False, scope='actor_target')

        self.critic_net = self._build_critic_net(s=self.s, a=self.actor_net, trainable=True, scope='critic_eval')
        self.critic_target_net = self._build_critic_net(s=self.s_, a=self.actor_target_net, trainable=False,
                                                        scope='critic_target')

        self.ae_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='actor_eval')
        self.at_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='actor_target')
        self.ce_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='critic_eval')
        self.ct_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='critic_target')

        self.soft_replace = [[tf.assign(ta, (1 - self.tau) * ta + self.tau * ea),
                              tf.assign(tc, (1 - self.tau) * tc + self.tau * ec)]
                             for ta, ea, tc, ec in zip(self.at_params, self.ae_params, self.ct_params, self.ce_params)]

        with tf.variable_scope('critic_loss'):
            q_target = self.r + self.gamma*self.critic_target_net
            self.critic_loss_op = tf.reduce_mean(tf.squared_difference(q_target, self.critic_net))
        with tf.variable_scope('critic_train'):
            self.critic_train_op = tf.train.AdamOptimizer(self.critic_lr).minimize(self.critic_loss_op,
                                                                                   var_list=self.ce_params)
        with tf.variable_scope('actor_loss'):
            self.actor_loss_op = -tf.reduce_mean(self.critic_net)  # maximize q

        with tf.variable_scope('actor_train'):
            self.actor_train_op = tf.train.AdamOptimizer(self.actor_lr).minimize(self.actor_loss_op,
                                                                                 var_list=self.ae_params)

    def _build_actor_net(self, s, trainable, scope):

        k_init, b_init = tf.random_normal_initializer(0., 0.1), tf.constant_initializer(0.1)
        h1_units = 64

        with tf.variable_scope(scope):
            w1 = tf.get_variable(name='w1', shape=[self.n_states, h1_units], initializer=k_init, trainable=trainable)
            b1 = tf.get_variable(name='b1', shape=[h1_units], initializer=b_init, trainable=trainable)
            h1 = tf.nn.relu(tf.matmul(s, w1) + b1)

            w2 = tf.get_variable(name='w2', shape=[h1_units, self.n_actions], initializer=k_init, trainable=trainable)
            b2 = tf.get_variable(name='b2', shape=[self.n_actions], initializer=b_init, trainable=trainable)
            actor_net = tf.matmul(h1, w2) + b2
            actor_net = tf.clip_by_value(actor_net, self.low_action, self.high_action)
        return actor_net

    def _build_critic_net(self, s, a, trainable, scope):
        k_init, b_init = tf.random_normal_initializer(0., 0.2), tf.constant_initializer(0.1)
        h1_units = 64
        with tf.variable_scope(scope):
            w1s = tf.get_variable(name='w1s', shape=[self.n_states, h1_units], initializer=k_init, trainable=trainable)
            w1a = tf.get_variable(name='w1a', shape=[self.n_actions, h1_units], initializer=k_init, trainable=trainable)
            b1 = tf.get_variable(name='b1_e', shape=[h1_units], initializer=b_init, trainable=trainable)
            h1 = tf.nn.relu(tf.matmul(s, w1s) + tf.matmul(a, w1a) + b1)

            w2 = tf.get_variable(name='w2', shape=[h1_units, 1], initializer=k_init, trainable=trainable)
            b2 = tf.get_variable(name='b2', shape=[1], initializer=b_init, trainable=trainable)
            critic_net = tf.matmul(h1, w2) + b2
        return critic_net

    def choose_action(self, s):
        s = s[np.newaxis, :]
        action_probs = self.session.run(self.actor_net, feed_dict={self.s: s})
        action = action_probs[0]
        action = np.clip(np.random.normal(action, self.action_noise), -2, 2)
        self.action_noise *= self.action_noise_decay
        return action

    def learn(self):
        self.session.run(self.soft_replace)

        bs, ba, br, bs_ = self.sample_memory()

        fetches = [self.actor_train_op, self.critic_train_op]

        self.session.run(fetches=fetches, feed_dict={self.s: bs, self.actor_net: ba,
                                                     self.r: br, self.s_: bs_})
        self.learning_counter += 1

    def store_memory(self, s, a, r, s_):
        transition = np.hstack((s, a, r, s_))
        index = self.memory_counter % self.memory_size
        self.memory[index, :] = transition
        self.memory_counter += 1

    def sample_memory(self):
        assert self.memory_counter >= self.batch_size
        if self.memory_counter <= self.memory_size:
            index = np.random.choice(self.memory_counter, self.batch_size)
        else:
            index = np.random.choice(self.memory_size, self.batch_size)
        batch_memory = self.memory[index, :]

        bs = batch_memory[:, :self.n_states]
        ba = batch_memory[:, self.n_states: self.n_states+self.n_actions]
        br = batch_memory[:, self.n_states+self.n_actions]
        bs_ = batch_memory[:, -self.n_states:]

        br = br[:, np.newaxis]

        return bs, ba, br, bs_

run.py

import gym
import matplotlib.pyplot as plt
from ddpg import DDPG

GAME = 'Pendulum-v0'
MAX_STEP = 100
EPISODES = 3000


env = gym.make(GAME)
env = env.unwrapped
env.seed(1)

n_states = env.observation_space.shape[0]
n_actions = env.action_space.shape[0]
action_low_bound = env.action_space.low
action_high_bound = env.action_space.high

agent = DDPG(n_states, n_actions, action_low_bound, action_high_bound)


def run():
    plt.ion()
    total_r = 0
    avg_ep_r_hist = []
    for episode in range(EPISODES):
        ep_step = 0
        ep_r = 0
        s = env.reset()
        while True:
            a = agent.choose_action(s)
            s_, r, done, info = env.step(a)
            agent.store_memory(s, a, r, s_)

            ep_r += r
            total_r += r
            ep_step += 1

            if agent.memory_counter >= agent.batch_size:
                agent.learn()

            if ep_step >= MAX_STEP:
                break

            s = s_

        if episode >= 10:
            avg_ep_r = total_r/(episode+1)
            avg_ep_r_hist.append(avg_ep_r)
            if episode % 20 == 0:
                print('Episode %d Avg Reward/Ep %s' % (episode, avg_ep_r))
        plt.cla()
        plt.plot(avg_ep_r_hist)
        plt.pause(0.0001)
    plt.ioff()
    plt.show()


if __name__ == '__main__':
    run()
  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值