Tensorflow实现策略网络（深度强化学习一）

最新推荐文章于 2024-08-28 09:27:03 发布

山水之间2018

最新推荐文章于 2024-08-28 09:27:03 发布

阅读量3.1k

点赞数 2

分类专栏：深度学习文章标签：策略网络深度强化学习

本文链接：https://blog.csdn.net/Gavinmiaoc/article/details/79397588

版权

深度学习专栏收录该内容

42 篇文章 11 订阅

订阅专栏

1.深度强化学习简介

强化学习（Reinforcement Learing）是机器学习的一个重要分支，主要用来解决连续决策的问题。强化学习可以在复杂的,不确定的环境中学习如何实现我们设定的目标。

一个强化学习问题包含三个主要概念，即环境状态(Environment State)，行动(Action),奖励(Reward)。强化学习的目标就算获得最多的累计奖励。

回顾下，AutoEncoder属于无监督学习，而MLP，CNN，RNN都属于监督学习，但强化学习跟这两种都不同。它不像无监督学习那样完全没有学习目标，也不像监督学习那样有非常明确的目标(即label)，强化学习的目标一般是变化的，不明确的，甚至可能不存在绝对正确的标签。

2.策略网络

所谓的策略网络，即建立一个神经网络模型，它可以通过观察环境状态，直接预测出目前最应该执行的策略（policy），执行这个策略可以获得最大的期望收益（包括现在的和未来的reward）。和之前的任务不同，在强化学习中可能没有绝对正确的学习目标，样本的feature和label也不在一一对应。我们的学习目标是期望价值，即当前获得的reward和未来潜在的可获取的reward。所以在策略网络中不只是使用当前的reward作为label，而是使用Discounted Future Reward，即把所有未来奖励一次乘以衰减系数γ。这里的衰减系数是一个略小于但接近1的数，防止没有损耗地积累导致Reward目标发散，同时也代表了对未来奖励的不确定性的估计。

3.Gym

Gym是OpenAI推出的开源的强化学习的环境生成工具。在Gym中有两个核心的概念，一个是Environment，指我们的任务或者问题，另一个就是Agent，即我们编写的策略或者算法。Agent会将执行的Action传给Environment，Environment接受某个Action后，再将结果Observation(即环境状态)和Reward返回给Agent。

安装：

sudo pip3 install gym

4.代码实现

下面就使用Tensorflow创建一个基于策略网络的Agent来解决CartPole问题。本节代码主要来自DeepRL-Agents的开源实现。

# !/usr/bin/python3
# -*- coding:utf-8 -*-
# 使用Tensorflow创建一个基于策略网络的Agent来解决CartPole问题
import numpy as np
import tensorflow as tf
import gym
env = gym.make('CartPole-v0') # 创建cartpole问题的环境env

# 初始化环境
env.reset()
random_episodes = 0
reward_sum = 0
# 进行10次随机实验
while random_episodes < 10:
    env.render() # 将cartpole问题的图像渲染出来
    # np.random.randint(0,2)产生随机的action,env.step(action)执行随机action
    observation,reward,done,_ = env.step(np.random.randint(0,2))  # done为True则本次试验结束
    reward_sum += reward
    if done:
        random_episodes += 1
        print("Reward for this episode was:",reward_sum)
        reward_sum = 0
        env.reset() # 重启环境

运行结果如下：

可以看到随机策略获得的奖励总值差不多在10～45之间，均值应该在20～35,这将作为接下来用来对比的基准。我们将任务完成的目标设定为拿到200的Reward,并希望通过尽量少次数的试验来完成这个目标。

我们的策略网络使用简单的带有一个隐含层的MLP。具体代码如下

# !/usr/bin/python3
# -*- coding:utf-8 -*-
# 使用Tensorflow创建一个基于策略网络的Agent来解决CartPole问题
import numpy as np
import tensorflow as tf
import gym
env = gym.make('CartPole-v0') # 创建cartpole问题的环境env

# 初始化环境
env.reset()
random_episodes = 0
reward_sum = 0
# 进行10次随机实验
while random_episodes < 10:
    env.render() # 将cartpole问题的图像渲染出来
    # np.random.randint(0,2)产生随机的action,env.step(action)执行随机action
    observation,reward,done,_ = env.step(np.random.randint(0,2))  # done为True则本次试验结束
    reward_sum += reward
    if done:
        random_episodes += 1
        print("Reward for this episode was:",reward_sum)
        reward_sum = 0
        env.reset() # 重启环境

# hyperparameters超参数设置
H = 50  # number of hidden layer neurons 隐含节点数
batch_size = 25  # every how many episodes to do a param update?
learning_rate = 1e-1  # feel free to play with this to train faster or more stably.
gamma = 0.99  # discount factor for reward 奖励的衰减系数

D = 4  # input dimensionality 环境信息observation的维度为4

tf.reset_default_graph()

# This defines the network as it goes from taking an observation of the environment to
# giving a probability of chosing to the action of moving left or right.
# 神经网络的输入环境的状态，并且输出左/右的概率
observations = tf.placeholder(tf.float32, [None, D], name="input_x")
# 初始化算法，创建隐含权重w1,其维度为[D,H]
W1 = tf.get_variable("W1", shape=[D, H],
                     initializer=tf.contrib.layers.xavier_initializer())
# relu激活函数处理得到隐含层输出layer1，注意我们不需要加偏置
layer1 = tf.nn.relu(tf.matmul(observations, W1))
# 创建sigmoid输出层的权重W2
W2 = tf.get_variable("W2", shape=[H, 1],
                     initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer1, W2)
# 使用sigmoid激活函数处理得到最后的输出概率
probability = tf.nn.sigmoid(score)

# From here we define the parts of the network needed for learning a good policy.
tvars = tf.trainable_variables() # 获取策略网络中全部可训练的参数tvars
input_y = tf.placeholder(tf.float32, [None, 1], name="input_y") # 人工设置的虚拟label
advantages = tf.placeholder(tf.float32, name="reward_signal") # 每个action的潜在价值

# The loss function. This sends the weights in the direction of making actions
# that gave good advantage (reward over time) more likely, and actions that didn't less likely.
# loglik其实就是当前Action对应的概率的对数
loglik = tf.log(input_y * (input_y - probability) + (1 - input_y) * (input_y + probability))
loss = -tf.reduce_mean(loglik * advantages) # 损失
newGrads = tf.gradients(loss, tvars) # 求解模型参数关于loss的梯度

# 为了减少奖励函数中的噪声，我们累积一系列的梯度之后才会更新神经网络的参数
# Once we have collected a series of gradients from multiple episodes, we apply them.
# We don't just apply gradeients after every episode in order to account for noise in the reward signal.
adam = tf.train.AdamOptimizer(learning_rate=learning_rate)  # Our optimizer
W1Grad = tf.placeholder(tf.float32,
                        name="batch_grad1")  # Placeholders to send the final gradients through when we update.
W2Grad = tf.placeholder(tf.float32, name="batch_grad2")
batchGrad = [W1Grad, W2Grad]
# 更新模型参数
updateGrads = adam.apply_gradients(zip(batchGrad, tvars))

# 定义函数，用来估算每一个Action对应的潜在价值discount_r
def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

# 定义一些参数：xs为环境信息observation的列表，ys为label的列表，drs为记录的每一个Action的Reward
xs, ys, drs = [], [], []
# running_reward = None
reward_sum = 0
episode_number = 1
total_episodes = 10000 # 总试验次数
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    rendering = False
    sess.run(init)
    observation = env.reset()  # Obtain an initial observation of the environment

    # Reset the gradient placeholder. We will collect gradients in
    # gradBuffer until we are ready to update our policy network.
    gradBuffer = sess.run(tvars) # 创建存储参数梯度的缓冲器
    # 将gradBuffer全部初始化为零
    for ix, grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0
# 进入循环
    while episode_number <= total_episodes:

        # Rendering the environment slows things down,
        # so let's only look at it once our agent is doing a good job.
        if reward_sum / batch_size > 100 or rendering == True:
            env.render()
            rendering = True

        # Make sure the observation is in a shape the network can handle.
        x = np.reshape(observation, [1, D])

        # Run the policy network and get an action to take.
        tfprob = sess.run(probability, feed_dict={observations: x})
        action = 1 if np.random.uniform() < tfprob else 0

        xs.append(x)  # observation
        # y = 1 if action == 0 else 0  # a "fake label"
        y = 1 - action
        ys.append(y)

        # step the environment and get new measurements
        observation, reward, done, info = env.step(action)
        reward_sum += reward

        drs.append(reward)  # record reward (has to be done after we call step() to get reward for previous action)

        if done:
            episode_number += 1
            # stack together all inputs, hidden states, action gradients, and rewards for this episode
            epx = np.vstack(xs) # np.vstack将列表的元素纵向堆叠起来
            epy = np.vstack(ys)
            epr = np.vstack(drs)
            xs, ys, drs = [], [], []  # reset array memory

            # compute the discounted reward backwards through time
            discounted_epr = discount_rewards(epr)
            # size the rewards to be unit normal (helps control the gradient estimator variance)
            # 标准化处理，减去均值除以标准差,得到一个零均值标准差为1的分布
            discounted_epr -= np.mean(discounted_epr)
            discounted_epr /= np.std(discounted_epr)

            # Get the gradient for this episode, and save it in the gradBuffer
            tGrad = sess.run(newGrads, feed_dict={observations: epx, input_y: epy, advantages: discounted_epr})
            for ix, grad in enumerate(tGrad):
                gradBuffer[ix] += grad

            # If we have completed enough episodes, then update the policy network with our gradients.
            if episode_number % batch_size == 0:
                sess.run(updateGrads, feed_dict={W1Grad: gradBuffer[0], W2Grad: gradBuffer[1]})
                for ix, grad in enumerate(gradBuffer):
                    gradBuffer[ix] = grad * 0

                # Give a summary of how well our network is doing for each batch of episodes.
                # running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
                print('Average reward for episode %d : %f.' % (episode_number, reward_sum / batch_size))

                if reward_sum / batch_size > 200:
                    print("Task solved in", episode_number, 'episodes!')
                    break

                reward_sum = 0

            observation = env.reset()