【RL从入门到放弃】【八】

最新推荐文章于 2023-09-03 08:28:41 发布

money_yuan

最新推荐文章于 2023-09-03 08:28:41 发布

阅读量268

点赞数

分类专栏： AI

AI 专栏收录该内容

60 篇文章 9 订阅

订阅专栏

1、tensorboard的使用

pip install tensorboard

运行.py生成log

import tensorflow as tf
with tf.name_scope('input1'):
	input1 = tf.constant([1.0,2.0,3.0],name="input1")
with tf.name_scope('input2'):
	input2 = tf.Variable(tf.random_uniform([3]),name="input2")
output = tf.add_n([input1,input2],name="add")
writer = tf.summary.FileWriter("log",tf.get_default_graph())
writer.close()

运行生成log文件

cmd执行如下command：tensorboard --logdir=log

出现如下界面：

在chrome中粘贴：http://sh09297pcw:6006

可正常使用：

2、优先回放

带有优先回放的Double DQN的伪代码如图6.10b所⽰。

实际例子：

    #这个可以画个细致的图来讲解
    def get_leaf(self,v):
        parent_idx = 0
        while True:
            cl_idx = 2 * parent_idx + 1
            cr_idx = cl_idx + 1
            if cl_idx >= len(self.tree):
                leaf_idx = parent_idx
                break
            else:
                if v <= self.tree[cl_idx]:
                    parent_idx = cl_idx
                else:
                    v -= self.tree[cl_idx]
                    parent_idx = cr_idx
        data_idx = leaf_idx - self.capacity + 1
        return leaf_idx,self.tree[leaf_idx],self.data[data_idx]

抽样时, 我们会将 p 的总合除以 batch size, 分成 batch size 那么多区间, (n=sum(p)/batch_size). 如果将所有 node 的 priority 加起来是42的话, 我们如果抽6个样本, 这时的区间拥有的 priority 可能是这样.

[0-7], [7-14], [14-21], [21-28], [28-35], [35-42]

然后在每个区间里随机选取一个数. 比如在第区间 [21-28] 里选到了24, 就按照这个 24 从最顶上的42开始向下搜索. 首先看到最顶上 42 下面有两个 child nodes, 拿着手中的24对比左边的 child 29, 如果左边的 child 比自己手中的值大, 那我们就走左边这条路, 接着再对比 29 下面的左边那个点 13, 这时, 手中的 24 比 13 大, 那我们就走右边的路, 并且将手中的值根据 13 修改一下, 变成 24-13 = 11. 接着拿着 11 和 13 左下角的 12 比, 结果 12 比 11 大, 那我们就选 12 当做这次选到的 priority, 并且也选择 12 对应的数据.

    #新加入的数据，我们认为他的优先级是最大的
    def store(self, transition):
        #max_p 是序列的最值
        max_p = np.max(self.tree.tree[-self.tree.capacity:])
        #print("max_p is {0}\t self.tree.tree[-self.tree.capacity:] is {1}".format(max_p, self.tree.tree[-self.tree.capacity:]))
        if max_p == 0:
            max_p = self.abs_err_upper
        self.tree.add(max_p, transition)   # set the max p for new p

我们定义一个store函数，用于将新的经验数据存储到Sumtree中，我们定义了一个abs_err_upper和epsilon ，表明p的范围在[epsilon,abs_err_upper]之间，对于第一条存储的数据，我们认为它的优先级P是最大的，同时，对于新来的数据，我们也认为它的优先级与当前树中优先级最大的经验相同。

    def sample(self,n):
        print("self.tree.data[0].size is {0}".format(self.tree.data[0].size))
        b_idx,b_memory,ISWeights = np.empty((n,),dtype=np.int32),np.empty((n,self.tree.data[0].size)),np.empty((n,1))

        pri_seg = self.tree.total_p / n
        print("pri_seg is {0}".format(pri_seg))

        self.beta = np.min([1., self.beta + self.beta_increment_per_sampling])

        min_prob = np.min(self.tree.tree[-self.tree.capacity:]) / self.tree.total_p  # for later calculate ISweight

        for i in range(n):
            a, b = pri_seg * i, pri_seg * (i + 1)
            v = np.random.uniform(a, b)
            #这里的v为什么要这样计算？，在划分的这些段里面随机取一个值
            idx, p, data = self.tree.get_leaf(v)
            prob = p / self.tree.total_p
            ISWeights[i, 0] = np.power(prob/min_prob, -self.beta)
            b_idx[i], b_memory[i, :] = idx, data
        return b_idx, b_memory, ISWeights

权重的计算

3、duleing DQN

这次我们看看累积奖励 reward, 杆子立起来的时候奖励 = 0, 其他时候都是负值, 所以当累积奖励没有在降低时, 说明杆子已经被成功立了很久了.

我们发现当可用动作越高, 学习难度就越大, 不过 Dueling DQN 还是会比 Natural DQN 学习得更快. 收敛效果更好.

money_yuan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【RL从入门到放弃】【八】

1、tensorboard的使用pip install tensorboard运行.py生成logimport tensorflow as tfwith tf.name_scope('input1'): input1 = tf.constant([1.0,2.0,3.0],name="input1")with tf.name_scope('input2'): input2 =...
复制链接

扫一扫