tensorflow Bug汇集以及解决

最新推荐文章于 2023-09-25 09:53:41 发布

jiruiYang

最新推荐文章于 2023-09-25 09:53:41 发布

阅读量2.5k

点赞数

分类专栏：机器学习图像处理算法随笔文章标签： bug tensorflow cudnn cuda

本文链接：https://blog.csdn.net/jiruiyang/article/details/77968226

版权

机器学习同时被 3 个专栏收录

8 篇文章 0 订阅

订阅专栏

图像处理算法

1 篇文章 0 订阅

订阅专栏

随笔

1 篇文章 0 订阅

订阅专栏

tensorflow Bug汇集以及解决

主要收集我自己在写代码以及Debug过程中遇到的不那么容易解决的bug.

Could not set cudnn filter descriptor:CUDNN_STATUS_BAD_PARAM

遇到这个问题，一般是由于在做BP的时候，有一个梯度tensor的batch_size为0导致。
例如，我在实现FPN的时候有这么一段代码：

with tf.variable_scope('crop_roi_and_roi_align'):
    for i in range(2, 6):
        level_i_proposal_indices = tf.reshape(tf.where(tf.equal(levels, i)), [-1])
        level_i_proposals = tf.gather(self.first_stage_decode_boxes, level_i_proposal_indices)

        all_level_proposal_list.append(level_i_proposals)

        ymin, xmin, ymax, xmax = tf.unstack(level_i_proposals, axis=1)
        img_h, img_w = tf.cast(self.img_shape[1], tf.float32), tf.cast(self.img_shape[2], tf.float32)
        normalize_ymin = ymin / img_h
        normalize_xmin = xmin / img_w
        normalize_ymax = ymax / img_h
        normalize_xmax = xmax / img_w

        level_i_cropped_rois = tf.image.crop_and_resize(
            self.feature_pyramid['P%d' % i],
            boxes=tf.transpose(tf.stack([normalize_ymin,
            normalize_xmin,normalize_ymax, normalize_xmax])),
            box_ind=tf.zeros(shape=[tf.shape(level_i_proposals)[0],],dtype=tf.int32),
            crop_size=[self.initial_crop_size, self.initial_crop_size])
        level_i_rois = slim.max_pool2d(level_i_cropped_rois,
                                       [self.max_pool_kernel_size, self.max_pool_kernel_size],
                                       stride=self.max_pool_kernel_size)
        all_level_roi_list.append(level_i_rois)

其中有个很重要的地方就是这个tf.image.crop_and_resize这个函数，有个指示batchid的参数。我是这样指定的：
box_ind=tf.zeros(shape=[tf.shape(level_i_proposals)[0],],dtype=tf.int32)。而level_i_proposals很有可能是[]，也就是说它的shape是0, 这样就会导致box_ind为[]，也就是说batch_size为0.所以会报错。
解决：
在产生level_i_proposals的时候加一个条件，使得all_level_proposal的shape不为0.

with tf.variable_scope('crop_roi_and_roi_align'):
    for i in range(2, 6):
        level_i_proposal_indices = tf.reshape(tf.where(tf.equal(levels, i)), [-1])
        level_i_proposals = tf.gather(self.first_stage_decode_boxes, level_i_proposal_indices)

        level_i_proposals = tf.cond(
            tf.equal(tf.shape(level_i_proposals)[0], 0),
            lambda: tf.constant([[0, 0, 0, 0]], dtype=tf.float32),
            lambda: level_i_proposals
        )  # to avoid level_i_proposals batch is 0, or it will broken when gradient BP

问题解决。

Ran out of GPU memory when allocating 0 bytes for[[Node]: softmax_cross_entropy_loss/xentropy

出现这样的情况不是因为显存不够导致，是因为在做cross_entropy_loss的时候，logits或者label是这样的：[].也就是空的。（当然大部分是因为偶然出现一次这样的Bug而导致程序中断）。
解决：

保证程序的鲁棒性，确保不会出现logits或者label为[]的情况。

待续。。。

jiruiYang

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
tensorflow Bug汇集以及解决

tensorflow Bug汇集以及解决主要收集我自己在写代码以及Debug过程中遇到的不那么容易解决的bug.1.Could not set cudnn filter descriptor:CUDNN_STATUS_BAD_PARAM遇到这个问题，一般是由于在做BP的时候，有一个梯度tensor的batch_size为0导致。
复制链接

扫一扫