Yolo v7的最简TensorFlow实现

gzroy

已于 2023-04-07 11:48:38 修改

阅读量2.2k

点赞数 2

文章标签： tensorflow YOLO 深度学习

于 2023-04-06 16:58:23 首次发布

本文链接：https://blog.csdn.net/gzroy/article/details/129264083

版权

Yolo v7去年推出之后，取得了很好的性能。作者也公布了基于Pytorch实现的源代码。在我之前的几篇博客当中，对代码进行了深入的解析，了解了Yolo v7的技术细节和实现机制。因为我一直是用的Tensorflow，因此也想尝试把代码移植到Tensorflow上。

数据集的构建

直接运行Yolo v7源代码里面的get_coco.sh脚本下载coco数据集，脚本代码如下：

#!/bin/bash
# COCO 2017 dataset http://cocodataset.org
# Download command: bash ./scripts/get_coco.sh

# Download/unzip labels
d='./' # unzip directory
url=https://github.com/ultralytics/yolov5/releases/download/v1.0/
f='coco2017labels-segments.zip' # or 'coco2017labels.zip', 68 MB
echo 'Downloading' $url$f ' ...'
curl -L $url$f -o $f && unzip -q $f -d $d && rm $f & # download, unzip, remove in background

# Download/unzip images
d='./coco/images' # unzip directory
url=http://images.cocodataset.org/zips/
f1='train2017.zip' # 19G, 118k images
f2='val2017.zip'   # 1G, 5k images
f3='test2017.zip'  # 7G, 41k images (optional)
for f in $f1 $f2 $f3; do
  echo 'Downloading' $url$f '...'
  curl -L $url$f -o $f && unzip -q $f -d $d && rm $f & # download, unzip, remove in background
done
wait # finish background tasks

数据下载完成之后，在images和labels目录下分别有train2017, val2017, test2017这三个子目录，对应训练/验证/测试数据。

然后我们可以基于Tensorflow来构建一个训练的数据集，需要对训练的图像进行增强，包括了包括了Mosaic拼接，随机拷贝图像，随机形变，色彩调整等，相应的图像里面的物体Label也要做相应的变换。具体的工作原理可以见我之前的博客，解读YOLO v7的代码(二)训练数据的准备-CSDN博客

这里我定义了一个Dataloader的类，负责对训练集的数据进行相应的图像增强处理，这里的处理过程和Yolov7源码的基本是一致的，只是做了一些小的修改，就是当做了Mosaic拼接之后，如果随机形变是进行缩小，那么有可能会出现物体的检测框超出图像的情况，这里我根据物体的segments数据进行了裁减，使得不会超出图像。

对于验证集的数据，我们不需要进行图像增强，只需要对图像的长边缩放到640即可，空白部分进行padding。Tensorflow的dataset的定义如下：

def map_val_fn(t: tf.Tensor):
    filename = str(t.numpy(), encoding='utf-8')
    imgid = int(filename[20:32])
    # Load image
    img, (h0, w0), (h, w) = load_image(filename)
    #augment_hsv(img, hgain=hsv_h, sgain=hsv_s, vgain=hsv_v)

    # Labels
    label_filename = val_label_path + filename.split('/')[-1].split('.')[0] + '.txt'
    labels, _ = load_labels(label_filename)
    labels[:, 1:] = xywhn2xyxy(labels[:, 1:], w, h, 0, 0)  # normalized xywh to pixel xyxy format
    labels[:, 1:5] = xyxy2xywh(labels[:, 1:5])  # convert xyxy to xywh
    labels[:, 1:5] /= img_size  # normalized height 0-1
    
    img = img[:, :, ::-1].transpose(2,0,1)
    img = img/255.
    
    img_hw = tf.concat([h0, w0], axis=0)
    return img, labels, img_hw, imgid

dataset_val = tf.data.Dataset.list_files("coco/images/val2017/*.jpg", shuffle=False)
dataset_val = dataset_val.map(
    lambda x: tf.py_function(func=map_val_fn, inp=[x], Tout=[tf.float32, tf.float32, tf.int32, tf.int32]), 
    num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset_val = dataset_val\
    .padded_batch(val_batch_size, padded_shapes=([3, img_size, img_size], [None, 5], [2], []), padding_values=(144/255., 0., 0, 0))\
    .prefetch(tf.data.experimental.AUTOTUNE)

对于训练集的dataset，本来我也是打算按类似以上验证集的方式来定义，只是把map函数替换为对应的Dataloader里面的函数，具体代码可以见dataloader.py。但是我发现这种方式效率不高，在实际测试中发现，因为这个图像增强的过程比较复杂，CPU需要花费较多的事件处理，虽然Tensorflow dataset的map和prefetch提供了一个Autotune的参数可以进行并行处理的优化，但是效果不是太理想，还是出现GPU等待CPU处理完数据的情况。为此我自己写了一个并行处理的函数，利用Python multiprocessing的多进程函数，来对图像进行并行处理，当GPU在训练100个Batch的时候，CPU并行准备下100个Batch的训练数据，这样可以大幅提高性能。

具体做法是创建一个share memory给各个子进程共享，然后在训练集的图像中随机抽取一部分文件名，分配给几个子进程，每个子进程读取这些图像，进行相应的图像处理，以及对相应的图像Label文件进行处理，并把处理后的数据写入到Share memory的对应位置。最后有一个独立的子进程对Share memory的数据进行合并整理，然后就可以基于整理后的数据直接构建一个dataset了。

模型的定义

构建一个YOLO v7的模型，模型的结构解读可见我之前的另一篇博客解读YOLO v7的代码(一)模型结构研究_gzroy的博客-CSDN博客

定义一个yolo.py文件，里面定义了模型的自定义层和对模型进行组装。

import tensorflow as tf
from tensorflow import keras
l=tf.keras.layers
from params import *

@tf.keras.utils.register_keras_serializable()
class YoloConv(keras.layers.Layer):
    def __init__(self, filters, kernel_size, strides, padding='same', bias=False, activation='swish', **kwargs):
        super(YoloConv, self).__init__(**kwargs)
        self.activation = activation
        self.filters = filters
        self.kernel_size = kernel_size
        self.strides = strides
        self.padding = padding
        self.bias = bias
        self.cv = l.Conv2D(filters=self.filters, 
            kernel_size=self.kernel_size, 
            strides=self.strides,
            padding=self.padding,
            data_format='channels_first',
            use_bias=self.bias,
            kernel_initializer='he_normal',
            kernel_regularizer=tf.keras.regularizers.l2(l=weight_decay))
        self.bn = l.BatchNormalization(axis=1)
        self.swish = l.Activation('swish')

    def call(self, inputs, training):
        output = self.cv(inputs)
        output = self.bn(output, training)
        if self.activation=='swish':
            output = self.swish(output)
        else:
            output = output
        return output

    def get_config(self):
        config = super(YoloConv, self).get_config()
        config.update({
            "activation": self.activation,
            "filters": self.filters,
            "kernel_size": self.kernel_size,
            "strides": self.strides,
            "padding": self.padding,
            "bias": self.bias
        })
        return config

@tf.keras.utils.register_keras_serializable()
class Elan(keras.layers.Layer):
    def __init__(self, filters, **kwargs):
        super(Elan, self).__init__(**kwargs)
        self.filters = filters
        self.cv1 = YoloConv(self.filters, 1, 1)
        self.cv2 = YoloConv(self.filters, 1, 1)
        self.cv3 = YoloConv(self.filters, 3, 1)
        self.cv4 = YoloConv(self.filters, 3, 1)
        self.cv5 = YoloConv(self.filters, 3, 1)
        self.cv6 = YoloConv(self.filters, 3, 1)
        self.cv7 = YoloConv(self.filters*4, 1, 1)
        self.concat = l.Concatenate(axis=1)
        
    def call(self, inputs, training):
        output1 = self.cv1(inputs, training)
        output2 = self.cv2(inputs, training)
        output3 = self.cv4(self.cv3(output2, training), training)
        output4 = self.cv6(self.cv5(output3, training), training)
        output = self.concat([output1, output2, output3, output4])
        output = self.cv7(output, training)
        return output

    def get_config(self):
        config = super(Elan, self).get_config()
        config.update({
            "filters": self.filters
        })
        return config

@tf.keras.utils.register_keras_serializable()
class MP(keras.layers.Layer):
    def __init__(self, filters, k=2):
        super(MP, self).__init__()
        self.filters = filters
        self.k = k
        self.cv1 = YoloConv(filters, 1, 1)
        self.cv2 = YoloConv(filters, 1, 1)
        self.cv3 = YoloConv(filters, 3, 2)
        self.pool = l.MaxPool2D(pool_size=self.k, strides=self.k, padding='same', data_format='channels_first')
        self.concat = l.Concatenate(axis=1)
        
    def call(self, inputs, training):
        output1 = self.pool(inputs)
        output1 = self.cv1(output1, training)
        output2 = self.cv2(inputs, training)
        output2 = self.cv3(output2, training)
        output = self.concat([output1, output2])
        return output

    def get_config(self):
        config = super(MP, self).get_config()
        config.update({
            "filters": self.filters,
            "k": self.k
        })
        return config

@tf.keras.utils.register_keras_serializable()    
class SPPCSPC(keras.layers.Layer):
    def __init__(self, filters, e=0.5, k=(5,9,13)):
        super(SPPCSPC, self).__init__()
        self.filters = filters
        self.e = e
        self.k = k
        c_ = int(2 * self.filters * self.e)
        self.cv1 = YoloConv(c_, 1, 1)
        self.cv2 = YoloConv(c_, 1, 1)
        self.cv3 = YoloConv(c_, 3, 1)
        self.cv4 = YoloConv(c_, 1, 1)
        self.m = [l.MaxPool2D(pool_size=x, strides=1, padding='same', data_format='channels_first') for x in k]
        self.cv5 = YoloConv(c_, 1, 1)
        self.cv6 = YoloConv(c_, 3, 1)
        self.cv7 = YoloConv(filters, 1, 1)
        self.concat = l.Concatenate(axis=1)
        
    def call(self, inputs, training):
        output1 = self.cv4(self.cv3(self.cv1(inputs, training), training), training)
        output2 = self.concat([output1] + [m(output1) for m in self.m])
        output2 = self.cv6(self.cv5(output2, training), training)
        output3 = self.cv2(inputs, training)
        output = self.cv7(self.concat([output2, output3]), training)
        return output
    
    def get_config(self):
        config = super(SPPCSPC, self).get_config()
        config.update({
            "filters": self.filters,
            "k": self.k,
            "e": self.e
        })
        return config

@tf.keras.utils.register_keras_serializable()
class Elan_A(keras.layers.Layer):
    def __init__(self, filters):
        super(Elan_A, self).__init__()
        self.filters = filters
        self.cv1 = YoloConv(filters, 1, 1)
        self.cv2 = YoloConv(filters, 1, 1)
        self.cv3 = YoloConv(filters//2, 3, 1)
        self.cv4 = YoloConv(filters//2, 3, 1)
        self.cv5 = YoloConv(filters//2, 3, 1)
        self.cv6 = YoloConv(filters//2, 3, 1)
        self.cv7 = YoloConv(filters, 1, 1)
        self.concat = l.Concatenate(axis=1)
        
    def call(self, inputs, training):
        output1 = self.cv1(inputs, training)
        output2 = self.cv2(inputs, training)
        output3 = self.cv3(output2, training)
        output4 = self.cv4(output3, training)
        output5 = self.cv5(output4, training)
        output6 = self.cv6(output5, training)
        output7 = self.concat([output1, output2, output3, output4, output5, output6])
        output = self.cv7(output7, training)
        return output
    
    def get_config(self):
        config = super(Elan_A, self).get_config()
        config.update({
            "filters": self.filters,
        })
        return config

@tf.keras.utils.register_keras_serializable()
class RepConv(keras.layers.Layer):
    def __init__(self, filters):
        super(RepConv, self).__init__()
        self.filters = filters
        self.cv1 = YoloConv(filters, 3, 1, activation=None)
        self.cv2 = YoloConv(filters, 1, 1, activation=None)
        self.swish = l.Activation('swish')
        
    def call(self, inputs, training):
        output1 = self.cv1(inputs, training)
        output2 = self.cv2(inputs, training)
        output = self.swish(output1+output2)
        return output
    
    def get_config(self):
        config = super(RepConv, self).get_config()
        config.update({
            "filters": self.filters,
        })
        return config

@tf.keras.utils.register_keras_serializable()
class IDetect(keras.layers.Layer):
    def __init__(self, shape, no, na, grids):
        super(IDetect, self).__init__()
        #self.a = tf.random.normal((1,shape,1,1), mean=0.0, stddev=0.02, dtype=tf.dtypes.float16)
        self.a = tf.Variable(tf.random.normal((1,shape,1,1), mean=0.0, stddev=0.02, dtype=tf.dtypes.float16))
        self.m = tf.Variable(tf.random.normal((1,no*na,1,1), mean=0.0, stddev=0.02, dtype=tf.dtypes.float16))
        #self.a = keras.initializers.RandomNormal(mean=0., stddev=0.02)(shape=(1,shape,1,1))
        #self.m = keras.initializers.RandomNormal(mean=0., stddev=0.02)(shape=(1,no*na,1,1))
        self.cv = YoloConv(no*na, 1, 1, bias=True, activation=None)
        self.shape = shape
        self.no = no
        self.na = na
        self.grids = grids
        self.reshape = l.Reshape([self.na, self.no, self.grids*self.grids])
        #self.permute = l.Permute([1,3,4,2])
        self.permute = l.Permute([1,3,2])
        self.activation = l.Activation('linear', dtype='float32')
    
    def call(self, inputs, training):
        #output = l.Add()([inputs, self.a])
        output = inputs + self.a
        output = self.cv(output, training)
        output = self.m * output
        #output = self.cv(inputs)
        #output = tf.reshape(output, [-1, self.na, self.no, self.grids, self.grids])
        output = self.reshape(output)
        #output = tf.transpose(output, perm=[0,1,3,4,2])
        output = self.permute(output)
        output = self.activation(output)
        return output

    def get_config(self):
        config = super(IDetect, self).get_config()
        config.update({
            "no": self.no,
            "na": self.na,
            "grids": self.grids,
            "shape": self.shape
        })
        return config

def create_model():
    inputs = keras.Input(shape=(3, img_size, img_size))
    x = YoloConv(32, 3, 1)(inputs)    #[32, img_size, img_size]
    x = YoloConv(64, 3, 2)(x)         #[64, img_size/2, img_size/2]
    x = YoloConv(64, 3, 1)(x)         #[64, img_size/2, img_size/2]
    x = YoloConv(128, 3, 2)(x)        #[128, img_size/4, img_size/4]
    x = Elan(64)(x)                   #11
    x = MP(128)(x)                    #16
    route1 = Elan(128)(x)             #24
    x = MP(256)(route1)               #29
    route2 = Elan(256)(x)             #37
    x = MP(512)(route2)               #42
    x = Elan(256)(x)                  #50
    route3 = SPPCSPC(512)(x)          #51
    x = YoloConv(256, 1, 1)(route3)
    x = l.UpSampling2D(size=(2, 2), data_format='channels_first', interpolation='nearest')(x)
    x = l.Concatenate(axis=1)([x, YoloConv(256, 1, 1)(route2)])
    route4 = Elan_A(256)(x)           #63
    x = YoloConv(128, 1, 1)(route4)
    x = l.UpSampling2D(size=(2, 2), data_format='channels_first', interpolation='nearest')(x)
    x = l.Concatenate(axis=1)([x, YoloConv(128, 1, 1)(route1)])
    route5 = Elan_A(128)(x)           #75, Connect to Detector 1
    x = MP(128)(route5)  
    x = l.Concatenate(axis=1)([x, route4])
    route6 = Elan_A(256)(x)           #88, Connect to Detector 2
    x = MP(256)(route6)   
    x = l.Concatenate(axis=1)([x, route3])
    route7 = Elan_A(512)(x)           #101, Connect to Detector 3
    detect1 = RepConv(256)(route5)
    detect2 = RepConv(512)(route6)
    detect3 = RepConv(1024)(route7)
    output1 = IDetect(256, 85, 3, 80)(detect1)
    output2 = IDetect(512, 85, 3, 40)(detect2)
    output3 = IDetect(1024, 85, 3, 20)(detect3)
    output = l.Concatenate(axis=-2)([output1, output2, output3])
    output = l.Activation('linear', dtype='float32')(output)
    model = keras.Model(inputs=inputs, outputs=output, name="yolov7_model")
    return model

损失函数的定义

YOLOv7对损失的定义可以见我另一篇文章的解读解读YOLO v7的代码(三)损失函数_gzroy的博客-CSDN博客

具体的定义在loss.py文件，我也是按照Yolov7的代码处理方式来进行tensorflow的改写，并且用了tf_function的封装来提高计算的效率，代码如下：

import tensorflow as tf
import math
from test1 import batch_size, na, nl, img_size, stride, balance
from test1 import loss_box, loss_obj, loss_cls
from test1 import batch_no_constant, anchor_no_constant, anchors_reshape, anchor_t, anchors_constant, layer_no_constant
from test1 import val_batch_no_constant, val_layer_no_constant
from util import *
from params import *

#In param: 
#    p - predictions of the model, list of three detection level.
#    labels - the label of the object, dimension [batch, boxnum, 5(class, xywh)]
#Out param:
#    results - list of the suggest positive samples for three detection level. 
#        dimension for each element: [sample_number, 5(batch_no, anch_no, x, y, class)]
#    anch - list of the anchor wh ratio for the positive samples
#        dimension for each element: [sample_number, anchor_w, anchor_h]
@tf.function(
    input_signature=(
        [tf.TensorSpec(shape=[batch_size, None, 5], dtype=tf.float32)]
    )
)
def tf_find_3_positive(labels):
    batch_no = tf.zeros_like(labels)[...,0:1] + batch_no_constant
    targets = tf.concat((batch_no, labels), axis=-1)    #targets dim [batch,box_num,6]
    targets = tf.reshape(targets, [batch_size, 1, -1, 6])   #targets dim [batch,1,box_num,6]
    targets = tf.tile(targets, [1,na,1,1])
    anchor_no = anchor_no_constant + tf.reshape(tf.zeros_like(batch_no), [batch_size, 1, -1, 1])
    targets = tf.concat([targets,anchor_no], axis=-1)   #targets dim [batch,na,box_num,7(batch_no, cls, xywh, anchor_no)]

    g = 0.5  # bias
    offsets = tf.expand_dims(tf.constant([[0.,0.], [-1.,0.], [0.,-1.], [1.,0.], [0.,1.]]), axis=0)  #offset dim [1,5,2]

    gain = tf.constant([[1.,1.,80.,80.,80.,80.,1.], [1.,1.,40.,40.,40.,40.,1.], [1.,1.,20.,20.,20.,20.,1.]])

    results = tf.TensorArray(tf.int32, size=nl, dynamic_size=False)
    anch = tf.TensorArray(tf.float32, size=nl, dynamic_size=False)

    for i in tf.range(nl):
        t = targets * tf.gather(gain, i)
        r = t[..., 4:6] / tf.gather(anchors_reshape, i)
        r_reciprocal = tf.math.reciprocal_no_nan(r)      #1/r
        r_max = tf.reduce_max(tf.math.maximum(r, r_reciprocal), axis=-1)
        mask_t = tf.logical_and(r_max<anchor_t, r_max>0)
        t = t[mask_t]
        # Offsets
        gxy = t[:, 2:4]  # grid xy
        #gxi = gain[[2, 3]] - gxy  # inverse    
        gxi = tf.gather(gain, i)[2:4] - gxy
        mask_xy = tf.concat([
            tf.ones([tf.shape(t)[0], 1], dtype=tf.bool),
            ((gxy % 1. < g) & (gxy > 1.)),
            ((gxi % 1. < g) & (gxi > 1.))
        ], axis=1)
        t = tf.repeat(tf.expand_dims(t, axis=1), 5, axis=1)[mask_xy]
        offsets_xy = (tf.expand_dims(tf.zeros_like(gxy, dtype=tf.float32), axis=1) + offsets)[mask_xy]
        xy = t[...,2:4] + offsets_xy
        from_which_layer = tf.ones_like(t[...,0:1]) * tf.dtypes.cast(i, tf.float32)
        results = results.write(i, tf.dtypes.cast(tf.concat([t[...,0:1], t[...,-1:], xy[...,1:2], xy[...,0:1], t[...,1:2], from_which_layer], axis=-1), tf.int32))
        anch = anch.write(i, tf.gather(tf.gather(anchors_constant, i), tf.dtypes.cast(t[...,-1], tf.int32)))
    return results.concat(), anch.concat()

@tf.function(
    input_signature=([
        tf.TensorSpec(shape=[None, 4], dtype=tf.float32),
        tf.TensorSpec(shape=[None, 4], dtype=tf.float32)
    ])
)
def box_iou(box1, box2):
    area1 = (box1[:,2]-box1[:,0])*(box1[:,3]-box1[:,1])
    area2 = (box2[:,2]-box2[:,0])*(box2[:,3]-box2[:,1])

    intersect_wh = tf.math.minimum(box1[:,None,2:], box2[:,2:]) - tf.math.maximum(box1[:,None,:2], box2[:,:2])
    intersect_wh = tf.clip_by_value(intersect_wh, clip_value_min=0, clip_value_max=img_size)
    intersect_area = intersect_wh[...,0]*intersect_wh[...,1]
    
    iou = intersect_area/(area1[:,None]+area2-intersect_area)
    return iou

@tf.function(
    input_signature=([
        tf.TensorSpec(shape=[None, 4], dtype=tf.float32),
        tf.TensorSpec(shape=[None, 4], dtype=tf.float32)
    ])
)
def bbox_ciou(box1, box2):
    eps=1e-7
    b1_x1, b1_x2 = box1[:,0]-box1[:,2]/2, box1[:,0]+box1[:,2]/2
    b1_y1, b1_y2 = box1[:,1]-box1[:,3]/2, box1[:,1]+box1[:,3]/2
    b2_x1, b2_x2 = box2[:,0]-box2[:,2]/2, box2[:,0]+box2[:,2]/2
    b2_y1, b2_y2 = box2[:,1]-box2[:,3]/2, box2[:,1]+box2[:,3]/2
    
    # Intersection area
    inter = tf.clip_by_value(
        tf.math.minimum(b1_x2, b2_x2) - tf.math.maximum(b1_x1, b2_x1), 
        clip_value_min=0, 
        clip_value_max=tf.float32.max) * tf.clip_by_value(
        tf.math.minimum(b1_y2, b2_y2) - tf.math.maximum(b1_y1, b2_y1), 
        clip_value_min=0, 
        clip_value_max=tf.float32.max)
    
    # Union Area
    w1, h1 = b1_x2 - b1_x1, b1_y2 - b1_y1 + eps
    w2, h2 = b2_x2 - b2_x1, b2_y2 - b2_y1 + eps
    union = w1 * h1 + w2 * h2 - inter + eps

    iou = inter / union
    
    cw = tf.math.maximum(b1_x2, b2_x2) - tf.math.minimum(b1_x1, b2_x1)  # convex (smallest enclosing box) width
    ch = tf.math.maximum(b1_y2, b2_y2) - tf.math.minimum(b1_y1, b2_y1)  # convex height
    
    c2 = cw ** 2 + ch ** 2 + eps  # convex diagonal squared
    rho2 = ((b2_x1 + b2_x2 - b1_x1 - b1_x2) ** 2 +
        (b2_y1 + b2_y2 - b1_y1 - b1_y2) ** 2) / 4  # center distance squared
    
    v = (4 / math.pi ** 2) * tf.math.pow(tf.math.atan(w2 / (h2 + eps)) - tf.math.atan(w1 / (h1 + eps)), 2)
    alpha = v / (v - iou + (1 + eps))
    return iou - (rho2 / c2 + v * alpha)

@tf.function(
    input_signature=([
        tf.TensorSpec(shape=[batch_size, na, None, 85], dtype=tf.float32),
        tf.TensorSpec(shape=[batch_size, None, 5], dtype=tf.float32)
    ])
)
def tf_build_targets(p, labels):
    results, anch = tf_find_3_positive(labels)

    #stride = tf.constant([8., 16., 32.])
    grids = tf.dtypes.cast(img_size/stride, tf.int32)

    pxyxys = tf.TensorArray(tf.float32, size=nl, dynamic_size=False)
    p_obj = tf.TensorArray(tf.float32, size=nl, dynamic_size=True, element_shape=[None, 1])
    p_cls = tf.TensorArray(tf.float32, size=nl, dynamic_size=False)
    all_idx = tf.TensorArray(tf.int32, size=nl, dynamic_size=False)
    from_which_layer = tf.TensorArray(tf.int32, size=nl, dynamic_size=False)
    all_anch = tf.TensorArray(tf.float32, size=nl, dynamic_size=False)
    
    matching_idxs = tf.TensorArray(tf.int32, size=batch_size, dynamic_size=False)
    matching_targets = tf.TensorArray(tf.float32, size=batch_size, dynamic_size=False)
    matching_anchs = tf.TensorArray(tf.float32, size=batch_size, dynamic_size=False)
    matching_layers = tf.TensorArray(tf.int32, size=batch_size, dynamic_size=False)

    for i in tf.range(nl):
        idx_mask = results[...,-1]==i
        idx = tf.boolean_mask(results, idx_mask)
        layer_mask = layer_no_constant[...,0]==i
        grid_no = tf.gather(grids, i)
        pl = tf.boolean_mask(p, layer_mask)
        pl = tf.reshape(pl, [batch_size, na, grid_no, grid_no, -1])
        pi = tf.gather_nd(pl, idx[...,0:4])
        anchors_p = tf.boolean_mask(anch, idx_mask)
        p_obj = p_obj.write(i, pi[...,4:5])
        p_cls = p_cls.write(i, pi[...,5:])
        gij = tf.dtypes.cast(tf.concat([idx[...,3:4], idx[...,2:3]], axis=-1), tf.float32)
        pxy = (tf.math.sigmoid(pi[...,:2])*2-0.5+gij)*tf.dtypes.cast(tf.gather(stride, i), tf.float32)
        pwh = (tf.math.sigmoid(pi[...,2:4])*2)**2*anchors_p*tf.dtypes.cast(tf.gather(stride, i), tf.float32)
        pxywh = tf.concat([pxy, pwh], axis=-1)
        pxyxy = xywh2xyxy(pxywh)
        pxyxys = pxyxys.write(i, pxyxy)
        all_idx = all_idx.write(i, idx[...,0:4])
        from_which_layer = from_which_layer.write(i, idx[..., -1:])
        all_anch = all_anch.write(i, tf.boolean_mask(anch, idx_mask))

    pxyxys = pxyxys.concat()
    p_obj = p_obj.concat()
    p_cls = p_cls.concat()
    all_idx = all_idx.concat()
    from_which_layer = from_which_layer.concat()
    all_anch = all_anch.concat()

    for i in tf.range(batch_size):
        batch_mask = all_idx[...,0]==i
        if tf.math.reduce_sum(tf.dtypes.cast(batch_mask, tf.int32)) > 0:
            pxyxy_i = tf.boolean_mask(pxyxys, batch_mask)
            target_mask = labels[i][...,3]>0
            target = tf.boolean_mask(labels[i], target_mask)
            txywh = target[...,1:] * img_size
            txyxy = xywh2xyxy(txywh)
            pair_wise_iou = box_iou(txyxy, pxyxy_i)
            pair_wise_iou_loss = -tf.math.log(pair_wise_iou + 1e-8)

            top_k, _ = tf.math.top_k(pair_wise_iou, tf.math.minimum(10, tf.shape(pair_wise_iou)[1]))
            dynamic_ks = tf.clip_by_value(
                tf.dtypes.cast(tf.math.reduce_sum(top_k, axis=-1), tf.int32),
                clip_value_min=1, 
                clip_value_max=10)

            gt_cls_per_image = tf.tile(
                tf.expand_dims(
                    tf.one_hot(
                        tf.dtypes.cast(target[...,0], tf.int32), nc),
                    axis = 1),
                [1,tf.shape(pxyxy_i)[0],1])

            num_gt = tf.shape(target)[0]
            cls_preds_ = (
                tf.math.sigmoid(tf.tile(tf.expand_dims(tf.boolean_mask(p_cls, batch_mask), 0), [num_gt, 1, 1])) *
                tf.math.sigmoid(tf.tile(tf.expand_dims(tf.boolean_mask(p_obj, batch_mask), 0), [num_gt, 1, 1])))    #dimension [labels_number, positive_targets_number, 80]
            y = tf.math.sqrt(cls_preds_)
            pair_wise_cls_loss = tf.math.reduce_sum(
                tf.nn.sigmoid_cross_entropy_with_logits(
                    labels = gt_cls_per_image,
                    logits = tf.math.log(y/(1-y))),
                axis = -1)

            cost = (
                pair_wise_cls_loss
                + 3.0 * pair_wise_iou_loss
            )

            matching_matrix = tf.zeros_like(cost)      #dimension [labels_number, positive_targets_number]

            matching_idx = tf.TensorArray(tf.int64, size=0, dynamic_size=True)
            for gt_idx in tf.range(num_gt):
                _, pos_idx = tf.math.top_k(
                    -cost[gt_idx], k=dynamic_ks[gt_idx], sorted=True)
                X,Y = tf.meshgrid(gt_idx, pos_idx)
                matching_idx = matching_idx.write(gt_idx, tf.dtypes.cast(tf.concat([X,Y], axis=-1), tf.int64))

            matching_idx = matching_idx.concat()
            '''
            matching_matrix = tf.scatter_nd(
                matching_idx, 
                tf.ones(tf.shape(matching_idx)[0]), 
                tf.dtypes.cast(tf.shape(cost), tf.int64))
            '''
            matching_matrix = tf.sparse.to_dense(
                tf.sparse.reorder(
                    tf.sparse.SparseTensor(
                        indices=tf.dtypes.cast(matching_idx, tf.int64), 
                        values=tf.ones(tf.shape(matching_idx)[0]), 
                        dense_shape=tf.dtypes.cast(tf.shape(cost), tf.int64))
                )
            )

            anchor_matching_gt = tf.reduce_sum(matching_matrix, axis=0)    #dimension [positive_targets_number]
            mask_1 = anchor_matching_gt>1     #it means one target match to several ground truths

            if tf.reduce_sum(tf.dtypes.cast(mask_1, tf.int32)) > 0:   #There is at least one positive target that predict several ground truth  
                #Get the lowest cost of the serveral ground truth of the target
                #For example, there are 100 targets and 10 ground truths.
                #The #5 target match to the #2 and #3 ground truth, the related cost are 10 for #2 and 20 for #3
                #Then it will select #2 gound truth for the #5 target.
                #mask_1 dimension [positive_targets_number]
                #tf.boolean_mask(cost, mask_1, axis=1), dimension [ground_truth_numer, targets_predict_sevearl_GT_number]
                cost_argmin = tf.math.argmin(
                    tf.boolean_mask(cost, mask_1, axis=1), axis=0)  #in above example, the cost_argmin is [2]
                m = tf.dtypes.cast(mask_1, tf.float32)
                _, target_indices = tf.math.top_k(
                    m, 
                    k=tf.dtypes.cast(tf.math.reduce_sum(m), tf.int32))  #in above example, the target_indices is [5]
                #So will set the index [2,5] of matching_matrix to 1, and set the other elements of [:,5] to 0
                target_matching_gt_indices = tf.concat(
                    [tf.reshape(tf.dtypes.cast(cost_argmin, tf.int32), [-1,1]), tf.reshape(target_indices, [-1,1])], 
                    axis=1)          
                matching_matrix = tf.multiply(
                    matching_matrix,
                    tf.repeat(tf.reshape(tf.dtypes.cast(anchor_matching_gt<=1, tf.float32), [1,-1]), tf.shape(cost)[0], axis=0))
                target_value = tf.sparse.to_dense(
                    tf.sparse.reorder(
                        tf.sparse.SparseTensor(
                            indices=tf.dtypes.cast(target_matching_gt_indices, tf.int64),
                            values=tf.ones(tf.shape(target_matching_gt_indices)[0]),
                            dense_shape=tf.dtypes.cast(tf.shape(matching_matrix), tf.int64)
                        )
                    )
                )
                matching_matrix = tf.add(matching_matrix, target_value)

            fg_mask_inboxes = tf.math.reduce_sum(matching_matrix, axis=0)>0.  #The mask for the targets that will use to predict
            if tf.shape(tf.boolean_mask(matching_matrix, fg_mask_inboxes, axis=1))[0]>0:
                matched_gt_inds = tf.math.argmax(tf.boolean_mask(matching_matrix, fg_mask_inboxes, axis=1), axis=0)  #Get the related gt number for the target

                all_idx_i = tf.boolean_mask(tf.boolean_mask(all_idx, batch_mask), fg_mask_inboxes)
                from_which_layer_i = tf.boolean_mask(tf.boolean_mask(from_which_layer, batch_mask), fg_mask_inboxes)
                all_anch_i = tf.boolean_mask(tf.boolean_mask(all_anch, batch_mask), fg_mask_inboxes)

                matching_idxs = matching_idxs.write(i, all_idx_i)
                matching_layers = matching_layers.write(i, from_which_layer_i)
                matching_anchs = matching_anchs.write(i, all_anch_i )
                matching_targets = matching_targets.write(i, tf.gather(target, matched_gt_inds))
            else:
                matching_idxs = matching_idxs.write(i, tf.constant([[-1,-1,-1,-1]], dtype=tf.int32))
                matching_layers = matching_layers.write(i, tf.constant([[-1]], dtype=tf.int32))
                matching_anchs = matching_anchs.write(i, tf.constant([[-1, -1]], dtype=tf.float32))
                matching_targets = matching_targets.write(i, tf.constant([[-1, -1, -1, -1, -1]], dtype=tf.float32))                                    
        
        else:
            matching_idxs = matching_idxs.write(i, tf.constant([[-1,-1,-1,-1]], dtype=tf.int32))
            matching_layers = matching_layers.write(i, tf.constant([[-1]], dtype=tf.int32))
            matching_anchs = matching_anchs.write(i, tf.constant([[-1, -1]], dtype=tf.float32))
            matching_targets = matching_targets.write(i, tf.constant([[-1, -1, -1, -1, -1]], dtype=tf.float32))
        
    matching_idxs = matching_idxs.concat()
    matching_layers = matching_layers.concat()
    matching_anchs = matching_anchs.concat()
    matching_targets = matching_targets.concat()
    filter_mask = matching_idxs[:,0]!=-1
    matching_idxs = tf.boolean_mask(matching_idxs, filter_mask)
    matching_layers = tf.boolean_mask(matching_layers, filter_mask)
    matching_anchs = tf.boolean_mask(matching_anchs, filter_mask)
    matching_targets = tf.boolean_mask(matching_targets, filter_mask)
    
    #return pxyxys, all_idx, matching_idx, matching_matrix, all_idx_i, cost, pair_wise_iou, from_which_layer_i
    return matching_idxs, matching_layers, matching_anchs, matching_targets

@tf.function(
    input_signature=([
        tf.TensorSpec(shape=[batch_size, na, None, 85], dtype=tf.float32),
        tf.TensorSpec(shape=[batch_size, None, 5], dtype=tf.float32)
    ])
)
def tf_loss_func(p, labels):
    matching_idxs, matching_layers, matching_anchs, matching_targets = tf_build_targets(p, labels)
    lcls, lbox, lobj = tf.zeros(1), tf.zeros(1), tf.zeros(1)
    
    grids = img_size//stride
    for i in tf.range(nl):
        layer_mask = layer_no_constant[...,0]==i
        grid = tf.gather(grids, i)
        pi = tf.reshape(tf.boolean_mask(p, layer_mask), [batch_size, na, grid, grid, -1])
        matching_layer_mask = matching_layers[:,0]==i
        if tf.reduce_sum(tf.dtypes.cast(matching_layer_mask, tf.int32))==0:
            continue
        m_idxs = tf.boolean_mask(matching_idxs, matching_layer_mask)
        if tf.shape(m_idxs)[0]==0:
            continue
        m_targets = tf.boolean_mask(matching_targets, matching_layer_mask)
        m_anchs = tf.boolean_mask(matching_anchs, matching_layer_mask)
        ps = tf.gather_nd(pi, m_idxs)
        pxy = tf.math.sigmoid(ps[:,:2])*2-0.5
        pwh = (tf.math.sigmoid(ps[:,2:4])*2)**2*m_anchs
        pbox = tf.concat([pxy,pwh], axis=-1)
        #selected_tbox = tf.gather_nd(labels, matching_targets[i])[:, 1:]
        selected_tbox = m_targets[:, 1:]
        selected_tbox = tf.multiply(selected_tbox, tf.dtypes.cast(grid, tf.float32))
        tbox_grid = tf.concat([
            tf.dtypes.cast(m_idxs[:,3:4], tf.float32),
            tf.dtypes.cast(m_idxs[:,2:3], tf.float32),
            tf.zeros((tf.shape(m_idxs)[0],2))], 
            axis=-1)
        selected_tbox = tf.subtract(selected_tbox, tbox_grid)
        iou = bbox_ciou(pbox, selected_tbox)
        lbox += tf.math.reduce_mean(1.0 - iou)  # iou loss

        # Objectness
        tobj = tf.sparse.to_dense(
            tf.sparse.reorder(
                tf.sparse.SparseTensor(
                    indices = tf.dtypes.cast(m_idxs, tf.int64),
                    values = (1.0 - gr) + gr * tf.clip_by_value(tf.stop_gradient(iou), clip_value_min=0, clip_value_max=tf.float32.max),
                    dense_shape = tf.dtypes.cast(tf.shape(pi[..., 0]), tf.int64)
                )
            ), validate_indices=False
        )

        # Classification

        tcls = tf.one_hot(
            indices = tf.dtypes.cast(m_targets[:,0], tf.int32),
            depth = 80,
            dtype = tf.float32
        )
        
        lcls += tf.math.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(
                labels = tcls,
                logits = ps[:, 5:]
            )
        )
        '''
        lcls += tf.math.reduce_mean(
            tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels = tf.dtypes.cast(m_targets[:,0], tf.int32),
                logits = ps[:, 5:]
            )    
        )
        '''
        obji = tf.math.reduce_mean(
            tf.nn.sigmoid_cross_entropy_with_logits(
                labels = tobj,
                logits = pi[..., 4]
            )
        )

        lobj += obji * tf.gather(balance, i) 
        
    lbox *= loss_box
    lobj *= loss_obj
    lcls *= loss_cls

    loss = (lbox + lobj + lcls) * batch_size

    return loss

@tf.function(
    input_signature=([
        tf.TensorSpec(shape=[None, na, 8400, 85], dtype=tf.float32),
        tf.TensorSpec(shape=[None, None, 5], dtype=tf.float32),
        tf.TensorSpec(shape=[None, 2], dtype=tf.int32),
        tf.TensorSpec(shape=[None], dtype=tf.int32),
    ])
)
def tf_predict_func(predictions, labels, imgs_hw, imgs_id):
    grids = img_size // stride
    batch_size = tf.shape(predictions)[0]
    confidence_threshold = 0.2
    probabilty_threshold = 0.8
    all_predict_result = tf.TensorArray(tf.float32, size=nl, dynamic_size=False) 
    boxes_result = tf.TensorArray(tf.float32, size=0, dynamic_size=True) 
    imgs_info = tf.TensorArray(tf.int32, size=0, dynamic_size=True)
    for i in tf.range(nl):
        grid = tf.gather(grids, i)
        grid_x, grid_y = tf.meshgrid(tf.range(grid, dtype=tf.float32), tf.range(grid, dtype=tf.float32))
        grid_x = tf.reshape(grid_x, [-1, 1])
        grid_y = tf.reshape(grid_y, [-1, 1])
        #grid_xy = tf.concat([grid_y, grid_x], axis=-1)
        grid_xy = tf.concat([grid_x, grid_y], axis=-1)
        grid_xy = tf.reshape(grid_xy, [1,1,-1,2])
        layer_mask = val_layer_no_constant[...,0]==i
        #grid = tf.gather(grids, i)
        predict_layer = tf.boolean_mask(predictions, layer_mask)
        predict_layer = tf.reshape(predict_layer, [batch_size, na, -1, 85])
        predict_conf = tf.math.sigmoid(predict_layer[...,4:5])
        predict_xy = (tf.math.sigmoid(predict_layer[...,:2])*2-0.5 + \
            tf.dtypes.cast(grid_xy,tf.float32))*tf.dtypes.cast(tf.gather(stride, i), tf.float32)
        predict_wh = (tf.math.sigmoid(predict_layer[...,2:4])*2)**2*\
            tf.reshape(tf.gather(anchors_constant,i), [1,na,1,2])*\
            tf.dtypes.cast(tf.gather(stride, i), tf.float32)
        predict_xywh = tf.concat([predict_xy, predict_wh], axis=-1)
        predict_xyxy = xywh2xyxy(predict_xywh)
        predict_cls = tf.reshape(tf.argmax(predict_layer[...,5:], axis=-1), [batch_size, na, -1, 1])
        predict_cls = tf.dtypes.cast(predict_cls, tf.float32)
        predict_proba = tf.nn.sigmoid(
            tf.reduce_max(
                predict_layer[...,5:], axis=-1, keepdims=True
            )
        )
        batch_no = tf.expand_dims(tf.tile(tf.gather(val_batch_no_constant, tf.range(batch_size)), [1,na,grid*grid]), -1)
        predict_result = tf.concat([batch_no, predict_conf, predict_xyxy, predict_cls, predict_proba], axis=-1)
        mask = tf.math.logical_and(
            predict_result[...,1]>=confidence_threshold,
            predict_result[...,-1]>=probabilty_threshold
        )
        predict_result = tf.boolean_mask(predict_result, mask)
        #tf.print(tf.shape(predict_result))
        if tf.shape(predict_result)[0] > 0:
            all_predict_result = all_predict_result.write(i, predict_result)
            #tf.print(tf.shape(predict_result))
        else:
            all_predict_result = all_predict_result.write(i, tf.zeros(shape=[1,8]))
    all_predict_result = all_predict_result.concat()
    #return all_predict_result
          
    for i in tf.range(batch_size):
        batch_mask = tf.math.logical_and(
            all_predict_result[...,0]==tf.dtypes.cast(i, tf.float32),
            all_predict_result[...,1]>0
        )
        predict_true_box = tf.boolean_mask(all_predict_result, batch_mask)
        if tf.shape(predict_true_box)[0]==0:
            continue
        original_hw = tf.dtypes.cast(tf.gather(imgs_hw, i), tf.float32)
        ratio = tf.dtypes.cast(tf.reduce_max(original_hw/img_size), tf.float32)
        predict_classes, _ = tf.unique(predict_true_box[:,6])
        #predict_classes_list = tf.unstack(predict_classes)
        #for class_id in predict_classes_list:
        for j in tf.range(tf.shape(predict_classes)[0]):
            #class_mask = tf.math.equal(predict_true_box[:, 6], class_id)
            class_mask = tf.math.equal(predict_true_box[:, 6], tf.gather(predict_classes, j))
            predict_true_box_class = tf.boolean_mask(predict_true_box, class_mask)
            predict_true_box_xy = predict_true_box_class[:, 2:6]
            predict_true_box_score = predict_true_box_class[:, 7]*predict_true_box_class[:, 1]
            #predict_true_box_score = predict_true_box_class[:, 1]
            selected_indices = tf.image.non_max_suppression(
                predict_true_box_xy,
                predict_true_box_score,
                100,
                iou_threshold=0.2
                #score_threshold=confidence_threshold
            )
            #Shape [box_num, 7]
            selected_boxes = tf.gather(predict_true_box_class, selected_indices) 
            #boxes_result = boxes_result.write(boxes_result.size(), selected_boxes)
            boxes_xyxy = selected_boxes[:,2:6]*ratio
            boxes_x1 = tf.clip_by_value(boxes_xyxy[:,0:1], 0., original_hw[1])
            boxes_x2 = tf.clip_by_value(boxes_xyxy[:,2:3], 0., original_hw[1])
            boxes_y1 = tf.clip_by_value(boxes_xyxy[:,1:2], 0., original_hw[0])
            boxes_y2 = tf.clip_by_value(boxes_xyxy[:,3:4], 0., original_hw[0])
            boxes_w = boxes_x2 - boxes_x1
            boxes_h = boxes_y2 - boxes_y1
            boxes = tf.concat([selected_boxes[:,0:2], boxes_x1, boxes_y1, boxes_w, boxes_h, selected_boxes[:,6:8]], axis=-1)
            boxes_result = boxes_result.write(boxes_result.size(), boxes)
        img_id = tf.gather(imgs_id, i)
        imgs_info = imgs_info.write(imgs_info.size(), tf.reshape(tf.stack([i, img_id]), [-1,2]))
    if boxes_result.size()==0:
        boxes_result = boxes_result.write(0, tf.zeros(shape=[1,8]))
    if imgs_info.size()==0:
        imgs_info = imgs_info.write(0, tf.dtypes.cast(tf.zeros(shape=[1,2]), tf.int32))

    return boxes_result.concat(), imgs_info.concat()

训练与验证

最后就是对模型进行训练和验证了，这里也是按照YOLOv7的实现方式来进行训练，验证的时候是采用pycocotools工具来进行mAP的计算。具体可以参见train.py文件

因为模型是对640*640大小的图像进行训练，对GPU的显存要求很大。在我本地的2080Ti显卡，11G内存的情况下，开启混合精度，只能设置Batch size为8，训练效果不是很理想。为此我在autodl平台租用了一个V100的32G显存的GPU来进行测试（价格是每小时2.28元），Batch size设置为32。感觉Batch size对模型的训练效果还是有比较大的影响的。最终经过了20多个epoch的训练，每个Epoch大概要训练1个小时多一点，大概花费了1天的时间，结果如下：

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.270
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.411
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.289
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.162
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.302
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.334
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.476
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.528
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.338
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.576
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.661

以下是对验证集的一些图片的预测结果，