工程(八)——yolov5可见光+红外双模态融合(代码)

桦树无泪

已于 2024-12-19 15:38:46 修改

阅读量3.5w

点赞数 148

分类专栏：智能环境感知工程代码调试 YOLOV5从0到精通文章标签：深度学习人工智能计算机视觉

于 2022-08-10 22:42:41 首次发布

本文链接：https://blog.csdn.net/HUASHUDEYANJING/article/details/126275611

版权

智能环境感知工程代码调试同时被 2 个专栏收录

18 篇文章

订阅专栏

YOLOV5从0到精通

14 篇文章

订阅专栏

引言

如果对你有用的话，欢迎引用论文和点代码小star，非常感谢。博主创建了一个科研互助群Q：950440005，欢迎大家加入讨论。

开源代码:

git@github.com:huashu996/SLBAF-Net.git

！！！YOLOv11双模态代码讲解链接
《双模态检测系列一》一文看懂YOLO11（系列进化史+网络结构+源码目录）_yolo11网络结构-CSDN博客

目前，多模态数据融合主要有三种融合方式：前端融合(early-fusion)或数据水平融合(data-level fusion)、后端融合(late-fusion)或决策水平融合(decision-level fusion)和中间融合(intermediate-fusion)。

前端融合将多个独立的数据集融合成一个单一的特征向量，然后输入到机器学习分类器中。由于多模态数据的前端融合往往无法充分利用多个模态数据间的互补性，且前端融合的原始数据通常包含大量的冗余信息。因此，多模态前端融合方法常常与特征提取方法相结合以剔除冗余信息，如主成分分析（PCA）、最大相关最小冗余算法（mRMR）、自动解码器（Autoencoders）等。

后端融合则是将不同模态数据分别训练好的分类器输出打分(决策)进行融合。这样做的好处是，融合模型的错误来自不同的分类器，而来自不同分类器的错误往往互不相关、互不影响，不会造成错误的进一步累加。常见的后端融合方式包括最大值融合(max-fusion)、平均值融合(averaged-fusion)、贝叶斯规则融合(Bayes’rule based)以及集成学习(ensemble learning)等。其中集成学习作为后端融合方式的典型代表，被广泛应用于通信、计算机识别、语音识别等研究领域。

中间融合是指将不同的模态数据先转化为高维特征表达，再于模型的中间层进行融合。以神经网络为例，中间融合首先利用神经网络将原始数据转化成高维特征表达，然后获取不同模态数据在高维空间上的共性。中间融合方法的一大优势是可以灵活的选择融合的位置

目标检测的融合方法主要分为三种方式，下面将给出三种方法的思路和相关的代码。融合的前提是传感器的标定，不知道如何标定的小伙伴参见下方链接，本文直接使用标定后的图片。

可见光相机与红外相机标定_OrigamiSun的博客-CSDN博客_红外相机标定

一、数据集融合

工程难度：简单
将可见光和红外图像先融合再传入网络进行训练，网络结构不用改动。融合前需要对可见光相机和红外相机进行参数标定，这样才能进行融合。

1.1 直接权重融合

# -*- coding: utf-8 -*-

try:
    import cv2
except:
    import sys
    sys.path.remove('/opt/ros/kinetic/lib/python2.7/dist-packages')
    import cv2

import numpy as np
import os
import time


def weight_half_algo(p1_path, p2_path, weight=0.5):
    # 两幅图像的加权平均
    p1 = cv2.imread(p1_path, cv2.IMREAD_COLOR)  # 按彩色图像读取
    p2 = cv2.imread(p2_path, cv2.IMREAD_COLOR)  # .astype(np.float)
    p1 = p1.astype(np.float)  # 转成float类型矩阵
    p2 = p2.astype(np.float)  # 转成float类型矩阵
    p = weight * p1 + (1 - weight) * p2
    return p


def flist():
    """
    对文件夹下的示例图像批量进行计算。结果写入文件夹 rootdir_Res 中
    :return:
    """
    rootdir_IR = r'D:\Project\Python\ImageFusion\VIFB-master\input\IR'  # 红外图像存放路径
    rootdir_VI = r'D:\Project\Python\ImageFusion\VIFB-master\input\VI'  # 可见光图像存放路径
    rootdir_Res = r'D:\Project\Python\ImageFusion\VIFB-master\Res'  # TIF算法处理后的图像存放路径
    rootdir_Res_weight = r'D:\Project\Python\ImageFusion\VIFB-master\Res_weight'  # 平均加权算法处理后的图像存放路径
    fflist = os.listdir(rootdir_IR)  # 列出文件夹下所有的目录与文件
    # print(fflist)
    for i in range(0, len(fflist)):
        path1 = os.path.join(rootdir_IR, fflist[i])
        path2 = os.path.join(rootdir_VI, fflist[i])
        if os.path.isfile(path1) and os.path.isfile(path2):
            p = weight_half_algo(path1, path2)  # 采用两者平均加权的方法进行融合
            cv2.imwrite(os.path.join(rootdir_Res_weight, fflist[i]), p)


if __name__ == '__main__':
    # 程序开始时的时间
    time_start = time.time()
    # 1 图表示可见光；2 图表示红外
    flist()
    time_end = time.time()
    #cv2.imwrite('./final_res.jpg', p)  
    print('程序运行花费时间', time_end - time_start)

直接自己设置权重，权重为定值，融合效果有限

1.2 TIF融合方法

TIF算法是将图像分成基础层和细节层，之后再按加权相加。基础层，就是将图像进行均值滤波（文中用的是35），均值滤波后的图像就是基础层，原图减去基础层就是细节层。基础层的权重是0.5，细节层的权重按公式计算，是动态的。

# -*- coding: utf-8 -*-

try:
    import cv2
except:
    import sys
    sys.path.remove('/opt/ros/kinetic/lib/python2.7/dist-packages')
    import cv2

import numpy as np
import os
import time


def TIF_algo(p1, p2, median_blur_value=3, mean_blur_value=35):
    """
    通过TIF方法融合图像
    :param p1_path: 图像1路径
    :param p2_path: 图像2路径
    :param median_blur_value: 中值滤波参数
    :param mean_blur_value:  均值滤波参数
    :return: 融合图像
    """
    # median_blur_value = 3  # 中值滤波系数
    # mean_blur_value = 35  # 均值滤波系数

    # 均值滤波后(35,35)的图层，即基础层
    p1_b = cv2.blur(p1, (mean_blur_value, mean_blur_value))
    p1_b = p1_b.astype(np.float)  # 转成float类型矩阵
    p2_b = cv2.blur(p2, (mean_blur_value, mean_blur_value))
    p2_b = p2_b.astype(np.float)  # 转成float类型矩阵
    # cv2.imshow('picture after mean blur p1_b', p1_b)
    # cv2.imshow('picture after mean blur p2_b', p2_b)
    # 均值滤波后的细节层
    # p1_d = abs(p1.astype(np.float) - p1_b)
    # p2_d = abs(p2.astype(np.float) - p2_b)
    p1_d = p1.astype(np.float) - p1_b
    p2_d = p2.astype(np.float) - p2_b
    # cv2.imshow('detail layer p1', p1_d / 255.0)
    # cv2.imshow('detail layer p2', p2_d / 255.0)
    # 原图经过中值滤波后的图层
    p1_after_medianblur = cv2.medianBlur(p1, median_blur_value)
    p2_after_medianblur = cv2.medianBlur(p2, median_blur_value)
    # 矩阵转换,换成float型，参与后面计算
    p1_after_medianblur = p1_after_medianblur.astype(np.float)
    p2_after_medianblur = p2_after_medianblur.astype(np.float)

    # 计算均值和中值滤波后的误差
    p1_subtract_from_median_mean = p1_after_medianblur - p1_b + 0.01  # 加0.01 保证结果非NAN
    p2_subtract_from_median_mean = p2_after_medianblur - p2_b + 0.01
    # cv2.imshow('subtract_from_median_mean  p1_subtract_from_median_mean', p1_subtract_from_median_mean/255.0)
    # cv2.imshow('subtract_from_median_mean  p2_subtract_from_median_mean', p2_subtract_from_median_mean/255.0)
    m1 = p1_subtract_from_median_mean[:, :, 0]
    m2 = p1_subtract_from_median_mean[:, :, 1]
    m3 = p1_subtract_from_median_mean[:, :, 2]
    res = m1 * m1 + m2 * m2 + m3 * m3
    # delta1 = np.sqrt(res)
    delta1 = res
    m1 = p2_subtract_from_median_mean[:, :, 0]
    m2 = p2_subtract_from_median_mean[:, :, 1]
    m3 = p2_subtract_from_median_mean[:, :, 2]
    res = m1 * m1 + m2 * m2 + m3 * m3
    # delta2 = np.sqrt(res) #采用平方和开根号做权重计算
    # delta2 = res #采用平方和做权重计算
    delta2 = abs(m1)  # 由于图像2 红外图像是灰度图像，直接用像素差做权重计算

    delta_total = delta1 + delta2  # 分母

    psi_1 = delta1 / delta_total
    psi_2 = delta2 / delta_total
    psi1 = np.zeros(p1.shape, dtype=np.float)
    psi2 = np.zeros(p2.shape, dtype=np.float)
    psi1[:, :, 0] = psi_1
    psi1[:, :, 1] = psi_1
    psi1[:, :, 2] = psi_1
    psi2[:, :, 0] = psi_2
    psi2[:, :, 1] = psi_2
    psi2[:, :, 2] = psi_2
    # 基础层融合
    p_b = 0.5 * (p1_b + p2_b)
    # 细节层融合
    p_d = psi1 * p1_d + psi2 * p2_d
    # 整体融合
    p = p_b + p_d
    return p

if __name__ == '__main__':
    # 程序开始时的时间
    time_start = time.time()
    # 1 图表示可见光；2 图表示红外
    #flist()

    p1_path = './rgb/000000.jpg'  # 可见光图像
    p1 = cv2.imread(p1_path, cv2.IMREAD_COLOR)#.astype(np.float)
    p2_path = './t/000000.jpg'  # 红外图像
    p2 = cv2.imread(p2_path, cv2.IMREAD_COLOR)#.astype(np.float)
    p = TIF_algo(p1, p2)
    time_end = time.time()
    cv2.imwrite('./dy_weight/000000.jpg', p)
    
    print('程序运行花费时间', time_end - time_start)

可以看到融合的图片细节更加清楚。

将融合后的图像传入神经网络进行训练。

二、结果集融合

工程难度：中等

单独训练两个网络，得到两个权重。将detect改成ros节点，对应两个callback分别加载加载两个模型权重进行检测，得到结果进行融合，选择结果最好的。

2.1 整体代码框架

可以看到双模态的主要更改有

1、两个检测器

2、怎么选择最优目标框

3、结果的可视化

# -*- coding: UTF-8 -*-

import os
import sys
import rospy
import numpy as np
import time
import math
import matplotlib.pyplot as plt
import threading
import argparse

from std_msgs.msg import Header
from sensor_msgs.msg import Image

try:
    import cv2
except ImportError:
    import sys
    sys.path.remove('/opt/ros/kinetic/lib/python2.7/dist-packages')
    import cv2

from yolov5_detector import Yolov5Detector, draw_predictions
from mono_estimator import MonoEstimator
from functions import get_stamp, publish_image
from functions import display, print_info
from functions import simplified_nms

parser = argparse.ArgumentParser(
    description='Demo script for dual modal peception')
parser.add_argument('--print', action='store_true',
    help='Whether to print and record infos.')
parser.add_argument('--sub_image1', default='/pub_rgb', type=str,
    help='The image topic to subscribe.')
parser.add_argument('--sub_image2', default='/pub_t', type=str,
    help='The image topic to subscribe.')
parser.add_argument('--pub_image', default='/result', type=str,
    help='The image topic to publish.')
parser.add_argument('--calib_file', default='../conf/calibration_image.yaml', type=str,
    help='The calibration file of the camera.')
parser.add_argument('--modality', default='RGBT', type=str,
    help='The modality to use. This should be `RGB`, `T` or `RGBT`.')
parser.add_argument('--indoor', action='store_true',
    help='Whether to use INDOOR detection mode.')
parser.add_argument('--frame_rate', default=10, type=int,
    help='Working frequency.')
parser.add_argument('--display', action='store_true',
    help='Whether to display and save all videos.')
args = parser.parse_args()
# 在线程函数执行前，“抢占”该锁，执行完成后，“释放”该锁，则我们确保了每次只有一个线程占有该锁。这也是为什么能让RGB和T匹配的原因
image1_lock = threading.Lock() #创建锁
image2_lock = threading.Lock()

#3.1 获取RGB图像的时间戳和格式转化
#×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
def image1_callback(image): 
    global image1_stamp, image1_frame #多线程是共享资源的，使用全局变量 
    image1_lock.acquire() #锁定锁
    image1_stamp = get_stamp(image.header) #获得时间戳
    image1_frame = np.frombuffer(image.data, dtype=np.uint8).reshape(image.height, image.width, -1)#图片格式转化
    image1_lock.release() #释放
#3.2 获取红外图像的时间戳和格式转化
#×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
def image2_callback(image):
    global image2_stamp, image2_frame
    image2_lock.acquire()
    image2_stamp = get_stamp(image.header)
    image2_frame = np.frombuffer(image.data, dtype=np.uint8).reshape(image.height, image.width, -1)
    image2_lock.release()

#5.1 获取红外图像的时间戳和格式转化
#×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
def timer_callback(event):
    #获得RGB的时间戳和图片
    global image1_stamp, image1_frame
    image1_lock.acquire()
    cur_stamp1 = image1_stamp
    cur_frame1 = image1_frame.copy()
    image1_lock.release()
    #获得T的时间戳和图片
    global image2_stamp, image2_frame
    image2_lock.acquire()
    cur_stamp2 = image2_stamp
    cur_frame2 = image2_frame.copy()
    image2_lock.release()
    
    global frame
    frame += 1
    start = time.time()
    # 5.2获得预测结果
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    if args.indoor: #使用yolov5本身权重
        labels, scores, boxes = detector.run(
            cur_frame1, conf_thres=0.50, classes=[0]
        ) # person
    else:
        if args.modality.lower() == 'rgb': #用RGB图像权重
            labels, scores, boxes = detector1.run(
                cur_frame1, conf_thres=0.50, classes=[0, 1, 2, 3, 4]
            ) # pedestrian, cyclist, car, bus, truck
            
        elif args.modality.lower() == 't':  #用T图像权重
            labels, scores, boxes = detector2.run(
                cur_frame2, conf_thres=0.50, classes=[0, 1, 2, 3, 4]
            ) # pedestrian, cyclist, car, bus, truck
            
        elif args.modality.lower() == 'rgbt': #双模态都用
            #获取RGB的预测结果    类别、置信分数、检测框
            labels1, scores1, boxes1 = detector1.run(
                cur_frame1, conf_thres=0.50, classes=[0, 1, 2, 3, 4]
            ) # pedestrian, cyclist, car, bus, truck
            #print("rgb",labels1, scores1, boxes1)
            #获取T的预测结果
            labels2, scores2, boxes2 = detector2.run(
                cur_frame2, conf_thres=0.50, classes=[0, 1, 2, 3, 4]
            ) # pedestrian, cyclist, car, bus, truck
            #print("T",labels2, scores2, boxes2)
            # 确定最终结果
            labels = labels1 + labels2 #合并类别数组
            #print("labels",labels)
            scores = scores1 + scores2 #合并分数数组
            #print("scores",scores)
            if boxes1.shape[0] > 0 and boxes2.shape[0] > 0: #如果可见光和红外都检测到目标
                boxes = np.concatenate([boxes1, boxes2], axis=0) #链接两个检测框
                #print("boxes",boxes)
                # 排除重复的目标框
                indices = simplified_nms(boxes, scores)
                labels, scores, boxes = np.array(labels)[indices], np.array(scores)[indices], boxes[indices]
                #print("result",labels, scores, boxes)
            elif boxes1.shape[0] > 0: #如果只有可见光检测到
                boxes = boxes1
                #print("boxes",boxes)
            elif boxes2.shape[0] > 0: #如果只有红外检测到
                boxes = boxes2
                #print("boxes",boxes)
            else:   #都没检测到
                boxes = np.array([])
                #print("boxes",boxes)
        else:
            raise ValueError("The modality must be 'RGB', 'T' or 'RGBT'.")
    labels_temp = labels.copy()
    labels = []
    for i in labels_temp:
        labels.append(i if i not in ['pedestrian', 'cyclist'] else 'person')
    print("boxes",boxes)
    # 5.3单目估计距离
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    locations = mono.estimate(boxes)#获得目标框的世界坐标系
    print("locations",locations)
    indices = [i for i in range(len(locations)) if locations[i][1] > 0 and locations[i][1] < 200]
    labels, scores, boxes, locations = \
        np.array(labels)[indices], np.array(scores)[indices], boxes[indices], np.array(locations)[indices]
    distances = [(loc[0] ** 2 + loc[1] ** 2) ** 0.5 for loc in locations] #估计距离
    print("distances",distances)
    cur_frame1 = cur_frame1[:, :, ::-1].copy() # to BGR
    cur_frame2 = cur_frame2[:, :, ::-1].copy() # to BGR
    # 5.4画检测框
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    for i in reversed(np.argsort(distances)):#reversed()函数用于对可迭代对象中的元素进行反向排列
        cur_frame1 = draw_predictions(
            cur_frame1, str(labels[i]), float(scores[i]), boxes[i], location=locations[i]
        )
        cur_frame2 = draw_predictions(
            cur_frame2, str(labels[i]), float(scores[i]), boxes[i], location=locations[i]
        )
    # 5.5发布图像
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    result_frame = np.concatenate([cur_frame1, cur_frame2], axis=1) #合并图像
    
    if args.display:
        if not display(result_frame, v_writer, win_name='result'):
            print("\nReceived the shutdown signal.\n")
            rospy.signal_shutdown("Everything is over now.")
    result_frame = result_frame[:, :, ::-1] # to RGB
    publish_image(pub, result_frame)
    delay = round(time.time() - start, 3)
    
    if args.print:
        print_info(frame, cur_stamp1, delay, labels, scores, boxes, locations, file_name)

if __name__ == '__main__':
    # 初始化节点
    rospy.init_node("dual_modal_perception", anonymous=True, disable_signals=True)
    frame = 0
    
    # 一、加载标定参数
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    if not os.path.exists(args.calib_file):
        raise ValueError("%s Not Found" % (args.calib_file))
    mono = MonoEstimator(args.calib_file, print_info=args.print)
    
    # 二、初始化Yolov5Detector
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    if args.indoor:
        detector = Yolov5Detector(weights='weights/coco/yolov5s.pt')
    else:
        if args.modality.lower() == 'rgb':
            detector1 = Yolov5Detector(weights='weights/seumm_visible/yolov5s_100ep_pretrained.pt')
        elif args.modality.lower() == 't':
            detector2 = Yolov5Detector(weights='weights/seumm_lwir/yolov5s_100ep_pretrained.pt')
        elif args.modality.lower() == 'rgbt': #双模态
            detector1 = Yolov5Detector(weights='weights/yolov5s.pt')
            detector2 = Yolov5Detector(weights='weights/yolov5s.pt')
        else:
            raise ValueError("The modality must be 'RGB', 'T' or 'RGBT'.")
    # 三、进入回调函数
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    # 准备图像序列
    image1_stamp, image1_frame = None, None
    image2_stamp, image2_frame = None, None
    
    #rgb回调函数
    rospy.Subscriber(args.sub_image1, Image, image1_callback, queue_size=1,
        buff_size=52428800)
    #红外回调函数
    rospy.Subscriber(args.sub_image2, Image, image2_callback, queue_size=1,
        buff_size=52428800)
    # 等待RGB和t图像都获得再进行下一次循环
    while image1_frame is None or image2_frame is None:
        time.sleep(0.1)
        print('Waiting for topic %s and %s...' % (args.sub_image1, args.sub_image2))
    print('  Done.\n')
    
    # 四、功能选择
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    #如果在命令行中输入 python3 demo_dual_modal.py --print 则会运行下面代码
    # 是否记录时间戳和检测结果
    if args.print:
        file_name = 'result.txt'
        with open(file_name, 'w') as fob:
            fob.seek(0)
            fob.truncate()
    # 是否保存视频
    if args.display:
        assert image1_frame.shape == image2_frame.shape, \
            'image1_frame.shape must be equal to image2_frame.shape.'
        win_h, win_w = image1_frame.shape[0], image1_frame.shape[1] * 2
        v_path = 'result.mp4'
        v_format = cv2.VideoWriter_fourcc(*"mp4v")
        v_writer = cv2.VideoWriter(v_path, v_format, args.frame_rate, (win_w, win_h), True)
    
    # 五、预测结果与发布
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    # 启动定时检测线程
    pub = rospy.Publisher(args.pub_image, Image, queue_size=1)
    rospy.Timer(rospy.Duration(1 / args.frame_rate), timer_callback) #每frame_rate 秒调用一次timer_callback
    
    # 与C++的spin不同，rospy.spin()的作用是当节点停止时让python程序退出
    rospy.spin()

2.2 标定参数

要进行单目距离估计，引入相机标定参数，calibration_image.yaml 文件内容如下。

%YAML:1.0
---
ProjectionMat: !!opencv-matrix
   rows: 3
   cols: 4
   dt: d
   data: [859, 0, 339, 0, 0, 864, 212, 0, 0, 0, 1, 0]
Height: 2.0 # the height of the camera (meter)
DepressionAngle: 0.0 # the depression angle of the camera (degree)

2.3 选择最优检测框

2.3.1 检测框合并

想要选择最优检测框，首先要将所有框合并，一共四种情况，两路都检测到，各只有一路检测到，都没检测到

1、可见光和红外都检测到

# 可见光和红外检测出来的东西
rgb ['person'] [0.74] [[        354         245         379         329]]
T ['person', 'person'] [0.77, 0.54] [[        353         241         379         327]
 [        114         249         128         292]]
# 合并后的数组
labels ['person', 'person', 'person']
scores [0.74, 0.77, 0.54]
boxes [[        354         245         379         329]
 [        353         241         379         327]
 [        114         249         128         292]]

2、只有可见光检测到

#各检测结果
rgb ['person'] [0.72] [[        357         245         382         330]]
T [] [] []
# 合并
labels ['person']
scores [0.72]
boxes [[        357         245         382         330]]

3、只有红外检测到

#各检测结果
rgb [] [] []
T ['person'] [0.72] [[        357         245         382         330]]

# 合并
labels ['person']
scores [0.72]
boxes [[        357         245         382         330]]

4、都没检测到

#各检测结果
rgb [] [] []
T [] [] []

# 合并
labels []
scores []
boxes [[]]

2.3.2 重复检测框剔除

针对第一种情况，会出现同一目标同时被可见光和红外都检测到，那么到底用哪个检测框来表示是一个选择问题。

首先看一下效果，通过检测框坐标知道第1个person和第3个person是同一个目标，算法排除了第2个person目标，那么怎么选择最好的检测框呢？

NMS 把置信度（预测这个网格里是否有目标的置信度）最高的那个网格的边界箱作为极大边界箱，计算极大边界箱和其他几个网格的边界箱的IOU，如果超过一个阈值，例如0.5，就认为这两个网格实际上预测的是同一个物体，就把其中置信度比较小的删除。

#1、检测
rgb ['person', 'person'] [0.82, 0.6] [[        373         244         401         333]
 [        110         259         123         299]]
T ['person'] [0.67] [[        372         248         400         331]]
#2、合并
labels ['person', 'person', 'person']
scores [0.82, 0.6, 0.67]
boxes [[        373         244         401         333]
 [        110         259         123         299]
 [        372         248         400         331]]
#3、选择
result ['person' 'person'] [       0.82         0.6] [[        373         244         401         333]  [        110         259         123         299]]

算法代码如下

def simplified_nms(boxes, scores, iou_thres=0.5):  #排除重复目标
    '''
    Args:
        boxes: (n, 4), xyxy format
        scores: list(float)
    Returns:
        indices: list(int), indices to keep
    '''
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3] #获得每个检测框的对角坐标
    area = (x2 - x1) * (y2 - y1) #所有检测框面积
    #argsort()函数是将x中的元素从小到大排列，提取其对应的index(索引)，使用[::-1],可以建立X从大到小的索引。
    #1、置信度排序
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    idx = np.argsort(scores)[::-1] #置信度由大到小的索引排序
    print("idx",idx)
    indices = []
    #2、循环数组筛选
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    while len(idx) > 0:
        i = idx[0] #取最大置信度的索引
        indices.append(i)  #将这个索引放入indices中
        print('indices',indices)
        if len(idx) == 1:  #如果就一个检测框
            break
        idx = idx[1:] #意思是去掉列表中第一个元素，对后面的元素进行操作
        print('idx[1:]',idx)
        #3、与最大置信度框的重合度来判断是否是重复检测框
        #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
        xx1 = np.clip(x1[idx], a_min=x1[i], a_max=np.inf) #截取数组中小于或者大于某值的部分 np.inf表示无穷大
        yy1 = np.clip(y1[idx], a_min=y1[i], a_max=np.inf)
        xx2 = np.clip(x2[idx], a_min=0, a_max=x2[i])
        yy2 = np.clip(y2[idx], a_min=0, a_max=y2[i])
        w, h = np.clip((xx2 - xx1), a_min=0, a_max=np.inf), np.clip((yy2 - yy1), a_min=0, a_max=np.inf)
        
        inter = w * h
        union = area[i] + area[idx] - inter
        iou = inter / union #计算iou 越大越好最大为1
        print('iou',iou)
        #4、将重合度小的即不同目标保存到idx再次循环
        #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
        idx = idx[iou < iou_thres] #框重合度小视为不同目标
        print("idx",idx)
        
    return indices #返回合并后的索引

变量输出如下

重复目标

#1、检测
rgb ['person'] [0.81] [[        365         248         392         332]]
T ['person'] [0.82] [[        364         243         392         329]]
#2、合并
labels ['person', 'person']
scores [0.81, 0.82]
boxes [[        365         248         392         332]
 [        364         243         392         329]]
#3、置信度排序选择
idx [1 0]
indices [1]
#4、对剩下的目标与置信度最大的目标进行iou计算，来判断是否是同一目标
idx[1:] [0]
iou [    0.87867] #iou为0.87 高度重合，所以是同一目标
idx [] #没有待选目标了，循环结束

不重复目标

# 1、检测
rgb ['person'] [0.78] [[        363         246         390         329]]
T ['person', 'person'] [0.84, 0.52] [[        362         245         391         332]
 [        113         250         135         295]]
# 2、合并
labels ['person', 'person', 'person']
scores [0.78, 0.84, 0.52]
boxes [[        363         246         390         329]
 [        362         245         391         332]
 [        113         250         135         295]]
# 3、排序
idx [1 0 2]
indices [1]
# 4、计算iou排除重复目标
idx[1:] [0 2]
iou [    0.88823           0] #0表示不重合，为不同的目标，对应索引保存到indices
idx [2]
indices [1, 2]
result ['person' 'person'] [       0.84        0.52] [[        362         245         391         332]
 [        113         250         135         295]]

2.4 结果可视化

2.4.3 单目估计距离

    # 5.3单目估计距离
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    locations = mono.estimate(boxes)#获得目标框的世界坐标系
    print("locations",locations)
    indices = [i for i in range(len(locations)) if locations[i][1] > 0 and locations[i][1] < 200]
    labels, scores, boxes, locations = \
        np.array(labels)[indices], np.array(scores)[indices], boxes[indices], np.array(locations)[indices]
    distances = [(loc[0] ** 2 + loc[1] ** 2) ** 0.5 for loc in locations] #估计距离
    print("distances",distances)
    cur_frame1 = cur_frame1[:, :, ::-1].copy() # to BGR
    cur_frame2 = cur_frame2[:, :, ::-1].copy() # to BGR

单目估计距离算法的代码如下

# -*- coding: UTF-8 -*- 

import numpy as np
import math
import cv2
from math import sin, cos

class MonoEstimator():
    def __init__(self, file_path, print_info=True):
        fs = cv2.FileStorage(file_path, cv2.FileStorage_READ)
        
        mat = fs.getNode('ProjectionMat').mat()
        self.fx = int(mat[0, 0])
        self.fy = int(mat[1, 1])
        self.u0 = int(mat[0, 2])
        self.v0 = int(mat[1, 2])
        
        self.height = fs.getNode('Height').real()
        self.depression = fs.getNode('DepressionAngle').real() * math.pi / 180.0
        
        if print_info:
            print('Calibration of camera:')
            print('  Parameters: fx(%d) fy(%d) u0(%d) v0(%d)' % (self.fx, self.fy, self.u0, self.v0))
            print('  Height: %.2fm' % self.height)
            print('  DepressionAngle: %.2frad' % self.depression)
            print()
    
    def uv_to_xyz(self, u, v):
        # Compute (x, y, z) coordinates in the real world, according to (u, v) coordinates in the image.
        # X axis - on the right side of the camera
        # Z axis - in front of the camera
        u = int(u)
        v = int(v)
        
        fx, fy = self.fx, self.fy
        u0, v0 = self.u0, self.v0
        
        h = self.height
        t = self.depression
        
        denominator = fy * sin(t) + (v - v0) * cos(t)
        if denominator != 0:
            z = (h * fy * cos(t) - h * (v - v0) * sin(t)) / denominator
            if z > 1000: z = 1000
        else:
            z = 1000
        x = (z * (u - u0) * cos(t) + h * (u - u0) * sin(t)) / fx
        y = h
        return x, y, z
    
    def estimate(self, boxes):#单目估计距离
        locations = []
        if boxes.shape[0] > 0: #判断目标个数是否大于1
            for box in boxes:
                print(box[0])
                u, v = (box[0] + box[2]) / 2, box[3]
                x, y, z = self.uv_to_xyz(u, v)
                locations.append((x, z))
        return locations

locations是世界坐标系的xy坐标，distances是距离

2.4.4 结果参数可视化

# 5.4画检测框
  #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    for i in reversed(np.argsort(distances)):#reversed()函数用于对可迭代对象中的元素进行反向排列
        cur_frame1 = draw_predictions(
            cur_frame1, str(labels[i]), float(scores[i]), boxes[i], location=locations[i] #两个图像上的数据显示一样
        )
        cur_frame2 = draw_predictions(
            cur_frame2, str(labels[i]), float(scores[i]), boxes[i], location=locations[i]
        )
    # 5.5发布图像
    #×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××
    result_frame = np.concatenate([cur_frame1, cur_frame2], axis=1) #合并图像

最后的效果如下

三、特征集融合

工程难度：困难

同时训练可见光和红外图片，需要改动网络的结构，对每层的特征进行融合。

同时需要对图片质量进行评价，给出自适应的融合权重。

论文终于发表喽，欢迎小伙伴们引用：

SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment | SpringerLink

3.1 双模态数据集制作

有两个图片集，可见光images和红外images2，共用一个labels。

yaml文件配置如下

train: data/train.txt
train2: data/train2.txt
val: data/val.txt
val2: data/val2.txt
test: data/test.txt
# Classes
nc: 7  # number of classes
names: ['pedestrian','cyclist','car','bus','truck','traffic_light','traffic_sign']  # class names

3.2 传入双模态数据

3.2.1 dataset.py

这个文件存放的是读取处理数据集的一系列函数，yolov5原来使用的mosaic拼接方法，即一张图片再随机选取三张进行拼接，但这不适合双模态的融合，所以我先把dataset.py文件保留成只有最基本功能的函数，再进行修改。

dataset.py拆解后的运行效果如下，这样对于双模态的融合是非常方便的。随后把各函数的输入输出变成双模态。

主要修改的思路如下，整个修改过程还是比较艰辛的，需要对代码很熟悉，本文只提供大概思路

1、create_dataloader

这个函数主要改一下输入，增加一个path2，对应train.py中引用这个函数也需要修改

def create_dataloader(path,path2,imgsz, batch_size, stride, single_cls=False, hyp=None, augment=False, cache=False, pad=0.0,
                      rect=False, rank=-1, workers=8, image_weights=False, quad=False, prefix='', shuffle=False):
    if rect and shuffle:
        LOGGER.warning('WARNING: --rect is incompatible with DataLoader shuffle, setting shuffle=False')
        shuffle = False
    with torch_distributed_zero_first(rank):  # init dataset *.cache only once if DDP
        #return torch.from_numpy(img),torch.from_numpy(img2), labels_out, self.im_files[index],self.im_files2[index], shapes
        dataset = LoadImagesAndLabels(path,path2, imgsz, batch_size,
                                      augment=augment,  # augmentation
                                      hyp=hyp,  # hyperparameters
                                      rect=rect,  # rectangular batches
                                      cache_images=cache,
                                      single_cls=single_cls,
                                      stride=int(stride),
                                      pad=pad,
                                      image_weights=image_weights,
                                      prefix=prefix)
    batch_size = min(batch_size, len(dataset))
    nd = torch.cuda.device_count()  # number of CUDA devices
    nw = min([os.cpu_count() // max(nd, 1), batch_size if batch_size > 1 else 0, workers])  # number of workers
    sampler = None if rank == -1 else distributed.DistributedSampler(dataset, shuffle=shuffle)
    loader = DataLoader if image_weights else InfiniteDataLoader  # only DataLoader allows for attribute updates
    return loader(dataset,
                  batch_size=batch_size,
                  shuffle=shuffle and sampler is None,
                  num_workers=nw,
                  sampler=sampler,
                  pin_memory=True,
                  collate_fn=LoadImagesAndLabels.collate_fn4 if quad else LoadImagesAndLabels.collate_fn), dataset

更改train.py文件中的train_loader和 val_loader

    #创建训练集
    train_loader, dataset = create_dataloader(train_path,train_path2, imgsz, batch_size // WORLD_SIZE, gs, single_cls,
                                              hyp=hyp, augment=True, cache=None if opt.cache == 'val' else opt.cache,
                                              rect=opt.rect, rank=LOCAL_RANK, workers=workers,
                                              image_weights=opt.image_weights, quad=opt.quad,
                                              prefix=colorstr('train: '), shuffle=True)
    # 获取标签中最大的类别值与类别数做比较
    #mlc = int(np.concatenate(dataset.labels, 0)[:, 0].max())  # max label class
    nb = len(train_loader)  # number of batches
    # 如果小于则出现问题
    #assert mlc < nc, f'Label class {mlc} exceeds nc={nc} in {data}. Possible class labels are 0-{nc - 1}'

    # Process 0
    if RANK in [-1, 0]:
        # 创建测试集
        val_loader = create_dataloader(val_path,val_path2, imgsz, batch_size // WORLD_SIZE * 2, gs, single_cls,
                                       hyp=hyp, cache=None if noval else opt.cache,
                                       rect=True, rank=-1, workers=workers * 2, pad=0.5,
                                       prefix=colorstr('val: '))[0]

2、LoadImagesAndLabels

这个函数是改动的重点，我把生成cache的代码单独封装一个函数get_cache，以获得cache2进行后续操作。

def img2label_paths(img_paths):
    # Define label paths as a function of image paths
    sa, sb = os.sep + 'images' + os.sep, os.sep + 'labels' + os.sep  # /images/, /labels/ substrings
    return [sb.join(x.rsplit(sa, 1)).rsplit('.', 1)[0] + '.txt' for x in img_paths]
def get_cache(path):
    prefix=''
    cache_version = 0.6  # dataset labels *.cache version
    try:
        #1、获取图片
        f = []  # image files
        for p in path if isinstance(path, list) else [path]:
            p = Path(p)  # os-agnostic
            if p.is_dir():  # dir
                f += glob.glob(str(p / '**' / '*.*'), recursive=True)
                # f = list(p.rglob('*.*'))  # pathlib
            elif p.is_file():  # file
                with open(p) as t:
                    t = t.read().strip().splitlines()
                    parent = str(p.parent) + os.sep #上级目录os.sep是分隔符
                    f += [x.replace('./', parent) if x.startswith('./') else x for x in t]  # local to global path
                    # f += [p.parent / x.lstrip(os.sep) for x in t]  # local to global path (pathlib)
            else:
                raise Exception(f'{prefix}{p} does not exist')
        # 2、过滤不支持格式的图片
        im_files = sorted(x.replace('/', os.sep) for x in f if x.split('.')[-1].lower() in IMG_FORMATS)
        # self.img_files = sorted([x for x in f if x.suffix[1:].lower() in IMG_FORMATS])  # pathlib
        assert im_files, f'{prefix}No images found'
    except Exception as e:
        raise Exception(f'{prefix}Error loading data from {path}: {e}\nSee {HELP_URL}')
    label_files = img2label_paths(im_files)  # 获取labels
    cache_path = (p if p.is_file() else Path(label_files[0]).parent).with_suffix('.cache')
    try:
        cache, exists = np.load(cache_path, allow_pickle=True).item(), True  # load dict
        assert cache['version'] == cache_version  # same version
        assert cache['hash'] == get_hash(label_files + im_files)  # same hash 判断hash值是否改变
    except Exception:
        cache, exists = cache_labels(cache_path,im_files,label_files, prefix), False  # cache

    # Display cache  过滤结果打印
    nf, nm, ne, nc, n = cache.pop('results')  # found, missing, empty, corrupt, total
    if exists:
        d = f"Scanning '{cache_path}' images and labels... {nf} found, {nm} missing, {ne} empty, {nc} corrupt"
        tqdm(None, desc=prefix + d, total=n, initial=n, bar_format=BAR_FORMAT)  # display cache results
        if cache['msgs']:
            LOGGER.info('\n'.join(cache['msgs']))  # display warnings
    return cache
#↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
#一、数据处理
class LoadImagesAndLabels(Dataset):
    # YOLOv5 train_loader/val_loader, loads images and labels for training and validation
    

    def __init__(self, path, path2, img_size=640, batch_size=16, augment=False, hyp=None, rect=False, image_weights=False,
                 cache_images=False, single_cls=False, stride=32, pad=0.0, prefix=''):
        #创建参数
        self.img_size = img_size
        self.augment = augment #是否数据增强
        self.hyp = hyp #超参数
        self.image_weights = image_weights #图片采样权重
        self.rect = False if image_weights else rect #矩阵训练
        self.stride = stride #下采样步数
        self.path = path
        self.path2 = path2
        self.albumentations = Albumentations() if augment else None
        cache = get_cache(self.path)
        cache2 = get_cache(self.path2)
        # Read cache
        [cache.pop(k) for k in ('hash', 'version', 'msgs')]  # remove items
        [cache2.pop(k) for k in ('hash', 'version', 'msgs')]  # remove items
        labels, shapes, self.segments = zip(*cache.values())
        self.labels = list(labels)
        self.shapes = np.array(shapes, dtype=np.float64)
        self.im_files = list(cache.keys())  # update 图片列表
        self.im_files2 = list(cache2.keys())  # update 图片列表
        self.label_files = img2label_paths(cache.keys())  # update 标签列表
        n = len(shapes)  # number of images 14329
        bi = np.floor(np.arange(n) / batch_size).astype(np.int)  # batch index 将每一张图片batch索引
        nb = bi[-1] + 1  # number of batches
        self.batch = bi  # batch index of image
        self.n = n
        self.indices = range(n)
        

        # Update labels
        #过滤类别
        include_class = []  # filter labels to include only these classes (optional)
        include_class_array = np.array(include_class).reshape(1, -1)
        for i, (label, segment) in enumerate(zip(self.labels, self.segments)):
            if include_class:
                j = (label[:, 0:1] == include_class_array).any(1)
                self.labels[i] = label[j]
                if segment:
                    self.segments[i] = segment[j]
            if single_cls:  # single-class training, merge all classes into 0 把所有目标归为一类
                self.labels[i][:, 0] = 0
                if segment:
                    self.segments[i][:, 0] = 0
        #是否采用矩形构造
        if self.rect:
            # Sort by aspect ratio
            s = self.shapes  # wh
            ar = s[:, 1] / s[:, 0]  # aspect ratio #高和宽的比
            irect = ar.argsort() #根据ar排序
            self.im_files = [self.im_files[i] for i in irect]
            self.label_files = [self.label_files[i] for i in irect]
            self.labels = [self.labels[i] for i in irect]
            self.shapes = s[irect]  # wh
            ar = ar[irect]

            # Set training image shapes 设置训练图片的shapes
            # 对同个batch进行尺寸处理
            shapes = [[1, 1]] * nb
            for i in range(nb):
                ari = ar[bi == i]
                mini, maxi = ari.min(), ari.max()
                if maxi < 1:
                    shapes[i] = [maxi, 1]
                elif mini > 1:
                    shapes[i] = [1, 1 / mini]
            self.batch_shapes = np.ceil(np.array(shapes) * img_size / stride + pad).astype(np.int) * stride
        self.ims = [None] * self.n
        self.npy_files = [Path(f).with_suffix('.npy') for f in self.im_files]
        self.npy_files2 = [Path(f).with_suffix('.npy') for f in self.im_files2]

3、__getitem__

接着更改图像增强部分的输入输出，时刻检查两图像是否是同一张图片。

    #二、图片增强
    def __getitem__(self, index):#根据每个类别数量获得图片采样权重，获取新的下标
        i = self.indices[index]  # linear, shuffled, or image_weights
        hyp = self.hyp
        #↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
        # Load image resize图片
        im, f, fn = self.ims[i], self.im_files[i], self.npy_files[i], #判断有没有这个图片
        im2, f2, fn2 = self.ims[i], self.im_files2[i], self.npy_files2[i], #判断有没有这个图片
        img, (h0, w0), (h, w) = self.load_image(im, f, fn)
        img2, (h02, w02), (h2, w2) = self.load_image(im2, f2, fn2)

        # Letterbox
        shape = self.batch_shapes[self.batch[index]] if self.rect else self.img_size  # final letterboxed shape
        img, ratio, pad = letterbox(img, shape, auto=False, scaleup=self.augment)
        img2, ratio2, pad2 = letterbox(img2, shape, auto=False, scaleup=self.augment)
        labels = self.labels[index].copy()
        shapes = (h0, w0), ((h / h0, w / w0), pad)  # for COCO mAP rescaling
        if labels.size:  # normalized xywh to pixel xyxy format
            labels[:, 1:] = xywhn2xyxy(labels[:, 1:], ratio[0] * w, ratio[1] * h, padw=pad[0], padh=pad[1])

        nl = len(labels)  # number of labels
        if nl:
            labels[:, 1:5] = xyxy2xywhn(labels[:, 1:5], w=img.shape[1], h=img.shape[0], clip=True, eps=1E-3)

        labels_out = torch.zeros((nl, 6))
        if nl:
            labels_out[:, 1:] = torch.from_numpy(labels)

        # Convert
        img = img.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
        img = np.ascontiguousarray(img)
        img2 = img2.transpose((2, 0, 1))[::-1]  # HWC to CHW, BGR to RGB
        img2 = np.ascontiguousarray(img2)
        return torch.from_numpy(img),torch.from_numpy(img2), labels_out, self.im_files[index],self.im_files2[index], shapes


#↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

    @staticmethod
    def collate_fn(batch): #如何取样本
        im,im2, label, path,path2, shapes = zip(*batch)  # transposed
        for i, lb in enumerate(label):
            lb[:, 0] = i  # add target image index for build_targets()
        return torch.stack(im, 0),torch.stack(im2, 0), torch.cat(label, 0), path,path2, shapes
    @staticmethod
    def collate_fn4(batch):
        im,im2, label, path,path2, shapes = zip(*batch)  # transposed
        n = len(shapes) // 4
        im4,im42, label4, path4,path42, shapes4 = [],[], [], path[:n],path2[:n], shapes[:n]

        ho = torch.tensor([[0.0, 0, 0, 1, 0, 0]])
        wo = torch.tensor([[0.0, 0, 1, 0, 0, 0]])
        s = torch.tensor([[1, 1, 0.5, 0.5, 0.5, 0.5]])  # scale
        for i in range(n):  # zidane torch.zeros(16,3,720,1280)  # BCHW
            i *= 4
            if random.random() < 0.5:
                im = F.interpolate(img[i].unsqueeze(0).float(), scale_factor=2.0, mode='bilinear', align_corners=False)[
                    0].type(img[i].type())
                im2 = F.interpolate(img2[i].unsqueeze(0).float(), scale_factor=2.0, mode='bilinear', align_corners=False)[
                    0].type(img[i].type())
                lb = label[i]
            else:
                im = torch.cat((torch.cat((img[i], img[i + 1]), 1), torch.cat((img[i + 2], img[i + 3]), 1)), 2)
                im2 = torch.cat((torch.cat((img2[i], img2[i + 1]), 1), torch.cat((img2[i + 2], img2[i + 3]), 1)), 2)
                lb = torch.cat((label[i], label[i + 1] + ho, label[i + 2] + wo, label[i + 3] + ho + wo), 0) * s
            im4.append(im)
            im42.append(im2)
            label4.append(lb)

        for i, lb in enumerate(label4):
            lb[:, 0] = i  # add target image index for build_targets()
        return torch.stack(im4, 0),torch.stack(im42, 0), torch.cat(label4, 0), path4,path42,shapes4

4、pbar更改

因为传入的是双模态数据集，所以batch和原来的不一样

由im,targets, paths, shapes 变成了im,im2,targets, paths,paths2, shapes

所以对应的train.py 和 val.py进行修改

train.py

 if RANK in [-1, 0]:
            # 通过tqdm创建进度条，方便训练信息的展示
            pbar = tqdm(pbar, total=nb, bar_format='{l_bar}{bar:10}{r_bar}{bar:-10b}')  # progress bar
        optimizer.zero_grad() #梯度训练
        for i, (imgs,imgs2, targets, paths,paths2, _) in pbar:  # batch -------------------------------------------------------------
            # 计算迭代次数
            ni = i + nb * epoch  # number integrated batches (since train start)
            imgs = imgs.to(device, non_blocking=True).float() / 255  # uint8 to float32, 0-255 to 0.0-1.0

val.py

    pbar = tqdm(dataloader, desc=s, bar_format='{l_bar}{bar:10}{r_bar}{bar:-10b}')  # progress bar
    for batch_i, (im,im2,targets, paths,paths2, shapes) in enumerate(pbar):
        t1 = time_sync()
        if cuda:
            im = im.to(device, non_blocking=True)
            targets = targets.to(device)
        im = im.half() if half else im.float()  # uint8 to fp16/32
        im /= 255  # 0 - 255 to 0.0 - 1.0
        nb, _, height, width = im.shape  # batch size, channels, height, width
        t2 = time_sync()
        dt[0] += t2 - t1

如果改动的比较顺利，现在运行原本的yolov5命令，代码是可以运行的，因为现在虽然传入了两个模态，但只用了一个模态，并不影响。如果运行不了，根据报错仔细的查看代码进行修改，一般都是输入输出的问题。

3.2.2 train_batch显示

yolov5会在训练的前三个batch生成train_batch.jpg图片，方便检查传入网络的图片和label是否正确。我想检查一下生成的红外与可见光图片是否一致，以及labels是否正确，就需要改相应的函数，下面提供思路。

1、首先要找到生成train_batch图片的位置

采取关键字查找法，在ubuntu使用如下命令进行查找

grep -rn "train_batch" *

搜寻结果如下，可以看到在utils/loggers/__init__.py文件中

2、更改函数

直接拷贝这个函数，改一下生成图片的名字以做区分

    def on_train_batch_end(self, ni, model, imgs, targets, paths, plots, sync_bn):
        # Callback runs on train batch end
        if plots:
            if ni == 0:
                if not sync_bn:  # tb.add_graph() --sync known issue https://github.com/ultralytics/yolov5/issues/3754
                    with warnings.catch_warnings():
                        warnings.simplefilter('ignore')  # suppress jit trace warning
                        self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), [])
            if ni < 3:
                f = self.save_dir / f'train_batch{ni}.jpg'  # filename
                Thread(target=plot_images, args=(imgs, targets, paths, f), daemon=True).start()
            if self.wandb and ni == 10:
                files = sorted(self.save_dir.glob('train*.jpg'))
                self.wandb.log({'Mosaics': [wandb.Image(str(f), caption=f.name) for f in files if f.exists()]})
    def on_train_batch_end2(self, ni, model, imgs, targets, paths, plots, sync_bn):
        # Callback runs on train batch end
        if plots:
            if ni == 0:
                if not sync_bn:  # tb.add_graph() --sync known issue https://github.com/ultralytics/yolov5/issues/3754
                    with warnings.catch_warnings():
                        warnings.simplefilter('ignore')  # suppress jit trace warning
                        self.tb.add_graph(torch.jit.trace(de_parallel(model), imgs[0:1], strict=False), [])
            if ni < 3:
                f = self.save_dir / f'train_batch2{ni}.jpg'  # filename
                Thread(target=plot_images, args=(imgs, targets, paths, f), daemon=True).start()
            if self.wandb and ni == 10:
                files = sorted(self.save_dir.glob('train*.jpg'))
                self.wandb.log({'Mosaics': [wandb.Image(str(f), caption=f.name) for f in files if f.exists()]})

3、train.py引用

在train.py添加这个函数，让img2引用

我当时改到这一步认为已经改完了，但代码还是跑不起来于是，我搜寻了一下原来on_train_batch_end所寻在的文件，结果如下

看到在callbacks.py还存在这个函数，于是将更改的函数同样加到这里，如下所示

之后再次运行代码，跑通！！！！可以看到红外和可见光的图片相对应，labels也正确。

到此数据输入的部分算是搞定了，下面该是更改网络部分。

3.3 更改backbone

想要更改backbone需要对yolo.py十分熟悉，整个改代码调试的过程太过艰辛，这里只分享主要思路和代码。改更改代码之前首先要理清楚几个概念。

3.3.1 yaml解析文件

yolo.py是十分重要的文件，包括了网络的读取和构建，yolov5是使用yaml文件读取网络的，新手刚开始看十分迷惑，网上的注释很多可以借鉴，这里不再阐述。

parse_model函数是yaml的解析文件，这里需要改成双模态的，首先双模态需要两个分开来的输入和输出，避免在调用时候混淆。因此我加了ch2、c12、c22。随后根据层的序列来判断是走哪个backbone。需要注意的是在最后ch2需要包括ch的所有层，如果不包含的话，索引是不匹配的。

#解析网络配置文件构建模型
def parse_model(d, ch,ch2):  # model_dict, input_channels(3)
    LOGGER.info(f"\n{'':>3}{'from':>18}{'n':>3}{'params':>10}  {'module':<40}{'arguments':<30}")#记录日志
    #1、以下是读取配置dict里的参数
    #————————————————————————————————————————————————————————————————
    anchors, nc, gd, gw = d['anchors'], d['nc'], d['depth_multiple'], d['width_multiple']
    na = (len(anchors[0]) // 2) if isinstance(anchors, list) else anchors  # number of anchors
    no = na * (nc + 5)  # number of outputs = anchors * (classes + 5) 输出的通道数
 
    # 2、开始搭建网络
    #————————————————————————————————————————————————————————————————
    '''
    layers: 保存每一层的层结构
    # save: 记录下所有层结构中from中不是-1的层结构序号
    # c2: 保存当前层的输出channel
    '''
    layers, save, c2,c22 = [], [], ch[-1] ,ch2[-1] # layers, savelist, ch out初始化
    # from(当前层输入来自哪些层), number(当前层次数 初定模型深度), module(当前层类别), args(当前层类参数 初定)
    for i, (f, n, m, args) in enumerate(d['backbone1']+d['backbone2']+d['head']):  # 遍历backbone和head的每一层
        m = eval(m) if isinstance(m, str) else m  #得到当前层的真实类名 
        for j, a in enumerate(args): #循环模块参数args
            try:
                args[j] = eval(a) if isinstance(a, str) else a  # eval strings
            except NameError:
                pass
                
        n = n_ = max(round(n * gd), 1) if n > 1 else n
        
        if i<10:
            if m in [Conv, GhostConv, Bottleneck, GhostBottleneck, SPP, SPPF, DWConv, MixConv2d, Focus, CrossConv,
                 BottleneckCSP, C3, C3TR, C3SPP, C3Ghost, conv_bn_relu_maxpool, Shuffle_Block]:
                c1, c2 = ch[f], args[0]
                if c2 != no:  
                    c2 = make_divisible(c2 * gw, 8)  #保证通道是8的倍数
                # 在初始arg的基础上更新 加入当前层的输入channel并更新当前层
                # [in_channel, out_channel, *args[1:]]
                args = [c1, c2, *args[1:]] 
        # depth gain 控制深度  如v5s: n*0.33   n: 当前模块的次数(间接控制深度)
        else : 
            if m in [Conv, GhostConv, Bottleneck, GhostBottleneck, SPP, SPPF, DWConv, MixConv2d, Focus, CrossConv,
                     BottleneckCSP, C3, C3TR, C3SPP, C3Ghost, conv_bn_relu_maxpool, Shuffle_Block]:
                c12, c22 = ch2[f], args[0] #保存输出
                if c22 != no:  
                    c22 = make_divisible(c22 * gw, 8)  #保证通道是8的倍数
                args = [c12, c22, *args[1:]] 
                if m in [BottleneckCSP, C3, C3TR, C3Ghost]:
                    args.insert(2, n)   #在第二个位置插入bottleneck个数n
                    n = 1  #重置
            elif m is nn.BatchNorm2d: # BN层只需要返回上一层的输出channel
                args = [ch2[f]]
            elif m is Concat2: #Concat：f是所有需要拼接层的索引，则输出通道c2是所有层的和
                c22 = 0
                for x in f:
                    if x==-1:
                        c2p = ch2[x]
                    else:
                        c2p = ch[x]
                    c22 = c22+c2p
            elif m is Concat: #Concat：f是所有需要拼接层的索引，则输出通道c2是所有层的和
                c22 = sum(ch2[x] for x in f)
            elif m is Detect:#args先填入每个预测层的输入通道数，然后填入生成所有预测层对应的预测框的初始高宽的列表。
                args.append([ch2[x] for x in f]) #在args中加入三个Detect层的输出channel
                if isinstance(args[1], int):  # number of anchors
                    args[1] = [list(range(args[1] * 2))] * len(f)
            elif m is Contract:
                c22 = ch2[f] * args[0] ** 2
            elif m is Expand:
                c22 = ch2[f] // args[0] ** 2
            else:
                c22 = ch2[f]
        #拿args里的参数去构建了module m，然后模块的循环次数用参数n控制。
        # m_: 得到当前层module  如果n>1就创建多个m(当前层结构), 如果n=1就创建一个m
        m_ = nn.Sequential(*(m(*args) for _ in range(n))) if n > 1 else m(*args)  # module
        t = str(m)[8:-2].replace('__main__.', '')  # 打印当前层结构的一些基本信息
        np = sum(x.numel() for x in m_.parameters())  # 计算这一层的参数量
        m_.i, m_.f, m_.type, m_.np = i, f, t, np  # attach index, 'from' index, type, number params
        LOGGER.info(f'{i:>3}{str(f):>18}{n_:>3}{np:10.0f}  {t:<40}{str(args):<30}')  # 打印日志文件信息（每一层module构建的编号、参数量等）
        # append to savelist  把所有层结构中from不是-1的值记下 
        save.extend(x % i for x in ([f] if isinstance(f, int) else f) if x != -1)  # append to savelist
        layers.append(m_) # 将当前层结构module加入layers中
        if i == 0:
            ch = []
        ch.append(c2)
        
        if i == 10:
            ch2 = [32, 64, 64, 128, 128, 256, 256, 512, 512, 512] #层编号是包含backbone1的
        ch2.append(c22)

    return nn.Sequential(*layers), sorted(save)      #当循环结束后再构建成模型

3.3.2 backbone

yaml文件改好以后就可以改backbone了如下，这里我为了不让和正常的concat混淆，加了一个concat2，一模一样只是函数名字变了

# 双模型态度
nc: 7  
depth_multiple: 0.33  
width_multiple: 0.50

#边界框的设置
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

backbone1:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  # 0-P1/2 #64代表通道数，3表示3*3的卷积核，2代表步长为2,2表示分两组卷积
   #320×320×32
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   #160×160×64
   [-1, 3, C3, [128]],
   #160×160×64
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   #80×80×128
   [-1, 6, C3, [256]],
   #80×80×128
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   #40×40×256
   [-1, 9, C3, [512]],
   #40×40×256
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   #20×20×512
   [-1, 3, C3, [1024]],
   #20×20×512
   [-1, 1, SPPF, [1024, 5]],  # 9
   #20×20×512
  ]



backbone2:
  # [from, number, module, args]
  [[-1, 1, Conv, [64, 6, 2, 2]],  #10
   #320×320×32
   [-1, 1, Conv, [128, 3, 2]],  # 
   #160×160×64
   [-1, 3, C3, [128]], 
   #160×160×64 
   [-1, 1, Conv, [256, 3, 2]],
   #80×80×128  
   [-1, 6, C3, [256]],  # 14
   #80×80×128
   [[-1, 4], 1, Concat2, [1]],  #
   #80×80×256
   [-1, 6, C3, [256]],  #16
   #80×80×128
   [14, 1, Conv, [512, 3, 2]],  # 
   #40×40×256
   [-1, 9, C3, [512]],
   #40×40×256
   [[-1, 6], 1, Concat2, [1]],  #
   #40×40×512
   [-1, 6, C3, [512]],  #20
   #40×40×256
   [17, 1, Conv, [1024, 3, 2]],  # 
   #20×20×512
   [[-1, 9], 1, Concat2, [1]],  # 
   #20×20×1024
   [-1, 3, C3, [1024]],
   #20×20×512
   [-1, 1, SPPF, [1024, 5]],  #24
   #20×20×512
  ]

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
  #20×20×256
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   #40×40×256
   [[-1, 20], 1, Concat, [1]],  # cat backbone P4
    #40×40×512
   [-1, 3, C3, [512, False]],  # 28
   #40×40×256

   [-1, 1, Conv, [256, 1, 1]],
   #40×40×128
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   #80×80×128
   [[-1, 16], 1, Concat, [1]],  # cat backbone P3
   #80×80×256
   [-1, 3, C3, [256, False]],  # 32 (P3/8-small)
   #80×80×128

   [-1, 1, Conv, [512, 3, 2]],
   #40×40×256
   [[-1, 28], 1, Concat, [1]],  # cat head P4
   #40×40×512
   [-1, 3, C3, [512, False]],  # 35 (P4/16-medium)
   #40×40×256

   [-1, 1, Conv, [1024, 3, 2]],
   #20×20×512
   [[-1, 24], 1, Concat, [1]],  # cat head P5
   #20×20×1024
   [-1, 3, C3, [1024, False]],  # 38 (P5/32-large)
   #20×20×512

   [[32, 35, 38], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

3.4 Forward更改

Forward是网络向前传播的逻辑所在，要解决的问题是x1和x2怎么判断进入到不同的backbone。

    def _forward_once(self, x1,x2, profile=False, visualize=False):
        y, dt = [], []  # outputs
        for m in self.model:
            if m.i<10:
                x1 = m(x1)
                x2=x2
                y.append(x1 if m.i in self.save else None)  # save output
            else:
                if m.f != -1:  # if not from previous layer
                    x2 = y[m.f] if isinstance(m.f, int) else [x2 if j == -1 else y[j] for j in m.f]  # from earlier layers
                if profile:
                    self._profile_one_layer(m, x2, dt)
                x2 = m(x2)  # run
                x1=x1
                y.append(x2 if m.i in self.save else None)  # save output
            
            if visualize:
                feature_visualization(x2, m.type, m.i, save_dir=visualize)
            
            feature_vis = False
            if m.type == 'models.common.C3' and feature_vis and m.i==17:
                print(m.type, m.i)
                feature_visualization2(x2, m.type, m.i,128,8,16)
        return x2

3.5 调试运行

python train.py --data data/dual.yaml --cfg models/dualyolov5s.yaml --weights weights/yolov5s.pt --batch-size 12 --epochs 50

网络结构如下

              from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64]                      
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  2    148992  models.common.C3                        [128, 128]                    
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    889344  models.common.C3                        [256, 256]                    
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1   1182720  models.common.C3                        [512, 512]                    
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 10                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]              
 11                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
 12                -1  1     18816  models.common.C3                        [64, 64, 1]                   
 13                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
 14                -1  2    115712  models.common.C3                        [128, 128, 2]                 
 15           [-1, 4]  1         0  models.common.Concat2                   [1]                           
 16                -1  2    132096  models.common.C3                        [256, 128, 2]                 
 17                14  1    295424  models.common.Conv                      [128, 256, 3, 2]              
 18                -1  3    625152  models.common.C3                        [256, 256, 3]                 
 19           [-1, 6]  1         0  models.common.Concat2                   [1]                           
 20                -1  2    526336  models.common.C3                        [512, 256, 2]                 
 21                17  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
 22           [-1, 9]  1         0  models.common.Concat2                   [1]                           
 23                -1  1   1444864  models.common.C3                        [1024, 512, 1]                
 24                -1  1    656896  models.common.SPPF                      [512, 512, 5]                 
 25                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 26                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 27          [-1, 20]  1         0  models.common.Concat                    [1]                           
 28                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 29                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 30                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 31          [-1, 16]  1         0  models.common.Concat                    [1]                           
 32                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 33                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
 34          [-1, 28]  1         0  models.common.Concat                    [1]                           
 35                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 36                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
 37          [-1, 24]  1         0  models.common.Concat                    [1]                           
 38                -1  1   1444864  models.common.C3                        [1024, 512, 1, False]         
 39      [32, 35, 38]  1     32364  models.yolo.Detect                      [7, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 530 layers, 13493740 parameters, 13493740 gradients

可以到看效果还是不错的，训练13个epoch就map0.5就达到了0.85左右