Faster R-CNN源码解析1（Pytorch版）

神洛华

已于 2022-10-03 14:13:35 修改

阅读量2.5k

点赞数 4

分类专栏： CV 文章标签： cnn 深度学习目标检测

于 2022-08-14 04:17:26 首次发布

本文链接：https://blog.csdn.net/qq_56591814/article/details/126325693

版权

CV 专栏收录该内容

37 篇文章 105 订阅

订阅专栏

文章目录

github项目地址：Faster R-CNN、B站视频《Faster RCNN源码解析(pytorch)》

一、项目代码使用

本项目参考的版本是Pytorch官方torchvision模块中的faster_rcnn源码。也可以在ipynb文件中输入import torchvision.models.detection.faster_rcnn，按住Ctrl+鼠标右键，点击Go to Definition就可以打开faster_rcnn.py，这两个是一样的。

但是这个其实只是代码的一部分，关于训练的代码在Pytorch官方vision仓库下的references/detection可以找到（点此链接）。

这里给出faster_rcnn API官方文档、Pytorch官方中文文档1.7版目标检测教程

1.1 项目说明

打开项目地址，README.md中有写文件结构：

├── backbone: 特征提取网络，可以根据自己的要求选择。这里可使用MobileNetV2或ResNet50+FPN
├── network_files: Faster R-CNN网络主要源码部分（包括Fast R-CNN以及RPN等模块）
├── train_utils: 训练验证相关模块，主要就是刚刚说的官网pytorch/vision/references/detection里的训练代码
├── my_dataset.py: 自定义dataset，用于读取VOC数据集
├── train_mobilenet.py: 以MobileNetV2做为backbone进行训练（网络简单，适合解析研究）
├── train_resnet50_fpn.py: 以resnet50+FPN做为backbone进行训练（实际训练效果更好）
├── train_multi_GPU.py: 针对使用多GPU的用户使用
├── predict.py: 简易的预测脚本，使用训练好的权重进行预测测试
├── validation.py: 利用训练好的权重验证/测试数据的COCO指标，并生成record_mAP.txt文件
└── pascal_voc_classes.json: pascal_voc标签文件
训练方法：

确保提前准备好数据集
确保提前下载好对应预训练模型权重（放在backbone文件夹下）
- 若要训练mobilenetv2+fasterrcnn，直接使用train_mobilenet.py训练脚本
- 若要训练resnet50+fpn+fasterrcnn，直接使用train_resnet50_fpn.py训练脚本
- 若要使用多GPU训练，使用python -m torch.distributed.launch --nproc_per_node=8 --use_env train_multi_GPU.py指令,nproc_per_node参数为使用GPU数量，torch.distributed.launch为开启多个进程训练。
如果想指定使用哪些GPU设备可在指令前加上CUDA_VISIBLE_DEVICES=0,3(例如我只要使用设备中的第1块和第4块GPU设备)
CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 --use_env train_multi_GPU.py

1.2 训练代码说明

train_mobilenet.py：
- 使用create_model函数创建模型，用my_dataset.py中定义的VOCDataSet函数处理VOC数据集。
- 由于mobilenet的预训练权重只含有backbone部分，所以训练时会将backbone冻结，先训练其它层5个epoch（一开始其它层参数还是初始化的）。然后再训练整个网络20个epoch。
- 整个网络训练时，参照Pytorch官方训练resnet50+fpn的faster rcnn方法，冻结backbone部分底层权重（这些权重应该是普遍的）。实际来看这样比训练整个backbone效果更好，而且训练更快。
- 第10个epoch才开始收敛，所以只保留后5个epoch权重。并且会保存model权重和优化器、调度器、epoch参数，这样下次可以接着训练
train_resnet50_fpn.py:
- resnet50+fpn模型的权重是整个faster rcnn网络的权重，所以不需要先冻结backbone训练其它层，直接训练整个网络15个epoch。并且这个是属于迁移学习，训练很快，所以从第一个epoch开始就保存模型权重
- 使用parser.add_argument方法添加了一些参数，方便在命令行进行改写。--resume参数默认为’'，表示从头开始训练。若需要接着上次训练，则指定上次训练保存权重文件地址。--start_epoch表示加载上次训练权重后，从哪个epoch开始训练。
- create_model函数中，resnet50_fpn_backbone默认会冻结一些底层参数，所以不需要另外写冻结底层参数的代码
- 下载backbone权重和weight权重，分别重命名为resnet50.pth和fasterrcnn_resnet50_fpn_coco.pth
predict.py ：
- 使用冻结BN层会出现训练废了的情况，作者将这些代码注释掉了。训练时没有冻结BN层，预测时就也不要使用。
- 加载模型使用 model.load_state_dict(torch.load(weights_path, map_location='cpu')["model"])。和训练时不同，最后多了一个["model"]字段。因为训练时还保存了优化器等信息，而预测时只需要model权重。

img = torch.unsqueeze(img, dim=0)表示增加bacth维度。
作者说多GPU训练resnet50+FPN时，80s每epoch。使用单GPU需要10min每epoch。多GPU训练优化很多，比如多进程预处理数据，对CPU调用很多。15个epoch之后VOC数据集的mAP≈0.8。听的我想流泪啊，我colab上跑，一个epoch大概20min。

Tips：network_files文件夹下有很多脚本，例如rpn_function.py。这些脚本定义的函数下面会有一个type的注释。例如 # type: (Tensor, int) -> Tuple[int, int]。这个注释可能在一些IDE里面会有报错（红色波浪线）。有强迫症的同学改了，代码正向传播就会出问题，导致无法运行。这个注释使Pytorch在运行中可以获取输入变量的类型，以便运行中会进行类型的检查，防止出现各种问题。

二、自定义Dataset

2.1 划分训练集合验证集

split_data.py解决如何将原始数据集划分训练集和测试集，并保存其index为train.txt 和val.txt文件。这样就和VOC数据集VOC2012/ImageSets/Main 下的train.txt或val.txt是一样的，其格式如下：

2008_000008
2008_000015
2008_000019
...

split_data.py代码如下：

import os
import random


def main():
    random.seed(0)  # 设置随机种子，保证随机结果可复现

    files_path = "./VOCdevkit/VOC2012/Annotations"
    assert os.path.exists(files_path), "path: '{}' does not exist.".format(files_path)

    val_rate = 0.5 #验证集比例
	# 遍历所有xml文件，以.分割文件名，前一个就是图片索引files_name。
    files_name = sorted([file.split(".")[0] for file in os.listdir(files_path)])
    files_num = len(files_name)
    val_index = random.sample(range(0, files_num), k=int(files_num*val_rate))
    train_files = []
    val_files = []
    for index, file_name in enumerate(files_name):
        if index in val_index:
            val_files.append(file_name)
        else:
            train_files.append(file_name)

    try:
        train_f = open("train.txt", "x")
        eval_f = open("val.txt", "x")
        train_f.write("\n".join(train_files)) # 用换行符连接所有的file_name
        eval_f.write("\n".join(val_files))
    except FileExistsError as e:
        print(e)
        exit(1)

if __name__ == '__main__':
    main()

2.2 自定义Dataset

创建自己的dataset，是参考Pytoch的官方示例《TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL》里面的Defining the Dataset章节。
定义自己的dataset需要继承torch.utils.data.Dataset 类，并实现__len__ 和 __getitem__（返回图片和我其标注信息）方法。多GPU训练时还要实现get_height_and_width方法。否则会在载入所有图片时计算其高度和宽度。这样会比较耗时而且占内存。提前实现get_height_and_width方法到时候就不会遍历整个数据集了。
pascal_voc_classes.json文件是一个字典，将类别映射为索引。默认索引0是背景类，

{
    "aeroplane": 1,
    "bicycle": 2,
    ...
    "tvmonitor": 20
}

xml文件格式如下：

<annotation>
    <folder>images</folder>
    <filename>road650.png</filename>
    <size>
        <width>300</width>
        <height>400</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>speedlimit</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <occluded>0</occluded>
        <difficult>0</difficult>
        <bndbox>
            <xmin>126</xmin>
            <ymin>110</ymin>
            <xmax>162</xmax>
            <ymax>147</ymax>
        </bndbox>
    </object>
</annotation>

在my_dataset.py中我们定义VOCDataSet类。

import numpy as np
from torch.utils.data import Dataset
import os
import torch
import json
from PIL import Image
from lxml import etree


class VOCDataSet(Dataset):
    """读取解析PASCAL VOC2007/2012数据集"""

    def __init__(self, voc_root, year="2012", transforms=None, txt_name: str = "train.txt"):
        assert year in ["2007", "2012"], "year must be in ['2007', '2012']"
        
        """
        voc_root就是VOCdevkit的路径，self.root就是VOC2012数据集的路径。
        txt_name的值为train.txt或者val.txt，分别用于读取这两个txt文件
        """
        # 增加容错能力
        if "VOCdevkit" in voc_root:
            self.root = os.path.join(voc_root, f"VOC{year}")
        else:
            self.root = os.path.join(voc_root, "VOCdevkit", f"VOC{year}")
        self.img_root = os.path.join(self.root, "JPEGImages")
        self.annotations_root = os.path.join(self.root, "Annotations")

        # read train.txt or val.txt file
        txt_path = os.path.join(self.root, "ImageSets", "Main", txt_name)
        assert os.path.exists(txt_path), "not found {} file.".format(txt_name)
        
		"""#读取txt文件每一行，用line.strip()去掉每一行的换行符"""
        with open(txt_path) as read: 
            xml_list = [os.path.join(self.annotations_root, line.strip() + ".xml")
                        for line in read.readlines() if len(line.strip()) > 0]

        self.xml_list = []
        """
        下面的代码作用是：
        1. 检查xml_list列表的xml文件地址是否存在
        2. 去掉中没有标注目标object的xml文件
        """
        for xml_path in xml_list:
            if os.path.exists(xml_path) is False:
                print(f"Warning: not found '{xml_path}', skip this annotation file.")
                continue

            # check for targets
            with open(xml_path) as fid:
                xml_str = fid.read()
            xml = etree.fromstring(xml_str)  # 根据xml_list存储的地址，读取对应的xml文件
            data = self.parse_xml_to_dict(xml)["annotation"]
            if "object" not in data:        # 如果xml文件中没有标注信息
                print(f"INFO: no objects in {xml_path}, skip this annotation file.")
                continue

            self.xml_list.append(xml_path) 

        assert len(self.xml_list) > 0, "in '{}' file does not find any information.".format(txt_path)

        # 加载类别字典
        json_file = './pascal_voc_classes.json'
        assert os.path.exists(json_file), "{} file not exist.".format(json_file)
        with open(json_file, 'r') as f:
            self.class_dict = json.load(f)

        self.transforms = transforms

    def __len__(self):
        return len(self.xml_list)

    def __getitem__(self, idx): #通过idx读取图片以及对应的target信息
        # 读取xml文件
        xml_path = self.xml_list[idx]
        with open(xml_path) as fid:
            xml_str = fid.read()
        xml = etree.fromstring(xml_str) #etree读取xml文件
        data = self.parse_xml_to_dict(xml)["annotation"] #parse_xml_to_dict处理xml文件得到一个字典，通过键annotation得到其所有信息
        img_path = os.path.join(self.img_root, data["filename"])
        image = Image.open(img_path)
        if image.format != "JPEG": #VOC数据集中图片都是jpeg文件
            raise ValueError("Image '{}' format not JPEG".format(img_path))
            
		#iscrowd表示这个目标是否与其它目标重叠，可以理解为是否容易检测。值为0表示好检测
        boxes,labels,iscrowd = [],[],[] 
        assert "object" in data, "{} lack of object information.".format(xml_path)
        for obj in data["object"]:  #遍历object列表中的每一个目标信息
            xmin = float(obj["bndbox"]["xmin"])
            xmax = float(obj["bndbox"]["xmax"])
            ymin = float(obj["bndbox"]["ymin"])
            ymax = float(obj["bndbox"]["ymax"])

            # 进一步检查数据，有的标注信息中可能有w或h为0的情况，这样的数据会导致计算回归loss为nan
            if xmax <= xmin or ymax <= ymin:
                print("Warning: in '{}' xml, there are some bbox w/h <=0".format(xml_path))
                continue
            
            boxes.append([xmin, ymin, xmax, ymax])
            labels.append(self.class_dict[obj["name"]]) #添加目标类别对应的索引
            if "difficult" in obj:
                iscrowd.append(int(obj["difficult"]))
            else:
                iscrowd.append(0)

        # 将列表和idx全部转为torch.Tensor格式
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        labels = torch.as_tensor(labels, dtype=torch.int64)
        iscrowd = torch.as_tensor(iscrowd, dtype=torch.int64)
        image_id = torch.tensor([idx])
        area = (boxes[:,3]-boxes[:,1])*(boxes[:,2]-boxes[:,0]) #(ymax-ymin)*(xmax-xmin)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            image, target = self.transforms(image, target)

        return image, target

	 #通过xml文件size里面的height、width得到图片的高和宽
    def get_height_and_width(self, idx):
        # read xml
        xml_path = self.xml_list[idx]
        with open(xml_path) as fid:
            xml_str = fid.read()
        xml = etree.fromstring(xml_str)
        data = self.parse_xml_to_dict(xml)["annotation"]
        data_height = int(data["size"]["height"])
        data_width = int(data["size"]["width"])
        return data_height, data_width

    def parse_xml_to_dict(self, xml):
        """
        将xml文件解析成字典形式，参考tensorflow的recursive_parse_xml_to_dict
        Args:
            xml: xml tree obtained by parsing XML file contents using lxml.etree
        Returns:
            Python dictionary holding XML contents.
      
		这里的xml是已经读取完的xml文件的内容。刚读取在xml顶层annotation这里，通过len方法看其下面
		是否还有子目录。正常来说 len(xml)!=0，接着往下走
		"""
        if len(xml) == 0:  # 遍历到底层，直接返回tag对应的信息
            return {xml.tag: xml.text}

        result = {}
        for child in xml: #遍历annotation下的所有子目录
            child_result = self.parse_xml_to_dict(child)  # 递归遍历标签信息，解析下一层的xml文件
            if child.tag != 'object':
                result[child.tag] = child_result[child.tag]
            else:
                if child.tag not in result:  # 因为object可能有多个，所以需要放入列表里
                    result[child.tag] = []   # 也就是result[objetc]=[]
                result[child.tag].append(child_result[child.tag])
        return {xml.tag: result}

    def coco_index(self, idx):
        """
        该方法是专门为pycocotools统计标签信息准备，不对图像和标签作任何处理
        由于不用去读取图片，可大幅缩减统计时间
        Args:
            idx: 输入需要获取图像的索引
        """
        # read xml
        xml_path = self.xml_list[idx]
        with open(xml_path) as fid:
            xml_str = fid.read()
        xml = etree.fromstring(xml_str)
        data = self.parse_xml_to_dict(xml)["annotation"]
        data_height = int(data["size"]["height"])
        data_width = int(data["size"]["width"])
        # img_path = os.path.join(self.img_root, data["filename"])
        # image = Image.open(img_path)
        # if image.format != "JPEG":
        #     raise ValueError("Image format not JPEG")
        boxes = []
        labels = []
        iscrowd = []
        for obj in data["object"]:
            xmin = float(obj["bndbox"]["xmin"])
            xmax = float(obj["bndbox"]["xmax"])
            ymin = float(obj["bndbox"]["ymin"])
            ymax = float(obj["bndbox"]["ymax"])
            boxes.append([xmin, ymin, xmax, ymax])
            labels.append(self.class_dict[obj["name"]])
            iscrowd.append(int(obj["difficult"]))

        # convert everything into a torch.Tensor
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        labels = torch.as_tensor(labels, dtype=torch.int64)
        iscrowd = torch.as_tensor(iscrowd, dtype=torch.int64)
        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        return (data_height, data_width), target

    @staticmethod
    def collate_fn(batch):
        return tuple(zip(*batch))

VOCDataSet中的transforms方法在transforms.py文件中定义，实现了ToTensor和水平翻转。需要注意的是，标注框也需要翻转，其y坐标不变，只需要处理x坐标。
在训练代码中，使用 train_data_loader载入数据时，我们定义了collate_fn=train_dataset.collate_fn（train_dataset是VOCDataSet的一个实例）。因为如果不使用这个方法的话，Pytorch默认使用torch.stack()将数据进行简单的拼接。
在之前图片分类模型中，每个train_dataset里的每个元素就是一张图片，一个tensor。torch.stack()可以简单的对图像进行拼接，得到一个batch。但是目标检测中，我们定义的VOCDataSet返回的是一个tuple，即(image,target)。如果还是用默认的方法就会出错。所以要使用自定义的collate_fn方法。
collate_fn方法中，zip(*batch)通过非关键字参数形式传入到zip函数中，这样会将batch(image,targrt)拆开,将image和target分别放在一起打包成一个batch，即返回(batch_images,batch_targets)。

batch_size=8时，处理前是返回8个元素，每个元素都是(iamge,target)。处理后只有两个元素，分别是8*image和8*target。

简单说，VOCDataSet.collate_fn就是将VOCDataSet返回的的(image,target),分别打包成一个batch。具体为何这么做，后面会讲解。

下面是一个简单的样例，来读取我们定义的datasets：（draw_objs方法在draw_box_utils.py ）

from draw_box_utils import draw_objs
from PIL import Image
import json,random,transforms
import matplotlib.pyplot as plt
import torchvision.transforms as ts

# read class_indict
category_index = {}
try:
   json_file = open('./pascal_voc_classes.json', 'r')
   class_dict = json.load(json_file)
   # 预测时得到的是目标类别的索引，target中也是存的索引值，所以这里要颠倒key和value
   category_index = {str(v): str(k) for k, v in class_dict.items()}
except Exception as e:
   print(e)
   exit(-1)

data_transform = {
   "train": transforms.Compose([transforms.ToTensor(),
                                transforms.RandomHorizontalFlip(0.5)]),
   "val": transforms.Compose([transforms.ToTensor()])}

# os.getcwd()获取当前文件路径
train_data_set = VOCDataSet(os.getcwd(), "2012", data_transform["train"], "train.txt")
print(len(train_data_set))
for index in random.sample(range(0, len(train_data_set)), k=5): #随机选取5张图片
   img, target = train_data_set[index] # 这里得到的都是tensor格式
   img = ts.ToPILImage()(img)		   # 转为img格式
   plot_img = draw_objs(img,
                        target["boxes"].numpy(),
                        target["labels"].numpy(),
                        np.ones(target["labels"].shape[0]),# 其实没有objectness置信度，所以全部设为1
                        category_index=category_index,
                        box_thresh=0.5,  #正常是将置信度低的预测目标去除，不打印
                        line_thickness=3, #预测框线的宽度
                        font='arial.ttf',
                        font_size=20)
   plt.imshow(plot_img)
   plt.show()

三、Faster R-CNN模型框架

3.1 模型框架

下图是Faster R-CNN模型框架图，黄色部分是训练时才处理的步骤，比如计算RPN和Faster R-CNN的损失。
在这里插入图片描述

roi_heads：是上图从ROI pooling到postprocess部分。即roi_heads=ROI pooling+TwoMLPHead+Presictor+postprocess Detections。

3.2 faster_rcnn_framework.py

下面对照上图进行讲解。首先是network_files/faster_rcnn_framework.py 。

3.2.1 FasterRCNNBase类

自定义的collate_fn方法：train_data_loader载入数据时，会传入自定义的collate_fn方法，所以FasterRCNNBase类的forword函数中，输入的image和target都是一个list。即 # type: (List[Tensor], Optional[List[Dict[str, Tensor]]])
transform：FasterRCNNBase类中的transform使用的是transform.py中的GeneralizedRCNNTransform类，即对应上图中的GenerilizedRCNNTransform。对图像进行标准化处理和resize（不是将图像缩放到统一大小，而是限制输入图片的最小边长和最大边长）
if torch.jit.is_scripting()部分：可参考Pytorch官网教程《INTRODUCTION TO TORCHSCRIPT》。里面介绍了使用TorchScript的方法。使用此方法优点是：
- 转成TorchScript模型之后，不再依赖python环境。不使用python编译环境，也就不受Global Interpreter Lock的影响，更充分的利用硬件资源，提高并行速度
- 通过此方法可以将完整的模型保存到磁盘中，并且可以载入到其它的环境下
- 可以优化模型，类似tf function，将动态图转为静态图，编译中可以进行加速处理

import warnings
from collections import OrderedDict
from typing import Tuple, List, Dict, Optional, Union

import torch
from torch import nn, Tensor
import torch.nn.functional as F
from torchvision.ops import MultiScaleRoIAlign

from .roi_head import RoIHeads
from .transform import GeneralizedRCNNTransform
from .rpn_function import AnchorsGenerator, RPNHead, RegionProposalNetwork


class FasterRCNNBase(nn.Module):
    """
    Main class for Generalized R-CNN.
    Arguments:
        backbone (nn.Module):
        rpn (nn.Module):
        roi_heads (nn.Module): takes the features + the proposals from the RPN and computes
            detections / masks from it.
        transform (nn.Module): performs the data transformation from the inputs to feed into
            the model
    """

    def __init__(self, backbone, rpn, roi_heads, transform):
        super(FasterRCNNBase, self).__init__()
        self.transform = transform
        self.backbone = backbone
        self.rpn = rpn
        self.roi_heads = roi_heads
        # used only on torchscript mode
        self._has_warned = False

    def forward(self, images, targets=None):
        # type: (List[Tensor], Optional[List[Dict[str, Tensor]]]) -> Tuple[Dict[str, Tensor], List[Dict[str, Tensor]]]
        """
        输入的images大小是不一样的，后面预处理会将这些图片方人员同样大小的tensor中打包成batch
        Arguments:
            images (list[Tensor]): 要处理的图片
            targets (list[Dict[Tensor]]): 图片中的ground-truth boxes（可选）
            
        Returns:
            result (list[BoxList] or dict[Tensor]): 模型的输出.
                训练时, 返回一个包含loss的dict[Tensor] .
                测试时，返回包含附加字段的list[BoxList]，比如 `scores`, `labels` 和 `mask` (for Mask R-CNN models).
        """
        if self.training and targets is None: #如果是训练模式，但又没有target，就会报错
            raise ValueError("In training mode, targets should be passed")

        if self.training: #判断是否是训练模式
            assert targets is not None
            """遍历targets，进一步判断传入的target的boxes参数是否符合规定"""
            for target in targets:         
                boxes = target["boxes"]
                if isinstance(boxes, torch.Tensor): # isinstance判断boxes是否是torch.Tensor类型
                	"""输入的boxesshape为[N,4],即有两个维度，且最后一个维度必须是4维。N表示一张图片中目标的数量"""
                    if len(boxes.shape) != 2 or boxes.shape[-1] != 4:                    	
                        raise ValueError("Expected target boxes to be a tensor"
                                         "of shape [N, 4], got {:}.".format(
                                          boxes.shape))
                else:
                    raise ValueError("Expected target boxes to be of type "
                                     "Tensor, got {:}.".format(type(boxes)))
		# 下面定义一个空列表original_image_sizes，声明其类型为List[Tuple[int, int]]。
		# 这个变量是用来存储图像的原始尺寸，在后面transform中用到。
        original_image_sizes = torch.jit.annotate(List[Tuple[int, int]], [])        
        for img in images:
            val = img.shape[-2:]  #VOCDataSet已经将image转为tensor格式，其形状为[channel,h,w]
            assert len(val) == 2  # 防止输入的image是个一维向量，此时img.shape[-2:]不会保错
            original_image_sizes.append((val[0], val[1]))
        # original_image_sizes = [img.shape[-2:] for img in images]
		"""
		1. 这里的self.transform是GeneralizedRCNNTransform类，会对图像进行(normelnize,resize)预处理。
		2. 处理后，图片尺寸会发生变化，所以需要事先记录图片原始尺寸
		   在得到最终输出之后，会将其映射回原尺寸，这样得到的边界框才是正确的。
		3. 预处理之前的图片大小都不一样，是没法将其打包的成一个batch输入网络中进行GPU并行运算的。
		   transform的resize方法会将图片统一放到给定大小的tensor中，这样处理后，得到的数据才是真正的一个batch的数据。
		"""
        images, targets = self.transform(images, targets) 

        # print(images.tensors.shape)
        features = self.backbone(images.tensors)  # 将图像输入backbone得到特征图
        if isinstance(features, torch.Tensor):  # 若只在一层特征层上预测，将feature放入有序字典中，并编号为‘0’
            features = OrderedDict([('0', features)])  # 若在多层特征层上预测，传入的就是一个有序字典
		"""
		1. 这里为啥将features放入到有序字典中呢？是因为resnet50+FPN模型中，会得到5张特征图。
		   这样经过backbone得到的是一个有序字典，key=[0,1,2,3,pool],分别对应5张特征图。
		2. mobilenet作为backbone只得到一张特征图，但是我们也将其存入到有序字典中，这样单层或多层预测的情况就可以统一。
		"""
        # 将特征层以及标注target信息传入rpn中，得到区域建议框proposals和RPN的loss
        # proposals: List[Tensor], Tensor_shape: [num_proposals, 4],
        # 每个proposals是绝对坐标，且为(x1, y1, x2, y2)格式
        proposals, proposal_losses = self.rpn(images, features, targets)

        # 将rpn生成的数据以及标注target信息传入fast rcnn后半部分，即roi_heads
        # detections就是最终的目标预测结果，是预处理之后坐标信息。detector_losses为faster_rcnn的loss
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
        """
        1. transform.postprocess对网络的预测结果进行后处理,将bboxes还原到原图像尺度上
           对应框架图的最后一步GeneralizedRCNNTransform postprocess
        2. 这里的images.image_sizes是transform预处理之后的图片的尺寸
        """
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

        losses = {}
        losses.update(detector_losses)# update() 函数把字典dict2的键值对更新到dict1里
        losses.update(proposal_losses)
        
		# TorchScript部分参考https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html
        if torch.jit.is_scripting():
            if not self._has_warned:
                warnings.warn("RCNN always returns a (Losses, Detections) tuple in scripting")
                self._has_warned = True
            return losses, detections
        else:
            return self.eager_outputs(losses, detections)

        # if self.training:
        #     return losses
        #
        # return detections


class TwoMLPHead(nn.Module):
    """
    Standard heads for FPN-based models
    Arguments:
        in_channels (int): number of input channels
        representation_size (int): size of the intermediate representation
    """

    def __init__(self, in_channels, representation_size):
        super(TwoMLPHead, self).__init__()

        self.fc6 = nn.Linear(in_channels, representation_size)
        self.fc7 = nn.Linear(representation_size, representation_size)

    def forward(self, x):
        x = x.flatten(start_dim=1)

        x = F.relu(self.fc6(x))
        x = F.relu(self.fc7(x))

        return x


class FastRCNNPredictor(nn.Module):
    """
    Standard classification + bounding box regression layers
    for Fast R-CNN.
    Arguments:
        in_channels (int): number of input channels
        num_classes (int): number of output classes (including background)
    """

    def __init__(self, in_channels, num_classes):
        super(FastRCNNPredictor, self).__init__()
        self.cls_score = nn.Linear(in_channels, num_classes)
        self.bbox_pred = nn.Linear(in_channels, num_classes * 4)

    def forward(self, x):
        if x.dim() == 4:
            assert list(x.shape[2:]) == [1, 1]
        x = x.flatten(start_dim=1)
        scores = self.cls_score(x)
        bbox_deltas = self.bbox_pred(x)

        return scores, bbox_deltas

总结：

FasterRCNNBase部分就是整个网络的前向传播
- 将图片输入backnone得到特征图
- 将特征图输入RPN网络得到proposals和proposal_losses
- 将proposals经过roi_heads得到特征图上的预测结果detections和detector_losses。proposal_losses+detector_losses就是整个网络的loss，可以进行梯度回传。
- 将detections经过transform.postprocess后处理得到映射在原图上的预测结果detections

3.2.2 FasterRCNN类

FasterRCNN类主要是定义一系列参数和各个模块。

class FasterRCNN(FasterRCNNBase):
    """
   	实现更快的 R-CNN。
    模型的输入预计是一个张量列表，每个张量的形状为 [C, H, W]，每个张量表示一张图像，并且应该在 0-1 范围内。
    不同的图像可以有不同的尺寸。
    模型的行为取决于它是处于训练模式还是评估模式。
    在训练期间，模型需要输入张量以及targets (list of dictionary)，包含：
        - boxes (FloatTensor[N, 4]): the ground-truth boxes in [x1, y1, x2, y2] format, with values
          between 0 and H and 0 and W
        - labels (Int64Tensor[N]): 每个ground-truth box的类别标签
    训练时模型返回一个Dict[Tensor] ，包括RPN和R-CNN的分类损失和回归损失。
    D推理时，模型输入图片张量，为每张图片分别一个返回后处理之后的预测结果。结果是一个 List[Dict[Tensor]] 
    包含:
        - boxes (FloatTensor[N, 4]): 预测框坐标，为[x1, y1, x2, y2]的形式, 值在[0,H]和[0，W]之间
        - labels (Int64Tensor[N]): 每张图片预测的类别
        - scores (Tensor[N]): 每个预测结果的置信度分数
        
    Arguments:
        backbone (nn.Module): 提取图片特征的骨干网络
            It should contain a out_channels attribute, which indicates the number of output
            channels that each feature map has (and it should be the same for all feature maps).
            The backbone should return a single Tensor or and OrderedDict[Tensor].
        num_classes (int): 模型的类别数(包含背景类)。也就是VOC数据集有
            如果指定了 box_predictor，则 num_classes 应为 None。类，classes=21。
        min_size (int): transform预处理中，resize时限制的最小尺寸
        max_size (int): transform预处理中，resize时限制的最大尺寸
        image_mean (Tuple[float, float, float]): input标准化的mean values 
            They are generally the mean values of the dataset on which the backbone has been trained
            on
        image_std (Tuple[float, float, float]): input标准化的std values
            They are generally the std values of the dataset on which the backbone has been trained on
            
        rpn_anchor_generator (AnchorGenerator): 在特征图上生成anchors的模块
        rpn_head (nn.Module): RPN中计算objectness和regression deltas的模块，对应框架图的RPNHead。
        					  3×3的滑动窗口就是用3×3的卷积实现的。        					 
        rpn_pre_nms_top_n_train (int):  训练时，NMS处理前保留的proposals数
        rpn_pre_nms_top_n_test (int):   测试时，NMS处理前保留的proposals数
        rpn_post_nms_top_n_train (int): 训练时，NMS处理后保留的proposals数
        rpn_post_nms_top_n_test (int):  测试时，NMS处理后保留的proposals数        
        rpn_nms_thresh (float):    使用NMS后处理RPN proposals时的NMS阈值
        rpn_fg_iou_thresh (float): RPN训练时的正样本IoU阈值，与任何一个GT box的IoU大于这个阈值，就被认为是正样本（前景）。
        rpn_bg_iou_thresh (float): RPN训练时的负样本IoU阈值，与所有GT box的IoU都小于这个阈值，就被认为是负样本（背景）。
        rpn_batch_size_per_image (int): RPN训练时采样的anchors数，这些样本会计算损失。默认采样256个
        rpn_positive_fraction (float):  RPN训练时一个mini_batch中正样本的比例，默认0.5。
        rpn_score_thresh (float):  推理时，仅返回classification分数大于rpn_score_thresh的proposals
        
        box_roi_pool (MultiScaleRoIAlign): 对应RoIpooling层
        box_head (nn.Module): 对应框架图的TWO MLPHead，即Flatten+FC
        box_predictor (nn.Module): 框架图的FasterRCNNPredictor模块，接受box_head的输入，返回类别概率和和box回归参数
        box_score_thresh (float): 推理时，只返回classification score大于该值的proposals
        box_nms_thresh (float):   推理时，prediction head的NMS阈值
        box_detections_per_img (int): 每张图预测的detections的最大值（包含所有目标），默认100，一般是足够的
        box_fg_iou_thresh (float): Faster-RCNN训练时的正样本IoU阈值，与任何一个GT box的IoU大于这个阈值，就被认为是正样本。
        box_bg_iou_thresh (float): Faster-RCNN训练时的负样本IoU阈值，与所有GT box的IoU都小于这个阈值，就被认为是负样本。
        box_batch_size_per_image (int): Faster-RCNN训练时采样的anchors数，默认采样512个
        box_positive_fraction (float):  Faster-RCNN训练时采样的正样本比例，默认0.25
        bbox_reg_weights (Tuple[float, float, float, float]): 编码/解码边界框的weights
    """

	"""
	init参数中，NMS处理前后会保留一样的proposals数，是针对带有FPN的网络。FPN输出5个特征层，
    NMS处理前每个特征层有2000个proposals，经过NMS处理后根据score还是保留2000个proposals。
    这部分处理在框架图的Fiter Proposals中。先根据预测score筛掉一部分，再进行NMS处理。
	"""
    def __init__(self, backbone, num_classes=None,
                 # transform parameter
                 min_size=800, max_size=1333,      # 预处理resize时限制的最小尺寸与最大尺寸
                 image_mean=None, image_std=None,  # normalize时使用的均值和方差，其实使用的是image数据集的std和mean
                 # RPN 参数
                 rpn_anchor_generator=None, rpn_head=None,
                 rpn_pre_nms_top_n_train=2000, rpn_pre_nms_top_n_test=1000,    # rpn中nms处理前保留的proposal数
                 rpn_post_nms_top_n_train=2000, rpn_post_nms_top_n_test=1000,  # rpn中nms处理后保留的proposal数
                 rpn_nms_thresh=0.7,  # rpn中进行nms处理时使用的iou阈值
                 # 下面两个rpn计算损失时，采集正负样本设置的IoU阈值（分别表示前景/背景）
                 # 与GT box的IoU阈值在0.3到0.7之间的anchors直接被舍去
                 rpn_fg_iou_thresh=0.7, rpn_bg_iou_thresh=0.3,  
                 rpn_batch_size_per_image=256, rpn_positive_fraction=0.5,  # rpn训练时的采样数，正样本采样比例
                 rpn_score_thresh=0.0,
                 # Box 参数
                 box_roi_pool=None, box_head=None, box_predictor=None,
                 # 移除低目标概率      fast rcnn中进行nms处理的阈值   对预测结果根据score排序取前100个目标
                 box_score_thresh=0.05, box_nms_thresh=0.5, box_detections_per_img=100,
                 box_fg_iou_thresh=0.5, box_bg_iou_thresh=0.5,   # faster rcnn计算误差时，采集正负样本设置的阈值
                 box_batch_size_per_image=512, box_positive_fraction=0.25, 
                 # faster rcnn计算误差时采样的样本数，以及正样本占所有样本的比例
                 bbox_reg_weights=None):
                 
        """
        1. 判断backbone是否有out_channels属性,比如在train_mobilenetv2.py的create_model函数中，
           我们会为backbone添加一个out_channels属性。这个就是backbone输出特征图的channel数
        2. 在create_model函数中，定义了anchor_generator是AnchorsGenerator的一个实例。
           这里rpn_anchor_generator是none也可以，就会在后面创建一个generator
        3. box_roi_pool：要么是create_model函数中实例化的MultiScaleRoIAlign类，要么是none。
        4. 判断num_classes和box_predictor是否在create_model函数中被定义。如果是none就要在这里创建
		"""
        if not hasattr(backbone, "out_channels"):
            raise ValueError(
                "backbone should contain an attribute out_channels"
                "specifying the number of output channels  (assumed to be the"
                "same for all the levels")

        assert isinstance(rpn_anchor_generator, (AnchorsGenerator, type(None)))
        assert isinstance(box_roi_pool, (MultiScaleRoIAlign, type(None)))

        if num_classes is not None:
            if box_predictor is not None:
                raise ValueError("num_classes should be None when box_predictor "
                                 "is specified")
        else:
            if box_predictor is None:
                raise ValueError("num_classes should not be None when box_predictor "
                                 "is not specified")

        # 预测特征层的channels
        out_channels = backbone.out_channels

        # 若anchor生成器为空，则自动生成针对resnet50_fpn的anchor生成器。mobilenetv2中则定义了生成器
        # resnet50_fpn有5个预测特征层，每层预测一种尺度的目标。下面的(32,)含有逗号表示是元组，千万不能丢，否则被认为是int
        if rpn_anchor_generator is None:
            anchor_sizes = ((32,), (64,), (128,), (256,), (512,))
            aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)# 将元组(0.5, 1.0, 2.0)重复5次
            rpn_anchor_generator = AnchorsGenerator(
                anchor_sizes, aspect_ratios
            )

        # 生成 RPN通过滑动窗口预测网络的 部分。默认不会传，也就是none。然后直接在这里创建
        if rpn_head is None:
            rpn_head = RPNHead(
                out_channels, rpn_anchor_generator.num_anchors_per_location()[0]
            )

        # 默认rpn_pre_nms_top_n_train = 2000, rpn_pre_nms_top_n_test = 1000,
        # 默认rpn_post_nms_top_n_train = 2000, rpn_post_nms_top_n_test = 1000,
        rpn_pre_nms_top_n = dict(training=rpn_pre_nms_top_n_train, testing=rpn_pre_nms_top_n_test)
        rpn_post_nms_top_n = dict(training=rpn_post_nms_top_n_train, testing=rpn_post_nms_top_n_test)

        # 定义整个RPN框架，这个后面讲
        rpn = RegionProposalNetwork(
            rpn_anchor_generator, rpn_head,
            rpn_fg_iou_thresh, rpn_bg_iou_thresh,
            rpn_batch_size_per_image, rpn_positive_fraction,
            rpn_pre_nms_top_n, rpn_post_nms_top_n, rpn_nms_thresh,
            score_thresh=rpn_score_thresh)

        #  Multi-scale RoIAlign pooling。box_roi_pool在train_mobilenetv2.py中有定义
        #  在resnet50_fpn中没有定义，也就是这里传入了none，需要在这里创建。
        if box_roi_pool is None:
        	# featmap_names表示在哪些特征层进行roi pooling。resnet50_fpn源码中还有个pooling层，但是官网实现部分没有
            box_roi_pool = MultiScaleRoIAlign(            	
                featmap_names=['0', '1', '2', '3'],  
                output_size=[7, 7],
                sampling_ratio=2)

        # fast RCNN中roi pooling后的展平处理两个全连接层部分
        if box_head is None:
            resolution = box_roi_pool.output_size[0]  # 默认等于7
            representation_size = 1024
            box_head = TwoMLPHead(
                out_channels * resolution ** 2,
                representation_size)

        # 在box_head的输出上预测部分，预测类别概率和边界框回归参数
        if box_predictor is None:
            representation_size = 1024
            box_predictor = FastRCNNPredictor(
                representation_size,
                num_classes)

        # 将roi pooling, box_head以及box_predictor结合在一起
        roi_heads = RoIHeads(
            # box
            box_roi_pool, box_head, box_predictor,
            box_fg_iou_thresh, box_bg_iou_thresh,  # 0.5  0.5
            box_batch_size_per_image, box_positive_fraction,  # 512  0.25
            bbox_reg_weights,
            box_score_thresh, box_nms_thresh, box_detections_per_img)  # 0.05  0.5  100

        if image_mean is None: #预处理的图像均值和方差
            image_mean = [0.485, 0.456, 0.406]
        if image_std is None:
            image_std = [0.229, 0.224, 0.225]

        # 对数据进行标准化，缩放，打包成batch等处理部分，在backbone之前进行。
        # GeneralizedRCNNTransform也有postprocess部分，即框架图的最后一个部分。
        transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)

        super(FasterRCNN, self).__init__(backbone, rpn, roi_heads, transform)

3.3 数据预处理

这部分代码在network_files/transform.py中，定义GeneralizedRCNNTransform类来实现数据预处理功能。主要是：

forword：将图像进行标准化和resize处理，然后打包成一个batch输入网络。并记录resize之后的图像尺寸。最终返回的是（image_list, targets ）。其中，image_list是ImageList(images, image_sizes_list)类，前者是打包后的图片，每个batch内的size是一样的，batch间的size不一样。后者是resize后，打包前的图片尺寸。target也是resize后，打包前的标签信息。
postprocess：根据resize之后的尺寸，将预测结果映射回原图尺寸。

关于torch.jit相关代码被我删了，想看的可以看源码。
下面定义的batch_images操作，进行打包图片时，不是简单粗暴的直接将所有图片resize到统一大小（这样原始图像其实会失真，比如宽图缩放成正方形会失真，看着奇怪），而是对一个mini_batch图片进行填充。先选取一个mini_batch中的最大图片尺寸（下图蓝色框）。然后所有图片左上角和其对齐，不足部分用0填充。这样原始图片比例不变，填充部分都是0，对检测也没有干扰。
在这里插入图片描述

import math
from typing import List, Tuple, Dict, Optional

import torch
from torch import nn, Tensor
import torchvision

from .image_list import ImageList


@torch.jit.unused
# 这里有些torch.jit使用的代码，一般不用，删掉了。需要的可以看源码

class GeneralizedRCNNTransform(nn.Module):
    """
    Performs input / target transformation before feeding the data to a GeneralizedRCNN
    model.
    The transformations it perform are:
        - input normalization (mean subtraction and std division)
        - input / target resizing to match min_size / max_size
    It returns a ImageList for the inputs, and a List[Dict[Tensor]] for the targets
    """

    def __init__(self, min_size, max_size, image_mean, image_std):
        super(GeneralizedRCNNTransform, self).__init__()
        if not isinstance(min_size, (list, tuple)): #判断是否是list或tuple，否则转为tuple类型
            min_size = (min_size,)
        self.min_size = min_size      # 指定图像的最小边长范围
        self.max_size = max_size      # 指定图像的最大边长范围
        self.image_mean = image_mean  # 指定图像在标准化处理中的均值
        self.image_std = image_std    # 指定图像在标准化处理中的方差

    def normalize(self, image):
        """标准化处理"""
        dtype, device = image.dtype, image.device
        mean = torch.as_tensor(self.image_mean, dtype=dtype, device=device)
        std = torch.as_tensor(self.image_std, dtype=dtype, device=device)
        # mean和std本身是一个list[3]，对图像的三个通道做标准化处理
        # 添加None将mean、std变为三维张量，因为image是三维张量。[:, None, None]: shape [3] -> [3, 1, 1]
        return (image - mean[:, None, None]) / std[:, None, None]

    def torch_choice(self, k):
        # type: (List[int]) -> int
        """
        Implements `random.choice` via torch ops so it can be compiled with
        TorchScript. Remove if https://github.com/pytorch/pytorch/issues/25803
        is fixed.
        """
        index = int(torch.empty(1).uniform_(0., float(len(k))).item())
        return k[index]

    def resize(self, image, target):
        # type: (Tensor, Optional[Dict[str, Tensor]]) -> Tuple[Tensor, Optional[Dict[str, Tensor]]]
        """
        将图片缩放到指定的大小范围内，并缩放对应的bboxes信息
        Args:
            image: 输入的图片
            target: 输入图片的相关信息（包括bboxes信息）
        Returns:
            image: 缩放后的图片
            target: 缩放bboxes后的图片相关信息
        """
        # image shape is [channel, height, width]
        h, w = image.shape[-2:]
        if self.training:
            size = float(self.torch_choice(self.min_size))  # 指定输入图片的最小边长,注意是self.min_size不是min_size
        else:
            # FIXME assume for now that testing uses the largest scale
            size = float(self.min_size[-1])    # 指定输入图片的最小边长,注意是self.min_size不是min_size
        if target is None: #没有target，也就是推理模式
            return image, target

        bbox = target["boxes"]
        # 根据图像的缩放比例来缩放bbox。[h,w]和image.shape[-2:]分别是缩放前后的image的宽高
        bbox = resize_boxes(bbox, [h, w], image.shape[-2:])#这个函数在最后
        target["boxes"] = bbox

        return image, target


    def max_by_axis(self, the_list):
        # type: (List[List[int]]) -> List[int]
        maxes = the_list[0] # 第一张图片的shape赋值给maxes
        for sublist in the_list[1:]: # 从第二张图片开始遍历，将[bs,w,h]的最大值赋值给maxes对应维度
            for index, item in enumerate(sublist):
                maxes[index] = max(maxes[index], item)
        return maxes  # 返回batch中所有图片的max_channel，max_w，max_h

    def batch_images(self, images, size_divisible=32):
        # type: (List[Tensor], int) -> Tensor
        """
        将一批图像打包成一个batch返回（注意batch中每个tensor的shape是相同的）
        Args:
            images: 输入的一批图片
            size_divisible: 将图像高和宽调整到该数的整数倍
        Returns:
            batched_imgs: 打包成一个batch后的tensor数据
        """
		# ONNX是开放的神经网络交换格式，可以将tensorflow、pytorch、coffee都转成这个格式，转完后就可以不依赖原来的pytorch等环境了
		# 所以不转的时候不用管这段代码，前面也删了好多了
        if torchvision._is_tracing(): 
            # batch_images() does not export well to ONNX
            # call _onnx_batch_images() instead
            return self._onnx_batch_images(images, size_divisible)

        # 分别计算一个batch中所有图片中的最大channel, height, width。max_size是一维列表
        max_size = self.max_by_axis([list(img.shape) for img in images])

        stride = float(size_divisible)
        # max_size = list(max_size)
        # 将height向上调整到stride的整数倍。math.ceil表示向上取整
        max_size[1] = int(math.ceil(float(max_size[1]) / stride) * stride)
        # 将width向上调整到stride的整数倍
        max_size[2] = int(math.ceil(float(max_size[2]) / stride) * stride)

        # batch_shape：[batch, channel, height, width]
        batch_shape = [len(images)] + max_size

        # 创建shape为batch_shape且值全部为0的tensor。
        # image[0]表示取第一张图片，这个主要是为了创建tensor，取0到7都行。（bs=8）
        batched_imgs = images[0].new_full(batch_shape, 0)
        for img, pad_img in zip(images, batched_imgs):          
            # img.shape[0]=3，img.shape[1]和img.shape[2]就是当前图像的宽高。
            # copy_: 将src中的元素复制到self张量并原地返回self
            # 这步操作就是将输入images中的每张图片复制到新的batched_imgs的每张图片中，对齐左上角，bboxes的坐标不变
            # 保证输入到网络中一个batch的每张图片的shape相同，但是原图缩放比例不变 
            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)

        return batched_imgs

    def postprocess(self,
                    result,                # type: List[Dict[str, Tensor]]
                    image_shapes,          # type: List[Tuple[int, int]]
                    original_image_sizes   # type: List[Tuple[int, int]]
                    ):
        # type: (...) -> List[Dict[str, Tensor]]
        """
        对网络的预测结果进行后处理（主要将bboxes还原到原图像尺度上）
        Args:
            result: list(dict), 网络的预测结果,包括预测框坐标信息以及其类别信息
            image_shapes: list(torch.Size), 图像预处理缩放后的尺寸
            original_image_sizes: list(torch.Size), 图像的原始尺寸
        Returns:
        """
        if self.training: #训练模式不需要在原图显示预测框，只需要loss信息。所以就不需要将预测框映射回原图。
            return result

        # 遍历每张图片的预测信息，将boxes信息还原回原尺度
        for i, (pred, im_s, o_im_s) in enumerate(zip(result, image_shapes, original_image_sizes)):
            boxes = pred["boxes"]
            boxes = resize_boxes(boxes, im_s, o_im_s)  # 将bboxes缩放回原图像尺度上
            result[i]["boxes"] = boxes
        return result

    def __repr__(self):
        """自定义输出实例化对象的信息，可通过print打印实例信息。这部分可以略过"""
        format_string = self.__class__.__name__ + '('
        _indent = '\n    '
        format_string += "{0}Normalize(mean={1}, std={2})".format(_indent, self.image_mean, self.image_std)
        format_string += "{0}Resize(min_size={1}, max_size={2}, mode='bilinear')".format(_indent, self.min_size,
                                                                                         self.max_size)
        format_string += '\n)'
        return format_string

    def forward(self,
                images,       # type: List[Tensor]
                targets=None  # type: Optional[List[Dict[str, Tensor]]]
                ):
        # type: (...) -> Tuple[ImageList, Optional[List[Dict[str, Tensor]]]]
        images = [img for img in images]
        for i in range(len(images)):
            image = images[i]
            target_index = targets[i] if targets is not None else None

            if image.dim() != 3: #如果输入image不是3维张量，就报错
                raise ValueError("images is expected to be a list of 3d tensors "
                                 "of shape [C, H, W], got {}".format(image.shape))
            image = self.normalize(image)                # 对图像进行标准化处理
            image, target_index = self.resize(image, target_index)   # 对图像和对应的bboxes缩放到指定范围
            images[i] = image
            if targets is not None and target_index is not None:
                targets[i] = target_index

        # images是打包后的图片，三维张量。
        # image_sizes_list记录打包后的图像尺寸，二维张量。
        image_sizes = [img.shape[-2:] for img in images]
        images = self.batch_images(images)  # 将images打包成一个batch，bacch中图片尺寸一样
        image_sizes_list = torch.jit.annotate(List[Tuple[int, int]], [])

        for image_size in image_sizes:
            assert len(image_size) == 2
            image_sizes_list.append((image_size[0], image_size[1]))
        # 注意，这里添加的是image_size的尺寸，而不是images.size。即记录的是rezise后，打包前的尺寸。
        # target也是rezise后，打包前的标签信息。
		# 这样做是因为我们输入网络的是resize之后的图片，预测的边界框也是这个尺度，但是最后显示预测结果应该是在原图尺寸上
		# 所以这里要记录resize之后的尺寸，方便后面使用postprocess函数映射回去。
        image_list = ImageList(images, image_sizes_list) # ImageList类在image_list.py中
        return image_list, targets # 这个结果就是处理后，要输入backbone的数据

def resize_boxes(boxes, original_size, new_size):
    # type: (Tensor, List[int], List[int]) -> Tensor
    """
    将boxes参数根据图像的缩放情况进行相应缩放
    Arguments:
        original_size: 图像缩放前的尺寸
        new_size: 图像缩放后的尺寸
    """
    ratios = [
        torch.tensor(s, dtype=torch.float32, device=boxes.device) /
        torch.tensor(s_orig, dtype=torch.float32, device=boxes.device)
        for s, s_orig in zip(new_size, original_size)]#zip函数分别获取一个batch图像缩放前后的w和h，相除得到ratios
    ratios_height, ratios_width = ratios
    # 移除一个张量维度, boxes [minibatch, 4]
    # Returns a tuple of all slices along a given dimension, already without it.
    # unbind方法移除指定维度，返回一个元组，包含了沿着指定维切片后的各个切片。
    # 也就是 boxes [minibatch, 4] ——> boxes [4]*minibatch。最后用stack拼接起来
    xmin, ymin, xmax, ymax = boxes.unbind(1) 
    xmin = xmin * ratios_width
    xmax = xmax * ratios_width
    ymin = ymin * ratios_height
    ymax = ymax * ratios_height
    return torch.stack((xmin, ymin, xmax, ymax), dim=1)

GeneralizedRCNNTransformImageList类

四、 RPN部分

以下部分代码都在rpn_function.py中。

注意，前面预处理也讲了只是将每个batch图片输入尺寸限制在一定的范围。而训练时每个batch都是随机的，所以调试会发现每次图片尺寸每次都不一样，每张图生成anchors数量也不一样。（batch内尺寸一致，batch间不同）

4.1 AnchorsGenerator

AnchorsGenerator前向传播输入是（image_list, feature_maps）。其中前者是一个字典{image_size,Tensors}，表示打包后的图片尺寸和图片，后者是特征矩阵：
在这里插入图片描述

对于mobilenetv2网络：

在train_mobilenetv2.py中，我们在 create_model时，就创建了anchor生成器：

# 在一个预测特征层上生成5种尺寸的anchors
# 下面set_cell_anchors函数中有一层for遍历，所以size和aspect_ratios都是((value),)的形式,多套一层括号。
anchor_generator = AnchorsGenerator(sizes=((32, 64, 128, 256, 512),),
                                        aspect_ratios=((0.5, 1.0, 2.0),))

在特征图上使用set_cell_anchors函数生成的anchor模板信息 self.cell_anchors。（anchor左上右下角相对anchor自己中心点的坐标，是一个相对坐标）。具体的，实是调用generate_anchors函数。其中：

# 使用Pycharm调试，直接在需要的地方打断点，光标停在这。然后调试train_mobilenet.py脚本
# 选择调试到光标处（Alt+F9），然后一步步F8单步执行调试。还有步入步出函数等等操作，具体可以参其它的调试帖子。
ws=tensor([ 45.2548,  90.5097, 181.0193, 362.0387, 724.0773,  32.0000,  64.0000,
        128.0000, 256.0000, 512.0000,  22.6274,  45.2548,  90.5097, 181.0193,
        362.0387])
        
hs=tensor([ 22.6274,  45.2548,  90.5097, 181.0193, 362.0387,  32.0000,  64.0000,
128.0000, 256.0000, 512.0000,  45.2548,  90.5097, 181.0193, 362.0387,
724.0773])

base_anchors=tensor([[ -22.6274,  -11.3137,   22.6274,   11.3137],
					[ -45.2548,  -22.6274,   45.2548,   22.6274],
					[ -90.5097,  -45.2548,   90.5097,   45.2548],
					[-181.0193,  -90.5097,  181.0193,   90.5097],
					[-362.0387, -181.0193,  362.0387,  181.0193],
					[ -16.0000,  -16.0000,   16.0000,   16.0000],
					[ -32.0000,  -32.0000,   32.0000,   32.0000],
					[ -64.0000,  -64.0000,   64.0000,   64.0000],
					[-128.0000, -128.0000,  128.0000,  128.0000],
					[-256.0000, -256.0000,  256.0000,  256.0000],
					[ -11.3137,  -22.6274,   11.3137,   22.6274],
					[ -22.6274,  -45.2548,   22.6274,   45.2548],
					[ -45.2548,  -90.5097,   45.2548,   90.5097],
					[ -90.5097, -181.0193,   90.5097,  181.0193],
					[-181.0193, -362.0387,  181.0193,  362.0387]])

通过round四舍五入之后，anchors模板信息如下：

# cell_anchors
[tensor([[ -23.,  -11.,   23.,   11.],
         [ -45.,  -23.,   45.,   23.],
         [ -91.,  -45.,   91.,   45.],
         [-181.,  -91.,  181.,   91.],
         [-362., -181.,  362.,  181.],
         [ -16.,  -16.,   16.,   16.],
         [ -32.,  -32.,   32.,   32.],
         [ -64.,  -64.,   64.,   64.],
         [-128., -128.,  128.,  128.],
         [-256., -256.,  256.,  256.],
         [ -11.,  -23.,   11.,   23.],
         [ -23.,  -45.,   23.,   45.],
         [ -45.,  -91.,   45.,   91.],
         [ -91., -181.,   91.,  181.],
         [-181., -362.,  181.,  362.]])]

我之前总疑惑一开始 AnchorsGenerator传入的sizes是对应哪个尺寸，具体size又是什么意思。这里可以看出，sizes表示映射到原图的anchors模板的面积为 $size^2$ 。根据坐标可以算出anchors面积，而下一步这个anchors坐标会和原图上网格点坐标相加，得到原图各个网格点生成的anchors绝对坐标。

grid_anchors函数将上一步得到的self.cell_anchors相对坐标映射回原图上。每个batch的图片尺寸不一样，输入网络后的特征层的grid size也不一样。假设gird_cel=25×38，相对于原图的高宽步长strides=[32,32]。 shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)生成网格后，shift_y, shift_x 如下：

# shift_y：每个元素对应预测特征层每个网格点映射回原图的y坐标。torch.Size([25, 38])
tensor([[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
           0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
           0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
           0.,   0.],
        [ 32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,
          32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,
          32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,  32.,
          32.,  32.],
        [ 64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,
          64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,
          64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,  64.,
          64.,  64.],
        [ 96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,
          96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,
          96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,  96.,
          96.,  96.],
        [128., 128., 128., 128., 128., 128., 128., 128., 128., 128., 128., 128.,
         128., 128., 128., 128., 128., 128., 128., 128., 128., 128., 128., 128.,
         128., 128., 128., 128., 128., 128., 128., 128., 128., 128., 128., 128.,
         128., 128.],
        [160., 160., 160., 160., 160., 160., 160., 160., 160., 160., 160., 160.,
         160., 160., 160., 160., 160., 160., 160., 160., 160., 160., 160., 160.,
         160., 160., 160., 160., 160., 160., 160., 160., 160., 160., 160., 160.,
         160., 160.],
        [192., 192., 192., 192., 192., 192., 192., 192., 192., 192., 192., 192.,
         192., 192., 192., 192., 192., 192., 192., 192., 192., 192., 192., 192.,
         192., 192., 192., 192., 192., 192., 192., 192., 192., 192., 192., 192.,
         192., 192.],
        [224., 224., 224., 224., 224., 224., 224., 224., 224., 224., 224., 224.,
         224., 224., 224., 224., 224., 224., 224., 224., 224., 224., 224., 224.,
         224., 224., 224., 224., 224., 224., 224., 224., 224., 224., 224., 224.,
         224., 224.],
        [256., 256., 256., 256., 256., 256., 256., 256., 256., 256., 256., 256.,
         256., 256., 256., 256., 256., 256., 256., 256., 256., 256., 256., 256.,
         256., 256., 256., 256., 256., 256., 256., 256., 256., 256., 256., 256.,
         256., 256.],
        [288., 288., 288., 288., 288., 288., 288., 288., 288., 288., 288., 288.,
         288., 288., 288., 288., 288., 288., 288., 288., 288., 288., 288., 288.,
         288., 288., 288., 288., 288., 288., 288., 288., 288., 288., 288., 288.,
         288., 288.],
        [320., 320., 320., 320., 320., 320., 320., 320., 320., 320., 320., 320.,
         320., 320., 320., 320., 320., 320., 320., 320., 320., 320., 320., 320.,
         320., 320., 320., 320., 320., 320., 320., 320., 320., 320., 320., 320.,
         320., 320.],
        [352., 352., 352., 352., 352., 352., 352., 352., 352., 352., 352., 352.,
         352., 352., 352., 352., 352., 352., 352., 352., 352., 352., 352., 352.,
         352., 352., 352., 352., 352., 352., 352., 352., 352., 352., 352., 352.,
         352., 352.],
        [384., 384., 384., 384., 384., 384., 384., 384., 384., 384., 384., 384.,
         384., 384., 384., 384., 384., 384., 384., 384., 384., 384., 384., 384.,
         384., 384., 384., 384., 384., 384., 384., 384., 384., 384., 384., 384.,
         384., 384.],
        [416., 416., 416., 416., 416., 416., 416., 416., 416., 416., 416., 416.,
         416., 416., 416., 416., 416., 416., 416., 416., 416., 416., 416., 416.,
         416., 416., 416., 416., 416., 416., 416., 416., 416., 416., 416., 416.,
         416., 416.],
        [448., 448., 448., 448., 448., 448., 448., 448., 448., 448., 448., 448.,
         448., 448., 448., 448., 448., 448., 448., 448., 448., 448., 448., 448.,
         448., 448., 448., 448., 448., 448., 448., 448., 448., 448., 448., 448.,
         448., 448.],
        [480., 480., 480., 480., 480., 480., 480., 480., 480., 480., 480., 480.,
         480., 480., 480., 480., 480., 480., 480., 480., 480., 480., 480., 480.,
         480., 480., 480., 480., 480., 480., 480., 480., 480., 480., 480., 480.,
         480., 480.],
        [512., 512., 512., 512., 512., 512., 512., 512., 512., 512., 512., 512.,
         512., 512., 512., 512., 512., 512., 512., 512., 512., 512., 512., 512.,
         512., 512., 512., 512., 512., 512., 512., 512., 512., 512., 512., 512.,
         512., 512.],
        [544., 544., 544., 544., 544., 544., 544., 544., 544., 544., 544., 544.,
         544., 544., 544., 544., 544., 544., 544., 544., 544., 544., 544., 544.,
         544., 544., 544., 544., 544., 544., 544., 544., 544., 544., 544., 544.,
         544., 544.],
        [576., 576., 576., 576., 576., 576., 576., 576., 576., 576., 576., 576.,
         576., 576., 576., 576., 576., 576., 576., 576., 576., 576., 576., 576.,
         576., 576., 576., 576., 576., 576., 576., 576., 576., 576., 576., 576.,
         576., 576.],
        [608., 608., 608., 608., 608., 608., 608., 608., 608., 608., 608., 608.,
         608., 608., 608., 608., 608., 608., 608., 608., 608., 608., 608., 608.,
         608., 608., 608., 608., 608., 608., 608., 608., 608., 608., 608., 608.,
         608., 608.],
        [640., 640., 640., 640., 640., 640., 640., 640., 640., 640., 640., 640.,
         640., 640., 640., 640., 640., 640., 640., 640., 640., 640., 640., 640.,
         640., 640., 640., 640., 640., 640., 640., 640., 640., 640., 640., 640.,
         640., 640.],
        [672., 672., 672., 672., 672., 672., 672., 672., 672., 672., 672., 672.,
         672., 672., 672., 672., 672., 672., 672., 672., 672., 672., 672., 672.,
         672., 672., 672., 672., 672., 672., 672., 672., 672., 672., 672., 672.,
         672., 672.],
        [704., 704., 704., 704., 704., 704., 704., 704., 704., 704., 704., 704.,
         704., 704., 704., 704., 704., 704., 704., 704., 704., 704., 704., 704.,
         704., 704., 704., 704., 704., 704., 704., 704., 704., 704., 704., 704.,
         704., 704.],
        [736., 736., 736., 736., 736., 736., 736., 736., 736., 736., 736., 736.,
         736., 736., 736., 736., 736., 736., 736., 736., 736., 736., 736., 736.,
         736., 736., 736., 736., 736., 736., 736., 736., 736., 736., 736., 736.,
         736., 736.],
        [768., 768., 768., 768., 768., 768., 768., 768., 768., 768., 768., 768.,
         768., 768., 768., 768., 768., 768., 768., 768., 768., 768., 768., 768.,
         768., 768., 768., 768., 768., 768., 768., 768., 768., 768., 768., 768.,
         768., 768.]])
         
 # shift_x ：每个元素对应预测特征层每个网格点映射回原图的x坐标        
 tensor([[   0.,   32.,   64.,   96.,  128.,  160.,  192.,  224.,  256.,  288.,
          320.,  352.,  384.,  416.,  448.,  480.,  512.,  544.,  576.,  608.,
          640.,  672.,  704.,  736.,  768.,  800.,  832.,  864.,  896.,  928.,
          960.,  992., 1024., 1056., 1088., 1120., 1152., 1184.]*25)

然后拉平成一维向量，经过shifts = torch.stack([shift_x, shift_y, shift_x, shift_y], dim=1)，得到原图上网格点坐标shifts（我的理解是网格点坐标重复两次），形状为torch.Size([950, 4])。将shifts加上self.cell_anchors，广播操作，得到原图上每个anchors坐标信息shift_anchors，形状为torch.Size([950,15. 4])。将前两维合并，得到最终返回结果anchors，形状为一个list。每个元素对应每个预测特征层映射到原图上生成的所有anchors信息。对于mobilenetv2网络只有一个元素，其shape=torch.Size([14250,4])（加入特征层尺度为25×38）。

# shifts
tensor([[   0.,    0.,    0.,    0.],
        [  32.,    0.,   32.,    0.],
        [  64.,    0.,   64.,    0.],
        ...,
        [1120.,  768., 1120.,  768.],
        [1152.,  768., 1152.,  768.],
        [1184.,  768., 1184.,  768.]])

# anchors
tensor([[ -23.,  -11.,   23.,   11.],
        [ -45.,  -23.,   45.,   23.],
        [ -91.,  -45.,   91.,   45.],
        ...,
        [1139.,  677., 1229.,  859.],
        [1093.,  587., 1275.,  949.],
        [1003.,  406., 1365., 1130.]])

简单理解就是shifts[原图网格点坐标，原图网格点坐标]+[anchor左上角偏移量，anchor右下角偏移量]=[anchor左上角坐标，anchor右下角坐标]
示意图如下：

左侧这张图是表示原图上对应的每个网格点，其坐标为shifts；
右侧图表示anchors模板，也就是cell_anchors。cell_anchors存储的刚好就是anchor模板左上右下角相对中心点的相对坐标信息。
shifts+cell_anchors就是原图上各个网格点生成的anchor的绝对坐标，赋值给shifts_anchor。形状应该是[49,15,4]
anchors.append(shifts_anchor.reshape(-1, 4))，size=[735,4]，表示一个预测特征层共生成735个anchor，每个anchor有4和坐标信息。

在这里插入图片描述

在train_res50_fpn.py 中， create_model时，则没有预先创建。而是在network_files/faster_rcnn_framework.py脚本中，自动生成针对resnet50_fpn的anchor生成器（若anchor生成器为空时）。即：

 if rpn_anchor_generator is None:
            anchor_sizes = ((32,), (64,), (128,), (256,), (512,))
            aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)
            rpn_anchor_generator = AnchorsGenerator(
                anchor_sizes, aspect_ratios
            ) # 在5个预测特征层上分别生成这5种尺寸的anchors

Anchor面积 $Ares=scale^2={anchor -sizes}^2，aspect-ratios= h/w$

下面是AnchorsGenerator部分代码：

from typing import List, Optional, Dict, Tuple

import torch
from torch import nn, Tensor
from torch.nn import functional as F
import torchvision

from . import det_utils
from . import boxes as box_ops
from .image_list import ImageList

class AnchorsGenerator(nn.Module):
    __annotations__ = {
        "cell_anchors": Optional[List[torch.Tensor]],
        "_cache": Dict[str, List[torch.Tensor]]
    }
    """
    anchors生成器：根据一组feature maps和image sizes生成anchors的模块
    这个模块支持在特征图上根据多种sizes（scale）和高宽比生成anchors
    sizes和aspect_ratios的数量应该和feature maps数量相同（每个特征图上都要生成anchors）
    且sizes和aspect_ratios的元素数量也要相同（每个anchors根据二者共同确定）
    sizes[i]和aspect_ratios[i]可以有任意数量的元素
    AnchorGenerator会在feature map i上的每个位置都都生成sizes[i] * aspect_ratios[i]尺寸的anchors。

    Arguments:
        sizes (Tuple[Tuple[int]]):
        aspect_ratios (Tuple[Tuple[float]]):
    """

    def __init__(self, sizes=(128, 256, 512), aspect_ratios=(0.5, 1.0, 2.0)):
        super(AnchorsGenerator, self).__init__()
		# 论文中默认的size和aspect_ratios，但是本项目两种模型的anchor_size都是(32, 64, 128, 256, 512)五种尺寸
		# 如果size和aspect_ratios不是元组或列表，就转成元组
        if not isinstance(sizes[0], (list, tuple)):
            # TODO change this
            sizes = tuple((s,) for s in sizes)
        if not isinstance(aspect_ratios[0], (list, tuple)):
            aspect_ratios = (aspect_ratios,) * len(sizes)

        assert len(sizes) == len(aspect_ratios) #判断二者元素个数是否一样

        self.sizes = sizes
        self.aspect_ratios = aspect_ratios
        self.cell_anchors = None
        self._cache = {} # 原图生成的anchors坐标信息存储在这里

    def generate_anchors(self, scales, aspect_ratios, dtype=torch.float32, device=torch.device("cpu")):
        # type: (List[int], List[float], torch.dtype, torch.device) -> Tensor
        """
        compute anchor sizes
        Arguments:
            scales: sqrt(anchor_area)，就是类变量sizes，表示anchor的面积
            aspect_ratios: h/w ratios
            dtype: float32 , device: cpu/gpu
            下面标注的torch.Size是mobilenetv2为backbone时生成器结果，resnet50+fpn网络可以自己调试
        """
        scales = torch.as_tensor(scales, dtype=dtype, device=device)# 之前是list，这里转为tensor。torch.Size([5])
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device) # torch.Size([3]))
        h_ratios = torch.sqrt(aspect_ratios) # 高度乘法因子：[0.7071,1.000,1.1412]
        w_ratios = 1.0 / h_ratios            # 宽度乘法因子：[1.1412,1.000,0.7071]

        # [r1, r2, r3]' * [s1, s2, s3]，元素个数是len(ratios)*len(scales)
        # 在w_ratios后面增加一个维度，scales前面增加一个维度，相乘后让每个比例都有对应的长宽
        # w_ratios[:, None] :torch.Size([3])——> torch.Size([3, 1])
        # scales[None, :]   : torch.Size([5])——>torch.Size([1, 5])
        # (w_ratios[:, None] * scales[None, :]结果是 torch.Size([3, 5])，view(-1) 后展平为一维向量
        """
        简单讲就是保证面积为scale平方一定的情况下，改变高宽比为某个比值，如何求改变后的高宽（hs和ws）。
        改变后面积不变，所以ws*hs=w_ratios*scales*h_ratios*scales=scales^2，求得 w_ratios*h_ratios=1
        新的高宽比还是aspect_ratios=h_ratios/w_ratios，联合求得 h_ratios = torch.sqrt(aspect_ratios)
        注意：这里生成的Anchors都是对应原图的尺度。（scale采用的是原图尺寸）        
		"""
        ws = (w_ratios[:, None] * scales[None, :]).view(-1) #每个anchor模板的宽度，torch.Size([15])
        hs = (h_ratios[:, None] * scales[None, :]).view(-1) #每个anchor模板的高度，torch.Size([15])

        # 生成的anchors模板都是以（0, 0）为中心的, shape [len(ratios)*len(scales), 4]
        # 所以这里坐标都要/2。根据坐标计算宽度为0.5ws-(-0.5ws)=ws这样才是对的。
        # torch.stack函数将其在dim=1上拼接，
        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2 # torch.Size([15, 4]) resnet50+fpn则生成75个anchors。
		
        return base_anchors.round()  # round 四舍五入

    def set_cell_anchors(self, dtype, device):
        # type: (torch.dtype, torch.device) -> None
        if self.cell_anchors is not None: #上面初始化为None，所以第一次生成anchors是跳过这步
            cell_anchors = self.cell_anchors
            assert cell_anchors is not None
            # suppose that all anchors have the same device
            # which is a valid assumption in the current state of the codebase
            if cell_anchors[0].device == device:
                return

        # 根据提供的sizes和aspect_ratios生成anchors模板
        # anchors模板都是以(0, 0)为中心的anchor。size个数对应预测特征层的个数
        # 这里有个for遍历，怪不得train_mobilenetv2.py里生成器传入的size和aspect_ratios都是((value),)的形式,多套了一层括号。
        cell_anchors = [
            self.generate_anchors(sizes, aspect_ratios, dtype, device)
            for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios) # torch.Size([15，4]))
        ]
        self.cell_anchors = cell_anchors

    def num_anchors_per_location(self):
        # 计算每个预测特征层上每个滑动窗口的预测目标数
        return [len(s) * len(a) for s, a in zip(self.sizes, self.aspect_ratios)]

    # For every combination of (a, (g, s), i) in (self.cell_anchors, zip(grid_sizes, strides), 0:2),
    # output g[i] anchors that are s[i] distance apart in direction i, with the same dimensions as a.
    def grid_anchors(self, grid_sizes, strides):
        # type: (List[List[int]], List[List[Tensor]]) -> List[Tensor]
        """
        anchors position in grid coordinate axis map into origin image
        计算预测特征图对应原始图像上的所有anchors的坐标
        Args:
            grid_sizes: 预测特征矩阵的height和width
            strides: 预测特征矩阵上一步对应原始图像上的步距
        """
        anchors = []
        cell_anchors = self.cell_anchors #set_cell_anchors生成的所有anchor模板信息
        assert cell_anchors is not None

        # 遍历每个预测特征层的grid_size，strides和cell_anchors
        for size, stride, base_anchors in zip(grid_sizes, strides, cell_anchors):
            grid_height, grid_width = size  	 # 预测特征层的宽和高，mobilenetv2中都是7
            stride_height, stride_width = stride # 相对原图的宽/高步长
            device = base_anchors.device

            # For output anchor, compute [x_center, y_center, x_center, y_center]
            # shape: [grid_width] 对应原图上的x坐标(列)
            # 根据特征层的宽高和步长，计算出在原图上每个网格点的对应坐标x、y
            shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width
            # shape: [grid_height] 对应原图上的y坐标(行)
            shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height

            # 计算预测特征矩阵上每个点对应原图上的坐标(anchors模板的坐标偏移量)
            # torch.meshgrid函数分别传入行坐标和列坐标，生成网格行坐标矩阵和网格列坐标矩阵
            # shape: [grid_height, grid_width]
            # torch.meshgrid(a,b)作用是根据两个一维张量a,b生成两个网格。两个网格形状都是a行b列，分别填充a和b的数据。
            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)
            shift_x = shift_x.reshape(-1)# 拉平成一维，每个元素对应预测特征层每个网格点映射回原图的x坐标
            shift_y = shift_y.reshape(-1)# 对应预测特征层每个网格点映射回原图的y坐标。shape：torch.Size([49]))

            # 下面计算anchors坐标(xmin, ymin, xmax, ymax)在原图上的坐标偏移量
            # shape: [grid_width*grid_height, 4]，torch.Size([950，4]))            
            shifts = torch.stack([shift_x, shift_y, shift_x, shift_y], dim=1)# 这个就是网格点映射回原图的坐标，重复两次。

            # For every (base anchor, output anchor) pair,
            # offset each zero-centered base anchor by the center of the output anchor.
            # 将anchors模板与原图上的坐标偏移量相加得到原图上所有anchors的坐标信息(shape不同时会使用广播机制)
            
            #  torch.Size([950，1，4]))+ torch.Size([1，15，4]))=torch.Size([950，15，4]))。这里利用了广播机制，为每个网格点生成15个anchors。
            shifts_anchor = shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4) # torch.Size([950，15，4]))
            anchors.append(shifts_anchor.reshape(-1, 4)) # torch.Size([14250，4]))

        return anchors  # List[Tensor(all_num_anchors, 4)]。最终返回一个列表，每个元素是一个预测特征层生成的所有anchors位置信息。

    def cached_grid_anchors(self, grid_sizes, strides):
        # type: (List[List[int]], List[List[Tensor]]) -> List[Tensor]
        """将计算得到的所有anchors信息进行缓存"""
        key = str(grid_sizes) + str(strides) 
        # self._cache是字典类型
        if key in self._cache: # 在一开始我们初始化self._cache={}
            return self._cache[key]
        anchors = self.grid_anchors(grid_sizes, strides)
        self._cache[key] = anchors
        return anchors #生成14250个anchors。（25×38×15）

    def forward(self, image_list, feature_maps):
        # type: (ImageList, List[Tensor]) -> List[Tensor]
        # image_list保存的是打包后的(images,image_size)，feature_maps是一个list，每个元素是一个预测特征层
        # 获取每个预测特征层的尺寸(height, width)，根据输入的bacth不同，尺寸也不一样。比如grid_sizes=[25,38]
        grid_sizes = list([feature_map.shape[-2:] for feature_map in feature_maps])

        # image_list.tensors获取的是图片信息。
        image_size = image_list.tensors.shape[-2:]

        # 获取变量类型和设备类型
        dtype, device = feature_maps[0].dtype, feature_maps[0].device

        # one step in feature map equate n pixel stride in origin image
        # 计算特征层上的一步等于原始图像上的步长，这个值backbone一定时就是固定的。strides=32
        strides = [[torch.tensor(image_size[0] // g[0], dtype=torch.int64, device=device),
                    torch.tensor(image_size[1] // g[1], dtype=torch.int64, device=device)] for g in grid_sizes]

        # 根据提供的sizes和aspect_ratios在特征图上生成anchors模板。这里模板只有anchor左上右下角相对于anchor自己的中心点的坐标
        # 其实相当于只有anchor的高宽信息，还没有特征图或原图上具体的坐标信息
        self.set_cell_anchors(dtype, device)

        # 计算/读取所有anchors的坐标信息（这里的anchors信息是映射到原图上的所有anchors信息，不是anchors模板）
        # 得到的是一个list列表，对应每张图中所有预测特征图映射回原图的anchors坐标信息
        anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)

        anchors = torch.jit.annotate(List[List[torch.Tensor]], [])
        # 遍历一个batch中的每张图像。image_list.image_sizes是一个batch的8张图的尺寸
        for i, (image_height, image_width) in enumerate(image_list.image_sizes):
            anchors_in_image = []
            # 遍历每张预测特征图映射回原图的anchors坐标信息
            for anchors_per_feature_map in anchors_over_all_feature_maps:
                anchors_in_image.append(anchors_per_feature_map)
            anchors.append(anchors_in_image)
        # 将每一张图像的所有预测特征层的anchors坐标信息拼接在一起
        # anchors是个list，每个元素为一张图像的所有预测特征层生成的所有anchors信息
        anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]
        # Clear the cache in case that memory leaks.
        self._cache.clear()
        return anchors  # list[8]，每个元素都是torch.Size([14250，4]))

计算预测特征层尺寸grid_sizes和打包后原图的尺寸image_size，根据二者计算步长strides
使用generate_anchors函数，在特征图上根据提供的sizes和aspect_ratios第一次生成anchors模板。
使用set_cell_anchors函数生成后续的anchors模板cell_anchors。
使用cached_grid_anchors函数计算映射到原图的anchors坐标信息。具体的：
- 调用grid_anchors函数，传入gird_size、strides得到预测特征层网格高宽和相对于原图的高/宽步长
- shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)生成原图对应的网格点，再处理得到原图上网格点的坐标shifts
- 根据shifts+cell_anchors进行广播，得到列表anchors。其每个元素对应一个预测特征层映射回原图上生成的所有anchors坐标信息，也就是List[torch.Tensor]格式。
返回最终结果anchors：
- 上面计算的是一张图的所有预测特征层上anchor在原图的坐标信息。遍历一个batch的图片，得到所有图片的所有预测特征层的anchors坐标信息，添加到列表anchors中。此时anchors形状为List[List[torch.Tensor]]。第一个list代表8张图，第二个list代表各个预测特征层。
- 将anchors列表在预测特征层维度上进行拼接，得到的结果anchors作为整个AnchorsGenerator类的前向传播结果。anchors依旧是一个列表，每个元素代表一张图上生成的所有anchors信息，size为[feature_map_numbers*grid_cells*15，4]）
- 清空字典self._cache，防止内存泄露