Innovation idea for the second paper

Tasks solved

The Long-tail distribution of relationships on the VG Dataset in the scene graph generation (SGG) task

目的是为了解决VG数据集中谓词关系的长尾分布问题

Existing methods

首先就是不对数据进行操作的方法,例如发表在CVPR2020上的《Unbiased Scene Graph Generation from Biased Training》

其次,就是对数据进行re-sampling的一类方法,例如发表在CVPR2021的《Bipartite graph network with adaptive message passing for unbiased scene graph generation》

Motivation

  • 不对数据操作的,牺牲head-class的召回(on、has、near)
  • 对数据进行操作的,一般都是采用re-sampling方式,re-sampling虽然一定程度上解决了长尾问题,但是都会改变数据集原有的分布

Contributions

  • 改变训练思路,将最小训练单元由image变为triples
  • 通过分配策略,使得每一次迭代训练的batch里的predicate分布不再是长尾分布
  • 不采样数据集(上下采样),不会出现扭曲原数据集分布的情况

Remake train GT Dataset

思路:对数据集进行处理,将每一幅图像的GT数据提取出来。保存:图像的ID,GT三元组,object label…

具体做法:

  • 获得VGDataset对象
train_data = VGDataset(split='train', img_dir=img_dir, roidb_file=roidb_file, 
                        dict_file=dict_file, image_file=image_file, num_val_im=5000, 
                        filter_duplicate_rels=False)
  • 遍历total train dataset,将Remake后的数据保存在Predicate_GT.json和Predicate_GT.csv文件中
for ex_ind in tqdm(range(len(train_data))):
    flip_img = (random.random() > 0.5) and train_data.flip_aug and (train_data.split == 'train')
    target = train_data.get_groundtruth(ex_ind, flip_img) #获得图像GT
    realtion_map = target.get_field("relation")  # 得到GT关系的矩阵,有关系的为1~50,没关系的为0
    image_id = target.get_field("image_id")  # 得到当前图像的id
    # image_path = train_data.img_info[ex_ind]['url']  # 得到当前图像的url,目前用不到
    for i in range(realtion_map.shape[0]):
        sub_id = i  # 为当前sub添加序号(不是类别编号,是存储在图像中的object序号)
        sub_labels = target.get_field("labels")[i]  # 得到第i个位置的sub类别
        for j in range(realtion_map.shape[1]):  # 固定sub—id 遍历obj-id
            obj_id = j  # 为当前obj添加序号
            obj_labels = target.get_field("labels")[j]  # 得到第i个位置的sub类别
            predicate_label = realtion_map[i][j]  # 获得第i个sub和第j个obj之间的谓词类别(0~50)
            predicate_gt = {
                "image_id": int(image_id),
                "sub_id": sub_id,
                "obj_id": obj_id,
                "sub_labels": int(sub_labels),
                "obj_labels": int(obj_labels),
                "predicate_label": int(predicate_label),
                 # "image_path_url" : image_path,
            }                
            csv_writer.writerow(predicate_gt)

csv文件的保存格式如下:

flieldnames = ["image_id", "sub_id", "obj_id", "sub_labels", "obj_labels", "predicate_label"]
f = open('/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Predicate_GT.csv', mode='a',encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=flieldnames)   
csv_writer.writeheader()   #写入表头 
  • 读取保存完毕的csv文件,存为列表字典
# 把数据集存为列表字典
Predicate_list = []
with open("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Predicate_GT.csv", 'r', encoding='utf-8') as fp:
      fp_key = csv.reader(fp)
      for csv_key in fp_key:  # 把key取出来
          csv_reader = csv.DictReader(fp, fieldnames=csv_key)
          for row in tqdm(csv_reader):
              Predicate_list.append(row)

Remake train dataset

思路:对数据集中的image做预测。图像经过FasterRCNN提取特征(保存下来):图像的ID,sub-obj pairs feature,…

在这里插入图片描述
具体做法:

  • 1 首先要得到每一副图像经过FasterRCNN得到的,所有box特征(由Visual features、Spatial features、Label features三个特征拼接起来)
import torch
from torch import nn
import numpy as np
from torch.nn import functional as F
from maskrcnn_benchmark.modeling.utils import cat
from .utils_motifs import encode_box_info,obj_edge_vectors

class FasterRCNNFeats(nn.Module):
    def __init__(self,cfg,obj_classes):
        super(FasterRCNNFeats, self).__init__()

        self.cfg = cfg
        self.obj_classes = obj_classes
        self.num_obj_classes = len(obj_classes)

        # position embedding
        self.pos_embed = nn.Sequential(*[
            nn.Linear(9, 32), nn.BatchNorm1d(32, momentum=0.001),
            nn.Linear(32, 128), nn.ReLU(inplace=True),
        ])

        # word embedding
        self.embed_dim = self.cfg.MODEL.ROI_RELATION_HEAD.EMBED_DIM
        obj_embed_vecs = obj_edge_vectors(self.obj_classes, wv_dir=self.cfg.GLOVE_DIR, wv_dim=self.embed_dim)
        self.obj_embed1 = nn.Embedding(self.num_obj_classes, self.embed_dim)
        with torch.no_grad():
            self.obj_embed1.weight.copy_(obj_embed_vecs, non_blocking=True)


    def forward(self, x, proposals):

        if self.training or self.cfg.MODEL.ROI_RELATION_HEAD.USE_GT_BOX:
            obj_labels = cat([proposal.get_field("labels") for proposal in proposals], dim=0)
        else:
            obj_labels = None

        if self.cfg.MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL: #如果object—label给出,即PREDCLS任务上
            obj_embed = self.obj_embed1(obj_labels.long()) #就直接采用word embedding方式
        else: #如果object—label没有给出,即SGCLS和SGDET任务上
            obj_logits = cat([proposal.get_field("predict_logits") for proposal in proposals], dim=0).detach()
            obj_embed = F.softmax(obj_logits, dim=1) @ self.obj_embed1.weight # 就采用logtis编码

        assert proposals[0].mode == 'xyxy'
        pos_embed = self.pos_embed(encode_box_info(proposals)) #位置编码,这里调用API
        
		# x是visual features obj_embed是label编码特征 pos_embed是位置编码特征
        obj_pre_rep = cat((x, obj_embed, pos_embed), -1) 

        return  obj_pre_rep # 返回4424维度的features
  • 2 在SGG模型加载处,声明我们刚才写好的FasterRCNNFeats类,用于get object features
def __init__(self, cfg):
    super(GeneralizedRCNN, self).__init__()
    self.cfg = cfg.clone()
    self.backbone = build_backbone(cfg) #xhb:R-101-FPN
    self.rpn = build_rpn(cfg, self.backbone.out_channels)
    self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)


    self.statistics = get_dataset_statistics(cfg)
    self.obj_classes = self.statistics['obj_classes'] # get object类别数,实例化FasterRCNNFeats对象
    self.getFasterRCNNFeats = FasterRCNNFeats(cfg,self.obj_classes) 

    self.PretrainingFeats_dict_list = [] # 创建一个模型自始至终都存在的类内变量,用于存放per-image信息的字典
  • 3 传入getFasterRCNNFeats的参数有{x:box的visual features,result:box的proposals}。将单个image的id和box features存入PretrainingFeats_dict中,然后将所有image的PretrainingFeats_dict存入PretrainingFeats_dict_list列表中
if self.training:
    Feats = self.getFasterRCNNFeats(x,result) #得到一个batch的所有特征
    num_rois = [len(b) for b in result]  #得到每幅图像里box数量
    Feats = Feats.split(num_rois, dim=0)  #得到每幅图像的object feature

    for i in range(len(num_rois)):
       per_feats = Feats[i] #第i张图像里的所有box特征
       image_id = targets[i].get_field("image_id")  # 得到当前图像的id
       PretrainingFeats_dict = {
           "image_id":image_id,
           "feats" : per_feats,
       }
       self.PretrainingFeats_dict_list.append(PretrainingFeats_dict) #将一个batch里的数据存储
  • 4 每2000次迭代,就会有2000batch_size张图像参与训练,也会有2000batch_size张图像的特征信息会被存储在PretrainingFeats_2000.pt里
#xhb:如果迭代次数正好到了要保存模型的时候,就save模型
if iteration % checkpoint_period == 0:
    checkpointer.save("model_{:07d}".format(iteration), **arguments)
    torch.save(model.PretrainingFeats_dict_list,"/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats_{:07d}.pt".format(iteration))
  • 5 由于每2000次迭代都会保存,所以下一次2000次迭代保存的特征数据有上一次2000特征保存的数据。意味着最后一次迭代会把所有参与训练的image特征信息保存到最终的pt文件中

以迭代42000次为例,将PretrainingFeats_0042000.pt数据通过torch.load下载下来,存于View_list_dict_42000列表字典中(顾名思义列表中嵌套字典)

 # load 迭代42000次,42000*12 = 239916张图像上的box tensor
View_list_dict_42000 = torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats_0042000.pt")

下图展示了该数据,可以看出列表中一共有239916张图像的特征数据。然而,VG dataset参与训练的只有57723张图像,说明这里一定有重复的图像被采样了。

在这里插入图片描述

  • 6 验证是否有重复的图像:即检查image_id是否有重复的

这里采用的做法是:将239916个图像数据的image_id提取出来,然后存于一个iter_42000_image_id_list中,通过set方法去重,检查去重之后的list长度和去重之前的列表长度是否一致。如果列表长度一致,说明没有重复的image_id

# 检查239916图像里,有没有重复的,这里先检查image_id
iter_42000_image_id_list = []
for i in range(len(View_list_dict_42000)):
    image_id = View_list_dict_42000[i]["image_id"]
    iter_42000_image_id_list.append(image_id)

new_list_42000 = set(iter_42000_image_id_list) #得到一个无序,不重复的新列表,用于判断原始列表中image_id是否有重复样本
print("iter_42000_image_id_list length = {}".format(len(iter_42000_image_id_list)))
print("new_list_42000 length = {}".format(len(new_list_42000)))

结果确实是一致,即说明239916个image_id都不重复。这说明我们的处理出现了bug,数据集中一共才57723图像,怎么会出现这么多不重复的image_id呢?经过我不断地尝试,终于发现了问题所在

我首先怀疑set函数的功能,并没有达到去重的作用。后来想想python的内置函数应该不会出现问题,大概率是我们的数据有问题。因为image_id 的数据格式是tensor类型,我怀疑set函数处理tensor时会有问题,于是将tensor强转为int,对int型的list去重,看看结果是如何的。

首先对第一个2000次迭代数据进行查重,由于我设置batch_size=8,所以这里面一共有16000张图像,经过查重,new_list_2000的长度也是16000

iter_2000_image_id_list = []
for i in range(len(View_list_dict_2000)):
    image_id = View_list_dict_2000[i]["image_id"]
    iter_2000_image_id_list.append(int(image_id))
new_list_2000 = set(iter_2000_image_id_list)

但是16000张图像并不能说明问题,必须测试大于数据集中图像数量的一个pt文件,因此我们测试迭代8000次之后的pt文件,也就是64000张图像。依旧是将tensor强转为int类型,然后存于带查重的列表iter_8000_image_id_list中,经过测试,发现new_list_8000 的长度就是57723。

View_list_dict_8000 = torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats_0008000.pt")
# 检查64000图像里,有没有重复的,这里先检查image_id
iter_8000_image_id_list = []
for i in tqdm(range(len(View_list_dict_8000))):
   image_id = View_list_dict_8000[i]["image_id"]
   iter_8000_image_id_list.append(int(image_id))

new_list_8000 = set(iter_8000_image_id_list)  # 得到一个无序,不重复的新列表,用于判断原始列表中image_id是否有重复样本
print("iter_8000_image_id_list length = {}".format(len(iter_8000_image_id_list)))
print("new_list_8000 length = {}".format(len(new_list_8000)))

到此,说明我们的image-id并没有问题,只是查重函数set对tensor类型的list会有问题。那我们怎么得到整个数据集57723个数据呢,只需在64000中切片前57723个即可

将55723个数据集中不重复的图片存到PredicateFeats_list_dict 列表里,作为我们的数据集的输入部分。

PredicateFeats_list_dict = []
for i in tqdm(range(len(View_list_dict_8000))):
    if i < 57723:
        PredicateFeats_list_dict.append(View_list_dict_8000[i])
    else:
        break

但是,怎么证明这57723不重复呢?测试:查看这57723是否有重复。结果new_list_57723 的长度也是57723,说明这57723个数据是不重复的。

iter_57723_image_id_list = []
for i in tqdm(range(len(PredicateFeats_list_dict))):
    image_id = PredicateFeats_list_dict[i]["image_id"]
    iter_57723_image_id_list.append(int(image_id))
new_list_57723 = set(iter_57723_image_id_list)  # 得到一个无序,不重复的新列表,用于判断原始列表中image_id是否有重复样本
print("iter_57723_image_id_list length = {}".format(len(iter_57723_image_id_list)))
print("new_list_57723 length = {}".format(len(new_list_57723)))

为了进一步证明,我们将数据切片至57724个。结果是:new_list_57723 的长度依然是57723,但是源列表的长度是57724,说明第57724个image_id和前面的重复了,也证明了我们上面截取的57723就是数据集中无重复的57723张图像数据。

PredicateFeats_list_dict = []
for i in tqdm(range(len(View_list_dict_8000))):
    if i < 57724:
        PredicateFeats_list_dict.append(View_list_dict_8000[i])
    else:
        break

iter_57723_image_id_list = []
for i in tqdm(range(len(PredicateFeats_list_dict))):
    image_id = PredicateFeats_list_dict[i]["image_id"]
    iter_57723_image_id_list.append(int(image_id))
new_list_57723 = set(iter_57723_image_id_list)  # 得到一个无序,不重复的新列表,用于判断原始列表中image_id是否有重复样本
print("iter_57723_image_id_list length = {}".format(len(iter_57723_image_id_list))) #55724
print("new_list_57723 length = {}".format(len(new_list_57723))) #55723

将64000个数据的前57723个单独保存为训练集中的特征数据,以后读取数据集时,无需再进行切片操作。

PredicateFeats_list_dict = []
for i in tqdm(range(len(View_list_dict_8000))):
    if i < 57723:
        PredicateFeats_list_dict.append(View_list_dict_8000[i])
    else:
        break
torch.save(PredicateFeats_list_dict,"/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/PretrainingFeats.pt")

训练集57723张图像的id和特征。feats里面有该张图像所有GT-box的特征,所以该数据只能在SGCLS和PREDCLS任务上使用,SGDET任务上还需要进行额外的处理。

在这里插入图片描述

GT数据:image-id:图像的唯一索引;sub-id:subject在图像中的序号;obj-id:object在图像中的序号;sub-labels:该序号subject的标签;obj-labels:该序号object的标签;predicate-label:这一对sub-obj谓词关系的GT标签

在这里插入图片描述

Allocation strategy

经过Remake Dataset,我们得到了数据集中训练集部分和GT部分,接下来我们要进行场景图任务的训练。但是在训练之前,我们需要解决一个最关键分配策略的问题:分配给每一个batch里的谓词分布不是长尾分布。

predicates count

一共有51个谓词类别,其中第0位是background,表示无关系。

<class ‘list’>: [‘background’, ‘above’, ‘across’, ‘against’,‘along’, ‘and’, ‘at’, ‘attached to’, ‘behind’, ‘belonging to’,‘between’, ‘carrying’, ‘covered in’, ‘covering’, ‘eating’, ‘flying in’, ‘for’, ‘from’, ‘growing on’, ‘hanging from’, ‘has’, ‘holding’,‘in’, ‘in front of’, ‘laying on’, ‘looking at’, ‘lying on’, ‘made of’, ‘mounted on’, ‘near’, ‘of’, ‘on’, ‘on back of’, ‘over’, ‘painted on’,‘parked on’, ‘part of’, ‘playing’, ‘riding’, ‘says’, ‘sitting on’, ‘standing on’, ‘to’, ‘under’, ‘using’, ‘walking in’, ‘walking on’, ‘watching’, ‘wearing’, ‘wears’, ‘with’]

根据我的统计,训练集中56224张image,每个GT谓词数量如下:红色代表类别编号

在这里插入图片描述

长尾分布图如下所示:

在这里插入图片描述
我想要做到每个batch里的谓词数量平衡,目前尝试两种分配方案:

  • 第一种是:无关系的数量与有关系的保持平衡,即一个batch中,background和其他带有GT关系的谓词基本保持平衡

在这里插入图片描述

  • 第二种是:background继续保持高数量占比,其余带有GT标签的谓词类别的数量基本保持平衡。

在这里插入图片描述

Technical details of strategy

没有采样频率约束的情况下,随机抽样。这种的抽样会给每一个谓词相同的权重

# 按照原始分布采样
m = 2000
origin_GTPredicate_batch_All = []  # 存放5000个batch
origin_predicate_distribution = [] # 存放每个batch里的谓词分布
for n in tqdm(range(4000)): # 采样5000次,得到5000个batch
    predicate_batch = random.sample(GTPredicate_list,m)
    origin_GTPredicate_batch_All.append(predicate_batch) #保存batch本身内容
    batch_predicate_count = [x*0 for x in range(51)] # 统计每一个batch里面,谓词分布情况
    for i in  range(len(predicate_batch)):
        id = int(predicate_batch[i]['predicate_label'])
        batch_predicate_count[id] = batch_predicate_count[id] + 1
    origin_predicate_distribution.append(batch_predicate_count)

会得到下面的分布:仍然是长尾分布,其中on、of处仍高频率的出现,这是我们不想看到的

在这里插入图片描述
因此我们必须通过分配策略,使得每个batch里的谓词类别数量是平衡的。实现过程如下:

  • 需要得到每个谓词在GT数据中出现的频次

在这里插入图片描述

  • 根据频次的倒数,粗略的作为抽样率(即数量多的谓词类,概率较小的被采样),每一个索引处的概率值表示该处索引对应的谓词数量的倒数,background是可变的。当ratio-background变大时,就会出现第二种的谓词分布,这也是我目前暂定的分布。
predlicate_probability =[0.0000107383, # background sample-ratio = 0.000000107383时,background将与其他保持实例平衡
    0.000224972,0.006993007,0.006289308,0.002923977,0.002469136,0.000831947,0.000958773,0.000117357,0.001831502,0.002873563,
    0.00101626,0.00310559,0.002680965,0.002824859,0.25,0.001517451,0.007092199,0.007462687,0.001912046,0.0000206637,0.00015006,
    0.0000655781,0.000380373,0.002298851,0.002079002,0.004716981,0.012658228,0.005524862,0.0000814001,0.000040657,0.00001259,
    0.004065041,0.001481481,0.008333333,0.002123142,0.002915452,0.02,0.000445831,0.045454545,0.000312402,0.000632111,0.004081633,
    0.000335233,0.003278689,0.005291005,0.001113586,0.003322259,0.000031001,0.000294291,0.00011711]
  • 我们要为GT数据中的9581817个数据,根据谓词类别赋予相应的采样离散概率(这是因为随机采样函数要求这样做)
# 把GT数据集存为列表字典 9581817
GTPredicate_list = []
with open("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Predicate_GT.csv", 'r',encoding='utf-8') as fp:
   fp_key = csv.reader(fp)
   for csv_key in fp_key:  # 把key取出来
       csv_reader = csv.DictReader(fp, fieldnames=csv_key)
       for row in tqdm(csv_reader):
           GTPredicate_list.append(row)

# 为每一个GT附上离散概率,抽样时使用
GTPredicate_probability = [x*0 for x in range(len(GTPredicate_list))] # 用于存放9581817个概率
for i in tqdm(range(len(GTPredicate_list))):
   id = int(GTPredicate_list[i]['predicate_label'])
   GTPredicate_probability[i] = predlicate_probability[id]
  • 按照频率分布采样,将得到batch存到GTPredicate_batch_All 里。并统计每个batch里谓词频率分布,存到predicate_distribution 里
batch_size = 2000 # 目前设定每个batch里有2000组三元组数据
epoch_iteration = 5000 #设定迭代4000次,
GTPredicate_batch_All = []  # 存放epoch_iteration个batch
predicate_distribution = [] # 存放每个batch里的谓词分布
for n in tqdm(range(epoch_iteration)): # 采样epoch_iteration次,得到epoch_iteration个batch
    batch_sampler = random.choices(GTPredicate_list, weights=GTPredicate_probability, k=batch_size)
    GTPredicate_batch_All.append(batch_sampler) #保存batch本身内容
    batch_predicate_count = [x*0 for x in range(51)] # 统计每一个batch里面,谓词分布情况
    for i in  range(len(batch_sampler)):
        id = int(batch_sampler[i]['predicate_label'])
        batch_predicate_count[id] = batch_predicate_count[id] + 1
    predicate_distribution.append(batch_predicate_count)

在这里插入图片描述在这里插入图片描述

  • 检查5000个batch里是否包含了整个GT数据中全部图像(结果是56105,没差太多,到时候我们弄到10000次迭代,全部图像应该是可以包含进来的)
# check 采样后的所有batch是不是包含了所有image
sample_image_id_list = []
for i in tqdm(range(len(GTPredicate_batch_All))):
    for j in range(len(GTPredicate_batch_All[i])):
        image_id = int(GTPredicate_batch_All[i][j]['image_id'])
        sample_image_id_list.append(image_id)

set_sample_image_id_list = set(sample_image_id_list) # 检查长度是否是56224,

Combining Training data and GT

我们在上面得到了符合均匀分布的谓词三元组,我们要与训练集一一对应,才可以很好的预测,进而完成loss的计算。本节,我们需要实现训练集和GT的绑定。我们需要实现训练集的batch和GT的batch一一对应,即我们在抽样GTbatch时,需要一一对应的将训练集也抽出来,然后绑在一起,参与后续训练。

我们想要将训练数据和GT数据绑在一起,就必须通过image-id一一对应。所以第一步我们需要将image-id存为一个list,用于查找时使用

  • 取出训练集中image-id
# 把所有的image-id append list里
DatasetFeats_image_id = []
for j in tqdm(range(len(DatasetFeats))):
    image_id = int(DatasetFeats[j]['image_id'])
    DatasetFeats_image_id.append(image_id)
  • 通过image-id,找到与GT数据集中image-id一致的图像数据,并根据sub-id和obj-id找到对应的object features
# 将GT与训练集数据一一对应(绑在一起) [[train_data1,GT1],[train_data2,GT2],....]
debug_print(logger, 'Remake Dataloader')
batch_size = 2000
epoch_iteration = 4000
train_dataloader_list_dict = []  #存放4000个batch,batch里有2000个字典
for n in tqdm(range(epoch_iteration)):  # 采样epoch_iteration次,得到epoch_iteration个batch
    batch_gt = random.choices(GTPredicate_list, weights=GTPredicate_probability, k=batch_size)
    train_set_batch_list_dict = [] #存放2000个字典,也就是1个batch
    for i in range(batch_size):
        batch_i_iamge_id = int(batch_gt[i]['image_id']) # 得到第n个batch里第i个gt三元组的image-id
        batch_i_sub_id = int(batch_gt[i]['sub_id'])     # 得到第n个batch里第i个gt三元组的sub_id
        batch_i_obj_id = int(batch_gt[i]['obj_id'])     # 得到第n个batch里第i个gt三元组的obj_id
        if batch_i_iamge_id in DatasetFeats_image_id: # 在训练集中找到与之匹配的image
            index = DatasetFeats_image_id.index(batch_i_iamge_id) # 取出image-id的位置
            batch_i_sub_feats = DatasetFeats[index]['feats'][batch_i_sub_id, :]  # 得到相应sub的特征(4424维度的)
            batch_i_obj_feats = DatasetFeats[index]['feats'][batch_i_obj_id, :]  # 得到相应obj的特征(4424维度的)

            train_dataset_dict = { # 创建一个字典,存与gt对应的训练数据
                "image_id": batch_i_iamge_id,
                "sub_id": batch_i_sub_id,
                "obj_id": batch_i_obj_id,
                "sub_feats": batch_i_sub_feats,
                "obj_feats": batch_i_obj_feats,
            }

            train_unit = (train_dataset_dict,batch_gt[i])
            train_set_batch_list_dict.append(train_unit)  # 把对应于gt的训练数据存于字典,最后append到一个2000长度的列表train_set_list_dict里
    train_dataloader_list_dict.append(train_set_batch_list_dict)

我们将单个gt个单个训练数据存于train_unit元组中,将2000个元组存于train_set_batch_list_dict里,这就组成了一个batch(batch里有2000个训练数据,即batch_size=2000)。从数据集中抽样,目前是选择4000次迭代次数为1个epoch,至于最终训练几个epoch还要看模型的收敛情况。train_dataloader_list_dict的数据结构如下所示:

在这里插入图片描述

Load union features of two bbox

我们用于关系分类时,原本需要两个box的union features。之前的步骤中省略了这一步,后来觉得结果不好,思考是不是由于没有添加union features导致的。因此才有了这一节的工作:将union features提取出来,然后恢复原本的公式,检查是否会影响性能。

  • 首先我们还是在main model:GeneralizedRCNN中进行保存。
# 导入需要的头文件,也就是后面函数所在的文件
from ..roi_heads.relation_head.sampling import make_roi_relation_samp_processor
from ..roi_heads.relation_head.roi_relation_feature_extractors import make_roi_relation_feature_extractor

class GeneralizedRCNN(nn.Module):
    """
    Main class for Generalized R-CNN. Currently supports boxes and masks. #支持box和mask
    It consists of three main parts: 包含三个部分
    - backbone 主干提取网络
    - rpn rpn 特征金字塔
    - heads: takes the features + the proposals from the RPN and computes #从rpn获得建议框和对应特征,并计算
        detections / masks from it.
    """

    def __init__(self, cfg):
        super(GeneralizedRCNN, self).__init__()
        self.cfg = cfg.clone()
        self.backbone = build_backbone(cfg) #xhb:R-101-FPN
        self.rpn = build_rpn(cfg, self.backbone.out_channels)
        self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)

 		#-------- preparing save union_feature ---------#
 		# 声明一些成员方法
        self.samp_processor = make_roi_relation_samp_processor(cfg)
        self.union_feature_extractor = make_roi_relation_feature_extractor(cfg, self.backbone.out_channels)
        self.use_union_box = self.cfg.MODEL.ROI_RELATION_HEAD.PREDICT_USE_VISION
        self.union_feats_list_dict = []


    def forward(self, images, targets=None, logger=None): #xhb:进这个forward算损失
     
        if self.training and targets is None:
            raise ValueError("In training mode, targets should be passed")
        images = to_image_list(images) #xhb:将图像变成列表形式
        
        features = self.backbone(images.tensors) #xhb:R-101-FPN
        proposals, proposal_losses = self.rpn(images, features, targets) #xhb:Faster-RCNN 输出的proposal
        if self.roi_heads:
            x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
        else:
            # RPN-only models don't have roi_heads
            x = features
            result = proposals
            detector_losses = {}

        #-------- preparing save union_feature ---------#
        if self.training:
            # relation subsamples and assign ground truth label during training
            with torch.no_grad():
                if self.cfg.MODEL.ROI_RELATION_HEAD.USE_GT_BOX:
                    _, rel_labels, rel_pair_idxs, _ = self.samp_processor.gtbox_relsample(result, targets)
                else:
                    _, rel_labels, rel_pair_idxs, _ = self.samp_processor.detect_relsample(result, targets)
        else:
            rel_labels, rel_binarys = None, None
            rel_pair_idxs = self.samp_processor.prepare_test_pairs(features[0].device, result)

        if self.use_union_box:
        	# 得到union_features,但这个union_features是整个batch里所有图像上的所有box的union features
            union_features = self.union_feature_extractor(features, result, rel_pair_idxs) 
        else:
            union_features = None

        # ---------------save union_features in training-----------------#
        if self.training:
            with torch.no_grad():
                num_objs = [len(b) for b in rel_pair_idxs] 
                union_feature = union_features.split(num_objs, dim=0) # 分离出每幅图像上的所有union features
                for i in range(len(union_feature)):
                    image_id = targets[i].get_field("image_id")
                    # 有关系的,save union feature,没关系的不存
                    relation_map = targets[i].get_field("relation")
                    relation_mat = relation_map.cpu().numpy() # cuda tensor no direct convert numpy
                    gt_rel_idx = np.where(relation_mat != 0) # 找到关系矩阵中不为0的地方,也就是具有GT关系的地方
                    for k in range(len(gt_rel_idx[0])):
                        sub_id = gt_rel_idx[0][k] # 因为gt_rel_idx是[2,k],所以要这么取索引
                        obj_id = gt_rel_idx[1][k]
                        idx = None
                        for p in range(rel_pair_idxs[i].shape[0]):
                            if int(sub_id) == int(rel_pair_idxs[i][p][0]) and int(obj_id) == int(rel_pair_idxs[i][p][1]):
                                idx = p # 找到该rel_pair_idxs在union_feature 的索引位置,然后把对应的union_feature 取出
                                break
                            else:
                                continue
                        union_feats_dict = {
                            "image_id": image_id,
                            "sub_id": sub_id,
                            "obj_id": obj_id,
                            "union_feats": union_feature[i][idx].data,
                        }
                        self.union_feats_list_dict.append(union_feats_dict)


        if self.training:
            losses = {}
            losses.update(detector_losses)
            if not self.cfg.MODEL.RELATION_ON:
                # During the relationship training stage, the rpn_head should be fixed, and no loss. 
                losses.update(proposal_losses)
            return losses

        return result
  • 然后在train.net文件中保存即可
 if iteration % checkpoint_period == 0:
    checkpointer.save("model_{:07d}".format(iteration), **arguments)
    torch.save(model.union_feats_list_dict, "/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/UnionFeats_{:05d}.pt".format(iteration))

保存完之后,我们就要将其与训练数据绑定,以备在模型运作的过程中进行计算

  • 首先load union-feats文件,得到其image-id列表
# load train union features
UnionFeats_list_dict_train = torch.load(
    "/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/UnionFeats_02000.pt")

UnionFeats_image_id_train = []
for j in tqdm(range(len(UnionFeats_list_dict_train))):
    image_id = int(UnionFeats_list_dict_train[j]['image_id'])
    UnionFeats_image_id_train.append(image_id)
  • 然后在之前的绑定代码中,添加下面代码。这段代码的作用是,判断该图像上是否有union文件,因为我在存union文件时,显存不够,只存了25000+图像的union-feats文件,因此,后面一半数据集是没有的。所以这里我要判断。
union_feats = None
if batch_i_iamge_id in UnionFeats_image_id:
    index_u = UnionFeats_image_id.index(batch_i_iamge_id)
    if batch_i_sub_id == int(UnionFeats_list_dict[index_u]["sub_id"]) and batch_i_obj_id == int(
            UnionFeats_list_dict[index_u]["obj_id"]):
        union_feats = UnionFeats_list_dict[index_u]["union_feats"]

if union_feats is not None:
    train_unit = (train_dataset_dict, batch_gt[i], union_feats) # 如果有,那就绑定union-feats
else:
    train_unit = (train_dataset_dict, batch_gt[i]) # 如果没有,那就按照之前的:训练数据 + GT数据

整个函数的代码如下:改动的地方主要是参数的添加和上面代码。

def Combine_TrainGT(batch_size,epoch_iteration,GTPredicate_list,GTPredicate_probability,DatasetFeats_image_id,newDatasetFeats,UnionFeats_list_dict,UnionFeats_image_id):
    # batch_size = 2000
    # epoch_iteration = 12000

    train_dataloader_list_dict = []  # 存放4000个batch,batch里有2000个字典
    for n in tqdm(range(epoch_iteration)):  # 采样epoch_iteration次,得到epoch_iteration个batch
        batch_gt = random.choices(GTPredicate_list, weights=GTPredicate_probability, k=batch_size) # 按照GTPredicate_probability分布采样
        # batch_gt = random.sample(GTPredicate_list, batch_size) # 原始分布采样
        train_set_batch_list_dict = []  # 存放2000个字典,也就是1个batch
        for i in range(batch_size):
            batch_i_iamge_id = int(batch_gt[i]['image_id'])  # 得到第n个batch里第i个gt三元组的image-id
            batch_i_sub_id = int(batch_gt[i]['sub_id'])  # 得到第n个batch里第i个gt三元组的sub_id
            batch_i_obj_id = int(batch_gt[i]['obj_id'])  # 得到第n个batch里第i个gt三元组的obj_id
            if batch_i_iamge_id in DatasetFeats_image_id:
                index = DatasetFeats_image_id.index(batch_i_iamge_id)
                batch_i_sub_feats = newDatasetFeats[index]['feats'][batch_i_sub_id, :]  # 得到相应sub的特征(4424维度的)
                batch_i_obj_feats = newDatasetFeats[index]['feats'][batch_i_obj_id, :]  # 得到相应obj的特征(4424维度的)
                batch_i_sub_dist = newDatasetFeats[index]['obj_dist'][batch_i_sub_id, :]  # 得到相应sub的dist(151维度的)
                batch_i_obj_dist = newDatasetFeats[index]['obj_dist'][batch_i_obj_id, :]  # 得到相应obj的dist(151维度的)
                batch_i_sub_ctx = newDatasetFeats[index]['obj_ctx'][batch_i_sub_id, :]  # 得到相应sub的context(512维度的)
                batch_i_obj_ctx = newDatasetFeats[index]['obj_ctx'][batch_i_obj_id, :]  # 得到相应obj的context(512维度的)

                train_dataset_dict = {
                    "image_id": batch_i_iamge_id,
                    "sub_id": batch_i_sub_id,
                    "obj_id": batch_i_obj_id,
                    "sub_feats": batch_i_sub_feats,
                    "obj_feats": batch_i_obj_feats,
                    "sub_dist": batch_i_sub_dist,
                    "obj_dist": batch_i_obj_dist,
                    "sub_ctx": batch_i_sub_ctx,
                    "obj_ctx": batch_i_obj_ctx,
                }

                union_feats = None
                if batch_i_iamge_id in UnionFeats_image_id:
                    index_u = UnionFeats_image_id.index(batch_i_iamge_id)
                    if batch_i_sub_id == int(UnionFeats_list_dict[index_u]["sub_id"]) and batch_i_obj_id == int(
                            UnionFeats_list_dict[index_u]["obj_id"]):
                        union_feats = UnionFeats_list_dict[index_u]["union_feats"]

                if union_feats is not None:
                    train_unit = (train_dataset_dict, batch_gt[i], union_feats)
                else:
                    train_unit = (train_dataset_dict, batch_gt[i])

                train_set_batch_list_dict.append(train_unit)  # 把对应于gt的训练数据存于字典,最后append到一个2000长度的列表train_set_list_dict里
        train_dataloader_list_dict.append(train_set_batch_list_dict)

    return  train_dataloader_list_dict
  • 绑定完之后,我们就可以按照之前的方法训练了。在进模型之前需要判断一下,union_feats 是否存在。如果存在就传入刚才绑定在第三位的train_gt_batch[i][2],如果不存在,那就传一个None进去。
train_data = train_gt_batch[i][0]
target = train_gt_batch[i][1]
if len(train_gt_batch[i]) == 3:
    union_feats = train_gt_batch[i][2]
else:
    union_feats = None
output_loss = mymodel(train_data, target,union_feats)
  • 在PredicatedRCNN类里,得到关系特征时,需要判断union-feats存不存在,如果存在就相乘。如果不存在就按照之前的,没有union-feats计算
prod_rep = torch.cat((head_rep, tail_rep), dim=-1)
prod_rep = self.post_cat(prod_rep)

if union_feats != None:
    relation_rep = prod_rep * self.up_dim(self.down_dim(union_feats))  # 得到关系特征,先降维再升维
else:
    relation_rep = prod_rep # 得到关系特征
relation_dists = self.rel_compress(relation_rep) # 关系分类

后续解决:全部数据集union-feats都要得到,解决显存溢出问题

  • 没有union-feats时
    在这里插入图片描述
  • 加了union-feats之后,可以看出,R提高了,mR下降了。
    在这里插入图片描述

SGG Prediction(train+val+test)

我们搞定了分配策略,使得参与训练的每一个batch中不会有谓词长尾分布问题。并且将训练数据和gt搭配完毕,现在开始场景图生成任务的预测。

Training

subject和object特征先预测各自的label。由于无法进入LSTM进行特征的refine,因此这里采用在ImageNet上预训练好的ResNet50进行特征的refine,从而预测更好的物体标签。SGG网络大致可以归结为如下过程:

在这里插入图片描述

  • 首先得到subject的特征和object的特征
def forward(self, train_data, target, logger=None): #xhb:进这个forward算损失
    # get object feature
    sub_feats = train_data['sub_feats']
    obj_feats = train_data['obj_feats']
  • 目前没有想到怎么refine特征(如果原特征不好那就得保存context特征),直接用原特征进行分类。得到对应的logits(这里的logits里面有很多的0,因为我Relu了两次,后续要验证该不该Relu)
def __init__(self, cfg):
    super(PredicatedRCNN, self).__init__()
    self.cfg = cfg.clone()
    self.feats_dim = 4424
    self.hidden_size = 1024
    self.num_obj_classes = 151
   
    # 映射输入物体的特征的维度
    self.out_obj = nn.Sequential( # 4424-> 151
        nn.Linear(self.feats_dim, self.hidden_size),
        nn.ReLU(inplace=True),
        nn.Linear(self.hidden_size, self.num_obj_classes),
        nn.ReLU(inplace=True),
    )

def forward(self, train_data, target, logger=None): #xhb:进这个forward算损失
    # get updated object feature
    '''暂时不用'''

    # get object logits
    sub_logits = self.out_obj(sub_feats) # 对sub进行标签分类,得到151维度的logits
    obj_logits = self.out_obj(obj_feats) # 对obj进行标签分类,得到151维度的logits
  • 通过logits得到预测标签,通过edge_obj_embed得到对应object标签的编码向量(词向量,来自于glove.pt),进而得到对应的关系特征(也就是edge features)
   def __init__(self, cfg):
        super(PredicatedRCNN, self).__init__()
        self.cfg = cfg.clone()
        self.num_obj_classes = 151
        self.embed_dim = cfg.MODEL.ROI_RELATION_HEAD.EMBED_DIM # 200

        # load class dict
        statistics = get_dataset_statistics(cfg)
        obj_classes, rel_classes = statistics['obj_classes'], statistics['rel_classes']
        assert self.num_obj_classes == len(obj_classes)

        # 对obj—pred进行word vector编码
        obj_embed_vecs = obj_edge_vectors(obj_classes, wv_dir=self.cfg.GLOVE_DIR, wv_dim=self.embed_dim) # [151,200] 包含所有类
        self.edge_obj_embed = nn.Embedding(self.num_obj_classes, self.embed_dim)
        with torch.no_grad():
            self.edge_obj_embed.weight.copy_(obj_embed_vecs, non_blocking=True)

    def forward(self, train_data, target, logger=None): #xhb:进这个forward算损失
        # get object pred
        sub_pred = torch.argmax(sub_logits).view(1) # tensor(8, device='cuda:0') 这不是一个向量,所以to_onehot无法识别 shape: torch.Size([])
        obj_pred = torch.argmax(obj_logits).view(1)

        # get label embd feature
        sub_embed = self.edge_obj_embed(sub_pred.long())
        obj_embed = self.edge_obj_embed(obj_pred.long())

        # get edge features
        sub_rel_rep = torch.cat((sub_embed.view(1,-1), sub_feats.view(1,-1)), -1)
        obj_rel_rep = torch.cat((obj_embed.view(1,-1), obj_feats.view(1,-1)), -1)
  • 将得到的edge特征,拼接起来,然后经过两次维度转换,再分离成haed和tail,最后映射至4096,再相乘。就得到了最终的关系特征,关系特征经过分类,得到关系的logits。
    def __init__(self, cfg):
        super(PredicatedRCNN, self).__init__()
        self.cfg = cfg.clone()
        self.feats_dim = 4424
        self.hidden_size = 1024
        self.num_obj_classes = 151
        self.num_rel_classes = 51
        self.edge_hidden_dim = cfg.MODEL.ROI_RELATION_HEAD.CONTEXT_HIDDEN_DIM # 512
        self.embed_dim = cfg.MODEL.ROI_RELATION_HEAD.EMBED_DIM # 200
        self.post_emb = nn.Linear(self.edge_hidden_dim, self.edge_hidden_dim * 2)  # 512->1024
        self.edge_lin = nn.Linear((self.feats_dim+self.embed_dim)*2, self.edge_hidden_dim) # 9248->512

        self.pooling_dim = cfg.MODEL.ROI_RELATION_HEAD.CONTEXT_POOLING_DIM
        self.post_cat = nn.Linear(self.edge_hidden_dim, self.pooling_dim)
        self.rel_compress = nn.Linear(self.pooling_dim, self.num_rel_classes, bias=True)

    def forward(self, train_data, target, logger=None): #xhb:进这个forward算损失
        # get relation head-tail feature
        edge_feats = torch.cat((sub_rel_rep,obj_rel_rep),-1) # 先将sub和obj的合并特征拼接起来
        edge_feats = self.edge_lin(edge_feats) # 映射到512 dim
        edge_rep = self.post_emb(edge_feats) # 再映射到1024 dim
        edge_rep = edge_rep.view(edge_rep.size(0), 2, self.edge_hidden_dim)
        head_rep = edge_rep[:, 0].contiguous().view(-1, self.edge_hidden_dim) # 512 dim
        tail_rep = edge_rep[:, 1].contiguous().view(-1, self.edge_hidden_dim) # 512 dim

        # get relation dist
        head_rep = self.post_cat(head_rep) # 将head映射到4096维度
        tail_rep = self.post_cat(tail_rep) # 将tail映射到4096维度
        relation_rep = head_rep * tail_rep # 得到关系特征
        relation_dists = self.rel_compress(relation_rep) # 关系分类
  • 获得label,然后算loss,采用的是交叉熵损失函数
# get <subject predicate object> label
sub_label = torch.tensor(int(target['sub_labels'])).long().view(1).to(self.device)
obj_label = torch.tensor(int(target['obj_labels'])).long().view(1).to(self.device)
rel_label = torch.tensor(int(target['predicate_label'])).long().view(1).to(self.device)

# 训练模式下,返回loss
if self.training:
    # calculate object loss
    sub_loss = self.criterion_loss(sub_logits.view(1,-1),sub_label)
    obj_loss = self.criterion_loss(obj_logits.view(1,-1),obj_label)
    # calculate relation loss
    rel_loss = self.criterion_loss(relation_dists.view(1,-1), rel_label)

    output_losses = dict(loss_rel=rel_loss, loss_refine_sub=sub_loss, loss_refine_obj=obj_loss)
    return output_losses

得到loss,就可以进行梯度回传了,进而完成整个训练过程

logger.info("Start training")  //xhb:输出日志:开始训练了
meters = MetricLogger(delimiter="  ")
max_iter = len(train_dataloader_list_dict)  # xhb:max_iter:4000
start_training_time = time.time()  # 1641821079.0426774
end = time.time()  # 1641821079.0426843
mymodel = PredicatedRCNN(cfg)
mymodel.to(device)
optimizer_Adam = torch.optim.Adam(mymodel.parameters(),lr=0.01)  # SGD也可备选,lr需要尝试初始学习率
//scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer_Adam,gamma=0.9,last_epoch=-1)
scheduler = make_lr_scheduler(cfg, optimizer_Adam, logger)
for iteration,train_gt_batch in enumerate(train_dataloader_list_dict):
    data_time = time.time() - end
    iteration = iteration + 1
    arguments["iteration"] = iteration
    rel_loss = torch.tensor(0)
    sub_loss = torch.tensor(0)
    obj_loss = torch.tensor(0)
    batch_loss  = dict(loss_rel=rel_loss, loss_refine_sub=sub_loss, loss_refine_obj=obj_loss)
    for i in range(len(train_gt_batch)):
        train_data = train_gt_batch[i][0]
        target = train_gt_batch[i][1]
        output_loss = mymodel(train_data, target)
        batch_loss['loss_rel'] = output_loss['loss_rel']  + batch_loss['loss_rel']
        batch_loss['loss_refine_sub'] = output_loss['loss_refine_sub'] + batch_loss['loss_refine_sub']
        batch_loss['loss_refine_obj'] = output_loss['loss_refine_obj'] + batch_loss['loss_refine_obj']

    losses = sum(loss/len(train_gt_batch) for loss in batch_loss.values())

    meters.update(loss=losses, **batch_loss)

    optimizer_Adam.zero_grad()   # xhb:优化器清除梯度,准备损失梯度回传了
	losses.backward()
    optimizer_Adam.step()  # xhb:优化器开始工作

    batch_time = time.time() - end
    end = time.time()
    meters.update(time=batch_time, data=data_time)
    eta_seconds = meters.time.global_avg * (max_iter - iteration)
    eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))

    if iteration % 2 == 0:
        logger.info(
            meters.delimiter.join(
                [
                    "eta: {eta}",
                    "iter: {iter}",
                    "{meters}",
                    "lr: {lr:.6f}",
                    "max mem: {memory:.0f}",
                ]
            ).format(
                eta=eta_string,
                iter=iteration,
                meters=str(meters),
                lr=optimizer.param_groups[-1]["lr"],
                memory=torch.cuda.max_memory_allocated() / 1024.0 / 1024.0,
            )
        )

Evaluating process

Remake val GT Dataset

由于我的模型传入的是单个三元组的特征,计算准确度也是单个三元组来计算,所以需要把验证集也要制作成训练集那样

  • 得到验证集的GT数据集(obejct的label,关系的label,以及三元组的label,image-id等)
dataset_names = cfg.DATASETS.VAL
 for dataset_name, val_data_loader in zip(dataset_names, val_data_loaders):
     dataset = val_data_loader.dataset
     groundtruths = []
     for i in range(5000):
         image_id = dataset.img_info[i]['image_id']
         gt = dataset.get_groundtruth(i, evaluation=True)
         gt.add_field("image_id", torch.as_tensor(image_id))
         groundtruths.append(gt)

GT里是这么个数据结构,上一步把image_id添加进去了

在这里插入图片描述

  • 将groundtruths数据存储到value_GT.cd.csv中
flieldnames = ["image_id", "sub_id", "obj_id", "sub_labels", "obj_labels", "predicate_label"]
f = open('/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/value_GT.csv', mode='a',encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=flieldnames)
csv_writer.writeheader()

dataset_names = cfg.DATASETS.VAL
for dataset_name, val_data_loader in zip(dataset_names, val_data_loaders):
  dataset = val_data_loader.dataset
  groundtruths = []
  for i in tqdm(range(5000)):
      image_id = dataset.img_info[i]['image_id']
      gt = dataset.get_groundtruth(i, evaluation=True)
      gt.add_field("image_id", torch.as_tensor(image_id))
      groundtruths.append(gt)
      
	  # 保存过程
      realtion_map = gt.get_field("relation")
      for i in range(realtion_map.shape[0]):
          sub_id = i  # 为当前sub添加序号(不是类别编号,是存储在图像中的object序号)
          sub_labels = gt.get_field("labels")[i]  # 得到第i个位置的sub类别
          for j in range(realtion_map.shape[1]):  # 固定sub—id 遍历obj-id
              obj_id = j  # 为当前obj添加序号
              obj_labels = gt.get_field("labels")[j]  # 得到第i个位置的sub类别
              predicate_label = realtion_map[i][j]  # 获得第i个sub和第j个obj之间的谓词类别(0~50)
              predicate_gt = {
                  "image_id": int(image_id),
                  "sub_id": sub_id,
                  "obj_id": obj_id,
                  "sub_labels": int(sub_labels),
                  "obj_labels": int(obj_labels),
                  "predicate_label": int(predicate_label),
              }
              csv_writer.writerow(predicate_gt)
  • 读取value_GT.csv文件,将其存为列表形式,可以看出验证集一共有1039344个三元组
# 把GT数据集存为列表字典 1039344
debug_print(logger, 'Loading value_GT')
GTValue_list = []
with open("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/value_GT.csv", 'r',encoding='utf-8') as fp:
    fp_key = csv.reader(fp)
    for csv_key in fp_key:  # 把key取出来
        csv_reader = csv.DictReader(fp, fieldnames=csv_key)
        for row in tqdm(csv_reader):
            GTValue_list.append(row)

在这里插入图片描述

Remake val dataset

  • 得到验证集的特征文件,类似于训练集一样(image_id + features),因为我的模型需要特征作为输入。由于我们进行了训练集特征数据的收集,对于评估集而言只需要将收集代码不放在if self.training:下面即可,并且同时进行第一次评估,使得可以使用验证集调用
class GeneralizedRCNN(nn.Module):
    def __init__(self, cfg):
        super(GeneralizedRCNN, self).__init__()
        self.cfg = cfg.clone()
        self.backbone = build_backbone(cfg) #xhb:R-101-FPN
        self.rpn = build_rpn(cfg, self.backbone.out_channels)
        self.roi_heads = build_roi_heads(cfg, self.backbone.out_channels)

        self.statistics = get_dataset_statistics(cfg)
        self.obj_classes = self.statistics['obj_classes']
        self.getFasterRCNNFeats = FasterRCNNFeats(cfg,self.obj_classes)
        self.ValueFeats_dict_list = []

    def forward(self, images, targets=None, logger=None): #xhb:进这个forward算损失
        if self.training and targets is None:
            raise ValueError("In training mode, targets should be passed")
        images = to_image_list(images) #xhb:将图像变成列表形式
        features = self.backbone(images.tensors) #xhb:R-101-FPN
        proposals, proposal_losses = self.rpn(images, features, targets) #xhb:Faster-RCNN 输出的proposal
        if self.roi_heads:
            x, result, detector_losses = self.roi_heads(features, proposals, targets, logger)
        else:
            # RPN-only models don't have roi_heads
            x = features
            result = proposals
            detector_losses = {}

        if self.training:
            losses = {}
            losses.update(detector_losses)
            if not self.cfg.MODEL.RELATION_ON:
                # During the relationship training stage, the rpn_head should be fixed, and no loss. 
                losses.update(proposal_losses)
            return losses

        # 得到评估集的特征文件
        Feats = self.getFasterRCNNFeats(x,result) #得到一个batch的所有特征
        num_rois = [len(b) for b in result]  #得到每幅图像里box数量
        Feats = Feats.split(num_rois, dim=0)  #得到每幅图像的object feature
        # PretrainingFeats_dict_list = []
        for i in range(len(num_rois)):
            per_feats = Feats[i] #第i张图像里的所有box特征
            image_id = targets[i].get_field("image_id")  # 得到当前图像的id
            PrevaluingFeats_dict = {
                "image_id":image_id,
                "feats" : per_feats,
            }
            self.ValueFeats_dict_list.append(PrevaluingFeats_dict) #将一个batch里的数据存储

        return result
  • 然后在评估完成之后,将文件保存为PrevaluingFeats.pt即可
if cfg.SOLVER.TO_VAL and iteration % cfg.SOLVER.VAL_PERIOD == 0:
     logger.info("Start validating")
     val_result = run_val(cfg, model, val_data_loaders, distributed, logger)
     torch.save(model.ValueFeats_dict_list,"/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PrevaluingFeats.pt")
     logger.info("Validation Result: %.4f" % val_result)

-读取PrevaluingFeats.pt只需一行代码,得到list-dict数据格式的评估集文件(5000个)

Value_list_dict = torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PrevaluingFeats.pt")

验证集一共5000张图像,每一个unit里都是一张图像的image-id和其所有box特征

在这里插入图片描述

Combining val data and GT

结合的目的是什么呢?将评估集特征文件分开,分成与GT相对应的三元组形式。即每个单元有sub和obj的信息(id和features)

# load 评估集特征文件
Value_list_dict = torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PrevaluingFeats.pt")

# 把所有的value set image-id append list里
ValueFeats_image_id = []
for j in tqdm(range(len(Value_list_dict))):
    image_id = int(Value_list_dict[j]['image_id'])
    ValueFeats_image_id.append(image_id)

# 把value GT数据集存为列表字典 1039344
debug_print(logger, 'Loading value_GT')
GTValue_list = []
with open("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/value_GT.csv", 'r',encoding='utf-8') as fp:
    fp_key = csv.reader(fp)
    for csv_key in fp_key:  # 把key取出来
        csv_reader = csv.DictReader(fp, fieldnames=csv_key)
        for row in tqdm(csv_reader):
            GTValue_list.append(row)

debug_print(logger, 'Remake Val Dataloader')
val_dataloader_list_dict = []  # 存放整个value set所有三元组的训练集数据
for i in tqdm(range(len(GTValue_list))):
    val_i_iamge_id = int(GTValue_list[i]['image_id'])  # 得到第i个gt三元组的image-id
    val_i_sub_id = int(GTValue_list[i]['sub_id'])  # 得到第i个gt三元组的sub_id
    val_i_obj_id = int(GTValue_list[i]['obj_id'])  # 得到第i个gt三元组的obj_id
    if val_i_iamge_id in ValueFeats_image_id:  # 在训练集中找到与之匹配的image
        index = ValueFeats_image_id.index(val_i_iamge_id)  # 取出image-id的位置
        batch_i_sub_feats = Value_list_dict[index]['feats'][val_i_sub_id, :]  # 得到相应sub的特征(4424维度的)
        batch_i_obj_feats = Value_list_dict[index]['feats'][val_i_obj_id, :]  # 得到相应obj的特征(4424维度的)

        val_dataset_dict = {  # 创建一个字典,存与gt对应的训练数据
            "image_id": val_i_iamge_id,
            "sub_id": val_i_sub_id,
            "obj_id": val_i_obj_id,
            "sub_feats": batch_i_sub_feats,
            "obj_feats": batch_i_obj_feats,
        }

        val_unit = (val_dataset_dict, GTValue_list[i]) # 把gt个feats绑在一起
        val_dataloader_list_dict.append(val_unit)  # 把对应于gt的训练数据存于字典,最后append到一个2000长度的列表train_set_list_dict里

每个单个训练三元组的信息是这样的

在这里插入图片描述

最终存储在val_dataloader_list_dict里,一共有1039344个可评估数据(评估集的特征数据和GT数据)

在这里插入图片描述

Evaluating starting

评估过程不仅要使数据进入模型,还要计算R@K和mR@K。

  • 由于要计算R/mR@20/50/100,所以要对预测得到的三元组进行排序处理,以sub-scores * obj-scores * pred-scores的结果为排序依据。首先得到所有预测三元组的scores
model.eval()
# 对原始数据集进行排序:val_dataloader_list_dict,以image为单位,按照三元组分数排序,得到一个新的val_dataloader_list_dict
triplet_score_image = [[] for x in range(5000)]  # 存放分数
print("-------------start save triplet_score_image-------------")
for i in tqdm(range(len(val_dataloader_list_dict))):
   val_data_i = val_dataloader_list_dict[i][0]
   target_i = val_dataloader_list_dict[i][1]
   predictions = model(val_data_i, target_i)
   '''
     predictions = dict(
     sub_pred=sub_pred,
     obj_pred=obj_pred,
     predicate_pred=predicate_pred,
     sub_scores=sub_scores,
     obj_scores=obj_scores,
     predicate_scores=predicate_scores
   )
   '''
   # 按照置信度乘积,对每一副图像中的三元组排序,进而可以计算R@20/50/100 和 mR@20/50/100
   if  int(target_i['image_id']) in val_image_id_list:
       index = val_image_id_list.index(int(target_i['image_id']))
       triplet_score_i = predictions['sub_scores'] * predictions['obj_scores'] * predictions['predicate_scores']
       triplet_score_image[index].append(triplet_score_i) #把对应图像的三元组分数append对应的triplet_score_image的子列表里

下图就是存放score的数据格式。首先被分成5000张image,每个image里面存放着当前image三元组的scores

在这里插入图片描述

  • 以每张图像为单位,对分数进行从大到小排序。这是因为计算mR@20/50/100时,也是以图像为单位计算的
index_all = []  # 存放排完序的索引 列表长度为5000
for i in range(len(triplet_score_image)):
	# 由于sorted函数平常只能返回排序后的列表,无法返回索引,所以采用这种方式
    index_image = sorted(range(len(triplet_score_image[i])), key=lambda k: triplet_score_image[i][k], reverse=True)
    index_all.append(index_image)

排完序之后的每张图像的索引如下,可以根据此index索引到排序后的list。即src_list[index] = sort_list

在这里插入图片描述

  • 由于我们需要对单张image里的三元组进行排序,因此还需要将原val_dataloader_list_dict分离成以图像为unit的形式
 # 将原val_dataloader_list_dict按照image为单位,分成5000份,存于new_val_dataloader_list_dict
split_val_dataloader_list_dict = []  # list长度为5000,但是每一个元素依然是个list
flag = 0 # 记录下一张图像开始的索引位置
for triplet_count in triplet_image_count: # triplet_count 每张图像中三元组的个数:[529,200,44,...,361,...]
    image_i_triplet = val_dataloader_list_dict[flag : triplet_count + flag]
    split_val_dataloader_list_dict.append(image_i_triplet)
    flag = flag + triplet_count
  • 分离完毕后,在逐一对image里的三元组进行排序(依靠之前的索引,因为之前分数是按照原始顺序计算的,排序之后的索引可以用在产生分数的三元组上)
 # 对分离后的new_val_dataloader_list_dict 按照image为单位,按照索引排序
sort_new_val_dataloader_list_dict = []
for i in range(len(split_val_dataloader_list_dict)):
    for j in range(len(split_val_dataloader_list_dict[i])):
        triplet = split_val_dataloader_list_dict[i][index_all[i][j]] # 排序的关键
        sort_new_val_dataloader_list_dict.append(triplet)
  • 对每张image里的三元组排完序之后,就得到了一个新的1039344的val列表,但是我们后续计算指标时,还得在每幅图像中计算,因此,将排序之后的sort_new_val_dataloader_list_dict再次分离成以image为单位的、长度为5000的列表。(这个5000的list里存放着5000张image数据,且每个image里的三元组是根据scores排序之后的)
# 将原sort_new_val_dataloader_list_dict按照image为单位,分成5000份,存于split_new_val_dataloader_list_dict
split_new_val_dataloader_list_dict = []  # list长度为5000,但是每一个元素依然是个list
flag_new = 0
for triplet_count_ in triplet_image_count:
    image_i_triplet_ = sort_new_val_dataloader_list_dict[flag_new:triplet_count_ + flag_new]
    split_new_val_dataloader_list_dict.append(image_i_triplet_)
    flag_new = flag_new + triplet_count_
  • 评估之前,还要先统计image里谓词的类别编号。这是因为算mR@K时,分母不是所有GT的数量,而是当前谓词类别的数量。(比如预测31,GT[31,22,33,10],那31类的mR就为100%)
# 得到每张image里的非0 GT标签,分别都是哪些,然后每当我预测出谓词类时,都可以除以相应的分母
val_dataloader_GTPredicate = []  # 存放整个val set中,所有图像中GT谓词的编号,也就是[[31,31,22,34...], [31,2,12,34...],...]
for image_i_triplet in split_val_dataloader_list_dict:
    image_GTPredicate = []  # 存放每一张图像中非0 GT谓词的编号,也就是[31,31,22,34...]
    for triplet_i in image_i_triplet:
        rel_label = int(triplet_i[1]['predicate_label']) # 获得当前三元组的谓词label
        if rel_label != 0: # 如果label不等于0,就将谓词append进image_GTPredicate
            image_GTPredicate.append(rel_label)
    val_dataloader_GTPredicate.append(image_GTPredicate) # 将每一副image的谓词保存起来
  • 紧接着统计image里GT的总数量,计算R@K时需要
# 统计每一个image上GT谓词的数量,计算R要用
GT_count = [x*0 for x in range(5000)] 
for i in range(len(val_dataloader_list_dict)):
    image_id = int(val_dataloader_list_dict[i][1]['image_id']) #得到当前三元组的image-id
    if int(val_dataloader_list_dict[i][1]['predicate_label']) != 0: # 如果该谓词不为0
        if image_id in ValueFeats_image_id:
            index = ValueFeats_image_id.index(image_id)  # 返回当前image-id在Recall_image的位置
            GT_count[index]  = GT_count[index] + 1 # 数量加1
  • 此时可以开始评估过程了
Recall_image20 = [x * 0 for x in range(5000)]  # 声明一个5000的空列表,每个位置存放每一张图像的召回率20
Recall_image50 = [x * 0 for x in range(5000)]  # 声明一个5000的空列表,每个位置存放每一张图像的召回率50
Recall_image100 = [x * 0 for x in range(5000)]  # 声明一个5000的空列表,每个位置存放每一张图像的召回率100
mRecall_predicate20 = [[] for x in range(51)]  # 声明一个51的空列表,每个位置存放每一谓词类别的平均召回率20,0处不存东西
mRecall_predicate50 = [[] for x in range(51)]  # 声明一个51的空列表,每个位置存放每一谓词类别的平均召回率50
mRecall_predicate100 = [[] for x in range(51)]  # 声明一个51的空列表,每个位置存放每一谓词类别的平均召回率100

# with torch.no_grad():
print("-------------start valuing-------------")
for i in tqdm(range(len(split_new_val_dataloader_list_dict))):

    recall_count = [x*0 for x in range(51)] # 统计一副图像中谓词的个数,第0位统计其非0谓词总数
    for idx in range(len(split_new_val_dataloader_list_dict[i])):
        predicate_label = int(split_new_val_dataloader_list_dict[i][idx][1]['predicate_label'])
        if predicate_label != 0:
            recall_count[predicate_label] += 1
            recall_count[0] += 1

    recall_hit20 = [x * 0 for x in range(51)]  # 统计一副图像中被召回谓词的个数(带索引)第0位为召回的总数
    recall_hit50 = [x * 0 for x in range(51)]
    recall_hit100 = [x * 0 for x in range(51)]

    with torch.no_grad():
        for k in [20, 50, 100]:
            length = k
            if len(split_new_val_dataloader_list_dict[i]) >= 100:
                length = k
            elif (50 <= len(split_new_val_dataloader_list_dict[i])) < 100 and k == 100:
                length = len(split_new_val_dataloader_list_dict[i])
            elif (20 <= len(split_new_val_dataloader_list_dict[i]) < 50) and (k == 100 or k == 50):
                length = len(split_new_val_dataloader_list_dict[i])
            elif (len(split_new_val_dataloader_list_dict[i]) < 20) and (k == 100 or k == 50 or k == 20):
                length = len(split_new_val_dataloader_list_dict[i])

            for j in range(len(split_new_val_dataloader_list_dict[i][:length])):  # 每一图像为单位
                val_data_i = split_new_val_dataloader_list_dict[i][j][0]
                target_i = split_new_val_dataloader_list_dict[i][j][1]
                predictions = model(val_data_i, target_i)

                '''
                  predictions = dict(
                  sub_pred=sub_pred,
                  obj_pred=obj_pred,
                  predicate_pred=predicate_pred,
                  sub_scores=sub_scores,
                  obj_scores=obj_scores,
                  predicate_scores=predicate_scores
                )
                '''
                # 以image-id为基本单位,进行召回率的统计
                sub_pred = predictions['sub_pred']
                obj_pred = predictions['obj_pred']
                predicate_pred = predictions['predicate_pred']
                sub_label = int(target_i['sub_labels'])
                obj_label = int(target_i['obj_labels'])
                predicate_label = int(target_i['predicate_label'])

                # 所有label全对上才算预测成功,由于我是按照编号取的,所以不用对box
                if sub_pred == sub_label and obj_pred == obj_label and predicate_pred == predicate_label and predicate_label != 0:

                    if k == 20:
                        recall_hit20[predicate_label] += 1
                        recall_hit20[0] += 1
                    elif k == 50:
                        recall_hit50[predicate_label] += 1
                        recall_hit50[0] += 1
                    elif k == 100:
                        recall_hit100[predicate_label] += 1
                        recall_hit100[0] += 1

        for n in range(51): # 把当前图像的R和mR统计进去
            if n == 0 and recall_count[0] > 0: # 把当前image的R填进总表
                Recall_image20[i] = float(recall_hit20[n] / recall_count[n])
                Recall_image50[i] = float(recall_hit50[n] / recall_count[n])
                Recall_image100[i] = float(recall_hit100[n] / recall_count[n])
                continue
            if recall_count[n] > 0: # 把当前image的mR填进总表
                mRecall_predicate20[n].append(float(recall_hit20[n] / recall_count[n]))
                mRecall_predicate50[n].append(float(recall_hit50[n] / recall_count[n]))
                mRecall_predicate100[n].append(float(recall_hit100[n] / recall_count[n]))
  • 得到所有的R和mR列表,就可以打印结果了
print('----------------------- calculate R ------------------------\n')
print("R@20 = {:.4f}-----R@50 = {:.4f}-----R@100 = {:.4f}\n".format(np.mean(Recall_image20)*100,
                                                                    np.mean(Recall_image50)*100,
                                                                    np.mean(Recall_image100)*100))
print('----------------------- calculate mR ------------------------\n')
mR20 = []
for i,mR_predicate in enumerate(mRecall_predicate20):
    if i == 0:
        continue
    if len(mR_predicate) == 0:
        continue
    image_mR20 = np.mean(mR_predicate)
    mR20.append(image_mR20)
mR50 = []
for i,mR_predicate in enumerate(mRecall_predicate50):
    if i == 0:
        continue
    if len(mR_predicate) == 0:
        continue
    image_mR50 = np.mean(mR_predicate)
    mR50.append(image_mR50)
mR100 = []
for i,mR_predicate in enumerate(mRecall_predicate100):
    if i == 0:
        continue
    if len(mR_predicate) == 0:
        continue
    image_mR100 = np.mean(mR_predicate)
    mR100.append(image_mR100)
print("mR@20 = {:.4f}----mR@50 = {:.4f}----mR@100 = {:.4f}\n".format((np.sum(mR20))/0.5,
                                                                         (np.sum(mR50))/0.5,
                                                                         (np.sum(mR100))/0.5))
val_result = np.sum(mR100)/0.5

return val_result
  • 最终将R@100作为评估结果,用来调整学习率
val_result = np.mean(Recall_image100)
return val_result
  • 可以在迭代多少次时候进行评估,并且下降其学习率
val_result = None  # used for scheduler updating xhb:评估模型,主要是调用run_val函数
if cfg.SOLVER.TO_VAL and real_iteration % 4000 == 0: # cfg.SOLVER.VAL_PERIOD和VAL_PERIOD 我都改成了500
    logger.info("Start validating")
    val_result = do_predicate_evaluation(cfg, mymodel, val_dataloader_list_dict, ValueFeats_image_id, GT_count,triplet_image_count,logger)
    logger.info("Validation Result: %.4f" % val_result)
    for p in optimizer_Adam.param_groups:
        p["lr"] *= 0.8

# xhb:如果迭代次数正好到了要保存模型的时候,就save模型
if real_iteration % 4000 == 0:
    torch.save(mymodel,"/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/modelpth/model_{:05d}.pth".format(real_iteration))

Bug Summary

The number of training sets and GT is inconsistent

第一个遇到的bug就是训练集数据中图像的数量和GT数量不一致。训练集中提取到的image-id有57723个,而GT数据中只有56224个。为了避免后续计算loss时出现不必要的错误,这里我选择将57723切片为56224,达到训练集和GT集match的目的。

  • Load GT数据集
# 把GT数据集存为列表字典
Predicate_list = []
with open("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Predicate_GT.csv", 'r',encoding='utf-8') as fp:
    fp_key = csv.reader(fp)
    for csv_key in fp_key:  # 把key取出来
        csv_reader = csv.DictReader(fp, fieldnames=csv_key)
        for row in tqdm(csv_reader):
            Predicate_list.append(row)

这是一个长度为9581817的列表,里面存放着我们Remake后GT数据集的数据。

在这里插入图片描述

  • 将image-id提取出来,存于列表Predicate_list_id 后去重,得到不重复的new_Predicate_list_id
Predicate_list_id = []
for j in range(len(Predicate_list)):
    image_id = Predicate_list[j]["image_id"]
    Predicate_list_id.append(int(image_id))
new_Predicate_list_id = list(set(Predicate_list_id))  #56224  - 57723 = -1499 证明GT中少了1499张图像

new_Predicate_list_id :这是一个长度为56224 的列表,意味着GT数据集中只保存了56224 张图像的信息。

  • Load 训练集数据
DatasetFeats = torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/PretrainingFeats.pt")

DatasetFeats_list = [] # 存储57723个image-id
for i in range(len(DatasetFeats)):
    image_id = DatasetFeats[i]["image_id"]
    DatasetFeats_list.append(int(image_id))
new_DatasetFeats_list_id = list(set(DatasetFeats_list))

在这里插入图片描述
new_DatasetFeats_list_id :这是一个长度为57723的列表,意味着训练集中保存着57723张图像的信息

我知道,56224一定是57723的子集,我现在想要把57723中不是56224中的数据去掉,达到57723去掉不相同的数据之后,与GT中的56224数据一致。

# 因为GT只有56224 所以将train中的57723切片为56224
new_DatasetFeats = []  # 最后获得的56224个图像的特征数据
for i in tqdm(range(len(DatasetFeats_list))):
   for j in range(len(new_Predicate_list_id)):
       if DatasetFeats_list[i] == new_Predicate_list_id[j]: # 如果57723中的image-id
           new_DatasetFeats.append(DatasetFeats[i])
       else:
           continue

如果如果57723中的image-id能在56224中找到,就把这个训练集数据保存下来。如果57723中的image-id不能在56224中找到,说明这不是两者重复的数据,即要舍去。然后保存训练集数据

torch.save(new_DatasetFeats,"/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats56224.pt")
  • 验证处理之后的训练集和GT中图像是否一致

将处理好的训练集,load下来,然后提取image-id,然后跟GT去重后的image-id对比。

DatasetFeats = torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats56224.pt")

DatasetFeats_list = [] # 存储56224个image-id
for i in range(len(DatasetFeats)):
    image_id = DatasetFeats[i]["image_id"]
    DatasetFeats_list.append(int(image_id))
new_DatasetFeats_list_id = list(set(DatasetFeats_list))

Predicate_list_id = [] # 存储56224个image-id
for j in range(len(Predicate_list)):
    image_id = Predicate_list[j]["image_id"]
    Predicate_list_id.append(int(image_id))
new_Predicate_list_id = list(set(Predicate_list_id))  

# 排序,对比
new_DatasetFeats_list_id.sort() 
new_Predicate_list_id.sort()
if new_DatasetFeats_list_id == new_Predicate_list_id:
    print("OK") # 结果自然是OK

Object loss does not decrease

现在遇到的一个问题是,关系损失虽然一直在下降,但是object预测的损失一直不变,就说明没有一个预测正确的。可以看出关系loss从7849降到了5565(900次迭代),然而sub和obj几乎不变。

在这里插入图片描述

问题分析:第一种情况就是代码写错了,分类label时代码不对。第二种情况就是直接对FasterRCNN提取的特征进行分类label就是有问题的。第二种情况也可以分为两种:获取4424特征时的过程有问题;获取到的4424特征不能拿来直接预测。

Solution

我们首先想到的一个办法就是使用上下文特征进行预测label,这样就需要在保存训练集特征时,额外保存上下文特征。整体的解决方案如下:

  • 将motifs中的LSTMContext进行预训练,在模型最后收敛时保存模型的参数和权值,这样就可以得到一组不错的模型参数。(在达到最大衰减step时,保存LSTMContext的state_dict)
if cfg.SOLVER.SCHEDULE.TYPE == "WarmupReduceLROnPlateau":
   scheduler.step(val_result, epoch=iteration) #xhb:根据评估结果调整学习率
   if scheduler.stage_count >= cfg.SOLVER.SCHEDULE.MAX_DECAY_STEP:
       logger.info("Trigger MAX_DECAY_STEP at iteration {}.".format(iteration))
       torch.save(model.roi_heads.relation.predictor.context_layer.state_dict(),"/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/modulepth/LSTMContext.pth")
       break
else:
   scheduler.step()
  • 搭建一个与LSMTContext网络结构一致的myLSMTContext,通过导入参数的方法,将训练好的参数赋予myLSMTContext,这样就相当于使用预训练权重了(forward相对于原LSTM多返回一个obj_ctx,因为这是我需要的)
class myLSTMContext(nn.Module):
   """
   Modified from neural-motifs to encode contexts for each objects
   """
   def __init__(self, config, obj_classes, rel_classes, in_channels):
       super(myLSTMContext, self).__init__()
       self.cfg = config
       self.obj_classes = obj_classes
       self.rel_classes = rel_classes
       self.num_obj_classes = len(obj_classes)

       # mode
      
       # word embedding
     
       # position embedding
      
       # object & relation context
      
       # TODO Kaihua Tang
       # AlternatingHighwayLSTM is invalid for pytorch 1.0
      
       # map bidirectional hidden states of dimension self.hidden_dim*2 to self.hidden_dim
      
       # untreated average features
       
   def obj_ctx(self, obj_feats, proposals, obj_labels=None, boxes_per_cls=None, ctx_average=False):
       # code
       return obj_dists, obj_preds, encoder_rep, perm, inv_perm, ls_transposed
       
   def edge_ctx(self, inp_feats, perm, inv_perm, ls_transposed):
       # code
       return edge_ctx
       
   def forward(self, x, proposals, all_average=False, ctx_average=False):
	   # code
       return obj_dists, obj_preds, edge_ctx, obj_ctx, None
  • 然后在保存特征的代码中引入模型(用没有参数的网络使用load_state_dict方法加载参数,即可变成预训练网络)
class FasterRCNNFeats(nn.Module):
    def __init__(self,cfg,obj_classes,rel_classes):
        super(FasterRCNNFeats, self).__init__()

        self.cfg = cfg
        self.obj_classes = obj_classes
        self.rel_classes = rel_classes
        self.num_obj_classes = len(obj_classes)
        self.num_rel_classes = len(rel_classes)

        # position embedding
        self.pos_embed = nn.Sequential(*[
            nn.Linear(9, 32), nn.BatchNorm1d(32, momentum=0.001),
            nn.Linear(32, 128), nn.ReLU(inplace=True),
        ])

        # word embedding
        self.embed_dim = self.cfg.MODEL.ROI_RELATION_HEAD.EMBED_DIM
        obj_embed_vecs = obj_edge_vectors(self.obj_classes, wv_dir=self.cfg.GLOVE_DIR, wv_dim=self.embed_dim)
        self.obj_embed1 = nn.Embedding(self.num_obj_classes, self.embed_dim)
        with torch.no_grad():
            self.obj_embed1.weight.copy_(obj_embed_vecs, non_blocking=True)

		# 首先实例模型(指的是调用模型),然后将参数导入模型
        self.myLSTM = myLSTMContext(self.cfg,self.obj_classes,self.rel_classes, 4096)
        self.myLSTM.load_state_dict(torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/modulepth/LSTMContext_state_dict_final.pth"))
        print("self.myLSTM.load_state_dict")


    def forward(self, x, proposals): # 传入4096的x特征 和 proposal 就可以得到4424的特征数据

        if self.training or self.cfg.MODEL.ROI_RELATION_HEAD.USE_GT_BOX:
            obj_labels = cat([proposal.get_field("labels") for proposal in proposals], dim=0)
        else:
            obj_labels = None

        if self.cfg.MODEL.ROI_RELATION_HEAD.USE_GT_OBJECT_LABEL:
            obj_embed = self.obj_embed1(obj_labels.long()) # PredCls
        else:
            obj_logits = cat([proposal.get_field("predict_logits") for proposal in proposals], dim=0).detach() #SGCLS
            obj_embed = F.softmax(obj_logits, dim=1) @ self.obj_embed1.weight

        assert proposals[0].mode == 'xyxy'
        pos_embed = self.pos_embed(encode_box_info(proposals))

        obj_pre_rep = cat((x, obj_embed, pos_embed), -1)

        # 加载训练好的
        obj_dists, obj_preds, edge_ctx, obj_ctx, _ = self.myLSTM(x, proposals)

        return  obj_pre_rep,obj_dists,obj_ctx

这样的话,我们就可以返回很多参数(obj_pre_rep:4424的FasterRCNN特征、obj_dists:每个object的预测概率分布、obj_ctx:每个object的context特征)

self.statistics = get_dataset_statistics(cfg)
self.obj_classes = self.statistics['obj_classes']
self.rel_classes = self.statistics['rel_classes']
self.getFasterRCNNFeats = FasterRCNNFeats(cfg,self.obj_classes,self.rel_classes)
self.PretrainingFeats_dict_list = []

if self.training:
   with torch.no_grad():
       obj_pre_rep, obj_dists, obj_ctx = self.getFasterRCNNFeats(x,result) #得到一个batch的所有特征
       num_rois = [len(b) for b in result]  #得到每幅图像里box数量
       obj_pre_rep = obj_pre_rep.split(num_rois, dim=0)  #得到每幅图像的object feature
       obj_dists = obj_dists.split(num_rois, dim=0)
       obj_ctx = obj_ctx.split(num_rois, dim=0)
       for i in range(len(num_rois)):
           per_feats = obj_pre_rep[i] #第i张图像里的所有object特征
           per_obj_dists = obj_dists[i] #第i张图像里的所有objetc的预测概率分布
           per_obj_ctx = obj_ctx[i] #第i张图像里的所有object的上下文特征
           image_id = targets[i].get_field("image_id")  # 得到当前图像的id
           PretrainingFeats_dict = {
               "image_id": image_id,
               "feats": per_feats,
               "obj_dist": per_obj_dists,
               "obj_ctx": per_obj_ctx,
           }
           self.PretrainingFeats_dict_list.append(PretrainingFeats_dict) #将一个batch里的数据存储
  • 然后在训练过程中保存上面的数据(此次训练,我设置的5000次保存模型,batch-size=12,采用顺序采样VG数据集。这样做的目的是可以采样60000张image,大于整个训练集57723张)
if iteration % checkpoint_period == 0:
    checkpointer.save("model_{:07d}".format(iteration), **arguments)
    torch.save(model.PretrainingFeats_dict_list,"/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats_ctx.pt")

我们在上面经历过GT数据56224的长度和训练集57723长度不一致的Bug,因此在这里直接解决这个问题,不再阐述过程。

  • 把训练集GT数据取出,然后把9851817个三元组的image-id取出,去重
PredicateGT_list = [] # 把训练集的GT数据取出,一共是9851817个三元组
with open("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Predicate_GT.csv", 'r',encoding='utf-8') as fp:
    fp_key = csv.reader(fp)
    for csv_key in fp_key:  # 把key取出来
        csv_reader = csv.DictReader(fp, fieldnames=csv_key)
        for row in tqdm(csv_reader):
            PredicateGT_list.append(row)

PredicateGT_list_id = [] # 把GT中的image-id取出
for j in tqdm(range(len(PredicateGT_list))):
    image_id = PredicateGT_list[j]["image_id"]
    PredicateGT_list_id.append(int(image_id))
new_PredicateGT_list_id = list(set(PredicateGT_list_id))  # 56224  - 57723 = -1499 证明GT中少了1499张图像
  • 加载刚才保存的pt文件
DatasetFeats = torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats_ctx.pt")

可以看出有59979张image的数据,每个image数据里都有image-id、输入特征feats、obj预测分布obj_dist、obj上下文obj_ctx

在这里插入图片描述

  • 由于59979一定是相对于57723重复一些image,所以我们检查是否是这样。答案确实如此,59979经过去重变成了57723个
DatasetFeats_list = []  # 存储57723个image-id
for i in range(len(DatasetFeats)):
    image_id = DatasetFeats[i]["image_id"]
    DatasetFeats_list.append(int(image_id))
new_DatasetFeats_list_id = list(set(DatasetFeats_list)) # 57723
  • 由于GT只有56224,因此要将57723个进行切片。遍历57723中的image-id,如果发现GT中也有此image-id,就把57723中的此数据保留。如果57723中的image在GT中找不到,我们就舍弃。然后保存新的56224到PretrainingFeats56224_ctx.pt
new_DatasetFeats = []  # 最后获得的56224个图像的特征数据
for i in tqdm(range(len(DatasetFeats_list[:57723]))):
    if  DatasetFeats_list[i] in new_PredicateGT_list_id: # 根据GT中56224挑选57723中的相同56224
        new_DatasetFeats.append(DatasetFeats[i])
    else:
        continue

torch.save(new_DatasetFeats,"/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats56224_ctx.pt")
  • 检查保存得到的56224个训练集特征文件中的image是否与GT一致(先加载训练集文件,然后将image-id提取出来,去重,排序。检查与GT是否一模一样,因为排序之后如果相同,说明两者中的image是相同的)。结果自然是相同,说明我们的没有什么问题
newDatasetFeats = torch.load("/data/xuhongbo/xuhongbo.code/unbiased_sgg_xuhongbo_new/datasets/vg/Feats/PretrainingFeats56224_ctx.pt")

# 验证是否一致
newDatasetFeats_list = []  # 存储56224个image-id
for i in range(len(newDatasetFeats)):
    image_id = newDatasetFeats[i]["image_id"]
    newDatasetFeats_list.append(int(image_id))
new_DatasetFeats_list_id = list(set(newDatasetFeats_list))

# 排序,对比
new_DatasetFeats_list_id.sort()
new_PredicateGT_list_id.sort()
if new_DatasetFeats_list_id == new_PredicateGT_list_id:
    print("OK")  # 结果自然是OK

得到了可以一一对应的训练集和GT数据,下面就要进行绑定了

  • 可以看到训练数据中多了一些东西:dist和ctx,这两个东西是我在VG数据集上采用LSTMContext预训练好的,也就是我的model不需要进行object label的预测,即不需要回传object loss。
# 将GT与训练集数据一一对应(绑在一起) [[train_data1,GT1],[train_data2,GT2],....]
debug_print(logger, 'Remake Dataloader')
batch_size = 2000
epoch_iteration = 8000
train_dataloader_list_dict = []  #存放4000个batch,batch里有2000个字典
for n in tqdm(range(epoch_iteration)):  # 采样epoch_iteration次,得到epoch_iteration个batch
    batch_gt = random.choices(GTPredicate_list, weights=GTPredicate_probability, k=batch_size)
    train_set_batch_list_dict = [] #存放2000个字典,也就是1个batch
    for i in range(batch_size):
        batch_i_iamge_id = int(batch_gt[i]['image_id']) # 得到第n个batch里第i个gt三元组的image-id
        batch_i_sub_id = int(batch_gt[i]['sub_id'])     # 得到第n个batch里第i个gt三元组的sub_id
        batch_i_obj_id = int(batch_gt[i]['obj_id'])     # 得到第n个batch里第i个gt三元组的obj_id
        if batch_i_iamge_id in DatasetFeats_image_id:
            index = DatasetFeats_image_id.index(batch_i_iamge_id)
            batch_i_sub_feats = newDatasetFeats[index]['feats'][batch_i_sub_id, :]  # 得到相应sub的特征(4424维度的)
            batch_i_obj_feats = newDatasetFeats[index]['feats'][batch_i_obj_id, :]  # 得到相应obj的特征(4424维度的)
            batch_i_sub_dist = newDatasetFeats[index]['obj_dist'][batch_i_sub_id, :]  # 得到相应sub的dist(151维度的)
            batch_i_obj_dist = newDatasetFeats[index]['obj_dist'][batch_i_obj_id, :]  # 得到相应obj的dist(151维度的)
            batch_i_sub_ctx = newDatasetFeats[index]['obj_ctx'][batch_i_sub_id, :]  # 得到相应sub的context(512维度的)
            batch_i_obj_ctx = newDatasetFeats[index]['obj_ctx'][batch_i_obj_id, :]  # 得到相应obj的context(512维度的)

            train_dataset_dict = {
                "image_id": batch_i_iamge_id,
                "sub_id": batch_i_sub_id,
                "obj_id": batch_i_obj_id,
                "sub_feats": batch_i_sub_feats,
                "obj_feats": batch_i_obj_feats,
                "sub_dist": batch_i_sub_dist,
                "obj_dist": batch_i_obj_dist,
                "sub_ctx": batch_i_sub_ctx,
                "obj_ctx": batch_i_obj_ctx,
            }

            train_unit = (train_dataset_dict,batch_gt[i])
            train_set_batch_list_dict.append(train_unit)  # 把对应于gt的训练数据存于字典,最后append到一个2000长度的列表train_set_list_dict里
    train_dataloader_list_dict.append(train_set_batch_list_dict)
  • 于是我的model结构就变成了下面这样(不在需要预测sub-obj的label,而是直接采用预训练好的dist进行argmax。loss只有rel-loss,即我的模型实质上只是对关系的分类,是一种解决了长尾问题的关系分类网络)
class PredicatedRCNN(nn.Module):

    def __init__(self, cfg):
        super(PredicatedRCNN, self).__init__()
        self.cfg = cfg.clone()
        self.feats_dim = 4424
        self.hidden_size = 1024
        self.num_obj_classes = 151
        self.num_rel_classes = 51
        self.ctx_dim = 512
        self.edge_hidden_dim = cfg.MODEL.ROI_RELATION_HEAD.CONTEXT_HIDDEN_DIM # 512
        self.embed_dim = cfg.MODEL.ROI_RELATION_HEAD.EMBED_DIM # 200
        self.post_emb = nn.Linear(self.edge_hidden_dim, self.edge_hidden_dim * 2)  # 512->1024
        self.edge_lin = nn.Linear((self.feats_dim+self.embed_dim + self.ctx_dim)*2, self.edge_hidden_dim) # 10272->512

        self.pooling_dim = cfg.MODEL.ROI_RELATION_HEAD.CONTEXT_POOLING_DIM
        self.post_cat = nn.Linear(self.edge_hidden_dim, self.pooling_dim)
        self.rel_compress = nn.Linear(self.pooling_dim, self.num_rel_classes, bias=True)

        # load class dict
        statistics = get_dataset_statistics(cfg)
        obj_classes, rel_classes = statistics['obj_classes'], statistics['rel_classes']
        assert self.num_obj_classes == len(obj_classes)
        assert self.num_rel_classes == len(rel_classes)

        # 对obj—pred进行word vector编码
        obj_embed_vecs = obj_edge_vectors(obj_classes, wv_dir=self.cfg.GLOVE_DIR, wv_dim=self.embed_dim) # [151,200] 包含所有类
        self.edge_obj_embed = nn.Embedding(self.num_obj_classes, self.embed_dim)
        self.sub_obj_embed1 = nn.Embedding(self.num_obj_classes, self.embed_dim)
        with torch.no_grad():
            self.sub_obj_embed1.weight.copy_(obj_embed_vecs, non_blocking=True)
            self.edge_obj_embed.weight.copy_(obj_embed_vecs, non_blocking=True)

        # 计算交叉熵损失函数
        self.criterion_loss = nn.CrossEntropyLoss()

        self.device = torch.device(cfg.MODEL.DEVICE)

    def forward(self, train_data, target, logger=None): #xhb:进这个forward算损失

        # get object feature
        sub_feats = train_data['sub_feats']
        obj_feats = train_data['obj_feats']

        # get object pretraining dist
        pre_sub_dist = train_data['sub_dist']
        pre_obj_dist = train_data['obj_dist']

        # get objecy pretraining context feature
        pre_sub_ctx = train_data['sub_ctx']
        pre_obj_ctx = train_data['obj_ctx']


        # get object pred
        sub_logits = F.softmax(pre_sub_dist.view(1,-1), dim=1)
        obj_logits = F.softmax(pre_obj_dist.view(1,-1), dim=1)
        sub_pred = torch.argmax(sub_logits).view(1) # tensor(8, device='cuda:0') 这不是一个向量,所以to_onehot无法识别 shape: torch.Size([])
        obj_pred = torch.argmax(obj_logits).view(1)

        # get label embd feature 缺点:如果上面预测的不准,这就很不准
        sub_embed = self.edge_obj_embed(sub_pred.long())
        obj_embed = self.edge_obj_embed(obj_pred.long())

        # get edge features
        sub_rel_rep = torch.cat((sub_embed.view(1,-1), sub_feats.view(1,-1), pre_sub_ctx.view(1,-1)), -1) # 5136 = 200 + 4424 + 512
        obj_rel_rep = torch.cat((obj_embed.view(1,-1), obj_feats.view(1,-1), pre_obj_ctx.view(1,-1)), -1)

        # get relation head-tail feature
        edge_feats = torch.cat((sub_rel_rep,obj_rel_rep),-1) # 先将sub和obj的合并特征拼接起来
        edge_feats = self.edge_lin(edge_feats) # 映射到512 dim
        edge_rep = self.post_emb(edge_feats) # 再映射到1024 dim
        edge_rep = edge_rep.view(edge_rep.size(0), 2, self.edge_hidden_dim)
        head_rep = edge_rep[:, 0].contiguous().view(-1, self.edge_hidden_dim) # 512 dim
        tail_rep = edge_rep[:, 1].contiguous().view(-1, self.edge_hidden_dim) # 512 dim

        # get relation dist
        head_rep = self.post_cat(head_rep) # 将head映射到4096维度
        tail_rep = self.post_cat(tail_rep) # 将tail映射到4096维度
        relation_rep = head_rep * tail_rep # 得到关系特征
        relation_dists = self.rel_compress(relation_rep) # 关系分类
        relation_logits = F.softmax(relation_dists,dim=1)
        new_relation_logits = relation_logits.view(-1).tolist()[1:]
        new_relation_logits = torch.Tensor(new_relation_logits)
        predicate_pred = torch.argmax(new_relation_logits).view(1) # 不预测关系label = 0的三元组

        # get <predicate> label
        rel_label = torch.tensor(int(target['predicate_label'])).long().view(1).to(self.device)

        # 获得分类置信度
        sub_index = int(sub_pred)
        lsit_sub = sub_logits.view(-1).tolist()
        sub_scores = lsit_sub[sub_index] # subject
        obj_index = int(obj_pred)
        lsit_obj = obj_logits.view(-1).tolist()
        obj_scores = lsit_obj[obj_index] # object
        predicate_index = int(predicate_pred) + 1 # 因为把0去掉了,所以其他类别得加1
        lsit_predicate = relation_logits.view(-1).tolist()
        predicate_scores = lsit_predicate[predicate_index] # predicate


        # 训练模式下,返回loss
        if self.training:
            # calculate relation loss
            rel_loss = self.criterion_loss(relation_dists.view(1,-1), rel_label)
            output_losses = dict(loss_rel=rel_loss)
            return output_losses

        # 评估模式下,返回预测结果
        elif self.cfg.SOLVER.TO_VAL:

            predictions = dict(sub_pred=sub_index,
                               obj_pred=obj_index,
                               predicate_pred=predicate_index,
                               sub_scores=sub_scores,
                               obj_scores=obj_scores,
                               predicate_scores=predicate_scores
                               )

            return predictions
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值