借京东图文识别baseline 来看clip训练过程。 clip是怎样练成的。

亮子李

已于 2022-03-28 09:39:29 修改

阅读量3.9k

点赞数 3

分类专栏：日常学习文章标签：人工智能

于 2022-03-25 17:41:48 首次发布

本文链接：https://blog.csdn.net/YI_SHU_JIA/article/details/123714897

版权

日常学习专栏收录该内容

16 篇文章 0 订阅

订阅专栏

这次轮到clip模型啦。记笔记记笔记。

背景是京东已经给了图片的feature 也就是不需要我们再去抽特征。然后给了图片对应的标题。

我们直接从clip训练开始。

    dataloader, sampler = data['train'].dataloader,  data['train'].sampler

    loss_img = nn.CrossEntropyLoss()
    loss_txt = nn.CrossEntropyLoss()
    if args.gpu is not None:
        loss_img = loss_img.cuda(args.gpu)
        loss_txt = loss_txt.cuda(args.gpu)

定义数据loader 和两个loss 都是普通的交叉熵loss。

    for i, batch in enumerate(dataloader):
        step = num_batches_per_epoch * epoch + i
        scheduler(step)

        optimizer.zero_grad()

        images, texts = batch

设置学习率然后读入一个batch的数据我们看看数据长啥样。

在我的这批数据里图像和文字的编码长度都是2048 所以上面就是images了一批256张，每张2048维。而text 就是一批文字数据。256句文字

tokens = tokenize(texts)

这句代码很长就是用bert的tokenize对文字进行编码

而bert的编码处理文字后一般有三个部分

1 input_ids :就是输入的文字的id 注意只是id编号而不是编码 101表示cls 的token 6144也代表着一个字

2 token_types_id 这个属性表示的是这个字在哪个句子里（一般seq把两个句子分开。）因为我们只有一个句子所有全部都是0 他的大小是 128*77 表示128个样本里每个字的所在句子位置。

3 attention_mask 表示我们要关注哪些字一般填充的pad对应0 其他的对应1

image_features, text_features, logit_scale = model(images, texts)

把图像特征和文字编码输入进模型

clip的编码就这两句第一句编图像第二句编文字但这里图像编码好了所以去编码文字。

        image_features = self.encode_image(image)
        text_features = self.encode_text(text)

class TextModel(torch.nn.Module):
    def __init__(self, model_name='M-CLIP/M-BERT-Base-69', out_features=2048):
        super().__init__()
        self.model_name = model_name
        self.transformer = transformers.AutoModel.from_pretrained(model_name, cache_dir='~/.cache')
        in_features = self.transformer.pooler.dense.out_features
        self.clip_head = torch.nn.Linear(in_features=in_features, out_features=out_features)
    
    def forward(self, txt_tok):
        embs = self.transformer(**txt_tok)[0]
        att = txt_tok['attention_mask']
        embs = (embs * att.unsqueeze(2)).sum(dim=1) / att.sum(dim=1)[:, None]
        return self.clip_head(embs)

这个class就是用来编码文字的 forward里有四句我们一句一句来看。

bert

第一句 transoforms 就是指 bert模型关于bert模型可以看这篇介绍

HuggingFace BERT源码详解：基本模型组件实现_PaperWeekly的博客-CSDN博客

bert分三块第一部分是编码 embedding 也就是把上面的字的id 对应成768维的向量。

第二步是encoder 就是子注意力模块。而第三部分就是pooler层。

bert模型的输入一半要求三个东西就是上面的三个。

embedding层

我们进入embedding层这里的输入是 128*77

其中 128是batch 因为我用了两个gpu 所以减半了。 77是text的长度其中位置0上是cls 后面跟着句子中字的id 后面是seq 再后面都是pad。

        seq_length = input_shape[1]

        if position_ids is None:
            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]

获得位置的id 这里的positionids 就是0到76

            inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)
position_embeddings = self.position_embeddings(position_ids)

把字和句子位置都编码 data变成了 128*77*768 位置也编码变成1*77*768 然后把他们三个直接加起来。位置编码在加的时候会自动扩充。

        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

经过 LN和DP 后得到带位置信息和句子位置信息的编码。输出是 128*77*768

encoder

encoder 就是自注意力层 bertchinese 是有12层每层注意力头有12个

下面是其中一层。

class BertLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.chunk_size_feed_forward = config.chunk_size_feed_forward
        self.seq_len_dim = 1
        self.attention = BertAttention(config)
        self.is_decoder = config.is_decoder
        self.add_cross_attention = config.add_cross_attention
        if self.add_cross_attention:
            if not self.is_decoder:
                raise ValueError(f"{self} should be used as a decoder model if cross attention is added")
            self.crossattention = BertAttention(config, position_embedding_type="absolute")
        self.intermediate = BertIntermediate(config)
        self.output = BertOutput(config)


class BertAttention(nn.Module):
    def __init__(self, config, position_embedding_type=None):
        super().__init__()
        self.self = BertSelfAttention(config, position_embedding_type=position_embedding_type)
        self.output = BertSelfOutput(config)
        self.pruned_heads = set()

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = position_embedding_type or getattr(
            config, "position_embedding_type", "absolute"
        )
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

上面三层扣起来的我也是醉了。下面的hidden_states就是输入。他依然是 128*77*768

mixed_query_layer = self.query(hidden_states)
query_layer = self.transpose_for_scores(mixed_query_layer)
 key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))

q，k，v 就是三个linear得到的之后要经过一个reshape 他们三个都是 128 *12 * 77 *64 12是注意力的头数 64是每个头得到的数据维数所以这样子我们就可以12个头同时计算了。

attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)


attention_scores = attention_scores + attention_mask   #注意这一句  attention 前面做了反向 也就是说 pad的位置上 都是-10000 很小 然后加起来  pad位置的注意力都很小 这样softmax后 几乎为 0 就不会注意到pad

    
        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()


        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(new_context_layer_shape)

上面代码就完成了attention is all your need 的这一数学公式

上面的操作做12次就得到了encoder后的特征大小是 128*77*768

pool层

        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

class BertPooler(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

这个pool层的意思很简单就是取cls 对应的特征相当于只取第一个字因为cls 考虑了全局他的大小是 128*768 然后经过一个mlp 就完成了 bert对text的编码。

        return BaseModelOutputWithPoolingAndCrossAttentions(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            past_key_values=encoder_outputs.past_key_values,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
            cross_attentions=encoder_outputs.cross_attentions,
        )

看着很多但其实就返回了两个值一个是 pool前的一个pool后的结果

text 后处理

att = txt_tok['attention_mask']

取出att 就是那个有字1 没字0的

embs = (embs * att.unsqueeze(2)).sum(dim=1) / att.sum(dim=1)[:, None]

att 我们知道是 128 * 77

att.unsqueeze(2)    这个是指在第三维扩充一维  变成 128*77 *1

embs 是pool前的结果所以他的大小是128 * 77*768 而*乘是指两个相同大小矩阵对应位置的数相乘。

不相同怎么*乘呢？其实有一个扩充比如下面代码。 a 大小是2*2 b是2*1 会自动把b的1维扩充为2 变成2*2 之后*乘

a =torch.tensor([[1,1],[1,1]])
print(a)
b = torch.tensor([[2],[3]])
print(a*b)

*******************************

tensor([[1, 1],
[1, 1]])
tensor([[2, 2],
[3, 3]])

然后在第一维相加也就是在77这里加就是说不用pool了现在我要取全部字的特征加起来然后取平均值。（我觉得这样不是很好说实话）那个

att.sum(dim=1)[:, None]  这个的意思是 看这句有多少字  然后[:,None]可以起到扩充维度的作用

a =torch.tensor([[1,1],[1,1]])
print(a)
print(a[:,None].shape)

torch.Size([2, 1, 2])

 return self.clip_head(embs)

clip_head 就是一个fc 把特征从768 到2048 最后结果是 128*2048

回到clip 、

text_features = text_features / text_features.norm(dim=-1, keepdim=True)

return image_features, text_features, self.logit_scale.exp()

上面我们得到了文字和图像的编码。 .norm表示二范数用二范数归一化。但我真的不知道那个logit_scale_有啥用可能后面有大用吧

回到train函数注意我们从两个gpu回收了data 所以现在数据是 256 *2048

        logits_per_image = logit_scale * image_features @ text_features.t()
        logits_per_text = logit_scale * text_features @ image_features.t()

logit_scale 在这里当一个倍数？

logits_per_image 大小是256 *256 表示的意思是 256张 照片 每张图片 有256个分类结果 相当于一个分类256的网络结果。 其中一个是正确的txt 。   对于per——text也是一样。

ground_truth = torch.arange(len(logits_per_image)).long()

gt 是0到256 其实很简单我们当作一个普通的分类任务结果。第一个图片有256个预测结果，他的真实结果是哪个呢？显然是第一个文字。所以他的标签是0 . 着256张照片对应的txt下标就是 0，1，2，3，.。。。255 对于txt也是一样。这里的loss是交叉熵损失。

                scaler.scale(total_loss).backward()
                scaler.step(optimizer)
            scaler.update()

更新参数和loss 所以logit_scale有什么用？

        m.logit_scale.data = torch.clamp(m.logit_scale.data, 0, 4.6052)

torch.clamp 表示让一个数落在一个区间超过就裁剪。所以logit_scale有什么用？

我们可以看到这里clip并没有像论文里那样用全部的数据进行对比学习而只是用一个bat的256个数据进行对比学习。也就是说是一个阉割版的。主要是官方很贴心知道大家都没时间（没钱）所以写了个判断如果在分布式系统上进行训练，才会进行全部数据的对比学习。

    if args.distributed and args.aggregate:
        world_size = dist.get_world_size()
        rank = dist.get_rank()

        # We gather tensors from all gpus to get more negatives to contrast with.
        gathered_image_features = [
            torch.zeros_like(image_features) for _ in range(world_size)
        ]
        gathered_text_features = [
            torch.zeros_like(text_features) for _ in range(world_size)
        ]
        dist.all_gather(gathered_image_features, image_features)
        dist.all_gather(gathered_text_features, text_features)

        all_image_features = torch.cat(
            [image_features]
            + gathered_image_features[:rank]
            + gathered_image_features[rank + 1 :]
        )
        all_text_features = torch.cat(
            [text_features]
            + gathered_text_features[:rank]
            + gathered_text_features[rank + 1 :]
        )

        # this is needed to send gradients back everywhere.
        logits_per_image = logit_scale * all_image_features @ all_text_features.t()
        logits_per_text = logits_per_image.t()

测试

然后我们会进入 evaluate 也就是跑测试集。一般跑测试集没啥好看的不过 clip好像不一样。

和训练得到loss的方法是一模一样的但是得到loss后测试多了个步骤我们来看看。

            batch_size = len(images)
            cumulative_loss += total_loss * batch_size
            num_elements += batch_size

        metrics = get_metrics(
            image_features=torch.cat(all_image_features),
            text_features=torch.cat(all_text_features),
            logit_scale=logit_scale
        )

get_metrics

def get_metrics(image_features, text_features, logit_scale):
    metrics = {}
    logits_per_image = (logit_scale * image_features @ text_features.t()).detach().cpu()
    logits_per_text = logits_per_image.t().detach().cpu()

    logits = {"image_to_text": logits_per_image, "text_to_image": logits_per_text}
    ground_truth = torch.arange(len(text_features)).view(-1, 1)

进入函数并初始化一些东西上面这些东西跟训练时是一样的。但不一样的是他的长度。他的长度足足有1000 . 注意GT的形状是1000*1 而不是1*1000

    for name, logit in logits.items():
        ranking = torch.argsort(logit, descending=True)

logit 就是 1000*1000类。先取的是图片预测文字。

ranking 大小是1000*1000 第一个1000是样本数。 argsort得到的是排序后对应位置数的下标。 des表示倒序。示例如下。最大数在第二个位置所以开始是2 然后 1 ， 0，3.

a = torch.tensor([3,5,7 , 1])
print(torch.argsort(a,descending=True))

tensor([2, 1, 0, 3])

preds = torch.where(ranking == ground_truth)[1] 
preds = preds.detach().cpu().numpy()

ranking == ground_truth   这个会返回一个1000*1000的矩阵 ranking 和GT不一样  GT会扩充 变成1000*1000 .   就像下图。

之后 ranking 第一行等于0的地方就会变成True 第二行等于1的地方变成True 以此类推。每行只有一个True。

torch.where（CON,X,Y）x，y是相同形状的矩阵 con是条件符合就返回x 不符就y 但是这里没有x，y 只有条件他会相当于另一个函数： torch.nonzero(condition, as_tuple=True) 他返回两个元组分别表示不为0的元素的行索引和列索引。比如下面 00 ，01，10 三个位置都不是0.

a = torch.tensor([[1,1],[2,0]])
print(torch.nonzero(a,as_tuple=True))
(tensor([0, 0, 1]), tensor([0, 1, 0]))

我们知道ranking == ground_truth 每行只有一个 1 剩下全是0 也就是说每行只对应一个列索引所以 preds 就每行中为Ture的那个下标。也就是标签和排序位置对应的标签他的含义是什么呢？

        metrics[f"{name}_mean_rank"] = preds.mean() + 1
        metrics[f"{name}_median_rank"] = np.floor(np.median(preds)) + 1

保存两个值一个是预测值的均值加1 一个是中位数取整后加1.

        for k in [1, 5, 10]:
            metrics[f"{name}_R@{k}"] = np.mean(preds < k)

统计preds小于k的概率。到这里我似乎懂了一点点。我们知道最大的预测值的下标都在第一个。

比如这样一串数他就是logit 经过argsort 变成了 2，1，0，4 也就是说我们预测的值就是2 那么如果真实标签也是2 那么在torch.where里就会返回0 也就是说所有预测正确的都会返回0 就算不是0 我们也希望他越小越好。就是说希望预测值里真值的概率排行比较靠前。

那么问题很清楚了这个所谓的矩阵其实就是一个正确率统计罢了。也就是我们经常在论文里看到的前5准确率前10准确率。

我们可以看到前1准确率 0.322 前5准确率 0.677 前10准确率 0.817

 return metrics

返回预测矩阵结果。这样找预测正确率是不是很高端？高端的我看了半天才知道在干嘛 .........

至此 clip 怎么用图片和text来训练的步骤就算完了。

所以logit_scale有什么用？？？

二：用预训练的模型去预测。

给的baseline 预测出来全是0 我倒要看看你是怎么预测的。

checkpoint_path = "/home/competition/jingdong/baseLine/src/training/logs/epoch_36.pt"
test_data = "src/data/test.txt"
attr_dict_file = "src/data/attr_to_attrvals.json"
out_file = "test_pred.txt"

# build model
model = load_model(checkpoint_path)

# test
attr_dict = load_attr_dict(attr_dict_file)
rets = []

设置模型载入数据

        data = json.loads(data)
        feature = np.array(data['feature']).astype(np.float32)
        texts = [data['title'] if a=='图文' else match_attrval(data['title'], a, attr_dict) for a in data['query']]
        features = torch.from_numpy(feature)[None, ].repeat(len(texts), 1)
        tokens = tokenize(texts)
        
        features = features.cuda()
        tokens = {k: v.cuda() for k, v in tokens.items()}

读数据和写数据

这里很有意思。

一个数据进来先把图文title 作为图文的字编码把query也各自编码比如如果有两个quert 就会把query跟title一样等于作为一个样本然后把图像特征复制三次。跟text对应

with torch.no_grad():
    image_features, text_features, _ = model(features, tokens)
    similarities = (image_features*text_features).sum(dim=-1)
    similarities = similarities.cpu().tolist()

之后两个特征相乘得到similarity

        ret = {
            "img_name": data["img_name"],
            "match": {
                a: int(s>0.4 if a=='图文' else 0.04) for a, s in zip(data['query'], similarities)
            }
        }

这段很有意思就是说如果是图文当匹配度大于0.4 就是1 如果不是图文就输出0

也就是如果是属性只能输出0 我估计这里是写错了代码应该这样写

 a: int(s>0.4 if a=='图文' else s > 0.04) for a, s in zip(data['query'], similarities)

因为属性匹配的分数都很低。

       
        rets.append(json.dumps(ret, ensure_ascii=False)+'\n')

with open(out_file, 'w') as f:
    f.writelines(rets)

写入结果

亮子李

关注

3
点赞
踩
14

收藏

觉得还不错? 一键收藏
5
评论
借京东图文识别baseline 来看clip训练过程。 clip是怎样练成的。

这次轮到clip模型啦。记笔记记笔记。背景是京东已经给了图片的feature 也就是不需要我们再去抽特征。然后给了图片对应的标题。我们直接从clip训练开始。 dataloader, sampler = data['train'].dataloader, data['train'].sampler loss_img = nn.CrossEntropyLoss() loss_txt = nn.CrossEntropyLoss() if args.
复制链接

扫一扫