Yolo-World网络模型结构及原理分析（二）——文本编码器

小小酥kkk

于 2024-07-20 15:58:48 发布

阅读量843

点赞数 19

文章标签： YOLO 人工智能计算机视觉 python 深度学习神经网络自然语言处理

本文链接：https://blog.csdn.net/ITdaka/article/details/140572401

版权

文章目录

前言

前言

YOLO-World 模型中的文本编码器部分主要负责将文本信息转换为可用于模型进一步处理的嵌入表示，以下是对 YOLO-World 中文本编码器主要内容。

文本编码器 Text Encoder

由于在yolo-world网络中主要利用预训练的CLIP模型将输入文本（如类别名称、名词短语或对象描述）编码为文本嵌入，那么我们就着重看一下clip模型中的文本编码部分。
CLIP模型结构

1. 文本编码器的主要功能

文本表示：将输入的文本转换为高维向量表示，这些向量捕捉文本的语义信息。
多模态学习：与图像编码器一起，通过对比学习的方式，学习图像和文本之间的关联。

2. CLIP 文本编码器的详细工作流程

2.1 输入文本处理

Tokenization（分词）：首先，输入的自然语言文本被分割成一系列的token。在YOLO-World 使用简单的 n-gram 算法来提取名词短语。这种方法是一种基于统计的语言模型，能够从文本中识别出常见的词组或短语。分词后，模型根据预定义的词汇表将每个词汇映射到一个唯一的ID。词汇表是在预训练阶段根据大量文本数据构建的，每个词汇或标记都对应一个索引。

from transformers import GPT2Tokenizer
input_text = "A cat sits on the mat.
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokens = tokenizer.tokenize(input_text)
tokens_ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)         # ['A', 'Ġcat', 'Ġsits', 'Ġon', 'Ġthe', 'Ġmat', '.']
print(tokens_ids)     # [32, 5778, 1374, 319, 262, 2682, 13]

Adding Special Tokens（添加特殊标记）：在分词后，CLIP会在文本的开头和结尾添加特殊标记，例如[CLS]和[SEP]，分别表示句子的开始和结束。这些标记有助于模型理解文本的边界。

Input Text: "A cat sits on the mat."

Step 1: Tokenization
Tokens: ["[CLS]", "A", "cat", "sits", "on", "the", "mat", "[SEP]"]

Step 2: Token and Position Embeddings
Token Embeddings: [e_cls, e_A, e_cat, e_sits, e_on, e_the, e_mat, e_sep]

Position Embeddings: [p_0, p_1, p_2, p_3, p_4, p_5, p_6, p_7]

Final Embeddings: [e_cls + p_0, e_A + p_1, e_cat + p_2, ...]

2.2 词嵌入

Token Embedding（词嵌入）：将每个token转换为对应的词嵌入向量。这通过查找嵌入矩阵完成。每个token通过一个嵌入矩阵（embedding matrix）映射到一个高维向量空间。嵌入矩阵是一个大小为(vocab_size, embedding_dim)的矩阵，其中vocab_size是词汇表的大小，embedding_dim是嵌入向量的维度。每个token的嵌入向量是通过查找嵌入矩阵的相应行获得的。

import torch
from transformers import GPT2Model

model = GPT2Model.from_pretrained('gpt2')
input_ids = torch.tensor([tokens_ids])
with torch.no_grad():
    outputs = model(input_ids)
    embeddings = outputs.last_hidden_state

print(embeddings.shape)  
# torch.Size([1, 7, 768])  - 1 sentence, 7 tokens, 768 dimensions

Position Embedding（位置嵌入）：为每个token加上位置编码，使模型能够理解每个token的位置。由于Transformer没有内置顺序信息，CLIP使用位置编码（positional encoding）来表示token在句子中的位置。位置编码可以是预先定义的固定向量，也可以是可训练的参数。位置编码向量加到每个token的词嵌入向量上，使模型能够捕捉到序列信息。

2.3 Transformer编码器

Multi-Head Self-Attention（多头自注意力）：
- Linear Transformations（线性变换）：输入的token嵌入向量首先通过线性变换被映射到查询（Query）、键（Key）和值（Value）向量。这些线性变换由可训练的权重矩阵实现。
- Scaled Dot-Product Attention（缩放点积注意力）：对于每个查询向量，计算它与所有键向量的点积，并除以一个缩放因子（通常是键向量维度的平方根），然后通过softmax函数计算注意力权重。最后，将注意力权重与相应的值向量相乘，得到加权的输出。
- Concatenation and Linear Transformation（拼接和线性变换）：多个注意力头的输出向量被拼接在一起，并通过另一组线性变换得到最终的多头自注意力输出。

Multi-Head Self-Attention流程如下图所示：

Multi-Head Self-Attention计算过程如图所示：
self-attention注意力机制
下面通过具体例子进行解释：
1、input-1的查询向量为[1, 0, 2]，分别乘上input-1、input-2、input-3的键向量，获得三个score为2，4，4。
2、然后对这三个score取softmax，获得了input-1、input-2、input-3各自的重要程度，获得三个重要程度为0.0，0.5，0.5。
3、然后将这个重要程度乘上input-1、input-2、input-3的值向量，求和，即
0.0 ∗ [ 1 , 2 , 3 ] + 0.5 ∗ [ 2 , 8 , 0 ] + 0.5 ∗ [ 2 , 6 , 3 ] = [ 2.0 , 7.0 , 1.5 ] 0.0 * [1, 2, 3] + 0.5 * [2, 8, 0] + 0.5 * [2, 6, 3] = [2.0, 7.0, 1.5]0.0∗[1,2,3]+0.5∗[2,8,0]+0.5∗[2,6,3]=[2.0,7.0,1.5]。
4、此时我们获得了input-1的输出 [2.0, 7.0, 1.5]。
self-attention矩阵运算
更多细节请参考
原文链接：多模态模型学习1——CLIP对比学习语言-图像预训练模型

# Transformer Encoder 内部过程示例
from torch.nn import functional as F

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(d_k)
    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output

# Q, K, V 矩阵是通过线性变换从 embeddings 获得的
Q = torch.nn.Linear(768, 768)(embeddings)
K = torch.nn.Linear(768, 768)(embeddings)
V = torch.nn.Linear(768, 768)(embeddings)

attention_output = scaled_dot_product_attention(Q, K, V)

Add & Norm（加和归一化）：自注意力层的输出通过残差连接（residual connection）与原始输入相加，然后进行层归一化（layer normalization）。

from torch.nn import LayerNorm

# 残差连接
residual_output = embeddings + attention_output

# 层归一化
layer_norm = LayerNorm(768)
normalized_output = layer_norm(residual_output)

Feed-Forward Network（前馈神经网络）：自注意力层的输出进一步通过一个前馈神经网络，该网络通常由两个线性变换层和一个非线性激活函数（如ReLU）组成。前馈网络的输出再次通过残差连接和层归一化。

ffn = torch.nn.Sequential(
    torch.nn.Linear(768, 3072),
    torch.nn.ReLU(),
    torch.nn.Linear(3072, 768)
)

ffn_output = ffn(normalized_output)

层堆叠：上述自注意力和前馈网络的组合构成了一个Transformer编码器层，CLIP通常堆叠多个这样的层以增强模型的表示能力。重复上述自注意力机制、残差连接、层归一化和前馈神经网络的过程，通常有12层或更多。

for _ in range(12):  # 假设有12层
    Q = torch.nn.Linear(768, 768)(ffn_output)
    K = torch.nn.Linear(768, 768)(ffn_output)
    V = torch.nn.Linear(768, 768)(ffn_output)
    attention_output = scaled_dot_product_attention(Q, K, V)
    residual_output = ffn_output + attention_output
    normalized_output = layer_norm(residual_output)
    ffn_output = ffn(normalized_output)

2.4 输出文本嵌入

[CLS] Token Output：在所有Transformer编码器层之后，CLIP通常使用[CLS] token的嵌入作为整个文本的表示。这个嵌入向量被认为包含了整个句子的语义信息。

# GPT-2 不使用 [CLS] token，但我们可以使用第一个 token 的嵌入作为句子表示
sentence_embedding = ffn_output[:, 0, :]  # 取第一个 token 的嵌入

Normalization（归一化）：最终的[CLS]嵌入向量可能会进行L2归一化，使其嵌入空间中的表示更加稳定和一致。

sentence_embedding = torch.nn.functional.normalize(sentence_embedding, p=2, dim=1)

通过这些步骤，文本编码器能够将输入的文本转换为一个高维空间中的向量表示，这个向量捕捉了文本的语义信息，并可以用于后续的下游任务，如与图像特征的比较等。
下图是Transformer模块图示：
TransformerBlock的构建
至此就实现了把一个文本词汇输出为一个词向量的完整过程（如有错误请批评指正，谢谢！）。