1.原理
论文地址:Learning Transferable Visual Models From Natural Language Supervision (arxiv.org)
introduction:
在nlp领域,预训练的方法可以让模型在下游任务中有着非常强的zero-shot transfer能力,即可以直接微调(甚至不需要)然后就可以使用了。比如gpt3。但是计算机视觉领域还是crowd-labeled datasets such as ImageNet,我的理解就是这种打了很多标签的数据。因为预训练的方法大多都是无标签的训练方式,比如bert 的mask一些token,或者gpt的预测下一个词。本质上都是把一部分的token 遮盖不输入网络,然后让网络去预测这些mask的token从而达到学习的目的,区别是gpt是固定的预测下一个token,是单向attention,而bert是复原中间的随机被mask的token,双向attention,他们的标签就是mask的token或者后面那个要去预测的token,这都是已经知道了的,所以可以看作自监督学习。一般基于自监督的预训练大模型能取得非常好的效果的原因是不需要人工标注,从而训练数据很大。而CV方向的基本都是有监督的学习是SOTA。基于人工标注的数据集基本上很难做的很大,成本太高了。本文作者想做一个无监督的SOTA在CV领域。
Approach:
使用Natural Language Supervision
简单理解就是:对多个文本--图像对分别做encoder特征编码,然后归一化到统一scale,然后矩阵相乘计算特征相似度(原理是余弦相似度:cosine_similarity(A, B) = (A dot B) / (||A|| * ||B||),即量向量的相似度就是两向量的点乘 / 他们的模长的乘积,这里做了归一化,所以不需要再除以模长,你可以看作其实除以了一个1)。这里的标签(labels)是一个长度为n的一维数组,其中包含了从0到n-1的连续整数。它的作用是为了指示每个样本在计算损失时的对应位置。在这个多模态嵌入学习的任务中,有一个对称的损失函数,它要求样本之间的相似度在两个维度上都是对称的。因此,为了实现对称性,使用了一个对角线元素的标签。具体而言,对于logits矩阵中的每个元素logits[i, j],我们希望它表示样本i和样本j之间的相似度。为了将样本i的相似度与自身进行比较,将对角线元素的标签设置为i。这样,当计算损失时,logits[i, i]与样本i的标签进行比较,可以确保损失函数对称且对角线元素的损失被计算。因此,clip模型的标签涉及是为了确保损失函数的对称性,并且每个样本的标签与其自身对应,以便进行相应的计算和优化。
2.实现
使用huggingface 开源transformers里面的examples进行学习。
transformers/examples/pytorch/contrastive-image-text at main · huggingface/transformers (github.com)
2.1.下载数据集
mkdir data
cd data
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/zips/test2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget http://images.cocodataset.org/annotations/image_info_test2017.zip
cd ..
也可以直接去网站上下,或者使用迅雷下载更快
2.2 数据加载
import os
import datasets
COCO_DIR = os.path.join(os.getcwd(), "data")
ds = datasets.load_dataset("ydshieh/coco_dataset_script", "2017", data_dir=COCO_DIR)
2.3 微调训练
--output_dir # 输出的路径
./clip-roberta-finetuned
--model_name_or_path
./clip-roberta
--data_dir # 刚刚下载的数据的路径
D:\data\coco
--dataset_name
ydshieh/coco_dataset_script
--dataset_config_name=2017
--image_column
image_path
--caption_column
caption
--remove_unused_columns=False
--do_train
--do_eval
--per_device_train_batch_size="64" # 指定batch size,gpu小就把这个值调小
--per_device_eval_batch_size="64"
--learning_rate="5e-5"
--warmup_steps="0"
--weight_decay
0.1
--overwrite_output_dir
1
roberta就是在bert的基础上做了几点调整: 1)训练时间更长,batch size更大,训练数据更多; 2)移除了next predict loss; 3)训练序列更长; 4)动态调整Masking机制。 5) Byte level BPE RoBERTa is trained with dynamic masking (Section 4.1), FULL - SENTENCES without NSP loss (Section 4.2), large mini-batches (Section 4.3) and a larger byte-level BPE (Section 4.4).
这里的clip采用了roberta这种训练方式。
2.4 源代码解析
2.4.1clip 官方源代码里面model.py:clip model
openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image (github.com)
clip model = image encoder + text encoder + 归一化 + 相似度计算 + 交叉熵loss
image encoder实现
先看image encoder的实现
这里可以选择resnet变体的结构或者VIT结构作为image encoder。
我主要看来下VIT:第一个conv1就是划分patch了,把输入的3*224*224划分为一个个32*32的窗口patch。
然后就是类别和位置编码(注意是cat拼接,维度加1),layernorm已经最核心的transformer类
transformer这个类的核心是ResidualAttentionBlock
类似的transformer encoder结构
mul-head-attention-->layernorm-->mlp(FFN )
图像encoder 类的调用:
VisionTransformer-->transformer-->residualattentionblock-->nn.mulheadattention
text encoder实现
2.4.2 transformers包里面的实现
transformers/examples/pytorch/contrastive-image-text at main · huggingface/transformers (github.com)
使用from_pretrained读取模型配置文件然后模型初始化。
还是from_pretrained函数,这里就是读取配置文件的路径读取配置文件from
模型通过配置文件进行初始化
model.from_pretrained
初始化
视觉模型
image encoder:visual model
核心是visual transformer
class CLIPVisionEmbeddings(nn.Module):
def __init__(self, config: CLIPVisionConfig):
super().__init__()
self.config = config
self.embed_dim = config.hidden_size
self.image_size = config.image_size
self.patch_size = config.patch_size
self.class_embedding = nn.Parameter(torch.randn(self.embed_dim))
self.patch_embedding = nn.Conv2d(
in_channels=config.num_channels,
out_channels=self.embed_dim,
kernel_size=self.patch_size,
stride=self.patch_size,
bias=False,
)
self.num_patches = (self.image_size // self.patch_size) ** 2
self.num_positions = self.num_patches + 1
self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)
def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
"""3个embedding:cat[cls, patch] + position"""
batch_size = pixel_values.shape[0]
target_dtype = self.patch_embedding.weight.dtype
patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
class_embeds = self.class_embedding.expand(batch_size, 1, -1)
embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
embeddings = embeddings + self.position_embedding(self.position_ids)
return embeddings
class CLIPEncoder(nn.Module):
"""
Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
[`CLIPEncoderLayer`].
Args:
config: CLIPConfig
"""
def __init__(self, config: CLIPConfig):
super().__init__()
self.config = config
self.layers = nn.ModuleList([CLIPEncoderLayer(config) for _ in range(config.num_hidden_layers)])
self.gradient_checkpointing = False
def forward(
self,
inputs_embeds,
attention_mask: Optional[torch.Tensor] = None,
causal_attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutput]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
encoder_states = () if output_hidden_states else None
all_attentions = () if output_attentions else None
hidden_states = inputs_embeds
for idx, encoder_layer in enumerate(self.layers):
if output_hidden_states:
encoder_states = encoder_states + (hidden_states,)
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
encoder_layer.__call__,
hidden_states,
attention_mask,
causal_attention_mask,
output_attentions,
)
else:
layer_outputs = encoder_layer(
hidden_states,
attention_mask,
causal_attention_mask,
output_attentions=output_attentions,
)
hidden_states = layer_outputs[0]
if output_attentions:
all_attentions = all_attentions + (layer_outputs[1],)
if output_hidden_states:
encoder_states = encoder_states + (hidden_states,)
if not return_dict:
return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
return BaseModelOutput(
last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
)
主要是visual embedding 和clip encoder
clip encoder 主要核心调用clip encoder layer
clip encoder layer主要核心是clip attention 和clip mlp
class CLIPMLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.activation_fn = ACT2FN[config.hidden_act]
self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.fc1(hidden_states)
hidden_states = self.activation_fn(hidden_states)
hidden_states = self.fc2(hidden_states)
return hidden_states
class CLIPAttention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
def __init__(self, config):
super().__init__()
self.config = config
self.embed_dim = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.embed_dim // self.num_heads
if self.head_dim * self.num_heads != self.embed_dim:
raise ValueError(
f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
f" {self.num_heads})."
)
self.scale = self.head_dim**-0.5
self.dropout = config.attention_dropout
self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
causal_attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
"""Input shape: Batch x Time x Channel"""
bsz, tgt_len, embed_dim = hidden_states.size()
# get query proj,把输入特征linear出q,k,v
query_states = self.q_proj(hidden_states) * self.scale
key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
# reshape
proj_shape = (bsz * self.num_heads, -1, self.head_dim)
query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
key_states = key_states.view(*proj_shape)
value_states = value_states.view(*proj_shape)
# attn_weights = q * k
src_len = key_states.size(1)
attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
raise ValueError(
f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
f" {attn_weights.size()}"
)
# apply the causal_attention_mask first
if causal_attention_mask is not None:
if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
raise ValueError(
f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is"
f" {causal_attention_mask.size()}"
)
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + causal_attention_mask
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, tgt_len, src_len):
raise ValueError(
f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
)
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
attn_weights = nn.functional.softmax(attn_weights, dim=-1)
if output_attentions:
# this operation is a bit akward, but it's required to
# make sure that attn_weights keeps its gradient.
# In order to do so, attn_weights have to reshaped
# twice and have to be reused in the following
attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
else:
attn_weights_reshaped = None
# attn_weights--》attn_probs
attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
# attn_probs * v
attn_output = torch.bmm(attn_probs, value_states)
if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
raise ValueError(
f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
f" {attn_output.size()}"
)
attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
attn_output = attn_output.transpose(1, 2)
attn_output = attn_output.reshape(bsz, tgt_len, embed_dim)
# attn_output map
attn_output = self.out_proj(attn_output)
return attn_output, attn_weights_reshaped
Attention = x--project-->q,k,v-->q*k(attn_weights)-->mask(attn_weights)-->softmax-->dropout(attn_probs )-->attn_probs *v(attn_output )-->linear(attn_output )
就是FFN前馈神经网络了
12曾encoder
公式总结:
Attention = project q,k,v + q*k + mask + softmax + dropout + atte_probs*v + linear
clip_encoder_layer = clip attention + layernorm + GELU + linear*2 + layernorm
Clip_encoder = 12 clip_encoder_layer
Clip visual transformer = clip vision embedding + layernorm + clip encoder+layernorm
文本模型
from_config
最后进到了这个类RobertaModel = embedding + encoder + pool
class RobertaSelfAttention(nn.Module):
def __init__(self, config, position_embedding_type=None):
super().__init__()
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
f"heads ({config.num_attention_heads})"
)
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.position_embedding_type = position_embedding_type or getattr(
config, "position_embedding_type", "absolute"
)
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
self.max_position_embeddings = config.max_position_embeddings
self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
self.is_decoder = config.is_decoder
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.FloatTensor] = None,
head_mask: Optional[torch.FloatTensor] = None,
encoder_hidden_states: Optional[torch.FloatTensor] = None,
encoder_attention_mask: Optional[torch.FloatTensor] = None,
past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
mixed_query_layer = self.query(hidden_states)
is_cross_attention = encoder_hidden_states is not None
if is_cross_attention and past_key_value is not None:
# reuse k,v, cross_attentions
key_layer = past_key_value[0]
value_layer = past_key_value[1]
attention_mask = encoder_attention_mask
elif is_cross_attention:
key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
attention_mask = encoder_attention_mask
elif past_key_value is not None:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
else:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
query_layer = self.transpose_for_scores(mixed_query_layer)
use_cache = past_key_value is not None
if self.is_decoder:
past_key_value = (key_layer, value_layer)
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
query_length, key_length = query_layer.shape[2], key_layer.shape[2]
if use_cache:
position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
-1, 1
)
else:
position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
distance = position_ids_l - position_ids_r
positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility
if self.position_embedding_type == "relative_key":
relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores
elif self.position_embedding_type == "relative_key_query":
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.functional.softmax(attention_scores, dim=-1)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs = attention_probs * head_mask
context_layer = torch.matmul(attention_probs, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
if self.is_decoder:
outputs = outputs + (past_key_value,)
return outputs
和传统的attention几乎一样
总结
最后 clip model = visual model + text model
2.5 transformer训练遇到的问题
问题1:accelerate 和transformers的版本
https://github.com/hiyouga/LLaMA-Factory/issues/2552
直接找到accelerate 和transformers的版本对应关系,然后安装
问题2:huggingface Token
ValueError: Token is required (write-access action) but no token found. You need to provide a token or be logged in to Hugging Face with huggingface-cli login
or huggingface_hub.login
. See https://huggingface.co/settings/tokens.
问题3:
因为https://huggingface.co/api/models 没有clip那个模型,我们也不需要把结果pus
问题5:
最后成功train起来了