huggingface的transformers训练多模态clip模型

傅云昭

已于 2024-03-14 18:26:07 修改

阅读量1.5k

点赞数 15

文章标签：人工智能深度学习 python 机器学习

于 2024-03-13 14:47:19 首次发布

本文链接：https://blog.csdn.net/weixin_50161877/article/details/136660049

版权

1.原理

论文地址：Learning Transferable Visual Models From Natural Language Supervision (arxiv.org)

官网地址：openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image (github.com)

introduction：

在nlp领域，预训练的方法可以让模型在下游任务中有着非常强的zero-shot transfer能力，即可以直接微调（甚至不需要）然后就可以使用了。比如gpt3。但是计算机视觉领域还是crowd-labeled datasets such as ImageNet，我的理解就是这种打了很多标签的数据。因为预训练的方法大多都是无标签的训练方式，比如bert 的mask一些token，或者gpt的预测下一个词。本质上都是把一部分的token 遮盖不输入网络，然后让网络去预测这些mask的token从而达到学习的目的，区别是gpt是固定的预测下一个token，是单向attention，而bert是复原中间的随机被mask的token，双向attention，他们的标签就是mask的token或者后面那个要去预测的token，这都是已经知道了的，所以可以看作自监督学习。一般基于自监督的预训练大模型能取得非常好的效果的原因是不需要人工标注，从而训练数据很大。而CV方向的基本都是有监督的学习是SOTA。基于人工标注的数据集基本上很难做的很大，成本太高了。本文作者想做一个无监督的SOTA在CV领域。

Approach：

使用Natural Language Supervision

简单理解就是：对多个文本--图像对分别做encoder特征编码，然后归一化到统一scale，然后矩阵相乘计算特征相似度（原理是余弦相似度：cosine_similarity(A, B) = (A dot B) / (||A|| * ||B||)，即量向量的相似度就是两向量的点乘 / 他们的模长的乘积，这里做了归一化，所以不需要再除以模长，你可以看作其实除以了一个1）。这里的标签（labels）是一个长度为n的一维数组，其中包含了从0到n-1的连续整数。它的作用是为了指示每个样本在计算损失时的对应位置。在这个多模态嵌入学习的任务中，有一个对称的损失函数，它要求样本之间的相似度在两个维度上都是对称的。因此，为了实现对称性，使用了一个对角线元素的标签。具体而言，对于logits矩阵中的每个元素logits[i, j]，我们希望它表示样本i和样本j之间的相似度。为了将样本i的相似度与自身进行比较，将对角线元素的标签设置为i。这样，当计算损失时，logits[i, i]与样本i的标签进行比较，可以确保损失函数对称且对角线元素的损失被计算。因此，clip模型的标签涉及是为了确保损失函数的对称性，并且每个样本的标签与其自身对应，以便进行相应的计算和优化。

2.实现

使用huggingface 开源transformers里面的examples进行学习。

transformers/examples/pytorch/contrastive-image-text at main · huggingface/transformers (github.com)

2.1.下载数据集

mkdir data
cd data
wget http://images.cocodataset.org/zips/train2017.zip
wget http://images.cocodataset.org/zips/val2017.zip
wget http://images.cocodataset.org/zips/test2017.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget http://images.cocodataset.org/annotations/image_info_test2017.zip
cd ..

也可以直接去网站上下，或者使用迅雷下载更快

2.2 数据加载

import os
import datasets

COCO_DIR = os.path.join(os.getcwd(), "data")
ds = datasets.load_dataset("ydshieh/coco_dataset_script", "2017", data_dir=COCO_DIR)

2.3 微调训练

--output_dir  # 输出的路径
./clip-roberta-finetuned
--model_name_or_path
./clip-roberta
--data_dir   # 刚刚下载的数据的路径
D:\data\coco
--dataset_name
ydshieh/coco_dataset_script
--dataset_config_name=2017
--image_column
image_path
--caption_column
caption
--remove_unused_columns=False
--do_train
--do_eval
--per_device_train_batch_size="64" # 指定batch size，gpu小就把这个值调小
--per_device_eval_batch_size="64"
--learning_rate="5e-5"
--warmup_steps="0"
--weight_decay
0.1
--overwrite_output_dir
1

roberta就是在bert的基础上做了几点调整： 1）训练时间更长，batch size更大，训练数据更多； 2）移除了next predict loss； 3）训练序列更长； 4）动态调整Masking机制。 5) Byte level BPE RoBERTa is trained with dynamic masking (Section 4.1), FULL - SENTENCES without NSP loss (Section 4.2), large mini-batches (Section 4.3) and a larger byte-level BPE (Section 4.4).

这里的clip采用了roberta这种训练方式。

2.4 源代码解析

2.4.1clip 官方源代码里面model.py：clip model

openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image (github.com)

clip model = image encoder + text encoder + 归一化 + 相似度计算 + 交叉熵loss

image encoder实现

先看image encoder的实现

这里可以选择resnet变体的结构或者VIT结构作为image encoder。

我主要看来下VIT：第一个conv1就是划分patch了，把输入的3*224*224划分为一个个32*32的窗口patch。

然后就是类别和位置编码（注意是cat拼接，维度加1），layernorm已经最核心的transformer类

transformer这个类的核心是ResidualAttentionBlock

类似的transformer encoder结构

mul-head-attention-->layernorm-->mlp(FFN )

图像encoder 类的调用:

VisionTransformer-->transformer-->residualattentionblock-->nn.mulheadattention

text encoder实现

2.4.2 transformers包里面的实现

transformers/examples/pytorch/contrastive-image-text at main · huggingface/transformers (github.com)

使用from_pretrained读取模型配置文件然后模型初始化。

还是from_pretrained函数，这里就是读取配置文件的路径读取配置文件from

模型通过配置文件进行初始化

model.from_pretrained

初始化

视觉模型

image encoder：visual model

核心是visual transformer

class CLIPVisionEmbeddings(nn.Module):
    
    def __init__(self, config: CLIPVisionConfig):
        super().__init__()
        self.config = config
        self.embed_dim = config.hidden_size
        self.image_size = config.image_size
        self.patch_size = config.patch_size

        self.class_embedding = nn.Parameter(torch.randn(self.embed_dim))

        self.patch_embedding = nn.Conv2d(
            in_channels=config.num_channels,
            out_channels=self.embed_dim,
            kernel_size=self.patch_size,
            stride=self.patch_size,
            bias=False,
        )

        self.num_patches = (self.image_size // self.patch_size) ** 2
        self.num_positions = self.num_patches + 1
        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
        self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)

    def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
    """3个embedding：cat[cls, patch] + position"""
        batch_size = pixel_values.shape[0]
        target_dtype = self.patch_embedding.weight.dtype
        patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
        patch_embeds = patch_embeds.flatten(2).transpose(1, 2)

        class_embeds = self.class_embedding.expand(batch_size, 1, -1)
        embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
        embeddings = embeddings + self.position_embedding(self.position_ids)
        return embeddings

class CLIPEncoder(nn.Module):
    """
    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
    [`CLIPEncoderLayer`].

    Args:
        config: CLIPConfig
    """

    def __init__(self, config: CLIPConfig):
        super().__init__()
        self.config = config
        self.layers = nn.ModuleList([CLIPEncoderLayer(config) for _ in range(config.num_hidden_layers)])
        self.gradient_checkpointing = False

    def forward(
        self,
        inputs_embeds,
        attention_mask: Optional[torch.Tensor] = None,
        causal_attention_mask: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutput]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    
    encoder_states = () if output_hidden_states else None
    all_attentions = () if output_attentions else None
    
    hidden_states = inputs_embeds
    for idx, encoder_layer in enumerate(self.layers):
        if output_hidden_states:
            encoder_states = encoder_states + (hidden_states,)
        if self.gradient_checkpointing and self.training:
            layer_outputs = self._gradient_checkpointing_func(
                encoder_layer.__call__,
                hidden_states,
                attention_mask,
                causal_attention_mask,
                output_attentions,
            )
        else:
            layer_outputs = encoder_layer(
                hidden_states,
                attention_mask,
                causal_attention_mask,
                output_attentions=output_attentions,
            )

    hidden_states = layer_outputs[0]

    if output_attentions:
        all_attentions = all_attentions + (layer_outputs[1],)
    
    if output_hidden_states:
        encoder_states = encoder_states + (hidden_states,)
    
    if not return_dict:
        return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
    return BaseModelOutput(
        last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
    )

主要是visual embedding 和clip encoder

clip encoder 主要核心调用clip encoder layer

clip encoder layer主要核心是clip attention 和clip mlp

class CLIPMLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.activation_fn = ACT2FN[config.hidden_act]
        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        hidden_states = self.fc1(hidden_states)
        hidden_states = self.activation_fn(hidden_states)
        hidden_states = self.fc2(hidden_states)
        return hidden_states

class CLIPAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(self, config):
        super().__init__()
        self.config = config
        self.embed_dim = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.embed_dim // self.num_heads
        if self.head_dim * self.num_heads != self.embed_dim:
            raise ValueError(
                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
                f" {self.num_heads})."
            )
        self.scale = self.head_dim**-0.5
        self.dropout = config.attention_dropout

        self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
        self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
        self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)

    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        causal_attention_mask: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        """Input shape: Batch x Time x Channel"""
        bsz, tgt_len, embed_dim = hidden_states.size()
        # get query proj,把输入特征linear出q，k，v
        query_states = self.q_proj(hidden_states) * self.scale
        key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
        value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
        # reshape
        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
        query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
        key_states = key_states.view(*proj_shape)
        value_states = value_states.view(*proj_shape)
        # attn_weights  = q * k
        src_len = key_states.size(1)
        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))

        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
            raise ValueError(
                f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
                f" {attn_weights.size()}"
            )
        # apply the causal_attention_mask first
        if causal_attention_mask is not None:
            if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
                raise ValueError(
                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is"
                    f" {causal_attention_mask.size()}"
                )
            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + causal_attention_mask
            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)

        if attention_mask is not None:
            if attention_mask.size() != (bsz, 1, tgt_len, src_len):
                raise ValueError(
                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
                )
            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)

        attn_weights = nn.functional.softmax(attn_weights, dim=-1)

        if output_attentions:
            # this operation is a bit akward, but it's required to
            # make sure that attn_weights keeps its gradient.
            # In order to do so, attn_weights have to reshaped
            # twice and have to be reused in the following
            attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
            attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
        else:
            attn_weights_reshaped = None
        # attn_weights--》attn_probs 
        attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
        # attn_probs * v
        attn_output = torch.bmm(attn_probs, value_states)

        if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
                f" {attn_output.size()}"
            )

        attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
        attn_output = attn_output.transpose(1, 2)
        attn_output = attn_output.reshape(bsz, tgt_len, embed_dim)
        # attn_output  map
        attn_output = self.out_proj(attn_output)

        return attn_output, attn_weights_reshaped

Attention = x--project-->q,k,v-->q*k(attn_weights)-->mask(attn_weights)-->softmax-->dropout(attn_probs )-->attn_probs *v(attn_output )-->linear(attn_output )

就是FFN前馈神经网络了

12曾encoder

公式总结：

Attention = project q,k,v + q*k + mask + softmax + dropout + atte_probs*v + linear

clip_encoder_layer = clip attention + layernorm + GELU + linear*2 + layernorm

Clip_encoder = 12 clip_encoder_layer

Clip visual transformer = clip vision embedding + layernorm + clip encoder+layernorm

文本模型

from_config

最后进到了这个类RobertaModel = embedding + encoder + pool

class RobertaSelfAttention(nn.Module):
    def __init__(self, config, position_embedding_type=None):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
                f"heads ({config.num_attention_heads})"
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = position_embedding_type or getattr(
            config, "position_embedding_type", "absolute"
        )
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder
def forward(
    self,
    hidden_states: torch.Tensor,
    attention_mask: Optional[torch.FloatTensor] = None,
    head_mask: Optional[torch.FloatTensor] = None,
    encoder_hidden_states: Optional[torch.FloatTensor] = None,
    encoder_attention_mask: Optional[torch.FloatTensor] = None,
    past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
    output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor]:
    mixed_query_layer = self.query(hidden_states)
    is_cross_attention = encoder_hidden_states is not None

    if is_cross_attention and past_key_value is not None:
        # reuse k,v, cross_attentions
        key_layer = past_key_value[0]
        value_layer = past_key_value[1]
        attention_mask = encoder_attention_mask
    elif is_cross_attention:
        key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
        value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
        attention_mask = encoder_attention_mask
    elif past_key_value is not None:
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))
        key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
        value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
    else:
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))

    query_layer = self.transpose_for_scores(mixed_query_layer)

    use_cache = past_key_value is not None
    if self.is_decoder:
        past_key_value = (key_layer, value_layer)

    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

    if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
        query_length, key_length = query_layer.shape[2], key_layer.shape[2]
        if use_cache:
            position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
                -1, 1
            )
        else:
            position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
        position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
        distance = position_ids_l - position_ids_r

        positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
        positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility

        if self.position_embedding_type == "relative_key":
            relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
            attention_scores = attention_scores + relative_position_scores
        elif self.position_embedding_type == "relative_key_query":
            relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
            relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
            attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key

    attention_scores = attention_scores / math.sqrt(self.attention_head_size)
    if attention_mask is not None:
        # Apply the attention mask is (precomputed for all layers in RobertaModel forward() function)
        attention_scores = attention_scores + attention_mask

    # Normalize the attention scores to probabilities.
    attention_probs = nn.functional.softmax(attention_scores, dim=-1)

    # This is actually dropping out entire tokens to attend to, which might
    # seem a bit unusual, but is taken from the original Transformer paper.
    attention_probs = self.dropout(attention_probs)

    # Mask heads if we want to
    if head_mask is not None:
        attention_probs = attention_probs * head_mask

    context_layer = torch.matmul(attention_probs, value_layer)

    context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
    new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
    context_layer = context_layer.view(new_context_layer_shape)

    outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)

    if self.is_decoder:
        outputs = outputs + (past_key_value,)
    return outputs

和传统的attention几乎一样

总结

最后 clip model = visual model + text model

2.5 transformer训练遇到的问题

问题1：accelerate 和transformers的版本

https://github.com/hiyouga/LLaMA-Factory/issues/2552

直接找到accelerate 和transformers的版本对应关系，然后安装

问题2：huggingface Token

ValueError: Token is required (write-access action) but no token found. You need to provide a token or be logged in to Hugging Face with huggingface-cli login or huggingface_hub.login. See https://huggingface.co/settings/tokens.