详解a-PyTorch-Tutorial-to-Image-Captioning-master代码中的model模型，此代码是对show attend and tell论文的复现，基于Pytorch

最新推荐文章于 2024-03-23 09:33:55 发布

冰岛小贤

最新推荐文章于 2024-03-23 09:33:55 发布

阅读量791

点赞数 1

文章标签： pytorch 深度学习自然语言处理神经网络

本文链接：https://blog.csdn.net/weixin_51666355/article/details/130539734

版权

代码的下载地址： https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning

代码的调试请参考：（强！）(26条消息) show attend and tell代码实现（绝对详细）_path of karpathy json file with splits and caption_饿了就干饭的博客-CSDN博客

本文主要针对model.py文件进行详述！

编码器部分

注意力机制部分

解码器和注意力机制结合部分

编码器部分

class Encoder(nn.Module):
    """
    Encoder.
    """

    def __init__(self, encoded_image_size=14):
        super(Encoder, self).__init__()
        self.enc_image_size = encoded_image_size

        resnet = torchvision.models.resnet101(pretrained=True)  # pretrained ImageNet ResNet-101

        # Remove linear and pool layers (since we're not doing classification)
        modules = list(resnet.children())[:-2]
        self.resnet = nn.Sequential(*modules)

        # Resize image to fixed size to allow input images of variable size
        self.adaptive_pool = nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))

        self.fine_tune()


    def forward(self, images):
        """
        Forward propagation.

        :param images: images, a tensor of dimensions (batch_size, 3, image_size, image_size)
        :return: encoded images
        """
        out = self.resnet(images)  # (batch_size, 2048, image_size/32, image_size/32)
        out = self.adaptive_pool(out)  # (batch_size, 2048, encoded_image_size, encoded_image_size)
        out = out.permute(0, 2, 3, 1)  # (batch_size, encoded_image_size, encoded_image_size, 2048)
        return out

    def fine_tune(self, fine_tune=True):
        """
        Allow or prevent the computation of gradients for convolutional blocks 2 through 4 of the encoder.

        :param fine_tune: Allow?
        """
        for p in self.resnet.parameters():
            p.requires_grad = False
        # If fine-tuning, only fine-tune convolutional blocks 2 through 4
        for c in list(self.resnet.children())[5:]:
            for p in c.parameters():
                p.requires_grad = fine_tune

这段代码定义了一个名为Encoder的类，它是一个神经网络模型，用于将图像编码成较低维度的特征向量。

在类的构造函数中，指定了一个名为encoded_image_size的参数，用于指定编码后的图像大小，这里默认设置为14。然后，使用torchvision.models.resnet101加载了预训练的ImageNet ResNet-101模型，并去掉了最后两层（全连接层和池化层），因为这些层不会用于图像编码。接下来，使用这些去掉最后两层的ResNet-101模型和自适应平均池化层（nn.AdaptiveAvgPool2d）构建了一个新的神经网络模型。自适应平均池化层的大小被设置为(encoded_image_size,encoded_image_size)，这意味着输出的编码图像大小为(encoded_image_size, encoded_image_size, 2048)。最后，调用了fine_tune()方法来冻结ResNet-101模型中的所有层（即使在反向传播时也不会更新），以便对该模型进行微调。

Encoder类的前向传播函数forward()，使用前面定义的神经网络模型将输入图像编码成低维特征向量。函数的输入images是一个大小为(batch_size, 3, image_size, image_size)的tensor，表示一批图像。在函数内部，通过resnet(images)调用前面定义的ResNet-101模型对输入图像进行特征提取。这一步处理后，得到了一个大小为(batch_size, 2048, image_size/32, image_size/32)的tensor，即输入图像被编码成2048维的向量。接下来，调用self.adaptive_pool(out)对特征向量进行自适应平均池化，将其大小调整为(batch_size,2048,encoded_image_size,encoded_image_size)。最后，通过调用out.permute(0, 2, 3, 1)将tensor的维度重新排列为(batch_size, encoded_image_size, encoded_image_size, 2048)，其中最后一个维度变为了2048，符合常用的图像编码格式。函数返回编码后的tensor。

Encoder类中的fine_tune()方法，用于微调ResNet-101模型。当fine_tune参数被设置为True时，方法会允许对模型进行微调（即在反向传播时更新权重）。具体来说，该方法会遍历ResNet-101模型的所有参数，并将它们的requires_grad属性设置为False，从而冻结模型的所有层。然后，仅对模型中的第二到第四个卷积块进行微调。当fine_tune参数被设置为False时，方法会禁止对模型进行微调，即ResNet-101模型的所有层都被冻结，不在反向传播更新。

注意力机制部分

class Attention(nn.Module):
    """
    Attention Network.
    """

    def __init__(self, encoder_dim, decoder_dim, attention_dim):
        """
        :param encoder_dim: feature size of encoded images
        :param decoder_dim: size of decoder's RNN
        :param attention_dim: size of the attention network
        """
        super(Attention, self).__init__()
        self.encoder_att = nn.Linear(encoder_dim, attention_dim)  # linear layer to transform encoded image
        self.decoder_att = nn.Linear(decoder_dim, attention_dim)  # linear layer to transform decoder's output
        self.full_att = nn.Linear(attention_dim, 1)  # linear layer to calculate values to be softmax-ed
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)  # softmax layer to calculate weights

    def forward(self, encoder_out, decoder_hidden):
        """
        Forward propagation.

        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :param decoder_hidden: previous decoder output, a tensor of dimension (batch_size, decoder_dim)
        :return: attention weighted encoding, weights
        """
        att1 = self.encoder_att(encoder_out)  # (batch_size, num_pixels, attention_dim)
        att2 = self.decoder_att(decoder_hidden)  # (batch_size, attention_dim)
        att = self.full_att(self.relu(att1 + att2.unsqueeze(1))).squeeze(2)  # (batch_size, num_pixels)
        alpha = self.softmax(att)  # (batch_size, num_pixels)
        attention_weighted_encoding = (encoder_out * alpha.unsqueeze(2)).sum(dim=1)  # (batch_size, encoder_dim)

        return attention_weighted_encoding, alpha

Attention类的主要目的是计算图像和文本之间的注意力权重，用于生成图像标注。Attention类的构造函数__init__()，用于定义一个注意力机制的神经网络。

在构造函数中，定义了三个线性层（Linear）来构建注意力机制。其中，self.encoder_att是一个线性层，用于将已编码图像的特征向量进行变换。encoder_dim表示已编码图像的特征向量的大小，attention_dim表示变换为的维度大小。类似的，self.decoder_att是一个线性层，用于变换解码器RNN的输出向量，它的输入大小为decoder_dim，输出大小为attention_dim。self.full_att也是一个线性层，用于计算softmax的数值。softmax在实现时需要一个值的权重，这些权重需要从这里计算出来。在构造函数中，还定义了一个ReLU层和一个softmax层。其中，ReLU层用于对变换后的张量进行激活函数操作，使其非线性化；softmax层用于计算权重，从而在注意力机制中起到了关键的作用。

Attention类的前向传播函数forward()，用于计算注意力权重和注意力加权后的特征表示。函数的输入分别为encoder_out和decoder_hidden，分别是已编码的图像和前一个解码器的输出向量。函数中，首先将encoder_out和decoder_hidden进行线性变换，得到att1和att2，并利用ReLU激活函数进行非线性化处理。然后，将att1和att2相加，并对结果进行未知数unsqueeze、Squeeze 和矩阵乘法操作，最终得到一个大小为(batch_size, num_pixels)的张量，即网络对图像的注意力权重。接下来，对这个权重进行softmax操作，得到大小为(batch_size, num_pixels)的权重张量alpha。最后，计算图像对应的特征表示attention_weighted_encoding，通过对卷积特征张量和softmax后的权重alpha进行矩阵乘法操作，再利用sum()方法对结果进行求和即可得到。注意力加权张量attention_weighted_encoding应该是(batch_size, encoder_dim)大小的。函数返回了attention_weighted_encoding和alpha。

解码器和注意力机制结合部分

class DecoderWithAttention(nn.Module):
    """
    Decoder.
    """

    def __init__(self, attention_dim, embed_dim, decoder_dim, vocab_size, encoder_dim=2048, dropout=0.5):
        """
        :param attention_dim: size of attention network
        :param embed_dim: embedding size
        :param decoder_dim: size of decoder's RNN
        :param vocab_size: size of vocabulary
        :param encoder_dim: feature size of encoded images
        :param dropout: dropout
        """
        super(DecoderWithAttention, self).__init__()

        self.encoder_dim = encoder_dim
        self.attention_dim = attention_dim
        self.embed_dim = embed_dim
        self.decoder_dim = decoder_dim
        self.vocab_size = vocab_size
        self.dropout = dropout

        self.attention = Attention(encoder_dim, decoder_dim, attention_dim)  # attention network

        self.embedding = nn.Embedding(vocab_size, embed_dim)  # embedding layer
        self.dropout = nn.Dropout(p=self.dropout)
        self.decode_step = nn.LSTMCell(embed_dim + encoder_dim, decoder_dim, bias=True)  # decoding LSTMCell
        self.init_h = nn.Linear(encoder_dim, decoder_dim)  # linear layer to find initial hidden state of LSTMCell
        self.init_c = nn.Linear(encoder_dim, decoder_dim)  # linear layer to find initial cell state of LSTMCell
        self.f_beta = nn.Linear(decoder_dim, encoder_dim)  # linear layer to create a sigmoid-activated gate
        self.sigmoid = nn.Sigmoid()
        self.fc = nn.Linear(decoder_dim, vocab_size)  # linear layer to find scores over vocabulary
        self.init_weights()  # initialize some layers with the uniform distribution

    def init_weights(self):
        """
        Initializes some parameters with values from the uniform distribution, for easier convergence.
        """
        self.embedding.weight.data.uniform_(-0.1, 0.1)
        self.fc.bias.data.fill_(0)
        self.fc.weight.data.uniform_(-0.1, 0.1)

    def load_weights_embeddings(self, embeddings):
        """
        Loads embedding layer with pre-trained embeddings.

        :param embeddings: pre-trained embeddings
        """
        self.embedding.weight = nn.Parameter(embeddings)

    def fine_tune_embeddings(self, fine_tune=True):
        """
        Allow fine-tuning of embedding layer? (Only makes sense to not-allow if using pre-trained embeddings).

        :param fine_tune: Allow?
        """
        for p in self.embedding.parameters():
            p.requires_grad = fine_tune

    def init_hidden_state(self, encoder_out):
        """
        Creates the initial hidden and cell states for the decoder's LSTM based on the encoded images.

        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :return: hidden state, cell state
        """
        mean_encoder_out = encoder_out.mean(dim=1)
        h = self.init_h(mean_encoder_out)  # (batch_size, decoder_dim)
        c = self.init_c(mean_encoder_out)
        return h, c

    def forward(self, encoder_out, encoded_captions, caption_lengths):
        """
        Forward propagation.

        :param encoder_out: encoded images, a tensor of dimension (batch_size, enc_image_size, enc_image_size, encoder_dim)
        :param encoded_captions: encoded captions, a tensor of dimension (batch_size, max_caption_length)
        :param caption_lengths: caption lengths, a tensor of dimension (batch_size, 1)
        :return: scores for vocabulary, sorted encoded captions, decode lengths, weights, sort indices
        """

        batch_size = encoder_out.size(0)
        encoder_dim = encoder_out.size(-1)
        vocab_size = self.vocab_size

        # Flatten image
        encoder_out = encoder_out.view(batch_size, -1, encoder_dim)  # (batch_size, num_pixels, encoder_dim)
        num_pixels = encoder_out.size(1)

        # Sort input data by decreasing lengths; why? apparent below
        caption_lengths, sort_ind = caption_lengths.squeeze(1).sort(dim=0, descending=True)
        encoder_out = encoder_out[sort_ind]
        encoded_captions = encoded_captions[sort_ind]

        # Embedding
        embeddings = self.embedding(encoded_captions)  # (batch_size, max_caption_length, embed_dim)

        # Initialize LSTM state
        h, c = self.init_hidden_state(encoder_out)  # (batch_size, decoder_dim)

        # We won't decode at the <end> position, since we've finished generating as soon as we generate <end>
        # So, decoding lengths are actual lengths - 1
        decode_lengths = (caption_lengths - 1).tolist()

        # Create tensors to hold word predicion scores and alphas
        predictions = torch.zeros(batch_size, max(decode_lengths), vocab_size).to(device)
        alphas = torch.zeros(batch_size, max(decode_lengths), num_pixels).to(device)

        # At each time-step, decode by
        # attention-weighing the encoder's output based on the decoder's previous hidden state output
        # then generate a new word in the decoder with the previous word and the attention weighted encoding
        for t in range(max(decode_lengths)):
            batch_size_t = sum([l > t for l in decode_lengths])
            attention_weighted_encoding, alpha = self.attention(encoder_out[:batch_size_t],
                                                                h[:batch_size_t])
            gate = self.sigmoid(self.f_beta(h[:batch_size_t]))  # gating scalar, (batch_size_t, encoder_dim)
            attention_weighted_encoding = gate * attention_weighted_encoding
            h, c = self.decode_step(
                torch.cat([embeddings[:batch_size_t, t, :], attention_weighted_encoding], dim=1),
                (h[:batch_size_t], c[:batch_size_t]))  # (batch_size_t, decoder_dim)
            preds = self.fc(self.dropout(h))  # (batch_size_t, vocab_size)
            predictions[:batch_size_t, t, :] = preds
            alphas[:batch_size_t, t, :] = alpha

        return predictions, encoded_captions, decode_lengths, alphas, sort_ind

DecoderWithAttention类的构造函数__init__()，它定义了一个带有注意力机制的解码器神经网络模型。

在构造函数中，设置了一些变量用于模型构建。encoder_dim表示已编码图像的特征向量大小，attention_dim表示注意力层的输出大小，embed_dim表示嵌入层的输出大小，decoder_dim表示解码器LSTM单元的大小，vocab_size表示词汇表的大小。dropout表示在Dropout层中的dropout比率。然后，定义了一个名为attention的Attention类的实例来构建注意力机制的神经网络。接下来，定义了嵌入层(nn.Embedding)、Dropout层(nn.Dropout)、解码LSTM单元(nn.LSTMCell)、三个线性层(nn.Linear)，分别用于初始化单元的隐藏状态(init_h)和细胞状态(init_c)，计算注意力加权向量权重(f_beta)，以及最终输出各个单词的概率得分(fc)。最后，调用init_weights()方法，用于对一些特定的层进行参数初始化，这里仅初始化了LSTM单元的全部权重。

DecoderWithAttention类中的init_weights()方法，用于初始化神经网络模型中的某些参数。在本方法中，调用了嵌入层(self.embedding)和输出层(self.fc)的方法，分别用于对其权重进行初始化。这里的权重采用均匀分布进行初始化，均值为0，方差为0.1。另外，在输出层(self.fc)中，将偏置项设置为零。这些初始化是为了加快模型收敛速度，使其更容易学习到有用的特征。

DecoderWithAttention类中的load_weights_embeddings()方法，用于将预训练的嵌入（embeddings）加载到该网络的嵌入层（embedding layer）中。嵌入层作为神经网络的第一层，将单词映射为它们的向量表示。这样，网络就可以通过这些向量理解输入的单词。在这段代码中，我们将预训练好的嵌入作为参数传递给该方法，将它们作为神经网络中的嵌入层参数，并将其设置为网络的权重。这样，我们就可以使用预训练好的词向量来初始化神经网络中的嵌入层，从而提高训练效果。

DecoderWithAttention类中的fine_tune_embeddings()方法是用于微调词向量（embeddings）层的。在许多自然语言处理的任务中，我们通常会使用预先训练好的词向量来初始化嵌入层，并冻结这些层的参数。这是因为预训练的词向量已经捕捉了大量特定任务的语义信息，它们使得训练过程更为稳定和高效，并且通常能够优化模型的性能。但是，如果我们需要使用一个不同的领域或一个不同的任务，我们可能需要对预先训练的词向量进行微调。在这种情况下，我们可以使用这段代码，将fine_tune参数设置为True，以便我们能够让嵌入层的词向量可训练。该方法通过遍历词向量层的所有参数，并将其设置为可训练（require_grad=True），使层中的所有参数处于可训练状态。如果fine_tune参数设置为False，它将冻结嵌入层的所有参数以用于推理过程，从而避免在训练期间对嵌入层进行微调。

DecoderWithAttention类中的init_hidden_state()方法是用于初始化解码器中LSTM单元的隐藏状态（hidden state）和细胞状态（cell state）的方法。在图像描述生成（image captioning）中，一般将图像编码为一个固定大小的向量，通常称为视觉特征（visual feature）。这些向量可能来自于最后一个卷积层或其他类型的视觉处理网络。在这段代码中，我们首先计算出编码器的所有输出向量的平均值（mean_encoder_out），将其作为LSTM的初始状态。然后，我们传递平均编码器输出（mean_encoder_out）向量到init_h和init_c函数（两个线性层）以初始化隐藏状态和细胞状态。最后，我们返回这两个状态。这种方法可以将图像特征作为LSTM解码器的初始状态，并将图像特征与文本信息结合起来，从而生成相关的图像描述文本。

DecoderWithAttention类中的forward()方法是一个图像描述生成模型中的前向传递方法，它接受图像编码器的输出（encoder_out）、编码的标题（encoded_captions）和标题长度（caption_lengths）作为输入，并生成包含描述预测（predictions）、编码的标题（encoded_captions）、解码长度（decode_lengths）、注意力权重（alphas）和排序索引（sort_ind）的输出。在前向传递开始时，我们首先对输入进行一些预处理，包括将encoder_out的形状转换为(batch_size，num_pixels，encoder_dim)，其中，num_pixels是编码器输出的图像特征向量数量，因为它们对应于输入标题的单词数量，我们将caption_lengths缩小一个维度并按长度排序，以适用于PackedSequence处理的LSTM模型。然后，我们将编码的标题传递到嵌入层以获得词嵌入表示，并使用init_hidden_state方法来初始化LSTM的隐藏状态和细胞状态。接下来，我们迭代解码器的步骤，直到达到最大的解码步数，生成对应的预测、注意力权重以及更新的LSTM状态，并将这些信息保存到以tensor 创建的"predictions","alphas"张量中。最后，我们将预测，编码的标题，解码长度，注意力权重和排序索引作为5个输出返回。在图像描述生成模型中，这些预测可以转换为关于输入图像的自然语言描述。

结束啦！

冰岛小贤

关注

1
点赞
踩
15

收藏

觉得还不错? 一键收藏
打赏
1
评论
详解a-PyTorch-Tutorial-to-Image-Captioning-master代码中的model模型，此代码是对show attend and tell论文的复现，基于Pytorch

详解a-PyTorch-Tutorial-to-Image-Captioning-master代码中的model模型，此代码是对show attend and tell论文的复现，基于Pytorch
复制链接

扫一扫

详解a-PyTorch-Tutorial-to-Image-Captioning-master代码中的model模型，此代码是对show attend and tell论文的复现，基于Pytorch

编码器部分

注意力机制部分

解码器和注意力机制结合部分

“相关推荐”对你有帮助么？