第五十九周学习笔记

最新推荐文章于 2023-12-02 19:55:53 发布

luputo

最新推荐文章于 2023-12-02 19:55:53 发布

阅读量620

点赞数 1

分类专栏：学习笔记

本文链接：https://blog.csdn.net/luo3300612/article/details/100867195

版权

学习笔记专栏收录该内容

61 篇文章 3 订阅

订阅专栏

第五十九周学习笔记

咦?第五十八周笔记去哪了？

论文阅读概述

nocaps: novel object captioning at scale: This article creates a new novel object captioning dataset —— NoCaps as a better benchmark than COCO held out dataset.
Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment: This article uses high-level vision task (image captioning) to guide low level vision task (phrase grounding) and simultaneously trains a phrase-ROI matching model and a sequence-image matching model, in which global matching is based on local matching results.
Towards Unsupervised Image Captioning with Shared Multimodal Embeddings: This article introduces an unsupervised image captioning model which align image and sentence according to the concepts they share and achieve better performance with multi-task loss.
Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning: Latent intention sequence learning is the key idea of this article which serves as the probability distribution of the future of the sentence and provide diverse generated captions.
An Empirical Study of Spatial Attention Mechanisms in Deep Networks: This article comes up with a simple formulation to cover many attention mechanisms across the history and help us better understanding the essence of mechanism with the conclusion that high attention weights doesn’t mean more responsible for the outputs.
SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks : This article introduce a model to prediction human fixation with pre-trained network and combination of coarse and fine detector, achieving better results with find-tuning.

coding

LSTM from scratch

LSTM from torch.nn scratch, 这里
主要的设计思路

Core+主模型，将LSTM的核心部分写成Core，其他各种特征处理部分（比如特征映射到hidden空间，取输入等）放在外部模型中
一步完成所有gate的计算，因为LSTM公式中的gate实际上计算公式一样，仅仅是参数不一样，因此可以写成一个线性变换

class myLSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(myLSTMCell, self).__init__()
        self.gate = nn.Linear(input_size + hidden_size, 3)
        self.Cell = nn.Linear(input_size + hidden_size, hidden_size)

    def forward(self, x, state):
        h, c = state
        x = torch.cat([x, h], dim=1)
        gate = self.gate(x)
        gate_i = torch.sigmoid(gate[:, 0]).view(-1, 1)
        gate_f = torch.sigmoid(gate[:, 1]).view(-1, 1)
        gate_o = torch.sigmoid(gate[:, 2]).view(-1, 1)
        C = torch.tanh(self.Cell(x))
        C = gate_i * C + gate_f * c
        h = gate_o * torch.tanh(C)
        return h, c


class myLSTM(nn.Module):
    def __init__(self, opt):
        super(myLSTM, self).__init__()
        self.input_size = opt.input_size
        self.hidden_size = opt.hidden_size
        self.vocab_size = opt.vocab_size
        self.embedding_dim = opt.embedding_dim
        self.core = myLSTMCell(self.input_size, self.hidden_size)
        self.embed = nn.Embedding(self.vocab_size+1, self.embedding_dim)
        self.max_length = opt.max_length
        self.num_classes = opt.num_classes
        self.output = nn.Linear(self.hidden_size, self.num_classes)

    def init_weight(self, bs):
        weight = next(self.parameters())
        return weight.new_zeros(bs, self.hidden_size), weight.new_zeros(bs, self.hidden_size)

    def forward(self, x):
        state = self.init_weight(x.shape[0])
        outputs = []
        for i in range(self.max_length):
            xt = x[:, i]
            if torch.sum(xt) == 0:
                break
            xt = self.embed(xt)
            state = self.core(xt, state)
            outputs.append(state[0].unsqueeze(2))
        x = self.output(torch.mean(torch.cat(outputs,dim=2),dim=2))
        return x

随便找了个外卖评价做预测，准确率到达83.71%，作为对比，词袋模型的准确率是73%

一些问题

用最后一个time step预测，模型几乎学不动（可能是没梯度了？）
用所有的time step平均预测，效果比用一个好了15%+

Attention model visualization

Attention model的可视化
attention序列化
序列attention融合

caption model performance

测了手上所有的grid attention的模型表现，代码取自这里
在这里插入图片描述
和可考的官方结果进行对比

复习

复习了一些以前的内容

指数分布族和Logistic Regression的导出

Sigmoid函数推导过程为
$P(y;\eta)=b(y)\exp(\eta^TT(y)+a(\eta)) \text{ 指数分布族公式}\\ P(y;\phi)=\phi^y(1-\phi)^{1-y}=\exp(y\log(\phi)+(1-y)log(1-\phi))=\exp(y\log(\frac{\phi}{1-\phi})+\log(1-\phi))\\ \text{令 } \eta=\log(\dfrac{\phi}{1-\phi}) \text{则}\\ \phi=\dfrac{1}{1+e^{-\eta}}$

caption 度量

重新整理到了这里

整理读过的论文

整理到了这里

本周小结

开学第一周，有点迷，总是发呆

下周任务

attention model 可视化完成，并与Salicon数据集进行对比
attention论文以及跨level的论文阅读，包括本周未完成的三篇
没了，凑个三条

Appendix（日记）

9月16日TODO

LSTM模型from scratch
指数分布族和Logistic Regression回顾
两篇论文阅读
晚上将之前的论文整理到github上

9月16日完成情况

LSTM模型from scratch，这里
Sigmoid函数推导过程为
$P(y;\eta)=b(y)\exp(\eta^TT(y)+a(\eta)) \text{ 指数分布族公式}\\ P(y;\phi)=\phi^y(1-\phi)^{1-y}=\exp(y\log(\phi)+(1-y)log(1-\phi))=\exp(y\log(\frac{\phi}{1-\phi})+\log(1-\phi))\\ \text{令 } \eta=\log(\dfrac{\phi}{1-\phi}) \text{则}\\ \phi=\dfrac{1}{1+e^{-\eta}}$
在Logistic Regression中，用线性形式来估计 $\eta$
两篇论文阅读完成
整理论文，未完成，因为任务过大，更改为本周的任务

9月17日 TODO

回忆度量
回忆ELMo
论文阅读×2
整理论文
完成代码上传和调试

9月17日完成情况

回忆度量，完成，所有内容在这篇博客
回忆ELMO:
传统的Word Embedding的每个词的词向量训练后就固定了，无法处理多义词的问题，而ELMO针对新遇到的句子中的词，结合句子的上下文来生成该词的embedding,根据上下文对word embedding进行动态调整，使用双层双向的LSTM完成，每个单词在最初和每层中都拥有一个embedding，共有三个embedding，利用大量语聊进行预训练（根据上下文预测当前单词）之后，实际用在下游任务上时通过对三个embedding进行加权求和来得到最终的embedding，从现在看ELMO的局限性在于没有使用特征抽取能力更强、并行效果更好的的transformer
论文阅读×2，没读
整理论文，on doing
调试发现需要google drive上的一个模型，没有TZ，难受(╯﹏╰)

9月18日 TODO

论文阅读×3（补1篇昨天没读的）unsupervised对比
整理论文
所有模型的度量结果

9月18日完成情况

论文阅读完成
整理论文 doing
所有模型度量结果，没出
COCO2014 bottom-up特征生成

9月19日TODO

所有模型的度量结果和原论文中进行对比
论文阅读2
high level 帮助low level整理

9月19日完成情况

找到原文的都进行了对比
论文阅读，完成一篇
整理发现三篇论文，放在明日TODO里
所有bottom-up模型的度量结果计算完成

9月20日TODO

Paying Attention to Descriptions Generated by Image Captioning Models， attention的解释
Top-down Visual Saliency Guided by Captions, caption 引导saliency的文章
Boosted Attention: Leveraging Human Attention for Image Captioning， salicon引导caption的文章