Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks

big_matster

于 2023-01-07 23:35:45 发布

阅读量201

点赞数

分类专栏：论文创新及观点文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/kuxingseng123/article/details/128594109

版权

论文创新及观点专栏收录该内容

19 篇文章 1 订阅

订阅专栏

在这里插入图片描述

Attribute variance heat maps of the 312 attributes in CUB birds and the 102 attributes in SUN scenes
the t-SNE [35] visualizations of the test images
represented by all attributes (left) and only the high-variance ones (right)
SP-AEN method

摘要

我们提出一个新颖的架构，称为: SP-AEN method，应用于零样本视觉识别（ZSL）。
在整个训练过程中，测试图像和它们的类别都是不可见类别。
该方法的目的是处理固有的问题，semantic loss in the prevailing family of embedding-based ZSL。在这里，一些语义在训练阶段可以被丢弃，如果在训练类别中都是不可识别的，这对识别测试类别是关键性的，
SP-AEN 可以解决语义损失，方法：引入一个独立的视觉到语义空间的嵌入器，这里可以将语义空间引入到两个相互矛盾的子空间中。：classfication and reconstructive
虽然，两个子空间的对抗学习，SP-AEN可以将语义从重建子空间迁移到判别式的一个，完成提升不可见类别的零样本识别的改善。
与之前的工作相比较，SP-AEN不仅能够提升分类效果，而且能够产生photo-realistic images,证明保存语义的有效性，四个流行的基础

bechmarks

: CUB, AWA, SUN and aPY

Introduction

在这里插入图片描述

Class embedings :极端情况

all the class embeddings are one-hot label vectors
- 退化了传统的监督分类，因此，没有语义可以被迁移。

解决方法

preserve semantics by reconstruction
- 一个图片的语义嵌入向量有能力映射后面的图片（the image back) 期望任何两个语义嵌入都可以保存丰富的语义信息并分开。否则reconstruction将会失败，然而，重建和分类对两个相互冲突的目标是必要的，
- reconstruction:尽可能的保存更多的图像细节.
- classification: 压缩不相关的内容。
  $\rightarrow S$
  $\rightarrow V$

在这里插入图片描述
为了解决这些冲突，我们提出了一个新颖的视觉语义嵌入框架：

Semantics-Preserving Adversarial Embedding Network (SP-AEN).
引入一个新的映射: $\rightarrow S$
an adversarial objective: 鉴别器 $D$ 和 $e n co d er F$
尝试使 $F (x)$ 和 $E (x)$ 无法区分。
引入 $F$ 和 $D$ 的两个益出是帮助 $E$ 保存语义信息。
Semantic Transfer: 即使这个语义损失对 $E$ 是不可避免的，
- we can avoid it using $F$ by borrowing ingredients, from $E (X)$ of other classes.
- 鉴别器 $D$ :最终迁移语义信息 $F (x)$ to $E (x)$ 通过将两个语义嵌入空间裁剪为相同的分布。
Disentangled Classification and Reconstruction.
相关工作

Zero-Shot Learning
- 零样本的主流是基于属性的视觉识别，
  (the attribute-based visula recognition）
- the attribute serve as an imtermediate feature space
- scale up ZSL(纵向扩展ZSL)
  - embedding based methods
  - learn a mapping from the image visual space to a semantic space, represented by semantic
    vectors
SP-AEN is an embedding based ZSL
- the ranking based classification loss
- reconstruct images from the semantic embeddings
- no image is exposed to test classes at training in ZSL.
Domain Shift and Hubness
- the semantic loss
- Domain shift
  - 训练和测试的数据是不同的分布
  - Hubness [37] states
  - Another way of countering semantic loss is to learn independent attribute classifiers
Generative Adversarial Network (GAN)
- to train a generator that can fool a discriminator to confuse the distributions of the generated and true samples
- this max-min training procedure
- data augmentation of unseen classes
- feature-level 特征层次
Image Generation
- pixel-level loss
- feature-level reconstruction loss
- perceptual similarity
- adversarial loss
- image-to-image transformation
- a bottleneck layer 瓶颈层

Formulation

Preliminaries 预赛

Given a set of training set:
{ $x_i,l_i$ }
$x_i \in V$ :an image represented in the visual space,
$l_i ∈ L_s$ : is a class label in the seen class set
the unseen class set $L_u.$
the embedding-based framework
a Visual-to-semantic mapping
$\rightarrow S$
simple nearest neighbor search
- any class label $l$ is embedded as $y_l \in R^d$ in the semantic space $S$
- 预测标签可以通过以下得到 $l^{*}$
  - simple nearest neighbor search
    - $l^{*} = \max_{l \in L}y^T_lE(x)$ (1)
    - $\in l_u$ :the conventional ZSL setting;
    - $\in l_s \bigcup l_u$ :the generalized ZSL setting

Classification Objective

As label prediction in Eq. (1) 是一个基本的ranking problem。
a large-margin based ranking loss function for classification objective
a higher dot-product similarity
- $y_l$ and $E (x)$
- a lower one for any wrongly labeled pair
  ( $\acute{l}$ )
- the similarity margin between the correct one
  and the wrong one should be larger than a threshold

在这里插入图片描述

$\gamma > 0$ : a hyperparameter for the margin
At each iteration in stochastic training
the unpaired labels 未配对标签
two additional objectives introduced next.
the semantic embedding $E (x)$
corresponding kernel size $c$
number of fully-connected layer dimension $f c$
stride $s$ of each convolutional layer.
Same color indicates the same layer type.

Reconstruction Objective

learn a semantic-to-visual mapping:
$\rightarrow V$
reconstructs a semantic embedding:
$\in S$
back to image such that
$∣∣ G (s) - x ∣∣$
Recall that the reconstruction in the autoencoder fashion
$s = E (x)$
introduce an independent visual-to-semantic mapping
$F$ for embedding reconstructive $s = F (x)$
the visual space $V$ is a feature space from the output of a higher-layer in deep CNN
use the raw 256 × 256 × 3 RGB color space for image reconstruction
By minimizing a reconstruction objective $F (x)$

Adversarial objective

the disentangled semantic embeddings
$E (x)$ and $F (x)$

在这里插入图片描述

Full Objective

在这里插入图片描述
最终的目标是解决:

considering $F$ as the encoder
$G$ as the decoder
the semantic embeddding $F (x)$ can be considered as the bottleneck layer
regularized to match a supervised distribution $E (x)$
$SP - A EN$ is a supervised Adversarial Autoencoder
another adversarial objective for $F (x)$ to match a prior embedding space

Implementation

Architecture

*** an end-to-end network** with the input of raw images and ground truth class embeddings.

. The embedder $E$ : ResNet-101
$F$ is based on AlexNet appended with two
more fully-connected blocks
a d-dimensional embedding vector
the subsequent reconstruction network $G$
*** five up-convolutional blocks** with leaky ReLU [20]
for transforming a vector into a 3-D feature map,
$D$ is a two-layer fully-connected layer plus a non-linear ReLU layer that takes the d-dimensional embedding vector as input.