- Attribute variance heat maps of the 312 attributes in CUB birds and the 102 attributes in SUN scenes
- the t-SNE [35] visualizations of the test images
represented by all attributes (left) and only the high-variance ones (right) - SP-AEN method
摘要
- 我们提出一个新颖的架构,称为: SP-AEN method,应用于零样本视觉识别(ZSL)。
- 在整个训练过程中,测试图像和它们的类别都是不可见类别。
- 该方法的目的是处理固有的问题,semantic loss in the prevailing family of embedding-based ZSL。在这里,一些语义在训练阶段可以被丢弃,如果在训练类别中都是不可识别的,这对识别测试类别是关键性的,
- SP-AEN 可以解决语义损失,方法:引入一个独立的视觉到语义空间的嵌入器,这里可以将语义空间引入到两个相互矛盾的子空间中。:classfication and reconstructive
- 虽然,两个子空间的对抗学习,SP-AEN可以将语义从重建子空间迁移到判别式的一个,完成提升不可见类别的零样本识别的改善。
与之前的工作相比较,SP-AEN不仅能够提升分类效果,而且能够产生photo-realistic images,证明保存语义的有效性,四个流行的基础
bechmarks
: CUB, AWA, SUN and aPY
Introduction
Class embedings :极端情况
- all the class embeddings are one-hot label vectors
- 退化了传统的监督分类,因此,没有语义可以被迁移。
解决方法
- preserve semantics by reconstruction
- 一个图片的语义嵌入向量有能力映射后面的图片(the image back) 期望任何两个语义嵌入都可以保存丰富的语义信息并分开。否则reconstruction将会失败,然而,重建和分类对两个相互冲突的目标是必要的,
- reconstruction:尽可能的保存更多的图像细节.
- classification: 压缩不相关的内容。
E : V → S E: V \rightarrow S E:V→S
G : S → V G: S \rightarrow V G:S→V
为了解决这些冲突,我们提出了一个新颖的视觉语义嵌入框架:
-
Semantics-Preserving Adversarial Embedding Network (SP-AEN).
-
引入一个新的映射: F : V → S F: V \rightarrow S F:V→S
-
an adversarial objective: 鉴别器 D D D和 e n c o d e r F encoder F encoderF
尝试使 F ( x ) F(x) F(x)和 E ( x ) E(x) E(x)无法区分。 -
引入 F F F和 D D D的两个益出是帮助 E E E保存语义信息。
-
Semantic Transfer: 即使这个语义损失对 E E E是不可避免的,
- we can avoid it using F F F by borrowing ingredients, from E ( X ) E(X) E(X) of other classes.
- 鉴别器 D D D:最终迁移语义信息 F ( x ) F(x) F(x) to E ( x ) E(x) E(x) 通过将两个语义嵌入空间裁剪为相同的分布。
-
Disentangled Classification and Reconstruction.
相关工作
Zero-Shot Learning
- 零样本的主流是基于属性的视觉识别,
(the attribute-based visula recognition) - the attribute serve as an imtermediate feature space
- scale up ZSL(纵向扩展ZSL)
- embedding based methods
- learn a mapping from the image visual space to a semantic space, represented by semantic
vectors
-
SP-AEN is an embedding based ZSL
- the ranking based classification loss
- reconstruct images from the semantic embeddings
- no image is exposed to test classes at training in ZSL.
Domain Shift and Hubness
- the semantic loss
- Domain shift
- 训练和测试的数据是不同的分布
- Hubness [37] states
- Another way of countering semantic loss is to learn independent attribute classifiers
Generative Adversarial Network (GAN)
- to train a generator that can fool a discriminator to confuse the distributions of the generated and true samples
- this max-min training procedure
- data augmentation of unseen classes
- feature-level 特征层次
Image Generation
- pixel-level loss
- feature-level reconstruction loss
- perceptual similarity
- adversarial loss
- image-to-image transformation
- a bottleneck layer 瓶颈层
Formulation
Preliminaries 预赛
- Given a set of training set:
{ x i , l i x_i,l_i xi,li} - x i ∈ V x_i \in V xi∈V :an image represented in the visual space,
- l i ∈ L s l_i ∈ L_s li∈Ls: is a class label in the seen class set
- the unseen class set L u . L_u. Lu.
- the embedding-based framework
- a Visual-to-semantic mapping
E : V → S E:V \rightarrow S E:V→S - simple nearest neighbor search
- any class label l l l is embedded as y l ∈ R d y_l \in R^d yl∈Rd in the semantic space S S S
- 预测标签可以通过以下得到
l
∗
l^{*}
l∗
- simple nearest neighbor search
- l ∗ = max l ∈ L y l T E ( x ) l^{*} = \max_{l \in L}y^T_lE(x) l∗=l∈LmaxylTE(x) (1)
- l ∈ l u l \in l_u l∈lu :the conventional ZSL setting;
- l ∈ l s ⋃ l u l \in l_s \bigcup l_u l∈ls⋃lu:the generalized ZSL setting
- simple nearest neighbor search
Classification Objective
- As label prediction in Eq. (1) 是一个基本的ranking problem。
- a large-margin based ranking loss function for classification objective
- a higher dot-product similarity
- y l y_l yl and E ( x ) E(x) E(x)
- a lower one for any wrongly labeled pair
( x , l ˊ x, \acute{l} x,lˊ) - the similarity margin between the correct one
and the wrong one should be larger than a threshold
- γ > 0 \gamma > 0 γ>0: a hyperparameter for the margin
- At each iteration in stochastic training
- the unpaired labels 未配对标签
- two additional objectives introduced next.
- the semantic embedding
E
(
x
)
E(x)
E(x)
- corresponding kernel size c c c
- number of fully-connected layer dimension f c fc fc
- stride s s s of each convolutional layer.
- Same color indicates the same layer type.
Reconstruction Objective
-
learn a semantic-to-visual mapping:
G : S → V G: S \rightarrow V G:S→V -
reconstructs a semantic embedding:
s ∈ S s \in S s∈S -
back to image such that
∣ ∣ G ( s ) − x ∣ ∣ ||G(s) - x|| ∣∣G(s)−x∣∣ -
Recall that the reconstruction in the autoencoder fashion
s = E ( x ) s = E(x) s=E(x) -
introduce an independent visual-to-semantic mapping
F F F for embedding reconstructive s = F ( x ) s = F(x) s=F(x) -
the visual space V V V is a feature space from the output of a higher-layer in deep CNN
-
use the raw 256 × 256 × 3 RGB color space for image reconstruction
-
By minimizing a reconstruction objective F ( x ) F(x) F(x)
Adversarial objective
- the disentangled semantic embeddings
E ( x ) E(x) E(x) and F ( x ) F(x) F(x)
Full Objective
最终的目标是解决:
- considering F F F as the encoder
- G G G as the decoder
- the semantic embeddding F ( x ) F(x) F(x) can be considered as the bottleneck layer
- regularized to match a supervised distribution E ( x ) E(x) E(x)
- S P − A E N SP-AEN SP−AEN is a supervised Adversarial Autoencoder
- another adversarial objective for F ( x ) F(x) F(x) to match a prior embedding space
Implementation
Architecture
*** an end-to-end network** with the input of raw images and ground truth class embeddings.
- . The embedder E E E : ResNet-101
-
F
F
F is based on AlexNet appended with two
more fully-connected blocks - a d-dimensional embedding vector
- the subsequent reconstruction network
G
G
G
*** five up-convolutional blocks** with leaky ReLU [20]
for transforming a vector into a 3-D feature map, - D D D is a two-layer fully-connected layer plus a non-linear ReLU layer that takes the d-dimensional embedding vector as input.
Training Details
- per-pixel mean subtraction
- MSRA random initializer
- grid search
- the pretrained generator
Datasets
- CUB
- SUN
- AWA
- aPY
Settings and Evaluation Metrics
U
→
U
U \rightarrow U
U→U
S
→
T
S \rightarrow T
S→T
U
→
T
U \rightarrow T
U→T
. Comparisons with State-of-The-Arts
Comparing Methods
- embedding based
- DeViSE、ALE、SJE、ESZSL、LATEM
- CMT/CMT、SAE、SAE
- attribute based:
- DAP
- IAP
- SSE
- CONSE
- SYNC
Ablation Studies
Conflict between Classification & Reconstruction
- DirectMap
- SAE
- SplitBranch
Effectiveness of D and G
- the Seen-Unseen accuracy Curve (SUC)
- The Area Under Seen-Unseen Accuracy Curve (AUSUC)
总结,
- 有时间,花费时间将该论文的网络架构好好研究一波,慢慢的将其全部搞定都行啦的理由与打算。慢慢的将这个网络架构全部都搞定都行啦的理由与打算。将网络架构,全部都整理好都行啦的理由
技术操作
- harmonic mean values
- a simple nearest neighbor search
- Semantic Transfer
- a filexible plug-and-play
- end-to-end fine-tune fashion
- this max-min training procedure
关键词
- SP-AEN method
- photo-realistic reconstruction.
- the high-variance
- the low-variance
- non-discriminative 不可识别
- adversarial learning
- the semantic discrepancy 语义差异
- a lossy semantic space 有损的语义空间
- the class embedding:丰富了语义信息
- a flexible plug-and-play
- end-to-end fine-tune fashion
- trade-off parameters 权衡参数
- ZSL
- few-shot learning
- domain adaptaion
- data augmentation 数据增强
- mode collapse problem: 模式坍塌问题