Paper is all you need

最新推荐文章于 2021-03-19 01:11:47 发布

Dr_P

最新推荐文章于 2021-03-19 01:11:47 发布

阅读量1k

点赞数 2

分类专栏： Paper Reading 文章标签： Deep Learning

本文链接：https://blog.csdn.net/qq2414205893/article/details/83515865

版权

Paper Reading 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

End also is a Beginning

This is an important date to be written down 2018-10-29
Today I have finished the first paper during my PHD project. From today, I will update the thoughts after reading paper weekly on website.

Good words

如"particularize"这类含后缀"-ize"的单词，就是一个不够简明的典例，大可以用"specify"将其替换。类似的例子还有 “utilize"和"use”,"finalize"和 “end”,"hitherto"和"until now/previously"等。

Title

Suppression of Inter-Domain Background Shift for Person Re-Identification 抑制
Because of the complementary nature of their products, Microsoft’s and Intel’s alternating advances have created a virtuous cycle, benefiting from network effects. 交替前进良性循环

Adjective

tractable 易于处理的
well-studied 充分研究的
incompatible 不相容的
they make impractical assumptions 不切实际
Extensive experiments 广泛的大量的
diverse 多种多样的
despite its undeniable success 不可争辩的
empirical insights 以经验和实验为基础的深入观察（洞察）
a large amount of 大量的
Omni-Scale & omni-directional 全方位全尺度
hereby thereby 从而据此因此

Adverb

and vice versa 反之亦然
projection domain shift exists inherently in the regression of GZSL (固有地存在于)
We revisit the dissimilarity representation [9] in the new context of GZSL.
explicitly interpretable information
Log-Euclidean distance (LED), and further derive a kernel function that explicitly
(显式函数)maps the covariance matrix from the Riemannian manifold to a Euclidean space
regarding 关于
Given a training set comprising N samples 包含
Concretely 具体地 == specifically
explicitly 明确地
underlying relations 潜在的关系
Subsequently 随后其后
vice versa 反之亦然

Verb

Compared to 。。。，our results __ comparing。。。， we。。。
compare with
Unfortunately, these ways suﬀer from the problem of error accumulation, as they undergo two-stages probabilistic inference so that probability errors are accumulated.
表示由什么组成： is a composition of ， is comprised of
The major limitation originates from that the classical filters are invariant at each location. 源自于
elaborate 阐述详细解释 v.
More specifically, HGNN is a general framework which can incorporate with multi-modal data and complicated data correlations. 合并（not only combine）
Pair-based metric learning often generates a large amount of 大量 pair-wise samples, which are highly redundant and include many uninformative samples. Training with random sampling can be overwhelmed 压倒，不堪重负 by these redundant samples, which significantly degrade 降级 the model capability and also slows the convergence. Therefore, sampling plays a key role in pair-based metric learning.
complement each other 互补
exploits 利用 ===utilizes == employs == leverages == take full advantage of == exploit 利用
replicate 复现复制
A common challenge in person re-identification systems is to differentiate people with very similar appearances. distinguish 区别区分
Existing methods are primary to reduce the inter-domain shift between the domains, which however usually overlook the relations among target samples.
Apart from the easily identified camera variance, some other latent intra-domain variations are hard to explicitly discern without fine-grained labels, such as the changes of pose, view, and background. 分辨
We accomplish this constraint by encouraging an exemplar and its reliable neighbors to be close to each other. 实现
effectively accommodate 有效缓解
「advocate, emphasize, maintain,sustain, convey, optimize, deliver, narrow, deteriorate, inhibit,incorporate」

Conjunction

So, Consequently, Thus, Therefore, As a result， As a consequence， To this end
subsequently 后来
More concretely, 更具体地来说
As such,
Originally, 起初
Altogether 总而言之
In the light of this, 鉴于此
I was caught off-guard by the “conference numbers” 猝不及防
according to == based on
Suppose we are given an appropriate model trained on the source and target domains, a target sample and its nearest-neighbors in the target set may share the same identity with a higher potential. 假设。。。结果会怎么样
In addition to == apart from 除了…

Specific for mathematics

vertex 顶点
adjacency Matrix 邻接矩阵

Noun

Graph convolutional neural networks have shown superiority on representation learning compared with traditional neural networks due to its ability of using data graph structure. 优胜，优越
Under such circumstances, 环境

Abstract

Such == this

Beginning of Abstract

To understand the processing of information underlying these counterintuitive properties, we visualize the features of shape and texture that underlie identity decisions. Then, we shed a light of information processing into the black box and demonstrate how the hidden layers represent features for decision, and characterize the invariance of these representations to changes of 3D pose.
Face recognition has witnessed significant progresses due to the advances of deep convolutional neural networks (CNNs), the central challenge of which, is feature discrimination. To address it, one group tries to exploit miningbased strategies (e.g., hard example mining and focal loss) to focus on the informative examples. The other group devotes to designing margin-based loss functions (e.g., angular, additive and additive angular margins) to increase the feature margin from the perspective of ground truth class. Both of them have been well-verified to learn discriminative features. However, they suffer from either the ambiguity of hard examples or the lack of discriminative power of other classes. In this paper, we design a novel loss function, namely support vector guided softmax loss (SV-Softmax), which adaptively emphasizes the mis-classified points (support vectors) to guide the discriminative features learning. So the developed SV-Softmax loss is able to eliminate the ambiguity of hard examples as well as absorb the discriminative power of other classes, and thus results in more discrimiantive features.
Beyond that, we conduct an exhaustive analysis on the role of training data on
performance.
To the best of our knowledge, this is the first attempt to inherit the advantages of mining-based and margin-based losses into one framework.
Symmetric Positive Definite (SPD) matrix learning methods have become popular in many image and video processing tasks, thanks to their ability to learn appropriate statistical representations while respecting Riemannian geometry of underlying SPD manifolds.
In particular, we devise bilinear mapping layers to transform input SPD matrices
to more desirable SPD matrices, exploit eigenvalue rectification（动词rectify） layers to apply a non-linear activation function to the new SPD matrices, and design an eigenvalue logarithm layer
to perform Riemannian computing on the resulting SPD matrices for regular output layers.
We utilize the process of attribute detection to
generate corresponding attribute-part detectors, whose in
variance to many influences like poses and camera views can be guaranteed.
In this paper, unlike most existing methods simply taking attribute learning as a classification problem, we perform it in a different way with the motivation that attributes are related to specific local regions, which refers to the perceptual ability of attributes.
In this work, we explore how to harness the similar natural characteristics existing in the samples from the target domain for learning to conduct person re-ID in an unsupervised manner. 利用

Abstract

These independent clusters are then assigned with labels, which serve as the pseudo identities to supervise the training process. 充当

End of Abstract

Experimental results on several benchmarks have demonstrated the effectiveness of our approach over state-of-the-arts.
Extensive experiments demonstrate the superior performance of our algorithm over several state-of-the-art algorithms on small-scale datasets and comparable performance on large-scale re-ID datasets
Person re-identification (re-id) is a fundamental technique to associate various person images, captured by different surveillance cameras, to the same person.
Extensive experiments demonstrate that by simply substituting OLM for standard linear module without revising any experimental protocols, our method largely improves the performance of the state-of-the-art networks, including Inception and residual networks on CIFAR and ImageNet datasets.

Introduction

Begin of Introduction

Such hierarchy and deep architectures equip DNNs with large capacity to represent complicated relationships between inputs and outputs.
the dependencies amplify as the network becomes deeper as 表示随着
Person attribute learning has been studied a lot in recent
years, and has been proven beneficial for the person Re-ID task.
its key novelty lies in the regularization framework 他的关键创新之处在于

Intermidiate Sentences of Introduction

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions [3]. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
Recent efforts toward reducing these overheads involve pruning and compressing the weights of various layers without hurting original accuracy.
That is, the limitation seems to lie in the difficulty of optimisation rather than in the network size
A starting point to understand information processing in CNNs (and the brain) is to identify the features represented across their respective computational hierarchies
An open question remains whether a well-constrained CNN (i.e. constrained by architecture, time, representation, function and so forth) could learn the mid-to-high-level features that flexibly represent task-dependent visual categories in the human visual.
There are some pitfalls existed in this paradigm. Firstly, the image features φ(x) either crafted manually or from a pre-trained CNN model may be not representative enough for zero-short recognition task. Though the features from a pre-trained CNN model are learned, yet restricted to a fixed set of images (e.g., ImageNet [24]), which is not optimal for a particular ZSL task.
Secondly, the user-defined attributes (y) are semantically descriptive, but they are not exhaustive, thus limiting its discriminativeness in classification. There may exist discriminative visual clues not reflected by the pre-defined attributes in ZSL datasets, e.g., the huge mouths of hippos. On the other hand, as shown in Figure 1, the annotated attributes, such as big, strong and ground, are shared in many object categories. This is desired for knowledge transfer between categories, especially from seen to unseen categories. However, if two categories (e.g. cheetah and tiger) share too many (user-defined) attributes, they will be hardly distinguishable in the space of attribute vectors.
Thirdly, low-level feature extraction and embedding space construction in existing ZSL approaches are treated separately, and usually carried out in isolation. Therefore, few existing work ever considers those two components in a unified framework. To address those pitfalls, we propose an end-to-end model capable of learning latent discriminative features (LDF) for ZSL in both visual and semantic space. Specifically, our contributions are:
A problem thus arises: A DNN comprises multiple feature extraction layers stacked one on top of each other; and it is widely acknowledged [20, 39, 14] that, when progressing from the bottom to the top layers, the visual concepts captured by the feature maps tend to be more abstract and of higher semantic level.
To enhance performance of CNNs, recent researches have mainly investigated three important factors of networks: depth, width, and cardinality.
They empirically show that cardinality not only saves the total number of parameters but also results in stronger representation power than the other two factors: depth and width
Through empirical statistics on the classification errors, we find that the network is able to predict several candidate categories that include the correct one with high confidence. However, making the correct final decision on the single category is difficult for the network based models, due to the distraction from other candidate categories. Motivated by the above observations, we propose a novel “Learning by Rethinking” (LR) algorithm in this paper: instead of making the final decision based on one-pass of the data through the network, we introduce feedback connections and allow the network based models to “re-think” the decision and take the high-level feedback information into feature extraction. Benefiting from the feedback, the model is able to extract more discriminative low-level features with the guidance from the high-level information.
we may make accurate Re-ID more tractable.
Specifically, we flow global contextual information obtained at top sides into bottom sides. The top
contextual information will learn to guide the bottom sides to construct the contextual features at fine spatial scales only emphasizing salient objects. Hence the obtained contexts are different from side-output features or some combinations of them which only contain or at least emphasize local representations for an image.
Machine learning on visual recognition greatly relies on many manually labeled images. However, labeling images is a costly work, especially for fine-grained annotation in specifc domains.
Inspired by humans’ ability that human can classify visual objects of unseen classes according
to their previous knowledge, GZSL is proposed. GZSL is to classify objects of unseen classes within the whole scope of classes [23, 24]. If the classifcation is just within the scope of unseen classes, it is
known as zero-shot learning (ZSL) [4, 14]. As GZSL is more practical
and valuable than ZSL, it has gradually attracted more attention.
how to guide the learning process to weaken the eﬀect of projection domain shift becomes a key factor
Learning with no data**, a.k.a.,** (又称)Zero-Shot Learning (ZSL), has been proved to be an eﬀective way to tackle the increasing difculty posed by insufcient training samples

End of Introduction

The experimental results demonstrate our simple idea can favorably outperform recent state-of-the-art methods that use heavily engineered networks, **especially for fine-grained annotation in specifc domains. **
Since our ultimate goal is classification

Contribution

We build an hourglass network with intermediate supervision to learn hierarchical contexts, which are generated with the guidance of global contextual information and thus only emphasize salient objects at different scales
We extensively compare our method with recent stateof-the-art methods on six popular datasets. Our simple method favorably outperforms these competitors under various metrics.
We propose a hierarchical context aggregation module to ensure the network is optimized from the top sides to bottom sides. We aggregate the learned hierarchical contexts at different scales to perform accurate salient
object detection unlike previous studies [16, 55, 43] that fuse side-output features or some complex combinations of side-outputs

Related Work

Beginning of Related Work

It reveals the potential of making the region feature extraction step learnable. However, its form still resembles the regular grid based pooling. The learnable part is limited to bin offsets only.
Most (if not all) previous region feature extraction methods are shown to be specialization of this formulation by specifying the weights in different ways, mostly hand-crafted.
We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly. In contrast to pruning weights, this approach does not result in sparse connectivity patterns.
Zhao et al. [58] added a pyramid pooling module for global context construction upon the final
layer of the deep network, by which they significantly improved the performance of semantic segmentation.

End of Related Work

Hence deep learning based methods have dominated this fields due to their powerful representation capability.
The full literature review of salient object detection is out the scope of this paper. Please refer to [2, 8, 12] for a more comprehensive survey. In this paper, we focus on the context learning rather than previous multi-level feature fusion for the improvement of saliency detection. Different from [43] that uses multiple networks, each of which has a pyramid pooling module [58] at the top, we propose an elegant single network. Different from [59] that uses multi-scale inputs, we use single-scale inputs to extract multi-level contexts. The resulting model is simple yet effective.

Proposed method

Beginning of Proposed method

In this section, we will elaborate our proposed framework for salient object detection. We first introduce our base network in Section 3.1. Then, we present a Mirrorlinked Hourglass Network (MLHN) in Section 3.2. A detailed description of the Hierarchical Context Aggregation (HCA) module is finally provided in Section 3.3. We show an overall network architecture in Figure 2.
To tackle the salient object detection, we follow recent studies [5, 43, 16] to use fully convolutional networks.
Specifically, we use the well-known VGG16 network [38] as our backbone net, whose final fully connected
layers are removed to serve for image-to-image translation.
To this end, we remain the final pooling layer as in [16] and follow [3] to transform the last two fully connected layers to convolution layers, one of which has the kernel size of 3 × 3 with 1024 channels and another of which has the kernel size of 1 × 1 with 1024 channels as well.
Following these observations, we hypothesize that despite
ReLU erasing negative linear responses, the first few convolution layers of a deep CNN manage to capture both negative and positive phase information through learning
pairs or groups of negatively correlated filters. This conjecture implies that there exists a redundancy among the filters from the lower convolution layers.
A well-learned class center is expected to characterize the samples belonging to this class in the feature space. 描述
To simplify the notation. 为了简化符号
It seems difficult to derive a closed-form solution of Eq. (1) owing to the difficulty in computing the derivative with respect to A. 归因于难以计算服从于a的倒数

Intermediate Sentences of Proposed method

solid is the professional representation

Specifically, we use the well-known VGG16 network [38] as our backbone net, whose final fully connected
layers are removed to serve for image-to-image translation
a projection H˜ = 2H # 1 is exploited to transform H to H˜ 2 [#1, 1]m⇥k
each hash bit is generated on the basis of the whole (and the same) input image feature vector, which may inevitably result in redundancy among the hash bits
where α, β, γ are weights that control the interaction of the loss terms
two feature vectors from source and target domain are concatenated to a 2,048-dim vector
For similarity learning, we employ the triplet loss used in [15], which is formulated as,
Given a certain small length of binary codes, the redundancy lies in different bits would badly affect its performance.
1. a source image and its translated image should contain the same ID, i.e., self-similarity, and 2) the translated image should be of a different ID with any target image, i.e., domain dissimilarity. Note: the source and target domains contain entirely different IDs.
but we want a representation which is conducive to training strong classifiers
In unsupervised adaptation, we assume access to source images Xs and labels Ys drawn from a source domain distribution ps(x; y), as well as target images Xt drawn from a target distribution pt(x; y), where there are no label observations.
Our goal is to learn a target representation, Mt and classifier Ct that can correctly classify target images into one of K categories at test time, despite the lack of in domain annotations.
Since direct supervised learning on the target is not possible, domain adaptation instead learns a source representation mapping, Ms, along with a source classifier, Cs, and then learns to adapt that model for use in the target domain
In adversarial adaptive methods, the main goal is to regularize the learning of the source and target mappings, Ms and Mt, so as to minimize the distance between the empirical source and target mapping distributions: Ms(Xs) and Mt(Xt)
If this is the case then the source classification model, Cs, can be directly applied to the target representations, elimating the need to learn a separate target classifier and instead setting, C = Cs = Ct.
In the case of learning a source mapping Ms alone it is clear that supervised training through a latent space discriminative loss using the known labels Ys results in the best representation for final source recognition.
However, given that our target domain is unlabeled, it remains an open question how best to minimize the distance between the source and target mappings. Thus the first choice to be made is in the particular parameterization of these mappings.
Both g1 and g2 are realized as multilayer perceptrons
This constraint forces the high-level semantics to be decoded in the same way in g1 and g2.
Likewise, let Eb(xb; θb) represent the shared encoder function, parameterized by θb which maps an
image xb to the encoder output hb, where hb ∼ HB.
This notation simply states that at each output location u of the channel c, the gather operator has a receptive field of the input that lies within a single channel and has an area bounded by (2e − 1)2. If the field envelops the full input feature map, we say that the gather operator has global extent.
Note that the architecture in principle contains multiple scales and for clarity, we illustrate the network with two scales as an example.
Different from traditional ZSL approaches, the parameters of FNet are jointly trained with other parts in our framework; thus the obtained features are regulated well with the embedding component. We show that this leads to an performance improvement.
However, there exist identity-discriminative but view-invariant visual appearance characteristics or factors that can be exploited for person Re-ID 被充分开发做
automatically learn the space of multi-level discriminative visual factors that are insensitive to viewing condition changes
We propose two new types of layers – the “feedback” layer and the “emphasis” layer – to serve as the channel for transferring the feedback information.
We consider Se6 as the top valve （顶阀） that controls the overall contextual information flow in the network.
The resolution of feature maps in each convolution block is the half of the preceding one. Following [16, 48], the side-output of each convolution block means the connection from the last layer of this block.
existing SPD matrix learning approaches typically flatten SPD manifolds（压平流形压平tensorTOvector） via tangent space approximation
While, in principle, this could be handled by using the strategy of Section 3.2 with a small D˜, this would incur a loss of information that reduces the network capacity too severely.
The objective is to construct a graph to jointly take the target pairs and the context information into consideration, and eventually outputs the similarity score.

End Sentences of Proposed method

To this end, one can simply remove the fully-connected layers of the first-order
CNN and connect the resulting output to a CDU. The output of the CDU being a vector, one can then simply pass it to a fully-connected layer, which, after a softmax activation, produces class probabilities. Since, as discussed above, all our new layers are differentiable, the resulting network can be trained in an end-to-end manner.

Equation Deascription

For a clear presentation, this can be formulated as
The (D × D) covariance matrix of such features can then be expressed as (没有逗号)（接equation）, where
Therein, γ > 0 is a small weight constant to ensure that the state of convergence. 其中

Figure Description

Hierarchical Context Aggregation (HCA) module used in our proposed network. All sides of the backbone have intermediate supervision to ensure that the optimization is performed from high sides to lower sides, so that every side can learn the contextual information. The hierarchical contexts from all sides
are concatenated for final saliency map prediction
Figure 2. Overall framework of our proposed method. Our effort starts from the VGG16 network [38]. We add an additional convolution block at the end of the convolution layers of VGG16, resulting in six convolution blocks in total. The contexts at each convolution block are learned in a high-to-low manner to ensure that each block is guided by all higher layers to generate scale-aware contexts. The Hierarchical Context Aggregation (HCA) module can guarantee the optimization order is high-to-low and aggregate the generated hierarchical contexts
to predict the final saliency maps.
The proposed Riemannian network is conceptually illustrated in Fig.1.

Experiments

Experiments Configuration

Architectural Analyses

Due to the nature of the multi-scale and multi-level learning in deep neural networks, there have emerged a large number of architectures that are designed to utilize the hierarchical deep features. For example, multi-scale learning can use skip-layer connections [13, 31] which is widely accepted owning their strong capabilities to fuse hierarchical deep features inside the networks. On the other hand, multi-scale learning can use encoder-decoder networks that progressively decode the hierarchical deep representation
learned in the encoder backbone net. We have seen these two structures applied in various vision tasks.
We continue our discussion by briefly categorizing inside multi-scale deep learning into five classes: hyper feature learning, FCN style, HED style, DSS style and encoderdecoder networks. An overall illustration of them is summarized in Figure 4. Our following discussion of them will clearly show the differences between our proposed HCA network and previous efforts on multi-scale learning.
Our network architecture is shown in Figure 2. Firstly, the concept generator is particularly designed to have multiple
fully connected layers in order to obtain enough capacity to generate image-analogous concepts which are highly
heterogeneous from the input attribute. Details are shown in Table 1. Secondly, our concept discriminator is also a com-
bination of fully connected layers, each followed by batch normalization and leaky reLU, except for the output layer,
which is processed by the Sigmoid non-linearity. Finally, the concept extractor is obtained by removing the last Softmax
classification layer of Resnet-50 and adding a 128-D fully connected layer. We regard the feature produced by the FC
layer as the image concept. Note that the dimension of the last layer in the concept generator is also set to 128.

Ablation Study

Compare with the state-of-the-art

Conclusion

Add FC layer following pool5 is good for cross-domain re-id but decrease the accuracy of rank1 for supervised person re-id.

Professional description collection

The domain adversarial similarity loss [7, 8] is used to train a model to produce representations such that a classifier cannot reliably predict the domain of the encoded representation. Maximizing such “confusion” is achieved via a Gradient Reversal Layer (GRL) and a domain classifier trained to predict the domain producing the hidden representation. The GRL has the same output as the identity function, but reverses the gradient direction. Formally, for some function f(u), the GRL is defined as Q (f(u)) = f(u) with a gradient dduQ(f(u)) = − dduf(u). The domain classifier Z(Q(hc); θz) ! d^parameterized by θz maps a shared representation vector hc = Ec(x; θc) to a prediction of the label d^2 f0; 1g of the input sample x. Learning with a GRL is adversarial in that θz is optimized to increase Z’s ability to discriminate between encodings of images from the source or target domains, while the reversal of the gradient results in the model parameters θc learning representations from which domain classification accuracy is reduced；
With a discriminative base model, input images are mapped into a feature space that is useful for a discriminative task such as image classification. For example, in the case of digit classification this may be the standard LeNet model. However, Liu and Tuzel achieve state of the art results on unsupervised MNIST-USPS using two generative adversarial networks [13]. These generative models use random noise as input to generate samples in image space—generally, an intermediate feature of an adversarial discriminator is then used as a feature for training a task-specific classifier.
Note that this information flow direction is opposite to that in a discriminative deep neural network [6] where the first layers extract low-level features while the last layers extract high-level features.
They can materialize the shared high-level representation differently for fooling the respective discriminators.
We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that entity and the viewer (the pose)
A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by trainable viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated for each image using the Expectation-Maximization algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The transformation matrices are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attacks than our baseline convolutional neural network.
For cumbersome models that learn to discriminate between a large number of classes, the normal training objective is to maximize the average log probability of the correct answer, but a side-effect of the learning is that the trained model assigns probabilities to all of the incorrect answers and even when these probabilities are very small, some of them are much larger than others.
While there are often many solutions (deep network parameter settings) that generate zero train error, some of these generalise better than others due to being in wide valleys rather than narrow crevices [4, 9] – so that small perturbations do not change the prediction efficacy drastically;and that deep networks are better than might be expected at finding these good solutions [26], but that the tendency towards finding robust minima can be enhanced by biasing deep nets towards solutions with higher posterior entropy
In this paper, we propose a new network module, named “Convolutional Block Attention Module”. Since convolution operations extract informative features by blending cross-channel and spatial information
together, we adopt our module to emphasize meaningful features along those two principal dimensions: channel and spatial axes. To achieve this, we sequentially apply channel and spatial attention modules (as shown in Fig. 1), so that each of the branches can learn ‘what’ and ‘where’ to attend in the channel and spatial axes respectively. As a result, our module efficiently helps the information flow within the network by learning which information to emphasize or suppress.
We produce a channel attention map by exploiting the inter-channel relationship of features. As each channel of a feature map is considered as a feature detector [31], channel attention focuses on ‘what’ is meaningful given an input image. To compute the channel attention efficiently, we squeeze the spatial dimension of the input feature map. For aggregating spatial information, average-pooling has been commonly adopted so far. Zhou et al. [32] suggest to use it to learn the extent of the target object effectively and Hu et al. [28] adopt it in their attention module to compute spatial statistics. Beyond the previous works, we argue that max-pooling gathers another important clue about distinctive object features to infer finer channel-wise attention. Thus, we use both average-pooled and max-pooled features simultaneously. We empirically confirmed that exploiting both features greatly improves representation power of networks rather thanusing each independently (see Sec. 4.1), showing the effectiveness of our design choice. We describe the detailed operation below.
We first aggregate spatial information of a feature map by using both average-pooling and max-pooling operations, generating two different spatial context descriptors: Fc avg and Fc max, which denote average-pooled features and max-pooled features respectively. Both descriptors are then forwarded to a shared network to produce our channel attention map Mc 2 RC×1×1. The shared network is composed of multi-layer perceptron (MLP) with one hidden layer. To reduce parameter overhead, the hidden activation size is set to RC=r×1×1, where r is the reduction ratio. After the shared network is applied to each descriptor, we merge the output feature vectors using element-wise summation. In short, the channel attention is computed as:
We generate a spatial attention map by utilizing the inter-spatial relationship of features. Different from the channel attention, the spatial attention focuses on ‘where’ is an informative part, which is complementary to the channel attention. To compute the spatial attention, we first apply average-pooling and max-pooling operations along the channel axis and concatenate them to generate an efficient feature descriptor. Applying pooling operations along the channel axis is shown to be effective in highlighting informative regions [33]. On the concatenated feature descriptor, we apply a convolution layer to generate a spatial attention map Ms(F) 2 RH×W which encodes where to emphasize or suppress. We describe the detailed operation below. We aggregate channel information of a feature map by using two pooling operations, generating two 2D maps: Fs avg 2 R1×H×W and Fs max 2 R1×H×W . Each denotes average-pooled features and max-pooled features across the channel. Those are then concatenated and convolved by a standard convolution layer, producing our 2D spatial attention map. In short, the spatial attention is computed as:
Furthermore, convolutional features naturally retain spatial information which is lost in fully-connected
layers, so we can expect the last convolutional layers to have the best compromise between high-level semantics and detailed spatial information
Inspired by above evidences, we present a novel Feedback Convolutional Neural Network architecture in this paper. It achieves this selectivity by jointly reasoning outputs of class nodes and activations of hidden layer neurons during the feedback loop.
From a machine learning perspective, the proposed feedback networks add extra flexibility to Convolutional Networks, to help in capturing visual attention and improving feature detection
Compared with traditional bottom-up strategies [11, 13], which aim to regularize the network training, the proposed feedback framework adds flexibilities to the model inference from high-level concepts down to the receptive field.
We mimic the human visual recognition process that human may focus to recognize objects in a complicated image after a first time glimpse as the procedure “Look and Think Twice” for image classification. We utilize the weakly supervised object localization during the “first glimpse” to make guesses of ROIs, then make the network refocused on those ROIs and give final classifications list.
As in Biased Competition Theory [1, 6], feedback, which passes the high-level semantic information down to the lowlevel perception, controls the selectivity of neuron activations in an extra loop in addition to the feedforward process. This results in the “Top-Down” attention in human cognition. Hierarchical probabilistic computational models [19] are proposed to characterize feedback stimuli in a top-down manner, which are further incorporated into deep neural networks, for example, modeling feedback as latent variables in DBM [31], or using selectivity to resolve fine-grained classification [21], et al…
Inspired by visualizations of CNNs [33, 24], a more feasible and cognitive manner for detection / localization could be derived by utilizing the saliency maps generated in feedback visualizations.
However, if possible, the challenge lies on取决于 how to obtain semantically meaningful salience maps with high quality for each concept. That’s the ultimate goal of our work presented in this paper
Be interpreting ReLU and Max-Pooling layers as “gates” controlled by input x, the network selects information during feedforward phases in a bottom-up manner, and eliminates signals with minor contributions in making decisions. However, the activated neurons could be either helpful or harmful for classification, and involve too many noises, for instance, cluttered backgrounds in complex scenes.
Since the model opens all gates and allow maximal information getting through to ensure the generalization, to increase the discriminability within feature level, it is feasible to turn off those gates that provide irrelevant information when targeting at particular semantic labels.
However, all the existing methods merely apply shallow learning, with which traditional methods are typically surpassed by recent popular deep learning methods in many contexts in artificial intelligence and visual recognition.
A new backpropagation is derived to train the proposed network with exploiting a stochastic gradient descent optimization algorithm on Stiefel manifolds.
Analogously to the well-known convolutional network (ConvNet), the proposed SPD matrix network (SPDNet)
also designs fully connected convolution-like layers and rectified linear units (ReLU)-like layers, named bilinear mapping (BiMap) layers and eigenvalue rectification (ReEig) layers respectively. In particular, following the classical manifold learning theory that learning or even preserving the original data structure can benefit classification, the BiMap layers are designed to transform the input SPD matrices,
that are usually covariance matrices derived from the data,
to new SPD matrices with a bilinear mapping. As the classical ReLU layers, the proposed ReEig layers introduce a non-linearity to the SPDNet by rectifying the resulting SPD matrices with a non-linear function. Since SPD matrices reside on non-Euclidean manifolds, we have to devise an
eigenvalue logarithm (LogEig) layer to carry out Riemannian computing on them to output their Euclidean forms for any regular output layers.
The normalized and de-correlated activation is well known for
improving the conditioning of the Fisher information matrix and accelerating the training of deep neural
networks [20, 6, 37].
This trick can recover the representation capacity of orthogonal weight layer to some extent, that is practical in shallow neural networks but for deep CNNs, it is unnecessary based on our observation.
We target to update proxy parameters V, and therefore it is necessary to back-propagate the gradient
information through the transformation φ(V)

Latex general problems

有时候我们用latex编译论文的时候，会遇到和bib相关的问题，如下所示：

Something’s wrong–perhaps a missing \item. \end{thebibliography}

同样的latex文档，在windows下编译没问题，但放到mac上就编译不能通过。

根本问题在于*.tex所在目录下的*.bbl文件。这个文件在windows和mac上的处理方式不同。当文章中没有引用任何文献的时候，windows可以编译通过，但是mac就不能编译通过。

所以解决的办法是：

（1）先关闭*.tex文件，然后删除*.bbl文件；

（2）打开*.tex文件，在文章中的任何地方加上\cite{*}这条语句；

（3）再次编译，就没有问题了

Dr_P

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Paper is all you need

Paper is all you need ==&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; 2018-10-29欢迎使用Markdown编辑器新的改变功能快捷键合理的创建标题，有助于目录的生成如何改变文本的样式插入链接与图片如何插入一
复制链接

扫一扫