Paper reading——Deep TEN_ Texture Encoding Network-CSDN博客

本文链接：https://blog.csdn.net/cough777/article/details/123004377

Title

Deep TEN: Texture Encoding Network

Year/ Authors/ Journal

2017

/Hang Zhang Jia Xue Kristin Dana

/ 2017 IEEE Conference on Computer Vision and Pattern Recognition

citation

@article{mao2021deep,
  title={Deep residual pooling network for texture recognition},
  author={Mao, Shangbo and Rajan, Deepu and Chia, Liang Tien},
  journal={Pattern Recognition},
  volume={112},
  pages={107817},
  year={2021},
  publisher={Elsevier}
}

Summary

There are some similarity between the Encoding Layer and the Attention Block in purpose of retain more meaningful features while ignoring disturbing ones.
There may be some innovations in building a more efficient layer for the aforementioned idea and simultaneously bring the multi-size training into consideration.

Interesting Point(s)

The properties of the Deep-TEN for integrating Encoding Layer with and end-to-end CNN architecture.

Domain Transfer.
Multi-size Training.
Joint Deep Encoding.

This method can be transferred into different modules and this paper also improve that this model has the characteristics of the compared models, which include:
- Relation to dictionary learning.
- Relation to BoWs and residual encoders.
- Relation to pooling.

Research Objective(s)

在这里插入图片描述

**Figure 1. ** A comparison of classic approaches and the proposed Deep Texture Encoding Network. Traditional methods such as bag-of-words BoW (left) have a structural similarity to more recent FV-CNN methods (center). Each component is optimized in separate steps as illustrated with different colors. In our approach (right) the entire pipeline is learned in an integrated manner, tuning each component for the task at hand (end-to-end texture/material/pattern recognition).

Contributions

As the first contribution of this paper, we introduce a novel learnable residual encoding layer which we refer to as the Encoding Layer, that ports the entire dictionary learning and residual encoding pipeline into a single layer for CNN. The Encoding Layer has three main properties.
- (1) The Encoding Layer generalizes robust residual encoders such as VLAD and Fisher Vector. This representation is orderless and describes the feature distribution, which is suitable for material and texture recognition.
- (2) The Encoding Layer acts as a pooling layer integrated on top of convolutional layers, accepting arbitrary input sizes and providing output as a fixed-length representation. By allowing arbitrary size images, the Encoding Layer makes the deep learning framework more flexible and our experiments show that recognition performance is often improved with multi-size training. In addition,
- (3) the Encoding Layer learns an inherent dictionary and the encoding representation which is likely to carry domain-specific information and therefore is suitable for transferring pretrained features. In this work, we transfer CNNs from object categorization (ImageNet [12]) to material recognition. Since the network is trained end-to-end as a regression, the convolutional features learned together with Encoding Layer on top are easier to transfer (likely to be domain-independent).
The second contribution of this paper is a new framework for end-to-end material recognition which we refer to as Texture Encoding Network - Deep TEN, where the feature extraction, dictionary learning and encoding representation are learned together in a single network as illustrated in Figure 1. Our approach has the benefit of gradient information passing to each component during back propagation, tuning each component for the task at hand. Deep-Ten outperforms existing modular methods and achieves the state-of-the-art results on material/texture datasets such as MINC-2500 and KTH-TIPS-2b. Additionally, this Deep Encoding Network performs well in general recognition tasks beyond texture and material as demonstrated with results on MIT-Indoor and Caltech-101 datasets. We also explore how convolutional features learned with Encoding Layer can be transferred through joint training on two different datasets. The experimental result shows that the recognition rate is significantly improved with this joint training.

Background / Problem Statement

Traditional handle methods, such as FV, SIFT, have the advantages of accepting arbitrary input images sizes and have no issue when transferring features cross different domains since the low-level features are generic. However, these methods are comprised of stacking self-constrained algorithmic components( feature extraction, dictionary learning, encoding, classifier training) as visualized in Figure 1 (left, center). Consequently,
they have the disadvantage that the features and the encoders are fixed once built, so that feature learning (CNNs and dictionary) does not benefit from labeled data. We present a new approach (Figure 1, right) where the entire pipeline is learned in an end-to-end manner.
Deep-learning is well known as an end-to-end learning of hierarchical features. This paper transfers this method to recognizing textures which needs the spatially invariant representation describing the feature distributions instead of concatenation.
The challenge is to make the loss function differentiable with respect to the inputs and layer parameters.

All in one word: for better transferring the deep-learning method into texture recognition. Since model in this field always with pretrained in large dataset (such as ImageNet).

Method(s)

在这里插入图片描述

Figure 2: The Encoding Layer learns an inherent Dictionary. The Residuals are calculated by pairwise difference between the input visual descriptors and the codewords of the dictionary. The Assignment Weights based on pairwise distance between input descriptors and codewords. Finally, the residual vectors are aggregated with the assignment weights.

Learnable Residual Encoding Layer

Encoding Layer.
End-to-end Learning.

Evaluation

Comparison between single-size training and multi-size training.

Comparison with traditional methods on multiple datasets.

在这里插入图片描述

Joint training.

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PtrtAGXD-1645168893584)(C:\Users\leeju\AppData\Roaming\Typora\typora-user-images\image-20220119092338031.png)]$

Conclusion

在这里插入图片描述

Figure 5: Pipelines of classic computer vision approaches. Given in put images, the local visual appearance is extracted using hang-engineered features (SIFT or filter bank responses). A dictionary is then learned off-line using unsupervised grouping such as K-means. An encoder (such as BoWs or Fisher Vector) is built on top which describes the distribution of the features and output a fixed-length representations for classification.

Codes

we developed an Encoding Layer which bridges the gap between classic computer vision approaches and the CNN architecture.
This layer has two main advantages:
- (1) the resulting deep learning framework is more flexible by allowing arbitrary input image size, and
- (2) the learned convolutional features are easier to transfer since the Encoding Layer is likely to carry domain-specific information.

Notes

datasets: MINC-2500, KTH and two recent material datasets: GTOS and 4D-Lightfield.

Question(s)

1. What is "domain specific"?

References

[1] H. J´egou, M. Douze, C. Schmid, and P. P´erez. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3304–3311. IEEE, 2010.

[2] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banks for texture recognition and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3828–3836, 2015.

[3] F. Perronnin, J. S´anchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In European conference on computer vision, pages 143–156. Springer, 2010.