DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations – CVPR 2016
论文(project):http://personal.ie.cuhk.edu.hk/~lz013/projects/DeepFashion.html
1 clothes的challenges
Large variation in style, texture, and cutting ;
Cloth items are frequently subject to deformation and occlusion;
Often exhibit serious variations when taken under different scenarios, like selfies,online shopping photos.
2 multi-tasks:predicting clothing categories,attributes and landmarks, cross-pose/cross-domain correspondences of cloth pairs
Given the additional landmarks locations may improve recognition;
Massive attributes lead to better partition of the cloth feature space, facilitating the recognition and retrieval of cross-domain clothes images.
---
50 cats
1000 attrs
4~8 landmarks
>30w pairs correspondences
bounding boxes
>80w images
3 core ideas
Handling clothing deformation/occlusion by pooling/gating feature maps upon estimated landmark locations;
Supervised by massive attribute labels;
4 contributions
a) Building the large-scale clothes dataset of over 80w images, namely deepfashion,annotated with cats, attrs, landmarks, and cross-pose/cross-domain pair correspondences;
b) Developing FashionNet to jointly predict attrs and landmarks, in this case estimatedlandmarks are then employed to pool/gate the learned features;
c) Defining benchmark datasets and evaluation protocols for three widely accepted tasks inclothes recognition and retrieval.
5 approaches -> FashionNet
Simultaneously predicting landmarks and attributes
基于VGGNet的multi-tasks的网络,其中利用landmarks来获取local features,并和global features进行fusion,
最后同时预测category,attribute和triple(对应于pair correspondence,在测试阶段,该branch可能不会用到)。
This procedure performs in aniterative manner, and the whole framework can be learned end-to-end.
Red branch:capturing global features of entire clothing item
Green branch:capturing local features pooling over estimated clothing landmarks
Blue branch:predicting landmark’s location as well as their visibility (or occluded or not)
Moreover, the outputs of thebranches in red and green are concatenated together ad in ‘fc2_fusion’ tojointly predict the clothes categories, attributes and to model clothes pairs.
Forward Pass
A cloth image is fed into the network
passed through the branch inblue to predict the landmarks’ locations
the estimated landmarks areemployed to pool or gate the features in ‘pool5_local’,
which is a land markpooling layer, leading to local features that are invariant to the deformationsand occlusions of clothes
the global features of‘fc6_global’ and the landmark-pooled local features of ‘fc6_local’ areconcatenated together in ‘fc7_fusion’.
Backward Pass
Four type loss functions
Landmark localization: regression loss (L2 norm with visibility variables) -> Errors not bp when a landmark is occluded
Landmark visibility: softmax loss
Clothes categories: softmax loss
Attributes prediction: weighed cross-entropy loss
Two coefficients are determined by theratio of the numbers of positive and negative samples (to control the imbalancesbetween attributes)
Pairwise clothes images: triplet loss for metric learning
Iterative training strategy:
类似faster-rcnn的4-stages,这里是2-stages:
Joint-training:4个loss同时进行bp,只是着重于训练landmark的localization。
通过设置不同的lossweights来实现
treat the branch in blue as the main task, and the remaining branches asthe auxiliary tasks,
by setting the loss weights of landmark visibility and localizationto be large, while the others have small weights.
显然认为correlated的multi-tasks的joint optimization可以加快收敛。
Predicting clothing categories andattributes, as well as to learn the pairwise relations between clothes images:
主要学习类别分类、属性分类和pairwise的correspondences,而landmark的location和visibility则用来获取localfeature为branchin green提取localfeatures。
(这里的branchin blue可以学习,也可以不学习,学习的话,其lossweights会很小)
Landmark PoolingLayer
类似于roi-pooling layer,只不过bounding box变成了landmark的location和visibility而已。
如果visibility为可见,则根据location在特征图(如conv4)上的映射,选取一个region,
进行max-pooling;如果为不可见,则对应的输出特征由0填充。
最后把所有landmark得到的localfeatures连接起来(concatenate)。
*****比较好奇如何根据location来确定其region的大小*****
6 evaluation protocals
Category and Attribute Prediction:
63720 images
Top-k accuracy for cats
Top-k recall for attrs
In-Shop Clothes Retrieval
54642 images, 11735 clothing items
Top-k retrieval accuracy
Consumer-to-Shop Clothes Retrieval
251361 consumer-to-shop image pairs from Mogujie
Top-k retrieval accuracy
7 experiments
从实验来看,multi-tasks的joint training对类别分类来说,提升最大。其中landmark的pool/gate features以及attributes的分类能够提升类别分类。
论文中提及到可以用human pose estimation来替代landmark localization,不过效果不如landmark localization。
---
整体上来说,整个framework是比较经典的,multi-tasks的joint training,
如landmark localization辅助于local feature的提取,从而更加学习到global和local的信息,使得特征空间更加判断性。
缺点就是需要构造这样丰富的标注信息。
另外论文中没有分析triplet training对效果带来的提升进行分析。
---
感觉最近几年,cloth parsing & segmentation,cloth categories的classification,cloth&instance search/retrieval,比较热门,
各大会议都有这方面的论文。
骚年吖,可以考虑下这个方向,说不定下一个一作就是你了