CVPR 2016 全部文章摘要阅读

深度学习 专栏收录该内容
47 篇文章 1 订阅



Image Captioning and Question Answering

Monday, June 27th, 9:00AM - 10:05AM.

These papers will also be presented at the following poster session

1   Deep Compositional Captioning: Describing Novel Object Categories Without Paired Training Data. 

Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Trevor Darrell

大体的工作就是他们可以对图片-语句库中没有出现过的物体也进行描述。In this work, we propose the Deep Compositional Captioner (DCC) to address the task of generating descriptions of novel objects which are not present in paired image-sentence datasets.

2   Generation and Comprehension of Unambiguous Object Descriptions. 

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, Kevin Murphy

也是图像描述的一种形式,作者指出这种图像描述可以由客观的评价指标。We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described. 

3   Stacked Attention Networks for Image Question Answering. 

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola

文章用于图像的问答,比如提问图片中有几个人,做出相应回答,感觉更难啊。这里的创新是采用了堆栈式网络。This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. 

4   Image Question Answering Using Convolutional Neural Network With Dynamic Parameter Prediction. 

Hyeonwoo Noh, Paul Hongsuck Seo, Bohyung Han

图像问答,这里的创新是添加了一个自适应参数层,其中自适用参数采用GRU学习。We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions.

5   Neural Module Networks. 

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein

目的也是图像问答,创新是同时考虑两方面内容:表示问题与语言模型(囧,不都是同时考虑这两方面么)。还没仔细看好。Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.)

Go back to calendar >



Language and Vision

Monday, June 27th, 10:05AM - 10:30AM.

These papers will also be presented at the following poster session

6   Learning Deep Representations of Fine-Grained Visual Descriptions. 

Scott Reed, Zeynep Akata, Honglak Lee , Bernt Schiele

处理Zero-Shot问题,没有仔细看明白创新点。大致是分为两部分,从题目看,其实就是采用深度学习获得细粒度特征表示。Our proposed models train end-to-end to align with the fine-grained and category-specific content of images. Natural language provides a flexible and compact way of encoding only the salient visual aspects for distinguishing categories. 

7   Multi-Cue Zero-Shot Learning With Strong Supervision. 

Zeynep Akata, Mateusz Malinowski, Mario Fritz, Bernt Schiele


8   Latent Embeddings for Zero-Shot Classification. 

Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, Bernt Schiele


9   One-Shot Learning of Scene Locations via Feature Trajectory Transfer. 

Roland Kwitt, Sebastian Hegenbart, Marc Niethammer


10   Learning Attributes Equals Multi-Source Domain Generalization. 

Chuang Gan, Tianbao Yang, Boqing Gong


11   Anticipating Visual Representations From Unlabeled Video. 

Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

核心内容就是利用深度学习预测下一时刻或者时间段的行为活动。We present a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future.

Go back to calendar >



Matching and Alignment

Monday, June 27th, 9:00AM - 10:05AM.

These papers will also be presented at the following poster session

12   Learning to Assign Orientations to Feature Points. 

Kwang Moo Yi, Yannick Verdie, Pascal Fua, Vincent Lepetit

内容是利用深度学习给特征点标定方向,用于匹配算法,并且提出了一种新的Activation函数。We show how to train a Convolutional Neural Network to assign a canonical orientation to feature points given an image patch centered on the feature point.

13   Learning Dense Correspondence via 3D-Guided Cycle Consistency. 

Tinghui Zhou, Philipp Krähenbuhl, Mathieu Aubry, Qixing Huang, Alexei A. Efros

也是利用深度学习,目的是探究跨实例相似性。 We exploit this consistency as a supervisory signal to train a convolutional neural network to predict cross-instance correspondences between pairs of images depicting objects of the same category.

14   The Global Patch Collider. 

Shenlong Wang, Sean Ryan Fanello, Christoph Rhemann, Shahram Izadi, Pushmeet Kohli


15   Joint Probabilistic Matching Using m-Best Solutions. 

Seyed Hamid Rezatofighi, Anton Milan, Zhen Zhang, Qinfeng Shi, Anthony Dick, Ian Reid


16   Face Alignment Across Large Poses: A 3D Solution. 

Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, Stan Z. Li

采用深度学习提出一种三维人脸矫正技术。we propose a solution to the three problems in an new alignment framework, called 3D Dense Face Alignment (3DDFA), in which a dense 3D face model is fitted to the image via convolutional neutral network (CNN)

Go back to calendar >



Segmentation and Contour Detection

Monday, June 27th, 10:05AM - 10:30AM.

These papers will also be presented at the following poster session

17   Interactive Segmentation on RGBD Images via Cue Selection. 

Jie Feng, Brian Price, Scott Cohen, Shih-Fu Chang


18   Layered Scene Decomposition via the Occlusion-CRF. 

Chen Liu, Pushmeet Kohli, Yasutaka Furukawa


19   Affinity CNN: Learning Pixel-Centric Pairwise Relations for Figure/Ground Embedding. 

Michael Maire, Takuya Narihira, Stella X. Yu

通过深度学习获得一个Affinity矩阵。兴趣不够。We train a convolutional neural network (CNN) to directly predict the pairwise relationships that define this affinity matrix

20   Weakly Supervised Object Boundaries. 

Anna Khoreva, Rodrigo Benenson, Mohamed Omran, Matthias Hein, Bernt Schiele


21   Object Contour Detection With a Fully Convolutional Encoder-Decoder Network. 

Jimei Yang, Brian Price, Scott Cohen, Honglak Lee , Ming-Hsuan Yang

提出一种基于全卷积编码解码网络的轮廓识别算法。We develop a deep learning algorithm for contour detection with a fully convolutional encoder-decoder network.

Go back to calendar >



Poster Session 1-1. Monday, June 27th, 10:30AM - 12:30PM.

Images and Language

22   What Value Do Explicit High Level Concepts Have in Vision to Language Problems?. 

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, Anton van den Hengel

文章考虑图像描述,但由于现有的方法都是直接将图像中的物体映射为文本信息,并没有考虑高层语义信息,文章提出考虑高层信息的创新点。We propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. 

Edge Contour Detection

23   Fast Detection of Curved Edges at Low SNR. 

Nati Ofir, Meirav Galun, Boaz Nadler, Ronen Basri


24   Object Skeleton Extraction in Natural Images by Fusing Scale-Associated Deep Side Outputs. 

Wei Shen, Kai Zhao, Yuan Jiang, Yan Wang, Zhijiang Zhang, Xiang Bai

利用全卷积网络提出一种骨架提取算法,貌似就是之前博客里看到的文章。In this paper, we present a fully convolutional network with multiple scale-associated side outputs to address this problem. By observing the relationship between the receptive field sizes of the sequential stages in the network and the skeleton scales they can capture, we introduce a scale-associated side output to each stage

25   Learning Relaxed Deep Supervision for Better Edge Detection. 

Yu Liu, Michael S. Lew

一种基于深度学习的边缘检测算法。We propose using relaxed deep supervision (RDS) within convolutional neural networks for edge detection. 

26   Occlusion Boundary Detection via Deep Exploration of Context. 

Huan Fu, Chaohui Wang, Dacheng Tao, Michael J. Black

遮挡边缘检测。基于深度学习 In this paper, we improve occlusion boundary detection via enhanced exploration of contextual information (e.g., local structural boundary patterns, observations from surrounding regions, and temporal context), and in doing so develop a novel approach based on convolutional neural networks (CNNs) and conditional random fields (CRFs)

27   SemiContour: A Semi-Supervised Learning Approach for Contour Detection. 

Zizhao Zhang, Fuyong Xing, Xiaoshuang Shi, Lin Yang


Feature Extraction and Description

28   Learning to Localize Little Landmarks. 

Saurabh Singh, Derek Hoiem, David Forsyth


29   InterActive: Inter-Layer Activeness Propagation. 

Lingxi Xie, Liang Zheng, Jingdong Wang, Alan L. Yuille, Qi Tian

相当于是Activation函数的改进。In this paper, we present InterActive, a novel algorithm which computes the activeness of neurons and network connections. 

30   Exploit Bounding Box Annotations for Multi-Label Object Recognition. 

Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, Jianfei Cai

一种将深度学习用于多标签目标识别其实就是多目标识别的算法。In this paper, we incorporate local information to enhance the feature discriminative power.

31   TI-POOLING: Transformation-Invariant Pooling for Feature Learning in Convolutional Neural Networks. 

Dmitry Laptev, Nikolay Savinov, Joachim M. Buhmann, Marc Pollefeys

一种新的Pooling算子。In this paper we present a deep neural network topology that incorporates a simple to implement transformation-invariant pooling operator (TI-pooling)

32   Fashion Style in 128 Floats: Joint Ranking and Classification Using Weak Data for Feature Extraction. 

Edgar Simo-Serra, Hiroshi Ishikawa


33   Equiangular Kernel Dictionary Learning With Applications to Dynamic Texture Analysis. 

Yuhui Quan, Chenglong Bao, Hui Ji


34   Compact Bilinear Pooling. 

Yang Gao, Oscar Beijbom, Ning Zhang, Trevor Darrell


Feature Extraction and Matching

35   Accumulated Stability Voting: A Robust Descriptor From Descriptors of Multiple Scales. 

Tsun-Yi Yang, Yen-Yu Lin, Yung-Yu Chuang


36   CoMaL: Good Features to Match on Object Boundaries. 

Swarna K. Ravindran, Anurag Mittal


37   Progressive Feature Matching With Alternate Descriptor Selection and Correspondence Enrichment. 

Yuan-Ting Hu, Yen-Yu Lin


Image Segmentation

38   A New Finsler Minimal Path Model With Curvature Penalization for Image Segmentation and Closed Contour Detection. 

Da Chen, Jean-Marie Mirebeau, Laurent D. Cohen


39   Scale-Aware Alignment of Hierarchical Image Segmentation. 

Yuhua Chen, Dengxin Dai, Jordi Pont-Tuset, Luc Van Gool


40   Deep Interactive Object Selection. 

Ning Xu, Brian Price, Scott Cohen, Jimei Yang, Thomas S. Huang

深度学习交互的目标识别?选择?In this paper, we present a novel deep-learning-based algorithm which has much better understanding of objectness and can reduce user interactions to just a few clicks.

41   Pull the Plug? Predicting If Computers or Humans Should Segment Images. 

Danna Gurari, Suyog Jain, Margrit Betke, Kristen Grauman


42   In the Shadows, Shape Priors Shine: Using Occlusion to Improve Multi-Region Segmentation. 

Yuka Kihara, Matvey Soloviev, Tsuhan Chen


43   Convexity Shape Constraints for Image Segmentation. 

Loic A. Royer, David L. Richmond, Carsten Rother, Bjoern Andres, Dagmar Kainmueller


44   MCMC Shape Sampling for Image Segmentation With Nonparametric Shape Priors. 

Ertunc Erdil, Sinan Yildirim, Müjdat Cetin, Tolga Tasdizen


Low-Level Vision

45   From Noise Modeling to Blind Image Denoising. 

Fengyuan Zhu, Guangyong Chen, Pheng-Ann Heng


46   Efficient and Robust Color Consistency for Community Photo Collections. 

Jaesik Park, Yu-Wing Tai, Sudipta N. Sinha, In So Kweon


47   Needle-Match: Reliable Patch Matching Under High Uncertainty. 

Or Lotan, Michal Irani


48   ReconNet: Non-Iterative Reconstruction of Images From Compressively Sensed Measurements. 

Kuldeep Kulkarni, Suhas Lohit, Pavan Turaga, Ronan Kerviche, Amit Ashok

利用卷积神经元网络做压缩感知图像的恢复重建。 we propose a novel convolutional neural network (CNN) architecture which takes in CS measurements of an image as input and outputs an intermediate reconstruction

49   Soft-Segmentation Guided Object Motion Deblurring. 

Jinshan Pan, Zhe Hu, Zhixun Su, Hsin-Ying Lee, Ming-Hsuan Yang


50   Two Illuminant Estimation and User Correction Preference. 

Dongliang Cheng, Abdelrahman Abdelhamed, Brian Price, Scott Cohen, Michael S. Brown


51   Deep Contrast Learning for Salient Object Detection. 

Guanbin Li, Yizhou Yu

重要物体检测。 we propose an end-to-end deep contrast network to overcome the aforementioned limitations. Our deep network consists of two complementary components, a pixel-level fully convolutional stream and a segment-wise spatial pooling stream.

52   Multiview Image Completion With Space Structure Propagation. 

Seung-Hwan Baek, Inchang Choi, Min H. Kim


53   Composition-Preserving Deep Photo Aesthetics Assessment. 

Long Mai, Hailin Jin, Feng Liu

图片的审美评估,通过卷积神经元网络。we present a composition-preserving deep ConvNet method that directly learns aesthetics features from the original input images without any image transformations.

54   Automatic Image Cropping : A Computational Complexity Study. 

Jiansheng Chen, Gaocheng Bai, Shaoheng Liang, Zhengqin Li


55   A Deeper Look at Saliency: Feature Contrast, Semantics, and Beyond. 

Neil D. B. Bruce, Christopher Catton, Sasa Janjic


56   Spatially Binned ROC: A Comprehensive Saliency Metric. 

Calden Wloka, John Tsotsos


57   GraB: Visual Saliency via Novel Graph Model and Background Priors. 

Qiaosong Wang, Wen Zheng, Robinson Piramuthu


58   Predicting When Saliency Maps Are Accurate and Eye Fixations Consistent. 

Anna Volokitin, Michael Gygli, Xavier Boix


59   Split and Match: Example-Based Adaptive Patch Sampling for Unsupervised Style Transfer. 

Oriel Frigo, Neus Sabater, Julie Delon, Pierre Hellier


60   Detection and Accurate Localization of Circular Fiducials Under Highly Challenging Conditions. 

Lilian Calvet, Pierre Gurdjos, Carsten Griwodz, Simone Gasparini


Scene Understanding

61   Scene Recognition With CNNs: Objects, Scales and Dataset Bias. 

Luis Herranz, Shuqiang Jiang, Xiangyang Li

采用卷积神经元网络实现两部分的内容,一部分是物体识别与场景的结合,另一部分尺度的因素。主要实现场景识别,场景识别。In this paper we address two related problems: 1) scale induced dataset bias in multi-scale convolutional neural network (CNN) architectures, and 2) how to combine effectively scene-centric and object-centric knowledge。

62   Learning Action Maps of Large Environments via First-Person Vision. 

Nicholas Rhinehart, Kris M. Kitani


63   Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. 

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, Yi Ma

构建了一个新型数据库,用于密集人群的统计。提出一种MCNN,也是可以输入任意大小的图像,其中M是指Multi-column,每一个column都是可以自适应的处理可变尺寸的情况。This paper aims to develop a method that can accurately estimate the crowd count from an individual image with arbitrary crowd density and arbitrary perspective. 

64   Shallow and Deep Convolutional Networks for Saliency Prediction. 

Junting Pan, Elisa Sayrol, Xavier Giro-i-Nieto, Kevin McGuinness, Noel E. O'Connor

文中说这是首次用深度学习的方式来进行关键性检测。提出了一种浅层一种深层的网络结构,损失函数选取实际关键位置与预测位置的欧氏距离优化。This paper, however, addresses the problem with a completely data-driven approach by training a convolutional neural network (convnet)。

65   Sample and Filter: Nonparametric Scene Parsing via Efficient Filtering. 

Mohammad Najafi, Sarah Taghavi Namin, Mathieu Salzmann, Lars Petersson


66   DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes. 

Saumitro Dasgupta, Kuan Fang, Kevin Chen, Silvio Savarese

考虑室内场景空间分布?提出采用全卷积网络的结构,并给出了一个新的优化框架。 In this paper, we present a method that uses a fully convolutional neural network (FCNN) in conjunction with a novel optimization framework for generating layout estimates. 

67   A Text Detection System for Natural Scenes With Convolutional Feature Learning and Cascaded Classification. 

Siyu Zhu, Richard Zanibbi

完成的任务是针对自然场景中文字的识别。识别大体可以分为两个阶段,第一个阶段相当于场景中文字区域检测,第二阶段针对含文字区域进行图分割。We propose a system that finds text in natural scenes using a variety of cues. Our novel data-driven method incorporates coarse-to-fine detection of character pixels using convolutional features (Text-Conv), followed by extracting connected components (CCs) from characters using edge and color features, and finally performing a graph-based segmentation of CCs into words (Word-Graph)

Segmentation and Saliency

68   Reversible Recursive Instance-Level Object Segmentation. 

Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Zequn Jie, Jiashi Feng, Liang Lin, Shuicheng Yan

应该是基于递归神经元网络的目标分割。In this work, we propose a novel Reversible Recursive Instance-level Object Segmentation (R2-IOS) framework to address the challenging instance-level object segmentation task.。

69   Coherent Parametric Contours for Interactive Video Object Segmentation. 

Yao Lu, Xue Bai, Linda Shapiro, Jue Wang


70   Manifold SLIC: A Fast Method to Compute Content-Sensitive Superpixels. 

Yong-Jin Liu, Cheng-Chi Yu, Min-Jing Yu, Ying He


71   Deep Saliency With Encoded Low Level Distance Map and High Level Features. 

Gayoung Lee, Yu-Wing Tai, Junmo Kim

看到这里,很明显上面谋篇说自己是首次用深度学习做关键性检测的就是瞎胡扯了,因为这篇文章的参考里说之前用深度学习的方法也很好了。这篇文章的主要创新不大,就是结合了高层特征与底层特征,其中在底层特征的应用过程中采用了1*1的卷积核及ReLU单元处理,然后再与高层特征结合,进行关键性检测。 In this paper, we demonstrate that the hand-crafted features can provide complementary effects to enhance performance of saliency detection that utilizes only high level features. Our method utilizes both high level and low level features for saliency detection under a unified deep learning framework.

72   Instance-Level Segmentation for Autonomous Driving With Deep Densely Connected MRFs. 

Ziyu Zhang, Sanja Fidler, Raquel Urtasun


73   DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection. 

Nian Liu, Junwei Han

一种关键性检测的深度网络。DHSNet首先预测物体边缘,接下来采用一种改进的递归网络对边缘进行微调。In this work, we propose a novel end-to-end deep hierarchical saliency network (DHSNet) based on convolutional neural networks for detecting salient objects.

74   Object Co-Segmentation via Graph Optimized-Flexible Manifold Ranking. 

Rong Quan, Junwei Han, Dingwen Zhang, Feiping Nie


Video Segmentation

75   Primary Object Segmentation in Videos via Alternate Convex Optimization of Foreground and Background Distributions. 

Won-Dong Jang, Chulwoo Lee, Chang-Su Kim


76   Automatic Fence Segmentation in Videos of Dynamic Scenes. 

Renjiao Yi, Jue Wang, Ping Tan


77   Discovering the Physical Parts of an Articulated Object Class From Multiple Videos. 

Luca Del Pero, Susanna Ricco, Rahul Sukthankar, Vittorio Ferrari


78   A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. 

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, Alexander Sorkine-Hornung


79   Learning Temporal Regularity in Video Sequences. 

Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K. Roy-Chowdhury, Larry S. Davis


80   Bilateral Space Video Segmentation. 

Nicolas Maerki, Federico Perazzi, Oliver Wang, Alexander Sorkine-Hornung


81   ReD-SFA: Relation Discovery Based Slow Feature Analysis for Trajectory Clustering. 

Zhang Zhang, Kaiqi Huang, Tieniu Tan, Peipei Yang, Jun Li


Go back to calendar >



Object Recognition and Detection

Monday, June 27th, 1:45PM - 2:50PM.

These papers will also be presented at the following poster session

82   Training Region-Based Object Detectors With Online Hard Example Mining. 

Abhinav Shrivastava, Abhinav Gupta, Ross Girshick

还是基于区域的卷积神经元网络,但是利用了启发式信息。如同人类认知一样,都会碰到一些难以识别的,另一部分是容易识别的。We present a simple yet surprisingly effective online hard example mining (OHEM) algorithm for training region-based ConvNet detectors.

83   Deep Residual Learning for Image Recognition. 

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun


84   You Only Look Once: Unified, Real-Time Object Detection. 

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

一种目标检测算法。将目标检测是做回归问题,用神经元网络回归获得bounding box,同时获得其概率。文章号称速度快,而且效果优于RCNNWe present YOLO, a new approach to object detection.。

85   LocNet: Improving Localization Accuracy for Object Detection. 

Spyros Gidaris, Nikos Komodakis

目标定位算法,也是通过bounding box,利用卷积神经元网络完成。目的是为了提高精度。We propose a novel object localization methodology with the purpose of boosting the localization accuracy of state-of-the-art object detection systems.

86   Sketch Me That Shoe. 

Qian Yu, Feng Liu, Yi-Zhe Song, Tao Xiang, Timothy M. Hospedales, Chen-Change Loy


Go back to calendar >



Object Detection 1

Monday, June 27th, 2:50PM - 3:20PM.

These papers will also be presented at the following poster session

87   Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. 

Shuran Song, Jianxiong Xiao

3D深度图像的物体检测。由于这样的图像是三维,文章中也是采用了3D 卷积神经元网络,可以参考到视频当中,类比一下,它的第三维为深度,视频的第三维就是时间,帧的数目。同理。In our approach, we propose the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and color features in 2D.

88   Object Detection From Video Tubelets With Convolutional Neural Networks. 

Kai Kang, Wanli Ouyang, Hongsheng Li, Xiaogang Wang

就是视频的目标检测。确实是比较早的工作,或者说是本领域的最初作品。采用的也是当前的CNNTracking相关知识。The lately introduced ImageNet task on object detection from video (VID) brings the object detection task into video domain, in which objects' locations at each frame are required to be annotated with bounding boxes. In this work, we introduce a complete framework for the VID task based on still-image object detection and general object tracking。

89   Learning With Side Information Through Modality Hallucination. 

Judy Hoffman, Saurabh Gupta, Trevor Darrell

说的比较玄乎的目标识别。其实就是用了深度信息,也是采用了卷积神经元网络,有创新,摘要里不明确。We present a modality hallucination architecture for training an RGB object detection model which incorporates depth side information at training time.

90   Object-Proposal Evaluation Protocol is ‘Gameable’. 

Neelima Chavali, Harsh Agrawal, Aroma Mahendru, Dhruv Batra


91   HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection. 

Tao Kong, Anbang Yao, Yurong Chen, Fuchun Sun

基于传统卷积神经元网络的目标识别与定位,相当于是多任务。 In this paper, we present a deep hierarchical network, namely HyperNet, for handling region proposal generation and object detection jointly.

92   We Don't Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification. 

Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, Vittorio Ferrari


93   Factors in Finetuning Deep Model for Object Detection With Long-Tail Distribution. 

Wanli Ouyang, Xiaogang Wang, Cong Zhang, Xiaokang Yang

专注于长尾分布,其实相当于不平衡数据集的处理,文中是针对对长尾分布的数据采用卷积神经元网络进行fine tuning时效果不一定好,提出可以聚类为子集貌似是。These classes/tasks have their individuality in discriminative visual appearance representation. Taking this individuality into account, we cluster objects into visually similar class groups and learn deep representations for these groups separately. A hierarchical feature learning scheme is proposed.

Go back to calendar >



Vision With Alternative Sensors

Monday, June 27th, 1:45PM - 2:50PM.

These papers will also be presented at the following poster session

94   Information-Driven Adaptive Structured-Light Scanners. 

Guy Rosman, Daniela Rus, John W. Fisher III


95   Simultaneous Optical Flow and Intensity Estimation From an Event Camera. 

Patrick Bardow, Andrew J. Davison, Stefan Leutenegger


96   Macroscopic Interferometry: Rethinking Depth Estimation With Frequency-Domain Time-Of-Flight. 

Achuta Kadambi, Jamie Schiel, Ramesh Raskar


97   ASP Vision: Optically Computing the First Layer of Convolutional Neural Networks Using Angle Sensitive Pixels. 

Huaijin G. Chen, Suren Jayasuriya, Jiyue Yang, Judy Stephen, Sriram Sivaramakrishnan, Ashok Veeraraghavan, Alyosha Molnar


98   Computational Imaging for VLBI Image Reconstruction. 

Katherine L. Bouman, Michael D. Johnson, Daniel Zoran, Vincent L. Fish, Sheperd S. Doeleman, William T. Freeman


Go back to calendar >



Video Analysis 1

Monday, June 27th, 2:50PM - 3:20PM.

These papers will also be presented at the following poster session

99   You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images. 

Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, Tao Mei

相当于视频分类,主要考虑的问题就是训练样本不足,采取的方法就是针对网络上的视频、图像按照关键词query用于训练。 In this paper, we propose a Lead--Exceed Neural Network (LENN), which reinforces the training on Web images and videos in a curriculum manner. Specifically, the training proceeds by inputting frames of Web videos to obtain a network. The Web images are then filtered by the learnt network and the selected images are additionally fed into the network to enhance the architecture and further trim the videos. In addition, Long Short-Term Memory (LSTM) can be applied on the trimmed videos to explore temporal information.提到了事件分析。

100   Track and Segment: An Iterative Unsupervised Approach for Video Object Proposals. 

Fanyi Xiao, Yong Jae Lee

就是说这个可能不是深度学习的内容,但是研究与视频相关,而且也是时空特征,最重要的他也考虑了相邻frame的关系,也就是temporal frameWe present an unsupervised approach that generates a diverse, ranked set of bounding box and segmentation video object proposals---spatio-temporal tubes that localize the foreground objects---in an unannotated video.

101   Beyond Local Search: Tracking Objects Everywhere With Instance-Specific Proposals. 

Gao Zhu, Fatih Porikli, Hongdong Li


102   Groupwise Tracking of Crowded Similar-Appearance Targets From Low-Continuity Image Sequences. 

Hongkai Yu, Youjie Zhou, Jeff Simmons, Craig P. Przybyla, Yuewei Lin, Xiaochuan Fan, Yang Mi, Song Wang

这个虽然也跟深度学习关系不大,研究内容与食品相关,重要的是它与人群相关,以后的研究方向会与群体事件相关,这个可以考虑。In this paper we propose a new groupwise method to explore the target group information and employ the within-group correlations for association and tracking. In particular, the within-group association is modeled by a nonrigid 2D Thin-Plate transform and a sequence of group shrinking, group growing and group merging operations are then developed to refine the composition of each group. 

103   Social LSTM: Human Trajectory Prediction in Crowded Spaces. 

Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, Silvio Savarese

也是处理人群,而且重要的是应用到了LSTM,它实现的是人群中或者说拥挤场所的某个人的轨迹跟踪,比较重要。只要与人群相关,就重要。We present a new Long Short-Term Memory (LSTM) model which jointly reasons across multiple individuals in a scene. Different from the conventional LSTM, we share the information between multiple LSTMs through a new pooling layer. This layer pools the hidden representation from LSTMs corresponding to neighboring trajectories to capture interactions within this neighborhood. 

104   What Players Do With the Ball: A Physically Constrained Interaction Modeling. 

Andrii Maksai, Xinchao Wang, Pascal Fua

做的是针对视频当中运动员与球的关系,难点在于遮挡,低分辨率等。 In this paper, we propose a generic and principled approach to modeling the interaction between the ball and the players while also imposing appropriate physical constraints on the ball's trajectory.

105   Highlight Detection With Pairwise Deep Ranking for First-Person Video Summarization. 

Ting Yao, Tao Mei, Yong Rui

相当于是第一视角的视频操作。we propose a novel pairwise deep ranking model that employs deep learning techniques to learn the relationship between highlight and non-highlight video segments. A two-stream network structure by representing video segments from complementary information on appearance of video frames and temporal dynamics across frames is developed for video highlight detection.

Go back to calendar >



Poster Session 1-2. Monday, June 27th, 4:45PM - 6:45PM.

Events, Activities, and Surveillance

106   Direct Prediction of 3D Body Poses From Motion Compensated Sequences. 

Bugra Tekin, Artem Rozantsev, Vincent Lepetit, Pascal Fua


107   Video2GIF: Automatic Generation of Animated GIFs From Video. 

Michael Gygli, Yale Song, Liangliang Cao

从视频到GIF图像,就是差不多对每帧重要性做排序, We pose the question: Can we automate the entirely manual and elaborate process of GIF creation by leveraging the plethora of user generated GIF content? We propose a Robust Deep RankNet that, given a video, generates a ranked list of its segments according to their suitability as GIF. We train our model to learn what visual content is often selected for GIFs by using over 100K user generated GIFs and their corresponding video sources. We effectively deal with the noisy web data by proposing a novel adaptive Huber loss in the ranking formulation.

108   NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. 

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, Gang Wang

这里居然说有temporal,与D有关么?不过如果直接用肯定要是3D Convolution吧。 In addition, we propose a new recurrent neural network structure to model the long-term temporal correlation of the features for each body part, and utilize them for better action classification. 

109   Progressively Parsing Interactional Objects for Fine Grained Action Detection. 

Bingbing Ni, Xiaokang Yang, Shenghua Gao

细粒度的物体识别与跟踪,其中对于每帧都采用LSTMIn this work, we propose an end-to-end system based on recursive neural network to perform frame by frame interactional object parsing, which can alleviate the difficulty through a incremental manner. Our key innovation is that: instead of jointly outputting all object detections at once, for each frame, we use a set of long-short term memory (LSTM) nodes to incrementally refine the detections.

110   Hierarchical Recurrent Neural Encoder for Video Representation With Application to Captioning. 

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang

针对视频数据,虽然目的可能是针对视频描述的表示,但其中提到的temporal的相关内容可以参考 In this paper, we propose a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos. Compared to recent video representation inference approaches, this paper makes the following three contributions. First, our HRNE is able to efficiently exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level. Second, computation operations are significantly lessened while attaining more non-linearity. Third, HRNE is able to uncover temporal transitions between frame chunks with different granularities, i.e. it can model the temporal transitions between frames as well as the transitions between segments. 

111   From Keyframes to Key Objects: Video Summarization by Representative Object Proposal Selection. 

Jingjing Meng, Hongxing Wang, Junsong Yuan, Yap-Peng Tan

不是深度学习,文章的目的是实现对视频中出现的多个主要物体的识别。We propose to summarize a video into a few key objects by selecting representative object proposals generated from video frames. 

112   Temporal Action Localization in Untrimmed Videos via Multi-Stage CNNs. 

Zheng Shou, Dongang Wang, Shih-Fu Chang

还是定位?有借鉴的地方we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and achieve high temporal localization accuracy.

113   Summary Transfer: Exemplar-Based Subset Selection for Video Summarization. 

Ke Zhang, Wei-Lun Chao, Fei Sha, Kristen Grauman

主要的创新在于迁移学习的视频描述。貌似没有深度学习的内容。 We propose a novel subset selection technique that leverages supervision in the form of human-created summaries to perform automatic keyframe-based video summarization. The main idea is to nonparametrically transfer summary structures from annotated videos to unseen test videos.

114   POD: Discovering Primary Objects in Videos Based on Evolutionary Refinement of Object Recurrence, Background, and Primary Object Models. 

Yeong Jun Koh, Won-Dong Jang, Chang-Su Kim

视频目标检测,以及噪声帧的过滤。A primary object discovery (POD) algorithm for a video sequence is proposed in this work, which is capable of discovering a primary object, as well as identifying noisy frames that do not contain the object. First, we generate object proposals for each frame. Then, we bisect each proposal into foreground and background regions, and extract features from each region. By superposing the foreground and background features, we build the object recurrence model, the background model, and the primary object model. We develop an iterative scheme to refine each model evolutionary using the information in the other models. Finally, using the evolved primary object model, we select candidate proposals and locate the bounding box of a primary object by merging the proposals selectively.

115   What If We Do Not Have Multiple Videos of the Same Action? — Video Action Localization Using Web Images. 

Waqas Sultani, Mubarak Shah

实现的任务就是在视频数据少的情况下的视频定位,而采取的方式就是利用网络挖掘的图片数据。This paper tackles the problem of spatio-temporal action localization in a video without assuming the availability of multiple videos or any prior annotations.

116   Beyond F-Formations: Determining Social Involvement in Free Standing Conversing Groups From Static Images. 

Lu Zhang, Hayley Hung


Fine Grained Categorization

117   DeepFashion: Powering Robust Clothes Recognition and Retrieval With Rich Annotations. 

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, Xiaoou Tang

新建了一个关于衣服的数据库。提出一个模型,深度模型。To demonstrate the advantages of DeepFashion, we propose a new deep model, namely FashionNet, which learns clothing features by jointly predicting clothing attributes and landmarks. The estimated landmarks are then employed to pool or gate the learned features.

118   SketchNet: Sketch Classification With Web Images. 

Hua Zhang, Si Liu, Changqing Zhang, Wenqi Ren, Rui Wang, Xiaochun Cao

骨架提取的网络。 we propose a deep convolutional neural network, named SketchNet. We firstly develop a triplet composed of sketch, positive and negative real image as the input of our neural network. To discover the coherent visual structures between the sketch and its positive pairs, we introduce the softmax as the loss function. Then a ranking mechanism is introduced to make the positive pairs obtain a higher score comparing over negative ones to achieve robust representation. Finally, we formalize above-mentioned constrains into the unified objective function, and create an ensemble feature representation to describe the sketch images.

119   Embedding Label Structures for Fine-Grained Feature Representation. 

Xiaofan Zhang, Feng Zhou, Yuanqing Lin, Shaoting Zhang

做细粒度的特征表示,提出了一种多任务学习的框架,并且应用了标签结构?However, previous studies have rarely focused on learning a fined-grained and structured feature representation that is able to locate relevant images at different levels of relevance, e.g., discovering cars from the same make or the same model, both of which require high precision. In this paper, we propose two main contributions to tackle this problem. 1) A multi-task learning framework is designed to effectively learn fine-grained feature representations by jointly optimizing both classification and similarity constraints. 2) To model the multi-level relevance, label structures such as hierarchy or shared attributes are seamlessly embedded into the framework by generalizing the triplet loss.

120   Fine-Grained Image Classification by Exploring Bipartite-Graph Labels. 

Feng Zhou, Yuanqing Lin


121   Picking Deep Filter Responses for Fine-Grained Image Recognition. 

Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao Lin, Qi Tian


122   SPDA-CNN: Unifying Semantic Part Detection and Abstraction for Fine-Grained Recognition. 

Han Zhang, Tao Xu, Mohamed Elhoseiny, Xiaolei Huang, Shaoting Zhang, Ahmed Elgammal, Dimitris Metaxas

两个子网络,一个做检测,一个做识别。 In this paper, we propose a new CNN architecture that integrates semantic part detection and abstraction (SPDA-CNN) for fine-grained classification. The proposed network has two sub-networks: one for detection and one for recognition. The detection sub-network has a novel top-down proposal method to generate small semantic part candidates for detection. The classification sub-network introduces novel part layers that extract features from parts detected by the detection sub-network, and combine them for recognition. 

123   Fine-Grained Categorization and Dataset Bootstrapping Using Deep Metric Learning With Humans in the Loop. 

Yin Cui, Feng Zhou, Yuanqing Lin, Serge Belongie

也是细粒度的分类,不感兴趣。In this work we propose a generic iterative framework for fine-grained categorization and dataset bootstrapping that handles these three challenges. Using deep metric learning with humans in the loop, we learn a low dimensional feature embedding with anchor points on manifolds for each category. These anchor points capture intra-class variances and remain discriminative between classes.

124   Mining Discriminative Triplets of Patches for Fine-Grained Classification. 

Yaming Wang, Jonghyun Choi, Vlad Morariu, Larry S. Davis


125   Part-Stacked CNN for Fine-Grained Visual Categorization. 

Shaoli Huang, Zhe Xu, Dacheng Tao, Ya Zhang

不知道是啥,但是估计也是two-stream,组合或者啥的。 In this paper, we propose a novel Part-Stacked CNN architecture that explicitly explains the fine-grained recognition process by modeling subtle differences from object parts. Based on manually-labeled strong part annotations, the proposed architecture consists of a fully convolutional network to locate multiple object parts and a two-stream classification network that encodes object-level and part-level cues simultaneously.

Feature Matching and Indexing

126   Learning Compact Binary Descriptors With Unsupervised Deep Neural Networks. 

Kevin Lin, Jiwen Lu, Chu-Song Chen, Jie Zhou

基于深度学习的无监督特征描述子提取算法。In this paper, we propose a new unsupervised deep learning approach called DeepBit to learn compact binary descriptor for efficient visual object matching. Unlike most existing binary descriptors which were designed with random projections or linear hash functions, we develop a deep neural network to learn binary descriptors in a unsupervised manner. We enforce three criterions on binary codes which are learned at the top layer of our network: 1) minimal loss quantization, 2) evenly distributed codes and 3) uncorrelated bits.

127   Solving Small-Piece Jigsaw Puzzles by Growing Consensus. 

Kilho Son, daniel Moreno, James Hays, David B. Cooper


128   Pairwise Matching Through Max-Weight Bipartite Belief Propagation. 

Zhen Zhang, Qinfeng Shi, Julian McAuley, Wei Wei, Yanning Zhang, Anton van den Hengel


129   Structured Feature Similarity With Explicit Feature Map. 

Takumi Kobayashi


130   Temporal Epipolar Regions. 

Mor Dar, Yael Moses


Human ID

131   Recurrent Attention Models for Depth-Based Person Identification. 

Albert Haque, Alexandre Alahi, Li Fei-Fei

不管怎么着,至少有可以借鉴的内容,因为有时空特征,而且与递归神经元网络相关。 Formulated as a reinforcement learning task, our model is based on a combination of convolutional and recurrent neural networks with the goal of identifying small, discriminative regions indicative of human identity.

132   Learning a Discriminative Null Space for Person Re-Identification. 

Li Zhang, Tao Xiang, Shaogang Gong


133   Learning Deep Feature Representations With Domain Guided Dropout for Person Re-Identification. 

Tong Xiao, Hongsheng Li, Wanli Ouyang, Xiaogang Wang

也是有值得借鉴的地方,考虑的是样本数量不足时,如何利用其它的数据集作为补充。Learning generic and robust feature representations with data from multiple domains for the same problem is of great value, especially for the problems that have multiple datasets but none of them are large enough to provide abundant data variations. In this work, we present a pipeline for learning deep feature representations from multiple domains with Convolutional Neural Networks (CNNs). When training a CNN with data from all the domains, some neurons learn representations shared across several domains, while some others are effective only for a specific one. Based on this important observation, we propose a Domain Guided Dropout algorithm to improve the feature learning procedure. 

134   How Far Are We From Solving Pedestrian Detection?. 

Shanshan Zhang, Rodrigo Benenson, Mohamed Omran, Jan Hosang, Bernt Schiele


135   Similarity Learning With Spatial Constraints for Person Re-Identification. 

Dapeng Chen, Zejian Yuan, Badong Chen, Nanning Zheng


136   Sample-Specific SVM Learning for Person Re-Identification. 

Ying Zhang, Baohua Li, Huchuan Lu, Atshushi Irie, Xiang Ruan


137   Joint Learning of Single-Image and Cross-Image Representations for Person Re-Identification. 

Faqiang Wang, Wangmeng Zuo, Liang Lin, David Zhang, Lei Zhang

行人再辨识,也是考虑两个子网络或者multitask In this work, we exploit the connection between these two categories of methods, and propose a joint learning framework to unify SIR and CIR using convolutional neural network (CNN). Specifically, our deep architecture contains one shared sub-network together with two sub-networks that extract the SIRs of given images and the CIRs of given image pairs, respectively. The SIR sub-network is required to be computed once for each image (in both the probe and gallery sets), and the depth of the CIR sub-network is required to be minimal to reduce computational burden. Therefore, the two types of representation can be jointly optimized for pursuing better matching accuracy with moderate computational cost.

138   A Multi-Level Contextual Model For Person Recognition in Photo Albums. 

Haoxiang Li, Jonathan Brandt, Zhe Lin, Xiaohui Shen, Gang Hua


139   Unsupervised Cross-Dataset Transfer Learning for Person Re-Identification. 

Peixi Peng, Tao Xiang, Yaowei Wang, Massimiliano Pontil, Shaogang Gong, Tiejun Huang, Yonghong Tian

不是深度学习,也是考虑如何扩展数据集。Most existing person re-identification (Re-ID) approaches follow a supervised learning framework, in which a large number of labelled matching pairs are required for training. This severely limits their scalability in real-world applications. To overcome this limitation, we develop a novel cross-dataset transfer learning approach to learn a discriminative representation. It is unsupervised in the sense that the target dataset is completely unlabelled. Specifically, we present an multi-task dictionary learning method which is able to learn a dataset-shared but target-data-biased representation.

140   Pedestrian Detection Inspired by Appearance Constancy and Shape Symmetry. 

Jiale Cao, Yanwei Pang, Xuelong Li


141   Recurrent Convolutional Network for Video-Based Person Re-Identification. 

Niall McLaughlin, Jesus Martinez del Rincon, Paul Miller

考虑视频相关特征,值得借鉴,有三部分比较重要的内容。In this paper we propose a novel recurrent neural network architecture for video-based person re-identification. Given the video sequence of a person, features are extracted from each frame using a convolutional neural network that incorporates a recurrent final layer, which allows information to flow between time-steps. The features from all time-steps are then combined using temporal pooling to give an overall appearance feature for the complete sequence. The convolutional network, recurrent layer, and temporal pooling layer, are jointly trained to act as a feature extractor for video-based re-identification using a Siamese network architecture. Our approach makes use of colour and optical flow information in order to capture appearance and motion information which is useful for video re-identification.

142   Person Re-Identification by Multi-Channel Parts-Based CNN With Improved Triplet Loss Function. 

De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, Nanning Zheng

也是one network怎么着,另一个network怎么着。In this paper, we present a novel multi-channel parts-based convolutional neural network (CNN) model under the triplet framework for person re-identification. Specifically, the proposed CNN model consists of multiple channels to jointly learn both the global full body and local body-parts features of the input persons. The CNN model is trained by an improved triplet loss function that serves to pull the instances of the same person closer, and at the same time push the instances belonging to different persons farther from each other in the learned feature space. 

143   Top-Push Video-Based Person Re-Identification. 

Jinjie You, Ancong Wu, Xiang Li, Wei-Shi Zheng


144   Improving Person Re-Identification via Pose-Aware Multi-Shot Matching. 

Yeong-Jun Cho, Kuk-Jin Yoon


145   Hierarchical Gaussian Descriptor for Person Re-Identification. 

Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, Yoichi Sato


Motion and Tracking

146   STCT: Sequentially Training Convolutional Networks for Visual Tracking. 

Lijun Wang, Wanli Ouyang, Xiaogang Wang, Huchuan Lu

用了集成的思想做跟踪,不太清楚,每个通道视作一个base learnerIn this paper, we propose a sequential training method for convolutional neural networks (CNNs) to effectively transfer pre-trained deep features for online applications. We regard a CNN as an ensemble with each channel of the output feature map as an individual base learner. Each base learner is trained using different loss criterions to reduce correlation and avoid over-training. To achieve the best ensemble online, all the base learners are sequentially sampled into the ensemble via important sampling. To further improve the robustness of each base learner, we propose to train the convolutional layers with random binary masks, which serves as a regularization to enforce each base learner to focus on different input features.

147   Determining Occlusions From Space and Time Image Reconstructions. 

Juan-Manuel Pérez-Rúa, Tomas Crivelli, Patrick Bouthemy, Patrick Pérez


148   Online Multi-Object Tracking via Structural Constraint Event Aggregation. 

Ju Hong Yoon, Chang-Ryeol Lee, Ming-Hsuan Yang, Kuk-Jin Yoon


149   Staple: Complementary Learners for Real-Time Tracking. 

Luca Bertinetto, Jack Valmadre, Stuart Golodetz, Ondrej Miksik, Philip H. S. Torr


150   Robust Optical Flow Estimation of Double-Layer Images Under Transparency or Reflection. 

Jiaolong Yang, Hongdong Li, Yuchao Dai, Robby T. Tan


151   Siamese Instance Search for Tracking. 

Ran Tao, Efstratios Gavves, Arnold W.M. Smeulders

跟踪,说的用的信息很少。In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTube videos. The presented tracker simply matches the initial patch of the target in the first frame with candidates in a new frame and returns the most similar patch by a learned matching function. The strength of the matching function comes from being extensively trained generically, i.e., without any data of the target, using a Siamese deep neural network, which we design for tracking. 

152   Adaptive Decontamination of the Training Set: A Unified Formulation for Discriminative Visual Tracking. 

Martin Danelljan, Gustav Häger, Fahad Shahbaz Khan, Michael Felsberg


153   3D Part-Based Sparse Tracker With Automatic Synchronization and Registration. 

Adel Bibi, Tianzhu Zhang, Bernard Ghanem


154   Recurrently Target-Attending Tracking. 

Zhen Cui, Shengtao Xiao, Jiashi Feng, Shuicheng Yan

考虑漂移问题的跟踪。Robust visual tracking is a challenging task in computer vision. Due to the accumulation and propagation of estimation error, model drifting often occurs and degrades the tracking performance. To mitigate this problem, in this paper we propose a novel tracking method called Recurrently Target-attending Tracking (RTT). RTT attempts to identify and exploit those reliable parts which are beneficial for the overall tracking process. To bypass occlusion and discover reliable components, multi-directional Recurrent Neural Networks (RNNs) are employed in RTT to capture long-range contextual cues by traversing a candidate spatial region from multiple directions. The produced confidence maps from the RNNs are employed to adaptively regularize the learning of discriminative correlation filters by suppressing clutter background noises while making full use of the information from reliable parts.

Supervised Learning

155   Structured Regression Gradient Boosting. 

Ferran Diego, Fred A. Hamprecht

156   Loss Functions for Top-k Error: Analysis and Insights. 

Maksim Lapin, Matthias Hein, Bernt Schiele

157   Metric Learning as Convex Combinations of Local Models With Generalization Guarantees. 

Valentina Zantedeschi, Rémi Emonet, Marc Sebban

158   Efficient Training of Very Deep Neural Networks for Supervised Hashing. 

Ziming Zhang, Yuting Chen, Venkatesh Saligrama

基于HashADMM思想的网络训练算法用于降低训练时间。效率提高。In this paper, we propose training very deep neural networks (DNNs) for supervised learning of hash codes. Existing methods in this context train relatively "shallow" networks limited by the issues arising in back propagation (e.g. vanishing gradients) as well as computational efficiency. We propose a novel and efficient training algorithm inspired by alternating direction method of multipliers (ADMM) that overcomes some of these limitations. Our method decomposes the training process into independent layer-wise local updates through auxiliary variables. Empirically we observe that our training algorithm always converges and its computational complexity is linearly proportional to the number of edges in the networks. 

159   Information Bottleneck Learning Using Privileged Information for Visual Recognition. 

Saeid Motiian, Marco Piccirilli, Donald A. Adjeroh, Gianfranco Doretto

Go back to calendar >



Recognition and Parsing In 3D

Tuesday, June 28th, 9:00AM - 10:05AM.

These papers will also be presented at the following poster session

160   3D Action Recognition From Novel Viewpoints. 

Hossein Rahmani, Ajmal Mian

没仔细看,肯定有借鉴的地方。We propose a human pose representation model that transfers human poses acquired from different unknown views to a view-invariant high-level space.

161   3D Shape Attributes. 

David F. Fouhey, Abhinav Gupta, Andrew Zisserman

提出对图像的三维描述子,当然是采用卷积神经元网络。号称自己的工作非常多,有5个贡献:In this paper we investigate 3D attributes as a means to understand the shape of an object in a single image.

162   Three-Dimensional Object Detection and Layout Prediction Using Clouds of Oriented Gradients. 

Zhile Ren, Erik B. Sudderth


163   3D Semantic Parsing of Large-Scale Indoor Spaces. 

Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, Silvio Savarese


164   Dense Human Body Correspondences Using Convolutional Networks. 

Lingyu Wei, Qixing Huang, Duygu Ceylan, Etienne Vouga, Hao Li


Go back to calendar >



Recognition Beyond Objects

Tuesday, June 28th, 10:05AM - 10:30AM.

These papers will also be presented at the following poster session

165   Geometry-Informed Material Recognition. 

Joseph DeGol, Mani Golparvar-Fard, Derek Hoiem


166   Towards Open Set Deep Networks. 

Abhijit Bendale, Terrance E. Boult


167   What's Wrong With That Object? Identifying Images of Unusual Objects by Modelling the Detection Score Distribution. 

Peng Wang, Lingqiao Liu, Chunhua Shen, Zi Huang, Anton van den Hengel, Heng Tao Shen


168   Large-Scale Location Recognition and the Geometric Burstiness Problem. 

Torsten Sattler, Michal Havlena, Konrad Schindler, Marc Pollefeys


169   Regularity-Driven Facade Matching Between Aerial and Street Views. 

Mark Wolff, Robert T. Collins, Yanxi Liu


170   Do Computational Models Differ Systematically From Human Object Perception?. 

R. T. Pramod, S. P. Arun


Go back to calendar >



Image Processing and Restoration

Tuesday, June 28th, 9:00AM - 10:05AM.

These papers will also be presented at the following poster session

171   Contour Detection in Unstructured 3D Point Clouds. 

Timo Hackel, Jan D. Wegner, Konrad Schindler


172   Unsupervised Learning of Edges. 

Yin Li, Manohar Paluri, James M. Rehg, Piotr Dollár


173   Blind Image Deblurring Using Dark Channel Prior. 

Jinshan Pan, Deqing Sun, Hanspeter Pfister, Ming-Hsuan Yang


174   Deeply-Recursive Convolutional Network for Image Super-Resolution. 

Jiwon Kim, Jung Kwon Lee, Kyoung Mu Lee

感觉相当于是网络层的重复利用,因为说增加层数但不增加参数。We propose an image super-resolution method (SR) using a deeply-recursive convolutional network (DRCN). Our network has a very deep recursive layer (up to 16 recursions). Increasing recursion depth can improve performance without introducing new parameters for additional convolutions. 

175   Accurate Image Super-Resolution Using Very Deep Convolutional Networks. 

Jiwon Kim, Jung Kwon Lee, Kyoung Mu Lee

基于VGG,增加层数,增加学习率,用于超分辨率重建。We present a highly accurate single image superresolution (SR) method. Our method uses a very deep convolutional network inspired by VGG-net used for ImageNet classification [19]. We find increasing our network depth shows a significant improvement in accuracy. Our final model uses 20 weight layers. By cascading small filters many times in a deep network structure, contextual information over large image regions is exploited in an efficient way. With very deep networks, however, convergence speed becomes a critical issue during training. We propose a simple yet effective training procedure. We learn residuals only and use extremely high learning rates (104 times higher than SRCNN [6]) enabled by adjustable gradient clipping. 

Go back to calendar >



Image Processing and Restoration

Tuesday, June 28th, 10:05AM - 10:30AM.

These papers will also be presented at the following poster session

176   RAW Image Reconstruction Using a Self-Contained sRGB-JPEG Image With Only 64 KB Overhead. 

Rang M. H. Nguyen, Michael S. Brown


177   Group MAD Competition - A New Methodology to Compare Objective Image Quality Models. 

Kede Ma, Qingbo Wu, Zhou Wang, Zhengfang Duanmu, Hongwei Yong, Hongliang Li, Lei Zhang


178   Non-Local Image Dehazing. 

Dana Berman, Tali treibitz, Shai Avidan


179   A Holistic Approach to Cross-Channel Image Noise Modeling and Its Application to Image Denoising. 

Seonghyeon Nam, Youngbae Hwang, Yasuyuki Matsushita, Seon Joo Kim


180   Multispectral Images Denoising by Intrinsic Tensor Sparsity Regularization. 

Qi Xie, Qian Zhao, Deyu Meng, Zongben Xu, Shuhang Gu, Wangmeng Zuo, Lei Zhang

这个估计还是比较有意思的,没有深度学习相关,但是有张量的稀疏正则,可以借鉴。 In this paper, we propose a new tensor-based denoising approach by fully considering two intrinsic characteristics underlying an MSI, i.e., the global correlation along spectrum (GCS) and nonlocal self-similarity across space (NSS). In specific, we construct a new tensor sparsity measure, called intrinsic tensor sparsity (ITS) measure, which encodes both sparsity insights delivered by the most typical Tucker and CANDECOMP/PARAFAC (CP) low-rank decomposition for a general tensor.

181   A Comparative Study for Single Image Blind Deblurring. 

Wei-Sheng Lai, Jia-Bin Huang, Zhe Hu, Narendra Ahuja, Ming-Hsuan Yang


Go back to calendar >



Poster Session 2-1. Tuesday, June 28th, 10:30AM - 12:30PM.

3D Vision

182   Spatiotemporal Bundle Adjustment for Dynamic 3D Reconstruction. 

Minh Vo, Srinivasa G. Narasimhan, Yaser Sheikh


183   Inextensible Non-Rigid Shape-From-Motion by Second-Order Cone Programming. 

Ajad Chhatkuli, Daniel Pizarro, Toby Collins, Adrien Bartoli


184   Optimal Relative Pose With Unknown Correspondences. 

Johan Fredriksson, Viktor Larsson, Carl Olsson, Fredrik Kahl


185   Homography Estimation From the Common Self-Polar Triangle of Separate Ellipses. 

Haifei Huang, Hui Zhang, Yiu-ming Cheung


186   Heterogeneous Light Fields. 

Maximilian Diebold, Bernd Jähne, Alexander Gatto


187   A Consensus-Based Framework for Distributed Bundle Adjustment. 

Anders Eriksson, John Bastian, Tat-Jun Chin, Mats Isaksson


188   Globally Optimal Manhattan Frame Estimation in Real-Time. 

Kyungdon Joo, Tae-Hyun Oh, Junsik Kim, In So Kweon


189   Mirror Surface Reconstruction Under an Uncalibrated Camera. 

Kai Han, Kwan-Yee K. Wong, Dirk Schnieders, Miaomiao Liu


190   A Hole Filling Approach Based on Background Reconstruction for View Synthesis in 3D Video. 

Guibo Luo, Yuesheng Zhu, Zhaotian Li, Liming Zhang


191   A Direct Least-Squares Solution to the PnP Problem With Unknown Focal Length. 

Yinqiang Zheng, Laurent Kneip


192   Efficient Intersection of Three Quadrics and Applications in Computer Vision. 

Zuzana Kukelova, Jan Heller, Andrew Fitzgibbon


193   Using Spatial Order to Boost the Elimination of Incorrect Feature Matches. 

Lior Talker, Yael Moses, Ilan Shimshoni


194   A Probabilistic Framework for Color-Based Point Set Registration. 

Martin Danelljan, Giulia Meneghetti, Fahad Shahbaz Khan, Michael Felsberg


Deblurring and Super-Resolution

195   Blind Image Deconvolution by Automatic Gradient Activation. 

Dong Gong, Mingkui Tan, Yanning Zhang, Anton van den Hengel, Qinfeng Shi


196   PSyCo: Manifold Span Reduction for Super Resolution. 

Eduardo Pérez-Pellitero, Jordi Salvador, Javier Ruiz-Hidalgo, Bodo Rosenhahn


197   Parametric Object Motion From Blur. 

Jochen Gast, Anita Sellent, Stefan Roth


198   Image Deblurring Using Smartphone Inertial Sensors. 

Zhe Hu, Lu Yuan, Stephen Lin, Ming-Hsuan Yang


199   Seven Ways to Improve Example-Based Single Image Super Resolution. 

Radu Timofte, Rasmus Rothe, Luc Van Gool


200   Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. 

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, Zehan Wang

重点不在其它的,而在速度快,可以实时处理视频的啊。In this paper, we present the first convolutional neural network (CNN) capable of real-time SR of 1080p videos on a single K2 GPU. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space. In addition, we introduce an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. By doing so, we effectively replace the handcrafted bicubic filter in the SR pipeline with more complex upscaling filters specifically trained for each feature map, whilst also reducing the computational complexity of the overall SR operation.

Events, Actions, and Activity Recognition

201   They Are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers. 

Xiaojun Chang, Yao-Liang Yu, Yi Yang, Eric P. Xing

做的是视频检索,而且没有深度学习,貌似意义不大。 In this paper, we present a state-of-the-art event search system without any example videos. Relying on the key observation that events (e.g. dog show) are usually compositions of multiple mid-level concepts (e.g. "dog," "theater," and "dog jumping"), we first train a skip-gram model to measure the relevance of each concept with the event of interest. The relevant concept classifiers then cast votes on the test videos but their reliability, due to lack of labeled training videos, has been largely unaddressed. We propose to combine the concept classifiers based on a principled estimate of their accuracy on the unlabeled test videos. A novel warping technique is proposed to improve the performance and an efficient highly-scalable algorithm is provided to quickly solve the resulting optimization.

202   Going Deeper into First-Person Activity Recognition. 

Minghuang Ma, Haoqi Fan, Kris M. Kitani

又是一种二阶段方法,不过二阶段貌似确实是适用视频的。这个有参考价值。we propose a twin stream network architecture, where one stream analyzes appearance information and the other stream analyzes motion information. Our appearance stream encodes prior knowledge of the egocentric paradigm by explicitly training the network to segment hands and localize objects. By visualizing certain neuron activation of our network, we show that our proposed architecture naturally learns features that capture object attributes and hand-object configurations.

203   Cascaded Interactional Targeting Network for Egocentric Video Analysis. 

Yang Zhou, Bingbing Ni, Richang Hong, Xiaokang Yang, Qi Tian

这里先说一下Egocentric video Analysis,与first-person activity recognition是一样的,只是不同的叫法。貌似是从第一人称的视角来分析行为。这个问题分为两个任务,但是由于精确数据集的缺失,提出的方法。 This work aims to explicitly address these two issues via introducing a cascaded interactional targeting (i.e., infer both hand and active object regions) deep neural network. Firstly, a novel EM-like learning framework is proposed to train the pixel-level deep convolutional neural network (DCNN) by seamlessly integrating weakly supervised data (i.e., massive bounding box annotations) with a small set of strongly supervised data (i.e., fully annotated hand segmentation maps) to achieve state-of-the-art hand segmentation performance. Secondly, the resulting high-quality hand segmentation maps are further paired with the corresponding motion maps and object feature maps, in order to explore the contextual information among object, motion and hand to generate interactional foreground regions (operated objects).

204   Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos. 

Fabian Caba Heilbron, Juan Carlos Niebles, Bernard Ghanem

也是值得借鉴,考虑效率问题,因为效率问题一直是困扰视频处理的一个重要因素。Current approaches for activity detection still struggle to handle large-scale video collections and the task remains relatively unexplored. This is in part due to the computational complexity of current action recognition approaches and the lack of a method that proposes fewer intervals in the video, where activity processing can be focused. In this paper, we introduce a proposal method that aims to recover temporal segments containing actions in untrimmed videos. Building on techniques for learning sparse dictionaries, we introduce a learning framework to represent and retrieve activity proposals. We demonstrate the capabilities of our method in not only producing high quality proposals but also in its efficiency.

205   Discriminative Hierarchical Rank Pooling for Activity Recognition. 

Basura Fernando, Peter Anderson, Marcus Hutter, Stephen Gould

值得关注的文章,相当于是一种新的pooling算法。具有参考价值。We present hierarchical rank pooling, a video sequence encoding method for activity recognition. It consists of a network of rank pooling functions which captures the dynamics of rich convolutional neural network features within a video sequence. By stacking non-linear feature functions and rank pooling over one another, we obtain a high capacity dynamic encoding mechanism, which is used for action recognition. We present a method for jointly learning the video representation and activity classifier parameters.

206   Convolutional Two-Stream Network Fusion for Video Action Recognition. 

Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

这篇文章应该需要关注,做视频分析,而且充分考虑了多种融合技术,并且得出了许多结论,值得看。We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

207   Learning Activity Progression in LSTMs for Activity Detection and Early Detection. 

Shugao Ma, Leonid Sigal, Stan Sclaroff

加入了LSTM的,肯定比较重要,另外他着重考虑了不仅仅是分类误差,还考虑了更加细分的情况,值得借鉴,相当于损失函数的创新。In this work we improve training of temporal deep models to better learn activity progression for activity detection and early detection. Conventionally, when training a Recurrent Neural Network, specifically a Long Short Term Memory (LSTM) model, the training loss only considers classification error. However, we argue that the detection score of the correct activity category or the detection score margin between the correct and incorrect categories should be monotonically non-decreasing as the model observes more of the activity. We design novel ranking losses that directly penalize the model on violation of such monotonicities, which are used together with classification loss in training of LSTM models.

208   VLAD3: Encoding Dynamics of Deep Features for Action Recognition. 

Yingwei Li, Weixin Li, Vijay Mahadevan, Nuno Vasconcelos

这个考虑了长时间视频的问题,但是借鉴意义不大,不是深度学习的内容。we propose a representation, VLAD for Deep Dynamics (VLAD^3), that accounts for different levels of video dynamics. It captures short-term dynamics with deep convolutional neural network features, relying on linear dynamic systems (LDS) to model medium-range dynamics. To account for long-range inhomogeneous dynamics, a VLAD descriptor is derived for the LDS and pooled over the whole video, to arrive at the final VLAD^3 representation. 

209   A Multi-Stream Bi-Directional Recurrent Neural Network for Fine-Grained Action Detection. 

Bharat Singh, Tim K. Marks, Michael Jones, Oncel Tuzel, Ming Shao

这个绝对有参考价值,也是源自two-stream的思想,但不是在原始帧的基础上,而是首先利用跟踪的只是,获得目标的bounding-box,再对bounding-boxtwo-stream的思想,并且考虑LSTM以及双向信息,参考意义大。Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion and also suppresses background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. Our motion streams use pixel trajectories of a frame as raw features, in which the displacement values corresponding to a moving scene point are at the same spatial position across several frames. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. 

210   A Hierarchical Deep Temporal Model for Group Activity Recognition. 

Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, Greg Mori

也是利用了LSTM的结构,二阶段思想,有借鉴意义。We build a deep model to capture these dynamics based on LSTM (long short-term memory) models. To make use of these observations, we present a 2-stage deep temporal model for the group activity recognition problem. In our model, a LSTM model is designed to represent action dynamics of individual people in a sequence and another LSTM model is designed to aggregate person-level information for whole activity understanding. 

211   A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets. 

Ivan Lillo, Juan Carlos Niebles, Alvaro Soto

复杂动作理解,不明觉厉,但没有深度学习。In this paper, we introduce a new hierarchical model for human action recognition that is able to categorize complex actions performed in videos. Our model is also able to perform spatio-temporal annotation of the atomic actions that compose the overall complex action. That is, for each atomic action, the model generates temporal atomic action annotations by inferring the starting and ending times of the atomic action, as well spatial annotations by inferring the human body parts that are involved in each atomic action. Our model has three key properties: (i) it can be trained with no spatial supervision, as it is able to automatically discover the relevant body parts from temporal action annotations only; (ii) its jointly learned poselet and actionlet representation encodes the visual variability of actions with good generalization power; (iii) its mechanism for handling noisy body pose estimates make it robust to common pose estimation errors.

212   A Key Volume Mining Deep Framework for Action Recognition. 

Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, Yu Qiao

考虑视频中的无关帧的影响,前向寻求关键帧,后向更新参数。Training with a large proportion of irrelevant volumes will hurt performance. To address this issue, we propose a key volume mining deep framework to identify key volumes and conduct classification simultaneously. Specifically, our framework is trained end-to-end in an EM-like loop. In the forward pass, our network mines key volumes for each action class. In the backward pass, it updates network parameters with the help of these mined key volumes. In addition, we propose "Stochastic out" to handle key volumes from multi-modalities, and an effective yet simple "unsupervised key volume proposal" method for high quality volume sampling.

Image Indexing and Retrieval

213   Improved Hamming Distance Search Using Variable Length Substrings. 

Eng-Jon Ong, Miroslaw Bober


214   Shortlist Selection With Residual-Aware Distance Estimator for K-Nearest Neighbor Search. 

Jae-Pil Heo, Zhe Lin, Xiaohui Shen, Jonathan Brandt, Sung-eui Yoon


215   Supervised Quantization for Similarity Search . 

Xiaojuan Wang, Ting Zhang, Guo-Jun Qi, Jinhui Tang, Jingdong Wang


216   Efficient Large-Scale Approximate Nearest Neighbor Search on the GPU. 

Patrick Wieschollek, Oliver Wang, Alexander Sorkine-Hornung, Hendrik P. A. Lensch


217   Collaborative Quantization for Cross-Modal Similarity Search. 

Ting Zhang, Jingdong Wang


218   Aggregating Image and Text Quantized Correlated Components. 

Thi Quynh Nhi Tran, Hervé Le Borgne, Michel Crucianu


219   Efficient Indexing of Billion-Scale Datasets of Deep Descriptors. 

Artem Babenko, Victor Lempitsky


220   Deep Supervised Hashing for Fast Image Retrieval. 

Haomiao Liu, Ruiping Wang, Shiguang Shan, Xilin Chen

Deep 哈希,没太大兴趣啊。

221   Efficient Large-Scale Similarity Search Using Matrix Factorization. 

Ahmet Iscen, Michael Rabbat, Teddy Furon


222   Incremental Object Discovery in Time-Varying Image Collections. 

Theodora Kontogianni, Markus Mathias, Bastian Leibe


Motion and Tracking

223   Detecting Migrating Birds at Night. 

Jia-Bin Huang, Rich Caruana, Andrew Farnsworth, Steve Kelling, Narendra Ahuja


Object Class Detection and Recognition

224   When Naïve Bayes Nearest Neighbors Meet Convolutional Neural Networks. 

Ilja Kuzborskij, Fabio Maria Carlucci, Barbara Caputo


225   Traffic-Sign Detection and Classification in the Wild. 

Zhe Zhu, Dun Liang, Songhai Zhang, Xiaolei Huang, Baoli Li, Shimin Hu

构建了交通信号灯的数据库,提出了基于CNN的网络结构,感觉难度不会太大如果仅仅是信号灯的识别。 Firstly, we have created a large traffic-sign benchmark from 100000 Tencent Street View panoramas, going beyond previous benchmarks. It provides 100000 images containing 30000 traffic-sign instances. These images cover large variations in illuminance and weather conditions. Each traffic-sign in the benchmark is annotated with a class label, its bounding box and pixel mask. We call this benchmark Tsinghua-Tencent 100K. Secondly, we demonstrate how a robust end-to-end convolutional neural network (CNN) can simultaneously detect and classify traffic-signs. Most previous CNN image processing solutions target objects that occupy a large proportion of an image, and such networks do not work well for target objects occupying only a small fraction of an image like the traffic-signs here.

226   Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer. 

Yuxing Tang, Josiah Wang, Boyang Gao, Emmanuel Dellandréa, Robert Gaizauskas, Liming Chen


227   Exploit All the Layers: Fast and Accurate CNN Object Detector With Scale Dependent Pooling and Cascaded Rejection Classifiers. 

Fan Yang, Wongun Choi, Yuanqing Lin

这个应该有技术可以借鉴,至少思想上能提供帮助。In this paper, we investigate two new strategies to detect objects accurately and efficiently using deep convolutional neural network: 1) scale-dependent pooling and 2) layer-wise cascaded rejection classifiers. The scale-dependent pooling (SDP) improves detection accuracy by exploiting appropriate convolutional features depending on the scale of candidate object proposals. The cascaded rejection classifiers (CRC) effectively utilize convolutional features and eliminate negative object proposals in a cascaded manner, which greatly speeds up the detection while maintaining high accuracy.

228   Dictionary Pair Classifier Driven Convolutional Neural Networks for Object Detection. 

Keze Wang, Liang Lin, Wangmeng Zuo, Shuhang Gu, Lei Zhang

借鉴意义,相当于新的层以及新的损失函数。In this paper, we propose a novel object detection system by unifying DPL with the convolutional feature learning. Specifically, we incorporate DPL as a Dictionary Pair Classifier Layer (DPCL) into the deep architecture, and develop an end-to-end learning algorithm for optimizing the dictionary pairs and the neural networks simultaneously. Moreover, we design a multi-task loss for guiding our model to accomplish the three correlated tasks: objectness estimation, categoryness computation, and bounding box regression.

229   Monocular 3D Object Detection for Autonomous Driving. 

Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, Raquel Urtasun


230   How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image. 

Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius Popescu, Dim P. Papadopoulos, Vittorio Ferrari


231   Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles. 

Hongye Liu, Yonghong Tian, Yaowei Yang, Lu Pang, Tiejun Huang

做车辆的再辨识,提出一个新的数据库,当然也是深度学习结构,但是参考意义不大。 This paper focuses on an interesting but challenging problem, vehicle re-identification (a.k.a precise vehicle search). We propose a Deep Relative Distance Learning (DRDL) method which exploits a two-branch deep convolutional network to project raw vehicle images into an Euclidean space where distance can be directly used to measure the similarity of arbitrary two vehicles. To further facilitate the future research on this problem, we also present a carefully-organized large-scale image database "VehicleID", which includes multiple images of the same vehicle captured by different real-world cameras in a city.

Recognition and Detection

232   Eye Tracking for Everyone. 

Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Suchendra Bhandarkar, Wojciech Matusik, Antonio Torralba


233   Efficient Globally Optimal 2D-To-3D Deformable Shape Matching. 

Zorah Lähner, Emanuele Rodolà, Frank R. Schmidt, Michael M. Bronstein, Daniel Cremers


234   Ambiguity Helps: Classification With Disagreements in Crowdsourced Annotations. 

Viktoriia Sharmanska, Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Novi Quadrianto


235   A Task-Oriented Approach for Cost-Sensitive Recognition. 

Roozbeh Mottaghi, Hannaneh Hajishirzi, Ali Farhadi


236   Refining Architectures of Deep Convolutional Neural Networks. 

Sukrit Shankar, Duncan Robertson, Yani Ioannou, Antonio Criminisi, Roberto Cipolla

这个思想还是比较新颖吧,因为它不是微调参数,而是微调网络结构。In this paper, we intend to answer this question and introduce a novel strategy that alters the architecture of a given CNN for a specified dataset, to potentially enhance the original accuracy while possibly reducing the model size. We use two operations for architecture refinement, viz. stretching and symmetrical splitting. Stretching increases the number of hidden units (nodes) in a given CNN layer, while a symmetrical split of say K between two layers separates the input and output channels into K equal groups, and connects only the corresponding input-output channel groups. Our procedure starts with a pre-trained CNN for a given dataset, and optimally decides the stretch and split factors across the network to refine the architecture. 

237   iLab-20M: A Large-Scale Controlled Object Dataset to Investigate Deep Learning. 

Ali Borji, Saeed Izadi, Laurent Itti


238   Recursive Recurrent Nets With Attention Modeling for OCR in the Wild. 

Chen-Yu Lee, Simon Osindero

相当于是自然场景中的文字识别,如果非得说值得借鉴的地方那就是他用了RNN可以参考The primary advantages of the proposed method are: (1) use of recursive convolutional neural networks (CNNs), which allow for parametrically efficient and effective image feature extraction; (2) an implicitly learned character-level language model, embodied in a recurrent neural network which avoids the need to use N-grams; and (3) the use of a soft-attention mechanism, allowing the model to selectively exploit image features in a coordinated way, and allowing for end-to-end training within a standard backpropagation framework. 

239   Deep Decision Network for Multi-Class Image Classification. 

Venkatesh N. Murthy, Vivek Singh, Terrence Chen, R. Manmatha, Dorin Comaniciu


240   Less Is More: Zero-Shot Learning From Online Textual Documents With Noise Suppression. 

Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, Anton van den Hengel


241   Fast Algorithms for Linear and Kernel SVM+. 

Wen Li, Dengxin Dai, Mingkui Tan, Dong Xu, Luc Van Gool


Go back to calendar >



Recognition and Labeling

Tuesday, June 28th, 1:45PM - 2:50PM.

These papers will also be presented at the following poster session

242   Hierarchically Gated Deep Networks for Semantic Segmentation. 

Guo-Jun Qi

语义分割,也没怎么看懂。we develop a novel paradigm of multi-scale deep network to model spatial contexts surrounding different pixels at various scales. It builds multiple layers of memory cells, learning feature representations for individual pixels at their customized scales by hierarchically absorbing relevant spatial contexts via memory gates between layers.Such Hierarchically Gated Deep Networks (HGDNs) can customize a suitable scale for each pixel, thereby delivering better performance on labeling scene structures of various scales. 

243   Deep Structured Scene Parsing by Learning With Image Descriptions. 

Liang Lin, Guangrun Wang, Rui Zhang, Ruimao Zhang, Xiaodan Liang, Wangmeng Zuo

就是图像描述,也是两个子网络,一个CNN一个RNNWe propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixelwise object labeling and ii) a recursive neural network (RNN) discovering the hierarchical object structure and the inter-object relations.

244   CNN-RNN: A Unified Framework for Multi-Label Image Classification. 

Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, Wei Xu

这个思想还比较新颖,考虑的是多标签分类,因为一般的多标签都是视作多分类或者独立的,各个无相关性,这里考虑其相关性,用RNN来考虑,也是CNN+RNNTraditional approaches to multi-label image classification learn independent classifiers for each category and employ ranking or thresholding on the classification results. These techniques, although working well, fail to explicitly exploit the label dependencies in an image. In this paper, we utilize recurrent neural networks (RNNs) to address this problem. Combined with CNNs, the proposed CNN-RNN framework learns a joint image-label embedding to characterize the semantic label dependency as well as the image-label relevance, and it can be trained end-to-end from scratch to integrate both information in an unified framework.

245   Walk and Learn: Facial Attribute Representation Learning From Egocentric Video and Contextual Data. 

Jing Wang, Yu Cheng, Rogerio Schmidt Feris


246   CNN-N-Gram for Handwriting Word Recognition. 

Arik Poznanski , Lior Wolf


Go back to calendar >



Object Detection 2

Tuesday, June 28th, 2:50PM - 3:20PM.

These papers will also be presented at the following poster session

247   Synthetic Data for Text Localisation in Natural Images. 

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman

也是自然场景中的文字定位,分为两部分,其中第二部分是全卷积回归网络。The method comprises two contributions: First, a fast and scalable engine to generate synthetic images of text in clutter. This engine overlays synthetic text to existing background images in a natural way, accounting for the local 3D scene geometry. Second, we use the synthetic images to train a Fully-Convolutional Regression Network (FCRN) which efficiently performs text detection and bounding-box regression at all locations and multiple scales in an image. We discuss the relation of FCRN to the recently-introduced YOLO detector, as well as other end-to-end object detection systems based on deep learning

248   End-To-End People Detection in Crowded Scenes. 

Russell Stewart, Mykhaylo Andriluka, Andrew Y. Ng

拥挤环境中的行人检测,采用了LSTM以及新的损失函数,可以参考We propose a model that is based on decoding an image into a set of people detections. Our system takes an image as input and directly outputs a set of distinct detection hypotheses. Because we generate predictions jointly, common post-processing steps such as non-maximum suppression are unnecessary. We use a recurrent LSTM layer for sequence generation and train our model end-to-end with a new loss function that operates on sets of detections. 

249   Real-Time Salient Object Detection With a Minimum Spanning Tree. 

Wei-Chih Tu , Shengfeng He, Qingxiong Yang, Shao-Yi Chien


250   Local Background Enclosure for RGB-D Salient Object Detection. 

David Feng, Nick Barnes, Shaodi You, Chris McCarthy


251   Adaptive Object Detection Using Adjacency and Zoom Prediction. 

Yongxi Lu, Tara Javidi, Svetlana Lazebnik

不是深度学习,但号称与Faster R-CNN有一拼。

252   Semantic Channels for Fast Pedestrian Detection. 

Arthur Daniel Costea, Sergiu Nedevschi


253   G-CNN: An Iterative Grid Based Object Detector. 

Mahyar Najibi, Mohammad Rastegari, Larry S. Davis

看摘要感觉思想与之前讲的某篇文章相似啊,是基于网格的,这样就不需要多尺度了。We introduce G-CNN, an object detection technique based on CNNs which works without proposal algorithms. G-CNN starts with a multi-scale grid of fixed bounding boxes. We train a regressor to move and scale elements of the grid towards objects iteratively. G-CNN models the problem of object detection as finding a path from a fixed grid to boxes tightly surrounding the objects. G-CNN with around 180 boxes in a multi-scale grid performs comparably to Fast R-CNN which uses around 2K bounding boxes generated with a proposal technique. 

Go back to calendar >



Computational Photography and Faces

Tuesday, June 28th, 1:45PM - 2:50PM.

These papers will also be presented at the following poster session

254   Recurrent Face Aging. 

Wei Wang, Zhen Cui, Yan Yan, Jiashi Feng, Shuicheng Yan, Xiangbo Shu, Nicu Sebe

相当于是一个年龄演化的过程。用RNN了。 In this paper, we introduce a recurrent face aging (RFA) framework based on a recurrent neural network which can identify the ages of people from 0 to 80. 

255   Face2Face: Real-Time Face Capture and Reenactment of RGB Videos. 

Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, Matthias Nießner

这是微博比较火的那个Face2Face,但没有深度学习的内容。We present a novel approach for real-time facial reenactment of a monocular target vide sequence

256   Self-Adaptive Matrix Completion for Heart Rate Estimation From Face Videos Under Realistic Conditions. 

Sergey Tulyakov, Xavier Alameda-Pineda, Elisa Ricci, Lijun Yin, Jeffrey F. Cohn, Nicu Sebe


257   Visually Indicated Sounds. 

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman


258   Image Style Transfer Using Convolutional Neural Networks. 

Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

貌似是不是跟那个NeuralStyle差不多,感觉是,也不知道是不是同一篇文章。We introduce A Neural Algorithm of Artistic Style that can separate and recombine the image content and style of natural images. The algorithm allows us to produce new images of high perceptual quality that combine the content of an arbitrary photograph with the appearance of numerous well-known artworks.

Go back to calendar >



Computational Photography and Biomedical Applications

Tuesday, June 28th, 2:50PM - 3:20PM.

These papers will also be presented at the following poster session

259   Patch-Based Convolutional Neural Network for Whole Slide Tissue Image Classification. 

Le Hou, Dimitris Samaras, Tahsin M. Kurc, Yi Gao, James E. Davis, Joel H. Saltz

这个按道理来说是有用的,因为基于patch的方法貌似相比基于整幅图像的算法效果要好,但是patch同时又有很多并不具有判别信息的patch,因此如何实现对有效判别patch的融合,是有意义的。 We propose to train a decision fusion model to aggregate patch-level predictions given by patch-level CNNs, which to the best of our knowledge has not been shown before. Furthermore, we formulate a novel Expectation-Maximization (EM) based method that automatically locates discriminative patches robustly by utilizing the spatial relationships of patches. We apply our method to the classification of glioma and non-small-cell lung carcinoma cases into subtypes. The classification accuracy of our method is similar to the inter-observer agreement between pathologists。

260   Hedgehog Shape Priors for Multi-Object Segmentation. 

Hossam Isack, Olga Veksler, Milan Sonka, Yuri Boykov


261   Latent Variable Graphical Model Selection Using Harmonic Analysis: Applications to the Human Connectome Project (HCP). 

Won Hwa Kim, Hyunwoo J. Kim, Nagesh Adluru, Vikas Singh


262   Simultaneous Estimation of Near IR BRDF and Fine-Scale Surface Geometry. 

Gyeongmin Choe, Srinivasa G. Narasimhan, In So Kweon


263   Do It Yourself Hyperspectral Imaging With Everyday Digital Cameras. 

Seoung Wug Oh, Michael S. Brown, Marc Pollefeys, Seon Joo Kim


264   Automatic Content-Aware Color and Tone Stylization. 

Joon-Young Lee, Kalyan Sunkavalli, Zhe Lin, Xiaohui Shen, In So Kweon


265   Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis. 

Chuan Li, Michael Wand


Go back to calendar >



Poster Session 2-2. Tuesday, June 28th, 4:45PM - 6:45PM.

Biomedical Image Analysis

266   DCAN: Deep Contour-Aware Networks for Accurate Gland Segmentation. 

Hao Chen, Xiaojuan Qi, Lequan Yu, Pheng-Ann Heng

做分割的,用到了边缘信息。 In this paper, we proposed an efficient deep contour-aware network (DCAN) to solve this challenging problem under a unified multi-task learning framework. In the proposed network, multi-level contextual features from the hierarchical architecture are explored with auxiliary supervision for accurate gland segmentation. When incorporated with multi-task regularization during the training, the discriminative capability of intermediate features can be further improved. Moreover, our network can not only output accurate probability maps of glands, but also depict clear contours simultaneously for separating clustered objects, which further boosts the gland segmentation performance. 

267   Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation. 

Hoo-Chang Shin, Kirk Roberts, Le Lu, Dina Demner-Fushman, Jianhua Yao, Ronald M. Summers

感觉相当于是将深度学习的内容用于图像的疾病诊断。In this paper, we present a deep learning model to efficiently detect a disease from an image and annotate its contexts (e.g., location, severity and the affected organs). We employ a publicly available radiology dataset of chest x-rays and their reports, and use its image annotations to mine disease names to train convolutional neural networks (CNNs). In doing so, we adopt various regularization techniques to circumvent the large normal-vs-diseased cases bias. Recurrent neural networks (RNNs) are then trained to describe the contexts of a detected disease, based on the deep CNN features. Moreover, we introduce a novel approach to use the weights of the already trained pair of CNN/RNN on the domain-specific image/text dataset, to infer the joint image/text contexts for composite image labeling. 

268   Conformal Surface Alignment With Optimal Möbius Search. 

Huu Le, Tat-Jun Chin, David Suter


269   Coupled Harmonic Bases for Longitudinal Characterization of Brain Networks. 

Seong Jae Hwang, Nagesh Adluru, Maxwell D. Collins, Sathya N. Ravi, Barbara B. Bendlin, Sterling C. Johnson, Vikas Singh


270   Automating Carotid Intima-Media Thickness Video Interpretation With Convolutional Neural Networks. 

Jae Shin, Nima Tajbakhsh, R. Todd Hurst, Christopher B. Kendall, Jianming Liang


Deep Learning and CNNs

271   Context Encoders: Feature Learning by Inpainting. 

Deepak Pathak, Philipp Krähenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros

貌似是无监督的?这个貌似创新还挺大的。没看懂。We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders -- a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. 

272   Comparative Deep Learning of Hybrid Representations for Image Recommendations. 

Chenyi Lei, Dong Liu, Weiping Li, Zheng-Jun Zha, Houqiang Li

两个子网络是对偶网络,一种新的训练方法。We design a dual-net deep network, in which the two sub-networks map input images and preferences of users into a same latent semantic space, and then the distances between images and users in the latent space are calculated to make decisions. We further propose a comparative deep learning (CDL) method to train the deep network, using a pair of images compared against one user to learn the pattern of their relative distances. 

273   Fast ConvNets Using Group-Wise Brain Damage. 

Vadim Lebedev, Victor Lempitsky

相当于是卷积神经元网络的加速,类似于Drop-out,这里估计是相当于整个groupdrop-out,也就是直接brain damage了。We revisit the idea of brain damage, i.e. the pruning of the coefficients of a neural network, and suggest how brain damage can be modified and used to speedup convolutional layers in ConvNets. The approach uses the fact that many efficient implementations reduce generalized convolutions to matrix multiplications. The suggested brain damage process prunes the convolutional kernel tensor in a group-wise fashion. After such pruning, convolutions can be reduced to multiplications of thinned dense matrices, which leads to speedup. We investigate different ways to add group-wise prunning to the learning process, and show that several-fold speedups of convolutional layers can be attained using group-sparsity regularizers.

274   Learning to Co-Generate Object Proposals With a Deep Structured Network. 

Zeeshan Hayder, Xuming He, Mathieu Salzmann

一个是强调协同,一个是CNN+CRF In this paper, we present an approach to co-generating object proposals in multiple images, thus leveraging the collective power of multiple object candidates. In particular, we introduce a deep structured network that jointly predicts the objectness scores and the bounding box locations of multiple object candidates. Our deep structured network consists of a fully-connected Conditional Random Field built on top of a set of deep Convolutional Neural Networks, which learn features to model both the individual object candidate and the similarity between multiple candidates. To train our deep structured network, we develop an end-to-end learning algorithm that, by unrolling the CRF inference procedure, lets us backpropagate the loss gradient throughout the entire structured network. 

275   DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. 

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Pascal Frossard

为了提高CNN的鲁棒性而做的文章。Despite the importance of this phenomenon, no effective methods have been proposed to accurately compute the robustness of state-of-the-art deep classifiers to such perturbations on large-scale datasets. In this paper, we fill this gap and propose the DeepFool algorithm to efficiently compute perturbations that fool deep networks, and thus reliably quantify the robustness of these classifiers. 

276   Blockout: Dynamic Model Selection for Hierarchical Deep Networks. 

Calvin Murdock, Zhen Li, Howard Zhou, Tom Duerig

结构学习,也是drop-out的提升,we propose Blockout, a method for regularization and model selection that simultaneously learns both the model architecture and parameters. A generalization of Dropout, our approach gives a novel parametrization of hierarchical architectures that allows for structure learning via back-propagation.

277   FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. 

Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Kurt Keutzer

为了实现加速的。Our approach has three key pillars. First, we select network hardware that achieves high bandwidth between GPU servers -- Infiniband or Cray interconnects are ideal for this. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes.

278   MDL-CW: A Multimodal Deep Learning Framework With Cross Weights. 

Sarah Rastegar, Mahdieh Soleymani, Hamid R. Rabiee, Seyed Mohsen Shojaee

多模?搞不懂,是不是跟权值共享有关。 In this paper, we propose a multimodal deep learning framework (MDL-CW) that exploits the cross weights between representation of modalities, and try to gradually learn interactions of the modalities in a deep network manner (from low to high level interactions). Moreover, we theoretically show that considering these interactions provide more intra-modality information, and introduce a multi-stage pre-training method that is based on the properties of multi-modal data. In the proposed framework, as opposed to the existing deep methods for multi-modal data, we try to reconstruct the representation of each modality at a given level, with representation of other modalities in the previous layer. 

279   Structured Receptive Fields in CNNs. 

Jörn-Henrik Jacobsen, Jan van Gemert, Zhongyu Lou, Arnold W. M. Smeulders

设计先验模型?We combine these ideas into structured receptive field networks, a model which has a fixed filter basis and yet retains the flexibility of CNNs. This flexibility is achieved by expressing receptive fields in CNNs as a weighted sum over a fixed basis which is similar in spirit to Scattering Networks. The key difference is that we learn arbitrary effective filter sets from the basis rather than modeling the filters. This approach explicitly connects classical multiscale image analysis with general CNNs

Events, Actions, and Activity Recognition

280   First Person Action Recognition Using Deep Learned Descriptors. 

Suriya Singh, Chetan Arora, C. V. Jawahar

第一视角的行为分析?也是应用型的吧。We propose convolutional neural networks (CNNs) for end to end learning and classification of wearer's actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. It is compact. It can also be trained from relatively small number of labeled egocentric videos that are available.

281   Recognizing Micro-Actions and Reactions From Paired Egocentric Videos. 

Ryo Yonetani, Kris M. Kitani, Yoichi Sato

交互式的利用两个人的第一视角视频做细微动作的行为识别。To recognize micro-level actions and reactions, such as slight shifts in attention, subtle nodding, or small hand actions, where only subtle body motion is apparent, we propose to use paired egocentric videos recorded by two interacting people. We show that the first-person and second-person points-of-view features of two people, enabled by paired egocentric videos, are complementary and essential for reliably recognizing micro-actions and reactions.

282   Mining 3D Key-Pose-Motifs for Action Recognition. 

Chunyu Wang, Yizhou Wang, Alan L. Yuille


283   Predicting the Where and What of Actors and Actions Through Online Action Localization. 

Khurram Soomro, Haroon Idrees, Mubarak Shah


284   Actions ~ Transformations. 

Xiaolong Wang, Ali Farhadi, Abhinav Gupta

基于Siamese网络的行为识别方法, Motivated by recent advancements of video representation using deep learning, we design a Siamese network which models the action as a transformation on a high-level feature space. We show that our model gives improvements on standard action recognition datasets including UCF101 and HMDB51. More importantly, our approach is able to generalize beyond learned action categories and shows significant performance improvement on cross-category generalization on our new ACT dataset

285   Visual Path Prediction in Complex Scenes With Crowded Moving Objects. 

YoungJoon Yoo, Kimin Yun, Sangdoo Yun, JongHee Hong, Hawook Jeong, Jin Young Choi


286   End-To-End Learning of Action Detection From Frame Glimpses in Videos. 

Serena Yeung, Olga Russakovsky, Greg Mori, Li Fei-Fei

基于人脑的认知,由于人类在识别行为过程大致分为两个过程,观测视频,然后确定关键帧。本文的方法基于RNN,但由于其不适用梯度的特性,采用了新方法。Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and whether to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's task-specific decision policy.

287   Action Recognition in Video Using Sparse Coding and Relative Features. 

Analí Alfaro, Domingo Mery, Alvaro Soto


288   Improving Human Action Recognition by Non-Action Classification. 

Yang Wang, Minh Hoai


289   Actionness Estimation Using Hybrid Fully Convolutional Networks. 

Limin Wang, Yu Qiao, Xiaoou Tang, Luc Van Gool

相当于之前有一个文章是Two-Stream的,做的是相当于把每帧的表象和多帧的动作的结合,不过采用的是CNN,而这里的改进感觉就是只是把CNN换成了FCNThis paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (H-FCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes.

290   Real-Time Action Recognition With Enhanced Motion Vector CNNs. 

Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, Hanli Wang

主要的目的是为了提高Two-Stream的计算效率,由于Two-Stream计算optical flow过程比较麻烦,或者说比较耗时,这里就采用motion vector来替代,一方面是提高计算效率,另一方面貌似可以提高准确率。Our key insight for relieving this problem is that optical flow and motion vector are inherent correlated. Transferring the knowledge learned with optical flow CNN to motion vector CNN can significantly boost the performance of the latter. Specifically, we introduce three strategies for this, initialization transfer, supervision transfer and their combination.

Image Enhancement, Restoration, and Texture

291   Laplacian Patch-Based Image Synthesis. 

Joo Ho Lee, Inchang Choi, Min H. Kim


292   Rain Streak Removal Using Layer Priors. 

Yu Li, Robby T. Tan, Xiaojie Guo, Jiangbo Lu, Michael S. Brown


293   Gradient-Domain Image Reconstruction Framework With Intensity-Range and Base-Structure Constraints. 

Takashi Shibata, Masayuki Tanaka, Masatoshi Okutomi


294   Removing Clouds and Recovering Ground Observations in Satellite Image Sequences via Temporally Contiguous Robust Matrix Completion. 

Jialei Wang, Peder A. Olsen, Andrew R. Conn, Aurélie C. Lozano


295   D3: Deep Dual-Domain Based Fast Restoration of JPEG-Compressed Images. 

Zhangyang Wang, Ding Liu, Shiyu Chang, Qing Ling, Yingzhen Yang, Thomas S. Huang


296   From Bows to Arrows: Rolling Shutter Rectification of Urban Scenes. 

Vijay Rengarajan, Ambasamudram N. Rajagopalan, Rangarajan Aravind


297   A Weighted Variational Model for Simultaneous Reflectance and Illumination Estimation. 

Xueyang Fu, Delu Zeng, Yue Huang, Xiao-Ping Zhang, Xinghao Ding


298   Visualizing and Understanding Deep Texture Representations. 

Tsung-Yu Lin, Subhransu Maji

没太看仔细,是不是跟之前的两篇文章Visualizing and Understanding CNNRNN是相同性质的。Based on recent work [13, 28] we propose a technique to visualize pre-images, providing a means for understanding categorical properties that are captured by these representations. Finally, we show preliminary results on how a unified parametric model of texture analysis and synthesis can be used for attribute-based image manipulation, e.g. to make an image more swirly, honeycombed, or knitted.

Low-Level Vision

299   Robust Kernel Estimation With Outliers Handling for Image Deblurring. 

Jinshan Pan, Zhouchen Lin, Zhixun Su, Ming-Hsuan Yang


Large Scale Visual Recognition

300   Online Collaborative Learning for Open-Vocabulary Visual Classifiers. 

Hanwang Zhang, Xindi Shang, Wenzhuo Yang, Huan Xu, Huanbo Luan, Tat-Seng Chua


301   Rethinking the Inception Architecture for Computer Vision. 

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, Zbigniew Wojna


Object Class Detection and Recognition

302   Cross Modal Distillation for Supervision Transfer. 

Saurabh Gupta, Judy Hoffman, Jitendra Malik


303   Efficient Point Process Inference for Large-Scale Object Detection. 

Trung T. Pham, Seyed Hamid Rezatofighi, Ian Reid, Tat-Jun Chin

挺意外的,虽然是large scale,但不是深度学习。

304   Weakly Supervised Deep Detection Networks. 

Hakan Bilen, Andrea Vedaldi

物体识别,根据从大规模数据集学习的分类模型,做fine tuning,实现物体检测。 In this paper, we address this problem by exploiting the power of deep convolutional neural networks pre-trained on large-scale image-level classification tasks. We propose a weakly supervised deep detection architecture that modifies one such network to operate at the level of image regions, performing simultaneously region selection and classification. Trained as an image classifier, the architecture implicitly learns object detectors that are better than alternative weakly supervised detection systems on the PASCAL VOC data.

305   BORDER: An Oriented Rectangles Approach to Texture-Less Object Recognition. 

Jacob Chan, Jimmy Addison Lee, Qian Kemao


306   Active Image Segmentation Propagation. 

Suyog Dutt Jain, Kristen Grauman


307   Inside-Outside Net: Detecting Objects in Context With Skip Pooling and Recurrent Neural Networks. 

Sean Bell, C. Lawrence Zitnick, Kavita Bala, Ross Girshick

用到空间RNN可以稍微关注一下 In this paper we present the Inside-Outside Net (ION), an object detector that exploits information both inside and outside the region of interest. Contextual information outside the region of interest is integrated using spatial recurrent neural networks. Inside, we use skip pooling to extract information at multiple scales and levels of abstraction.

308   RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection. 

Gong Cheng, Peicheng Zhou, Junwei Han

感觉这个想法比较好,所谓的添加了两层,其实也是换汤不换药,也是在现有的基础上,其中旋转不变性基于添加的损失项,Fisher判别则是在特征提取后的投影,值得关注,值得参考。To address these problems, this paper proposes a novel and effective method to learn a rotation-invariant and Fisher discriminative CNN (RIFD-CNN) model. This is achieved by introducing and learning a rotation-invariant layer and a Fisher discriminative layer, respectively, on the basis of the existing high-capacity CNN architectures. Specifically, the rotation-invariant layer is trained by imposing an explicit regularization constraint on the objective function that enforces invariance on the CNN features before and after rotating. The Fisher discriminative layer is trained by imposing the Fisher discrimination criterion on the CNN features so that they have small within-class scatter but large between-class separation. 

309   Reinforcement Learning for Visual Object Detection. 

Stefan Mathe, Aleksis Pirinen, Cristian Sminchisescu


310   Detecting Repeating Objects Using Patch Correlation Analysis. 

Inbar Huberman, Raanan Fattal


311   Analyzing Classifiers: Fisher Vectors and Deep Neural Networks. 

Sebastian Bach, Alexander Binder, Grégoire Montavon, Klaus-Robert Müller, Wojciech Samek


Scene and Image Classification

312   Learning Deep Features for Discriminative Localization. 

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba

考虑Average PoolingIn this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that exposes the implicit attention of CNNs on image.

313   Seeing Through the Human Reporting Bias: Visual Classifiers From Noisy Human-Centric Labels. 

Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, Ross Girshick


314   Learning Aligned Cross-Modal Representations From Weakly Aligned Data. 

Lluís Castrejón, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

考虑跨模态数据,或者说针对跨模态数据对CNN的改进。In this paper, we investigate how to learn cross-modal scene representations that transfer across modalities. To study this problem, we introduce a new cross-modal scene dataset. While convolutional neural networks can categorize cross-modal scenes well, they also learn an intermediate representation not aligned across modalities, which is undesirable for cross-modal transfer applications. We present methods to regularize cross-modal convolutional neural networks so that they have a shared representation that is agnostic of the modality. 

315   A Probabilistic Collaborative Representation Based Approach for Pattern Classification. 

Sijia Cai, Lei Zhang, Wangmeng Zuo, Xiangchu Feng


316   Learning Structured Inference Neural Networks With Label Relations. 

Hexiang Hu, Guang-Tong Zhou, Zhiwei Deng, Zicheng Liao, Greg Mori

基于标签关系的图模型?有点关系啊。A natural image could be assigned with fine-grained labels that describe major components, coarse-grained labels that depict high level abstraction or a set of labels that reveal attributes. Such categorization at different concept layers can be modeled with label graphs encoding label information. In this paper, we exploit this rich information with a state-of-art deep learning framework, and propose a generic structured model that leverages diverse label relations to improve image classification performance. Our approach employs a novel stacked label prediction neural network, capturing both inter-level and intra-level label semantics.

317   Discriminative Multi-Modal Feature Fusion for RGBD Indoor Scene Recognition. 

Hongyuan Zhu, Jean-Baptiste Weibel, Shijian Lu

相当于也是在高层的位置添加了融合层,使得融合层的信息可以反传回底层做更新,貌似向老师提到的就是这种思想类似,值得参考。Inspired by some recent work on RGBD object recognition using multi-modal feature fusion, we introduce a novel discriminative multi-modal fusion framework for rgbd scene recognition for the first time which simultaneously considers the inter- and intra-modality correlation for all samples and meanwhile regularizing the learned features to be discriminative and compact. The results from the multimodal layer can be back-propagated to the lower CNN layers, hence the parameters of the CNN layers and multimodal layers are updated iteratively until convergence.

318   Conditional Graphical Lasso for Multi-Label Image Classification. 

Qiang Li, Maoying Qiao, Wei Bian, Dacheng Tao


319   Region Ranking SVM for Image Classification. 

Zijun Wei, Minh Hoai

Rank SVM 不是深度学习。

Scene Understanding

320   Predicting Motivations of Actions by Leveraging Text. 

Carl Vondrick, Deniz Oktay, Hamed Pirsiavash, Antonio Torralba


Video Surveilance

321   BoxCars: 3D Boxes as CNN Input for Improved Fine-Grained Vehicle Recognition. 

Jakub Sochor, Adam Herout, Jiří Havel

提了3D Box特征,作为CNN网络结构的输入。使得单纯的CNN结构的效果也可以很好,可以参考。Our contribution is showing that extracting additional data from the video stream - besides the vehicle image itself - and feeding it into the deep convolutional neural network boosts the recognition performance considerably. This additional information includes: 3D vehicle bounding box used for "unpacking" the vehicle image, its rasterized low-resolution shape, and information about the 3D vehicle orientation. Experiments show that adding such information decreases classification error by 26% (the accuracy is improved from 0.772 to 0.832) and boosts verification average precision by 208% (0.378 to 0.785) compared to baseline pure CNN without any input modifications. Also, the pure baseline CNN outperforms the recent state of the art solution by 0.081.

322   Highway Vehicle Counting in Compressed Domain. 

Xu Liu, Zilei Wang, Jiashi Feng, Hongsheng Xi


323   Camera Calibration From Periodic Motion of a Pedestrian. 

Shiyao Huang, Xianghua Ying, Jiangpeng Rong, Zeyu Shang, Hongbin Zha


Go back to calendar >



Actions and Human Pose

Wednesday, June 29th, 9:00AM - 10:05AM.

These papers will also be presented at the following poster session

324   Dynamic Image Networks for Action Recognition. 

Hakan Bilen, Basura Fernando, Efstratios Gavves, Andrea Vedaldi, Stephen Gould

估计参考意义比较大,相当于使得整个的视频数据转化为动态图像,使得可以直接利用于视频分析?需要注意一下。We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used. The dynamic image is based on the rank pooling concept and is obtained through the parameters of a ranking machine that encodes the temporal evolution of the frames of the video. Dynamic images are obtained by directly applying rank pooling on the raw image pixels of a video producing a single RGB image per video. This idea is simple but powerful as it enables the use of existing CNN models directly on video data with fine-tuning. We present an efficient and effective approximate rank pooling operator, speeding it up orders of magnitude compared to rank pooling. Our new approximate rank pooling CNN layer allows us to generalize dynamic images to dynamic feature maps and we demonstrate the power of our new representations on standard benchmarks in action recognition achieving state-of-the-art performance.

325   Detecting Events and Key Actors in Multi-Person Videos. 

Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, Li Fei-Fei

参考意义大。针对多人或者说近乎人群的视频,发现其中的行为,并且寻找出行为的主导者。采用了RNN结构。In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. 

326   Regularizing Long Short Term Memory With 3D Human-Skeleton Sequences for Action Recognition. 

Behrooz Mahasseni, Sinisa Todorovic

参考意义大,就是首先从视频中提取出骨架序列信息,然后利用LSTM做识别。开始阶段可能不需要LSTM,但视频数据是序列数据,以后肯定要考虑LSTMThis paper argues that large-scale action recognition in video can be greatly improved by providing an additional modality in training data -- namely, 3D human-skeleton sequences -- aimed at complementing poorly represented or missing features of human actions in the training videos. For recognition, we use Long Short Term Memory (LSTM) grounded via a deep Convolutional Neural Network (CNN) onto the video. Training of LSTM is regularized using the output of another encoder LSTM (eLSTM) grounded on 3D human-skeleton training data. For such regularized training of LSTM, we modify the standard backpropagation through time (BPTT) in order to address the well-known issues with gradient descent in constraint optimization

327   Personalizing Human Video Pose Estimation. 

James Charles, Tomas Pfister, Derek Magee, David Hogg, Andrew Zisserman

反正是跟视频相关,采用CNN肯定具有一定的参考价值。We propose a personalized ConvNet pose estimator that automatically adapts itself to the uniqueness of a person's appearance to improve pose estimation in long videos. We make the following contributions: (i) we show that given a few high-precision pose annotations, e.g. from a generic ConvNet pose estimator, additional annotations can be generated throughout the video using a combination of image-based matching for temporally distant frames, and dense optical flow for temporally local frames; (ii) we develop an occlusion aware self-evaluation model that is able to automatically select the high-quality and reject the erroneous additional annotations; and (iii) we demonstrate that these high-quality annotations can be used to fine-tune a ConvNet pose estimator and thereby personalize it to lock on to key discriminative features of the person's appearance. 

328   End-To-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation. 

Wei Yang, Wanli Ouyang, Hongsheng Li, Xiaogang Wang

姿态估计,考虑加入先验知识,具有参考价值。In this paper, we propose a novel end-to-end framework for human pose estimation that combines DCNNs with the expressive deformable mixture of parts. We explicitly incorporate domain prior knowledge into the framework, which greatly regularizes the learning process and enables the flexibility of our framework for loopy models or tree-structured models. The effectiveness of jointly learning a DCNN with a deformable mixture of parts model is evaluated through intensive experiments on several widely used benchmarks. 

Go back to calendar >



Activity Recognition

Wednesday, June 29th, 10:05AM - 10:30AM.

These papers will also be presented at the following poster session

329   Actor-Action Semantic Segmentation With Grouping Process Models. 

Chenliang Xu, Jason J. Corso


330   Temporal Action Localization With Pyramid of Score Distribution Features. 

Jun Yuan, Bingbing Ni, Xiaokang Yang, Ashraf A. Kassim


331   Recognizing Activities of Daily Living With a Wrist-Mounted Camera. 

Katsunori Ohnishi, Atsushi Kanehira, Asako Kanezaki, Tatsuya Harada


332   Harnessing Object and Scene Semantics for Large-Scale Video Understanding. 

Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, Leonid Sigal

新的网络结构,现在是Three-Stream,但不同于Two-Stream的两个Stream,一个是帧层面的,一个是物体层面的,一个是场景层面的。 To address these problems, we propose a novel object- and scene-based semantic fusion network and representation. Our semantic fusion network combines three streams of information using a three-layer neural network: (i) frame-based low-level CNN features, (ii) object features from a state-of-the-art large-scale CNN object-detector trained to recognize 20K classes, and (iii) scene features from a state-of-the-art CNN scene-detector trained to recognize 205 scenes. The trained network achieves improvements in supervised activity and video categorization in two complex large-scale datasets - ActivityNet and FCVID, respectively. Further, by examining and back propagating information through the fusion network, semantic relationships (correlations) between video classes and objects/scenes can be discovered. 

333   Video-Story Composition via Plot Analysis. 

Jinsoo Choi, Tae-Hyun Oh, In So Kweon


334   Temporal Action Detection Using a Statistical Language Model. 

Alexander Richard, Juergen Gall


Go back to calendar >



Semantic Segmentation

Wednesday, June 29th, 9:00AM - 10:05AM.

These papers will also be presented at the following poster session

335   Multi-Scale Patch Aggregation (MPA) for Simultaneous Detection and Segmentation. 

Shu Liu, Xiaojuan Qi, Jianping Shi, Hong Zhang, Jiaya Jia


336   Instance-Aware Semantic Segmentation via Multi-Task Network Cascades. 

Jifeng Dai, Kaiming He, Jian Sun

就是利用多任务的思想,级联了三个网络,速度快,效率高。 In this paper, we present Multi-task Network Cascades for instance-aware semantic segmentation. Our model consists of three networks, respectively differentiating instances, estimating masks, and categorizing objects. These networks form a cascaded structure, and are designed to share their convolutional features. We develop an algorithm for the nontrivial end-to-end training of this causal, cascaded structure. Our solution is a clean, single-step training framework and can be generalized to cascades that have more stages. We demonstrate state-of-the-art instance-aware semantic segmentation accuracy on PASCAL VOC. Meanwhile, our method takes only 360ms testing an image using VGG-16, which is two orders of magnitude faster than previous systems for this challenging problem. As a by product, our method also achieves compelling object detection results which surpass the competitive Fast/Faster R-CNN systems.

337   ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation. 

Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, Jian Sun


338   Feature Space Optimization for Semantic Video Segmentation. 

Abhijit Kundu, Vibhav Vineet, Vladlen Koltun


339   Large-Scale Semantic 3D Reconstruction: An Adaptive Multi-Resolution Model for Multi-Class Volumetric Labeling. 

Maroš Bláha, Christoph Vogel, Audrey Richard, Jan D. Wegner, Thomas Pock, Konrad Schindler


Go back to calendar >



Semantic Parsing and Segmentation

Wednesday, June 29th, 10:05AM - 10:30AM.

These papers will also be presented at the following poster session

340   Semantic Object Parsing With Local-Global Long Short-Term Memory. 

Xiaodan Liang, Xiaohui Shen, Donglai Xiang, Jiashi Feng, Liang Lin, Shuicheng Yan

利用局部以及全局的LSTM做分割,这个分割是parsing In this work, we propose a novel deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly incorporate short-distance and long-distance spatial dependencies into the feature learning over all pixel positions. In each LG-LSTM layer, local guidance from neighboring positions and global guidance from the whole image are imposed on each position to better exploit complex local and global contextual information. Individual LSTMs for distinct spatial dimensions are also utilized to intrinsically capture various spatial layouts of semantic parts in the images, yielding distinct hidden and memory cells of each position for each dimension. In our parsing approach, several LG-LSTM layers are stacked and appended to the intermediate convolutional layers to directly enhance visual features, allowing network parameters to be learned in an end-to-end way. 

341   Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation. 

Guosheng Lin, Chunhua Shen, Anton van den Hengel, Ian Reid

加上了纹理信息?感觉还是CRFCNN的结合,关键就是怎么用了。We show how to improve semantic segmentation through the use of contextual information; specifically, we explore 'patch-patch' context between image regions, and 'patch-background' context. For learning from the patch-patch context, we formulate Conditional Random Fields (CRFs) with CNN-based pairwise potential functions to capture semantic correlations between neighboring patches. Efficient piecewise training of the proposed deep structured model is then applied to avoid repeated expensive CRF inference for back propagation. For capturing the patch-background context, we show that a network design with traditional multi-scale image input and sliding pyramid pooling is effective for improving performanc

342   Learning Transferrable Knowledge for Semantic Segmentation With Deep Convolutional Neural Network. 

Seunghoon Hong, Junhyuk Oh, Honglak Lee , Bohyung Han

一方面是一个弱监督网络,另一方面是一个耦合解耦网络。We propose a novel weakly-supervised semantic segmentation algorithm based on Deep Convolutional Neural Net- work (DCNN). Contrary to existing weakly-supervised approaches, our algorithm exploits auxiliary segmentation an- notations available for different categories to guide segmentations on images with only image-level class labels. To make segmentation knowledge transferrable across categories, we design a decoupled encoder-decoder architecture with attention model. In this architecture, the model generates spatial highlights of each category presented in images using an attention model, and subsequently per- forms binary segmentation for each highlighted region using decoder. Combining attention model, the decoder trained with segmentation annotations in different categories boosts accuracy of weakly-supervised semantic segmentation. 

343   The Cityscapes Dataset for Semantic Urban Scene Understanding. 

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, Bernt Schiele


344   Gaussian Conditional Random Field Network for Semantic Segmentation. 

Raviteja Vemulapalli, Oncel Tuzel, Ming-Yu Liu, Rama Chellapa

做图像分割,将离散条件随机场改为高斯条件随机场,The proposed Gaussian CRF network is composed of three sub-networks: (i) a CNN-based unary network for generating unary potentials, (ii) a CNN-based pairwise network for generating pairwise potentials, and (iii) a GMF network for performing Gaussian CRF inference. 

345   The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. 

German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, Antonio M. López


Go back to calendar >



Poster Session 3-1. Wednesday, June 29th, 10:30AM - 12:30PM.

3D Vision

346   Progressive Prioritized Multi-View Stereo. 

Alex Locher, Michal Perdoch, Luc Van Gool


347   WarpNet: Weakly Supervised Matching for Single-View Reconstruction. 

Angjoo Kanazawa, David W. Jacobs, Manmohan Chandraker


348   What Sparse Light Field Coding Reveals About Scene Structure. 

Ole Johannsen, Antonin Sulc, Bastian Goldluecke


349   Online Reconstruction of Indoor Scenes From RGB-D Streams. 

Hao Wang, Jun Wang, Wang Liang


350   Patches, Planes and Probabilities: A Non-Local Prior for Volumetric 3D Reconstruction. 

Ali Osman Ulusoy, Michael J. Black, Andreas Geiger


351   Single Image Camera Calibration With Lenticular Arrays for Augmented Reality. 

Ian Schillebeeckx, Robert Pless


352   Augmented Blendshapes for Real-Time Simultaneous 3D Head Modeling and Facial Motion Capture. 

Diego Thomas, Rin-ichiro Taniguchi


353   Learned Binary Spectral Shape Descriptor for 3D Shape Correspondence. 

Jin Xie, Meng Wang, Yi Fang


354   Multiple Model Fitting as a Set Coverage Problem. 

Luca Magri, Andrea Fusiello


355   Piecewise-Planar 3D Approximation From Wide-Baseline Stereo. 

Cédric Verleysen, Christophe De Vleeschouwer


356   Sparse to Dense 3D Reconstruction From Rolling Shutter Images. 

Olivier Saurer, Marc Pollefeys, Gim Hee Lee


357   Consistency of Silhouettes and Their Duals. 

Matthew Trager, Martial Hebert, Jean Ponce


358   Rolling Shutter Absolute Pose Problem With Known Vertical Direction. 

Cenek Albl, Zuzana Kukelova, Tomas Pajdla


359   Uncertainty-Driven 6D Pose Estimation of Objects and Scenes From a Single RGB Image. 

Eric Brachmann, Frank Michel, Alexander Krull, Michael Ying Yang, Stefan Gumhold, carsten Rother


360   Multicamera Calibration From Visible and Mirrored Epipoles. 

Andrey Bushnevskiy, Lorenzo Sorgi, Bodo Rosenhahn


Face and Gesture

361   Joint Unsupervised Deformable Spatio-Temporal Alignment of Sequences. 

Lazaros Zafeiriou, Epameinondas Antonakos, Stefanos Zafeiriou, Maja Pantic


362   Deep Region and Multi-Label Learning for Facial Action Unit Detection. 

Kaili Zhao, Wen-Sheng Chu, Honggang Zhang

用一个统一的框架来进行Region Learning 和 multi-label learning In this paper, we propose Deep Region and Multi-label Learning (DRML), a unified deep network that simultaneously addresses these two problems. One crucial aspect in DRML is a novel region layer that uses feed-forward functions to induce important facial regions, forcing the learned weights to capture structural information of the face. Our region layer serves as an alternative design between locally connected layers (i.e., confined kernels to individual pixels) and conventional convolution layers (i.e., shared kernels across an entire image). Unlike previous studies that solve RL and ML alternately, DRML by construction addresses both problems, allowing the two seemingly irrelevant problems to interact more directly.

363   Constrained Joint Cascade Regression Framework for Simultaneous Facial Action Unit Recognition and Facial Landmark Detection. 

Yue Wu, Qiang Ji


364   Unconstrained Face Alignment via Cascaded Compositional Learning. 

Shizhan Zhu, Cheng Li, Chen-Change Loy, Xiaoou Tang


365   Automated 3D Face Reconstruction From Multiple Images Using Quality Measures. 

Marcel Piotraschke, Volker Blanz


366   Occlusion-Free Face Alignment: Deep Regression Networks Coupled With De-Corrupt AutoEncoders. 

Jie Zhang, Meina Kan, Shiguang Shan, Xilin Chen

深度回归网络与自动编码的耦合网络。 In this work, we propose a novel face alignment method, which cascades several Deep Regression networks coupled with De-corrupt Autoencoders (denoted as DRDA) to explicitly handle partial occlusion problem. Different from the previous works that can only detect occlusions and discard the occluded parts, our proposed de-corrupt autoencoder network can automatically recover the genuine appearance for the occluded parts and the recovered parts can be leveraged together with those non-occluded parts for more accurate alignment. 

367   Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis. 

Zheng Zhang, Jeff M. Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun Canavan, Michael Reale, Andy Horowitz, Huiyuan Yang, Jeffrey F. Cohn, Qiang Ji, Lijun Yin


368   Learning Reconstruction-Based Remote Gaze Estimation. 

Pei Yu, Jiahuan Zhou, Ying Wu


369   Joint Training of Cascaded CNN for Face Detection. 

Hongwei Qin, Junjie Yan, Xiu Li, Xiaolin Hu

Adaboost框架下,也是级联了CNN,但是这里的CNN相比于之前的方法进步的地方在于它是同步优化的,而之前的则都是局部最优。However, to our best knowledge, most of the previous detection methods use cascade in a greedy manner, where previous stages in cascade are fixed when training a new stage. So optimizations of different CNNs are isolated. In this paper, we propose joint training to achieve end-to-end optimization for CNN cascade. We show that the back propagation algorithm used in training CNN can be naturally used in training CNN cascade.

370   Facial Expression Intensity Estimation Using Ordinal Information. 

Rui Zhao, Quan Gan, Shangfei Wang, Qiang Ji


Recognition and Detection

371   Proposal Flow. 

Bumsub Ham, Minsu Cho, Cordelia Schmid, Jean Ponce


372   ProNet: Learning to Propose Object-Specific Boxes for Cascaded Neural Networks. 

Chen Sun, Manohar Paluri, Ronan Collobert, Ram Nevatia, Lubomir Bourdev

物体识别。全卷积网络。In this paper, we propose a novel classification architecture ProNet based on convolutional neural networks. It uses computationally efficient neural networks to propose image regions that are likely to contain objects, and applies more powerful but slower networks on the proposed regions. The basic building block is a multi-scale fully-convolutional network which assigns object confidence scores to boxes at different locations and scales.

373   Seeing Behind the Camera: Identifying the Authorship of a Photograph. 

Christopher Thomas, Adriana Kovashka


374   Material Classification Using Raw Time-Of-Flight Measurements. 

Shuochen Su, Felix Heide, Robin Swanson, Jonathan Klein, Clara Callenberg, Matthias Hullin, Wolfgang Heidrich


375   Weakly Supervised Object Localization With Progressive Domain Adaptation. 

Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, Ming-Hsuan Yang

物体定位,两个阶段,第一个阶段将多标签分类的拓展为分类阶段,第二阶段是定位阶段。In this paper, we address this problem by progressive domain adaptation with two main steps: classification adaptation and detection adaptation. In classification adaptation, we transfer a pre-trained network to our multi-label classification task for recognizing the presence of a certain object in an image. In detection adaptation, we first use a mask-out strategy to collect class-specific object proposals and apply multiple instance learning to mine confident candidates. We then use these selected object proposals to fine-tune all the layers, resulting in a fully adapted detection network. 

376   Newtonian Scene Understanding: Unfolding the Dynamics of Objects in Static Images. 

Roozbeh Mottaghi, Hessam Bagherinezhad, Mohammad Rastegari, Ali Farhadi


377   Identifying Good Training Data for Self-Supervised Free Space Estimation. 

Ali Harakeh, Daniel Asmar, Elie Shammas


378   Learning to Match Aerial Images With Deep Attentive Architectures. 

Hani Altwaijry, Eduard Trulls, James Hays, Pascal Fua, Serge Belongie

意义不大。In this paper we propose a data-driven, deep learning-based approach that sidesteps local correspondence by framing the problem as a classification task. Furthermore, we demonstrate that local correspondences can still be useful. To do so we incorporate an attention mechanism to produce a set of probable matches, which allows us to further increase performance. 

379   Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection. 

Krishna Kumar Singh, Fanyi Xiao, Yong Jae Lee


380   DeepCAMP: Deep Convolutional Action & Attribute Mid-Level Patterns. 

Ali Diba, Ali Mohammad Pazandeh, Hamed Pirsiavash, Luc Van Gool

中层图像patch特征。In order to deal with this challenge, we propose a novel convolutional neural network that mines mid-level image patches that are sufficiently dedicated to resolve the corresponding subtleties. In particular, we train a newly designed CNN (DeepPattern) that learns discriminative patch groups. 

381   Canny Text Detector: Fast and Robust Scene Text Localization Algorithm. 

Hojin Cho, Myungchul Sung, Bongjin Jun


382   Temporal Multimodal Learning in Audiovisual Speech Recognition. 

Di Hu, Xuelong Li, Xiaoqiang lu


383   Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd. 

Andreas Doumanoglou, Rigas Kouskouridas, Sotiris Malassiotis, Tae-Kyun Kim


384   Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs. 

Liuhao Ge, Hui Liang, Junsong Yuan, Daniel Thalmann


Semantic Image Segmentation

385   Semantic Segmentation With Boundary Neural Fields. 

Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

针对FCN以及CNN+CRF的缺陷,提出基于边缘的分割。 To overcome these problems, we introduce a Boundary Neural Field (BNF), which is a global energy model integrating FCN predictions with boundary cues. The boundary information is used to enhance semantic segment coherence and to improve object localization. Specifically, we first show that the convolutional filters of semantic FCNs provide good features for boundary detection. We then employ the predicted boundaries to define pairwise potentials in our energy. Finally, we show that our energy decomposes semantic segmentation into multiple binary problems, which can be relaxed for efficient global optimization. 

386   HD Maps: Fine-Grained Road Segmentation by Parsing Ground and Aerial Images. 

Gellért Máttyus, Shenlong Wang, Sanja Fidler, Raquel Urtasun


387   DAG-Recurrent Neural Networks For Scene Labeling. 

Bing Shuai, Zhen Zuo, Bing Wang, Gang Wang

基于有向图的RNNIn image labeling, local representations for image units (pixels, patches or superpixels) are usually generated from their surrounding image patches, thus long-range contextual information is not effectively encoded. In this paper, we introduce recurrent neural networks (RNNs) to address this issue. Specifically, directed acyclic graph RNNs (DAG-RNNs) are proposed to process DAG-structured images, which enables the network to model long-range semantic dependencies among image units. Our DAG-RNNs are capable of tremendously enhancing the discriminative power of local representations, which significantly benefits the local classification. 

388   Saliency Guided Dictionary Learning for Weakly-Supervised Image Parsing. 

Baisheng Lai, Xiaojin Gong


389   Attention to Scale: Scale-Aware Semantic Image Segmentation. 

Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, Alan L. Yuille

还是多尺度的全卷积用作分割。 In this work, we propose an attention mechanism that learns to softly weight the multi-scale features at each pixel location. We adapt a state-of-the-art semantic image segmentation model, which we jointly train with multi-scale input images and the attention model. The proposed attention model not only outperforms average- and max-pooling, but allows us to diagnostically visualize the importance of features at different positions and scales. Moreover, we show that adding extra supervision to the output at each scale is essential to achieving excellent performance when merging multi-scale features.

390   Scene Labeling Using Sparse Precision Matrix. 

Nasim Souly, Mubarak Shah


391   Iterative Instance Segmentation. 

Ke Li, Bharath Hariharan, Jitendra Malik


392   Recurrent Attentional Networks for Saliency Detection. 

Jason Kuen, Zhenhua Wang, Gang Wang

就是针对子区域的RNN网络,用于关键性检测。Convolutional-deconvolution networks can be adopted to perform end-to-end saliency detection. But, they do not work well with objects of multiple scales. To overcome such a limitation, in this work, we propose a recurrent attentional convolutional-deconvolution network (RACDNN). Using spatial transformer and recurrent network units, RACDNN is able to iteratively attend to selected image sub-regions to perform saliency refinement progressively. Besides tackling the scale problem, RACDNN can also learn context-aware features from past iterations to enhance saliency refinement in future iterations. 

Semantic Video Segmentation

393   Instance-Level Video Segmentation From Object Tracks. 

Guillaume Seguin, Piotr Bojanowski, Rémi Lajugie, Ivan Laptev


394   Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer. 

Jun Xie, Martin Kiefel, Ming-Ting Sun, Andreas Geiger


Shape From X

395   Amplitude Modulated Video Camera - Light Separation in Dynamic Scenes. 

Amir Kolaman, Maxim Lvov, Rami Hagege, Hugo Guterman


396   A Benchmark Dataset and Evaluation for Non-Lambertian and Uncalibrated Photometric Stereo. 

Boxin Shi, Zhe Wu, Zhipeng Mo, Dinglong Duan, Sai-Kit Yeung, Ping Tan


397   Depth From Semi-Calibrated Stereo and Defocus. 

Ting-Chun Wang, Manohar Srikanth, Ravi Ramamoorthi


398   Exploiting Spectral-Spatial Correlation for Coded Hyperspectral Image Restoration. 

Ying Fu, Yinqiang Zheng, Imari Sato, Yoichi Sato


399   Variable Aperture Light Field Photography: Overcoming the Diffraction-Limited Spatio-Angular Resolution Tradeoff. 

Julie Chang, Isaac Kauvar, Xuemei Hu, Gordon Wetzstein


400   Convolutional Networks for Shape From Light Field. 

Stefan Heber, Thomas Pock

利用Light Field获得深度信息,而且是采用CNN In this paper we utilize CNNs to predict depth information for given Light Field (LF) data. The proposed method learns an end-to-end mapping between the 4D light field and a representation of the corresponding 4D depth field in terms of 2D hyperplane orientations. The obtained prediction is then further refined in a post processing step by applying a higher-order regularization. 

401   Panoramic Stereo Videos With a Single Camera. 

Rajat Aggarwal, Amrisha Vohra, Anoop M. Namboodiri


402   The Next Best Underwater View. 

Mark Sheinin, Yoav Y. Schechner


403   Reconstructing Shapes and Appearances of Thin Film Objects Using RGB Images. 

Yoshie Kobayashi, Tetsuro Morimoto, Imari Sato, Yasuhiro Mukaigawa, Takao Tomono, Katsushi Ikeuchi


404   Noisy Label Recovery for Shadow Detection in Unfamiliar Domains. 

Tomás F. Yago Vicente, Minh Hoai, Dimitris Samaras


Go back to calendar >



Video Understanding

Wednesday, June 29th, 1:45PM - 2:50PM.

These papers will also be presented at the following poster session

405   Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data Is Continuous and Weakly Labelled. 

Oscar Koller, Hermann Ney, Richard Bowden

估计新意还是比较高的,相当于用连续帧的图像,训练针对图像的模型,虽然每一帧的图像可能是弱标定的。This work presents a new approach to learning a frame-based classifier on weakly labelled sequence data by embedding a CNN within an iterative EM algorithm. This allows the CNN to be trained on a vast number of example images when only loose sequence level information is available for the source videos. Although we demonstrate this in the context of hand shape recognition, the approach has wider application to any video recognition task where frame level labelling is not available. 

406   Recognizing Car Fluents From Video. 

Bo Li, Tianfu Wu, Caiming Xiong, Song-Chun Zhu


407   Pairwise Decomposition of Image Sequences for Active Multi-View Recognition. 

Edward Johns, Stefan Leutenegger, Andrew J. Davison

两个CNN网络,We propose to bring Convolutional Neural Networks to generic multi-view recognition, by decomposing an image sequence into a set of image pairs, classifying each pair independently, and then learning an object classifier by weighting the contribution of each pair. This allows for recognition over arbitrary camera trajectories, without requiring explicit training over the potentially infinite number of camera paths and lengths. Building these pairwise relationships then naturally extends to the next-best-view problem in an active recognition framework. To achieve this, we train a second Convolutional Neural Network to map directly from an observed image to next viewpoint. Finally, we incorporate this into a trajectory optimisation task, whereby the best recognition confidence is sought for a given trajectory length.

408   Inferring Forces and Learning Human Utilities From Videos. 

Yixin Zhu, Chenfanfu Jiang, Yibiao Zhao, Demetri Terzopoulos, Song-Chun Zhu


409   Force From Motion: Decoding Physical Sensation in a First Person Video. 

Hyun Soo Park, jyh-Jing Hwang, Jianbo Shi


Go back to calendar >



Video Analysis 2

Wednesday, June 29th, 2:50PM - 1:20PM.

These papers will also be presented at the following poster session

410   Robust Multi-Body Feature Tracker: A Segmentation-Free Approach. 

Pan Ji, Hongdong Li, Mathieu Salzmann, Yiran Zhong


411   Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video. 

Dinesh Jayaraman, Kristen Grauman

视频分析,基于这样的一个假设,对于视频来说,越是高阶的信息变化就会越小,举一个简单的例子,如果背景是固定的,只有前景的人在动,那么图像中的大部分相邻帧之间的差是很小的,那么再做二次的微分,那么差会更小。也就是这里说的高阶空间信息,有参考价值The key idea is to impose a prior that higher order derivatives in the learned feature space must be small. To this end, we train a convolutional neural network with a regularizer that minimizes a contrastive loss on tuples of sequential frames from unlabeled video. Focusing on the case of triplets of frames, the proposed method encourages that feature changes over time should be smooth, i.e., similar to the most recent changes.

412   Volumetric 3D Tracking by Detection. 

Chun-Hao Huang, Benjamin Allain, Jean-Sébastien Franco, Nassir Navab, Slobodan Ilic, Edmond Boyer


413   The Solution Path Algorithm for Identity-Aware Multi-Object Tracking. 

Shoou-I Yu, Deyu Meng, Wangmeng Zuo, Alexander Hauptmann


414   In Defense of Sparse Tracking: Circulant Sparse Tracker. 

Tianzhu Zhang, Adel Bibi, Bernard Ghanem


415   Optical Flow With Semantic Segmentation and Localized Layers. 

Laura Sevilla-Lara, Deqing Sun, Varun Jampani, Michael J. Black


416   Video Segmentation via Object Flow. 

Yi-Hsuan Tsai, Ming-Hsuan Yang, Michael J. Black


Go back to calendar >



Grouping and Optimization Methods

Wednesday, June 29th, 1:45PM - 2:50PM.

These papers will also be presented at the following poster session

417   Closed-Form Training of Mahalanobis Distance for Supervised Clustering. 

Marc T. Law, YaoLiang Yu, Matthieu Cord, Eric P. Xing


418   Scalable Sparse Subspace Clustering by Orthogonal Matching Pursuit. 

Chong You, Daniel Robinson, René Vidal


419   Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering. 

Chong You, Chun-Guang Li, Daniel P. Robinson, René Vidal


420   Sparse Coding and Dictionary Learning With Linear Dynamical Systems. 

Wenbing Huang, Fuchun Sun, Lele Cao, Deli Zhao, Huaping Liu, Mehrtash Harandi


421   Sublabel-Accurate Relaxation of Nonconvex Energies. 

Thomas Möllenhoff, Emanuel Laude, Michael Moeller, Jan Lellmann, Daniel Cremers


Go back to calendar >



Statistical Methods and Transfer Learning

Wednesday, June 29th, 2:50PM - 1:20PM.

These papers will also be presented at the following poster session

422   The Multiverse Loss for Robust Transfer Learning. 

Etai Littwin, Lior Wolf


423   Learning From the Mistakes of Others: Matching Errors in Cross-Dataset Learning. 

Viktoriia Sharmanska, Novi Quadrianto


424   An Efficient Exact-PGA Algorithm for Constant Curvature Manifolds. 

Rudrasis Chakraborty, Dohyung Seo, Baba C. Vemuri


425   Online Learning With Bayesian Classification Trees. 

Samuel Rota Bulò, Peter Kontschieder


426   Cross-Stitch Networks for Multi-Task Learning. 

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, Martial Hebert

貌似是一种新的网络结构,通过多任务学习共享特征表示。 In this paper, we propose a principled approach to learn shared representations in ConvNets using multi-task learning. Specifically, we propose a new sharing unit: "cross-stitch" unit. These units combine the activations from multiple networks and can be trained end-to-end. A network with cross-stitch units can learn an optimal combination of shared and task-specific representations.

427   Deep Metric Learning via Lifted Structured Feature Embedding. 

Hyun Oh Song, Yu Xiang, Stefanie Jegelka, Silvio Savarese

深度度量学习。In this paper, we describe an algorithm for taking full advantage of the training batches in the neural network training by lifting the vector of pairwise distances within the batch to the matrix of pairwise distances. This step enables the algorithm to learn the state of the art feature embedding by optimizing a novel structured prediction objective for active hard negative mining on the lifted problem. 

428   Fast Algorithms for Convolutional Neural Networks. 

Andrew Lavin, Scott Gray

为了提高速度。We introduce a new class of fast algorithms for convolutional neural networks using Winograd's minimal filtering algorithms. The algorithms compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes. 

Go back to calendar >



Poster Session 3-2. Wednesday, June 29th, 4:45PM - 6:45PM.

3D Vision

429   Coordinating Multiple Disparity Proposals for Stereo Computation. 

Ang Li, Dapeng Chen, Yuanliu Liu, Zejian Yuan


430   Joint Multiview Segmentation and Localization of RGB-D Images Using Depth-Induced Silhouette Consistency. 

Chi Zhang, Zhiwei Li, Rui Cai, Hongyang Chao, Yong Rui


431   A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. 

Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, Thomas Brox


432   6D Dynamic Camera Relocalization From Single Reference Image. 

Wei Feng, Fei-Peng Tian, Qian Zhang, Jizhou Sun


433   Dense Monocular Depth Estimation in Complex Dynamic Scenes. 

René Ranftl, Vibhav Vineet, Qifeng Chen, Vladlen Koltun


434   Using Self-Contradiction to Learn Confidence Measures in Stereo Vision. 

Christian Mostegel, Markus Rumpler, Friedrich Fraundorfer, Horst Bischof


435   Understanding Real World Indoor Scenes With Synthetic Data. 

Ankur Handa, Viorica Pătrăucean, Vijay Badrinarayanan, Simon Stent, Roberto Cipolla


436   Stereo Matching With Color and Monochrome Cameras in Low-Light Conditions. 

Hae-Gon Jeon, Joon-Young Lee, Sunghoon Im, Hyowon Ha, In So Kweon


437   Camera Calibration From Dynamic Silhouettes Using Motion Barcodes. 

Gil Ben-Artzi, Yoni Kasten, Shmuel Peleg, Michael Werman


438   Structure-From-Motion Revisited. 

Johannes L. Schönberger, Jan-Michael Frahm


439   Constructing Canonical Regions for Fast and Effective View Selection. 

Wencheng Wang, Tianhao Gao


440   Prior-Less Compressible Structure From Motion. 

Chen Kong, Simon Lucey


441   Rolling Shutter Camera Relative Pose: Generalized Epipolar Geometry. 

Yuchao Dai, Hongdong Li, Laurent Kneip


442   Structure From Motion With Objects. 

Marco Crocco, Cosimo Rubino, Alessio Del Bue


443   DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed With Deep Features. 

Ayan Sinha, Chiho Choi, Karthik Ramani


Document Analysis

444   Multi-Oriented Text Detection With Fully Convolutional Networks. 

Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, Xiang Bai

两个FCN网络,第一个用来进行关键部分检测,第二个FCN用来做分类。First, a Fully Convolutional Network (FCN) model is trained for predicting a salient map of text regions in a holistic manner. Then, a set of hypotheses text lines are estimated by combining the salient map and MSER components. Finally, another FCN classifier is used for predicting the centroid of each character, in order to remove the false hypotheses. 

445   Robust Scene Text Recognition With Automatic Rectification. 

Baoguang Shi, Xinggang Wang, Pengyuan Lyu, Cong Yao, Xiang Bai


Face and Gesture

446   Mnemonic Descent Method: A Recurrent Process Applied for End-To-End Face Alignment. 

George Trigeorgis, Patrick Snape, Mihalis A. Nicolaou, Epameinondas Antonakos, Stefanos Zafeiriou


447   Large-Pose Face Alignment via CNN-Based Dense 3D Model Fitting. 

Amin Jourabloo, Xiaoming Liu

创新点不太突出啊,做人脸矫正,把他是做一个3DCNN的回归问题。 In this paper, we propose a face alignment method for large-pose face images, by combining the powerful cascaded CNN regressor method and 3DMM. We formulate the face alignment as a 3DMM fitting problem, where the camera projection matrix and 3D shape parameters are estimated by a cascade of CNN-based regressors. 

448   Adaptive 3D Face Reconstruction From Unconstrained Photo Collections. 

Joseph Roth, Yiying Tong, Xiaoming Liu


449   Online Detection and Classification of Dynamic Hand Gestures With Recurrent 3D Convolutional Neural Network. 

Pavlo Molchanov, Xiaodong Yang, Shalini Gupta, Kihwan Kim, Stephen Tyree, Jan Kautz

先不管这是干啥的,应该是手势识别,就冲这个递归3D卷积网络,针对视频来说就有很大的参考价值。In this paper, we address these challenges with a recurrent three-dimensional convolutional neural network that performs simultaneous detection and classification of dynamic hand gestures from multi-modal data. We employ connectionist temporal classification to train the network to predict class labels from in-progress gestures in unsegmented input streams. 

450   Kinematic Structure Correspondences via Hypergraph Matching. 

Hyung Jin Chang, Tobias Fischer, Maxime Petit, Martina Zambelli, Yiannis Demiris


451   CP-mtML: Coupled Projection Multi-Task Metric Learning for Large Scale Face Retrieval. 

Binod Bhattarai, Gaurav Sharma, Frederic Jurie


Motion and Tracking

452   PatchBatch: A Batch Augmented Loss for Optical Flow. 

David Gadot, Lior Wolf


453   Joint Recovery of Dense Correspondence and Cosegmentation in Two Images. 

Tatsunori Taniai, Sudipta N. Sinha, Yoichi Sato


454   Multi-View People Tracking via Hierarchical Trajectory Composition. 

Yuanlu Xu, Xiaobai Liu, Yang Liu, Song-Chun Zhu


455   Object Tracking via Dual Linear Structured SVM and Explicit Feature Map. 

Jifeng Ning, Jimei Yang, Shaojie Jiang, Lei Zhang, Ming-Hsuan Yang


456   Robust, Real-Time 3D Tracking of Multiple Objects With Similar Appearances. 

Taiki Sekii


457   An Egocentric Look at Video Photographer Identity. 

Yedid Hoshen, Shmuel Peleg


458   Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. 

Hyeonseob Nam, Bohyung Han

深度学习用于跟踪,首先利用大量的视频数据做预训练,后面就没看懂了。We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify target in each domain. We train each domain in the network iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. 

459   Hedged Deep Tracking. 

Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, Ming-Hsuan Yang

深度学习用于跟踪,Hedge估计是相当于把多层的CNN特征融合做跟踪。In this paper, we propose a novel CNN based tracking framework, which takes full advantage of features from different CNN layers and uses an adaptive Hedge method to hedge several CNN trackers into a stronger one.

460   Structural Correlation Filter for Robust Visual Tracking. 

Si Liu, Tianzhu Zhang, Xiaochun Cao, Changsheng Xu


461   Visual Tracking Using Attention-Modulated Disintegration and Integration. 

Jongwon Choi, Hyung Jin Chang, Jiyeoup Jeong, Yiannis Demiris, Jin Young Choi


462   A Continuous Occlusion Model for Road Scene Understanding. 

Vikas Dhiman, Quoc-Huy Tran, Jason J. Corso, Manmohan Chandraker


463   Virtual Worlds as Proxy for Multi-Object Tracking Analysis. 

Adrien Gaidon, Qiao Wang, Yohann Cabon, Eleonora Vig


Shape From X

464   Uncalibrated Photometric Stereo by Stepwise Optimization Using Principal Components of Isotropic BRDFs. 

Keisuke Midorikawa, Toshihiko Yamasaki, Kiyoharu Aizawa


465   Unbiased Photometric Stereo for Colored Surfaces: A Variational Approach. 

Yvain Quéau, Roberto Mecca, Jean-Denis Durou


466   3D Reconstruction of Transparent Objects With Position-Normal Consistency. 

Yiming Qian, Minglun Gong, Yee Hong Yang


467   Real-Time Depth Refinement for Specular Objects. 

Roy Or-El, Rom Hershkovitz, Aaron Wetzler, Guy Rosman, Alfred M. Bruckstein, Ron Kimmel


468   Recovering Transparent Shape From Time-Of-Flight Distortion. 

Kenichiro Tanaka, Yasuhiro Mukaigawa, Hiroyuki Kubo, Yasuyuki Matsushita, Yasushi Yagi


469   Robust Light Field Depth Estimation for Noisy Scene With Occlusion. 

Williem, In Kyu Park


470   Rotational Crossed-Slit Light Field. 

Nianyi Li, Haiting Lin, Bilin Sun, Mingyuan Zhou, Jingyi Yu


471   Single Image Object Modeling Based on BRDF and R-Surfaces Learning. 

Fabrizio Natola, Valsamis Ntouskos, Fiora Pirri, Marta Sanzari


Statistical Methods and Learning

472   A Nonlinear Regression Technique for Manifold Valued Data With Applications to Medical Image Analysis. 

Monami Banerjee, Rudrasis Chakraborty, Edward Ofori, Michael S. Okun, David E. Viallancourt, Baba C. Vemuri


473   RAID-G: Robust Estimation of Approximate Infinite Dimensional Gaussian With Application to Material Recognition. 

Qilong Wang, Peihua Li, Wangmeng Zuo, Lei Zhang


474   An Empirical Evaluation of Current Convolutional Architectures’ Ability to Manage Nuisance Location and Scale Variability. 

Nikolaos Karianakis, Jingming Dong, Stefano Soatto

相当于验证性工作。We conduct an empirical study to test the ability of convolutional neural networks (CNNs) to reduce the effects of nuisance transformations of the input data, such as location, scale and aspect ratio. We isolate factors by adopting a common convolutional architecture either deployed globally on the image to compute class posterior distributions, or restricted locally to compute class conditional distributions given location, scale and aspect ratios of bounding boxes determined by proposal heuristics. In theory, averaging the latter should yield inferior performance compared to proper marginalization. Yet empirical evidence suggests the converse, leading us to conclude that - at the current level of complexity of convolutional architectures and scale of the data sets used to train them - CNNs are not very effective at marginalizing nuisance variability. We also quantify the effects of context on the overall classification task and its impact on the performance of CNNs, and propose improved sampling techniques for heuristic proposal schemes that improve end-to-end performance to state-of-the-art levels. 

475   Learning Sparse High Dimensional Filters: Image Filtering, Dense CRFs and Bilateral Neural Networks. 

Varun Jampani, Martin Kiefel, Peter V. Gehler


476   Mixture of Bilateral-Projection Two-Dimensional Probabilistic Principal Component Analysis. 

Fujiao Ju, Yanfeng Sun, Junbin Gao, Simeng Liu, Yongli Hu, Baocai Yin


477   Rolling Rotations for Recognizing Human Actions From 3D Skeletal Data. 

Raviteja Vemulapalli, Rama Chellapa


478   Improving the Robustness of Deep Neural Networks via Stability Training. 

Stephan Zheng, Yang Song, Thomas Leung, Ian Goodfellow

没有说明创新点,但目的是为了提高CNN网络的鲁棒性。We present a general stability training method to stabilize deep networks against small input distortions that result from various types of common image processing, such as compression, rescaling, and cropping. We validate our method by stabilizing the state-of-the-art Inception architecture against these types of distortions

479   Logistic Boosting Regression for Label Distribution Learning. 

Chao Xing, Xin Geng, Hui Xue


480   Efficient Temporal Sequence Comparison and Classification Using Gram Matrix Embeddings on a Riemannian Manifold. 

Xikang Zhang, Yin Wang, Mengran Gou, Mario Sznaier, Octavia Camps


Vision For Graphics

481   Deep Reflectance Maps. 

Konstantinos Rematas, Tobias Ritschel, Mario Fritz, Efstratios Gavves, Tinne Tuytelaars

没看懂这是干啥的,We propose a fully convolutional neural architecture to estimate reflectance maps of specular materials in natural lighting conditions. We achieve this in an end-to-end learning formulation that directly predicts a reflectance map from the image itself. 

482   Semantic Filtering. 

Qingxiong Yang


Vision For Robotics

483   UAV Sensor Fusion With Latent-Dynamic Conditional Random Fields in Coronal Plane Estimation. 

Amir M. Rahimi, Raphael Ruschel, B.S. Manjunath


484   Robust Visual Place Recognition With Graph Kernels. 

Elena Stumm, Christopher Mei, Simon Lacroix, Juan Nieto, Marco Hutter, Roland Siegwart


Semantic Image Segmentation

485   Semantic Image Segmentation With Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform. 

Liang-Chieh Chen, Jonathan T. Barron, George Papandreou, Kevin Murphy, Alan L. Yuille

CRF替换为domain transform,提高效率,准确率也可以保证。We propose replacing the fully-connected CRF with domain transform (DT), a modern edge-preserving filtering method in which the amount of smoothing is controlled by a reference edge map. Domain transform filtering is several times faster than dense CRF inference and we show that it yields comparable semantic segmentation results, accurately capturing object boundaries. Importantly, our formulation allows learning the reference edge map from intermediate CNN features instead of using the image gradient magnitude as in standard DT filtering.

Go back to calendar >



Image & Video Captioning and Descriptions

Thursday, June 30th, 9:00AM - 10:05AM.

These papers will also be presented at the following poster session

486   Natural Language Object Retrieval. 

Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, Trevor Darrell

基于自然语言或者估计是关键词的搜索。 To address this issue, we propose a novel Spatial Context Recurrent ConvNet (SCRC) model as scoring function on candidate boxes for object retrieval, integrating spatial configurations and global scene-level contextual information into the network. Our model processes query text, local image descriptors, spatial configurations and global context features through a recurrent network, outputs the probability of the query text conditioned on each candidate box as a score for the box, and can transfer visual-linguistic knowledge from image captioning domain to our task.

487   DenseCap: Fully Convolutional Localization Networks for Dense Captioning. 

Justin Johnson, Andrej Karpathy, Li Fei-Fei

DenseCap,这个没的说,效果非常牛,但主要还是在CNN+RNN的框架下,其中CNN采用FCLN,其中L是定位信息,因为要考虑位置。To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences.

488   Unsupervised Learning From Narrated Instruction Videos. 

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien


489   Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. 

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, Wei Xu

视频描述的内容,有参考价值,既然针对单幅图像可以采用CNN+RNN的网络结构描述,那么视频无非就是多幅图像序列的组合,那么考虑由句子生成段落,再采用一级RNN,也就是文章提到的层次RNNWe present an approach that exploits hierarchical Recurrent Neural Networks (RNNs) to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video. Our hierarchical framework contains a sentence generator and a paragraph generator. The sentence generator produces one simple short sentence that describes a specific short video interval. It exploits both temporal- and spatial-attention mechanisms to selectively focus on visual elements during generation. The paragraph generator captures the inter-sentence dependency by taking as input the sentential embedding produced by the sentence generator, combining it with the paragraph history, and outputting the new initial state for the sentence generator.

490   Jointly Modeling Embedding and Translation to Bridge Video and Language. 

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui

还是针对视频描述,就是说单个的句子描述对从上下文来说可能是正确的,但从语义上来说可能是不准确的,本文就是考虑这方面的内容。 As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content.

Go back to calendar >



High Level Semantics

Thursday, June 30th, 10:05AM - 10:30AM.

These papers will also be presented at the following poster session

491   We Are Humor Beings: Understanding and Predicting Visual Humor. 

Arjun Chandrasekaran, Ashwin K. Vijayakumar, Stanislaw Antol, Mohit Bansal, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh


492   Where to Look: Focus Regions for Visual Question Answering. 

Kevin J. Shih, Saurabh Singh, Derek Hoiem


493   Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge From External Sources. 

Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, Anton van den Hengel


494   MovieQA: Understanding Stories in Movies Through Question-Answering. 

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler


495   TGIF: A New Dataset and Benchmark on Animated GIF Description. 

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, Jiebo Luo


496   Image Captioning With Semantic Attention. 

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, Jiebo Luo


Go back to calendar >



Non-Rigid Reconstruction and Motion Analysis

Thursday, June 30th, 9:00AM - 10:18AM.

These papers will also be presented at the following poster session

497   Temporally Coherent 4D Reconstruction of Complex Dynamic Scenes. 

Armin Mustafa, Hansung Kim, Jean-Yves Guillemaut, Adrian Hilton


498   Consensus of Non-Rigid Reconstructions. 

Minsik Lee, Jungchan Cho, Songhwai Oh


499   Isometric Non-Rigid Shape-From-Motion in Linear Time. 

Shaifali Parashar, Daniel Pizarro, Adrien Bartoli


500   Learning Online Smooth Predictors for Realtime Camera Planning Using Recurrent Decision Trees. 

Jianhui Chen, Hoang M. Le, Peter Carr, Yisong Yue, James J. Little


501   Egocentric Future Localization. 

Hyun Soo Park, Jyh-Jing Hwang, Yedong Niu, Jianbo Shi


502   Full Flow: Optical Flow Estimation By Global Optimization Over Regular Grids. 

Qifeng Chen, Vladlen Koltun


Go back to calendar >



Human Pose Estimation

Thursday, June 30th, 10:18AM - 10:30AM.

These papers will also be presented at the following poster session

503   Structured Feature Learning for Pose Estimation. 

Xiao Chu, Wanli Ouyang, Hongsheng Li, Xiaogang Wang


504   Convolutional Pose Machines. 

Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh


505   Human Pose Estimation With Iterative Error Feedback. 

João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, Jitendra Malik


Go back to calendar >



Poster Session 4-1. Thursday, June 30th, 10:30AM - 12:30PM.

Deep Learning and CNNs

506   WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks. 

Thibaut Durand, Nicolas Thome, Matthieu Cord

弱监督卷积网络,一方面利用多实例学习的优势,另一方面特征共享,没感觉创新多大啊。In this paper, we introduce a novel framework for WEakly supervised Learning of Deep cOnvolutional neural Networks (WELDON). Our method is dedicated to automatically selecting relevant image regions from weak annotations, e.g. global image labels, and encompasses the following contributions. Firstly, WELDON leverages recent improvements on the Multiple Instance Learning paradigm, i.e. negative evidence scoring and top instance selection. Secondly, the deep CNN is trained to optimize Average Precision, and fine-tuned on the target dataset with efficient computations due to convolutional feature sharing. 

507   DisturbLabel: Regularizing CNN on the Loss Layer. 

Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, Qi Tian

提出一种正则化新方法,类似于已经提出的权值延迟、模型平均以及数据扩充等,这里采用的思路是在每次的迭代学习过程中,刻意将某些样本的标签标记为错误标签,实验证明这样的机制可以有效抑制过拟合。In this paper, we present DisturbLabel, an extremely simple algorithm which randomly replaces a part of labels as incorrect values in each iteration. Although it seems weird to intentionally generate incorrect training labels, we show that DisturbLabel prevents the network training from over-fitting by implicitly averaging over exponentially many networks which are trained with different label sets. To the best of our knowledge, DisturbLabel serves as the first work which adds noises on the loss layer. Meanwhile, DisturbLabel cooperates well with Dropout to provide complementary regularization functions.

508   Gradual DropIn of Layers to Train Very Deep Neural Networks. 

Leslie N. Smith, Emily M. Hand, Timothy Doster

内容新颖,就是实现网络层的动态增长,而这个动态增长的过程是通过一个叫做DropIn的层来实现的,具体实现步骤没搞懂。We introduce the concept of dynamically growing a neural network during training. In particular, an untrainable deep network starts as a trainable shallow network and newly added layers are slowly, organically added during training, thereby increasing the network's depth. This is accomplished by a new layer, which we call DropIn. The DropIn layer starts by passing the output from a previous layer (effectively skipping over the newly added layers), then increasingly including units from the new layers for both feedforward and backpropagation. We show that deep networks, which are untrainable with conventional methods, will converge with DropIn layers interspersed in the architecture. In addition, we demonstrate that DropIn provides regularization during training in an analogous way as dropout.

509   Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition. 

Zhiwei Deng, Arash Vahdat, Hexiang Hu, Greg Mori

这个可以看一下,当做拓展视野,他是讲究把图模型和神经元网络结合,这个是很明显的,但是由于图模型的推理是比较难的,因此这里提出利用RNN进行推理。In this paper, we propose a method to integrate graphical models and deep neural networks into a joint framework. Instead of using a traditional inference method, we use a sequential inference modeled by a recurrent neural network. Beyond this, the appropriate structure for inference can be learned by imposing gates on edges between nodes. 

510   Deep SimNets. 

Nadav Cohen, Or Sharir, Amnon Shashua

是一种深度相似性度量网络,两方面创新,第一方面利用相似性度量函数替换内积层,另一方面log-mean-exp范化最大与均值。We present a deep layered architecture that generalizes convolutional neural networks (ConvNets). The architecture, called SimNets, is driven by two operators: (i) a similarity function that generalizes inner-product, and (ii) a log-mean-exp function called MEX that generalizes maximum and average. The two operators applied in succession give rise to a standard neuron but in "feature space". The feature spaces realized by SimNets depend on the choice of the similarity operator. The simplest setting, which corresponds to a convolution, realizes the feature space of the Exponential kernel, while other settings realize feature spaces of more powerful kernels (Generalized Gaussian, which includes as special cases RBF and Laplacian), or even dynamically learned feature spaces (Generalized Multiple Kernel Learning). As a result, the SimNet contains a higher abstraction level compared to a traditional ConvNet. We argue that enhanced expressiveness is important when the networks are small due to run-time constraints (such as those imposed by mobile applications).

511   Studying Very Low Resolution Recognition Using Deep Networks. 

Zhangyang Wang, Shiyu Chang, Yingzhen Yang, Ding Liu, Thomas S. Huang

针对低分辨率的图像识别,其实也是整合了多种技术,如针对一幅低分辨率图像的超分、回归等。We attempt to solve the VLRR problem using deep learning methods. Taking advantage of techniques primarily in super resolution, domain adaptation and robust regression, we formulate a dedicated deep learning method and demonstrate how these techniques are incorporated step by step. Any extra complexity, when introduced, is fully justified by both analysis and simulation results. The resulting Robust Partially Coupled Networks achieves feature enhancement and recognition simultaneously. It allows for both the flexibility to combat the LR-HR domain mismatch, and the robustness to outliers

512   Deep Gaussian Conditional Random Field Network: A Model-Based Deep Network for Discriminative Denoising. 

Raviteja Vemulapalli, Oncel Tuzel, Ming-Yu Liu

也是两个子网络,每个子网络都是与高斯条件随机场相关,一个网络用于参数估计,一个网络用于噪声恢复。We propose a novel end-to-end trainable deep network architecture for image denoising based on a Gaussian Conditional Random Field (GCRF) model. In contrast to the existing discriminative denoising methods that train a separate model for each individual noise level, the proposed deep network explicitly models the input noise variance and hence is capable of handling a range of noise levels. Our deep network, which we refer to as deep GCRF network, consists of two sub-networks: (i) a parameter generation network that generates the pairwise potential parameters based on the noisy input image, and ii) an inference network whose layers perform the computations involved in an iterative GCRF inference procedure. We train two deep GCRF networks (each network operates over a range of noise levels: one for low input noise levels and one for high input noise levels) discriminatively by maximizing the peak signal-to-noise ratio measure.

513   Event-Specific Image Importance. 

Yufei Wang, Zhe Lin, Xiaohui Shen, Radomír Mĕch, Gavin Miller, Garrison W. Cottrell

一个标准,一个数据集还有一个新的ranking loss函数。 We also propose a Convolutional Neural Network (CNN) based method to predict the image importance score of a given event album, using a novel rank loss function and a progressive training scheme.

514   Quantized Convolutional Neural Networks for Mobile Devices. 

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, Jian Cheng

我去,量子化的神经元网络,充分降低内存和计算量,好高大的样子。 In this paper, we propose an efficient framework, namely Quantized CNN, to simultaneously speed-up the computation and reduce the storage and memory overhead of CNN models. Both filter kernels in convolutional layers and weighting matrices in fully-connected layers are quantized, aiming at minimizing the estimation error of each layer's response.

515   Inverting Visual Representations With Convolutional Networks. 

Alexey Dosovitskiy, Thomas Brox

这个相当于卷积网络的逆过程,就是从特征到可视化图像。证明之前的想法是有意义的,继续。Inverting a deep network trained on ImageNet provides several insights into the properties of the feature representation learned by the network.

Face Recognition

516   Pose-Aware Face Recognition in the Wild. 

Iacopo Masi, Stephen Rawls, Gérard Medioni, Prem Natarajan


517   Multi-View Deep Network for Cross-View Classification. 

Meina Kan, Shiguang Shan, Xilin Chen

这个有参考价值,多视角网络目的就是消除视角变化造成的影响。包含两个子网络,一个子网络用于消除视角变量,另一个子网络就是普通的识别网络。重点在后面,他用的损失函数跟fisher相关,就是那个什么relay熵啥的。To eliminate the complex (maybe even highly nonlinear) view discrepancy for favorable cross-view recognition, we propose a multi-view deep network (MvDN), which seeks for a non-linear discriminant and view-invariant representation shared between multiple views. Specifically, our proposed MvDN network consists of two sub-networks, view-specific sub-network attempting to remove view-specific variations and the following common sub-network attempting to obtain common representation shared by all views. As the objective of MvDN network, the Fisher loss, i.e. the Rayleigh quotient objective, is calculated from the samples of all views so as to guide the learning of the whole network.

518   Sparsifying Neural Network Connections for Face Recognition. 

Yi Sun, Xiaogang Wang, Xiaoou Tang

早就想搞一个稀疏网络,现在别人做出来了,还是大牛组。原理就是首先按照传统网络预训练,然后不断迭代使得网络稀疏,文中也指出不能直接从稀疏的方面训练,这样的效果反而不好。This paper proposes to learn high-performance deep ConvNets with sparse neural connections, referred to as sparse ConvNets, for face recognition. The sparse ConvNets are learned in an iterative way, each time one additional layer is sparsified and the entire model is re-trained given the initial weights learned in previous iterations. One important finding is that directly training the sparse ConvNet from scratch failed to find good solutions for face recognition, while using a previously learned denser model to properly initialize a sparser model is critical to continue learning effective features for face recognition. This paper also proposes a new neural correlation-based weight selection criterion and empirically verifies its effectiveness in selecting informative connections from previously learned models in each iteration. 

519   Pairwise Linear Regression Classification for Image Set Retrieval. 

Qingxiang Feng, Yicong Zhou, Rushi Lan


520   The MegaFace Benchmark: 1 Million Faces for Recognition at Scale . 

Ira Kemelmacher-Shlizerman, Steven M. Seitz, Daniel Miller, Evan Brossard


521   Learnt Quasi-Transitive Similarity for Retrieval From Large Collections of Faces. 

Ognjen Arandjelović


522   Latent Factor Guided Convolutional Neural Networks for Age-Invariant Face Recognition. 

Yandong Wen, Zhifeng Li, Yu Qiao

也是应用了一下。In order to address this problem, we propose a novel deep face recognition framework to learn the age-invariant deep face features through a carefully designed CNN model. To the best of our knowledge, this is the first attempt to show the effectiveness of deep CNNs in advancing the state-of-the-art of AIFR. 

523   Copula Ordinal Regression for Joint Estimation of Facial Action Unit Intensity. 

Robert Walecki, Ognjen Rudovic, Vladimir Pavlovic, Maja Pantic


Face and Gesture

524   A Robust Multilinear Model Learning Framework for 3D Faces. 

Timo Bolkart, Stefanie Wuhrer


525   Ordinal Regression With Multiple Output CNN for Age Estimation. 

Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, Gang Hua

也是应用吧,而且还是把回归问题是做二分类问题。 In this paper, we propose an End-to-End learning approach to address ordinal regression problems using deep Convolutional Neural Network, which could simultaneously conduct feature learning and regression modeling. In particular, an ordinal regression problem is transformed into a series of binary classification sub-problems. And we propose a multiple output CNN learning algorithm to collectively solve these classification sub-problems, so that the correlation between these tasks could be explored.

Human Pose Estimation

526   DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. 

Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, Bernt Schiele

还是多任务,检测和姿态估计,没感觉什么特殊的啊。We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other. This joint formulation is in contrast to previous strategies, that address the problem by first detecting people and subsequently estimating their body pose. We propose a partitioning and labeling formulation of a set of body-part hypotheses generated with CNN-based part detectors. Our formulation, an instance of an integer linear program, implicitly performs non-maximum suppression on the set of part candidates and groups them to form configuration of body parts respecting geometric and appearance constraints. 

527   Thin-Slicing for Pose: Learning to Understand Pose Without Explicit Pose Estimation. 

Suha Kwak, Minsu Cho, Ivan Laptev

也是应用。We address the problem of learning a pose-aware, compact embedding that projects images with similar human poses to be placed close-by in the embedding space. The embedding function is built on a deep convolutional network, and trained with triplet-based rank constraints on real image data. This architecture allows us to learn a robust representation that captures differences in human poses by effectively factoring out variations in clothing, background, and imaging conditions in the wild. For a variety of pose-related tasks, the proposed pose embedding provides a cost-efficient and natural alternative to explicit pose estimation, circumventing challenges of localizing body joints. 

528   A Dual-Source Approach for 3D Pose Estimation From a Single Image. 

Hashim Yasin, Umar Iqbal, Björn Krüger, Andreas Weber, Juergen Gall


529   Efficiently Creating 3D Training Data for Fine Hand Pose Estimation. 

Markus Oberweger, Gernot Riegler, Paul Wohlhart, Vincent Lepetit


530   Sparseness Meets Deepness: 3D Human Pose Estimation From Monocular Video. 

Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Konstantinos G. Derpanis, Kostas Daniilidis

也是应用吧。A deep fully convolutional network is trained to predict the uncertainty maps of the 2D joint locations.

Images and Language

531   Answer-Type Prediction for Visual Question Answering. 

Kushal Kafle, Christopher Kanan


532   Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes. 

Satwik Kottur, Ramakrishna Vedantam, José M. F. Moura, Devi Parikh


533   Visual7W: Grounded Question Answering in Images. 

Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei


534   Learning Deep Structure-Preserving Image-Text Embeddings. 

Liwei Wang, Yin Li, Svetlana Lazebnik


535   Yin and Yang: Balancing and Answering Binary Visual Questions. 

Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, Devi Parikh


Shape Representations and Matching

536   GIFT: A Real-Time and Scalable 3D Shape Search Engine. 

Song Bai, Xiang Bai, Zhichao Zhou, Zhaoxiang Zhang, Longin Jan Latecki


537   Functional Faces: Groupwise Dense Correspondence Using Functional Maps. 

Chao Zhang, William A. P. Smith, Arnaud Dessein, Nick Pears, Hang Dai


538   Similarity Metric For Curved Shapes In Euclidean Space. 

Girum G. Demisse, Djamila Aouada, Björn Ottersten


539   Shape Analysis With Hyperbolic Wasserstein Distance. 

Jie Shi, Wen Zhang, Yalin Wang


540   Tensor Power Iteration for Multi-Graph Matching. 

Xinchu Shi, Haibin Ling, Weiming Hu, Junliang Xing, Yanning Zhang


Transfer Learning

541   Multivariate Regression on the Grassmannian for Predicting Novel Domains. 

Yongxin Yang, Timothy M. Hospedales


542   Learning Cross-Domain Landmarks for Heterogeneous Domain Adaptation. 

Yao-Hung Hubert Tsai, Yi-Ren Yeh, Yu-Chiang Frank Wang


543   Geospatial Correspondences for Multimodal Registration. 

Diego Marcos, Raffay Hamid, Devis Tuia


544   Constrained Deep Transfer Feature Learning and Its Applications. 

Yue Wu, Qiang Ji

迁移学习,针对数据量少的时候不够用。 To address these issues, we introduce a constrained deep transfer feature learning method to perform simultaneous transfer learning and feature learning by performing transfer learning in a progressively improving feature space iteratively in order to better narrow the gap between the target domain and the source domain for effective transfer of the data from source domain to target domain. Furthermore, we propose to exploit the target domain knowledge and incorporate such prior knowledge as constraint during transfer learning to ensure that the transferred data satisfies certain properties of the target domain. To demonstrate the effectiveness of the proposed constrained deep transfer feature learning method, we apply it to thermal feature learning for eye detection by transferring from the visible domain. 

Unsupervised, Semi-Supervised and Interactive Learning

545   Deep Canonical Time Warping. 

George Trigeorgis, Mihalis A. Nicolaou, Stefanos Zafeiriou, Björn W. Schuller


546   Multilinear Hyperplane Hashing. 

Xianglong Liu, Xinjie Fan, Cheng Deng, Zhujin Li, Hao Su, Dacheng Tao


547   Large Scale Hard Sample Mining With Monte Carlo Tree Search. 

Olivier Canévet, François Fleuret


548   Multi-Label Ranking From Positive and Unlabeled Data. 

Atsushi Kanehira, Tatsuya Harada


549   Joint Unsupervised Learning of Deep Representations and Image Clusters. 

Jianwei Yang, Devi Parikh, Dhruv Batra

相当于把深度学习是做特征提取的过程,做聚类,这里提到的无监督学习实际上就是在深度学习后的特征空间中做聚类,一方面特征提取,另一方面提取之后的特征作为聚类的指导信息。In this paper, we propose a recurrent framework for joint unsupervised learning of deep representations and image clusters. In our framework, successive operations in a clustering algorithm are expressed as steps in a recurrent process, stacked on top of representations output by a Convolutional Neural Network (CNN). During training, image clusters and representations are updated jointly: image clustering is conducted in the forward pass, while representation learning in the backward pass. Our key idea behind this framework is that good representations are beneficial to image clustering and clustering results provide supervisory signals to representation learning. By integrating two processes into a single model with a unified weighted triplet loss function and optimizing it end-to-end, we can obtain not only more powerful representations, but also more precise image clusters.

550   Kernel Sparse Subspace Clustering on Symmetric Positive Definite Manifolds. 

Ming Yin, Yi Guo, Junbin Gao, Zhaoshui He, Shengli Xie


551   Symmetry reCAPTCHA. 

Chris Funk, Yanxi Liu


552   Unsupervised Learning of Discriminative Attributes and Visual Representations. 

Chen Huang, Chen Change Loy, Xiaoou Tang

重点还是在于无监督,觉得无监督都是利用这种思想,跟前面的那篇一样。Attributes offer useful mid-level features to interpret visual data. While most attribute learning methods are supervised by costly human-generated labels, we introduce a simple yet powerful unsupervised approach to learn and predict visual attributes directly from data. Given a large unlabeled image collection as input, we train deep Convolutional Neural Networks (CNNs) to output a set of discriminative, binary attributes often with semantic meanings. Specifically, we first train a CNN coupled with unsupervised discriminative clustering, and then use the cluster membership as a soft supervision to discover shared attributes from the clusters while maximizing their separability. 

553   When VLAD Met Hilbert. 

Mehrtash Harandi, Mathieu Salzmann, Fatih Porikli


554   Approximate Log-Hilbert-Schmidt Distances Between Covariance Operators for Image Classification. 

Hà Quang Minh, Marco San Biagio, Loris Bazzani, Vittorio Murino


555   Subspace Clustering With Priors via Sparse Quadratically Constrained Quadratic Programming. 

Yongfang Cheng, Yin Wang, Mario Sznaier, Octavia Camps


556   Robust Tensor Factorization With Unknown Noise. 

Xi'ai Chen, Zhi Han, Yao Wang, Qian Zhao, Deyu Meng, Yandong Tang

这个虽然不是深度学习,但是对以后想要做地址分解的CNN有很大的帮助,值得看一下Because of the limitations of matrix factorization, such as losing spatial structure information, the concept of tensor factorization has been applied for the recovery of a low dimensional subspace from high dimensional visual data. Generally, the recovery is achieved by minimizing the loss function between the observed data and the factorization representation. Under different assumptions of the noise distribution, the loss functions are in various forms, like L1 and L2 norms. However, real data are often corrupted by noise with an unknown distribution. Then any specific form of loss function for one specific kind of noise often fails to tackle such real data with unknown noise. In this paper, we propose a tensor factorization algorithm to model the noise as a Mixture of Gaussians (MoG). As MoG has the ability of universally approximating any hybrids of continuous distributions, our algorithm can effectively recover the low dimensional subspace from various forms of noisy observations.

557   Kernel Approximation via Empirical Orthogonal Decomposition for Unsupervised Feature Learning. 

Yusuke Mukuta, Tatsuya Harada

也不是深度学习,也是跟分解相关,也是值得关注Kernel approximation methods are important tools for various machine learning problems. There are two major methods used to approximate the kernel function: the Nystrom method and the random features method. However, the Nystrom method requires relatively high-complexity post-processing to calculate a solution and the random features method does not provide sufficient generalization performance. In this paper, we propose a method that has good generalization performance without high-complexity postprocessing via empirical orthogonal decomposition using the probability distribution estimated from training data. We provide a bound for the approximation error of the proposed method.

558   Active Learning for Delineation of Curvilinear Structures. 

Agata Mosinska-Domanska, Raphael Sznitman, Przemyslaw Glowacki, Pascal Fua


559   Recognizing Emotions From Abstract Paintings Using Non-Linear Matrix Completion. 

Xavier Alameda-Pineda, Elisa Ricci, Yan Yan, Nicu Sebe

现在需要看矩阵分解相关的内容,值得关注Advanced computer vision and machine learning techniques tried to automatically categorize the emotions elicited by abstract paintings with limited success. Since the annotation of the emotional content is highly resource-consuming, datasets of abstract paintings are either constrained in size or partially annotated. Consequently, it is natural to address the targeted task within a transductive framework. Intuitively, the use of multi-label classification techniques is desirable so to synergically exploit the relations between multiple latent variables, such as emotional content, technique, author, etc. A very popular approach for transductive multi-label recognition under linear classification settings is matrix completion. In this study we introduce non-linear matrix completion (NLMC), thus extending classical linear matrix completion techniques to the non-linear case. Together with the theory grounding the model, we propose an efficient optimization solver.

560   Tensor Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Tensors via Convex Optimization. 

Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Zhouchen Lin, Shuicheng Yan

好多干货的养子,值得关注This paper studies the Tensor Robust Principal Component (TRPCA) problem which extends the known Robust PCA to the tensor case. Our model is based on a new tensor Singular Value Decomposition (t-SVD) and its induced tensor tubal rank and tensor nuclear norm. Consider that we have a 3-way tensor X in R^n*n*n_3 such that X=L_0+S_0, where L_0 has low tubal rank and S_0 is sparse. Is that possible to recover both components? In this work, we prove that under certain suitable assumptions, we can recover both the low-rank and the sparse components exactly by simply solving a convex program whose objective is a weighted combination of the tensor nuclear norm and the l1-norm, i.e., min L,E s.t. ||L||_*+lambda||E||_1 s.t. X=L+E. where lambda=1/sqrtmax(n_1,n_2)n_3. Interestingly, TRPCA involves RPCA as a special case when n_3=1 and thus it is a simple and elegant tensor extension of RPCA. Also numerical experiments verify our theory and the application for the image denoising demonstrates the effectiveness of our method.

561   Sliced Wasserstein Kernels for Probability Distributions. 

Soheil Kolouri, Yang Zou, Gustavo K. Rohde


562   Trace Quotient Meets Sparsity: A Method for Learning Low Dimensional Image Representations. 

Xian Wei, Hao Shen, Martin Kleinsteuber


563   Backtracking ScSPM Image Classifier for Weakly Supervised Top-Down Saliency. 

Hisham Cholakkal, Jubin Johnson, Deepu Rajan


Video and Language

564   MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. 

Jun Xu, Tao Mei, Ting Yao, Yong Rui


Go back to calendar >



Learning and CNN Architectures

Thursday, June 30th, 1:45PM - 2:50PM.

These papers will also be presented at the following poster session

565   NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. 

Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, Josef Sivic

新利用的层,VLAD,这个层已经有很好地特性,或许可以关注。We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine.

566   Structural-RNN: Deep Learning on Spatio-Temporal Graphs. 

Ashesh Jain, Amir R. Zamir, Silvio Savarese, Ashutosh Saxena

结构RNN?利用了时空图模型?估计是充分考虑时间和空间信息的。 In this paper, we propose an approach for combining the power of high-level spatio-temporal graphs and sequence learning success of Recurrent Neural Networks (RNNs). We develop a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps.

567   Learning to Select Pre-Trained Deep Representations With Bayesian Evidence Framework. 

Yong-Deok Kim, Taewoong Jang, Bohyung Han, Seungjin Choi


568   Synthesized Classifiers for Zero-Shot Learning. 

Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, Fei Sha


569   Semi-Supervised Vocabulary-Informed Learning. 

Yanwei Fu, Leonid Sigal


Go back to calendar >



Learning and Optimization

Thursday, June 30th, 2:50PM - 3:20PM.

These papers will also be presented at the following poster session

570   Simultaneous Clustering and Model Selection for Tensor Affinities. 

Zhuwen Li, Shuoguang Yang, Loong-Fah Cheong, Kim-Chuan Toh


571   Discriminatively Embedded K-Means for Multi-View Clustering. 

Jinglin Xu, Junwei Han, Feiping Nie


572   Min Norm Point Algorithm for Higher Order MRF-MAP Inference. 

Ishant Shanu, Chetan Arora, Parag Singla


573   Learning Deep Representation for Imbalanced Classification. 

Chen Huang, Yining Li, Chen Change Loy, Xiaoou Tang

研究性工作,提出最大化类内类间margin的思想。Data in vision domain often exhibit highly-skewed class distribution, i.e., most data belong to a few majority classes, while the minority classes only contain a scarce amount of instances. To mitigate this issue, contemporary classification methods based on deep convolutional neural network (CNN) typically follow classic strategies such as class re-sampling or cost-sensitive training. In this paper, we conduct extensive and systematic experiments to validate the effectiveness of these classic schemes for representation learning on class-imbalanced data. We further demonstrate that more discriminative deep representation can be learned by enforcing a deep network to maintain both inter-cluster and inter-class margins. This tighter constraint effectively reduces the class imbalance inherent in the local data neighborhood. We show that the margins can be easily deployed in standard deep learning framework through quintuplet instance sampling and the associated triple-header hinge loss. 

574   Learning Local Image Descriptors With Deep Siamese and Triplet Convolutional Networks by Minimising Global Loss Functions. 

Vijay Kumar B G, Gustavo Carneiro, Ian Reid

triplet网络获取网络描述子。Recent innovations in training deep convolutional neural network (ConvNet) models have motivated the design of new methods to automatically learn local image descriptors. The latest deep ConvNets proposed for this task consist of a siamese network that is trained by penalising misclassification of pairs of local image patches. Current results from machine learning show that replacing this siamese by a triplet network can improve the classification accuracy in several problems, but this has yet to be demonstrated for local image descriptor learning. Moreover, current siamese and triplet networks have been trained with stochastic gradient descent that computes the gradient from individual pairs or triplets of local image patches, which can make them prone to overfitting. In this paper, we first propose the use of triplet networks for the problem of local image descriptor learning. Furthermore, we also propose the use of a global loss that minimises the overall classification error of all patches present in the training set, which can improve the generalisation capability of the model. 

575   Sparse Coding for Third-Order Super-Symmetric Tensor Descriptors With Application to Texture Recognition. 

Piotr Koniusz, Anoop Cherian


576   Random Features for Sparse Signal Classification. 

Jen-Hao Rick Chang, Aswin C. Sankaranarayanan, B. V. K. Vijaya Kumar


Go back to calendar >



3D Shape Reconstruction

Thursday, June 30th, 1:45PM - 2:50PM.

These papers will also be presented at the following poster session

577   High-Quality Depth From Uncalibrated Small Motion Clip. 

Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon, In So Kweon


578   Efficient 3D Room Shape Recovery From a Single Panorama. 

Hao Yang, Hui Zhang


579   Structured Prediction of Unobserved Voxels From a Single Depth Image. 

Michael Firman, Oisin Mac Aodha, Simon Julier, Gabriel J. Brostow


580   HyperDepth: Learning Depth From Structured Light Without Matching. 

Sean Ryan Fanello, Christoph Rhemann, Vladimir Tankovich, Adarsh Kowdle , Sergio Orts Escolano, David Kim, Shahram Izadi


581   SVBRDF-Invariant Shape and Reflectance Estimation From Light-Field Cameras. 

Ting-Chun Wang, Manmohan Chandraker, Alexei A. Efros, Ravi Ramamoorthi


Go back to calendar >



3D Reconstruction

Thursday, June 30th, 2:50PM - 3:20PM.

These papers will also be presented at the following poster session

582   Semantic 3D Reconstruction With Continuous Regularization and Ray Potentials Using a Visibility Consistency Constraint. 

Nikolay Savinov, Christian Häne, Ľubor Ladický, Marc Pollefeys


583   Theory and Practice of Structure-From-Motion Using Affine Correspondences. 

Carolina Raposo, João P. Barreto


584   Just Look at the Image: Viewpoint-Specific Surface Normal Prediction for Improved Multi-View Reconstruction. 

Silvano Galliani, Konrad Schindler


585   From Dusk Till Dawn: Modeling in the Dark. 

Filip Radenović, Johannes L. Schönberger, Dinghuang Ji, Jan-Michael Frahm, Ondřej Chum, Jiří Matas


586   Accelerated Generative Models for 3D Point Cloud Data. 

Benjamin Eckart, Kihwan Kim, Alejandro Troccoli, Alonzo Kelly, Jan Kautz


587   Monocular Depth Estimation Using Neural Regression Forest. 

Anirban Roy, Sinisa Todorovic


588   DeepStereo: Learning to Predict New Views From the World’s Imagery. 

John Flynn, Ivan Neulander, James Philbin, Noah Snavely


Go back to calendar >



Face, Gesture, & Situation Recognition: Algorithms and Datasets

Thursday, June 30th, 3:45PM - 4:10PM.

These papers will also be presented at the following poster session

589   WIDER FACE: A Face Detection Benchmark. 

Shuo Yang, Ping Luo, Chen-Change Loy, Xiaoou Tang


590   Situation Recognition: Visual Semantic Role Labeling for Image Understanding . 

Mark Yatskar, Luke Zettlemoyer, Ali Farhadi


Go back to calendar >



People and Faces

Thursday, June 30th, 4:10PM - 4:45PM.

These papers will also be presented at the following poster session

591   A 3D Morphable Model Learnt From 10,000 Faces. 

James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Ponniah, David Dunaway


592   Some Like It Hot - Visual Guidance for Preference Prediction. 

Rasmus Rothe, Radu Timofte, Luc Van Gool


593   EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild. 

C. Fabian Benitez-Quiroz, Ramprakash Srinivasan, Aleix M. Martinez


594   ForgetMeNot: Memory-Aware Forensic Facial Sketch Matching. 

Shuxin Ouyang, Timothy M. Hospedales, Yi-Zhe Song, Xueming Li


595   LOMo: Latent Ordinal Model for Facial Analysis in Videos. 

Karan Sikka, Gaurav Sharma, Marian Bartlett


596   Discriminative Invariant Kernel Features: A Bells-and-Whistles-Free Approach to Unsupervised Face Recognition and Pose Estimation. 

Dipan K. Pal, Felix Juefei-Xu, Marios Savvides


597   Bottom-Up and Top-Down Reasoning With Hierarchical Rectified Gaussians. 

Peiyun Hu, Deva Ramanan


598   Fits Like a Glove: Rapid and Reliable Hand Shape Personalization. 

David Joseph Tan, Thomas Cashman, Jonathan Taylor, Andrew Fitzgibbon, Daniel Tarlow, Sameh Khamis, Shahram Izadi, Jamie Shotton


599   Slicing Convolutional Neural Network for Crowd Video Understanding. 

Jing Shao, Chen-Change Loy, Kai Kang, Xiaogang Wang

如出一辙的想法,他们实现了,我还在想怎么实现,也有参考价值Learning and capturing both appearance and dynamic representations are pivotal for crowd video understanding. Convolutional Neural Networks (CNNs) have shown its remarkable potential in learning appearance representations from images. However, the learning of dynamic representation, and how it can be effectively combined with appearance features for video analysis, remains an open problem. In this study, we propose a novel spatio-temporal CNN, named Slicing CNN (S-CNN), based on the decomposition of 3D feature maps into 2D spatio- and 2D temporal-slices representations. The decomposition brings unique advantages: (1) the model is capable of capturing dynamics of different semantic units such as groups and objects, (2) it learns separated appearance and dynamic representations while keeping proper interactions between them, and (3) it exploits the selectiveness of spatial filters to discard irrelevant background clutter for crowd understanding.

Go back to calendar >



3D, Stereo, Matching, and Saliency Estimation

Thursday, June 30th, 3:45PM - 4:45PM.

These papers will also be presented at the following poster session

600   Linear Shape Deformation Models With Local Support Using Graph-Based Structured Matrix Factorisation. 

Florian Bernard, Peter Gemmar, Frank Hertel, Jorge Goncalves, Johan Thunberg


601   Motion From Structure (MfS): Searching for 3D Objects in Cluttered Point Trajectories. 

Jayakorn Vongkulbhisal, Ricardo Cabral, Fernando De la Torre, João P. Costeira


602   Volumetric and Multi-View CNNs for Object Classification on 3D Data. 

Charles R. Qi, Hao Su, Matthias Niessner, Angela Dai, Mengyuan Yan, Leonidas J. Guibas

又是两种的组合。3D shape models are becoming widely available and easier to capture, making available 3D information crucial for progress in object classification. Current state-of-the-art methods rely on CNNs to address this problem. Recently, we witness two types of CNNs being developed: CNNs based upon volumetric representations versus CNNs based upon multi-view representations. Empirical results from these two types of CNNs exhibit a large gap, indicating that existing volumetric CNN architectures and approaches are unable to fully exploit the power of 3D representations. In this paper, we aim to improve both volumetric CNNs and multi-view CNNs according to extensive analysis of existing approaches. To this end, we introduce two distinct network architectures of volumetric CNNs. In addition, we examine multi-view CNNs, where we introduce multi-resolution filtering in 3D. 

603   Detecting Vanishing Points Using Global Image Context in a Non-Manhattan World. 

Menghua Zhai, Scott Workman, Nathan Jacobs


604   Learning Weight Uncertainty With Stochastic Gradient MCMC for Shape Classification. 

Chunyuan Li, Andrew Stevens, Changyou Chen, Yunchen Pu, Zhe Gan, Lawrence Carin

理论功底搞得很深的样子。Learning the representation of shape cues in 2D & 3D objects for recognition is a fundamental task in computer vision. Deep neural networks (DNNs) have shown promising performance on this task. Due to the large variability of shapes, accurate recognition relies on good estimates of model uncertainty, ignored in traditional training of DNNs, typically learned via stochastic optimization. This paper leverages recent advances in stochastic gradient Markov Chain Monte Carlo (SG-MCMC) to learn weight uncertainty in DNNs. It yields principled Bayesian interpretations for the commonly used Dropout/DropConnect techniques and incorporates them into the SG-MCMC framework. Extensive experiments on 2D & 3D shape datasets and various DNN models demonstrate the superiority of the proposed approach over stochastic optimization

605   A Field Model for Repairing 3D Shapes. 

Duc Thanh Nguyen, Binh-Son Hua, Khoi Tran, Quang-Hieu Pham, Sai-Kit Yeung


606   GOGMA: Globally-Optimal Gaussian Mixture Alignment. 

Dylan Campbell, Lars Petersson


607   Efficient Deep Learning for Stereo Matching. 

Wenjie Luo, Alexander G. Schwing, Raquel Urtasun

目的就是为了提高匹配速度。 In contrast, in this paper we propose a matching network which is able to produce very accurate results in less than a second of GPU computation. Towards this goal, we exploit a product layer which simply computes the inner product between the two representations of a siamese architecture. We train our network by treating the problem as multi-class classification, where the classes are all possible disparities.

608   Efficient Coarse-To-Fine PatchMatch for Large Displacement Optical Flow. 

Yinlin Hu, Rui Song, Yunsong Li


609   FANNG: Fast Approximate Nearest Neighbour Graphs. 

Ben Harwood, Tom Drummond


610   Exemplar-Driven Top-Down Saliency Detection via Deep Association. 

Shengfeng He, Rynson W.H. Lau, Qingxiong Yang


611   Unconstrained Salient Object Detection via Proposal Subset Optimization. 

Jianming Zhang, Stan Sclaroff, Zhe Lin, Xiaohui Shen, Brian Price, Radomír Mech


612   Recombinator Networks: Learning Coarse-To-Fine Feature Aggregation. 

Sina Honari, Jason Yosinski, Pascal Vincent, Christopher Pal


613   End-To-End Saliency Mapping via Probability Distribution Prediction. 

Saumya Jetley, Naila Murray, Eleonora Vig

新的损失函数。 In this work, we introduce a new saliency map model which formulates a map as a generalized Bernoulli distribution. We then train a deep architecture to predict such maps using novel loss functions which pair the softmax activation function with measures designed to compute distances between probability distributions.

Go back to calendar >



Poster Session 4-2. Thursday, June 30th, 4:45PM - 6:45PM.

Biologically Inspired Vision

614   A Paradigm for Building Generalized Models of Human Image Perception Through Data Fusion. 

Shaojing Fan, Tian-Tsong Ng, Bryan L. Koenig, Ming Jiang, Qi Zhao


615   Longitudinal Face Modeling via Temporal Deep Restricted Boltzmann Machines. 

Chi Nhan Duong, Khoa Luu, Kha Gia Quach, Tien D. Bui

相当于利用深度学习来做年龄老化的估计。深度玻尔兹曼机。This paper presents a deep model approach for face age progression that can efficiently capture the non-linear aging process and automatically synthesize a series of age-progressed faces in various age ranges. In this approach, we first decompose the long-term age progress into a sequence of short-term changes and model it as a face sequence. The Temporal Deep Restricted Boltzmann Machines based age progression model together with the prototype faces are then constructed to learn the aging transformation between faces in the sequence. In addition, to enhance the wrinkles of faces in the later age ranges, the wrinkle models are further constructed using Restricted Boltzmann Machines to capture their variations in different facial regions. The geometry constraints are also taken into account in the last step for more consistent age-progressed results. The proposed approach is evaluated using various face aging databases, i.e. FG-NET, Cross-Age Celebrity Dataset (CACD) and MORPH, and our collected large-scale aging database named AginG Faces in the Wild (AGFW)

616   Saliency Unified: A Deep Architecture for Simultaneous Eye Fixation Prediction and Salient Object Segmentation. 

Srinivas S. S. Kruthiventi, Vennela Gudisa, Jaley H. Dholakiya, R. Venkatesh Babu

多任务,一个是眼睛定位?另一个是关键部分分割,感觉这两个多任务倒是联合的话可以有干货,不像其他的你搞两个预测这有啥,只不过就是损失函数的叠加。 In this work, we propose a deep convolutional neural network (CNN) capable of predicting eye fixations and segmenting salient objects in a unified framework. We design the initial network layers, shared between both the tasks, such that they capture the object level semantics and the global contextual aspects of saliency, while the deeper layers of the network address task specific aspects. In addition, our network captures saliency at multiple scales via inception-style convolution blocks. 

Image Alignment and Registration

617   Estimating Correspondences of Deformable Objects In-The-Wild. 

Yuxiang Zhou, Epameinondas Antonakos, Joan Alabort-i-Medina, Anastasios Roussos, Stefanos Zafeiriou


618   Gravitational Approach for Point Set Registration. 

Vladislav Golyanik, Sk Aziz Ali, Didier Stricker


619   Context-Aware Gaussian Fields for Non-Rigid Point Set Registration. 

Gang Wang, Zhicheng Wang, Yufei Chen, Qiangqiang Zhou, Weidong Zhao


Optimization 这一系列虽然都不是深度学习,但是对于分解优化等全都有帮助,值得关注。

620   Trust No One: Low Rank Matrix Factorization Using Hierarchical RANSAC. 

Magnus Oskarsson, Kenneth Batstone, Kalle Åström


621   Relaxation-Based Preprocessing Techniques for Markov Random Field Inference. 

Chen Wang, Ramin Zabih


622   Sparse Coding for Classification via Discrimination Ensemble. 

Yuhui Quan, Yong Xu, Yuping Sun, Yan Huang, Hui Ji


623   Principled Parallel Mean-Field Inference for Discrete Random Fields. 

Pierre Baqué, Timur Bagautdinov, François Fleuret, Pascal Fua


623   Guaranteed Outlier Removal With Mixed Integer Linear Programs. 

Tat-Jun Chin, Yang Heng Kee, Anders Eriksson, Frank Neumann


625   Memory Efficient Max Flow for Multi-Label Submodular MRFs. 

Thalaiyasingam Ajanthan, Richard Hartley, Mathieu Salzmann


626   Proximal Riemannian Pursuit for Large-Scale Trace-Norm Minimization. 

Mingkui Tan, Shijie Xiao, Junbin Gao, Dong Xu, Anton van den Hengel, Qinfeng Shi


627   Minimizing the Maximal Rank. 

Erik Bylow, Carl Olsson, Fredrik Kahl, Mikael Nilsson


628   Solving Temporal Puzzles. 

Caglayan Dicle, Burak Yilmaz, Octavia Camps, Mario Sznaier


629   Estimating Sparse Signals With Smooth Support via Convex Programming and Block Sparsity. 

Sohil Shah, Tom Goldstein, Christoph Studer


630   TenSR: Multi-Dimensional Tensor Sparse Representation. 

Na Qi, Yunhui Shi, Xiaoyan Sun, Baocai Yin


631   Moral Lineage Tracing. 

Florian Jug, Evgeny Levinkov, Corinna Blasse, Eugene W. Myers, Bjoern Andres


632   Globally Optimal Rigid Intensity Based Registration: A Fast Fourier Domain Approach. 

Behrooz Nasihatkon, Frida Fejne, Fredrik Kahl


633   On Benefits of Selection Diversity via Bilevel Exclusive Sparsity. 

Haichuan Yang, Yijun Huang, Lam Tran, Ji Liu, Shuai Huang


Recognition and Detection

634   Fast Training of Triplet-Based Deep Binary Embedding Networks. 

Bohan Zhuang, Guosheng Lin, Chunhua Shen, Ian Reid

多分类,分为两个阶段,第一个阶段设计目标函数,以提高计算效率,第二阶段,设计CNN网络结构,将多分类问题转化为多个二分类编码问题。 To address this issue, we propose to formulate high-order binary codes learning as a multi-label classification problem by explicitly separating learning into two interleaved stages. To solve the first stage, we design a large-scale high-order binary codes inference algorithm to reduce the high-order objective to a standard binary quadratic problem such that graph cuts can be used to efficiently infer the binary codes which serve as the labels of each training datum. In the second stage we propose to map the original image to compact binary codes via carefully designed deep convolutional neural networks (CNNs) and the hash- ing function fitting can be solved by training binary CNN classifiers. 

635   Marr Revisited: 2D-3D Alignment via Surface Normal Prediction. 

Aayush Bansal, Bryan Russell, Abhinav Gupta

相当于设计了一个阶跃网络一个二阶段网络。 We introduce a skip-network model built on the pre-trained Oxford VGG convolutional neural network for surface normal prediction. Our model achieves state-of-the-art accuracy on the NYUv2 RGB-D dataset for surface normal prediction, and recovers fine object detail compared to previous methods. Furthermore, we develop a two-stream network over the input image and predicted surface normals that jointly learns pose and style for CAD model retrieval. 

636   Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning. 

Ziad Al-Halah, Makarand Tapaswi, Rainer Stiefelhagen


637   Fast Zero-Shot Image Tagging. 

Yang Zhang, Boqing Gong, Mubarak Shah


638   Modality and Component Aware Feature Fusion For RGB-D Scene Classification. 

Anran Wang, Jianfei Cai, Jiwen Lu, Tat-Jen Cham


639   PPP: Joint Pointwise and Pairwise Image Label Prediction. 

Yilin Wang, Suhang Wang, Jiliang Tang, Huan Liu, Baoxin Li


640   Cataloging Public Objects Using Aerial and Street-Level Images – Urban Trees. 

Jan D. Wegner, Steven Branson, David Hall, Konrad Schindler, Pietro Perona


641   Deep Exemplar 2D-3D Detection by Adapting From Real to Rendered Views. 

Francisco Massa, Bryan C. Russell, Mathieu Aubr

其实也就是做识别的呢。This paper presents an end-to-end convolutional neural network (CNN) for 2D-3D exemplar detection. We demonstrate that the ability to adapt the features of natural images to better align with those of CAD rendered views is critical to the success of our technique. We show that the adaptation can be learned by compositing rendered views of textured object models on natural images. Our approach can be naturally incorporated into a CNN detection pipeline and extends the accuracy and speed benefits from recent advances in deep learning to 2D-3D exemplar detection.

642   Zero-Shot Learning via Joint Latent Similarity Embedding. 

Ziming Zhang, Venkatesh Saligrama


643   CRAFT Objects From Images. 

Bin Yang, Junjie Yan, Zhen Lei, Stan Z. Li

相当于之前的rcnn或者faster rcnn是将识别分为两个任务,这里有分别将这两个分解为两个任务,共计四个。 In this paper, we push the "divide and conquer" solution even further by dividing each task into two sub-tasks. We call the proposed method "CRAFT" (Cascade Region-proposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade. We show that the cascade structure helps in both tasks: in proposal generation, it provides more compact and better localized object proposals; in object classification, it reduces false positives (mainly between ambiguous categories) by capturing both inter- and intra-category variances. 


  • 0
  • 0
  • 1
  • 一键三连
  • 扫一扫,分享海报