cvpr相关文章02

最新推荐文章于 2021-02-02 19:13:26 发布

sherry_gp

最新推荐文章于 2021-02-02 19:13:26 发布

阅读量809

点赞数

分类专栏：文章阅读文章标签：机器视觉文献

文章阅读专栏收录该内容

13 篇文章 0 订阅

订阅专栏

26. Posebits for Monocular Human Pose Estimation

Pons-Moll, G. ; Fleet, D.J. ; Rosenhahn, B.

Publication Year: 2014 , Page(s): 2345 - 2352

posebits 做单目人的姿势估计。其中qualitative information about 3D human pose, called posebits。Posebits represent Boolean geometric relationships between body parts (e.g., left-leg in front of right-leg or hands close to each other).

27.Robust Estimation of 3D Human Poses from a Single Image

Chunyu Wang ; Yizhou Wang ; Zhouchen Lin ; Yuille, A.L. ; Wen Gao

姿势估计，也是动作识别的关键。Human pose estimation is a key step to action recognition.

We propose a method of estimating 3D human poses from a single image, which works in conjunction with an existing 2D pose/joint detector. 3D pose estimation is challenging because multiple 3D poses may correspond to the same 2D pose after projection due to the lack of depth information.

28. Efficient Action Localization with Approximately Normalized Fisher Vectors

Oneata, D. ; Verbeek, J. ; Schmid, C.

Publication Year: 2014 , Page(s): 2545 - 2552

Fisher 向量的变换，增强动作识别的性能。

The Fisher vector (FV) representation is a high-dimensional extension of the popular bag-of-word representation.

Transformation of the FV by power and ?2 normalizations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classification and retrieval tasks.

29. Towards Good Practices for Action Video Encoding

Jianxin Wu ; Yu Zhang ; Weiyao Lin

Publication Year: 2014 , Page(s): 2577 - 2584

Cited by: Papers (2)

在upon VLAD的基础上的编码，效果还不错。

High dimensional representations such as VLAD or FV have shown excellent accuracy in action recognition. This paper shows that a proper encoding built upon VLAD can achieve further accuracy boost with only negligible computational cost.

30. Improving Semantic Concept Detection through the Dictionary of Visually-Distinct Elements

Dehghan, A. ; Idrees, H. ; Shah, M.

Publication Year: 2014 , Page(s): 2585 - 2592

视觉上不同的元素的字典，语义概念的动作检测

A video captures a sequence and interactions of concepts that can be static, for instance, objects or scenes, or dynamic, such as actions. For large datasets containing hundreds of thousands of images or videos, it is impractical to manually annotate all the concepts, or all the instances of a single concept. However, a dictionary with visually-distinct elements can be created automatically

本文的方法：In this paper, we present an approach that leverages the strengths of semantic concepts and the machine-discovered DOVE by learning a relationship between them.

31. Efficient Feature Extraction, Encoding, and Classification for Action Recognition

Kantorov, V. ; Laptev, I.

Publication Year: 2014 , Page(s): 2593 - 2600

Cited by: Papers (2)

法国INFIA科学家Laptev的文章，动作识别领域的牛人。不管怎么说，都要看一看

已下载，论文里面的图很是独特，解决的问题比较新鲜。

摘要翻译：局部视频特征的动作识别取得了比较领先的结果。虽然效果好，但是计算速度有限，限制了在实际场合中的应用。本文就是解决这个问题的。作者提到 We develop highly efficient video features using motion information in video compression.

Abstract: Local video features provide state-of-the-art performance for action recognition.

While the accuracy of action recognition has been continuously improved over the recent years, the low speed of feature extraction and subsequent recognition prevents current methods from scaling up to real-size problems.

We address this issue and first

develop highly efficient video features using motion information in video compression.

We next explore feature encoding by Fisher vectors and demonstrate accurate action recognition using fast linear classifiers.

Our method improves the speed of video feature extraction, feature encoding and action classification by two orders of magnitude at the cost of minor reduction in recognition accuracy. We validate our approach and compare it to the state of the art on four recent action recognition datasets.

32.3D Pose from Motion for Cross-View Action Recognition via Non-linear Circulant Temporal Encoding

Gupta, A. ; Martinez, J. ; Little, J.J. ; Woodham, R.J.

Publication Year: 2014 , Page(s): 2601 - 2608

通过非线性Circulant Temporal Encoding，使用三维姿势做交叉视角动作识别。

33. Human Action Recognition Based on Context-Dependent Graph Kernels

Baoxin Wu ; Chunfeng Yuan ; Weiming Hu

Publication Year: 2014 , Page(s): 2609 - 2616

基于上下文独立的图 Kernel。

In this paper, we construct a two-graph model to represent human actions by recording the spatial and temporal relationships among local features. We also propose a novel family of context-dependent graph kernels (CGKs) to measure similarity between graphs.

34. 重点文章，深度和骨骼结合的动作识别。

Depth and Skeleton Associated Action Recognition without Online Accessible RGB-D Cameras

Yen-Yu Lin ; Ju-Hsuan Hua ; Tang, N.C. ; Min-Hung Chen ; Liao, H.-Y.M.

The recent advances in RGB-D cameras have allowed us to better solve increasingly complex computer vision tasks. However, modern RGB-D cameras are still restricted by the short effective distances. The limitation may make RGB-D cameras not online accessible in practice, and degrade their applicability. We propose an alternative scenario to address this problem, and illustrate it with the application to action recognition. We use Kinect to offline collect an auxiliary, multi-modal database, in which not only the RGB videos but also the depth maps and skeleton structures of actions of interest are available. Our approach aims to enhance action recognition in RGB videos by leveraging the extra database. Specifically, it optimizes a feature transformation, by which the actions to be recognized can be concisely reconstructed by entries in the auxiliary database. In this way, the inter-database variations are adapted. More importantly, each action can be augmented with additional depth and skeleton images retrieved from the auxiliary database. The proposed approach has been evaluated on three benchmarks of action recognition. The promising results manifest that the augmented depth and skeleton features can lead to remarkable boost in recognition accuracy.

35. DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition

Lin Sun ; Kui Jia ; Tsung-Han Chan ; Yuqiang Fang ; Gang Wang ; Shuicheng Yan

这个文章有点意思，抛弃传统的手工设计特征的方法。使用了deep learning的方法。文中提到：In this paper, we propose to combine SFA with deep learning techniques to learn hierarchical representations from the video data itself.

摘要：Most of the previous work on video action recognition use complex hand-designed local features, such as SIFT, HOG and SURF, but these approaches are implemented sophisticatedly and difficult to be extended to other sensor modalities. Recent studies discover that there are no universally best hand-engineered features for all datasets, and learning features directly from the data may be more advantageous. One such endeavor is Slow Feature Analysis (SFA) proposed by Wiskott and Sejnowski [33]. SFA can learn the invariant and slowly varying features from input signals and has been proved to be valuable in human action recognition [34]. It is also observed that the multi-layer feature representation has succeeded remarkably in widespread machine learning applications. In this paper, we propose to combine SFA with deep learning techniques to learn hierarchical representations from the video data itself. Specifically, we use a two-layered SFA learning structure with 3D convolution and max pooling operations to scale up the method to large inputs and capture abstract and structural features from the video. Thus, the proposed method is suitable for action recognition. At the same time, sharing the same merits of deep learning, the proposed method is generic and fully automated. Our classification results on Hollywood2, KTH and UCF Sports are competitive with previously published results. To highlight some, on the KTH dataset, our recognition rate shows approximately 1% improvement in comparison to state-of-the-art methods even without supervision or dense sampling.

36.A Cause and Effect Analysis of Motion Trajectories for Modeling Actions

Narayan, S. ; Ramakrishnan, K.R.

Publication Year: 2014 , Page(s): 2633 - 2640

Cited by: Papers (1)

使用轨迹特征的动作建模。

批BOW模型，提出新模型。 HMDB51 and UCF50中的效果值得关注。

摘要：An action is typically composed of different parts of the object moving in particular sequences. The presence of different motions (represented as a 1D histogram) has been used in the traditional bag-of-words (BoW) approach for recognizing actions. However the interactions among the motions also form a crucial part of an action. Different object-parts have varying degrees of interactions with the other parts during an action cycle. It is these interactions we want to quantify in order to bring in additional information about the actions. In this paper we propose a causality based approach for quantifying the interactions to aid action classification. Granger causality is used to compute the cause and effect relationships for pairs of motion trajectories of a video. A 2D histogram descriptor for the video is constructed using these pairwise measures. Our proposed method of obtaining pairwise measures for videos is also applicable for large datasets. We have conducted experiments on challenging action recognition databases such as HMDB51 and UCF50 and shown that our causality descriptor helps in encoding additional information regarding the actions and performs on par with the state-of-the art approaches. Due to the complementary nature, a further increase in performance can be observed by combining our approach with state-of-the-art approaches.

37.From Stochastic Grammar to Bayes Network: Probabilistic Parsing of Complex Activity

Vo, N.N. ; Bobick, A.F.

Publication Year: 2014 , Page(s): 2641 - 2648

Cited by: Papers (1)

概率解析复杂的行为，贝叶斯网络。

We propose a probabilistic method for parsing a temporal sequence such as a complex activity defined as composition of sub-activities/actions.

38.Cross-View Action Modeling, Learning, and Recognition

Jiang Wang ; Xiaohan Nie ; Yin Xia ; Ying Wu ; Song-Chun Zhu

Publication Year: 2014 , Page(s): 2649 - 2656

知名的研究牛人 Jiang Wang ， Song-Chun Zhu的文章。值得看看。

Existing methods on video-based action recognition are generally view-dependent, i.e., performing recognition from the same views seen in the training data. We present a novel multiview spatio-temporal and-or graph (MST-AOG) representation for cross-view action recognition, i.e., the recognition is performed on the video from an unknown and unseen view. As a compositional model, MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations. This paper proposes effective methods to learn the structure and parameters of MST-AOG. The inference based on MST-AOG enables action recognition from novel views. The training of MST-AOG takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating enormous multi-view video frames, which is error-prone and time-consuming, but the recognition does not need 3D information and is based on 2D video input. A new Multiview Action3D dataset has been created and will be released. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition on 2D videos.

39.Feature-Independent Action Spotting without Human Localization, Segmentation, or Frame-wise Tracking

Chuan Sun ; Tappen, M. ; Foroosh, H.

作者提出：does not depend on any specific feature (e.g. HOG/HOF, STIP, silhouette, bag-of-words, etc.)，这个有点意思。

In this paper, we propose an unsupervised framework for action spotting in videos that does not depend on any specific feature (e.g. HOG/HOF, STIP, silhouette, bag-of-words, etc.). Furthermore, our solution requires no human localization, segmentation, or framewise tracking. This is achieved by treating the problem holistically as that of extracting the internal dynamics of video cuboids

40. 通用方法，可以用于动作识别。

Learning Everything about Anything: Webly-Supervised Visual Concept Learning

Divvala, S.K. ; Farhadi, A. ; Guestrin, C.

Publication Year: 2014 , Page(s): 3270 - 3277

Cited by: Papers (2)

Recognition is graduating from labs to real-world applications. While it is encouraging to see its potential being tapped, it brings forth a fundamental challenge to the vision researcher: scalability. How can we learn a model for any concept that exhaustively covers all its appearance variations, while requiring minimal or no human supervision for compiling the vocabulary of visual variance, gathering the training images and annotations, and learning the models? In this paper, we introduce a fully-automated approach for learning extensive models for a wide range of variations (e.g. actions, interactions, attributes and beyond) within any concept. Our approach leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models. Our approach organizes the visual knowledge about a concept in a convenient and useful way, enabling a variety of applications across vision and NLP. Our online system has been queried by users to learn models for several interesting concepts including breakfast, Gandhi, beautiful, etc. To date, our system has models available for over 50, 000 variations within 150 concepts, and has annotated more than 10 million images with bounding boxes.

41. Active Annotation Translation

Branson, S. ; Hjorleifsson, K.E. ; Perona, P.

Publication Year: 2014 , Page(s): 3702 - 3709

实验展示了该方法可以在动作词汇方面应用。

2) our system can be used effectively in a scheme where definitions of part, attribute, or action vocabularies are evolved interactively without relabeling the entire dataset。

摘要：We introduce a general framework for quickly annotating an image dataset when previous annotations exist. The new annotations (e.g. part locations) may be quite different from the old annotations (e.g. segmentations). Human annotators may be thought of as helping translate the old annotations into the new ones. As annotators label images, our algorithm incrementally learns a translator from source to target labels as well as a computer-vision-based structured predictor. These two components are combined to form an improved prediction system which accelerates the annotators' work through a smart GUI. We show how the method can be applied to translate between a wide variety of annotation types, including bounding boxes, segmentations, 2D and 3D part-based systems, and class and attribute labels. The proposed system will be a useful tool toward exploring new types of representations beyond simple bounding boxes, object segmentations, and class labels, and toward finding new ways to exploit existing large datasets with traditional types of annotations like SUN [36], Image Net [11], and Pascal VOC [12]. Experiments on the CUB-200-2011 and H3D datasets demonstrate 1) our method accelerates collection of part annotations by a factor of 3-20 compared to manual labeling, 2) our system can be used effectively in a scheme where definitions of part, attribute, or action vocabularies are evolved interactively without relabeling the entire dataset, and 3) toward collecting pose annotations, segmentations are more useful than bounding boxes, and part-level annotations are more effective than segmentations.

42.

Optimizing over Radial Kernels on Compact Manifolds

Jayasumana, S. ; Hartley, R. ; Salzmann, M. ; Hongdong Li ; Harandi, M.

Publication Year: 2014 , Page(s): 3802 - 3809

43.分类的通用方法

Latent Dictionary Learning for Sparse Representation Based Classification

Meng Yang ; Dengxin Dai ; Linlin Shen ; Van Gool, L.

Dictionary learning (DL) for sparse coding has shown promising results in classification tasks, while how to adaptively build the relationship between dictionary atoms and class labels is still an important open question. The existing dictionary learning approaches simply fix a dictionary atom to be either class-specific or shared by all classes beforehand, ignoring that the relationship needs to be updated during DL. To address this issue, in this paper we propose a novel latent dictionary learning (LDL) method to learn a discriminative dictionary and build its relationship to class labels adaptively. Each dictionary atom is jointly learned with a latent vector, which associates this atom to the representation of different classes. More specifically, we introduce a latent representation model, in which discrimination of the learned dictionary is exploited via minimizing the within-class scatter of coding coefficients and the latent-value weighted dictionary coherence. The optimal solution is efficiently obtained by the proposed solving algorithm. Correspondingly, a latent sparse representation based classifier is also presented. Experimental results demonstrate that our algorithm outperforms many recently proposed sparse representation and dictionary learning approaches for action, gender and face recognition.