CVPR 2011 全部论文标题和摘要

图像处理资源 专栏收录该内容
7 篇文章 0 订阅

CVPR 2011

Tian, Yuandong; Narasimhan, Srinivasa G.; , ■Rectification and 3D reconstruction of curved document images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.377-384, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995540
Abstract: Distortions in images of documents, such as the pages of books, adversely affect the performance of optical character recognition (OCR) systems. Removing such distortions requires the 3D deformation of the document that is often measured using special and precisely calibrated hardware (stereo, laser range scanning or structured light). In this paper, we introduce a new approach that automatically reconstructs the 3D shape and rectifies a deformed text document from a single image. We first estimate the 2D distortion grid in an image by exploiting the line structure and stroke statistics in text documents. This approach does not rely on more noise-sensitive operations such as image binarization and character segmentation. The regularity in the text pattern is used to constrain the 2D distortion grid to be a perspective projection of a 3D parallelogram mesh. Based on this constraint, we present a new shape-from-texture method that computes the 3D deformation up to a scale factor using SVD. Unlike previous work, this formulation imposes no restrictions on the shape (e.g., a developable surface). The estimated shape is then used to remove both geometric distortions and photometric (shading) effects in the image. We demonstrate our techniques on documents containing a variety of languages, fonts and sizes.

Oh, Sangmin; Hoogs, Anthony; Perera, Amitha; Cuntoor, Naresh; Chen, Chia-Chih; Lee, Jong Taek; Mukherjee, Saurajit; Aggarwal, J. K.; Lee, Hyungtae; Davis, Larry; Swears, Eran; Wang, Xioyang; Ji, Qiang; Reddy, Kishore; Shah, Mubarak; Vondrick, Carl; Pirsiavash, Hamed; Ramanan, Deva; Yuen, Jenny; Torralba, Antonio; Song, Bi; Fong, Anesco; Roy-Chowdhury, Amit; Desai, Mita; , ■A large-scale benchmark dataset for event recognition in surveillance video,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3153-3160, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995586
Abstract: We introduce a new large-scale video dataset designed to assess the performance of diverse visual event recognition algorithms with a focus on continuous visual event recognition (CVER) in outdoor areas with wide coverage. Previous datasets for action recognition are unrealistic for real-world surveillance because they consist of short clips showing one action by one individual [15, 8]. Datasets have been developed for movies [11] and sports [12], but, these actions and scene conditions do not apply effectively to surveillance videos. Our dataset consists of many outdoor scenes with actions occurring naturally by non-actors in continuously captured videos of the real world. The dataset includes large numbers of instances for 23 event types distributed throughout 29 hours of video. This data is accompanied by detailed annotations which include both moving object tracks and event examples, which will provide solid basis for large-scale evaluation. Additionally, we propose different types of evaluation modes for visual recognition tasks and evaluation metrics along with our preliminary experimental results. We believe that this dataset will stimulate diverse aspects of computer vision research and help us to advance the CVER tasks in the years ahead.

Padfield, Dirk; , ■The magic sigma,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.129-136, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995577
Abstract: With the explosion in the usage of mobile devices and other smart electronics, embedded devices are becoming ubiquitous. Most such embedded architectures utilize fixed-point rather than floating-point computation to meet power, heat, and speed requirements leading to the need for integer-based processing algorithms. Operations involving Gaussian kernels are common to such algorithms, but the standard methods of constructing such kernels result in approximations and lack a property that enables efficient bitwise shift operations. To overcome these limitations, we present how to precisely combine the power of integer arithmetic and bitwise shifts with intrinsically real valued Gaussian kernels. We prove mathematically that there exist a set of what we call ■magic sigmas■ for which the integer kernels exactly represent the Gaussian function whose values are all powers-of-two, and we discovered that the maximum sigma that leads to such properties is about 0.85. We also designed a simple and precise algorithm for designing kernels composed exclusively of integers given any arbitrary sigma and show how this can be exploited for Gaussian filter design. Considering the ubiquity of Gaussian filtering and the need for integer computation for increasing numbers of embedded devices, this is an important result for both theoretical and practical purposes.

Emonet, Remi; Varadarajan, Jagannadan; Odobez, Jean-Marc; , ■Extracting and locating temporal motifs in video scenes using a hierarchical non parametric Bayesian model,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3233-3240, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995572
Abstract: In this paper, we present an unsupervised method for mining activities in videos. From unlabeled video sequences of a scene, our method can automatically recover what are the recurrent temporal activity patterns (or motifs) and when they occur. Using non parametric Bayesian methods, we are able to automatically find both the underlying number of motifs and the number of motif occurrences in each document. The model's robustness is first validated on synthetic data. It is then applied on a large set of video data from state-of-the-art papers. We show that it can effectively recover temporal activities with high semantics for humans and strong temporal information. The model is also used for prediction where it is shown to be as efficient as other approaches. Although illustrated on video sequences, this model can be directly applied to various kinds of time series where multiple activities occur simultaneously.

Das Gupta, Mithun; Xiao, Jing; , 【Non-negative matrix factorization as a feature selection tool for maximum margin classifiers,【 Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2841-2848, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995492
Abstract: Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition tool for multivariate data. Non-negative bases allow strictly additive combinations which have been shown to be part-based as well as relatively sparse. We pursue a discriminative decomposition by coupling NMF objective with a maximum margin classifier, specifically a support vector machine (SVM). Conversely, we propose an NMF based regularizer for SVM. We formulate the joint update equations and propose a new method which identifies the decomposition as well as the classification parameters. We present classification results on synthetic as well as real datasets.

Castillo, Carlos D.; Jacobs, David W.; , ■Wide-baseline stereo for face recognition with large pose variation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.537-544, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995559
Abstract: 2-D face recognition in the presence of large pose variations presents a significant challenge. When comparing a frontal image of a face to a near profile image, one must cope with large occlusions, non-linear correspondences, and significant changes in appearance due to viewpoint. Stereo matching has been used to handle these problems, but performance of this approach degrades with large pose changes. We show that some of this difficulty is due to the effect that foreshortening of slanted surfaces has on window-based matching methods, which are needed to provide robustness to lighting change. We address this problem by designing a new, dynamic programming stereo algorithm that accounts for surface slant. We show that on the CMU PIE dataset this method results in significant improvements in recognition performance.

Osokin, Anton; Vetrov, Dmitry; Kolmogorov, Vladimir; , ■Submodular decomposition framework for inference in associative Markov networks with global constraints,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1889-1896, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995361
Abstract: In this paper we address the problem of finding the most probable state of discrete Markov random field (MRF) with associative pairwise terms. Although of practical importance, this problem is known to be NP-hard in general. We propose a new type of MRF decomposition, submod-ular decomposition (SMD). Unlike existing decomposition approaches SMD decomposes the initial problem into sub-problems corresponding to a specific class label while preserving the graph structure of each subproblem. Such decomposition enables us to take into account several types of global constraints in an efficient manner. We study theoretical properties of the proposed approach and demonstrate its applicability on a number of problems.

Tian, Jiandong; Tang, Yandong; , ■Linearity of each channel pixel values from a surface in and out of shadows and its applications,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.985-992, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995622
Abstract: Shadows, the common phenomena in most outdoor scenes, are illuminated by diffuse skylight whereas shaded from direct sunlight. Generally shadows take place in sunny weather when the spectral power distributions (SPD) of sunlight, skylight, and daylight show strong regularity: they principally vary with sun angles. In this paper, we first deduce that the pixel values of a surface illuminated by skylight (in shadow region) and by daylight (in non-shadow region) have a linear relationship, and the linearity is independent of surface reflectance and holds in each color channel. We then use six simulated images that contain 1995 surfaces and two real captured images to test the linearity. The results validate the linearity. Based on the deduced linear relationship, we develop three shadow processing applications include intrinsic image deriving, shadow verification, and shadow removal. The results of the applications demonstrate that the linear relationship have practical values.

Bo, Yihang; Fowlkes, Charless C.; , ■Shape-based pedestrian parsing,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2265-2272, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995609
Abstract: We describe a simple model for parsing pedestrians based on shape. Our model assembles candidate parts from an oversegmentation of the image and matches them to a library of exemplars. Our matching uses a hierarchical decomposition into a variable number of parts and computes scores on partial matchings in order to prune the search space of candidate segment. Simple constraints enforce consistent layout of parts. Because our model is shape-based, it generalizes well. We use exemplars from a controlled dataset of poses but achieve good test performance on unconstrained images of pedestrians in street scenes. We demonstrate results of parsing detections returned from a standard scanning-window pedestrian detector and use the resulting parse to perform viewpoint prediction and detection re-scoring.

Jiang, Nan; Liu, Wenyu; Wu, Ying; , ■Adaptive and discriminative metric differential tracking,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1161-1168, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995716
Abstract: Matching the visual appearances of the target over consecutive image frames is the most critical issue in video-based object tracking. Choosing an appropriate distance metric for matching determines its accuracy and robustness, and significantly influences the tracking performance. This paper presents a new tracking approach that incorporates adaptive metric into differential tracking method. This new approach automatically learns an optimal distance metric for more accurate matching, and obtains a closed-form analytical solution to motion estimation and differential tracking. Extensive experiments validate the effectiveness of adaptive metric, and demonstrate the improved performance of the proposed new tracking method.

Roberts, Richard; Sinha, Sudipta N.; Szeliski, Richard; Steedly, Drew; , ■Structure from motion for scenes with large duplicate structures,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3137-3144, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995549
Abstract: Most existing structure from motion (SFM) approaches for unordered images cannot handle multiple instances of the same structure in the scene. When image pairs containing different instances are matched based on visual similarity, the pairwise geometric relations as well as the correspondences inferred from such pairs are erroneous, which can lead to catastrophic failures in the reconstruction. In this paper, we investigate the geometric ambiguities caused by the presence of repeated or duplicate structures and show that to disambiguate between multiple hypotheses requires more than pure geometric reasoning. We couple an expectation maximization (EM)-based algorithm that estimates camera poses and identifies the false match-pairs with an efficient sampling method to discover plausible data association hypotheses. The sampling method is informed by geometric and image-based cues. Our algorithm usually recovers the correct data association, even in the presence of large numbers of false pairwise matches.

Tang, Chaoying; Kong, Adams Wai Kin; Craft, Noah; , ■Uncovering vein patterns from color skin images for forensic analysis,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.665-672, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995531
Abstract: Recent technological advances have allowed for a proliferation of digital evidence images. Using these images as evidence in legal cases (e.g. child sexual abuse, child pornography and masked gunmen) can be very challenging, because the faces of criminals or victims are not visible. Although large skin marks and tattoos have been used, they are ineffective in some legal cases, because the skin exposed in evidence images neither have unique tattoos nor enough skin marks for identification. The blood vessel between the skin and the muscle covering most parts of the human body is a powerful biometric trait, because of its universality, permanence and distinctiveness. Traditionally, it was impossible to use vein patterns for forensic identification, because they were not visible in color images. This paper proposes an algorithm to uncover vein patterns from the skin exposed in color images for personal identification. Based on the principles of optics and skin biophysics, we modeled the inverse process of skin color formation in an image and derived spatial distributions of biophysical parameters from color images, where vein patterns can be observed. Experimental results are very encouraging. The clarity of the vein patterns in resultant images is comparable to or even better than that in near infrared images.

Yu, Jin; Chin, Tat-Jun; Suter, David; , ■A global optimization approach to robust multi-model fitting,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2041-2048, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995608
Abstract: We present a novel Quadratic Program (QP) formulation for robust multi-model fitting of geometric structures in vision data. Our objective function enforces both the fidelity of a model to the data and the similarity between its associated inliers. Departing from most previous optimization-based approaches, the outcome of our method is a ranking of a given set of putative models, instead of a pre-specified number of ■good■ candidates (or an attempt to decide the right number of models). This is particularly useful when the number of structures in the data is a priori unascertainable due to unknown intent and purposes. Another key advantage of our approach is that it operates in a unified optimization framework, and the standard QP form of our problem formulation permits globally convergent optimization techniques. We tested our method on several geometric multi-model fitting problems on both synthetic and real data. Experiments show that our method consistently achieves state-of-the-art results.

Douze, Matthijs; Ramisa, Arnau; Schmid, Cordelia; , ■Combining attributes and Fisher vectors for efficient image retrieval,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.745-752, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995595
Abstract: Attributes were recently shown to give excellent results for category recognition. In this paper, we demonstrate their performance in the context of image retrieval. First, we show that retrieving images of particular objects based on attribute vectors gives results comparable to the state of the art. Second, we demonstrate that combining attribute and Fisher vectors improves performance for retrieval of particular objects as well as categories. Third, we implement an efficient coding technique for compressing the combined descriptor to very small codes. Experimental results on the Holidays dataset show that our approach significantly outperforms the state of the art, even for a very compact representation of 16 bytes per image. Retrieving category images is evaluated on the ■web-queries■ dataset. We show that attribute features combined with Fisher vectors improve the performance and that combined image features can supplement text features.

Jiang, Nan; Liu, Wenyu; Su, Heng; Wu, Ying; , ■Tracking low resolution objects by metric preservation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1329-1336, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995537
Abstract: Tracking low resolution (LR) targets is a practical yet quite challenging problem in real applications. The loss of discriminative details in the visual appearance of the L-R targets confronts most existing visual tracking methods. Although the resolution of the LR video inputs may be enhanced by super resolution (SR) techniques, the large computational cost for high-quality SR does not make it an attractive option. This paper presents a novel solution to track LR targets without performing explicit SR. This new approach is based on discriminative metric preservation that preserves the structure in the high resolution feature space for LR matching. In addition, we integrate metric preservation with differential tracking to derive a closed-form solution to motion estimation for LR video. Extensive experiments have demonstrated the effectiveness and efficiency of the proposed approach.

Pham, Viet-Quoc; Takahashi, Keita; Naemura, Takeshi; , ■Foreground-background segmentation using iterated distribution matching,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2113-2120, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995356
Abstract: This paper addresses the problem of image segmentation with a reference distribution. Recent studies have shown that segmentation with global consistency measures outperforms conventional techniques based on pixel-wise measures. However, such global approaches require a precise distribution to obtain the correct extraction. To overcome this strict assumption, we propose a new approach in which the given reference distribution plays a guiding role in inferring the latent distribution and its consistent region. The inference is based on an assumption that the latent distribution resembles the distribution of the consistent region but is distinct from the distribution of the complement region. We state the problem as the minimization of an energy function consisting of global similarities based on the Bhattacharyya distance and then implement a novel iterated distribution matching process for jointly optimizing distribution and segmentation. We evaluate the proposed algorithm on the GrabCut dataset, and demonstrate the advantages of using our approach with various segmentation problems, including interactive segmentation, background subtraction, and co-segmentation.

Xie, Yuelei; Chang, Hong; Li, Zhe; Liang, Luhong; Chen, Xilin; Zhao, Debin; , ■A unified framework for locating and recognizing human actions,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.25-32, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995648
Abstract: In this paper, we present a pose based approach for locating and recognizing human actions in videos. In our method, human poses are detected and represented based on deformable part model. To our knowledge, this is the first work on exploring the effectiveness of deformable part models in combining human detection and pose estimation into action recognition. Comparing with previous methods, ours have three main advantages. First, our method does not rely on any assumption on video preprocessing quality, such as satisfactory foreground segmentation or reliable tracking; Second, we propose a novel compact representation for human pose which works together with human detection and can well represent the spatial and temporal structures inside an action; Third, with human detection taken into consideration in our framework, our method has the ability to locate and recognize multiple actions in the same scene. Experiments on benchmark datasets and recorded cluttered videos verified the efficacy of our method.

Tron, Roberto; Vidal, Rene; , ■Distributed computer vision algorithms through distributed averaging,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.57-63, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995654
Abstract: Traditional computer vision and machine learning algorithms have been largely studied in a centralized setting, where all the processing is performed at a single central location. However, a distributed approach might be more appropriate when a network with a large number of cameras is used to analyze a scene. In this paper we show how centralized algorithms based on linear algebraic operations can be made distributed by using simple distributed averages. We cover algorithms such as SVD, least squares, PCA, GPCA, 3-D point triangulation, pose estimation and affine SfM.

Kim, Taesup; Nowozin, Sebastian; Kohli, Pushmeet; Yoo, Chang D.; , ■Variable grouping for energy minimization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1913-1920, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995645
Abstract: This paper addresses the problem of efficiently solving large-scale energy minimization problems encountered in computer vision. We propose an energy-aware method for merging random variables to reduce the size of the energy to be minimized. The method examines the energy function to find groups of variables which are likely to take the same label in the minimum energy state and thus can be represented by a single random variable. We propose and evaluate a number of extremely efficient variable grouping strategies. Experimental results show that our methods result in a dramatic reduction in the computational cost and memory requirements (in some cases by a factor of one hundred) with almost no drop in the accuracy of the final result. Comparative evaluation with efficient super-pixel generation methods, which are commonly used in variable grouping, reveals that our methods are far superior both in terms of accuracy and running time.

Kokiopoulou, Effrosyni; Kressner, Daniel; Zervos, Michail; Paragios, Nikos; , ■Optimal similarity registration of volumetric images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2449-2456, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995337
Abstract: This paper proposes a novel approach to optimally solve volumetric registration problems. The proposed framework exploits parametric dictionaries for sparse volumetric representations, ℓ1 dissimilarities and DC (Difference of Convex functions) decomposition. The SAD (sum of absolute differences) criterion is applied to the sparse representation of the reference volume and a DC decomposition of this criterion with respect to the transformation parameters is derived. This permits to employ a cutting plane algorithm for determining the optimal relative transformation parameters of the query volume. It further provides a guarantee for the global optimality of the obtained solution, which–to the best of our knowledge–is not offered by any other existing approach. A numerical validation demonstrates the effectiveness and the large potential of the proposed method.

Wang, Yang; Tran, Duan; Liao, Zicheng; , ■Learning hierarchical poselets for human parsing,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1705-1712, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995519
Abstract: We consider the problem of human parsing with part-based models. Most previous work in part-based models only considers rigid parts (e.g. torso, head, half limbs) guided by human anatomy. We argue that this representation of parts is not necessarily appropriate for human parsing. In this paper, we introduce hierarchical poselets–a new representation for human parsing. Hierarchical poselets can be rigid parts, but they can also be parts that cover large portions of human bodies (e.g. torso + left arm). In the extreme case, they can be the whole bodies. We develop a structured model to organize poselets in a hierarchical way and learn the model parameters in a max-margin framework. We demonstrate the superior performance of our proposed approach on two datasets with aggressive pose variations.

Jiang, Hao; Tian, Tai-Peng; Sclaroff, Stan; , ■Scale and rotation invariant matching using linearly augmented trees,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2473-2480, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995580
Abstract: We propose a novel linearly augmented tree method for efficient scale and rotation invariant object matching. The proposed method enforces pairwise matching consistency defined on trees, and high-order constraints on all the sites of a template. The pairwise constraints admit arbitrary metrics while the high-order constraints use L1 norms and therefore can be linearized. Such a linearly augmented tree formulation introduces hyperedges and loops into the basic tree structure, but different from a general loopy graph, its special structure allows us to relax and decompose the optimization into a sequence of tree matching problems efficiently solvable by dynamic programming. The proposed method also works on continuous scale and rotation parameters; we can match with a scale up to any large number with the same efficiency. Our experiments on ground truth data and a variety of real images and videos show that the proposed method is efficient, accurate and reliable.

Zhang, Cherry; Sato, Imari; , ■Separating reflective and fluorescent components of an image,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.185-192, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995704
Abstract: Traditionally researchers tend to exclude fluorescence from color appearance algorithms in computer vision and image processing because of its complexity. In reality, fluorescence is a very common phenomenon observed in many objects, from gems and corals, to different kinds of writing paper, and to our clothes. In this paper, we provide detailed theories of fluorescence phenomenon. In particular, we show that the color appearance of fluorescence is unaffected by illumination in which it differs from ordinary reflectance. Moreover, we show that the color appearance of objects with reflective and fluorescent components can be represented as a linear combination of the two components. A linear model allows us to separate the two components using images taken under two unknown illuminants using independent component analysis(ICA). The effectiveness of the proposed method is demonstrated using digital images of various fluorescent objects.

Netz, Aaron; Osadchy, Margarita; , ■Using specular highlights as pose invariant features for 2D-3D pose estimation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.721-728, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995673
Abstract: We address the problem of 2D-3D pose estimation in difficult viewing conditions, such as low illumination, cluttered background, and large highlights and shadows that appear on the object of interest. In such challenging conditions conventional features used for establishing correspondence are unreliable. We show that under the assumption of a dominant light source, specular highlights produced by a known object can be used to establish correspondence between its image and the 3D model, and to verify the hypothesized pose. These ideas are incorporated in an efficient method for pose estimation from a monocular image of an object using only highlights produced by the object as its input. The proposed method uses no knowledge of lighting direction and no calibration object for estimating the lighting in the scene. The evaluation of the method shows good accuracy on numerous synthetic images and good robustness on real images of complex, shiny objects, with shadows and difficult backgrounds.

Shotton, Jamie; Fitzgibbon, Andrew; Cook, Mat; Sharp, Toby; Finocchio, Mark; Moore, Richard; Kipman, Alex; Blake, Andrew; , ■Real-time human pose recognition in parts from single depth images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1297-1304, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995316
Abstract: We propose a new method to quickly and accurately predict 3D positions of body joints from a single depth image, using no temporal information. We take an object recognition approach, designing an intermediate body parts representation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of several training parameters. We achieve state of the art accuracy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching.

Pirsiavash, Hamed; Ramanan, Deva; Fowlkes, Charless C.; , ■Globally-optimal greedy algorithms for tracking a variable number of objects,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1201-1208, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995604
Abstract: We analyze the computational problem of multi-object tracking in video sequences. We formulate the problem using a cost function that requires estimating the number of tracks, as well as their birth and death states. We show that the global solution can be obtained with a greedy algorithm that sequentially instantiates tracks using shortest path computations on a flow network. Greedy algorithms allow one to embed pre-processing steps, such as nonmax suppression, within the tracking algorithm. Furthermore, we give a near-optimal algorithm based on dynamic programming which runs in time linear in the number of objects and linear in the sequence length. Our algorithms are fast, simple, and scalable, allowing us to process dense input data. This results in state-of-the-art performance.

Brown, Matthew; Susstrunk, Sabine; , ■Multi-spectral SIFT for scene category recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.177-184, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995637
Abstract: We use a simple modification to a conventional SLR camera to capture images of several hundred scenes in colour (RGB) and near-infrared (NIR). We show that the addition of near-infrared information leads to significantly improved performance in a scene-recognition task, and that the improvements are greater still when an appropriate 4-dimensional colour representation is used. In particular we propose MSIFT–a multispectral SIFT descriptor that, when combined with a kernel based classifier, exceeds the performance of state-of-the-art scene recognition techniques (e.g., GIST) and their multispectral extensions. We extensively test our algorithms using a new dataset of several hundred RGB-NIR scene images, as well as benchmarking against Torralba's scene categorization dataset.

Rohrbach, Marcus; Stark, Michael; Schiele, Bernt; , ■Evaluating knowledge transfer and zero-shot learning in a large-scale setting,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1641-1648, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995627
Abstract: While knowledge transfer (KT) between object classes has been accepted as a promising route towards scalable recognition, most experimental KT studies are surprisingly limited in the number of object classes considered. To support claims of KT w.r.t. scalability we thus advocate to evaluate KT in a large-scale setting. To this end, we provide an extensive evaluation of three popular approaches to KT on a recently proposed large-scale data set, the ImageNet Large Scale Visual Recognition Competition 2010 data set. In a first setting they are directly compared to one-vs-all classification often neglected in KT papers and in a second setting we evaluate their ability to enable zero-shot learning. While none of the KT methods can improve over one-vs-all classification they prove valuable for zero-shot learning, especially hierarchical and direct similarity based KT. We also propose and describe several extensions of the evaluated approaches that are necessary for this large-scale study.

Li, Mu; Lian, Xiao-Chen; Kwok, James T.; Lu, Bao-Liang; , ■Time and space efficient spectral clustering via column sampling,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2297-2304, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995425
Abstract: Spectral clustering is an elegant and powerful approach for clustering. However, the underlying eigen-decomposition takes cubic time and quadratic space w.r.t. the data set size. These can be reduced by the Nyström method which samples only a subset of columns from the matrix. However, the manipulation and storage of these sampled columns can still be expensive when the data set is large. In this paper, we propose a time- and space-efficient spectral clustering algorithm which can scale to very large data sets. A general procedure to orthogonalize the approximated eigenvectors is also proposed. Extensive spectral clustering experiments on a number of data sets, ranging in size from a few thousands to several millions, demonstrate the accuracy and scalability of the proposed approach. We further apply it to the task of image segmentation. For images with more than 10 millions pixels, this algorithm can obtain the eigenvectors in 1 minute on a single machine.

Qi, Guo-Jun; Tian, Qi; Huang, Thomas; , ■Locality-sensitive support vector machine by exploring local correlation and global regularization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.841-848, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995378
Abstract: Local classifiers have obtained great success in classification task due to its powerful discriminating ability on local regions. However, most of them still have restricted generalization in twofold: (1) each local classifier is sensitive to noise in local regions which leads to overfitting phenomenon in local classifiers; (2) the local classifiers also ignore the local correlation determined by the sample distribution in each local region. To overcome the above two problems, we present a novel locality-sensitive support vector machine (LSSVM) in this paper for image retrieval problem. This classifier applies locality-sensitive hashing (LSH) to divide the whole feature space into a number of local regions, on each of them a local model can be better constructed due to smaller within-class variation on it. To avoid these local models from overfitting into locality-sensitive structures, it imposes a global regularizer across local regions so that local classifiers are smoothly glued together to form a regularized overall classifier. local correlation is modeled to capture the sample distribution that determines the locality structure of each local region, which can increase the discriminating ability of the algorithm. To evaluate the performance, we apply the proposed algorithm into image retrieval task and competitive results are obtained on the real-world web image data set.

Zhu, Chunhui; Wen, Fang; Sun, Jian; , ■A rank-order distance based clustering algorithm for face tagging,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.481-488, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995680
Abstract: We present a novel clustering algorithm for tagging a face dataset (e. g., a personal photo album). The core of the algorithm is a new dissimilarity, called Rank-Order distance, which measures the dissimilarity between two faces using their neighboring information in the dataset. The Rank-Order distance is motivated by an observation that faces of the same person usually share their top neighbors. Specifically, for each face, we generate a ranking order list by sorting all other faces in the dataset by absolute distance (e. g., L1 or L2 distance between extracted face recognition features). Then, the Rank-Order distance of two faces is calculated using their ranking orders. Using the new distance, a Rank-Order distance based clustering algorithm is designed to iteratively group all faces into a small number of clusters for effective tagging. The proposed algorithm outperforms competitive clustering algorithms in term of both precision/recall and efficiency.

Huang, Yongzhen; Huang, Kaiqi; Yu, Yinan; Tan, Tieniu; , ■Salient coding for image classification,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1753-1760, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995682
Abstract: The codebook based (bag-of-words) model is a widely applied model for image classification. We analyze recent coding strategies in this model, and find that saliency is the fundamental characteristic of coding. The saliency in coding means that if a visual code is much closer to a descriptor than other codes, it will obtain a very strong response. The salient representation under maximum pooling operation leads to the state-of-the-art performance on many databases and competitions. However, most current coding schemes do not recognize the role of salient representation, so that they may lead to large deviations in representing local descriptors. In this paper, we propose ■salient coding■, which employs the ratio between descriptors' nearest code and other codes to describe descriptors. This approach can guarantee salient representation without deviations. We study salient coding on two sets of image classification databases (15-Scenes and PASCAL VOC2007). The experimental results demonstrate that our approach outperforms all other coding methods in image classification.

Chandraker, Manmohan; Bai, Jiamin; Ramamoorthi, Ravi; , ■A theory of differential photometric stereo for unknown isotropic BRDFs,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2505-2512, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995603
Abstract: This paper presents a comprehensive theory of photometric surface reconstruction from image derivatives. For unknown isotropic BRDFs, we show that two measurements of spatial and temporal image derivatives, under unknown light sources on a circle, suffice to determine the surface. This result is the culmination of a series of fundamental observations. First, we discover a photometric invariant that relates image derivatives to the surface geometry, regardless of the form of isotropic BRDF. Next, we show that just two pairs of differential images from unknown light directions suffice to recover surface information from the photometric invariant. This is shown to be equivalent to determining isocontours of constant magnitude of the surface gradient, as well as isocontours of constant depth. Further, we prove that specification of the surface normal at a single point completely determines the surface depth from these isocontours. In addition, we propose practical algorithms that require additional initial or boundary information, but recover depth from lower order derivatives. Our theoretical results are illustrated with several examples on synthetic and real data.

Cheng, Ming-Ming; Zhang, Guo-Xin; Mitra, Niloy J.; Huang, Xiaolei; Hu, Shi-Min; , ■Global contrast based salient region detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.409-416, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995344
Abstract: Reliable estimation of visual saliency allows appropriate processing of images without prior knowledge of their contents, and thus remains an important step in many computer vision tasks including image segmentation, object recognition, and adaptive compression. We propose a regional contrast based saliency extraction algorithm, which simultaneously evaluates global contrast differences and spatial coherence. The proposed algorithm is simple, efficient, and yields full resolution saliency maps. Our algorithm consistently outperformed existing saliency detection methods, yielding higher precision and better recall rates, when evaluated using one of the largest publicly available data sets. We also demonstrate how the extracted saliency map can be used to create high quality segmentation masks for subsequent image processing.

Wu, Xinxiao; Xu, Dong; Duan, Lixin; Luo, Jiebo; , ■Action recognition using context and appearance distribution features,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.489-496, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995624
Abstract: We first propose a new spatio-temporal context distribution feature of interest points for human action recognition. Each action video is expressed as a set of relative XYT coordinates between pairwise interest points in a local region. We learn a global GMM (referred to as Universal Background Model, UBM) using the relative coordinate features from all the training videos, and then represent each video as the normalized parameters of a video-specific GMM adapted from the global GMM. In order to capture the spatio-temporal relationships at different levels, multiple GMMs are utilized to describe the context distributions of interest points over multi-scale local regions. To describe the appearance information of an action video, we also propose to use GMM to characterize the distribution of local appearance features from the cuboids centered around the interest points. Accordingly, an action video can be represented by two types of distribution features: 1) multiple GMM distributions of spatio-temporal context; 2) GMM distribution of local video appearance. To effectively fuse these two types of heterogeneous and complementary distribution features, we additionally propose a new learning algorithm, called Multiple Kernel Learning with Augmented Features (AFMKL), to learn an adapted classifier based on multiple kernels and the pre-learned classifiers of other action classes. Extensive experiments on KTH, multi-view IXMAS and complex UCF sports datasets demonstrate that our method generally achieves higher recognition accuracy than other state-of-the-art methods.

Dixit, Mandar; Rasiwasia, Nikhil; Vasconcelos, Nuno; , ■Adapted Gaussian models for image classification,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.937-943, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995674
Abstract: A general formulation of ■Bayesian Adaptation■ for generative and discriminative classification in the topic model framework is proposed. A generic topic-independent Gaussian mixture model, known as the background GMM, is learned using all available training data and adapted to the individual topics. In the generative framework, a Gaussian variant of the spatial pyramid model is used with a Bayes classifier. For the discriminative case, a novel predictive histogram representation for an image is presented. This builds upon the adapted topic model structure, using the individual class dictionaries and Bayesian weighting. The resulting histogram representation is evaluated for classification using a Support Vector Machine (SVM). A comparative evaluation of the proposed image models with the standard ones in the image classification literature is provided on three benchmark datasets.

Domke, Justin; , ■Parameter learning with truncated message-passing,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2937-2943, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995320
Abstract: Training of conditional random fields often takes the form of a double-loop procedure with message-passing inference in the inner loop. This can be very expensive, as the need to solve the inner loop to high accuracy can require many message-passing iterations. This paper seeks to reduce the expense of such training, by redefining the training objective in terms of the approximate marginals obtained after message-passing is ■truncated■ to a fixed number of iterations. An algorithm is derived to efficiently compute the exact gradient of this objective. On a common pixel labeling benchmark, this procedure improves training speeds by an order of magnitude, and slightly improves inference accuracy if a very small number of message-passing iterations are used at test time.

Wang, Liang; Yang, Ruigang; , ■Global stereo matching leveraged by sparse ground control points,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3033-3040, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995480
Abstract: We present a novel global stereo model that makes use of constraints from points with known depths, i.e., the Ground Control Points (GCPs) as referred to in stereo literature. Our formulation explicitly models the influences of GCPs in a Markov Random Field. A novel GCPs-based regu-larization term is naturally integrated into our global optimization framework in a principled way using the Bayes rule. The optimal solution of the inference problem can be approximated via existing energy minimization techniques such as graph cuts used in this paper. Our generic probabilistic framework allows GCPs to be obtained from various modalities and provides a natural way to integrate the information from multiple sensors. Quantitative evaluations demonstrate the effectiveness of the proposed formulation for regularizing the ill-posed stereo matching problem and improving reconstruction accuracy.

Ding, Yuanyuan; Xiao, Jing; Yu, Jingyi; , ■Importance filtering for image retargeting,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.89-96, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995445
Abstract: Content-aware image retargeting has attracted a lot of interests recently. The key and most challenging issue for this task is how to balance the tradeoff between preserving the important contents and minimizing the visual distortions on the consistency of the image structure. In this paper we present a novel filtering-based technique to tackle this issue, called ■importance filtering■. Specifically, we first filter the image saliency guided by the image itself to achieve a structure-consistent importance map. We then use the pixel importance as the key constraint to compute the gradient map of pixel shifts from the original resolution to the target. Finally, we integrate the shift gradient across the image using a weighted filter to construct a smooth shift map and render the target image. The weight is again controlled by the pixel importance. The two filtering processes enforce to maintain the structural consistency and yet preserve the important contents in the target image. Furthermore, the simple nature of filter operations allows highly efficient implementation for real-time applications and easy extension to video retargeting, as the structural constraints from the original image naturally convey the temporal coherence between frames. The effectiveness and efficiency of our importance filtering algorithm are confirmed in extensive experiments.

Zhang, Zhengdong; Matsushita, Yasuyuki; Ma, Yi; , ■Camera calibration with lens distortion from low-rank textures,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2321-2328, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995548
Abstract: We present a simple, accurate, and flexible method to calibrate intrinsic parameters of a camera together with (possibly significant) lens distortion. This new method can work under a wide range of practical scenarios: using multiple images of a known pattern, multiple images of an unknown pattern, single or multiple image(s) of multiple patterns, etc. Moreover, this new method does not rely on extracting any low-level features such as corners or edges. It can tolerate considerably large lens distortion, noise, error, illumination and viewpoint change, and still obtain accurate estimation of the camera parameters. The new method leverages on the recent breakthroughs in powerful high-dimensional convex optimization tools, especially those for matrix rank minimization and sparse signal recovery. We will show how the camera calibration problem can be formulated as an important extension to principal component pursuit, and solved by similar techniques. We characterize to exactly what extent the parameters can be recovered in case of ambiguity. We verify the efficacy and accuracy of the proposed algorithm with extensive experiments on real images.

Ocegueda, Omar; Shah, Shishir K.; Kakadiaris, Ioannis A.; , ■Which parts of the face give out your identity?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.641-648, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995613
Abstract: We present a Markov Random Field model for the analysis of lattices (e.g., images or 3D meshes) in terms of the discriminative information of their vertices. The proposed method provides a measure field that estimates the probability of each vertex to be ■discriminative■ or ■non-discriminative■. As an application of the proposed framework, we present a method for the selection of compact and robust features for 3D face recognition. The resulting signature consists of 360 coefficients, based on which we are able to build a classifier yielding better recognition rates than currently reported in the literature. The main contribution of this work lies in the development of a novel framework for feature selection in scenarios in which the most discriminative information is known to be concentrated along piece-wise smooth regions of a lattice.

Guo, Ruiqi; Dai, Qieyun; Hoiem, Derek; , ■Single-image shadow detection and removal using paired regions,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2033-2040, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995725
Abstract: In this paper, we address the problem of shadow detection and removal from single images of natural scenes. Different from traditional methods that explore pixel or edge information, we employ a region based approach. In addition to considering individual regions separately, we predict relative illumination conditions between segmented regions from their appearances and perform pairwise classification based on such information. Classification results are used to build a graph of segments, and graph-cut is used to solve the labeling of shadow and non-shadow regions. Detection results are later refined by image matting, and the shadow free image is recovered by relighting each pixel based on our lighting model. We evaluate our method on the shadow detection dataset in [19]. In addition, we created a new dataset with shadow-free ground truth images, which provides a quantitative basis for evaluating shadow removal.

Joly, Alexis; Buisson, Olivier; , ■Random maximum margin hashing,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.873-880, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995709
Abstract: Following the success of hashing methods for multidimensional indexing, more and more works are interested in embedding visual feature space in compact hash codes. Such approaches are not an alternative to using index structures but a complementary way to reduce both the memory usage and the distance computation cost. Several data dependent hash functions have notably been proposed to closely fit data distribution and provide better selectivity than usual random projections such as LSH. However, improvements occur only for relatively small hash code sizes up to 64 or 128 bits. As discussed in the paper, this is mainly due to the lack of independence between the produced hash functions. We introduce a new hash function family that attempts to solve this issue in any kernel space. Rather than boosting the collision probability of close points, our method focus on data scattering. By training purely random splits of the data, regardless the closeness of the training samples, it is indeed possible to generate consistently more independent hash functions. On the other side, the use of large margin classifiers allows to maintain good generalization performances. Experiments show that our new Random Maximum Margin Hashing scheme (RMMH) outperforms four state-of-the-art hashing methods, notably in kernel spaces.

Brau, Ernesto; Dunatunga, Damayanthi; Barnard, Kobus; Tsukamoto, Tatsuya; Palanivelu, Ravi; Lee, Philip; , ■A generative statistical model for tracking multiple smooth trajectories,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1137-1144, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995736
Abstract: We present a general model for tracking smooth trajectories of multiple targets in complex data sets, where tracks potentially cross each other many times. As the number of overlapping trajectories grows, exploiting smoothness becomes increasingly important to disambiguate the association of successive points. However, in many important problems an effective parametric model for the trajectories does not exist. Hence we propose modeling trajectories as independent realizations of Gaussian processes with kernel functions which allow for arbitrary smooth motion. Our generative statistical model accounts for the data as coming from an unknown number of such processes, together with expectations for noise points and the probability that points are missing. For inference we compare two methods: A modified version of the Markov chain Monte Carlo data association (MCMCDA) method, and a Gibbs sampling method which is much simpler and faster, and gives better results by being able to search the solution space more efficiently. In both cases, we compare our results against the smoothing provided by linear dynamical systems (LDS). We test our approach on videos of birds and fish, and on 82 image sequences of pollen tubes growing in a petri dish, each with up to 60 tubes with multiple crossings. We achieve 93% accuracy on image sequences with up to ten trajectories (35 sequences) and 88% accuracy when there are more than ten (42 sequences). This performance surpasses that of using an LDS motion model, and far exceeds a simple heuristic tracker.

Yang, Meng; Zhang, Lei; Yang, Jian; Zhang, David; , ■Robust sparse coding for face recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.625-632, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995393
Abstract: Recently the sparse representation (or coding) based classification (SRC) has been successfully used in face recognition. In SRC, the testing image is represented as a sparse linear combination of the training samples, and the representation fidelity is measured by the l2-norm or l1-norm of coding residual. Such a sparse coding model actually assumes that the coding residual follows Gaussian or Laplacian distribution, which may not be accurate enough to describe the coding errors in practice. In this paper, we propose a new scheme, namely the robust sparse coding (RSC), by modeling the sparse coding as a sparsity-constrained robust regression problem. The RSC seeks for the MLE (maximum likelihood estimation) solution of the sparse coding problem, and it is much more robust to outliers (e.g., occlusions, corruptions, etc.) than SRC. An efficient iteratively reweighted sparse coding algorithm is proposed to solve the RSC model. Extensive experiments on representative face databases demonstrate that the RSC scheme is much more effective than state-of-the-art methods in dealing with face occlusion, corruption, lighting and expression changes, etc.

Levin, Anat; Nadler, Boaz; , ■Natural image denoising: Optimality and inherent bounds,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2833-2840, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995309
Abstract: The goal of natural image denoising is to estimate a clean version of a given noisy image, utilizing prior knowledge on the statistics of natural images. The problem has been studied intensively with considerable progress made in recent years. However, it seems that image denoising algorithms are starting to converge and recent algorithms improve over previous ones by only fractional dB values. It is thus important to understand how much more can we still improve natural image denoising algorithms and what are the inherent limits imposed by the actual statistics of the data. The challenge in evaluating such limits is that constructing proper models of natural image statistics is a long standing and yet unsolved problem. To overcome the absence of accurate image priors, this paper takes a non parametric approach and represents the distribution of natural images using a huge set of 1010 patches. We then derive a simple statistical measure which provides a lower bound on the optimal Bayesian minimum mean square error (MMSE). This imposes a limit on the best possible results of denoising algorithms which utilize a fixed support around a denoised pixel and a generic natural image prior. Our findings suggest that for small windows, state of the art denoising algorithms are approaching optimality and cannot be further improved beyond ∼ 0.1dB values.

Liu, Wei; Jiang, Yu-Gang; Luo, Jiebo; Chang, Shih-Fu; , ■Noise resistant graph ranking for improved web image search,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.849-856, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995315
Abstract: In this paper, we exploit a novel ranking mechanism that processes query samples with noisy labels, motivated by the practical application of web image search re-ranking where the originally highest ranked images are usually posed as pseudo queries for subsequent re-ranking. Availing ourselves of the low-frequency spectrum of a neighborhood graph built on the samples, we propose a graph-theoretical framework amenable to noise resistant ranking. The proposed framework consists of two components: spectral filtering and graph-based ranking. The former leverages sparse bases, progressively selected from a pool of smooth eigenvectors of the graph Laplacian, to reconstruct the noisy label vector associated with the query sample set and accordingly filter out the query samples with less authentic positive labels. The latter applies a canonical graph ranking algorithm with respect to the filtered query sample set. Quantitative image re-ranking experiments carried out on two public web image databases bear out that our re-ranking approach compares favorably with the state-of-the-arts and improves web image search engines by a large margin though we harvest the noisy queries from the top-ranked images returned by these search engines.

He, He; Siu, Wan-Chi; , ■Single image super-resolution using Gaussian process regression,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.449-456, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995713
Abstract: In this paper we address the problem of producing a high-resolution image from a single low-resolution image without any external training set. We propose a framework for both magnification and deblurring using only the original low-resolution image and its blurred version. In our method, each pixel is predicted by its neighbors through the Gaussian process regression. We show that when using a proper covariance function, the Gaussian process regression can perform soft clustering of pixels based on their local structures. We further demonstrate that our algorithm can extract adequate information contained in a single low-resolution image to generate a high-resolution image with sharp edges, which is comparable to or even superior in quality to the performance of other edge-directed and example-based super-resolution algorithms. Experimental results also show that our approach maintains high-quality performance at large magnifications.

Zhao, Bin; Fei-Fei, Li; Xing, Eric P.; , ■Online detection of unusual events in videos via dynamic sparse coding,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3313-3320, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995524
Abstract: Real-time unusual event detection in video stream has been a difficult challenge due to the lack of sufficient training information, volatility of the definitions for both normality and abnormality, time constraints, and statistical limitation of the fitness of any parametric models. We propose a fully unsupervised dynamic sparse coding approach for detecting unusual events in videos based on online sparse re-constructibility of query signals from an atomically learned event dictionary, which forms a sparse coding bases. Based on an intuition that usual events in a video are more likely to be reconstructible from an event dictionary, whereas unusual events are not, our algorithm employs a principled convex optimization formulation that allows both a sparse reconstruction code, and an online dictionary to be jointly inferred and updated. Our algorithm is completely un-supervised, making no prior assumptions of what unusual events may look like and the settings of the cameras. The fact that the bases dictionary is updated in an online fashion as the algorithm observes more data, avoids any issues with concept drift. Experimental results on hours of real world surveillance video and several Youtube videos show that the proposed algorithm could reliably locate the unusual events in the video sequence, outperforming the current state-of-the-art methods.

Chen, Xi; Jain, Arpit; Gupta, Abhinav; Davis, Larry S.; , ■Piecing together the segmentation jigsaw using context,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2001-2008, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995367
Abstract: We present an approach to jointly solve the segmentation and recognition problem using a multiple segmentation framework. We formulate the problem as segment selection from a pool of segments, assigning each selected segment a class label. Previous multiple segmentation approaches used local appearance matching to select segments in a greedy manner. In contrast, our approach formulates a cost function based on contextual information in conjunction with appearance matching. This relaxed cost function formulation is minimized using an efficient quadratic programming solver and an approximate solution is obtained by discretizing the relaxed solution. Our approach improves labeling performance compared to other segmentation based recognition approaches.

Kulis, Brian; Saenko, Kate; Darrell, Trevor; , ■What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1785-1792, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995702
Abstract: In real-world applications, ■what you saw■ during training is often not ■what you get■ during deployment: the distribution and even the type and dimensionality of features can change from one dataset to the next. In this paper, we address the problem of visual domain adaptation for transferring object models from one dataset or visual domain to another. We introduce ARC-t, a flexible model for supervised learning of non-linear transformations between domains. Our method is based on a novel theoretical result demonstrating that such transformations can be learned in kernel space. Unlike existing work, our model is not restricted to symmetric transformations, nor to features of the same type and dimensionality, making it applicable to a significantly wider set of adaptation scenarios than previous methods. Furthermore, the method can be applied to categories that were not available during training. We demonstrate the ability of our method to adapt object recognition models under a variety of situations, such as differing imaging conditions, feature types and codebooks.

Benoit, Louise; Mairal, Julien; Bach, Francis; Ponce, Jean; , ■Sparse image representation with epitomes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2913-2920, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995636
Abstract: Sparse coding, which is the decomposition of a vector using only a few basis elements, is widely used in machine learning and image processing. The basis set, also called dictionary, is learned to adapt to specific data. This approach has proven to be very effective in many image processing tasks. Traditionally, the dictionary is an unstructured ■flat■ set of atoms. In this paper, we study structured dictionaries [1] which are obtained from an epitome [11], or a set of epitomes. The epitome is itself a small image, and the atoms are all the patches of a chosen size inside this image. This considerably reduces the number of parameters to learn and provides sparse image decompositions with shift-invariance properties. We propose a new formulation and an algorithm for learning the structured dictionaries associated with epitomes, and illustrate their use in image de-noising tasks.

Zhou, Qian-Yi; Neumann, Ulrich; , ■2.5D building modeling with topology control,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2489-2496, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995611
Abstract: 2.5D building reconstruction aims at creating building models composed of complex roofs and vertical walls. In this paper, we define 2.5D building topology as a set of roof features, wall features, and point features; together with the associations between them. Based on this definition, we extend 2.5D dual contouring into a 2.5D modeling method with topology control. Comparing with the previous method, we put less restrictions on the adaptive simplification process. We show results under intense geometry simplifications. Our results preserve significant topology structures while the number of triangles is comparable to that of manually created models or primitive-based models.

Kim, Jaechul; Grauman, Kristen; , ■Boundary preserving dense local regions,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1553-1560, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995526
Abstract: We propose a dense local region detector to extract features suitable for image matching and object recognition tasks. Whereas traditional local interest operators rely on repeatable structures that often cross object boundaries (e.g., corners, scale-space blobs), our sampling strategy is driven by segmentation, and thus preserves object boundaries and shape. At the same time, whereas existing region-based representations are sensitive to segmentation parameters and object deformations, our novel approach to robustly sample dense sites and determine their connectivity offers better repeatability. In extensive experiments, we find that the proposed region detector provides significantly better repeatability and localization accuracy for object matching compared to an array of existing detectors. In addition, we show our regions lead to excellent results on two benchmark tasks that require good feature matching: weakly supervised foreground discovery, and nearest neighbor-based object recognition.

Levin, Anat; Weiss, Yair; Durand, Fredo; Freeman, William T.; , ■Efficient marginal likelihood optimization in blind deconvolution,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2657-2664, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995308
Abstract: In blind deconvolution one aims to estimate from an input blurred image y a sharp image x and an unknown blur kernel k. Recent research shows that a key to success is to consider the overall shape of the posterior distribution p(x, ky) and not only its mode. This leads to a distinction between MAPx, k strategies which estimate the mode pair x, k and often lead to undesired results, and MAPk strategies which select the best k while marginalizing over all possible x images. The MAPk principle is significantly more robust than the MAPx, k one, yet, it involves a challenging marginalization over latent images. As a result, MAPk techniques are considered complicated, and have not been widely exploited. This paper derives a simple approximated MAPkalgorithm which involves only a modest modification of common MAPx, k algorithms. We show that MAPk can, in fact, be optimized easily, with no additional computational complexity.

Brox, Thomas; Bourdev, Lubomir; Maji, Subhransu; Malik, Jitendra; , ■Object segmentation by alignment of poselet activations to image contours,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2225-2232, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995659
Abstract: In this paper, we propose techniques to make use of two complementary bottom-up features, image edges and texture patches, to guide top-down object segmentation towards higher precision. We build upon the part-based pose-let detector, which can predict masks for numerous parts of an object. For this purpose we extend poselets to 19 other categories apart from person. We non-rigidly align these part detections to potential object contours in the image, both to increase the precision of the predicted object mask and to sort out false positives. We spatially aggregate object information via a variational smoothing technique while ensuring that object regions do not overlap. Finally, we propose to refine the segmentation based on self-similarity defined on small image patches. We obtain competitive results on the challenging Pascal VOC benchmark. On four classes we achieve the best numbers to-date.

Humayun, Ahmad; Mac Aodha, Oisin; Brostow, Gabriel J.; , ■Learning to find occlusion regions,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2161-2168, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995517
Abstract: For two consecutive frames in a video, we identify which pixels in the first frame become occluded in the second. Such general-purpose detection of occlusion regions is difficult and important because one-to-one correspondence of imaged scene points is needed for many tracking, video segmentation, and reconstruction algorithms. Our hypothesis is that an effective trained occlusion detector can be generated on the basis of i) a broad spectrum of visual features, and ii) representative but synthetic training sequences. By using a Random Forest based framework for feature selection and training, we found that the proposed feature set was sufficient to frequently assign a high probability of occlusion to just the pixels that were indeed becoming occluded. Our extensive experiments on many sequences support this finding, and while accuracy is certainly still scene-dependent, the proposed classifier could be a useful preprocessing step to exploit temporal information in video.

Biswas, Soma; Aggarwal, Gaurav; Flynn, Patrick J.; , ■Pose-robust recognition of low-resolution face images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.601-608, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995443
Abstract: Face images captured by surveillance cameras usually have poor resolution in addition to uncontrolled poses and illumination conditions which adversely affect performance of face matching algorithms. In this paper, we develop a novel approach for matching surveillance quality facial images to high resolution images in frontal pose which are often available during enrollment. The proposed approach uses Multidimensional Scaling to simultaneously transform the features from the poor quality probe images and the high quality gallery images in such a manner that the distances between them approximate the distances had the probe images been captured in the same conditions as the gallery images. Thorough evaluation on the Multi-PIE dataset [10] and comparisons with state-of-the-art super-resolution and classifier based approaches are performed to illustrate the usefulness of the proposed approach. Experiments on real surveillance images further signify the applicability of the framework.

Owens, Trevor; Saenko, Kate; Chakrabarti, Ayan; Xiong, Ying; Zickler, Todd; Darrell, Trevor; , ■Learning object color models from multi-view constraints,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.169-176, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995705
Abstract: Color is known to be highly discriminative for many object recognition tasks, but is difficult to infer from uncontrolled images in which the illuminant is not known. Traditional methods for color constancy can improve surface reflectance estimates from such uncalibrated images, but their output depends significantly on the background scene. In many recognition and retrieval applications, we have access to image sets that contain multiple views of the same object in different environments; we show in this paper that correspondences between these images provide important constraints that can improve color constancy. We introduce the multi-view color constancy problem, and present a method to recover estimates of underlying surface reflectance based on joint estimation of these surface properties and the illuminants present in multiple images. The method can exploit image correspondences obtained by various alignment techniques, and we show examples based on matching local region features. Our results show that multi-view constraints can significantly improve estimates of both scene illuminants and object color (surface reflectance) when compared to a baseline single-view method.

Zeng, Yun; Wang, Chaohui; Wang, Yang; Gu, Xianfeng; Samaras, Dimitris; Paragios, Nikos; , ■Intrinsic dense 3D surface tracking,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1225-1232, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995513
Abstract: This paper presents a novel intrinsic 3D surface distance and its use in a complete probabilistic tracking framework for dynamic 3D data. Registering two frames of a deforming 3D shape relies on accurate correspondences between all points across the two frames. In the general case such correspondence search is computationally intractable. Common prior assumptions on the nature of the deformation such as near-rigidity, isometry or learning from a training set, reduce the search space but often at the price of loss of accuracy when it comes to deformations not in the prior assumptions. If we consider the set of all possible 3D surface matchings defined by specifying triplets of correspondences in the uniformization domain, then we introduce a new matching cost between two 3D surfaces. The lowest feature differences across this set of matchings that cause two points to correspond, become the matching cost of that particular correspondence. We show that for surface tracking applications, the matching cost can be efficiently computed in the uniformization domain. This matching cost is then combined with regularization terms that enforce spatial and temporal motion consistencies, into a maximum a posteriori (MAP) problem which we approximate using a Markov Random Field (MRF). Compared to previous 3D surface tracking approaches that either assume isometric deformations or consistent features, our method achieves dense, accurate tracking results, which we demonstrate through a series of dense, anisometric 3D surface tracking experiments.

Danielsson, Oscar; Rasolzadeh, Babak; Carlsson, Stefan; , ■Gated classifiers: Boosting under high intra-class variation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2673-2680, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995408
Abstract: In this paper we address the problem of using boosting (e.g. AdaBoost [7]) to classify a target class with significant intra-class variation against a large background class. This situation occurs for example when we want to recognize a visual object class against all other image patches. The boosting algorithm produces a strong classifier, which is a linear combination of weak classifiers. We observe that we often have sets of weak classifiers that individually fire on many examples of the target class but never fire together on those examples (i.e. their outputs are anti-correlated on the target class). Motivated by this observation we suggest a family of derived weak classifiers, termed gated classifiers, that suppress such combinations of weak classifiers. Gated classifiers can be used on top of any original weak learner. We run experiments on two popular datasets, showing that our method reduces the required number of weak classifiers by almost an order of magnitude, which in turn yields faster detectors. We experiment on synthetic data showing that gated classifiers enables more complex distributions to be represented. We hope that gated classifiers will extend the usefulness of boosted classifier cascades [29].

Szabo, Zoltan; Poczos, Barnabas; Lorincz, Andras; , ■Online group-structured dictionary learning,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2865-2872, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995712
Abstract: We develop a dictionary learning method which is (i) online, (ii) enables overlapping group structures with (iii) non-convex sparsity-inducing regularization and (iv) handles the partially observable case. Structured sparsity and the related group norms have recently gained widespread attention in group-sparsity regularized problems in the case when the dictionary is assumed to be known and fixed. However, when the dictionary also needs to be learned, the problem is much more difficult. Only a few methods have been proposed to solve this problem, and they can handle two of these four desirable properties at most. To the best of our knowledge, our proposed method is the first one that possesses all of these properties. We investigate several interesting special cases of our framework, such as the online, structured, sparse non-negative matrix factorization, and demonstrate the efficiency of our algorithm with several numerical experiments.

Liu, Jingen; Shah, Mubarak; Kuipers, Benjamin; Savarese, Silvio; , ■Cross-view action recognition via view knowledge transfer,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3209-3216, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995729
Abstract: In this paper, we present a novel approach to recognizing human actions from different views by view knowledge transfer. An action is originally modelled as a bag of visual-words (BoVW), which is sensitive to view changes. We argue that, as opposed to visual words, there exist some higher level features which can be shared across views and enable the connection of action models for different views. To discover these features, we use a bipartite graph to model two view-dependent vocabularies, then apply bipartite graph partitioning to co-cluster two vocabularies into visual-word clusters called bilingual-words (i.e., high-level features), which can bridge the semantic gap across view-dependent vocabularies. Consequently, we can transfer a BoVW action model into a bag-of-bilingual-words (BoBW) model, which is more discriminative in the presence of view changes. We tested our approach on the IXMAS data set and obtained very promising results. Moreover, to further fuse view knowledge from multiple views, we apply a Locally Weighted Ensemble scheme to dynamically weight transferred models based on the local distribution structure around each test example. This process can further improve the average recognition rate by about 7%.

Sidorov, Kirill A.; Richmond, Stephen; Marshall, David; , ■Efficient groupwise non-rigid registration of textured surfaces,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2401-2408, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995632
Abstract: Advances in 3D imaging have recently made 3D surface scanners, capable of capturing textured surfaces at video rate, affordable and common in computer vision. This is a relatively new source of data, the potential of which has not yet been fully exploited as the problem of non-rigid registration of surfaces is difficult. While registration based on shape alone has been an active research area for some time, the problem of registering surfaces based on texture information has not been addressed in a principled way. We propose a novel, efficient and reliable, fully automatic method for performing groupwise non-rigid registration of textured surfaces, such as those obtained with 3D scanners. We demonstrate the robustness of our approach on 3D scans of human faces, including the notoriously difficult case of inter-subject registration. We show how our method can be used to build high-quality 3D models of appearance fully automatically.

Smith, Brandon M.; Zhu, Shengqi; Zhang, Li; , ■Face image retrieval by shape manipulation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.769-776, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995471
Abstract: Current face image retrieval methods achieve impressive results, but lack efficient ways to refine the search, particularly for geometric face attributes. Users cannot easily find faces with slightly more furrowed brows or specific leftward pose shifts, for example. To address this problem, we propose a new face search technique based on shape manipulation that is complementary to current search engines. Users drag one or a small number of contour points, like the bottom of the chin or the corner of an eyebrow, to search for faces similar in shape to the current face, but with updated geometric attributes specific to their edits. For example, the user can drag a mouth corner to find faces with wider smiles, or the tip of the nose to find faces with a specific pose. As part of our system, we propose (1) a novel confidence score for face alignment results that automatically constructs a contour-aligned face database with reasonable alignment accuracy, (2) a simple and straightforward extension of PCA with missing data to tensor analysis, and (3) a new regularized tensor model to compute shape feature vectors for each aligned face, all built upon previous work. To the best of our knowledge, our system demonstrates the first face retrieval approach based chiefly on shape manipulation. We show compelling results on a sizable database of over 10,000 face images captured in uncontrolled environments.

Gao, Tianshi; Packer, Benjamin; Koller, Daphne; , ■A segmentation-aware object detection model with occlusion handling,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1361-1368, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995623
Abstract: The bounding box representation employed by many popular object detection models [3, 6] implicitly assumes all pixels inside the box belong to the object. This assumption makes this representation less robust to the object with occlusion [16]. In this paper, we augment the bounding box with a set of binary variables each of which corresponds to a cell indicating whether the pixels in the cell belong to the object. This segmentation-aware representation explicitly models and accounts for the supporting pixels for the object within the bounding box thus more robust to occlusion. We learn the model in a structured output framework, and develop a method that efficiently performs both inference and learning using this rich representation. The method is able to use segmentation reasoning to achieve improved detection results with richer output (cell level segmentation) on the Street Scenes and Pascal VOC 2007 datasets. Finally, we present a globally coherent object model using our rich representation to account for object-object occlusion resulting in a more coherent image understanding.

Elqursh, Ali; Elgammal, Ahmed; , ■Line-based relative pose estimation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3049-3056, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995512
Abstract: We present an algorithm for calibrated camera relative pose estimation from lines. Given three lines with two of the lines parallel and orthogonal to the third we can compute the relative rotation between two images. We can also compute the relative translation from two intersection points. We also present a framework in which such lines can be detected. We evaluate the performance of the algorithm using synthetic and real data. The intended use of the algorithm is with robust hypothesize-and-test frameworks such as RANSAC. Our approach is suitable for urban and indoor environments where most lines are either parallel or orthogonal to each other.

Moreno-Noguer, Francesc; , ■Deformation and illumination invariant feature point descriptor,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1593-1600, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995529
Abstract: Recent advances in 3D shape recognition have shown that kernels based on diffusion geometry can be effectively used to describe local features of deforming surfaces. In this paper, we introduce a new framework that allows using these kernels on 2D local patches, yielding a novel feature point descriptor that is both invariant to non-rigid image deformations and illumination changes. In order to build the descriptor, 2D image patches are embedded as 3D surfaces, by multiplying the intensity level by an arbitrarily large and constant weight that favors anisotropic diffusion and retains the gradient magnitude information. Patches are then described in terms of a heat kernel signature, which is made invariant to intensity changes, rotation and scaling. The resulting feature point descriptor is proven to be significantly more discriminative than state of the art ones, even those which are specifically designed for describing non-rigid image deformations.

Chen, Chao-Yeh; Grauman, Kristen; , ■Clues from the beaten path: Location estimation with bursty sequences of tourist photos,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1569-1576, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995412
Abstract: Image-based location estimation methods typically recognize every photo independently, and their resulting reliance on strong visual feature matches makes them most suited for distinctive landmark scenes. We observe that when touring a city, people tend to follow common travel patterns — for example, a stroll down Wall Street might be followed by a ferry ride, then a visit to the Statue of Liberty. We propose an approach that learns these trends directly from online image data, and then leverages them within a Hidden Markov Model to robustly estimate locations for novel sequences of tourist photos. We further devise a set-to-set matching-based likelihood that treats each ■burst■ of photos from the same camera as a single observation, thereby better accommodating images that may not contain particularly distinctive scenes. Our experiments with two large datasets of major tourist cities clearly demonstrate the approach's advantages over methods that recognize each photo individually, as well as a simpler HMM baseline that lacks the proposed burst-based observation model.

Chen, Guangliang; Maggioni, Mauro; , ■Multiscale geometric and spectral analysis of plane arrangements,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2825-2832, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995666
Abstract: Modeling data by multiple low-dimensional planes is an important problem in many applications such as computer vision and pattern recognition. In the most general setting where only coordinates of the data are given, the problem asks to determine the optimal model parameters (i.e., number of planes and their dimensions), estimate the model planes, and cluster the data accordingly. Though many algorithms have been proposed, most of them need to assume prior knowledge of the model parameters and thus address only the last two components of the problem. In this paper we propose an efficient algorithm based on multiscale SVD analysis and spectral methods to tackle the problem in full generality. We also demonstrate its state-of-the-art performance on both synthetic and real data.

Park, Sung Won; Savvides, Marios; , ■Multifactor analysis based on factor-dependent geometry,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2817-2824, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995397
Abstract: This paper proposes a novel method that preserves the geometrical structure created by variation of multiple factors in analysis of multiple factor models, i.e., multifactor analysis. We use factor-dependent submanifolds as constituent elements of the factor-dependent geometry in a multiple factor framework. In this paper, a submanifold is defined as some subset of a manifold in the data space, and factor-dependent submanifolds are defined as the submani-folds created for each factor by varying only this factor. In this paper, we show that MPCA is formulated using factor-dependent submanifolds, as is our proposed method. We show, however, that MPCA loses the original shapes of these submanifolds because MPCA's parameterization is based on averaging the shapes of factor-dependent subman-ifolds for each factor. On the other hand, our proposed multifactor analysis preserves the shapes of individual factor-dependent submanifolds in low-dimensional spaces. Because the parameters obtained by our method do not lose their structures, our method, unlike MPCA, sufficiently covers original factor-dependent submanifolds. As a result of sufficient coverage, our method is appropriate for accurate classification of each sample.

Zhao, Ji; Ma, Jiayi; Tian, Jinwen; Ma, Jie; Zhang, Dazhi; , ■A robust method for vector field learning with application to mismatch removing,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2977-2984, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995336
Abstract: We propose a method for vector field learning with outliers, called vector field consensus (VFC). It could distinguish inliers from outliers and learn a vector field fitting for the inliers simultaneously. A prior is taken to force the smoothness of the field, which is based on the Tiknonov regularization in vector-valued reproducing kernel Hilbert space. Under a Bayesian framework, we associate each sample with a latent variable which indicates whether it is an inlier, and then formulate the problem as maximum a posteriori problem and use Expectation Maximization algorithm to solve it. The proposed method possesses two characteristics: 1) robust to outliers, and being able to tolerate 90% outliers and even more, 2) computationally efficient. As an application, we apply VFC to solve the problem of mismatch removing. The results demonstrate that our method outperforms many state-of-the-art methods, and it is very robust.

Wang, Meng; Wang, Xiaogang; , ■Automatic adaptation of a generic pedestrian detector to a specific traffic scene,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3401-3408, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995698
Abstract: In recent years significant progress has been made learning generic pedestrian detectors from manually labeled large scale training sets. However, when a generic pedestrian detector is applied to a specific scene where the testing data does not match with the training data because of variations of viewpoints, resolutions, illuminations and backgrounds, its accuracy may decrease greatly. In this paper, we propose a new framework of adapting a pre-trained generic pedestrian detector to a specific traffic scene by automatically selecting both confident positive and negative examples from the target scene to re-train the detector iteratively. An important feature of the proposed framework is to utilize unsupervisedly learned models of vehicle and pedestrian paths, together with multiple other cues such as locations, sizes, appearance and motions to select new training samples. The information of scene structures increases the reliability of selected samples and is complementary to the appearance-based detector. However, it was not well explored in previous studies. In order to further improve the reliability of selected samples, outliers are removed through multiple hierarchical clustering steps. The effectiveness of different cues and clustering steps is evaluated through experiments. The proposed approach significantly improves the accuracy of the generic pedestrian detector and also outperforms the scene specific detector retrained using background subtraction. Its results are comparable with the detector trained using a large number of manually labeled frames from the target scene.

Dinh, Thang Ba; Vo, Nam; Medioni, Gerard; , ■Context tracker: Exploring supporters and distracters in unconstrained environments,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1177-1184, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995733
Abstract: Visual tracking in unconstrained environments is very challenging due to the existence of several sources of varieties such as changes in appearance, varying lighting conditions, cluttered background, and frame-cuts. A major factor causing tracking failure is the emergence of regions having similar appearance as the target. It is even more challenging when the target leaves the field of view (FoV) leading the tracker to follow another similar object, and not reacquire the right target when it reappears. This paper presents a method to address this problem by exploiting the context on-the-fly in two terms: Distracters and Supporters. Both of them are automatically explored using a sequential randomized forest, an online template-based appearance model, and local features. Distracters are regions which have similar appearance as the target and consistently co-occur with high confidence score. The tracker must keep tracking these distracters to avoid drifting. Supporters, on the other hand, are local key-points around the target with consistent co-occurrence and motion correlation in a short time span. They play an important role in verifying the genuine target. Extensive experiments on challenging real-world video sequences show the tracking improvement when using this context information. Comparisons with several state-of-the-art approaches are also provided.

Cho, Taeg Sang; Paris, Sylvain; Horn, Berthold K. P.; Freeman, William T.; , ■Blur kernel estimation using the radon transform,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.241-248, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995479
Abstract: Camera shake is a common source of degradation in photographs. Restoring blurred pictures is challenging because both the blur kernel and the sharp image are unknown, which makes this problem severely underconstrained. In this work, we estimate camera shake by analyzing edges in the image, effectively constructing the Radon transform of the kernel. Building upon this result, we describe two algorithms for estimating spatially invariant blur kernels. In the first method, we directly invert the transform, which is computationally efficient since it is not necessary to also estimate the latent sharp image. This approach is well suited for scenes with a diversity of edges, such as man-made environments. In the second method, we incorporate the Radon transform within the MAP estimation framework to jointly estimate the kernel and the image. While more expensive, this algorithm performs well on a broader variety of scenes, even when fewer edges can be observed. Our experiments show that our algorithms achieve comparable results to the state of the art in general and produce superior outputs on man-made scenes and photos degraded by a small kernel.

Schmidt, Uwe; Schelten, Kevin; Roth, Stefan; , ■Bayesian deblurring with integrated noise estimation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2625-2632, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995653
Abstract: Conventional non-blind image deblurring algorithms involve natural image priors and maximum a-posteriori (MAP) estimation. As a consequence of MAP estimation, separate pre-processing steps such as noise estimation and training of the regularization parameter are necessary to avoid user interaction. Moreover, MAP estimates involving standard natural image priors have been found lacking in terms of restoration performance. To address these issues we introduce an integrated Bayesian framework that unifies non-blind deblurring and noise estimation, thus freeing the user of tediously pre-determining a noise level. A sampling-based technique allows to integrate out the unknown noise level and to perform deblurring using the Bayesian minimum mean squared error estimate (MMSE), which requires no regularization parameter and yields higher performance than MAP estimates when combined with a learned high-order image prior. A quantitative evaluation demonstrates state-of-the-art results for both non-blind deblurring and noise estimation.

Le, Quoc V.; Zou, Will Y.; Yeung, Serena Y.; Ng, Andrew Y.; , ■Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3361-3368, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995496
Abstract: Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations. By replacing hand-designed features with our learned features, we achieve classification results superior to all previous published results on the Hollywood2, UCF, KTH and YouTube action recognition datasets. On the challenging Hollywood2 and YouTube action datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better than the current best published results. Further benefits of this method, such as the ease of training and the efficiency of training and prediction, will also be discussed. You can download our code and learned spatio-temporal features here:∼wzou/

Zhang, Yimeng; Jia, Zhaoyin; Chen, Tsuhan; , ■Image retrieval with geometry-preserving visual phrases,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.809-816, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995528
Abstract: The most popular approach to large scale image retrieval is based on the bag-of-visual-word (BoV) representation of images. The spatial information is usually re-introduced as a post-processing step to re-rank the retrieved images, through a spatial verification like RANSAC. Since the spatial verification techniques are computationally expensive, they can be applied only to the top images in the initial ranking. In this paper, we propose an approach that can encode more spatial information into BoV representation and that is efficient enough to be applied to large-scale databases. Other works pursuing the same purpose have proposed exploring the word co-occurrences in the neighborhood areas. Our approach encodes more spatial information through the geometry-preserving visual phrases (GVP). In addition to co-occurrences, the GVP method also captures the local and long-range spatial layouts of the words. Our GVP based searching algorithm increases little memory usage or computational time compared to the BoV method. Moreover, we show that our approach can also be integrated to the min-hash method to improve its retrieval accuracy. The experiment results on Oxford 5K and Flicker 1M dataset show that our approach outperforms the BoV method even following a RANSAC verification.

Sun, Jian; Tappen, Marshall F.; , ■Learning non-local range Markov Random field for image restoration,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2745-2752, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995520
Abstract: In this paper, we design a novel MRF framework which is called Non-Local Range Markov Random Field (NLR-MRF). The local spatial range of clique in traditional MRF is extended to the non-local range which is defined over the local patch and also its similar patches in a non-local window. Then the traditional local spatial filter is extended to the non-local range filter that convolves an image over the non-local ranges of pixels. In this framework, we propose a gradient-based discriminative learning method to learn the potential functions and non-local range filter bank. As the gradients of loss function with respect to model parameters are explicitly computed, efficient gradient-based optimization methods are utilized to train the proposed model. We implement this framework for image denoising and in-painting, the results show that the learned NLR-MRF model significantly outperforms the traditional MRF models and produces state-of-the-art results.

Johnson, Sam; Everingham, Mark; , ■Learning effective human pose estimation from inaccurate annotation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1465-1472, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995318
Abstract: The task of 2-D articulated human pose estimation in natural images is extremely challenging due to the high level of variation in human appearance. These variations arise from different clothing, anatomy, imaging conditions and the large number of poses it is possible for a human body to take. Recent work has shown state-of-the-art results by partitioning the pose space and using strong nonlinear classifiers such that the pose dependence and multi-modal nature of body part appearance can be captured. We propose to extend these methods to handle much larger quantities of training data, an order of magnitude larger than current datasets, and show how to utilize Amazon Mechanical Turk and a latent annotation update scheme to achieve high quality annotations at low cost. We demonstrate a significant increase in pose estimation accuracy, while simultaneously reducing computational expense by a factor of 10, and contribute a dataset of 10,000 highly articulated poses.

Grundmann, Matthias; Kwatra, Vivek; Essa, Irfan; , ■Auto-directed video stabilization with robust L1 optimal camera paths,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.225-232, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995525
Abstract: We present a novel algorithm for automatically applying constrainable, L1-optimal camera paths to generate stabilized videos by removing undesired motions. Our goal is to compute camera paths that are composed of constant, linear and parabolic segments mimicking the camera motions employed by professional cinematographers. To this end, our algorithm is based on a linear programming framework to minimize the first, second, and third derivatives of the resulting camera path. Our method allows for video stabilization beyond the conventional filtering of camera paths that only suppresses high frequency jitter. We incorporate additional constraints on the path of the camera directly in our algorithm, allowing for stabilized and retargeted videos. Our approach accomplishes this without the need of user interaction or costly 3D reconstruction of the scene, and works as a post-process for videos from any camera or from an online source.

Chatterjee, Priyam; Joshi, Neel; Kang, Sing Bing; Matsushita, Yasuyuki; , ■Noise suppression in low-light images through joint denoising and demosaicing,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.321-328, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995371
Abstract: We address the effects of noise in low-light images in this paper. Color images are usually captured by a sensor with a color filter array (CFA). This requires a demosaicing process to generate a full color image. The captured images typically have low signal-to-noise ratio, and the demosaicing step further corrupts the image, which we show to be the leading cause of visually objectionable random noise patterns (splotches). To avoid this problem, we propose a combined framework of denoising and demosaicing, where we use information about the image inferred in the denoising step to perform demosaicing. Our experiments show that such a framework results in sharper low-light images that are devoid of splotches and other noise artifacts.

He, Kaiming; Rhemann, Christoph; Rother, Carsten; Tang, Xiaoou; Sun, Jian; , ■A global sampling method for alpha matting,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2049-2056, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995495
Abstract: Alpha matting refers to the problem of softly extracting the foreground from an image. Given a trimap (specifying known foreground/background and unknown pixels), a straightforward way to compute the alpha value is to sample some known foreground and background colors for each unknown pixel. Existing sampling-based matting methods often collect samples near the unknown pixels only. They fail if good samples cannot be found nearby. In this paper, we propose a global sampling method that uses all samples available in the image. Our global sample set avoids missing good samples. A simple but effective cost function is defined to tackle the ambiguity in the sample selection process. To handle the computational complexity introduced by the large number of samples, we pose the sampling task as a correspondence problem. The correspondence search is efficiently achieved by generalizing a randomized algorithm previously designed for patch matching[3]. A variety of experiments show that our global sampling method produces both visually and quantitatively high-quality matting results.

Lin, Yuanqing; Lv, Fengjun; Zhu, Shenghuo; Yang, Ming; Cour, Timothee; Yu, Kai; Cao, Liangliang; Huang, Thomas; , ■Large-scale image classification: Fast feature extraction and SVM training,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1689-1696, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995477
Abstract: Most research efforts on image classification so far have been focused on medium-scale datasets, which are often defined as datasets that can fit into the memory of a desktop (typically 4G∼48G). There are two main reasons for the limited effort on large-scale image classification. First, until the emergence of ImageNet dataset, there was almost no publicly available large-scale benchmark data for image classification. This is mostly because class labels are expensive to obtain. Second, large-scale classification is hard because it poses more challenges than its medium-scale counterparts. A key challenge is how to achieve efficiency in both feature extraction and classifier training without compromising performance. This paper is to show how we address this challenge using ImageNet dataset as an example. For feature extraction, we develop a Hadoop scheme that performs feature extraction in parallel using hundreds of mappers. This allows us to extract fairly sophisticated features (with dimensions being hundreds of thousands) on 1.2 million images within one day. For SVM training, we develop a parallel averaging stochastic gradient descent (ASGD) algorithm for training one-against-all 1000-class SVM classifiers. The ASGD algorithm is capable of dealing with terabytes of training data and converges very fast–typically 5 epochs are sufficient. As a result, we achieve state-of-the-art performance on the ImageNet 1000-class classification, i.e., 52.9% in classification accuracy and 71.8% in top 5 hit rate.

Li, Xiong; Lee, Tai Sing; Liu, Yuncai; , ■Hybrid generative-discriminative classification using posterior divergence,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2713-2720, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995584
Abstract: Integrating generative models and discriminative models in a hybrid scheme has shown some success in recognition tasks. In such scheme, generative models are used to derive feature maps for outputting a set of fixed length features that are used by discriminative models to perform classification. In this paper, we present a method, called posterior divergence, to derive feature maps from the log likelihood function implied in the incremental expectation-maximization algorithm. These feature maps evaluate a sample in three complementary measures: (1) how much the sample affects the model; (2) how well the sample fits the model; (3) how uncertain the fit is. We prove that the linear classification error rate using the outputs of the derived feature maps is at least as low as that of plug-in estimation. We present efficient algorithms for computing these feature maps for semi-supervised learning and supervised learning. We evaluate the proposed method on three typical applications, i.e. scene recognition, face and non-face classification and protein sequence analysis, and demonstrate improvements over related methods.

Aghazadeh, Omid; Sullivan, Josephine; Carlsson, Stefan; , ■Novelty detection from an ego-centric perspective,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3297-3304, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995731
Abstract: This paper demonstrates a system for the automatic extraction of novelty in images captured from a small video camera attached to a subject's chest, replicating his visual perspective, while performing activities which are repeated daily. Novelty is detected when a (sub)sequence cannot be registered to previously stored sequences captured while performing the same daily activity. Sequence registration is performed by measuring appearance and geometric similarity of individual frames and exploiting the invariant temporal order of the activity. Experimental results demonstrate that this is a robust way to detect novelties induced by variations in the wearer's ego-motion such as stopping and talking to a person. This is an essentially new and generic way of automatically extracting information of interest to the camera wearer and can be used as input to a system for life logging or memory support.

Reshetouski, Ilya; Manakov, Alkhazur; Seidel, Hans-Peter; Ihrke, Ivo; , ■Three-dimensional kaleidoscopic imaging,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.353-360, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995579
Abstract: We introduce three-dimensional kaleidoscopic imaging, a promising alternative for recording multi-view imagery. The main limitation of multi-view reconstruction techniques is the limited number of views that are available from multi-camera systems, especially for dynamic scenes. Our new system is based on imaging an object inside a kaleidoscopic mirror system. We show that this approach can generate a large number of high-quality views well distributed over the hemisphere surrounding the object in a single shot. In comparison to existing multi-view systems, our method offers a number of advantages: it is possible to operate with a single camera, the individual views are perfectly synchronized, and they have the same radiometric and colorimetric properties. We describe the setup both theoretically, and provide methods for a practical implementation. Enabling interfacing to standard multi-view algorithms for further processing is an important goal of our techniques.

Wang, Jun; Tan, Ying; , ■Efficient Euclidean distance transform using perpendicular bisector segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1625-1632, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995644
Abstract: In this paper, we propose an efficient algorithm for computing the Euclidean distance transform of two-dimensional binary image, called PBEDT (Perpendicular Bisector Euclidean Distance Transform). PBEDT is a two-stage independent scan algorithm. In the first stage, PBEDT computes the distance from each point to its closest feature point in the same column using one time column-wise scan. In the second stage, PBEDT computes the distance transform for each point by row with intermediate results of the previous stage. By using the geometric properties of the perpendicular bisector, PBEDT directly computes the segmentation by feature points for each row and each segment corresponding to one feature point. Furthermore, by using integer arithmetic to avoid time consuming float operations, PBEDT still achieves exact results. All these methods reduce the computational complexity significantly. Consequently, an efficient and exact linear time Euclidean distance transform algorithm is implemented. Detailed comparison with state-of-the-art linear time Euclidean distance transform algorithms shows that PBEDT is the fastest on most cases, and also the most stable one with respect to image contents.

Micusik, Branislav; , ■Relative pose problem for non-overlapping surveillance cameras with known gravity vector,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3105-3112, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995534
Abstract: We present a method for estimating the relative pose of two calibrated or uncalibrated non-overlapping surveillance cameras from observing a moving object. We show how to tackle the problem of missing point correspondences heavily required by SfM pipelines and how to go beyond this basic paradigm. We relax the non-linear nature of the problem by accepting two assumptions which surveillance scenarios offer, ie. the presence of a moving object and easily estimable gravity vector. By those assumptions we cast the problem as a Quadratic Eigenvalue Problem offering an elegant way of treating nonlinear monomials and delivering a quasi closed-form solution as a reliable starting point for a further bundle adjustment. We are the first to bring the closed form solution to such a very practical problem arising in video surveillance. Results in different camera setups demonstrate the feasibility of the approach.

Liao, Miao; Huang, Xinyu; Yang, Ruigang; , ■Interreflection removal for photometric stereo by using spectrum-dependent albedo,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.689-696, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995343
Abstract: We present a novel method that can separate m-bounced light and remove the interreflections in a photometric stereo setup. Under the assumption of a uniformly colored lambertian surface, the intensity of a point in the scene is the sum of 1-bounced light through m-bounced light rays. Ruled by the law of diffuse reflection, whenever a light ray is bounced by the surface, its intensity will be attenuated by the factor of albedo ρ. This implies that the measured intensity value can be written as a polynomial function of ρ, and the intensity contribution of the m-bounced light rays are expressed by the term of ρm. Therefore, when we change the surface albedo, the intensity of the m-bounced light is changed to the order of m. This non-linearity gives us the possibility to separate the m-bounced light. In practice, we illuminate the scene with different light colors to effectively simulate different surface albedos since albedo is spectrum dependent. Once the m-bounced light rays are separated, we can perform the photometric stereo algorithm on the 1-bounced light (direct lighting) images to produce the 3D shape without the impact of interreflections. Experiments have shown that we get significantly improved scene reconstruction with a minimum of two color images.

Yang, Yi; Ramanan, Deva; , ■Articulated pose estimation with flexible mixtures-of-parts,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1385-1392, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995741
Abstract: We describe a method for human pose estimation in static images based on a novel representation of part models. Notably, we do not use articulated limb parts, but rather capture orientation with a mixture of templates for each part. We describe a general, flexible mixture model for capturing contextual co-occurrence relations between parts, augmenting standard spring models that encode spatial relations. We show that such relations can capture notions of local rigidity. When co-occurrence and spatial relations are tree-structured, our model can be efficiently optimized with dynamic programming. We present experimental results on standard benchmarks for pose estimation that indicate our approach is the state-of-the-art system for pose estimation, outperforming past work by 50% while being orders of magnitude faster.

Ahmed, Mohamed Abdelaziz; Pitie, Francois; Kokaram, Anil; , ■Reflection detection in image sequences,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.705-712, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995670
Abstract: Reflections in image sequences consist of several layers superimposed over each other. This phenomenon causes many image processing techniques to fail as they assume the presence of only one layer at each examined site e.g. motion estimation and object recognition. This work presents an automated technique for detecting reflections in image sequences by analyzing motion trajectories of feature points. It models reflection as regions containing two different layers moving over each other. We present a strong detector based on combining a set of weak detectors. We use novel priors, generate sparse and dense detection maps and our results show high detection rate with rejection to pathological motion and occlusion.

Li, Hanxi; Shen, Chunhua; Shi, Qinfeng; , ■Real-time visual tracking using compressive sensing,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1305-1312, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995483
Abstract: The ℓ1 tracker obtains robustness by seeking a sparse representation of the tracking object via ℓ1 norm minimization. However, the high computational complexity involved in the ℓ1 tracker may hamper its applications in real-time processing scenarios. Here we propose Real-time Com-pressive Sensing Tracking (RTCST) by exploiting the signal recovery power of Compressive Sensing (CS). Dimensionality reduction and a customized Orthogonal Matching Pursuit (OMP) algorithm are adopted to accelerate the CS tracking. As a result, our algorithm achieves a realtime speed that is up to 5,000 times faster than that of the ℓ1 tracker. Meanwhile, RTCST still produces competitive (sometimes even superior) tracking accuracy compared to the ℓ1 tracker. Furthermore, for a stationary camera, a refined tracker is designed by integrating a CS-based background model (CSBM) into tracking. This CSBM-equipped tracker, termed RTCST-B, outperforms most state-of-the-art trackers in terms of both accuracy and robustness. Finally, our experimental results on various video sequences, which are verified by a new metric — Tracking Success Probability (TSP), demonstrate the excellence of the proposed algorithms.

Ng, Bernard; Abugharbieh, Rafeef; , ■Generalized group sparse classifiers with application in fMRI brain decoding,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1065-1071, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995651
Abstract: The perplexing effects of noise and high feature dimensionality greatly complicate functional magnetic resonance imaging (fMRI) classification. In this paper, we present a novel formulation for constructing ■Generalized Group Sparse Classifiers■ (GSSC) to alleviate these problems. In particular, we propose an extension of group LASSO that permits associations between features within (predefined) groups to be modeled. Integrating this new penalty into classifier learning enables incorporation of additional prior information beyond group structure. In the context of fMRI, GGSC provides a flexible means for modeling how the brain is functionally organized into specialized modules (i.e. groups of voxels) with spatially proximal voxels often displaying similar level of brain activity (i.e. feature associations). Applying GSSC to real fMRI data improved predictive performance over standard classifiers, while providing more neurologically interpretable classifier weight patterns. Our results thus demonstrate the importance of incorporating prior knowledge into classification problems.

An, Senjian; Peursum, Patrick; Liu, Wanquan; Venkatesh, Svetha; , ■Efficient subwindow search with submodular score functions,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1409-1416, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995355
Abstract: Subwindow search aims to find the optimal subimage which maximizes the score function of an object to be detected. After the development of the branch and bound (B&B) method called Efficient Subwindow Search (ESS), several algorithms (IESS [2], AESS [2], ARCS [3]) have been proposed to improve the performance of ESS. For n×n images, IESS's time complexity is bounded by O(n3) which is better than ESS, but only applicable to linear score functions. Other work shows that Monge properties can hold in subwindow search and can be used to speed up the search to O(n3), but only applies to certain types of score functions. In this paper we explore the connection between submodular functions and the Monge property, and prove that sub-modular score functions can be used to achieve O(n3) time complexity for object detection. The time complexity can be further improved to be sub-cubic by applying B&B methods on row interval only, when the score function has a multivariate submodular bound function. Conditions for sub-modularity of common non-linear score functions and multivariate submodularity of their bound functions are also provided, and experiments are provided to compare the proposed approach against ESS and ARCS for object detection with some nonlinear score functions.

Zhao, Cong; Cham, Wai-Kuen; Wang, Xiaogang; , ■Joint face alignment with a generic deformable face model,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.561-568, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995381
Abstract: As having multiple images of an object is practically convenient nowadays, to jointly align them is important for subsequent studies and a wide range of applications. In this paper, we propose a model-based approach to jointly align a batch of images of a face undergoing a variety of geometric and appearance variations. The principal idea is to model the non-rigid deformation of a face by means of a learned deformable model. Different from existing model-based methods such as Active Appearance Models, the proposed one does not rely on an accurate appearance model built from a training set. We propose a robust fitting method that simultaneously identifies the appearance space of the input face and brings the images into alignment. The experiments conducted on images in the wild in comparison with competing methods demonstrate the effectiveness of our method in joint alignment of complex objects like human faces.

Nguyen, Huu-Giao; Fablet, Ronan; Boucher, Jean-Marc; , ■Visual textures as realizations of multivariate log-Gaussian Cox processes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2945-2952, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995340
Abstract: In this paper, we address invariant keypoint-based texture characterization and recognition. Viewing keypoint sets associated with visual textures as realizations of point processes, we investigate probabilistic texture models from multivariate log-Gaussian Cox processes. These models are parameterized by the covariance structure of the spatial patterns. Their implementation initially rely on the construction of a codebook of the visual signatures of keypoints. We discuss invariance properties of the proposed models for texture recognition applications and report a quantitative evaluation for three texture datasets, namely: UIUC, KTH-TIPs and Brodatz. These experiments include a comparison of the performance reached using different methods for keypoint detection and characterization and demonstrate the relevance of the proposed models w.r.t. state-of-the-art methods. We further discuss the main contribution of proposed approach, including the key features of a statistical model and complexity aspects.

Sibiryakov, Alexander; , ■Fast and high-performance template matching method,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1417-1424, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995391
Abstract: This paper proposes a new template matching method that is robust to outliers and fast enough for real-time operation. The template and image are densely transformed in binary code form by projecting and quantizing histograms of oriented gradients. The binary codes are matched by a generic method of robust similarity applicable to additive match measures, such as Lp- and Hamming distances. The robust similarity map is computed efficiently via a proposed Inverted Location Index structure that stores pixel locations indexed by their values. The method is experimentally justified in large image patch datasets. Challenging applications, such as intra-category object detection, object tracking, and multimodal image matching are demonstrated.

Zheng, Yinqiang; Sugimoto, Shigeki; Okutomi, Masatoshi; , ■Deterministically maximizing feasible subsystem for robust model fitting with unit norm constraint,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1825-1832, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995640
Abstract: Many computer vision problems can be accounted for or properly approximated by linearity, and the robust model fitting (parameter estimation) problem in presence of outliers is actually to find the Maximum Feasible Subsystem (MaxFS) of a set of infeasible linear constraints. We propose a deterministic branch and bound method to solve the MaxFS problem with guaranteed global optimality. It can be used in a wide class of computer vision problems, in which the model variables are subject to the unit norm constraint. In contrast to the convex and concave relaxations in existing works, we introduce a piecewise linear relaxation to build very tight under- and over-estimators for square terms by partitioning variable bounds into smaller segments. Based on this novel relaxation technique, our branch and bound method can converge in a few iterations. For homogeneous linear systems, which correspond to some quasi-convex problems based on L∞-L∞-norm, our method is non-iterative and certainly reaches the globally optimal solution at the root node by partitioning each variable range into two segments with equal length. Throughout this work, we rely on the so-called Big-M method, and successfully avoid potential numerical problems by exploiting proper parametrization and problem structure. Experimental results demonstrate the stability and efficiency of our proposed method.

Shi, Qinfeng; Eriksson, Anders; van den Hengel, Anton; Shen, Chunhua; , ■Is face recognition really a Compressive Sensing problem?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.553-560, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995556
Abstract: Compressive Sensing has become one of the standard methods of face recognition within the literature. We show, however, that the sparsity assumption which underpins much of this work is not supported by the data. This lack of sparsity in the data means that compressive sensing approach cannot be guaranteed to recover the exact signal, and therefore that sparse approximations may not deliver the robustness or performance desired. In this vein we show that a simple £2 approach to the face recognition problem is not only significantly more accurate than the state-of-the-art approach, it is also more robust, and much faster. These results are demonstrated on the publicly available YaleB and AR face datasets but have implications for the application of Compressive Sensing more broadly.
Saberian, Mohammad J.; Masnadi-Shirazi, Hamed; Vasconcelos, Nuno; , ■TaylorBoost: First and second-order boosting algorithms with explicit margin control,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2929-2934, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995605
Abstract: A new family of boosting algorithms, denoted Taylor-Boost, is proposed. It supports any combination of loss function and first or second order optimization, and includes classical algorithms such as AdaBoost, Gradient-Boost, or LogitBoost as special cases. Its restriction to the set of canonical losses makes it possible to have boosting algorithms with explicit margin control. A new large family of losses with this property, based on the set of cumulative distributions of zero mean random variables, is then proposed. A novel loss function in this family, the Laplace loss, is finally derived. The combination of this loss and second order TaylorBoost produces a boosting algorithm with explicit margin control.

Zhang, Ziming; Warrell, Jonathan; Torr, Philip H. S.; , ■Proposal generation for object detection using cascaded ranking SVMs,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1497-1504, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995411
Abstract: Object recognition has made great strides recently. However, the best methods, such as those based on kernel-SVMs are highly computationally intensive. The problem of how to accelerate the evaluation process without decreasing accuracy is thus of current interest. In this paper, we deal with this problem by using the idea of ranking. We propose a cascaded architecture which using the ranking SVM generates an ordered set of proposals for windows containing object instances. The top ranking windows may then be fed to a more complex detector. Our experiments demonstrate that our approach is robust, achieving higher overlap-recall values using fewer output proposals than the state-of-the-art. Our use of simple gradient features and linear convolution indicates that our method is also faster than the state-of-the-art.

Fan, Bin; Wu, Fuchao; Hu, Zhanyi; , ■Aggregating gradient distributions into intensity orders: A novel local image descriptor,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2377-2384, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995385
Abstract: A novel local image descriptor is proposed in this paper, which combines intensity orders and gradient distributions in multiple support regions. The novelty lies in three aspects: 1) The gradient is calculated in a rotation invariant way in a given support region; 2) The rotation invariant gradients are adaptively pooled spatially based on intensity orders in order to encode spatial information; 3) Multiple support regions are used for constructing descriptor which further improves its discriminative ability. Therefore, the proposed descriptor encodes not only gradient information but also information about relative relationship of intensities as well as spatial information. In addition, it is truly rotation invariant in theory without the need of computing a dominant orientation which is a major error source of most existing methods, such as SIFT. Results on the standard Oxford dataset and 3D objects have shown a significant improvement over the state-of-the-art methods under various image transformations.

Karayev, Sergey; Fritz, Mario; Fidler, Sanja; Darrell, Trevor; , ■A probabilistic model for recursive factorized image features,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.401-408, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995728
Abstract: Layered representations for object recognition are important due to their increased invariance, biological plausibility, and computational benefits. However, most of existing approaches to hierarchical representations are strictly feedforward, and thus not well able to resolve local ambiguities. We propose a probabilistic model that learns and infers all layers of the hierarchy jointly. Specifically, we suggest a process of recursive probabilistic factorization, and present a novel generative model based on Latent Dirichlet Allocation to this end. The approach is tested on a standard recognition dataset, outperforming existing hierarchical approaches and demonstrating performance on par with current single-feature state-of-the-art models. We demonstrate two important properties of our proposed model: 1) adding an additional layer to the representation increases performance over the flat model; 2) a full Bayesian approach outperforms a feedforward implementation of the model.

Flach, Boris; Schlesinger, Dmitrij; , ■Modelling composite shapes by Gibbs random fields,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2177-2182, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995726
Abstract: We analyse the potential of Gibbs Random Fields for shape prior modelling. We show that the expressive power of second order GRFs is already sufficient to express spatial relations between shape parts and simple shapes simultaneously. This allows to model and recognise complex shapes as spatial compositions of simpler parts.

Li, Patrick S.; Givoni, Inmar E.; Frey, Brendan J.; , ■Learning better image representations using ‘flobject analysis’,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2721-2728, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995649
Abstract: Unsupervised learning can be used to extract image representations that are useful for various and diverse vision tasks. After noticing that most biological vision systems for interpreting static images are trained using disparity information, we developed an analogous framework for unsu-pervised learning. The output of our method is a model that can generate a vector representation or descriptor from any static image. However, the model is trained using pairs of consecutive video frames, which are used to find representations that are consistent with optical flow-derived objects, or ‘flobjects’. To demonstrate the flobject analysis framework, we extend the latent Dirichlet allocation bag-of-words model to account for real-valued word-specific flow vectors and image-specific probabilistic associations between flow clusters and topics. We show that the static image representations extracted using our method can be used to achieve higher classification rates and better generalization than standard topic models, spatial pyramid matching and gist descriptors.

Li, Wen; Zhang, Jun; Dai, Qionghai; , ■Exploring aligned complementary image pair for blind motion deblurring,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.273-280, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995351
Abstract: Camera shake during long exposure is ineluctable in light-limited situations, and results in a blurry observation. Recovering the blur kernel and the latent image from the blurred image is an inherently ill-posed problem. In this paper, we analyze the image acquisition model to capture two blurred images simultaneously with different blur kernels. The image pair is well-aligned and the kernels have a certain relationship. Such strategy overcomes the challenge of blurry image alignment and reduces the ambiguity of blind deblurring. Thanks to the aided hardware, the algorithm based on such image pair can give high-quality kernel estimation and image restoration. The experiments on both synthetic and real images demonstrate the effectiveness of our image capture strategy, and show that the kernel estimation is accurate enough to restore superior latent image, which contains more details and fewer ringing artifacts.

Belhumeur, Peter N.; Jacobs, David W.; Kriegman, David J.; Kumar, Neeraj; , ■Localizing parts of faces using a consensus of exemplars,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.545-552, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995602
Abstract: We present a novel approach to localizing parts in images of human faces. The approach combines the output of local detectors with a non-parametric set of global models for the part locations based on over one thousand hand-labeled exemplar images. By assuming that the global models generate the part locations as hidden variables, we derive a Bayesian objective function. This function is optimized using a consensus of models for these hidden variables. The resulting localizer handles a much wider range of expression, pose, lighting and occlusion than prior ones. We show excellent performance on a new dataset gathered from the internet and show that our localizer achieves state-of-the-art performance on the less challenging BioID dataset.

Fleck, Daniel; Duric, Zoran; , ■Predicting image matching using affine distortion models,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.105-112, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995389
Abstract: We propose a novel method for predicting whether an image taken from a given location will match an existing set of images. This problem appears prominently in image based localization and augmented reality applications where new images are matched to an existing set to determine location or add virtual information into a scene. Our process generates a spatial coverage map showing the confidence that images taken at specific locations will match an existing image set. A new way to measure distortion between images using affine models is introduced. The distortion measure is combined with existing machine learning and structure from motion techniques to create a matching confidence predictor. The predictor is used to generate the spatial coverage map and also compute which images in the original set are redundant and can be removed. Results are presented showing the predictor is more accurate than previously published approaches.

Cui, Xinyi; Liu, Qingshan; Gao, Mingchen; Metaxas, Dimitris N.; , ■Abnormal detection using interaction energy potentials,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3161-3167, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995558
Abstract: A new method is proposed to detect abnormal behaviors in human group activities. This approach effectively models group activities based on social behavior analysis. Different from previous work that uses independent local features, our method explores the relationships between the current behavior state of a subject and its actions. An interaction energy potential function is proposed to represent the current behavior state of a subject, and velocity is used as its actions. Our method does not depend on human detection or segmentation, so it is robust to detection errors. Instead, tracked spatio-temporal interest points are able to provide a good estimation of modeling group interaction. SVM is used to find abnormal events. We evaluate our algorithm in two datasets: UMN and BEHAVE. Experimental results show its promising performance against the state-of-art methods.

Vijayanarasimhan, Sudheendra; Grauman, Kristen; , ■Efficient region search for object detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1401-1408, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995545
Abstract: We propose a branch-and-cut strategy for efficient region-based object detection. Given an oversegmented image, our method determines the subset of spatially contiguous regions whose collective features will maximize a classifier's score. We formulate the objective as an instance of the prize-collecting Steiner tree problem, and show that for a family of additive classifiers this enables fast search for the optimal object region via a branch-and-cut algorithm. Unlike existing branch-and-bounddetection methods designed for bounding boxes, our approach allows scoring of irregular shapes — which is especially critical for objects that do not conform to a rectangular window. We provide results on three challenging object detection datasets, and demonstrate the advantage of rapidly seeking best-scoring regions rather than subwindow rectangles.

Huang, Peng; Budd, Chris; Hilton, Adrian; , ■Global temporal registration of multiple non-rigid surface sequences,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3473-3480, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995438
Abstract: In this paper we consider the problem of aligning multiple non-rigid surface mesh sequences into a single temporally consistent representation of the shape and motion. A global alignment graph structure is introduced which uses shape similarity to identify frames for inter-sequence registration. Graph optimisation is performed to minimise the total non-rigid deformation required to register the input sequences into a common structure. The resulting global alignment ensures that all input sequences are resampled with a common mesh structure which preserves the shape and temporal correspondence. Results demonstrate temporally consistent representation of several public databases of mesh sequences for multiple people performing a variety of motions with loose clothing and hair.

Morariu, Vlad I.; Davis, Larry S.; , ■Multi-agent event recognition in structured scenarios,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3289-3296, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995386
Abstract: We present a framework for the automatic recognition of complex multi-agent events in settings where structure is imposed by rules that agents must follow while performing activities. Given semantic spatio-temporal descriptions of what generally happens (i.e., rules, event descriptions, physical constraints), and based on video analysis, we determine the events that occurred. Knowledge about spatio-temporal structure is encoded using first-order logic using an approach based on Allen's Interval Logic, and robustness to low-level observation uncertainty is provided by Markov Logic Networks (MLN). Our main contribution is that we integrate interval-based temporal reasoning with probabilistic logical inference, relying on an efficient bottom-up grounding scheme to avoid combinatorial explosion. Applied to one-on-one basketball, our framework detects and tracks players, their hands and feet, and the ball, generates event observations from the resulting trajectories, and performs probabilistic logical inference to determine the most consistent sequence of events. We demonstrate our approach on 1hr (100,000 frames) of outdoor videos.

Zhang, Chunjie; Liu, Jing; Tian, Qi; Xu, Changsheng; Lu, Hanqing; Ma, Songde; , ■Image classification by non-negative sparse coding, low-rank and sparse decomposition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1673-1680, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995484
Abstract: We propose an image classification framework by leveraging the non-negative sparse coding, low-rank and sparse matrix decomposition techniques (LR-Sc+ SPM). First, we propose a new non-negative sparse coding along with max pooling and spatial pyramid matching method (Sc+ SPM) to extract local features' information in order to represent images, where non-negative sparse coding is used to encode local features. Max pooling along with spatial pyramid matching (SPM) is then utilized to get the feature vectors to represent images. Second, motivated by the observation that images of the same class often contain correlated (or common) items and specific (or noisy) items, we propose to leverage the low-rank and sparse matrix recovery technique to decompose the feature vectors of images per class into a low-rank matrix and a sparse error matrix. To incorporate the common and specific attributes into the image representation, we still adopt the idea of sparse coding to recode the Sc+ SPM representation of each image. In particular, we collect the columns of the both matrixes as the bases and use the coding parameters as the updated image representation by learning them through the locality-constrained linear coding (LLC). Finally, linear SVM classifier is leveraged for the final classification. Experimental results show that the proposed method achieves or outperforms the state-of-the-art results on several benchmarks.

Bychkovsky, Vladimir; Paris, Sylvain; Chan, Eric; Durand, Fredo; , ■Learning photographic global tonal adjustment with a database of input / output image pairs,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.97-104, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995332
Abstract: Adjusting photographs to obtain compelling renditions requires skill and time. Even contrast and brightness adjustments are challenging because they require taking into account the image content. Photographers are also known for having different retouching preferences. As the result of this complexity, rule-based, one-size-fits-all automatic techniques often fail. This problem can greatly benefit from supervised machine learning but the lack of training data has impeded work in this area. Our first contribution is the creation of a high-quality reference dataset. We collected 5,000 photos, manually annotated them, and hired 5 trained photographers to retouch each picture. The result is a collection of 5 sets of 5,000 example input-output pairs that enable supervised learning. We first use this dataset to predict a user's adjustment from a large training set. We then show that our dataset and features enable the accurate adjustment personalization using a carefully chosen set of training photos. Finally, we introduce difference learning: this method models and predicts difference between users. It frees the user from using predetermined photos for training. We show that difference learning enables accurate prediction using only a handful of examples.

He, Ran; Sun, Zhenan; Tan, Tieniu; Zheng, Wei-Shi; , ■Recovery of corrupted low-rank matrices via half-quadratic based nonconvex minimization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2889-2896, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995328
Abstract: Recovering arbitrarily corrupted low-rank matrices arises in computer vision applications, including bioinformatic data analysis and visual tracking. The methods used involve minimizing a combination of nuclear norm and l1 norm. We show that by replacing the l1 norm on error items with nonconvex M-estimators, exact recovery of densely corrupted low-rank matrices is possible. The robustness of the proposed method is guaranteed by the M-estimator theory. The multiplicative form of half-quadratic optimization is used to simplify the nonconvex optimization problem so that it can be efficiently solved by iterative regularization scheme. Simulation results corroborate our claims and demonstrate the efficiency of our proposed method under tough conditions.

Bi, Jinbo; Wu, Dijia; Lu, Le; Liu, Meizhu; Tao, Yimo; Wolf, Matthias; , ■AdaBoost on low-rank PSD matrices for metric learning,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2617-2624, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995363
Abstract: The problem of learning a proper distance or similarity metric arises in many applications such as content-based image retrieval. In this work, we propose a boosting algorithm, MetricBoost, to learn the distance metric that preserves the proximity relationships among object triplets: object i is more similar to object j than to object k. Metric-Boost constructs a positive semi-definite (PSD) matrix that parameterizes the distance metric by combining rank-one PSD matrices. Different options of weak models and combination coefficients are derived. Unlike existing proximity preserving metric learning which is generally not scalable, MetricBoost employs a bipartite strategy to dramatically reduce computation cost by decomposing proximity relationships over triplets into pair-wise constraints. Met-ricBoost outperforms the state-of-the-art on two real-world medical problems: 1. identifying and quantifying diffuse lung diseases; 2. colorectal polyp matching between different views, as well as on other benchmark datasets.

Chen, Chia-Chih; Aggarwal, J. K.; , ■Modeling human activities as speech,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3425-3432, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995555
Abstract: Human activity recognition and speech recognition appear to be two loosely related research areas. However, on a careful thought, there are several analogies between activity and speech signals with regard to the way they are generated, propagated, and perceived. In this paper, we propose a novel action representation, the action spectrogram, which is inspired by a common spectrographic representation of speech. Different from sound spectrogram, an action spectrogram is a space-time-frequency representation which characterizes the short-time spectral properties of body parts' movements. While the essence of the speech signal is the variation of air pressure in time, our method models activities as the likelihood time series of action associated local interest patterns. This low-level process is realized by learning boosted window classifiers from spatially quantized spatio-temporal interest features. We have tested our algorithm on a variety of human activity datasets and achieved superior results.

Chan, Antoni B.; Dong, Daxiang; , ■Generalized Gaussian process models,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2681-2688, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995688
Abstract: We propose a generalized Gaussian process model (GGPM), which is a unifying framework that encompasses many existing Gaussian process (GP) models, such as GP regression, classification, and counting. In the GGPM framework, the observation likelihood of the GP model is itself parameterized using the exponential family distribution. By deriving approximate inference algorithms for the generalized GP model, we are able to easily apply the same algorithm to all other GP models. Novel GP models are created by changing the parameterization of the likelihood function, which greatly simplifies their creation for task-specific output domains. We also derive a closed-form efficient Taylor approximation for inference on the model, and draw interesting connections with other model-specific closed-form approximations. Finally, using the GGPM, we create several new GP models and show their efficacy in building task-specific GP models for computer vision.

Susskind, Joshua; Hinton, Geoffrey; Memisevic, Roland; Pollefeys, Marc; , ■Modeling the joint density of two images under a variety of transformations,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2793-2800, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995541
Abstract: We describe a generative model of the relationship between two images. The model is defined as a factored three-way Boltzmann machine, in which hidden variables collaborate to define the joint correlation matrix for image pairs. Modeling the joint distribution over pairs makes it possible to efficiently match images that are the same according to a learned measure of similarity. We apply the model to several face matching tasks, and show that it learns to represent the input images using task-specific basis functions. Matching performance is superior to previous similar generative models, including recent conditional models of transformations. We also show that the model can be used as a plug-in matching score to perform invariant classification.

Reynolds, Malcolm; Dobos, Jozef; Peel, Leto; Weyrich, Tim; Brostow, Gabriel J; , ■Capturing Time-of-Flight data with confidence,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.945-952, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995550
Abstract: Time-of-Flight cameras provide high-frame-rate depth measurements within a limited range of distances. These readings can be extremely noisy and display unique errors, for instance, where scenes contain depth discontinuities or materials with low infrared reflectivity. Previous works have treated the amplitude of each Time-of-Flight sample as a measure of confidence. In this paper, we demonstrate the shortcomings of this common lone heuristic, and propose an improved per-pixel confidence measure using a Random Forest regressor trained with real-world data. Using an industrial laser scanner for ground truth acquisition, we evaluate our technique on data from two different Time-of-Flight cameras1. We argue that an improved confidence measure leads to superior reconstructions in subsequent steps of traditional scan processing pipelines. At the same time, data with confidence reduces the need for point cloud smoothing and median filtering.

Zhang, Xu-Yao; Liu, Cheng-Lin; , ■Style transfer matrix learning for writer adaptation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.393-400, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995661
Abstract: In this paper, we propose a novel framework of style transfer matrix (STM) learning to reduce the writing style variation in handwriting recognition. After writer-specific style transfer learning, the data of different writers is projected onto a style-free space, where a writer independent classifier can yield high accuracy. We combine STM learning with a specific nearest prototype classifier: learning vector quantization (LVQ) with discriminative feature extraction (DFE), where both the prototypes and the subspace transformation matrix are learned via online discriminative learning. To adapt the basic classifier (trained with writer-independent data) to particular writers, we first propose two supervised models, one based on incremental learning and the other based on supervised STM learning. To overcome the lack of labeled samples for particular writers, we propose an unsupervised model to learn the STM using the self-taught strategy (also known as self-training). Experiments on a large-scale Chinese online handwriting database demonstrate that STM learning can reduce recognition errors significantly, and the unsupervised adaptation model performs even better than the supervised models.

Chen, David M.; Baatz, Georges; Koser, Kevin; Tsai, Sam S.; Vedantham, Ramakrishna; Pylvanainen, Timo; Roimela, Kimmo; Chen, Xin; Bach, Jeff; Pollefeys, Marc; Girod, Bernd; Grzeszczuk, Radek; , ■City-scale landmark identification on mobile devices,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.737-744, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995610
Abstract: With recent advances in mobile computing, the demand for visual localization or landmark identification on mobile devices is gaining interest. We advance the state of the art in this area by fusing two popular representations of street-level image data — facade-aligned and viewpoint-aligned — and show that they contain complementary information that can be exploited to significantly improve the recall rates on the city scale. We also improve feature detection in low contrast parts of the street-level data, and discuss how to incorporate priors on a user's position (e.g. given by noisy GPS readings or network cells), which previous approaches often ignore. Finally, and maybe most importantly, we present our results according to a carefully designed, repeatable evaluation scheme and make publicly available a set of 1.7 million images with ground truth labels, geotags, and calibration data, as well as a difficult set of cell phone query images. We provide these resources as a benchmark to facilitate further research in the area.

Reddy, Dikpal; Veeraraghavan, Ashok; Chellappa, Rama; , ■P2C2: Programmable pixel compressive camera for high speed imaging,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.329-336, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995542
Abstract: We describe an imaging architecture for compressive video sensing termed programmable pixel compressive camera (P2C2). P2C2 allows us to capture fast phenomena at frame rates higher than the camera sensor. In P2C2, each pixel has an independent shutter that is modulated at a rate higher than the camera frame-rate. The observed intensity at a pixel is an integration of the incoming light modulated by its specific shutter. We propose a reconstruction algorithm that uses the data from P2C2 along with additional priors about videos to perform temporal super-resolution. We model the spatial redundancy of videos using sparse representations and the temporal redundancy using brightness constancy constraints inferred via optical flow. We show that by modeling such spatio-temporal redundancies in a video volume, one can faithfully recover the underlying high-speed video frames from the observed low speed coded video. The imaging architecture and the reconstruction algorithm allows us to achieve temporal super-resolution without loss in spatial resolution. We implement a prototype of P2C2 using an LCOS modulator and recover several videos at 200 fps using a 25 fps camera.

Rigamonti, Roberto; Brown, Matthew A.; Lepetit, Vincent; , ■Are sparse representations really relevant for image classification?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1545-1552, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995313
Abstract: Recent years have seen an increasing interest in sparse representations for image classification and object recognition, probably motivated by evidence from the analysis of the primate visual cortex. It is still unclear, however, whether or not sparsity helps classification. In this paper we evaluate its impact on the recognition rate using a shallow modular architecture, adopting both standard filter banks and filter banks learned in an unsupervised way. In our experiments on the CIFAR-IO and on the Caltech-101 datasets, enforcing sparsity constraints actually does not improve recognition performance. This has an important practical impact in image descriptor design, as enforcing these constraints can have a heavy computational cost.

Hoai, Minh; Lan, Zhen-Zhong; De la Torre, Fernando; , ■Joint segmentation and classification of human actions in video,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3265-3272, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995470
Abstract: Automatic video segmentation and action recognition has been a long-standing problem in computer vision. Much work in the literature treats video segmentation and action recognition as two independent problems; while segmentation is often done without a temporal model of the activity, action recognition is usually performed on pre-segmented clips. In this paper we propose a novel method that avoids the limitations of the above approaches by jointly performing video segmentation and action recognition. Unlike standard approaches based on extensions of dynamic Bayesian networks, our method is based on a discriminative temporal extension of the spatial bag-of-words model that has been very popular in object recognition. The classification is performed robustly within a multi-class SVM framework whereas the inference over the segments is done efficiently with dynamic programming. Experimental results on honeybee, Weizmann, and Hollywood datasets illustrate the benefits of our approach compared to state-of-the-art methods.

Del Pero, Luca; Guan, Jinyan; Brau, Ernesto; Schlecht, Joseph; Barnard, Kobus; , ■Sampling bedrooms,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2009-2016, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995737
Abstract: We propose a top down approach for understanding indoor scenes such as bedrooms and living rooms. These environments typically have the Manhattan world property that many surfaces are parallel to three principle ones. Further, the 3D geometry of the room and objects within it can largely be approximated by non overlapping simple structures such as single blocks (e.g. the room boundary), thin blocks (e.g. picture frames), and objects that are well modeled by single blocks (e.g. simple beds). We separately model the 3D geometry, the imaging process (camera parameters), and edge likelihood, to provide a generative statistical model for image data. We fit this model using data driven MCMC sampling. We combine reversible jump Metropolis Hastings samples for discrete changes in the model such as the number of blocks, and stochastic dynamics to estimate continuous parameter values in a particular parameter space that includes block positions, block sizes, and camera parameters. We tested our approach on two datasets using room box pixel orientation. Despite using only bounding box geometry and, in particular, not training on appearance, our method achieves results approaching those of others. We also introduce a new evaluation method for this domain based on ground truth camera parameters, which we found to be more sensitive to the task of understanding scene geometry.

Teboul, Olivier; Kokkinos, Iasonas; Simon, Loic; Koutsourakis, Panagiotis; Paragios, Nikos; , ■Shape grammar parsing via Reinforcement Learning,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2273-2280, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995319
Abstract: We address shape grammar parsing for facade segmentation using Reinforcement Learning (RL). Shape parsing entails simultaneously optimizing the geometry and the topology (e.g. number of floors) of the facade, so as to optimize the fit of the predicted shape with the responses of pixel-level 'terminal detectors'. We formulate this problem in terms of a Hierarchical Markov Decision Process, by employing a recursive binary split grammar. This allows us to use RL to efficiently find the optimal parse of a given facade in terms of our shape grammar. Building on the RL paradigm, we exploit state aggregation to speedup computation, and introduce image-driven exploration in RL to accelerate convergence. We achieve state-of-the-art results on facade parsing, with a significant speed-up compared to existing methods, and substantial robustness to initial conditions. We demonstrate that the method can also be applied to interactive segmentation, and to a broad variety of architectural styles.

Ranzato, Marc'Aurelio; Susskind, Joshua; Mnih, Volodymyr; Hinton, Geoffrey; , ■On deep generative models with applications to recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2857-2864, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995710
Abstract: The most popular way to use probabilistic models in vision is first to extract some descriptors of small image patches or object parts using well-engineered features, and then to use statistical learning tools to model the dependencies among these features and eventual labels. Learning probabilistic models directly on the raw pixel values has proved to be much more difficult and is typically only used for regularizing discriminative methods. In this work, we use one of the best, pixel-level, generative models of natural images–a gated MRF–as the lowest level of a deep belief network (DBN) that has several hidden layers. We show that the resulting DBN is very good at coping with occlusion when predicting expression categories from face images, and it can produce features that perform comparably to SIFT descriptors for discriminating different types of scene. The generative ability of the model also makes it easy to see what information is captured and what is lost at each level of representation.

Chen, Guang; Han, Tony X.; Lao, Shihong; , ■Adapting an object detector by considering the worst case: A conservative approach,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1369-1376, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995362
Abstract: The performance of an offline-trained classifier can be improved on-site by adapting the classifier towards newly acquired data. However, the adaptation rate is a tuning parameter affecting the performance gain substantially. Poor selection of the adaptation rate may worsen the performance of the original classifier. To solve this problem, we propose a conservative model adaptation method by considering the worst case during the adaptation process. We first construct a random cover of the set of the adaptation data from its partition. For each element in the cover (i.e. a portion of the whole adaptation data set), we define the cross-entropy error function in the form of logistic regression. The element in the cover with the maximum cross-entropy error corresponds to the worst case in the adaptation. Therefore we can convert the conservative model adaptation into the classic min-max optimization problem: finding the adaptation parameters that minimize the maximum of the cross-entropy errors of the cover. Taking the object detection as a testbed, we implement an adapted object detector based on binary classification. Under different adaptation scenarios and different datasets including PASCAL, ImageNet, INRIA, and TUD-Pedestrian, the proposed adaption method achieves significant performance gain and is compared favorably with the state-of-the-art adaptation method with the fine tuned adaptation rate. Without the need of tuning the adaptation rates, the proposed conservative model adaptation method can be extended to other adaptive classification tasks.

Wang, Xiaogang; Liu, Ke; Tang, Xiaoou; , ■Query-specific visual semantic spaces for web image re-ranking,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.857-864, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995399
Abstract: Image re-ranking, as an effective way to improve the results of web-based image search, has been adopted by current commercial search engines. Given a query keyword, a pool of images are first retrieved by the search engine based on textual information. By asking the user to select a query image from the pool, the remaining images are re-ranked based on their visual similarities with the query image. A major challenge is that the similarities of visual features do not well correlate with images' semantic meanings which interpret users' search intention. On the other hand, learning a universal visual semantic space to characterize highly diverse images from the web is difficult and inefficient. In this paper, we propose a novel image re-ranking framework, which automatically offline learns different visual semantic spaces for different query keywords through keyword expansions. The visual features of images are projected into their related visual semantic spaces to get semantic signatures. At the online stage, images are re-ranked by comparing their semantic signatures obtained from the visual semantic space specified by the query keyword. The new approach significantly improves both the accuracy and efficiency of image re-ranking. The original visual features of thousands of dimensions can be projected to the semantic signatures as short as 25 dimensions. Experimental results show that 20% – 35% relative improvement has been achieved on re-ranking precisions compared with the state-of-the-art methods.

Zhou, Bolei; Wang, Xiaogang; Tang, Xiaoou; , ■Random field topic model for semantic region analysis in crowded scenes from tracklets,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3441-3448, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995459
Abstract: In this paper, a Random Field Topic (RFT) model is proposed for semantic region analysis from motions of objects in crowded scenes. Different from existing approaches of learning semantic regions either from optical flows or from complete trajectories, our model assumes that fragments of trajectories (called tracklets) are observed in crowded scenes. It advances the existing Latent Dirichlet Allocation topic model, by integrating the Markov random fields (MR-F) as prior to enforce the spatial and temporal coherence between tracklets during the learning process. Two kinds of MRF, pairwise MRF and the forest of randomly spanning trees, are defined. Another contribution of this model is to include sources and sinks as high-level semantic prior, which effectively improves the learning of semantic regions and the clustering of tracklets. Experiments on a large scale data set, which includes 40, 000+ tracklets collected from the crowded New York Grand Central station, show that our model outperforms state-of-the-art methods both on qualitative results of learning semantic regions and on quantitative results of clustering tracklets.

Li, Bing; Xiong, Weihua; Hu, Weiming; Wu, Ou; , ■Evaluating combinational color constancy methods on real-world images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1929-1936, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995615
Abstract: Light color estimation is crucial to the color constancy problem. Past decades have witnessed great progress in solving this problem. Contrary to traditional methods, many researchers propose a variety of combinational color constancy methods through applying different color constancy mathematical models on an image simultaneously and then give out a final estimation in diverse ways. Although many comprehensive evaluations or reviews about color constancy methods are available, few focus on combinational strategies. In this paper, we survey some prevailing combinational strategies systematically; divide them into three categories and compare them qualitatively on three real-world image data sets in terms of the angular error and the perceptual Euclidean distance. The experimental results show that combinational strategies with training procedure always produces better performance.

Enqvist, Olof; Jiang, Fangyuan; Kahl, Fredrik; , ■A brute-force algorithm for reconstructing a scene from two projections,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2961-2968, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995669
Abstract: Is the real problem in finding the relative orientation of two viewpoints the correspondence problem? We argue that this is only one difficulty. Even with known correspondences, popular methods like the eight point algorithm and minimal solvers may break down due to planar scenes or small relative motions. In this paper, we derive a simple, brute-force algorithm which is both robust to outliers and has no such algorithmic degeneracies. Several cost functions are explored including maximizing the consensus set and robust norms like truncated least-squares. Our method is based on parameter search in a four-dimensional space using a new epipolar parametrization. In principle, we do an exhaustive search of parameter space, but the computations are very simple and easily parallelizable, resulting in an efficient method. Further speed-ups can be obtained by restricting the domain of possible motions to, for example, planar motions or small rotations. Experimental results are given for a variety of scenarios including scenes with a large portion of outliers. Further, we apply our algorithm to 3D motion segmentation where we outperform state-of-the-art on the well-known Hopkins-155 benchmark database.

Wang, Xinggang; Bai, Xiang; Liu, Wenyu; Latecki, Longin Jan; , ■Feature context for image classification and object detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.961-968, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995696
Abstract: In this paper, we presents a new method to encode the spatial information of local image features, which is a natural extension of Shape Context (SC), so we call it Feature Context (FC). Given a position in a image, SC computes histogram of other points belonging to the target binary shape based on their distances and angles to the position. The value of each histogram bin of SC is the number of the shape points in the region assigned to the bin. Thus, SC requires knowing the location of the points of the target shape. In other words, an image point can have only two labels, it belongs to the shape or not. In contrast, FC can be applied to the whole image without knowing the location of the target shape in the image. Each image point can have multiple labels depending on its local features. The value of each histogram bin of FC is a histogram of various features assigned to points in the bin region. We also introduce an efficient coding method to encode the local image features, call Radial Basis Coding (RBC). Combining RBC and FC together, and using a linear SVM classifier, our method is suitable for both image classification and object detection.

Huang, Dong; Storer, Markus; De la Torre, Fernando; Bischof, Horst; , ■Supervised local subspace learning for continuous head pose estimation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2921-2928, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995683
Abstract: Head pose estimation from images has recently attracted much attention in computer vision due to its diverse applications in face recognition, driver monitoring and human computer interaction. Most successful approaches to head pose estimation formulate the problem as a nonlinear regression between image features and continuous 3D angles (i.e. yaw, pitch and roll). However, regression-like methods suffer from three main drawbacks: (1) They typically lack generalization and overfit when trained using a few samples. (2) They fail to get reliable estimates over some regions of the output space (angles) when the training set is not uniformly sampled. For instance, if the training data contains under-sampled areas for some angles. (3) They are not robust to image noise or occlusion. To address these problems, this paper presents Supervised Local Subspace Learning (SL2), a method that learns a local linear model from a sparse and non-uniformly sampled training set. SL2 learns a mixture of local tangent spaces that is robust to under-sampled regions, and due to its regularization properties it is also robust to over-fitting. Moreover, because SL2 is a generative model, it can deal with image noise. Experimental results on the CMU Multi-PIE and BU-3DFE database show the effectiveness of our approach in terms of accuracy and computational complexity.

Yamaguchi, Kota; Berg, Alexander C.; Ortiz, Luis E.; Berg, Tamara L.; , ■Who are you with and where are you going?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1345-1352, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995468
Abstract: We propose an agent-based behavioral model of pedestrians to improve tracking performance in realistic scenarios. In this model, we view pedestrians as decision-making agents who consider a plethora of personal, social, and environmental factors to decide where to go next. We formulate prediction of pedestrian behavior as an energy minimization on this model. Two of our main contributions are simple, yet effective estimates of pedestrian destination and social relationships (groups). Our final contribution is to incorporate these hidden properties into an energy formulation that results in accurate behavioral prediction. We evaluate both our estimates of destination and grouping, as well as our accuracy at prediction and tracking against state of the art behavioral model and show improvements, especially in the challenging observational situation of infrequent appearance observations–something that might occur in thousands of webcams available on the Internet.

Strekalovskiy, Evgeny; Cremers, Daniel; , ■Total variation for cyclic structures: Convex relaxation and efficient minimization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1905-1911, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995573
Abstract: We introduce a novel type of total variation regularizer, TVS1, for cyclic structures such as angles or hue values. The method handles the periodicity of values in a simple and consistent way and is invariant to value shifts. The regularizer is integrated in a recent functional lifting framework which allows for arbitrary nonconvex data terms. Results are superior and more natural than with the simple total variation without special care about wrapping interval end points. In addition we propose an equivalent formulation which can be minimized with the same time and memory efficiency as the standard total variation.

Cech, Jan; Sanchez-Riera, Jordi; Horaud, Radu; , ■Scene flow estimation by growing correspondence seeds,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3129-3136, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995442
Abstract: A simple seed growing algorithm for estimating scene flow in a stereo setup is presented. Two calibrated and synchronized cameras observe a scene and output a sequence of image pairs. The algorithm simultaneously computes a disparity map between the image pairs and optical flow maps between consecutive images. This, together with calibration data, is an equivalent representation of the 3D scene flow, i.e. a 3D velocity vector is associated with each reconstructed point. The proposed method starts from correspondence seeds and propagates these correspondences to their neighborhood. It is accurate for complex scenes with large motions and produces temporally-coherent stereo disparity and optical flow results. The algorithm is fast due to inherent search space reduction. An explicit comparison with recent methods of spatiotemporal stereo and variational optical and scene flow is provided.

He, Ran; Zheng, Wei-Shi; Hu, Bao-Gang; Kong, Xiang-Wei; , ■Nonnegative sparse coding for discriminative semi-supervised learning,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2849-2856, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995487
Abstract: An informative and discriminative graph plays an important role in the graph-based semi-supervised learning methods. This paper introduces a nonnegative sparse algorithm and its approximated algorithm based on the l0–l1 equivalence theory to compute the nonnegative sparse weights of a graph. Hence, the sparse probability graph (SPG) is termed for representing the proposed method. The nonnegative sparse weights in the graph naturally serve as clustering indicators, benefiting for semi-supervised learning. More important, our approximation algorithm speeds up the computation of the nonnegative sparse coding, which is still a bottle-neck for any previous attempts of sparse non-negative graph learning. And it is much more efficient than using l1-norm sparsity technique for learning large scale sparse graph. Finally, for discriminative semi-supervised learning, an adaptive label propagation algorithm is also proposed to iteratively predict the labels of data on the SPG. Promising experimental results show that the nonnegative sparse coding is efficient and effective for discriminative semi-supervised learning.

Yu, Kai; Lin, Yuanqing; Lafferty, John; , ■Learning image representations from the pixel level via hierarchical sparse coding,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1713-1720, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995732
Abstract: We present a method for learning image representations using a two-layer sparse coding scheme at the pixel level. The first layer encodes local patches of an image. After pooling within local regions, the first layer codes are then passed to the second layer, which jointly encodes signals from the region. Unlike traditional sparse coding methods that encode local patches independently, this approach accounts for high-order dependency among patterns in a local image neighborhood. We develop algorithms for data encoding and codebook learning, and show in experiments that the method leads to more invariant and discriminative image representations. The algorithm gives excellent results for hand-written digit recognition on MNIST and object recognition on the Caltech101 benchmark. This marks the first time that such accuracies have been achieved using automatically learned features from the pixel level, rather than using hand-designed descriptors.

Andriyenko, Anton; Schindler, Konrad; , ■Multi-target tracking by continuous energy minimization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1265-1272, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995311
Abstract: We propose to formulate multi-target tracking as minimization of a continuous energy function. Other than a number of recent approaches we focus on designing an energy function that represents the problem as faithfully as possible, rather than one that is amenable to elegant optimization. We then go on to construct a suitable optimization scheme to find strong local minima of the proposed energy. The scheme extends the conjugate gradient method with periodic trans-dimensional jumps. These moves allow the search to escape weak minima and explore a much larger portion of the variable-dimensional search space, while still always reducing the energy. To demonstrate the validity of this approach we present an extensive quantitative evaluation both on synthetic data and on six different real video sequences. In both cases we achieve a significant performance improvement over an extended Kalman filter baseline as well as an ILP-based state-of-the-art tracker.

Chen, Chao; Freedman, Daniel; Lampert, Christoph H.; , ■Enforcing topological constraints in random field image segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2089-2096, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995503
Abstract: We introduce TopoCut: a new way to integrate knowledge about topological properties (TPs) into random field image segmentation model. Instead of including TPs as additional constraints during minimization of the energy function, we devise an efficient algorithm for modifying the unary potentials such that the resulting segmentation is guaranteed with the desired properties. Our method is more flexible in the sense that it handles more topology constraints than previous methods, which were only able to enforce pairwise or global connectivity. In particular, our method is very fast, making it for the first time possible to enforce global topological properties in practical image segmentation tasks.

Payet, Nadia; Todorovic, Sinisa; , ■Scene shape from texture of objects,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2017-2024, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995326
Abstract: Joint reasoning about objects and 3D scene layout has shown great promise in scene interpretation. One visual cue that has been overlooked is texture arising from a spatial repetition of objects in the scene (e.g., windows of a building). Such texture provides scene-specific constraints among objects, and thus facilitates scene interpretation. We present an approach to: (1) detecting distinct textures of objects in a scene, (2) reconstructing the 3D shape of detected texture surfaces, and (3) combining object detections and shape-from-texture toward a globally consistent scene interpretation. Inference is formulated within the reinforcement learning framework as a sequential interpretation of image regions, starting from confident regions to guide the interpretation of other regions. Our algorithm finds an optimal policy that maps states of detected objects and reconstructed surfaces to actions which ought to be taken in those states, including detecting new objects and identifying new textures, so as to minimize a long-term loss. Tests against ground truth obtained from stereo images demonstrate that we can coarsely reconstruct a 3D model of the scene from a single image, without learning the layout of common scene surfaces, as done in prior work. We also show that reasoning about texture of objects improves object detection.

Deselaers, Thomas; Ferrari, Vittorio; , ■Visual and semantic similarity in ImageNet,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1777-1784, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995474
Abstract: Many computer vision approaches take for granted positive answers to questions such as ■Are semantic categories visually separable?■ and ■Is visual similarity correlated to semantic similarity?■. In this paper, we study experimentally whether these assumptions hold and show parallels to questions investigated in cognitive science about the human visual system. The insights gained from our analysis enable building a novel distance function between images assessing whether they are from the same basic-level category. This function goes beyond direct visual distance as it also exploits semantic similarity measured through ImageNet. We demonstrate experimentally that it outperforms purely visual distances.

Zhang, Junge; Huang, Kaiqi; Yu, Yinan; Tan, Tieniu; , ■Boosted local structured HOG-LBP for object localization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1393-1400, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995678
Abstract: Object localization is a challenging problem due to variations in object's structure and illumination. Although existing part based models have achieved impressive progress in the past several years, their improvement is still limited by low-level feature representation. Therefore, this paper mainly studies the description of object structure from both feature level and topology level. Following the bottom-up paradigm, we propose a boosted Local Structured HOG-LBP based object detector. Firstly, at feature level, we propose Local Structured Descriptor to capture the object's local structure, and develop the descriptors from shape and texture information, respectively. Secondly, at topology level, we present a boosted feature selection and fusion scheme for part based object detector. All experiments are conducted on the challenging PASCAL VOC2007 datasets. Experimental results show that our method achieves the state-of-the-art performance.

Zhang, Weiyu; Srinivasan, Praveen; Shi, Jianbo; , ■Discriminative image warping with attribute flow,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2393-2400, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995342
Abstract: We address the problem of finding deformation between two images for the purpose of recognizing objects. The challenge is that discriminative features are often transformation-variant (e.g. histogram of oriented gradients, texture), while transformation-invariant features (e.g. intensity, color) are often not discriminative. We introduce the concept of attribute flow which explicitly models how image attributes vary with its deformation. We develop a non-parametric method to approximate this using histogram matching, which can be solved efficiently using linear programming. Our method produces dense correspondence between images, and utilizes discriminative, transformation-variant features for simultaneous detection and alignment. Experiments on ETHZ shape categories dataset show that we can accurately recognize highly de-formable objects with few training examples.

Torralba, Antonio; Efros, Alexei A.; , ■Unbiased look at dataset bias,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1521-1528, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995347
Abstract: Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same time, datasets have often been blamed for narrowing the focus of object recognition research, reducing it to a single benchmark performance number. Indeed, some datasets, that started out as data capture efforts aimed at representing the visual world, have become closed worlds unto themselves (e.g. the Corel world, the Caltech-101 world, the PASCAL VOC world). With the focus on beating the latest benchmark numbers on the latest dataset, have we perhaps lost sight of the original purpose? The goal of this paper is to take stock of the current state of recognition datasets. We present a comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value. The experimental results, some rather surprising, suggest directions that can improve dataset collection as well as algorithm evaluation protocols. But more broadly, the hope is to stimulate discussion in the community regarding this very important, but largely neglected issue.

Liu, Jingen; Kuipers, Benjamin; Savarese, Silvio; , ■Recognizing human actions by attributes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3337-3344, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995353
Abstract: In this paper we explore the idea of using high-level semantic concepts, also called attributes, to represent human actions from videos and argue that attributes enable the construction of more descriptive models for human action recognition. We propose a unified framework wherein manually specified attributes are: i) selected in a discriminative fashion so as to account for intra-class variability; ii) coherently integrated with data-driven attributes to make the attribute set more descriptive. Data-driven attributes are automatically inferred from the training data using an information theoretic approach. Our framework is built upon a latent SVM formulation where latent variables capture the degree of importance of each attribute for each action class. We also demonstrate that our attribute-based action representation can be effectively used to design a recognition procedure for classifying novel action classes for which no training samples are available. We test our approach on several publicly available datasets and obtain promising results that quantitatively demonstrate our theoretical claims.

Fragkiadaki, Katerina; Shi, Jianbo; , ■Detection free tracking: Exploiting motion and topology for segmenting and tracking under entanglement,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2073-2080, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995366
Abstract: We propose a detection-free system for segmenting multiple interacting and deforming people in a video. People detectors often fail under close agent interaction, limiting the performance of detection based tracking methods. Motion information often fails to separate similarly moving agents or to group distinctly moving articulated body parts. We formulate video segmentation as graph partitioning in the trajectory domain. We classify trajectories as foreground or background based on trajectory saliencies, and use foreground trajectories as graph nodes. We incorporate object connectedness constraints into our trajectory weight matrix based on topology of foreground: we set repulsive weights between trajectories that belong to different connected components in any frame of their time intersection. Attractive weights are set between similarly moving trajectories. Information from foreground topology complements motion information and our spatiotemporal segments can be interpreted as connected moving entities rather than just trajectory groups of similar motion. All our cues are computed on trajectories and naturally encode large temporal context, which is crucial for resolving local in time ambiguities. We present results of our approach on challenging datasets outperforming by far the state of the art.

Kamgar-Parsi, Behzad; Kamgar-Parsi, Behrooz; , ■Matching 2D image lines to 3D models: Two improvements and a new algorithm,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2425-2432, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995557
Abstract: We revisit the problem of matching a set of lines in the 2D image to a set of corresponding lines in the 3D model for the following reasons. (a) Existing algorithms that treat lines as infinitely long contain a flaw, namely, the solutions found are not invariant with respect to the choice of the coordinate frame. The source of this flaw is in the way lines are represented. We propose a frame-independent representation for sets of infinite lines that removes the non-invariance flaw. (b) Algorithms for finding the best rigid transform are nonlinear optimizations that are sensitive to initialization and may result in unreliable and expensive solutions. We present a new recipe for initialization that exploits the 3D geometry of the problem and is applicable to all algorithms that perform the matching in the 3D scene. Experiments show that with this initialization all algorithms find the best transform. (c) We present a new efficient matching algorithm that is significantly faster than existing alternatives, since it does not require explicit evaluation of the cost function and its derivatives.

Pirri, Fiora; Pizzoli, Matia; Rudi, Alessandro; , ■A general method for the point of regard estimation in 3D space,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.921-928, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995634
Abstract: A novel approach to 3D gaze estimation for wearable multi-camera devices is proposed and its effectiveness is demonstrated both theoretically and empirically. The proposed approach, firmly grounded on the geometry of the multiple views, introduces a calibration procedure that is efficient, accurate, highly innovative but also practical and easy. Thus, it can run online with little intervention from the user. The overall gaze estimation model is general, as no particular complex model of the human eye is assumed in this work. This is made possible by a novel approach, that can be sketched as follows: each eye is imaged by a camera; two conics are fitted to the imaged pupils and a calibration sequence, consisting in the subject gazing a known 3D point, while moving his/her head, provides information to 1) estimate the optical axis in 3D world; 2) compute the geometry of the multi-camera system; 3) estimate the Point of Regard in 3D world. The resultant model is being used effectively to study visual attention by means of gaze estimation experiments, involving people performing natural tasks in wide-field, unstructured scenarios.

Holroyd, Michael; Lawrence, Jason; , ■An analysis of using high-frequency sinusoidal illumination to measure the 3D shape of translucent objects,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2985-2991, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995536
Abstract: Using optical triangulation methods to measure the shape of translucent objects is difficult because subsurface scattering contaminates measurements of the ■direct■ reflection at the surface. A number of recent papers have shown that high-frequency sinusoidal illumination patterns allow isolating this direct component [16], which in turn enables accurate estimation of the shape of translucent objects [4]. Despite these encouraging results, there is currently no rigorous mathematical analysis of the expected error in the measured surface as it relates to the parameters of these systems: the frequency of the projected sinusoid, the geometric configuration of the source and camera, and the optical properties of the target object. We present such an analysis, which confirms earlier empirical results and provides a much needed tool for designing 3D scanners for translucent objects.

Lee, Taehee; Soatto, Stefano; , ■Learning and matching multiscale template descriptors for real-time detection, localization and tracking,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1457-1464, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995453
Abstract: We describe a system to learn an object template from a video stream, and localize and track the corresponding object in live video. The template is decomposed into a number of local descriptors, thus enabling detection and tracking in spite of partial occlusion. Each local descriptor aggregates contrast invariant statistics (normalized intensity and gradient orientation) across scales, in a way that enables matching under significant scale variations. Low-level tracking during the training video sequence enables capturing object-specific variability due to the shape of the object, which is encapsulated in the descriptor. Salient locations on both the template and the target image are used as hypotheses to expedite matching.

Zhou, Hongbo; Cheng, Qiang; , ■O(N) implicit subspace embedding for unsupervised multi-scale image segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2209-2215, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995606
Abstract: Subspace embedding is a powerful tool for extracting salient information from matrix, and it has numerous applications in image processing. However, its applicability has been severely limited by the computational complexity of O(N3) (N is the number of the points) which usually arises in explicitly evaluating the eigenvalues and eigenvectors. In this paper, we propose an implicit subspace embedding method which avoids explicitly evaluating the eigenvectors. Also, we show that this method can be seamlessly incorporated into the unsupervised multi-scale image segmentation framework and the resulted algorithm has a running time of genuine O(N). Moreover, we can explicitly determine the number of iterations for the algorithm by estimating the desired size of the subspace, which also controls the amount of information we want to extract for this unsupervised learning. We performed extensive experiments to verify the validity and effectiveness of our method, and we conclude that it only requires less than 120 seconds (CPU 3.2G and memory 16G) to cut a 1000∗1000 color image and orders of magnitude faster than original multi-scale image segmentation with explicit spectral decomposition while maintaining the same or a better segmentation quality.

Hermans, Jeroen; Smeets, Dirk; Vandermeulen, Dirk; Suetens, Paul; , ■Robust point set registration using EM-ICP with information-theoretically optimal outlier handling,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2465-2472, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995744
Abstract: In this paper the problem of pairwise model-to-scene point set registration is considered. Three contributions are made. Firstly, the relations between correspondence-based and some information-theoretic point cloud registration algorithms are formalized. Starting from the observation that the outlier handling of existing methods relies on heuristically determined models, a second contribution is made exploiting aforementioned relations to derive a new robust point set registration algorithm. Representing model and scene point clouds by mixtures of Gaus-sians, the method minimizes their Kullback-Leibler divergence both w.r.t. the registration transformation parameters and w.r.t. the scene's mixture coefficients. This results in an Expectation-Maximization Iterative Closest Point (EM-ICP) approach with a parameter-free outlier model that is optimal in information-theoretical sense. While the current (CUDA) implementation is limited to the rigid registration case, the underlying theory applies to both rigid and non-rigid point set registration. As a by-product of the registration algorithm's theory, a third contribution is made by suggesting a new point cloud Kernel Density Estimation approach which relies on maximizing the resulting distribution's entropy w.r.t. the kernel weights. The rigid registration algorithm is applied to align different patches of the publicly available Stanford Dragon and Stanford Happy Budha range data. The results show good performance regarding accuracy, robustness and convergence range.

Bucak, Serhat Selcuk; Jin, Rong; Jain, Anil K.; , ■Multi-label learning with incomplete class assignments,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2801-2808, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995734
Abstract: We consider a special type of multi-label learning where class assignments of training examples are incomplete. As an example, an instance whose true class assignment is (c1, c2, c3) is only assigned to class c1 when it is used as a training sample. We refer to this problem as multi-label learning with incomplete class assignment. Incompletely labeled data is frequently encountered when the number of classes is very large (hundreds as in MIR Flickr dataset) or when there is a large ambiguity between classes (e.g., jet vs plane). In both cases, it is difficult for users to provide complete class assignments for objects. We propose a ranking based multi-label learning framework that explicitly addresses the challenge of learning from incompletely labeled data by exploiting the group lasso technique to combine the ranking errors. We present a learning algorithm that is empirically shown to be efficient for solving the related optimization problem. Our empirical study shows that the proposed framework is more effective than the state-of-the-art algorithms for multi-label learning in dealing with incompletely labeled data.

Krishnan, Dilip; Tay, Terence; Fergus, Rob; , ■Blind deconvolution using a normalized sparsity measure,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.233-240, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995521
Abstract: Blind image deconvolution is an ill-posed problem that requires regularization to solve. However, many common forms of image prior used in this setting have a major drawback in that the minimum of the resulting cost function does not correspond to the true sharp solution. Accordingly, a range of additional methods are needed to yield good results (Bayesian methods, adaptive cost functions, alpha-matte extraction and edge localization). In this paper we introduce a new type of image regularization which gives lowest cost for the true sharp image. This allows a very simple cost formulation to be used for the blind deconvolution model, obviating the need for additional methods. Due to its simplicity the algorithm is fast and very robust. We demonstrate our method on real images with both spatially invariant and spatially varying blur.

Guo, Guodong; Mu, Guowang; , ■Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.657-664, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995404
Abstract: Human age estimation has recently become an active research topic in computer vision and pattern recognition, because of many potential applications in reality. In this paper we propose to use the kernel partial least squares (KPLS) regression for age estimation. The KPLS (or linear PLS) method has several advantages over previous approaches: (1) the KPLS can reduce feature dimensionality and learn the aging function simultaneously in a single learning framework, instead of performing each task separately using different techniques; (2) the KPLS can find a small number of latent variables, e.g., 20, to project thousands of features into a very low-dimensional subspace, which may have great impact on real-time applications; and (3) the KPLS regression has an output vector that can contain multiple labels, so that several related problems, e.g., age estimation, gender classification, and ethnicity estimation can be solved altogether. This is the first time that the kernel PLS method is introduced and applied to solve a regression problem in computer vision with high accuracy. Experimental results on a very large database show that the KPLS is significantly better than the popular SVM method, and outperform the state-of-the-art approaches in human age estimation.

Gupta, Abhinav; Satkin, Scott; Efros, Alexei A.; Hebert, Martial; , ■From 3D scene geometry to human workspace,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1961-1968, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995448
Abstract: We present a human-centric paradigm for scene understanding. Our approach goes beyond estimating 3D scene geometry and predicts the ■workspace■ of a human which is represented by a data-driven vocabulary of human interactions. Our method builds upon the recent work in indoor scene understanding and the availability of motion capture data to create a joint space of human poses and scene geometry by modeling the physical interactions between the two. This joint space can then be used to predict potential human poses and joint locations from a single image. In a way, this work revisits the principle of Gibsonian affor-dances, reinterpreting it for the modern, data-driven era.

Cao, Yu; Zhang, Zhiqi; Czogiel, Irina; Dryden, Ian; Wang, Song; , ■2D nonrigid partial shape matching using MCMC and contour subdivision,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2345-2352, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995588
Abstract: Shape matching has many applications in computer vision, such as shape classification, object recognition, object detection, and localization. In 2D cases, shape instances are 2D closed contours and matching two shape contours can usually be formulated as finding a one-to-one dense point correspondence between them. However, in practice, many shape contours are extracted from real images and may contain partial occlusions. This leads to the challenging partial shape matching problem, where we need to identify and match a subset of segments of the two shape contours. In this paper, we propose a new MCMC (Markov chain Monte Carlo) based algorithm to handle partial shape matching with mildly non-rigid deformations. Specifically, we represent each shape contour by a set of ordered landmark points. The selection of a subset of these landmark points into the shape matching is evaluated and updated by a posterior distribution, which is composed of both a matching likelihood and a prior distribution. This prior distribution favors the inclusion of more and consecutive landmark points into the matching. To better describe the matching likelihood, we develop a contour-subdivision technique to highlight the contour segment with highest matching cost from the selected subsequences of the points. In our experiments, we construct 1,600 test shape instances by introducing partial occlusions to the 40 shapes chosen from different categories in MPEG-7 dataset. We evaluate the performance of the proposed algorithm by comparing with three well-known partial shape matching methods.

Cong, Yang; Yuan, Junsong; Liu, Ji; , ■Sparse reconstruction cost for abnormal event detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3449-3456, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995434
Abstract: We propose to detect abnormal events via a sparse reconstruction over the normal bases. Given an over-complete normal basis set (e.g., an image sequence or a collection of local spatio-temporal patches), we introduce the sparse reconstruction cost (SRC) over the normal dictionary to measure the normalness of the testing sample. To condense the size of the dictionary, a novel dictionary selection method is designed with sparsity consistency constraint. By introducing the prior weight of each basis during sparse reconstruction, the proposed SRC is more robust compared to other outlier detection criteria. Our method provides a unified solution to detect both local abnormal events (LAE) and global abnormal events (GAE). We further extend it to support online abnormal event detection by updating the dictionary incrementally. Experiments on three benchmark datasets and the comparison to the state-of-the-art methods validate the advantages of our algorithm.

Huang, Yongzhen; Huang, Kaiqi; Wang, Chong; Tan, Tieniu; , ■Exploring relations of visual codes for image classification,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1649-1656, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995655
Abstract: The classic Bag-of-Features (BOF) model and its extensional work use a single value to represent a visual code. This strategy ignores the relation of visual codes. In this paper, we explore this relation and propose a new algorithm for image classification. It consists of two main parts: 1) construct the codebook graph wherein a visual code is linked with other codes; 2) describe each local feature using a pair of related codes, corresponding to an edge of the graph. Our approach contains richer information than previous BOF models. Moreover, we demonstrate that these models are special cases of ours. Various coding and pooling algorithms can be embedded into our framework to obtain better performance. Experiments on different kinds of image classification databases demonstrate that our approach can stably achieve excellent performance compared with various BOF models.

Lee, Joon-Young; Shi, Boxin; Matsushita, Yasuyuki; Kweon, In-So; Ikeuchi, Katsushi; , ■Radiometric calibration by transform invariant low-rank structure,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2337-2344, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995409
Abstract: We present a robust radiometric calibration method that capitalizes on the transform invariant low-rank structure of sensor irradiances recorded from a static scene with different exposure times. We formulate the radiometric calibration problem as a rank minimization problem. Unlike previous approaches, our method naturally avoids over-fitting problem; therefore, it is robust against biased distribution of the input data, which is common in practice. When the exposure times are completely unknown, the proposed method can robustly estimate the response function up to an exponential ambiguity. The method is evaluated using both simulation and real-world datasets and shows a superior performance than previous approaches.

Chakrabarti, Ayan; Zickler, Todd; , ■Statistics of real-world hyperspectral images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.193-200, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995660
Abstract: Hyperspectral images provide higher spectral resolution than typical RGB images by including per-pixel irradiance measurements in a number of narrow bands of wavelength in the visible spectrum. The additional spectral resolution may be useful for many visual tasks, including segmentation, recognition, and relighting. Vision systems that seek to capture and exploit hyperspectral data should benefit from statistical models of natural hyperspectral images, but at present, relatively little is known about their structure. Using a new collection of fifty hyperspectral images of indoor and outdoor scenes, we derive an optimized ■spatio-spectral basis■ for representing hyperspectral image patches, and explore statistical models for the coefficients in this basis.

Huh, Seungil; Chen, Mei; , ■Detection of mitosis within a stem cell population of high cell confluence in phase-contrast microscopy images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1033-1040, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995717
Abstract: Computer vision analysis of cells in phase-contrast microscopy images enables long-term continuous monitoring of live cells, which has not been feasible using the existing cellular staining methods due to the use of fluorescence reagents or fixatives. In cell culture analysis, accurate detection of mitosis, or cell division, is critical for quantitative study of cell proliferation. In this work, we present an approach that can detect mitosis within a cell population of high cell confluence, or high cell density, which has proven challenging because of the difficulty in separating individual cells. We first detect the candidates for birth events that are defined as the time and location at which mitosis is complete and two daughter cells first appear. Each candidate is then examined whether it is real or not after incorporating spatio-temporal information by tracking the candidate in the neighboring frames. For the examination, we design a probabilistic model named Two-Labeled Hidden Conditional Random Field (TL-HCRF) that can use the information on the timing of the candidate birth event in addition to the visual change of cells over time. Applied to two cell populations of high cell confluence, our method considerably outperforms previous methods. Comparisons with related statistical models also show the superiority of TL-HCRF on the proposed task.

Sharma, Abhishek; Jacobs, David W; , ■Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.593-600, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995350
Abstract: This paper presents a novel way to perform multi-modal face recognition. We use Partial Least Squares (PLS) to linearly map images in different modalities to a common linear subspace in which they are highly correlated. PLS has been previously used effectively for feature selection in face recognition. We show both theoretically and experimentally that PLS can be used effectively across modalities. We also formulate a generic intermediate subspace comparison framework for multi-modal recognition. Surprisingly, we achieve high performance using only pixel intensities as features. We experimentally demonstrate the highest published recognition rates on the pose variations in the PIE data set, and also show that PLS can be used to compare sketches to photos, and to compare images taken at different resolutions.

Yao, Bangpeng; Khosla, Aditya; Fei-Fei, Li; , ■Combining randomization and discrimination for fine-grained image categorization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1577-1584, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995368
Abstract: In this paper, we study the problem of fine-grained image categorization. The goal of our method is to explore fine image statistics and identify the discriminative image patches for recognition. We achieve this goal by combining two ideas, discriminative feature mining and randomization. Discriminative feature mining allows us to model the detailed information that distinguishes different classes of images, while randomization allows us to handle the huge feature space and prevents over-fitting. We propose a random forest with discriminative decision trees algorithm, where every tree node is a discriminative classifier that is trained by combining the information in this node as well as all upstream nodes. Our method is tested on both subordinate categorization and activity recognition datasets. Experimental results show that our method identifies semantically meaningful visual information and outperforms state-of-the-art algorithms on various datasets.

Huang, Dong; Tian, Yuandong; De la Torre, Fernando; , ■Local isomorphism to solve the pre-image problem in kernel methods,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2761-2768, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995685
Abstract: Kernel methods have been popular over the last decade to solve many computer vision, statistics and machine learning problems. An important, both theoretically and practically, open problem in kernel methods is the pre-image problem. The pre-image problem consists of finding a vector in the input space whose mapping is known in the feature space induced by a kernel. To solve the pre-image problem, this paper proposes a framework that computes an isomorphism between local Gram matrices in the input and feature space. Unlike existing methods that rely on analytic properties of kernels, our framework derives closed-form solutions to the pre-image problem in the case of non-differentiable and application-specific kernels. Experiments on the pre-image problem for visualizing cluster centers computed by kernel k-means and denoising high-dimensional images show that our algorithm outperforms state-of-the-art methods.

Russell, Chris; Fayad, Joao; Agapito, Lourdes; , ■Energy based multiple model fitting for non-rigid structure from motion,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3009-3016, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995383
Abstract: In this paper we reformulate the 3D reconstruction of deformable surfaces from monocular video sequences as a labeling problem. We solve simultaneously for the assignment of feature points to multiple local deformation models and the fitting of models to points to minimize a geometric cost, subject to a spatial constraint that neighboring points should also belong to the same model. Piecewise reconstruction methods rely on features shared between models to enforce global consistency on the 3D surface. To account for this overlap between regions, we consider a super-set of the classic labeling problem in which a set of labels, instead of a single one, is assigned to each variable. We propose a mathematical formulation of this new model and show how it can be efficiently optimized with a variant of α-expansion. We demonstrate how this framework can be applied to Non-Rigid Structure from Motion and leads to simpler explanations of the same data. Compared to existing methods run on the same data, our approach has up to half the reconstruction error, and is more robust to over-fitting and outliers.

Han, Dongfeng; Bayouth, John; Song, Qi; Bhatia, Sudershan; Sonka, Milan; Wu, Xiaodong; , ■Feature guided motion artifact reduction with structure-awareness in 4D CT images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1057-1064, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995561
Abstract: In this paper, we propose a novel method to reduce the magnitude of 4D CT artifacts by stitching two images with a data-driven regularization constrain, which helps preserve the local anatomy structures. Our method first computes an interface seam for the stitching in the overlapping region of the first image, which passes through the ■smoothest■ region, to reduce the structure complexity along the stitching interface. Then, we compute the displacements of the seam by matching the corresponding interface seam in the second image. We use sparse 3D features as the structure cues to guide the seam matching, in which a regularization term is incorporated to keep the structure consistency. The energy function is minimized by solving a multiple-label problem in Markov Random Fields with an anatomical structure preserving regularization term. The displacements are propagated to the rest of second image and the two image are stitched along the interface seams based on the computed displacement field. The method was tested on both simulated data and clinical 4D CT images. The experiments on simulated data demonstrated that the proposed method was able to reduce the landmark distance error on average from 2.9 mm to 1.3 mm, outperforming the registration-based method by about 55%. For clinical 4D CT image data, the image quality was evaluated by three medical experts, and all identified much fewer artifacts from the resulting images by our method than from those by the compared methods.

Wu, Changchang; Agarwal, Sameer; Curless, Brian; Seitz, Steven M.; , ■Multicore bundle adjustment,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3057-3064, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995552
Abstract: We present the design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems. We explore the use of multicore CPU as well as multicore GPUs for this purpose. We show that overcoming the severe memory and bandwidth limitations of current generation GPUs not only leads to more space efficient algorithms, but also to surprising savings in runtime. Our CPU based system is up to ten times and our GPU based system is up to thirty times faster than the current state of the art methods [1], while maintaining comparable convergence behavior. The code and additional results are available at http://grail.cs.

Dong, Weisheng; Li, Xin; Zhang, Lei; Shi, Guangming; , ■Sparsity-based image denoising via dictionary learning and structural clustering,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.457-464, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995478
Abstract: Where does the sparsity in image signals come from? Local and nonlocal image models have supplied complementary views toward the regularity in natural images — the former attempts to construct or learn a dictionary of basis functions that promotes the sparsity; while the latter connects the sparsity with the self-similarity of the image source by clustering. In this paper, we present a variational framework for unifying the above two views and propose a new denoising algorithm built upon clustering-based sparse representation (CSR). Inspired by the success of l1-optimization, we have formulated a double-header l1-optimization problem where the regularization involves both dictionary learning and structural structuring. A surrogate-function based iterative shrinkage solution has been developed to solve the double-header l1-optimization problem and a probabilistic interpretation of CSR model is also included. Our experimental results have shown convincing improvements over state-of-the-art denoising technique BM3D on the class of regular texture images. The PSNR performance of CSR denoising is at least comparable and often superior to other competing schemes including BM3D on a collection of 12 generic natural images.

Bao, Sid Yingze; Savarese, Silvio; , ■Semantic structure from motion,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2025-2032, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995462
Abstract: Conventional rigid structure from motion (SFM) addresses the problem of recovering the camera parameters (motion) and the 3D locations (structure) of scene points, given observed 2D image feature points. In this paper, we propose a new formulation called Semantic Structure From Motion (SSFM). In addition to the geometrical constraints provided by SFM, SSFM takes advantage of both semantic and geometrical properties associated with objects in the scene (Fig. 1). These properties allow us to recover not only the structure and motion but also the 3D locations, poses, and categories of objects in the scene. We cast this problem as a max-likelihood problem where geometry (cameras, points, objects) and semantic information (object classes) are simultaneously estimated. The key intuition is that, in addition to image features, the measurements of objects across views provide additional geometrical constraints that relate cameras and scene parameters. These constraints make the geometry estimation process more robust and, in turn, make object detection more accurate. Our framework has the unique ability to: i) estimate camera poses only from object detections, ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms, iii) improve object detections given multiple un-calibrated images, compared to independently detecting objects in single images. Extensive quantitative results on three datasets–LiDAR cars, street-view pedestrians, and Kinect office desktop–verify our theoretical claims.

Kuo, Cheng-Hao; Nevatia, Ram; , ■How does person identity recognition help multi-person tracking?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1217-1224, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995384
Abstract: We address the problem of multi-person tracking in a complex scene from a single camera. Although tracklet-association methods have shown impressive results in several challenging datasets, discriminability of the appearance model remains a limitation. Inspired by the work of person identity recognition, we obtain discriminative appearance-based affinity models by a novel framework to incorporate the merits of person identity recognition, which help multi-person tracking performance. During off-line learning, a small set of local image descriptors is selected to be used in on-line learned appearances-based affinity models effectively and efficiently. Given short but reliable track-lets generated by frame-to-frame association of detection responses, we identify them as query tracklets and gallery tracklets. For each gallery tracklet, a target-specific appearance model is learned from the on-line training samples collected by spatio-temporal constraints. Both gallery tracklets and query tracklets are fed into hierarchical association framework to obtain final tracking results. We evaluate our proposed system on several public datasets and show significant improvements in terms of tracking evaluation metrics.

Gotardo, Paulo F.U.; Martinez, Aleix M.; , ■Non-rigid structure from motion with complementary rank-3 spaces,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3065-3072, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995560
Abstract: Non-rigid structure from motion (NR-SFM) is a difficult, underconstrained problem in computer vision. This paper proposes a new algorithm that revises the standard matrix factorization approach in NR-SFM. We consider two alternative representations for the linear space spanned by a small number K of 3D basis shapes. As compared to the standard approach using general rank-3K matrix factors, we show that improved results are obtained by explicitly modeling K complementary spaces of rank-3. Our new method is positively compared to the state-of-the-art in NR-SFM, providing improved results on high-frequency deformations of both articulated and simpler deformable shapes. We also present an approach for NR-SFM with occlusion.

Gupta, Mohit; Agrawal, Amit; Veeraraghavan, Ashok; Narasimhan, Srinivasa G.; , ■Structured light 3D scanning in the presence of global illumination,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.713-720, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995321
Abstract: Global illumination effects such as inter-reflections, diffusion and sub-surface scattering severely degrade the performance of structured light-based 3D scanning. In this paper, we analyze the errors caused by global illumination in structured light-based shape recovery. Based on this analysis, we design structured light patterns that are resilient to individual global illumination effects using simple logical operations and tools from combinatorial mathematics. Scenes exhibiting multiple phenomena are handled by combining results from a small ensemble of such patterns. This combination also allows us to detect any residual errors that are corrected by acquiring a few additional images. Our techniques do not require explicit separation of the direct and global components of scene radiance and hence work even in scenarios where the separation fails or the direct component is too low. Our methods can be readily incorporated into existing scanning systems without significant overhead in terms of capture time or hardware. We show results on a variety of scenes with complex shape and material properties and challenging global illumination effects.

Kurkure, Uday; Le, Yen H.; Paragios, Nikos; Carson, James P.; Ju, Tao; Kakadiaris, Ioannis A.; , ■Landmark/image-based deformable registration of gene expression data,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1089-1096, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995708
Abstract: Analysis of gene expression patterns in brain images obtained from high-throughput in situ hybridization requires accurate and consistent annotations of anatomical regions/subregions. Such annotations are obtained by mapping an anatomical atlas onto the gene expression images through intensity- and/or landmark-based registration methods or deformable model-based segmentation methods. Due to the complex appearance of the gene expression images, these approaches require a pre-processing step to determine landmark correspondences in order to incorporate landmark-based geometric constraints. In this paper, we propose a novel method for landmark-constrained, intensity-based registration without determining landmark correspondences a priori. The proposed method performs dense image registration and identifies the landmark correspondences, simultaneously, using a single higher-order Markov Random Field model. In addition, a machine learning technique is used to improve the discriminating properties of local descriptors for landmark matching by projecting them in a Hamming space of lower dimension. We qualitatively show that our method achieves promising results and also compares well, quantitatively, with the expert's annotations, outperforming previous methods.

Dhar, Sagnik; Ordonez, Vicente; Berg, Tamara L; , ■High level describable attributes for predicting aesthetics and interestingness,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1657-1664, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995467
Abstract: With the rise in popularity of digital cameras, the amount of visual data available on the web is growing exponentially. Some of these pictures are extremely beautiful and aesthetically pleasing, but the vast majority are uninteresting or of low quality. This paper demonstrates a simple, yet powerful method to automatically select high aesthetic quality images from large image collections. Our aesthetic quality estimation method explicitly predicts some of the possible image cues that a human might use to evaluate an image and then uses them in a discriminative approach. These cues or high level describable image attributes fall into three broad types: 1) compositional attributes related to image layout or configuration, 2) content attributes related to the objects or scene types depicted, and 3) sky-illumination attributes related to the natural lighting conditions. We demonstrate that an aesthetics classifier trained on these describable attributes can provide a significant improvement over baseline methods for predicting human quality judgments. We also demonstrate our method for predicting the ■interestingness■ of Flickr photos, and introduce a novel problem of estimating query specific ■interestingness■.

Lin, Wen-Yan; Liu, Siying; Matsushita, Yasuyuki; Ng, Tian-Tsong; Cheong, Loong-Fah; , ■Smoothly varying affine stitching,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.345-352, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995314
Abstract: Traditional image stitching using parametric transforms such as homography, only produces perceptually correct composites for planar scenes or parallax free camera motion between source frames. This limits mosaicing to source images taken from the same physical location. In this paper, we introduce a smoothly varying affine stitching field which is flexible enough to handle parallax while retaining the good extrapolation and occlusion handling properties of parametric transforms. Our algorithm which jointly estimates both the stitching field and correspondence, permits the stitching of general motion source images, provided the scenes do not contain abrupt protrusions.

Gong, Minglun; , ■Foreground segmentation of live videos using locally competing 1SVMs,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2105-2112, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995394
Abstract: The objective of foreground segmentation is to extract the desired foreground object from input videos. Over the years there have been significant amount of efforts on this topic, nevertheless there still lacks a simple yet effective algorithm that can process live videos of objects with fuzzy boundaries captured by freely moving cameras. This paper presents an algorithm toward this goal. The key idea is to train and maintain two competing one-class support vector machines (1SVMs) at each pixel location, which model local color distributions for foreground and background, respectively. We advocate the usage of two competing local classifiers, as it provides higher discriminative power and allows better handling of ambiguities. As a result, our algorithm can deal with a variety of videos with complex backgrounds and freely moving cameras with minimum user interactions. In addition, by introducing novel acceleration techniques and by exploiting the parallel structure of the algorithm, realtime processing speed is achieved for VGA-sized videos.

Baust, Maximilian; Yezzi, Anthony J.; Unal, Gozde; Navab, Nassir; , ■A Sobolev-type metric for polar active contours,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1017-1024, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995310
Abstract: Polar object representations have proven to be a powerful shape model for many medical as well as other computer vision applications, such as interactive image segmentation or tracking. Inspired by recent work on Sobolev active contours we derive a Sobolev-type function space for polar curves. This so-called polar space is endowed with a metric that allows us to favor origin translations and scale changes over smooth deformations of the curve. Moreover, the resulting curve flow inherits the coarse-to-fine behavior of Sobolev active contours and is thus very robust to local minima. These properties make the resulting polar active contours a powerful segmentation tool for many medical applications, such as cross-sectional vessel segmentation, aneurysm analysis, or cell tracking.

Li, Bing; Xiao, Rong; Li, Zhiwei; Cai, Rui; Lu, Bao-Liang; Zhang, Lei; , ■Rank-SIFT: Learning to rank repeatable local interest points,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1737-1744, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995461
Abstract: Scale-invariant feature transform (SIFT) has been well studied in recent years. Most related research efforts focused on designing and learning effective descriptors to characterize a local interest point. However, how to identify stable local interest points is still a very challenging problem. In this paper, we propose a set of differential features, and based on them we adopt a data-driven approach to learn a ranking function to sort local interest points according to their stabilities across images containing the same visual objects. Compared with the handcrafted rule-based method used by the standard SIFT algorithm, our algorithm substantially improves the stability of detected local interest point on a very challenging benchmark dataset, in which images were generated under very different imaging conditions. Experimental results on the Oxford and PASCAL databases further demonstrate the superior performance of the proposed algorithm on both object image retrieval and category recognition.

Fang, Yi; Sun, Mengtian; Vishwanathan, S.V.N.; Ramani, Karthik; , ■sLLE: Spherical locally linear embedding with applications to tomography,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1129-1136, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995563
Abstract: The tomographic reconstruction of a planar object from its projections taken at random unknown view angles is a problem that occurs often in medical imaging. Therefore, there is a need to robustly estimate the view angles given random observations of the projections. The widely used locally linear embedding (LLE) technique provides nonlinear embedding of points on a flat manifold. In our case, the projections belong to a sphere. Therefore, we extend LLE and develop a spherical locally linear embedding (sLLE) algorithm, which is capable of embedding data points on a non-flat spherically constrained manifold. Our algorithm, sLLE, transforms the problem of the angle estimation to a spherically constrained embedding problem. It considers each projection as a high dimensional vector with dimensionality equal to the number of sampling points on the projection. The projections are then embedded onto a sphere, which parametrizes the projections with respect to view angles in a globally consistent manner. The image is reconstructed from parametrized projections through the inverse Radon transform. A number of experiments demonstrate that sLLE is particularly effective for the tomography application we consider. We evaluate its performance in terms of the computational efficiency and noise tolerance, and show that sLLE can be used to shed light on the other constrained applications of LLE.

Wolf, Lior; Hassner, Tal; Maoz, Itay; , ■Face recognition in unconstrained videos with matched background similarity,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.529-534, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995566
Abstract: Recognizing faces in unconstrained videos is a task of mounting importance. While obviously related to face recognition in still images, it has its own unique characteristics and algorithmic requirements. Over the years several methods have been suggested for this problem, and a few benchmark data sets have been assembled to facilitate its study. However, there is a sizable gap between the actual application needs and the current state of the art. In this paper we make the following contributions. (a) We present a comprehensive database of labeled videos of faces in challenging, uncontrolled conditions (i.e., ‘in the wild’), the ‘YouTube Faces’ database, along with benchmark, pair-matching tests1. (b) We employ our benchmark to survey and compare the performance of a large variety of existing video face recognition techniques. Finally, (c) we describe a novel set-to-set similarity measure, the Matched Background Similarity (MBGS). This similarity is shown to considerably improve performance on the benchmark tests.

Raviv, Dan; Bronstein, Michael M.; Bronstein, Alexander M.; Kimmel, Ron; Sochen, Nir; , ■Affine-invariant diffusion geometry for the analysis of deformable 3D shapes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2361-2367, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995486
Abstract: We introduce an (equi-)affine invariant diffusion geometry by which surfaces that go through squeeze and shear transformations can still be properly analyzed. The definition of an affine invariant metric enables us to construct an invariant Laplacian from which local and global geometric structures are extracted. Applications of the proposed framework demonstrate its power in generalizing and enriching the existing set of tools for shape analysis.

Ancuti, Codruta Orniana; Ancuti, Cosmin; Bekaert, Phillipe; , ■Enhancing by saliency-guided decolorization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.257-264, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995414
Abstract: This paper introduces an effective decolorization algorithm that preserves the appearance of the original color image. Guided by the original saliency, the method blends the luminance and the chrominance information in order to conserve the initial color disparity while enhancing the chromatic contrast. As a result, our straightforward fusing strategy generates a new spatial distribution that discriminates better the illuminated areas and color features. Since we do not employ quantization or a per-pixel optimization (computationally expensive), the algorithm has a linear runtime, and depending on the image resolution it could be used in real-time applications. Extensive experiments and a comprehensive evaluation against existing state-of-the-art methods demonstrate the potential of our grayscale operator. Furthermore, since the method accurately preserves the finest details while enhancing the chromatic contrast, the utility and versatility of our operator have been proved for several other challenging applications such as video decolorization, detail enhancement, single image dehazing and segmentation under different illuminants.

Panagopoulos, Alexandros; Wang, Chaohui; Samaras, Dimitris; Paragios, Nikos; , ■Illumination estimation and cast shadow detection through a higher-order graphical model,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.673-680, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995585
Abstract: In this paper, we propose a novel framework to jointly recover the illumination environment and an estimate of the cast shadows in a scene from a single image, given coarse 3D geometry. We describe a higher-order Markov Random Field (MRF) illumination model, which combines low-level shadow evidence with high-level prior knowledge for the joint estimation of cast shadows and the illumination environment. First, a rough illumination estimate and the structure of the graphical model in the illumination space is determined through a voting procedure. Then, a higher order approach is considered where illumination sources are coupled with the observed image and the latent variables corresponding to the shadow detection. We examine two inference methods in order to effectively minimize the MRF energy of our model. Experimental evaluation shows that our approach is robust to rough knowledge of geometry and reflectance and inaccurate initial shadow estimates. We demonstrate the power of our MRF illumination model on various datasets and show that we can estimate the illumination in images of objects belonging to the same class using the same coarse 3D model to represent all instances of the class.

Cai, Deng; Bao, Hujun; He, Xiaofei; , ■Sparse concept coding for visual analysis,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, vol., no., pp.2905-2910, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995390
Abstract: We consider the problem of image representation for visual analysis. When representing images as vectors, the feature space is of very high dimensionality, which makes it difficult for applying statistical techniques for visual analysis. To tackle this problem, matrix factorization techniques, such as Singular Vector Decomposition (SVD) and Non-negative Matrix Factorization (NMF), received an increasing amount of interest in recent years. Matrix factorization is an unsupervised learning technique, which finds a basis set capturing high-level semantics in the data and learns coordinates in terms of the basis set. However, the representations obtained by them are highly dense and can not capture the intrinsic geometric structure in the data. In this paper, we propose a novel method, called Sparse Concept Coding (SCC), for image representation and analysis. Inspired from the recent developments on manifold learning and sparse coding, SCC provides a sparse representation which can capture the intrinsic geometric structure of the image space. Extensive experimental results on image clustering have shown that the proposed approach provides a better representation with respect to the semantic structure.

Wu, Zheng; Kunz, Thomas H.; Betke, Margrit; , ■Efficient track linking methods for track graphs using network-flow and set-cover techniques,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1185-1192, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995515
Abstract: This paper proposes novel algorithms that use network-flow and set-cover techniques to perform occlusion reasoning for a large number of small, moving objects in single or multiple views. We designed a track-linking framework for reasoning about short-term and long-term occlusions. We introduce a two-stage network-flow process to automatically construct a ■track graph■ that describes the track merging and splitting events caused by occlusion. To explain short-term occlusions, when local information is sufficient to distinguish objects, the process links trajectory segments through a series of optimal bipartite-graph matches. To resolve long-term occlusions, when global information is needed to characterize objects, the linking process computes a logarithmic approximation solution to the set cover problem. If multiple views are available, our method builds a track graph, independently for each view, and then simultaneously links track segments from each graph, solving a joint set cover problem for which a logarithmic approximation also exists. Through experiments on different datasets, we show that our proposed linear and integer optimization techniques make the track graph a particularly useful tool for tracking large groups of individuals in images.

Gall, Juergen; Fossati, Andrea; van Gool, Luc; , ■Functional categorization of objects using real-time markerless motion capture,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1969-1976, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995582
Abstract: Unsupervised categorization of objects is a fundamental problem in computer vision. While appearance-based methods have become popular recently, other important cues like functionality are largely neglected. Motivated by psychological studies giving evidence that human demonstration has a facilitative effect on categorization in infancy, we propose an approach for object categorization from depth video streams. To this end, we have developed a method for capturing human motion in real-time. The captured data is then used to temporally segment the depth streams into actions. The set of segmented actions are then categorized in an un-supervised manner, through a novel descriptor for motion capture data that is robust to subject variations. Furthermore, we automatically localize the object that is manipulated within a video segment, and categorize it using the corresponding action. For evaluation, we have recorded a dataset that comprises depth data with registered video sequences for 6 subjects, 13 action classes, and 174 object manipulations.

Vasilyev, Yuriy; Zickler, Todd; Gortler, Steven; Ben-Shahar, Ohad; , ■Shape from specular flow: Is one flow enough?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2561-2568, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995662
Abstract: Specular flow is the motion field induced on the image plane by the movement of points reflected by a curved, mirror-like surface. This flow provides information about surface shape, and when the camera and surface move as a fixed pair, shape can be recovered by solving linear differential equations along integral curves of flow. Previous analysis has shown that two distinct motions (i.e., two flow fields) are generally sufficient to guarantee a unique solution without externally-provided initial conditions. In this work, we show that we can often succeed with only one flow. The key idea is to exploit the fact that smooth surfaces induce integrability constraints on the surface normal field. We show that this induces a new differential equation that facilitates the propagation of shape information between integral curves of flow, and that combining this equation with known methods often permits the recovery of unique shape from a single specular flow given only a single seed point.

Chen, Lu-Hung; Yang, Yao-Hsiang; Chen, Chu-Song; Cheng, Ming-Yen; , ■Illumination invariant feature extraction based on natural images statistics — Taking face images as an example,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.681-688, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995621
Abstract: Natural images are known to carry several distinct properties which are not shared with randomly generated images. In this article we utilize the scale invariant property of natural images to construct a filter which extracts features invariant to illumination conditions. In contrast to most of the existing methods which assume that such features lie in high frequency part of spectrum, by analyzing the power spectra of natural images we show that some of these features could lie in low frequency part as well. From this fact, we derive a Wiener filter approach to best separate the illumination-invariant features from an image. We also provide a linear time algorithm for our proposed Wiener filter, which only involves solving linear equations with narrowly banded matrix. Our experiments on variable lighting face recognition show that our proposed method does achieve the best recognition rate and is generally faster compared to the state-of-the-art methods.

Gong, Yunchao; Lazebnik, Svetlana; , ■Iterative quantization: A procrustean approach to learning binary codes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.817-824, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995432
Abstract: This paper addresses the problem of learning similarity-preserving binary codes for efficient retrieval in large-scale image collections. We propose a simple and efficient alternating minimization scheme for finding a rotation of zero-centered data so as to minimize the quantization error of mapping this data to the vertices of a zero-centered binary hypercube. This method, dubbed iterative quantization (ITQ), has connections to multi-class spectral clustering and to the orthogonal Procrustes problem, and it can be used both with unsupervised data embeddings such as PCA and supervised embeddings such as canonical correlation analysis (CCA). Our experiments show that the resulting binary coding schemes decisively outperform several other state-of-the-art methods.

Mu, Yadong; Dong, Jian; Yuan, Xiaotong; Yan, Shuicheng; , ■Accelerated low-rank visual recovery by random projection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2609-2616, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995369
Abstract: Exact recovery from contaminated visual data plays an important role in various tasks. By assuming the observed data matrix as the addition of a low-rank matrix and a sparse matrix, theoretic guarantee exists under mild conditions for exact data recovery. Practically matrix nuclear norm is adopted as a convex surrogate of the non-convex matrix rank function to encourage low-rank property and serves as the major component of recently-proposed Robust Principal Component Analysis (R-PCA). Recent endeavors have focused on enhancing the scalability of R-PCA to large-scale datasets, especially mitigating the computational burden of frequent large-scale Singular Value Decomposition (SVD) inherent with the nuclear norm optimization. In our proposed scheme, the nuclear norm of an auxiliary matrix is minimized instead, which is related to the original low-rank matrix by random projection. By design, the modified optimization entails SVD on matrices of much smaller scale, as compared to the original optimization problem. Theoretic analysis well justifies the proposed scheme, along with greatly reduced optimization complexity. Both qualitative and quantitative studies are provided on various computer vision benchmarks to validate its effectiveness, including facial shadow removal, surveillance background modeling and large-scale image tag transduction. It is also highlighted that the proposed solution can serve as a general principal to accelerate many other nuclear norm oriented problems in numerous tasks.

Salakhutdinov, Ruslan; Torralba, Antonio; Tenenbaum, Josh; , ■Learning to share visual appearance for multiclass object detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1481-1488, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995720
Abstract: We present a hierarchical classification model that allows rare objects to borrow statistical strength from related objects that have many training examples. Unlike many of the existing object detection and recognition systems that treat different classes as unrelated entities, our model learns both a hierarchy for sharing visual appearance across 200 object categories and hierarchical parameters. Our experimental results on the challenging object localization and detection task demonstrate that the proposed model substantially improves the accuracy of the standard single object detectors that ignore hierarchical structure altogether.

Hansard, Miles; Horaud, Radu; Amat, Michel; Lee, Seungkyu; , ■Projective alignment of range and parallax data,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3089-3096, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995533
Abstract: An approximately Euclidean representation of the visible scene can be obtained directly from a range, or ‘time-of-flight’, camera. An uncalibrated binocular system, in contrast, gives only a projective reconstruction of the scene. This paper analyzes the geometric mapping between the two representations, without requiring an intermediate calibration of the binocular system. The mapping can be found by either of two new methods, one of which requires point-correspondences between the range and colour cameras, and one of which does not. It is shown that these methods can be used to reproject the range data into the binocular images, which makes it possible to associate high-resolution colour and texture with each point in the Euclidean representation.

, Anwaar-ul-Haq; Gondal, Iqbal; Murshed, Manzur; , ■On dynamic scene geometry for view-invariant action matching,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3369-3376, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995697
Abstract: Variation in viewpoints poses significant challenges to action recognition. One popular way of encoding view-invariant action representation is based on the exploitation of epipolar geometry between different views of the same action. Majority of representative work considers detection of landmark points and their tracking by assuming that motion trajectories for all landmark points on human body are available throughout the course of an action. Unfortunately, due to occlusion and noise, detection and tracking of these landmarks is not always robust. To facilitate it, some of the work assumes that such trajectories are manually marked which is a clear drawback and lacks automation introduced by computer vision. In this paper, we address this problem by proposing view invariant action matching score based on epipolar geometry between actor silhouettes, without tracking and explicit point correspondences. In addition, we explore multi-body epipolar constraint which facilitates to work on original action volumes without any pre-processing. We show that multi-body fundamental matrix captures the geometry of dynamic action scenes and helps devising an action matching score across different views without any prior segmentation of actors. Extensive experimentation on challenging view invariant action datasets shows that our approach not only removes long standing assumptions but also achieves significant improvement in recognition accuracy and retrieval.

Jiang, Zhuolin; Lin, Zhe; Davis, Larry S.; , ■Learning a discriminative dictionary for sparse coding via label consistent K-SVD,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1697-1704, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995354
Abstract: A label consistent K-SVD (LC-KSVD) algorithm to learn a discriminative dictionary for sparse coding is presented. In addition to using class labels of training data, we also associate label information with each dictionary item (columns of the dictionary matrix) to enforce discriminability in sparse codes during the dictionary learning process. More specifically, we introduce a new label consistent constraint called ‘discriminative sparse-code error’ and combine it with the reconstruction error and the classification error to form a unified objective function. The optimal solution is efficiently obtained using the K-SVD algorithm. Our algorithm learns a single over-complete dictionary and an optimal linear classifier jointly. It yields dictionaries so that feature points with the same class labels have similar sparse codes. Experimental results demonstrate that our algorithm outperforms many recently proposed sparse coding techniques for face and object category recognition under the same learning conditions.

Wang, Huayan; Koller, Daphne; , ■Multi-level inference by relaxed dual decomposition for human pose segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2433-2440, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995722
Abstract: Combining information from the higher level and the lower level has long been recognized as an essential component in holistic image understanding. However, an efficient inference method for multi-level models remains an open problem. Moreover, modeling the complex relations within real world images often gives rise to energy terms that couple many variables in arbitrary ways. They make the inference problem even harder. In this paper, we construct an energy function over the pose of the human body and pixel-wise foreground / background segmentation. The energy function incorporates terms both on the higher level, which models the human poses, and the lower level, which models the pixels. It also contains an intractable term that couples all body parts. We show how to optimize this energy in a principled way by relaxed dual decomposition, which proceeds by maximizing a concave lower bound on the energy function. Empirically, we show that our approach improves the state-of-the-art performance of human pose estimation on the Ramanan benchmark dataset.
Lu, Wei-Lwun; Ting, Jo-Anne; Murphy, Kevin P.; Little, James J.; , ■Identifying players in broadcast sports videos using conditional random fields,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3249-3256, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995562
Abstract: We are interested in the problem of automatic tracking and identification of players in broadcast sport videos shot with a moving camera from a medium distance. While there are many good tracking systems, there are fewer methods that can identify the tracked players. Player identification is challenging in such videos due to blurry facial features (due to fast camera motion and low-resolution) and rarely visible jersey numbers (which, when visible, are deformed due to player movements). We introduce a new system consisting of three components: a robust tracking system, a robust person identification system, and a conditional random field (CRF) model that can perform joint probabilistic inference about the player identities. The resulting system is able to achieve a player recognition accuracy up to 85% on unlabeled NBA basketball clips.

Shi, Jianping; Ren, Xiang; Dai, Guang; Wang, Jingdong; Zhang, Zhihua; , ■A non-convex relaxation approach to sparse dictionary learning,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1809-1816, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995592
Abstract: Dictionary learning is a challenging theme in computer vision. The basic goal is to learn a sparse representation from an overcomplete basis set. Most existing approaches employ a convex relaxation scheme to tackle this challenge due to the strong ability of convexity in computation and theoretical analysis. In this paper we propose a non-convex online approach for dictionary learning. To achieve the sparseness, our approach treats a so-called minimax concave (MC) penalty as a nonconvex relaxation of the ℓ0 penalty. This treatment expects to obtain a more robust and sparse representation than existing convex approaches. In addition, we employ an online algorithm to adaptively learn the dictionary, which makes the non-convex formulation computationally feasible. Experimental results on the sparseness comparison and the applications in image denoising and image inpainting demonstrate that our approach is more effective and flexible.

Harandi, Mehrtash T.; Sanderson, Conrad; Shirazi, Sareh; Lovell, Brian C.; , ■Graph embedding discriminant analysis on Grassmannian manifolds for improved image set matching,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2705-2712, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995564
Abstract: A convenient way of dealing with image sets is to represent them as points on Grassmannian manifolds. While several recent studies explored the applicability of discriminant analysis on such manifolds, the conventional formalism of discriminant analysis suffers from not considering the local structure of the data. We propose a discriminant analysis approach on Grassmannian manifolds, based on a graph-embedding framework. We show that by introducing within-class and between-class similarity graphs to characterise intra-class compactness and inter-class separability, the geometrical structure of data can be exploited. Experiments on several image datasets (PIE, BANCA, MoBo, ETH-80) show that the proposed algorithm obtains considerable improvements in discrimination accuracy, in comparison to three recent methods: Grassmann Discriminant Analysis (GDA), Kernel GDA, and the kernel version of Affine Hull Image Set Distance. We further propose a Grassmannian kernel, based on canonical correlation between subspaces, which can increase discrimination accuracy when used in combination with previous Grassmannian kernels.

Jain, Vidit; Learned-Miller, Erik; , ■Online domain adaptation of a pre-trained cascade of classifiers,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.577-584, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995317
Abstract: Many classifiers are trained with massive training sets only to be applied at test time on data from a different distribution. How can we rapidly and simply adapt a classifier to a new test distribution, even when we do not have access to the original training data? We present an on-line approach for rapidly adapting a ■black box■ classifier to a new test data set without retraining the classifier or examining the original optimization criterion. Assuming the original classifier outputs a continuous number for which a threshold gives the class, we reclassify points near the original boundary using a Gaussian process regression scheme. We show how this general procedure can be used in the context of a classifier cascade, demonstrating performance that far exceeds state-of-the-art results in face detection on a standard data set. We also draw connections to work in semi-supervised learning, domain adaptation, and information regularization.

Parikh, Devi; Grauman, Kristen; , ■Interactively building a discriminative vocabulary of nameable attributes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1681-1688, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995451
Abstract: Human-name able visual attributes offer many advantages when used as mid-level features for object recognition, but existing techniques to gather relevant attributes can be inefficient (costing substantial effort or expertise) and/or insufficient (descriptive properties need not be discriminative). We introduce an approach to define a vocabulary of attributes that is both human understandable and discriminative. The system takes object/scene-labeled images as input, and returns as output a set of attributes elicited from human annotators that distinguish the categories of interest. To ensure a compact vocabulary and efficient use of annotators' effort, we 1) show how to actively augment the vocabulary such that new attributes resolve inter-class confusions, and 2) propose a novel ■nameability■ manifold that prioritizes candidate attributes by their likelihood of being associated with a nameable property. We demonstrate the approach with multiple datasets, and show its clear advantages over baselines that lack a name-ability model or rely on a list of expert-provided attributes.

Rouf, Mushfiqur; Mantiuk, Rafal; Heidrich, Wolfgang; Trentacoste, Matthew; Lau, Cheryl; , ■Glare encoding of high dynamic range images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.289-296, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995335
Abstract: Without specialized sensor technology or custom, multi-chip cameras, high dynamic range imaging typically involves time-sequential capture of multiple photographs. The obvious downside to this approach is that it cannot easily be applied to images with moving objects, especially if the motions are complex. In this paper, we take a novel view of HDR capture, which is based on a computational photography approach. We propose to first optically encode both the low dynamic range portion of the scene and highlight information into a low dynamic range image that can be captured with a conventional image sensor. This step is achieved using a cross-screen, or star filter. Second, we decode, in software, both the low dynamic range image and the highlight information. Lastly, these two portions can be combined to form an image of a higher dynamic range than the regular sensor dynamic range.

Dixon, Michael; Abrams, Austin; Jacobs, Nathan; Pless, Robert; , ■On analyzing video with very small motions,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1-8, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995703
Abstract: We characterize a class of videos consisting of very small but potentially complicated motions. We find that in these scenes, linear appearance variations have a direct relationship to scene motions. We show how to interpret appearance variations captured through a PCA decomposition of the image set as a scene-specific non-parametric motion basis. We propose fast, robust tools for dense flow estimates that are effective in scenes with small motions and potentially large image noise. We show example results in a variety of applications, including motion segmentation and long-term point tracking.

Yin, Qi; Tang, Xiaoou; Sun, Jian; , ■An associate-predict model for face recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.497-504, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995494
Abstract: Handling intra-personal variation is a major challenge in face recognition. It is difficult how to appropriately measure the similarity between human faces under significantly different settings (e.g., pose, illumination, and expression). In this paper, we propose a new model, called ■Associate-Predict■ (AP) model, to address this issue. The associate-predict model is built on an extra generic identity data set, in which each identity contains multiple images with large intra-personal variation. When considering two faces under significantly different settings (e.g., non-frontal and frontal), we first ■associate■ one input face with alike identities from the generic identity date set. Using the associated faces, we generatively ■predict■ the appearance of one input face under the setting of another input face, or discriminatively ■predict■ the likelihood whether two input faces are from the same person or not. We call the two proposed prediction methods as ■appearance-prediction■ and ■likelihood-prediction■. By leveraging an extra data set (■memory■) and the ■associate-predict■ model, the intra-personal variation can be effectively handled. To improve the generalization ability of our model, we further add a switching mechanism — we directly compare the appearances of two faces if they have close intra-personal settings; otherwise, we use the associate-predict model for the recognition. Experiments on two public face benchmarks (Multi-PIE and LFW) demonstrated that our final model can substantially improve the performance of most existing face recognition methods

Budvytis, Ignas; Badrinarayanan, Vijay; Cipolla, Roberto; , ■Semi-supervised video segmentation using tree structured graphical models,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2257-2264, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995600
Abstract: We present a novel, implementation friendly and occlusion aware semi-supervised video segmentation algorithm using tree structured graphical models, which delivers pixel labels alongwith their uncertainty estimates. Our motivation to employ supervision is to tackle a task-specific segmentation problem where the semantic objects are pre-defined by the user. The video model we propose for this problem is based on a tree structured approximation of a patch based undirected mixture model, which includes a novel time-series and a soft label Random Forest classifier participating in a feedback mechanism. We demonstrate the efficacy of our model in cutting out foreground objects and multi-class segmentation problems in lengthy and complex road scene sequences. Our results have wide applicability, including harvesting labelled video data for training discriminative models, shape/pose/articulation learning and large scale statistical analysis to develop priors for video segmentation.

Yoshiyasu, Yusuke; Yamazaki, Nobutoshi; , ■Topology-adaptive multi-view photometric stereo,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1001-1008, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995576
Abstract: In this paper, we present a novel technique that enables capturing of detailed 3D models from flash photographs integrating shading and silhouette cues. Our main contribution is an optimization framework which not only captures subtle surface details but also handles changes in topology. To incorporate normals estimated from shading, we employ a mesh-based deformable model using deformation gradient. This method is capable of manipulating precise geometry and, in fact, it outperforms previous methods in terms of both accuracy and efficiency. To adapt the topology of the mesh, we convert the mesh into an implicit surface representation and then back to a mesh representation. This simple procedure removes self-intersecting regions of the mesh and solves the topology problem effectively. In addition to the algorithm, we introduce a hand-held setup to achieve multi-view photometric stereo. The key idea is to acquire flash photographs from a wide range of positions in order to obtain a sufficient lighting variation even with a standard flash unit attached to the camera. Experimental results showed that our method can capture detailed shapes of various objects and cope with topology changes well.

Lu, Le; Bi, Jinbo; Wolf, Matthias; Salganicoff, Marcos; , ■Effective 3D object detection and regression using probabilistic segmentation features in CT images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1049-1056, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995359
Abstract: 3D object detection and importance regression/ranking are at the core for semantically interpreting 3D medical images of computer aided diagnosis (CAD). In this paper, we propose effective image segmentation features and a novel multiple instance regression method for solving the above challenges. We perform supervised learning based segmentation algorithm on numerous lesion candidates (as 3D VOIs: Volumes Of Interest in CT images) which can be true or false. By assessing the statistical properties in the joint space of segmentation output (e.g., a 3D class-specific probability map or cloud), and original image appearance, 57 descriptive features in six subgroups are derived. The new feature set shows excellent performance on effectively classifying ambiguous positive and negative VOIs, for our CAD system of detecting colonic polyps using CT images. The proposed regression model on our segmentation derived features behaves as a robust object (polyp) size/importance estimator and ranking module with high reliability, which is critical for automatic clinical reporting and cancer staging. Extensive evaluation is executed on a large clinical dataset of 770 CT scans from 12 medical sites for validation, with the best state-of-the-art results.

Razavi, Nima; Gall, Juergen; Van Gool, Luc; , ■Scalable multi-class object detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1505-1512, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995441
Abstract: Scalability of object detectors with respect to the number of classes is a very important issue for applications where many object classes need to be detected. While combining single-class detectors yields a linear complexity for testing, multi-class detectors that localize all objects at once come often at the cost of a reduced detection accuracy. In this work, we present a scalable multi-class detection algorithm which scales sublinearly with the number of classes without compromising accuracy. To this end, a shared discriminative codebook of feature appearances is jointly trained for all classes and detection is also performed for all classes jointly. Based on the learned sharing distributions of features among classes, we build a taxonomy of object classes. The taxonomy is then exploited to further reduce the cost of multi-class object detection. Our method has linear training and sublinear detection complexity in the number of classes. We have evaluated our method on the challenging PASCAL VOC'06 and PASCAL VOC'07 datasets and show that scaling the system does not lead to a loss in accuracy.

Sadeghi, Mohammad Amin; Farhadi, Ali; , ■Recognition using visual phrases,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1745-1752, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995711
Abstract: In this paper we introduce visual phrases, complex visual composites like ■a person riding a horse■. Visual phrases often display significantly reduced visual complexity compared to their component objects, because the appearance of those objects can change profoundly when they participate in relations. We introduce a dataset suitable for phrasal recognition that uses familiar PASCAL object categories, and demonstrate significant experimental gains resulting from exploiting visual phrases. We show that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects. We argue that any multi-class detection system must decode detector outputs to produce final results; this is usually done with non-maximum suppression. We describe a novel decoding procedure that can account accurately for local context without solving difficult inference problems. We show this decoding procedure outperforms the state of the art. Finally, we show that decoding a combination of phrasal and object detectors produces real improvements in detector results.

Harker, Matthew; O'Leary, Paul; , ■Least squares surface reconstruction from gradients: Direct algebraic methods with spectral, Tikhonov, and constrained regularization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2529-2536, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995427
Abstract: This paper presents three new methods for regularizing the least squares solution of the reconstruction of a surface from its gradient field: firstly, the spectral regularization based on discrete generalized Fourier series (e.g., Gram Polynomials, Haar Functions, etc.); secondly, the Tikhonov regularization applied directly to the 2D domain problem; and thirdly, the regularization via constraints such as arbitrary Dirichlet boundary conditions. It is shown that the solutions to the aforementioned problems all satisfy Sylvester Equations, which leads to substantial computational gains; specifically, the solution of the Sylvester Equation is direct (non-iterative) and for an m × n surface is of the same complexity as computing an SVD of the same size, i.e., an O (n3) algorithm. In contrast, state-of-the-art algorithms are based on large-scale-linear-solvers, and use iterative techniques based on an O [n6) linear sub-step. To emphasize this improvement, it is demonstrated that the new algorithms are upwards of 104 (ten-thousand) times faster than the state-of-the-art techniques incorporating regularization. In fact, the new algorithms allow for the realtime regularized reconstruction of surfaces on the order of megapixels, which is unprecedented for this computer vision problem.

Sapp, Benjamin; Weiss, David; Taskar, Ben; , ■Parsing human motion with stretchable models,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1281-1288, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995607
Abstract: We address the problem of articulated human pose estimation in videos using an ensemble of tractable models with rich appearance, shape, contour and motion cues. In previous articulated pose estimation work on unconstrained videos, using temporal coupling of limb positions has made little to no difference in performance over parsing frames individually [8, 28]. One crucial reason for this is that joint parsing of multiple articulated parts over time involves intractable inference and learning problems, and previous work has resorted to approximate inference and simplified models. We overcome these computational and modeling limitations using an ensemble of tractable submodels which couple locations of body joints within and across frames using expressive cues. Each submodel is responsible for tracking a single joint through time (e.g., left elbow) and also models the spatial arrangement of all joints in a single frame. Because of the tree structure of each submodel, we can perform efficient exact inference and use rich temporal features that depend on image appearance, e.g., color tracking and optical flow contours. We propose and experimentally investigate a hierarchy of submodel combination methods, and we find that a highly efficient max-marginal combination method outperforms much slower (by orders of magnitude) approximate inference using dual decomposition. We apply our pose model on a new video dataset of highly varied and articulated poses from TV shows. We show significant quantitative and qualitative improvements over state-of-the-art single-frame pose estimation approaches.

Qi, Guo-Jun; Aggarwal, Charu; Rui, Yong; Tian, Qi; Chang, Shiyu; Huang, Thomas; , ■Towards cross-category knowledge propagation for learning visual concepts,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.897-904, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995312
Abstract: In recent years, knowledge transfer algorithms have become one of most the active research areas in learning visual concepts. Most of the existing learning algorithms focuses on leveraging the knowledge transfer process which is specific to a given category. However, in many cases, such a process may not be very effective when a particular target category has very few samples. In such cases, it is interesting to examine, whether it is feasible to use cross-category knowledge for improving the learning process by exploring the knowledge in correlated categories. Such a task can be quite challenging due to variations in semantic similarities and differences between categories, which could either help or hinder the cross-category learning process. In order to address this challenge, we develop a cross-category label propagation algorithm, which can directly propagate the inter-category knowledge at instance level between the source and the target categories. Furthermore, this algorithm can automatically detect conditions under which the transfer process can be detrimental to the learning process. This provides us a way to know when the transfer of cross-category knowledge is both useful and desirable. We present experimental results on real image and video data sets in order to demonstrate the effectiveness of our approach.

Gallagher, Andrew C.; Batra, Dhruv; Parikh, Devi; , ■Inference for order reduction in Markov random fields,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1857-1864, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995452
Abstract: This paper presents an algorithm for order reduction of factors in High-Order Markov Random Fields (HOMRFs). Standard techniques for transforming arbitrary high-order factors into pairwise ones have been known for a long time. In this work, we take a fresh look at this problem with the following motivation: It is important to keep in mind that order reduction is followed by an inference procedure on the order-reduced MRF. Since there are many possible ways of performing order reduction, a technique that generates ■easier■ pairwise inference problems is a better reduction. With this motivation in mind, we introduce a new algorithm called Order Reduction Inference (ORI) that searches over a space of order reduction methods to minimize the difficulty of the resultant pairwise inference problem. We set up this search problem as an energy minimization problem. We show that application of ORI for order reduction outperforms known order reduction techniques both in simulated problems and in real-world vision applications.

Cao, Yang; Wang, Changhu; Zhang, Liqing; Zhang, Lei; , ■Edgel index for large-scale sketch-based image search,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.761-768, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995460
Abstract: Retrieving images to match with a hand-drawn sketch query is a highly desired feature, especially with the popularity of devices with touch screens. Although query-by-sketch has been extensively studied since 1990s, it is still very challenging to build a real-time sketch-based image search engine on a large-scale database due to the lack of effective and efficient matching/indexing solutions. The explosive growth of web images and the phenomenal success of search techniques have encouraged us to revisit this problem and target at solving the problem of web-scale sketch-based image retrieval. In this work, a novel index structure and the corresponding raw contour-based matching algorithm are proposed to calculate the similarity between a sketch query and natural images, and make sketch-based image retrieval scalable to millions of images. The proposed solution simultaneously considers storage cost, retrieval accuracy, and efficiency, based on which we have developed a real-time sketch-based image search engine by indexing more than 2 million images. Extensive experiments on various retrieval tasks (basic shape search, specific image search, and similar image search) show better accuracy and efficiency than state-of-the-art methods.

Zhang, Honghui; Fang, Tian; Chen, Xiaowu; Zhao, Qinping; Quan, Long; , ■Partial similarity based nonparametric scene parsing in certain environment,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2241-2248, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995348
Abstract: In this paper we propose a novel nonparametric image parsing method for the image parsing problem in certain environment. A novel and efficient nearest neighbor matching scheme, the ANN bilateral matching scheme, is proposed. Based on the proposed matching scheme, we first retrieve some partially similar images for each given test image from the training image database. The test image can be well explained by these retrieved images, with similar regions existing in the retrieved images for each region in the test image. Then, we match the test image to the retrieved training images with the ANN bilateral matching scheme, and parse the test image by integrating multiple cues in a markov random field. Experiment on three datasets shows our method achieved promising parsing accuracy and outperformed two state-of-the-art nonparametric image parsing methods.

Prisacariu, Victor Adrian; Reid, Ian; , ■Nonlinear shape manifolds as shape priors in level set segmentation and tracking,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2185-2192, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995687
Abstract: We propose a novel nonlinear, probabilistic and variational method for adding shape information to level set-based segmentation and tracking. Unlike previous work, we represent shapes with elliptic Fourier descriptors and learn their lower dimensional latent space using Gaussian Process Latent Variable Models. Segmentation is done by a nonlinear minimisation of an image-driven energy function in the learned latent space. We combine it with a 2D pose recovery stage, yielding a single, one shot, optimisation of both shape and pose. We demonstrate the performance of our method, both qualitatively and quantitatively, with multiple images, video sequences and latent spaces, capturing both shape kinematics and object class variance.

Wang, Heng; Klaser, Alexander; Schmid, Cordelia; Liu, Cheng-Lin; , ■Action recognition by dense trajectories,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3169-3176, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995407
Abstract: Feature trajectories have shown to be efficient for representing videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajectories is often not sufficient. Inspired by the recent success of dense sampling in image classification, we propose an approach to describe videos by dense trajectories. We sample dense points from each frame and track them based on displacement information from a dense optical flow field. Given a state-of-the-art optical flow algorithm, our trajectories are robust to fast irregular motions as well as shot boundaries. Additionally, dense trajectories cover the motion information in videos well. We, also, investigate how to design descriptors to encode the trajectory information. We introduce a novel descriptor based on motion boundary histograms, which is robust to camera motion. This descriptor consistently outperforms other state-of-the-art descriptors, in particular in uncontrolled realistic videos. We evaluate our video description in the context of action classification with a bag-of-features approach. Experimental results show a significant improvement over the state of the art on four datasets of varying difficulty, i.e. KTH, YouTube, Hollywood2 and UCF sports.

Bychkovsky, Vladimir; Paris, Sylvain; Chan, Eric; Durand, Fredo; , ■Learning photographic global tonal adjustment with a database of input/output image pairs,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.97-104, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995413
Abstract: Adjusting photographs to obtain compelling renditions requires skill and time. Even contrast and brightness adjustments are challenging because they require taking into account the image content. Photographers are also known for having different retouching preferences. As the result of this complexity, rule-based, one-size-fits-all automatic techniques often fail. This problem can greatly benefit from supervised machine learning but the lack of training data has impeded work in this area. Our first contribution is the creation of a high-quality reference dataset. We collected 5,000 photos, manually annotated them, and hired 5 trained photographers to retouch each picture. The result is a collection of 5 sets of 5,000 example input-output pairs that enable supervised learning. We first use this dataset to predict a user's adjustment from a large training set. We then show that our dataset and features enable the accurate adjustment personalization using a carefully chosen set of training photos. Finally, we introduce difference learning: this method models and predicts difference between users. It frees the user from using predetermined photos for training. We show that difference learning enables accurate prediction using only a handful of examples.

Weinzaepfel, Philippe; Jegou, Herve; Perez, Patrick; , ■Reconstructing an image from its local descriptors,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.337-344, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995616
Abstract: This paper shows that an image can be approximately reconstructed based on the output of a blackbox local description software such as those classically used for image indexing. Our approach consists first in using an off-the-shelf image database to find patches that are visually similar to each region of interest of the unknown input image, according to associated local descriptors. These patches are then warped into input image domain according to interest region geometry and seamlessly stitched together. Final completion of still missing texture-free regions is obtained by smooth interpolation. As demonstrated in our experiments, visually meaningful reconstructions are obtained just based on image local descriptors like SIFT, provided the geometry of regions of interest is known. The reconstruction most often allows the clear interpretation of the semantic image content. As a result, this work raises critical issues of privacy and rights when local descriptors of photos or videos are given away for indexing and search purpose.

Ma, Wenye; Morel, Jean-Michel; Osher, Stanley; Chien, Aichi; , ■An L1-based variational model for Retinex theory and its application to medical images,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.153-160, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995422
Abstract: Human visual system (HVS) can perceive constant color under varying illumination conditions while digital images record information of both reflectance (physical color) of objects and illumination. Retinex theory, formulated by Edwin H. Land, aimed to simulate and explain this feature of HVS. However, to recover the reflectance from a given image is in general an ill-posed problem. In this paper, we establish an L1-based variational model for Retinex theory that can be solved by a fast computational approach based on Bregman iteration. Compared with previous works, our L1-Retinex method is more accurate for recovering the reflectance, which is illustrated by examples and statistics. In medical images such as magnetic resonance imaging (MRI), intensity inhomogeneity is often encountered due to bias fields. This is a similar formulation to Retinex theory while the MRI has some specific properties. We then modify the L1-Retinex method and develop a new algorithm for MRI data. We demonstrate the performance of our method by comparison with previous work on simulated and real data.

Li, Binlong; Ayazoglu, Mustafa; Mao, Teresa; Camps, Octavia I.; Sznaier, Mario; , ■Activity recognition using dynamic subspace angles,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3193-3200, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995672
Abstract: Cameras are ubiquitous everywhere and hold the promise of significantly changing the way we live and interact with our environment. Human activity recognition is central to understanding dynamic scenes for applications ranging from security surveillance, to assisted living for the elderly, to video gaming without controllers. Most current approaches to solve this problem are based in the use of local temporal-spatial features that limit their ability to recognize long and complex actions. In this paper, we propose a new approach to exploit the temporal information encoded in the data. The main idea is to model activities as the output of unknown dynamic systems evolving from unknown initial conditions. Under this framework, we show that activity videos can be compared by computing the principal angles between subspaces representing activity types which are found by a simple SVD of the experimental data. The proposed approach outperforms state-of-the-art methods classifying activities in the KTH dataset as well as in much more complex scenarios involving interacting actors.

Geiger, Andreas; Lauer, Martin; Urtasun, Raquel; , ■A generative model for 3D urban scene understanding from movable platforms,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1945-1952, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995641
Abstract: 3D scene understanding is key for the success of applications such as autonomous driving and robot navigation. However, existing approaches either produce a mild level of understanding, e.g., segmentation, object detection, or are not accurate enough for these applications, e.g., 3D pop-ups. In this paper we propose a principled generative model of 3D urban scenes that takes into account dependencies between static and dynamic features. We derive a reversible jump MCMC scheme that is able to infer the geometric (e.g., street orientation) and topological (e.g., number of intersecting streets) properties of the scene layout, as well as the semantic activities occurring in the scene, e.g., traffic situations at an intersection. Furthermore, we show that this global level of understanding provides the context necessary to disambiguate current state-of-the-art detectors. We demonstrate the effectiveness of our approach on a dataset composed of short stereo video sequences of 113 different scenes captured by a car driving around a mid-size city.

Schelten, Kevin; Roth, Stefan; , ■Connecting non-quadratic variational models and MRFs,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2641-2648, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995498
Abstract: Spatially-discrete Markov random fields (MRFs) and spatially-continuous variational approaches are ubiquitous in low-level vision, including image restoration, segmentation, optical flow, and stereo. Even though both families of approaches are fairly similar on an intuitive level, they are frequently seen as being technically rather distinct since they operate on different domains. In this paper we explore their connections and develop a direct, rigorous link with a particular emphasis on first-order regularizers. By representing spatially-continuous functions as linear combinations of finite elements with local support and performing explicit integration of the variational objective, we derive MRF potentials that make the resulting MRF energy equivalent to the variational energy functional. In contrast to previous attempts, we provide an explicit connection for modern non-quadratic regularizers and also integrate the data term. The established connection opens certain classes of MRFs to spatially-continuous interpretations and variational formulations to a broad range of probabilistic learning and inference algorithms.

Huang, Qixing; Han, Mei; Wu, Bo; Ioffe, Sergey; , ■A hierarchical conditional random field model for labeling and segmenting images of street scenes,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1953-1960, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995571
Abstract: Simultaneously segmenting and labeling images is a fundamental problem in Computer Vision. In this paper, we introduce a hierarchical CRF model to deal with the problem of labeling images of street scenes by several distinctive object classes. In addition to learning a CRF model from all the labeled images, we group images into clusters of similar images and learn a CRF model from each cluster separately. When labeling a new image, we pick the closest cluster and use the associated CRF model to label this image. Experimental results show that this hierarchical image labeling method is comparable to, and in many cases superior to, previous methods on benchmark data sets. In addition to segmentation and labeling results, we also showed how to apply the image labeling result to rerank Google similar images.

Charpiat, Guillaume; , ■Exhaustive family of energies minimizable exactly by a graph cut,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1849-1856, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995567
Abstract: Graph cuts are widely used in many fields of computer vision in order to minimize in small polynomial time complexity certain classes of energies. These specific classes depend on the way chosen to build the graphs representing the problems to solve. We study here all possible ways of building graphs and the associated energies minimized, leading to the exhaustive family of energies minimizable exactly by a graph cut. To do this, we consider the issue of coding pixel labels as states of the graph, i.e. the choice of state interpretations. The family obtained comprises many new classes, in particular energies that do not satisfy the submodularity condition, including energies that are even not permuted-submodular. A generating subfamily is studied in details, in particular we propose a canonical form to represent Markov random fields, which proves useful to recognize energies in this subfamily in linear complexity almost surely, and then to build the associated graph in quasilinear time. A few experiments are performed, to illustrate the new possibilities offered.

Bo, Liefeng; Lai, Kevin; Ren, Xiaofeng; Fox, Dieter; , ■Object recognition with hierarchical kernel descriptors,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1729-1736, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995719
Abstract: Kernel descriptors [1] provide a unified way to generate rich visual feature sets by turning pixel attributes into patch-level features, and yield impressive results on many object recognition tasks. However, best results with kernel descriptors are achieved using efficient match kernels in conjunction with nonlinear SVMs, which makes it impractical for large-scale problems. In this paper, we propose hierarchical kernel descriptors that apply kernel descriptors recursively to form image-level features and thus provide a conceptually simple and consistent way to generate image-level features from pixel attributes. More importantly, hierarchical kernel descriptors allow linear SVMs to yield state-of-the-art accuracy while being scalable to large datasets. They can also be naturally extended to extract features over depth images. We evaluate hierarchical kernel descriptors both on the CIFAR10 dataset and the new RGB-D Object Dataset consisting of segmented RGB and depth images of 300 everyday objects.

Batra, Dhruv; Kohli, Pushmeet; , ■Making the right moves: Guiding alpha-expansion using local primal-dual gaps,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1865-1872, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995449
Abstract: This paper presents a new adaptive graph-cut based move-making algorithm for energy minimization. Traditional move-making algorithms such as Expansion and Swap operate by searching for better solutions in some predefined moves spaces around the current solution. In contrast, our algorithm uses the primal-dual interpretation of the Expansion-move algorithm to adaptively compute the best move-space to search over. At each step, it tries to greedily find the move-space that will lead to biggest decrease in the primal-dual gap. We test different variants of our algorithm on a variety of image labelling problems such as object segmentation and stereo. Experimental results show that our adaptive strategy significantly outperforms the conventional Expansion-move algorithm, in some cases cutting the runtime by 50%.

Lim, Jongwoo; Frahm, Jan-Michael; Pollefeys, Marc; , ■Online environment mapping,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, vol., no., pp.3489-3496, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995511
Abstract: The paper proposes a vision based online mapping of large-scale environments. Our novel approach uses a hybrid representation of a fully metric Euclidean environment map and a topological map. This novel hybrid representation facilitates our scalable online hierarchical bundle adjustment approach. The proposed method achieves scalability by solving the local registration through embedding neighboring keyframes and landmarks into a Euclidean space. The global adjustment is performed on a segmentation of the keyframes and posed as the iterative optimization of the arrangement of keyframes in each segment and the arrangement of rigidly moving segments. The iterative global adjustment is performed concurrently with the local registration of the keyframes in a local map. Thus the map is always locally metric around the current location, and likely to be globally consistent. Loop closures are handled very efficiently benefiting from the topological nature of the map and overcoming the loss of the metric map properties as previous approaches. The effectiveness of the proposed method is demonstrated in real-time on various challenging video sequences.

Zhou, Luping; Wang, Yaping; Li, Yang; Yap, Pew-Thian; Shen, Dinggang; , Adni; , ■Hierarchical anatomical brain networks for MCI prediction by partial least square analysis,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1073-1080, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995689
Abstract: Owning to its clinical accessibility, T1-weighted MRI has been extensively studied for the prediction of mild cognitive impairment (MCI) and Alzheimer's disease (AD). The tissue volumes of GM, WM and CSF are the most commonly used measures for MCI and AD prediction. We note that disease-induced structural changes may not happen at isolated spots, but in several inter-related regions. Therefore, in this paper we propose to directly extract the inter-region connectivity based features for MCI prediction. This involves constructing a brain network for each subject, with each node representing an ROI and each edge representing regional interactions. This network is also built hierarchically to improve the robustness of classification. Compared with conventional methods, our approach produces a significant larger pool of features, which if improperly dealt with, will result in intractability when used for classifier training. Therefore based on the characteristics of the network features, we employ Partial Least Square analysis to efficiently reduce the feature dimensionality to a manageable level while at the same time preserving discriminative information as much as possible. Our experiment demonstrates that without requiring any new information in addition to T1-weighted images, the prediction accuracy of MCI is statistically improved.

Niu, Zhenxing; Hua, Gang; Gao, Xinbo; Tian, Qi; , ■Spatial-DiscLDA for visual recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1769-1776, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995426
Abstract: Topic models such as pLSA, LDA and their variants have been widely adopted for visual recognition. However, most of the adopted models, if not all, are unsupervised, which neglected the valuable supervised labels during model training. In this paper, we exploit recent advancement in supervised topic modeling, more particularly, the DiscLDA model for object recognition. We extend it to a part based visual representation to automatically identify and model different object parts. We call the proposed model as Spatial-DiscLDA (S-DiscLDA). It models the appearances and locations of the object parts simultaneously, which also takes the supervised labels into consideration. It can be directly used as a classifier to recognize the object. This is performed by an approximate inference algorithm based on Gibbs sampling and bridge sampling methods. We examine the performance of our model by comparing its performance with another supervised topic model on two scene category datasets, i.e., LabelMe and UIUC-sport dataset. We also compare our approach with other approaches which model spatial structures of visual features on the popular Caltech-4 dataset. The experimental results illustrate that it provides competitive performance.

Sfikas, Giorgos; Nikou, Christophoros; Galatsanos, Nikolaos; Heinrich, Christian; , ■Majorization-minimization mixture model determination in image segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2169-2176, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995349
Abstract: A new Bayesian model for image segmentation based on a Gaussian mixture model is proposed. The model structure allows the automatic determination of the number of segments while ensuring spatial smoothness of the final output. This is achieved by defining two separate mixture weight sets: the first set of weights is spatially variant and incorporates an MRF edge-preserving smoothing prior; the second set of weights is governed by a Dirichlet prior in order to prune unnecessary mixture components. The model is trained using variational inference and the Majorization-Minimization (MM) algorithm, resulting in closed-form parameter updates. The algorithm was successfully evaluated in terms of various segmentation indices using the Berkeley image data base.

Liang, Chao; Xu, Changsheng; Cheng, Jian; Lu, Hanqing; , ■TVParser: An automatic TV video parsing method,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3377-3384, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995681
Abstract: In this paper, we propose an automatic approach to simultaneously name faces and discover scenes in TV shows. We follow the multi-modal idea of utilizing script to assist video content understanding, but without using timestamp (provided by script-subtitles alignment) as the connection. Instead, the temporal relation between faces in the video and names in the script is investigated in our approach, and an global optimal video-script alignment is inferred according to the character correspondence. The contribution of this paper is two-fold: (1) we propose a generative model, named TVParser, to depict the temporal character correspondence between video and script, from which face-name relationship can be automatically learned as a model parameter, and meanwhile, video scene structure can be effectively inferred as a hidden state sequence; (2) we find fast algorithms to accelerate both model parameter learning and state inference, resulting in an efficient and global optimal alignment. We conduct extensive comparative experiments on popular TV series and report comparable and even superior performance over existing methods.

Hornacek, Michael; Maierhofer, Stefan; , ■Extracting vanishing points across multiple views,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.953-960, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995396
Abstract: The realization that we see lines known to be parallel in space as lines that appear to converge in a corresponding vanishing point has led to techniques employed by artists since at least the Renaissance to render a credible impression of perspective. More recently, it has also led to techniques for recovering information embedded in images pertaining to the geometry of their underlying scene. In this paper, we explore the extraction of vanishing points in the aim of facilitating the reconstruction of Manhattan-world scenes. In departure from most vanishing point extraction methods, ours extracts a constellation of vanishing points corresponding, respectively, to the scene's two or three dominant pairwise-orthogonal orientations by integrating information across multiple views rather than from a single image alone. What makes a multiple-view approach attractive is that in addition to increasing robustness to segments that do not correspond to any of the three dominant orientations, robustness is also increased with respect to inaccuracies in the extracted segments themselves.

Pedersoli, Marco; Vedaldi, Andrea; Gonzalez, Jordi; , ■A coarse-to-fine approach for fast deformable object detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1353-1360, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995668
Abstract: We present a method that can dramatically accelerate object detection with part based models. The method is based on the observation that the cost of detection is likely to be dominated by the cost of matching each part to the image, and not by the cost of computing the optimal configuration of the parts as commonly assumed. Therefore accelerating detection requires minimizing the number of part-to-image comparisons. To this end we propose a multiple-resolutions hierarchical part based model and a corresponding coarse-to-fine inference procedure that recursively eliminates from the search space unpromising part placements. The method yields a ten-fold speedup over the standard dynamic programming approach and is complementary to the cascade-of-parts approach of [9]. Compared to the latter, our method does not have parameters to be determined empirically, which simplifies its use during the training of the model. Most importantly, the two techniques can be combined to obtain a very significant speedup, of two orders of magnitude in some cases. We evaluate our method extensively on the PASCAL VOC and INRIA datasets, demonstrating a very high increase in the detection speed with little degradation of the accuracy.

Cabrera, Reyes Rios; Tuytelaars, Tinne; Van Gool, Luc; , ■Efficient multi-camera detection, tracking, and identification using a shared set of haar-features,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.65-71, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995735
Abstract: This paper presents an integrated solution for the problem of detecting, tracking and identifying vehicles in a tunnel surveillance application, taking into account practical constraints including realtime operation, poor imaging conditions, and a decentralized architecture. Vehicles are followed through the tunnel by a network of non-overlapping cameras. They are detected and tracked in each camera and then identified, i.e. matched to any of the vehicles detected in the previous camera(s). To limit the computational load, we propose to reuse the same set of Haar-features for each of these steps. For the detection, we use an Adaboost cascade. Here we introduce a composite confidence score, integrating information from all stage of the cascades. A subset of the features used for detection is then selected, optimizing for the identification problem. This results in a compact binary ‘vehicle fingerprint’, requiring very limited bandwidth. Finally, we show that the same set of features can also be used for tracking. This haar features based ‘tracking-by-identification’ yields surprisingly good results on standard datasets, without the need to update the model online.

Bruna, Joan; Mallat, Stephane; , ■Classification with scattering operators,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1561-1566, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995635
Abstract: A scattering vector is a local descriptor including multiscale and multi-direction co-occurrence information. It is computed with a cascade of wavelet decompositions and complex modulus. This scattering representation is locally translation invariant and linearizes deformations. A supervised classification algorithm is computed with a PCA model selection on scattering vectors. State of the art results are obtained for handwritten digit recognition and texture classification.

Liu, Yebin; Stoll, Carsten; Gall, Juergen; Seidel, Hans-Peter; Theobalt, Christian; , ■Markerless motion capture of interacting characters using multi-view image segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1249-1256, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995424
Abstract: We present a markerless motion capture approach that reconstructs the skeletal motion and detailed time-varying surface geometry of two closely interacting people from multi-view video. Due to ambiguities in feature-to-person assignments and frequent occlusions, it is not feasible to directly apply single-person capture approaches to the multi-person case. We therefore propose a combined image segmentation and tracking approach to overcome these difficulties. A new probabilistic shape and appearance model is employed to segment the input images and to assign each pixel uniquely to one person. Thereafter, a single-person markerless motion and surface capture approach can be applied to each individual, either one-by-one or in parallel, even under strong occlusions. We demonstrate the performance of our approach on several challenging multi-person motions, including dance and martial arts, and also provide a reference dataset for multi-person motion capture with ground truth.

Deng, Jia; Berg, Alexander C.; Fei-Fei, Li; , ■Hierarchical semantic indexing for large scale image retrieval,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.785-792, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995516
Abstract: This paper addresses the problem of similar image retrieval, especially in the setting of large-scale datasets with millions to billions of images. The core novel contribution is an approach that can exploit prior knowledge of a semantic hierarchy. When semantic labels and a hierarchy relating them are available during training, significant improvements over the state of the art in similar image retrieval are attained. While some of this advantage comes from the ability to use additional information, experiments exploring a special case where no additional data is provided, show the new approach can still outperform OASIS [6], the current state of the art for similarity learning. Exploiting hierarchical relationships is most important for larger scale problems, where scalability becomes crucial. The proposed learning approach is fundamentally parallelizable and as a result scales more easily than previous work. An additional contribution is a novel hashing scheme (for bilinear similarity on vectors of probabilities, optionally taking into account hierarchy) that is able to reduce the computational cost of retrieval. Experiments are performed on Caltech256 and the larger ImageNet dataset.

Cao, Yu; Ju, Lili; Zou, Qin; Qu, Chengzhang; Wang, Song; , ■A Multichannel Edge-Weighted Centroidal Voronoi Tessellation algorithm for 3D super-alloy image segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.17-24, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995590
Abstract: In material science and engineering, the grain structure inside a super-alloy sample determines its mechanical and physical properties. In this paper, we develop a new Multichannel Edge-Weighted Centroidal Voronoi Tessellation (MCEWCVT) algorithm to automatically segment all the 3D grains from microscopic images of a super-alloy sample. Built upon the classical k-means/CVT algorithm, the proposed algorithm considers both the voxel-intensity similarity within each cluster and the compactness of each cluster. In addition, the same slice of a super-alloy sample can produce multiple images with different grain appearances using different settings of the microscope. We call this multichannel imaging and in this paper, we further adapt the proposed segmentation algorithm to handle such multichannel images to achieve higher grain-segmentation accuracy. We test the proposed MCEWCVT algorithm on a 4-channel Ni-based 3D super-alloy image consisting of 170 slices. The segmentation performance is evaluated against the manually annotated ground-truth segmentation and quantitatively compared with other six image segmentation/edge-detection methods. The experimental results demonstrate the higher accuracy of the proposed algorithm than the comparison methods.

Mukherjee, Lopamudra; Singh, Vikas; Peng, Jiming; , ■Scale invariant cosegmentation for image groups,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1881-1888, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995420
Abstract: Our primary interest is in generalizing the problem of Cosegmentation to a large group of images, that is, concurrent segmentation of common foreground region(s) from multiple images. We further wish for our algorithm to offer scale invariance (foregrounds may have arbitrary sizes in different images) and the running time to increase (no more than) near linearly in the number of images in the set. What makes this setting particularly challenging is that even if we ignore the scale invariance desiderata, the Cosegmentation problem, as formalized in many recent papers (except [1]), is already hard to solve optimally in the two image case. A straightforward extension of such models to multiple images leads to loose relaxations; and unless we impose a distributional assumption on the appearance model, existing mechanisms for image-pair-wise measurement of foreground appearance variations lead to significantly large problem sizes (even for moderate number of images). This paper presents a surprisingly easy to implement algorithm which performs well, and satisfies all requirements listed above (scale invariance, low computational requirements, and viability for the multiple image setting). We present qualitative and technical analysis of the properties of this framework.

Yang, Ming; Zhu, Shenghuo; Lv, Fengjun; Yu, Kai; , ■Correspondence driven adaptation for human profile recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.505-512, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995481
Abstract: Visual recognition systems for videos using statistical learning models often show degraded performance when being deployed to a real-world environment, primarily due to the fact that training data can hardly cover sufficient variations in reality. To alleviate this issue, we propose to utilize the object correspondences in successive frames as weak supervision to adapt visual recognition models, which is particularly suitable for human profile recognition. Specifically, we substantialize this new strategy on an advanced convolutional neural network (CNN) based system to estimate human gender, age, and race. We enforce the system to output consistent and stable results on face images from the same trajectories in videos by using incremental stochastic training. Our baseline system already achieves competitive performance on gender and age estimation as compared to the state-of-the-art algorithms on the FG-NET database. Further, on two new video datasets containing about 900 persons, the proposed supervision of correspondences improves the estimation accuracy by a large margin over the baseline.

Kurz, Daniel; Ben Himane, Selim; , ■Inertial sensor-aligned visual feature descriptors,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.161-166, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995339
Abstract: We propose to align the orientation of local feature descriptors with the gravitational force measured with inertial sensors. In contrast to standard approaches that gain a reproducible feature orientation from the intensities of neighboring pixels to remain invariant against rotation, this approach results in clearly distinguishable descriptors for congruent features in different orientations. Gravity-aligned feature descriptors (GAFD) are suitable for any application relying on corresponding points in multiple images of static scenes and are particularly beneficial in the presence of differently oriented repetitive features as they are widespread in urban scenes and on man-made objects. In this paper, we show with different examples that the process of feature description and matching gets both faster and results in better matches when aligning the descriptors with the gravity compared to traditional techniques.

Brutzer, Sebastian; Hoferlin, Benjamin; Heidemann, Gunther; , ■Evaluation of background subtraction techniques for video surveillance,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1937-1944, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995508
Abstract: Background subtraction is one of the key techniques for automatic video analysis, especially in the domain of video surveillance. Although its importance, evaluations of recent background subtraction methods with respect to the challenges of video surveillance suffer from various shortcomings. To address this issue, we first identify the main challenges of background subtraction in the field of video surveillance. We then compare the performance of nine background subtraction methods with post-processing according to their ability to meet those challenges. Therefore, we introduce a new evaluation data set with accurate ground truth annotations and shadow masks. This enables us to provide precise in-depth evaluation of the strengths and drawbacks of background subtraction methods.

Tappen, Marshall F.; , ■Recovering shape from a single image of a mirrored surface from curvature constraints,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2545-2552, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995376
Abstract: This paper presents models and algorithms for estimating the shape of a mirrored surface from a single image of that surface, rendered under an unknown, natural illumination. While the unconstrained nature of this problem seems to make shape recovery impossible, the curvature of the surface cause characteristic image patterns to appear. These image patterns can be used to estimate how the surface curves in different directions. We show how these estimates can be used to produce constraints that can be used to estimate the shape of the surface. This approach is demonstrated on simple surfaces rendered under both natural and synthetic illuminations.

Hartley, Richard; Aftab, Khurrum; Trumpf, Jochen; , ■L1 rotation averaging using the Weiszfeld algorithm,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3041-3048, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995745
Abstract: We consider the problem of rotation averaging under the L1 norm. This problem is related to the classic Fermat-Weber problem for finding the geometric median of a set of points in IRn. We apply the classical Weiszfeld algorithm to this problem, adapting it iteratively in tangent spaces of SO(3) to obtain a provably convergent algorithm for finding the L1 mean. This results in an extremely simple and rapid averaging algorithm, without the need for line search. The choice of L1 mean (also called geometric median) is motivated by its greater robustness compared with rotation averaging under the L2 norm (the usual averaging process). We apply this problem to both single-rotation averaging (under which the algorithm provably finds the global L1 optimum) and multiple rotation averaging (for which no such proof exists). The algorithm is demonstrated to give markedly improved results, compared with L2 averaging. We achieve a median rotation error of 0.82 degrees on the 595 images of the Notre Dame image set.

Kulkarni, Naveen; Li, Baoxin; , ■Discriminative affine sparse codes for image classification,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1609-1616, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995701
Abstract: Images in general are captured under a diverse set of conditions. An image of the same object can be captured with varied poses, illuminations, scales, backgrounds and probably different camera parameters. The task of image classification then lies in forming features of the input images in a representational space where classifiers can be better supported in spite of the above variations. Existing methods have mostly focused on obtaining features which are invariant to scale and translation, and thus they generally suffer from performance degradation on datasets which consist of images with varied poses or camera orientations. In this paper we present a new framework for image classification, which is built upon a novel way of feature extraction that generates largely affine-invariant features called affine sparse codes. This is achieved through learning a compact dictionary of features from affine-transformed input images. Analysis and experiments indicate that this novel feature is highly discriminative in addition to being largely affine-invariant. A classifier using AdaBoost is then designed using the affine sparse codes as the input. Extensive experiments with standard databases demonstrate that the proposed approach can obtain the state-of-the-art results, outperforming existing leading approaches in the literature.

Shen, Chunhua; Kim, Junae; Wang, Lei; , ■A scalable dual approach to semidefinite metric learning,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2601-2608, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995447
Abstract: Distance metric learning plays an important role in many vision problems. Previous work of quadratic Maha-lanobis metric learning usually needs to solve a semidefinite programming (SDP) problem. A standard interior-point SDP solver has a complexity of O(D6.5) (with D the dimension of input data), and can only solve problems up to a few thousand variables. Since the number of variables is D(D + l)/2, this corresponds to a limit around D < 100. This high complexity hampers the application of metric learning to high-dimensional problems. In this work, we propose a very efficient approach to this metric learning problem. We formulate a Lagrange dual approach which is much simpler to optimize, and we can solve much larger Mahalanobis metric learning problems. Roughly, the proposed approach has a time complexity of O(t · D3) with t ≈ 20 ∼ 30 for most problems in our experiments. The proposed algorithm is scalable and easy to implement. Experiments on various datasets show its similar accuracy compared with state-of-the-art. We also demonstrate that this idea may also be able to be applied to other SDP problems such as maximum variance unfolding.

Cao, Xun; Tong, Xin; Dai, Qionghai; Lin, Stephen; , ■High resolution multispectral video capture with a hybrid camera system,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.297-304, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995418
Abstract: We present a new approach to capture video at high spatial and spectral resolutions using a hybrid camera system. Composed of an RGB video camera, a grayscale video camera and several optical elements, the hybrid camera system simultaneously records two video streams: an RGB video with high spatial resolution, and a multispectral video with low spatial resolution. After registration of the two video streams, our system propagates the multispectral information into the RGB video to produce a video with both high spectral and spatial resolution. This propagation between videos is guided by color similarity of pixels in the spectral domain, proximity in the spatial domain, and the consistent color of each scene point in the temporal domain. The propagation algorithm is designed for rapid computation to allow real-time video generation at the original frame rate, and can thus facilitate real-time video analysis tasks such as tracking and surveillance. Hardware implementation details and design tradeoffs are discussed. We evaluate the proposed system using both simulations with ground truth data and on real-world scenes. The utility of this high resolution multispectral video data is demonstrated in dynamic white balance adjustment and tracking.

All, Karim; Hasler, David; Fleuret, Frangois; , ■FlowBoost — Appearance learning from sparsely annotated video,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1433-1440, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995403
Abstract: We propose a new learning method which exploits temporal consistency to successfully learn a complex appearance model from a sparsely labeled training video. Our approach consists in iteratively improving an appearance-based model built with a Boosting procedure, and the reconstruction of trajectories corresponding to the motion of multiple targets. We demonstrate the efficiency of our procedure on pedestrian detection in videos and cell detection in microscopy image sequences. In both cases, our method is demonstrated to reduce the labeling requirement by one to two orders of magnitude. We show that in some instances, our method trained with sparse labels on a video sequence is able to outperform a standard learning procedure trained with the fully labeled sequence.

Gordo, Albert; Perronnin, Florent; , ■Asymmetric distances for binary embeddings,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, vol., no., pp.729-736, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995505
Abstract: In large-scale query-by-example retrieval, embedding image signatures in a binary space offers two benefits: data compression and search efficiency. While most embedding algorithms binarize both query and database signatures, it has been noted that this is not strictly a requirement. Indeed, asymmetric schemes which binarize the database signatures but not the query still enjoy the same two benefits but may provide superior accuracy. In this work, we propose two general asymmetric distances which are applicable to a wide variety of embedding techniques including Locality Sensitive Hashing (LSH), Locality Sensitive Binary Codes (LSBC), Spectral Hashing (SH) and Semi-Supervised Hashing (SSH). We experiment on four public benchmarks containing up to 1M images and show that the proposed asymmetric distances consistently lead to large improvements over the symmetric Hamming distance for all binary embedding techniques. We also propose a novel simple binary embedding technique — PCA Embedding (PCAE) — which is shown to yield competitive results with respect to more complex algorithms such as SH and SSH.

Caicedo, Juan C.; Kapoor, Ashish; Kang, Sing Bing; , ■Collaborative personalization of image enhancement,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.249-256, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995439
Abstract: While most existing enhancement tools for photographs have universal auto-enhancement functionality, recent research [8] shows that users can have personalized preferences. In this paper, we explore whether such personalized preferences in image enhancement tend to cluster and whether users can be grouped according to such preferences. To this end, we analyze a comprehensive data set of image enhancements collected from 336 users via Amazon Mechanical Turk. We find that such clusters do exist and can be used to derive methods to learn statistical preference models from a group of users. We also present a probabilistic framework that exploits the ideas behind collaborative filtering to automatically enhance novel images for new users. Experiments show that inferring clusters in image enhancement preferences results in better prediction of image enhancement preferences and outperforms generic auto-correction tools.

Utasi, Akos; Benedek, Csaba; , ■A 3-D marked point process model for multi-view people detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3385-3392, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995699
Abstract: In this paper we introduce a probabilistic approach on multiple person localization using multiple calibrated camera views. People present in the scene are approximated by a population of cylinder objects in the 3-D world coordinate system, which is a realization of a Marked Point Process. The observation model is based on the projection of the pixels of the obtained motion masks in the different camera images to the ground plane and to other parallel planes with different height. The proposed pixel-level feature is based on physical properties of the 2-D image formation process and can accurately localize the leg position on the ground plane and estimate the height of the people, even if the area of interest is only a part of the scene, meanwhile silhouettes from irrelevant outside motions may significantly overlap with the monitored region in some of the camera views. We introduce an energy function, which contains a data term calculated from the extracted features and a geometrical constraint term modeling the distance between two people. The final configuration results (location and height) are obtained by an iterative stochastic energy optimization process, called the Multiple Birth and Death dynamics. The proposed approached is compared to a recent state-of-the-art technique in a publicly available dataset and its advantages are quantitatively demonstrated.

Xue, Tianfan; Liu, Jianzhuang; Tang, Xiaoou; , ■Symmetric piecewise planar object reconstruction from a single image,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2577-2584, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995405
Abstract: Recovering 3D geometry from a single view of an object is an important and challenging problem in computer vision. Previous methods mainly focus on one specific class of objects without large topological changes, such as cars, faces, or human bodies. In this paper, we propose a novel single view reconstruction algorithm for symmetric piece-wise planar objects that are not restricted to some object classes. Symmetry is ubiquitous in manmade and natural objects and provides rich information for 3D reconstruction. Given a single view of a symmetric piecewise planar object, we first find out all the symmetric line pairs. The geometric properties of symmetric objects are used to narrow down the searching space. Then, based on the symmetric lines, a depth map is recovered through a Markov random field. Experimental results show that our algorithm can efficiently recover the 3D shapes of different objects with significant topological variations.

Zheng, Yinqiang; Sugimoto, Shigeki; Okutomi, Masatoshi; , ■A branch and contract algorithm for globally optimal fundamental matrix estimation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2953-2960, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995352
Abstract: We propose a unified branch and contract method to estimate the fundamental matrix with guaranteed global optimality, by minimizing either the Sampson error or the point to epipolar line distance, and explicitly handling the rank-2 constraint and scale ambiguity. Based on a novel denominator linearization strategy, the fundamental matrix estimation problem can be transformed into an equivalent problem that involves 9 squared univariate, 12 bilinear and 6 trilin-ear terms. We build tight convex and concave relaxations for these nonconvex terms and solve the problem deterministically under the branch and bound framework. For acceleration, a bound contraction mechanism is introduced to reduce the size of the branching region at the root node. Given high-quality correspondences and proper data normalization, our experiments show that the state-of-the-art locally optimal methods generally converge to the globally optimal solution. However, they indeed have the risk of being trapped into local minimum in case of noise. As another important experimental result, we also demonstrate, from the viewpoint of global optimization, that the point to epipolar line distance is slightly inferior to the Sampson error in case of drastically varying object scales across two views.

Gong, Yunchao; Lazebnik, Svetlana; , ■Comparing data-dependent and data-independent embeddings for classification and ranking of Internet images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2633-2640, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995619
Abstract: This paper presents a comparative evaluation of feature embeddings for classification and ranking in large-scale Internet image datasets. We follow a popular framework for scalable visual learning, in which the data is first transformed by a nonlinear embedding and then an efficient linear classifier is trained in the resulting space. Our study includes data-dependent embeddings inspired by the semi-supervised learning literature, and data-independent ones based on approximating specific kernels (such as the Gaussian kernel for GIST features and the histogram intersection kernel for bags of words). Perhaps surprisingly, we find that data-dependent embeddings, despite being computed from large amounts of unlabeled data, do not have any advantage over data-independent ones in the regime of scarce labeled data. On the other hand, we find that several data-dependent embeddings are competitive with popular data-independent choices for large-scale classification.

Ma, Tianyang; Latecki, Longin Jan; , ■From partial shape matching through local deformation to robust global shape similarity for object detection,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1441-1448, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995591
Abstract: In this paper, we propose a novel framework for contour based object detection. Compared to previous work, our contribution is three-fold. 1) A novel shape matching scheme suitable for partial matching of edge fragments. The shape descriptor has the same geometric units as shape context but our shape representation is not histogram based. 2) Grouping of partial matching hypotheses to object detection hypotheses is expressed as maximum clique inference on a weighted graph. 3) A novel local affine-transformation to utilize the holistic shape information for scoring and ranking the shape similarity hypotheses. Consequently, each detection result not only identifies the location of the target object in the image, but also provides a precise location of its contours, since we transform a complete model contour to the image. Very competitive results on ETHZ dataset, obtained in a pure shape-based framework, demonstrate that our method achieves not only accurate object detection but also precise contour localization on cluttered background.

Pishchulin, Leonid; Jain, Arjun; Wojek, Christian; Andriluka, Mykhaylo; Thormahlen, Thorsten; Schiele, Bernt; , ■Learning people detection models from few training samples,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1473-1480, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995574
Abstract: People detection is an important task for a wide range of applications in computer vision. State-of-the-art methods learn appearance based models requiring tedious collection and annotation of large data corpora. Also, obtaining data sets representing all relevant variations with sufficient accuracy for the intended application domain at hand is often a non-trivial task. Therefore this paper investigates how 3D shape models from computer graphics can be leveraged to ease training data generation. In particular we employ a rendering-based reshaping method in order to generate thousands of synthetic training samples from only a few persons and views. We evaluate our data generation method for two different people detection models. Our experiments on a challenging multi-view dataset indicate that the data from as few as eleven persons suffices to achieve good performance. When we additionally combine our synthetic training samples with real data we even outperform existing state-of-the-art methods.

Bleyer, Michael; Rother, Carsten; Kohli, Pushmeet; Scharstein, Daniel; Sinha, Sudipta; , ■Object stereo — Joint stereo matching and object segmentation,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3081-3088, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995581
Abstract: This paper presents a method for joint stereo matching and object segmentation. In our approach a 3D scene is represented as a collection of visually distinct and spatially coherent objects. Each object is characterized by three different aspects: a color model, a 3D plane that approximates the object's disparity distribution, and a novel 3D connectivity property. Inspired by Markov Random Field models of image segmentation, we employ object-level color models as a soft constraint, which can aid depth estimation in powerful ways. In particular, our method is able to recover the depth of regions that are fully occluded in one input view, which to our knowledge is new for stereo matching. Our model is formulated as an energy function that is optimized via fusion moves. We show high-quality disparity and object segmentation results on challenging image pairs as well as standard benchmarks. We believe our work not only demonstrates a novel synergy between the areas of image segmentation and stereo matching, but may also inspire new work in the domain of automatic and interactive object-level scene manipulation.

Chen, Yan; Bao, Hujun; He, Xiaofei; , ■Non-negative local coordinate factorization for image representation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.569-574, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995400
Abstract: Recently Non-negative Matrix Factorization (NMF) has become increasingly popular for feature extraction in computer vision and pattern recognition. NMF seeks for two non-negative matrices whose product can best approximate the original matrix. The non-negativity constraints lead to sparse, parts-based representations which can be more robust than non-sparse, global features. To obtain more accurate control over the sparseness, in this paper, we propose a novel method called Non-negative Local Coordinate Factorization (NLCF) for feature extraction. NLCF adds a local coordinate constraint into the standard NMF objective function. Specifically, we require that the learned basis vectors be as close to the original data points as possible. In this way, each data point can be represented by a linear combination of only few nearby basis vectors, which naturally leads to sparse representation. Extensive experimental results suggest that the proposed approach provides a better representation and achieves higher accuracy in image clustering.

Jancosek, Michal; Pajdla, Tomas; , ■Multi-view reconstruction preserving weakly-supported surfaces,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3121-3128, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995693
Abstract: We propose a novel method for the multi-view reconstruction problem. Surfaces which do not have direct support in the input 3D point cloud and hence need not be photo-consistent but represent real parts of the scene (e.g. low-textured walls, windows, cars) are important for achieving complete reconstructions. We augmented the existing Labatut CGF 2009 method with the ability to cope with these difficult surfaces just by changing the t-edge weights in the construction of surfaces by a minimal s-t cut. Our method uses Visual-Hull to reconstruct the difficult surfaces which are not sampled densely enough by the input 3D point cloud. We demonstrate importance of these surfaces on several real-world data sets. We compare our improvement to our implementation of the Labatut CGF 2009 method and show that our method can considerably better reconstruct difficult surfaces while preserving thin structures and details in the same quality and computational time.

Schwing, Alexander; Hazan, Tamir; Pollefeys, Marc; Urtasun, Raquel; , ■Distributed message passing for large scale graphical models,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1833-1840, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995642
Abstract: In this paper we propose a distributed message-passing algorithm for inference in large scale graphical models. Our method can handle large problems efficiently by distributing and parallelizing the computation and memory requirements. The convergence and optimality guarantees of recently developed message-passing algorithms are preserved by introducing new types of consistency messages, sent between the distributed computers. We demonstrate the effectiveness of our approach in the task of stereo reconstruction from high-resolution imagery, and show that inference is possible with more than 200 labels in images larger than 10 MPixels.

Li, Liang; Jiang, Shuqiang; Huang, Qingming; , ■Learning image Vicept description via mixed-norm regularization for large scale semantic image search,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.825-832, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995570
Abstract: The paradox of visual polysemia and concept polymorphism has been a great challenge in the large scale semantic image search. To address this problem, our paper proposes a new method to generate image Vicept representation. Vicept characterizes the membership distribution between elementary visual appearances and semantic concepts, and forms a hierarchical representation of image semantic from local to global. To obtain discriminative Vicept descriptions with structural sparsity, we adopt mixed-norm regularization in the optimization problem for learning the concept membership distribution of visual word. Furthermore, considering the structure of BOV in images, visual descriptor is encoded as a weighted sum of dictionary elements using group sparse coding, which could obtain sparse representation at the image level. The wide applications of Vicept are validated in our experiments, including large scale semantic image search, image annotation, and semantic image re-ranking.

Wang, Hua; Huang, Heng; Ding, Chris; , ■Image annotation using bi-relational graph of images and semantic labels,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.793-800, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995379
Abstract: Image annotation is usually formulated as a multi-label semi-supervised learning problem. Traditional graph-based methods only utilize the data (images) graph induced from image similarities, while ignore the label (semantic terms) graph induced from label correlations of a multi-label image data set. In this paper, we propose a novel Bi-relational Graph (BG) model that comprises both the data graph and the label graph as subgraphs, and connect them by an additional bipartite graph induced from label assignments. By considering each class and its labeled images as a semantic group, we perform random walk on the BG to produce group-to-vertex relevance, including class-to-image and class-to-class relevances. The former can be used to predict labels for unannotated images, while the latter are new class relationships, called as Causal Relationships (CR), which are asymmetric. CR is learned from input data and has better semantic meaning to enhance the label prediction for unannotated images. We apply the proposed approaches to automatic image annotation and semantic image retrieval tasks on four benchmark multi-label image data sets. The superior performance of our approaches compared to state-of-the-art multi-label classification methods demonstrate their effectiveness.

Dubska, Marketa; Herout, Adam; Havel, Jiri; , ■PClines — Line detection using parallel coordinates,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1489-1494, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995501
Abstract: Detection of lines in raster images is often performed using Hough transform. This paper presents a new parameterization of lines and a modification of the Hough transform–PClines. PClines are based on parallel coordinates, a coordinate system used mostly or solely for high-dimensional data visualization. The PClines algorithm is described in the paper; its accuracy is evaluated numerically and compared to the commonly used line detectors based on the Hough transform. The results show that PClines outperform the existing approaches in terms of accuracy. Besides, PClines are computationally extremely efficient, require no floating-point operations, and can be easily accelerated by different hardware architectures.

Grabner, Helmut; Gall, Juergen; Van Gool, Luc; , ■What makes a chair a chair?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1529-1536, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995327
Abstract: Many object classes are primarily defined by their functions. However, this fact has been left largely unexploited by visual object categorization or detection systems. We propose a method to learn an affordance detector. It identifies locations in the 3d space which ■support■ the particular function. Our novel approach ■imagines■ an actor performing an action typical for the target object class, instead of relying purely on the visual object appearance. So, function is handled as a cue complementary to appearance, rather than being a consideration after appearance-based detection. Experimental results are given for the functional category ■sitting■. Such affordance is tested on a 3d representation of the scene, as can be realistically obtained through SfM or depth cameras. In contrast to appearance-based object detectors, affordance detection requires only very few training examples and generalizes very well to other sittable objects like benches or sofas when trained on a few chairs.

Favaro, Paolo; Vidal, Rene; Ravichandran, Avinash; , ■A closed form solution to robust subspace estimation and clustering,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1801-1807, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995365
Abstract: We consider the problem of fitting one or more subspaces to a collection of data points drawn from the subspaces and corrupted by noise/outliers. We pose this problem as a rank minimization problem, where the goal is to decompose the corrupted data matrix as the sum of a clean, self-expressive, low-rank dictionary plus a matrix of noise/outliers. Our key contribution is to show that, for noisy data, this non-convex problem can be solved very efficiently and in closed form from the SVD of the noisy data matrix. Remarkably, this is true for both one or more subspaces. An important difference with respect to existing methods is that our framework results in a polynomial thresholding of the singular values with minimal shrinkage. Indeed, a particular case of our framework in the case of a single subspace leads to classical PCA, which requires no shrinkage. In the case of multiple subspaces, our framework provides an affinity matrix that can be used to cluster the data according to the sub-spaces. In the case of data corrupted by outliers, a closed-form solution appears elusive. We thus use an augmented Lagrangian optimization framework, which requires a combination of our proposed polynomial thresholding operator with the more traditional shrinkage-thresholding operator.

, Anwaar-ul-Haq; Gondal, Iqbal; Murshed, Manzur; , ■On dynamic scene geometry for view-invariant action matching,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3305-3312, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995690
Abstract: Variation in viewpoints poses significant challenges to action recognition. One popular way of encoding view-invariant action representation is based on the exploitation of epipolar geometry between different views of the same action. Majority of representative work considers detection of landmark points and their tracking by assuming that motion trajectories for all landmark points on human body are available throughout the course of an action. Unfortunately, due to occlusion and noise, detection and tracking of these landmarks is not always robust. To facilitate it, some of the work assumes that such trajectories are manually marked which is a clear drawback and lacks automation introduced by computer vision. In this paper, we address this problem by proposing view invariant action matching score based on epipolar geometry between actor silhouettes, without tracking and explicit point correspondences. In addition, we explore multi-body epipolar constraint which facilitates to work on original action volumes without any pre-processing. We show that multi-body fundamental matrix captures the geometry of dynamic action scenes and helps devising an action matching score across different views without any prior segmentation of actors. Extensive experimentation on challenging view invariant action datasets shows that our approach not only removes long standing assumptions but also achieves significant improvement in recognition accuracy and retrieval.

Bhattacharya, Subhabrata; Sukthankar, Rahul; Jin, Rong; Shah, Mubarak; , ■A probabilistic representation for efficient large scale visual recognition tasks,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2593-2600, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995746
Abstract: In this paper, we present an efficient alternative to the traditional vocabulary based on bag-of-visual words (BoW) used for visual classification tasks. Our representation is both conceptually and computationally superior to the bag-of-visual words: (1) We iteratively generate a Maximum Likelihood estimate of an image given a set of characteristic features in contrast to the BoW methods where an image is represented as a histogram of visual words, (2) We randomly sample a set of characteristic features instead of employing computation-intensive clustering algorithms used during the vocabulary generation step of BoW methods. Our performance compares favorably to the state-of-the-art on experiments over three challenging human action and a scene categorization dataset, demonstrating the universal applicability of our method.

Edupuganti, Venkata Gopal; Agarwal, Vinayak A; Kompalli, Suryaprakash; , ■Registration of camera captured documents under non-rigid deformation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.385-392, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995625
Abstract: Document registration is a problem where the image of a template document whose layout is known is registered with a test document image. Given the registration parameters, layout of the template image is superimposed on the test document. Registration algorithms have been popular in applications, such as forms processing where the superimposed layout is used to extract relevant fields. Prior art has been designed to work with scanned documents under affine transformation. We find that the proliferation of camera captured images makes it necessary to address camera noise such as non-uniform lighting, clutter, and highly variable scale/resolution. The absence of a scan bed also leads to challenging non-rigid deformations being seen in paper images. Prior approaches in point pattern based registration like RANdom SAmple Consensus (RANSAC) [4], and Thin Plate Spline-Robust Point Matching (TPS-RPM) [5, 6] form the basis of our work. We propose enhancements to these methods to enable registration of cell phone and camera captured documents under non-rigid transformations. We embed three novel aspects into the framework: (i) histogram based uniformly transformed correspondence estimation, (ii) clustering of points located near the regions of interest (ROI) to select only close by regions for matching, (iii) validation of the registration in RANSAC and TPS-RPM algorithms for non-rigid registration. We consider Scale Invariant Feature Transform (SIFT) [8] and Speeded-Up Robust Features (SURF) [1] as our features. Results are reported as comparing prior art with our method on a dataset that will be made publicly available.

Ramalingam, Srikumar; Bouaziz, Sofien; Sturm, Peter; Torr, Philip H. S.; , ■The light-path less traveled,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3145-3152, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995706
Abstract: This paper extends classical object pose and relative camera motion estimation algorithms for imaging sensors sampling the scene through light-paths. Many algorithms in multi-view geometry assume that every pixel observes light traveling in a single line in space. We wish to relax this assumption and address various theoretical and practical issues in modeling camera rays as piece-wise linear-paths. Such paths consisting of finitely many linear segments are typical of any simple camera configuration with reflective and refractive elements. Our main contribution is to propose efficient algorithms that can work with the complete light-path without knowing the correspondence between their individual segments and the scene points. Second, we investigate light-paths containing infinitely many and small piece-wise linear segments that can be modeled using simple parametric curves such as conics. We show compelling simulations and real experiments, involving catadioptric configurations and mirages, to validate our study.

Heller, Jan; Havlena, Michal; Sugimoto, Akihiro; Pajdla, Tomas; , ■Structure-from-motion based hand-eye calibration using L∞ minimization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3497-3503, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995629
Abstract: This paper presents a novel method for so-called hand-eye calibration. Using a calibration target is not possible for many applications of hand-eye calibration. In such situations Structure-from-Motion approach of hand-eye calibration is commonly used to recover the camera poses up to scaling. The presented method takes advantage of recent results in the L∞-norm optimization using Second-Order Cone Programming (SOCP) to recover the correct scale. Further, the correctly scaled displacement of the hand-eye transformation is recovered solely from the image correspondences and robot measurements, and is guaranteed to be globally optimal with respect to the L∞-norm. The method is experimentally validated using both synthetic and real world datasets.

Lee, Yong Jae; Grauman, Kristen; , ■Learning the easy things first: Self-paced visual category discovery,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1721-1728, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995523
Abstract: Objects vary in their visual complexity, yet existing discovery methods perform ■batch■ clustering, paying equal attention to all instances simultaneously — regardless of the strength of their appearance or context cues. We propose a self-paced approach that instead focuses on the easiest instances first, and progressively expands its repertoire to include more complex objects. Easier regions are defined as those with both high likelihood of generic objectness and high familiarity of surrounding objects. At each cycle of the discovery process, we re-estimate the easiness of each subwindow in the pool of unlabeled images, and then retrieve a single prominent cluster from among the easiest instances. Critically, as the system gradually accumulates models, each new (more difficult) discovery benefits from the context provided by earlier discoveries. Our experiments demonstrate the clear advantages of self-paced discovery relative to conventional batch approaches, including both more accurate summarization as well as stronger predictive models for novel data.

Pomeranz, Dolev; Shemesh, Michal; Ben-Shahar, Ohad; , ■A fully automated greedy square jigsaw puzzle solver,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.9-16, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995331
Abstract: In the square jigsaw puzzle problem one is required to reconstruct the complete image from a set of non-overlapping, unordered, square puzzle parts. Here we propose a fully automatic solver for this problem, where unlike some previous work, it assumes no clues regarding parts' location and requires no prior knowledge about the original image or its simplified (e.g., lower resolution) versions. To do so, we introduce a greedy solver which combines both informed piece placement and rearrangement of puzzle segments to find the final solution. Among our other contributions are new compatibility metrics which better predict the chances of two given parts to be neighbors, and a novel estimation measure which evaluates the quality of puzzle solutions without the need for ground-truth information. Incorporating these contributions, our approach facilitates solutions that surpass state-of-the-art solvers on puzzles of size larger than ever attempted before.

Escolano, Francisco; Hancock, Edwin; Lozano, Miguel; , ■Graph matching through entropic manifold alignment,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2417-2424, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995583
Abstract: In this paper we cast the problem of graph matching as one of non-rigid manifold alignment. The low dimensional manifolds are from the commute time embedding and are matched though coherent point drift. Although there have been a number of attempts to realise graph matching in this way, in this paper we propose a novel information-theoretic measure of alignment, the so-called symmetrized normalized-entropy-square variation. We succesfully test this dissimilarity measure between manifolds on a a challenging database. The measure is estimated by means of the bypass Leonenko entropy functional. In addition we prove that the proposed measure induces a positive definite kernel between the probability density functions associated with the manifolds and hence between graphs after deformation. In our experiments we find that the optimal embedding is associated to the commute time distance and we also find that our approach, which is purely topological, outperforms several state-of-the-art graph-based algorithms for point matching.

Hu, Yiqun; Mian, Ajmal S.; Owens, Robyn; , ■Sparse approximated nearest points for image set classification,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.121-128, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995500
Abstract: Classification based on image sets has recently attracted great research interest as it holds more promise than single image based classification. In this paper, we propose an efficient and robust algorithm for image set classification. An image set is represented as a triplet: a number of image samples, their mean and an affine hull model. The affine hull model is used to account for unseen appearances in the form of affine combinations of sample images. We introduce a novel between-set distance called Sparse Approximated Nearest Point (SANP) distance. Unlike existing methods, the dissimilarity of two sets is measured as the distance between their nearest points, which can be sparsely approximated from the image samples of their respective set. Different from standard sparse modeling of a single image, this novel sparse formulation for the image set enforces sparsity on the sample coefficients rather than the model coefficients and jointly optimizes the nearest points as well as their sparse approximations. A convex formulation for searching the optimal SANP between two sets is proposed and the accelerated proximal gradient method is adapted to efficiently solve this optimization. Experimental evaluation was performed on the Honda, MoBo and Youtube datasets. Comparison with existing techniques shows that our method consistently achieves better results.

Brendel, William; Fern, Alan; Todorovic, Sinisa; , ■Probabilistic event logic for interval-based event recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3329-3336, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995491
Abstract: This paper is about detecting and segmenting interrelated events which occur in challenging videos with motion blur, occlusions, dynamic backgrounds, and missing observations. We argue that holistic reasoning about time intervals of events, and their temporal constraints is critical in such domains to overcome the noise inherent to low-level video representations. For this purpose, our first contribution is the formulation of probabilistic event logic (PEL) for representing temporal constraints among events. A PEL knowledge base consists of confidence-weighted formulas from a temporal event logic, and specifies a joint distribution over the occurrence time intervals of all events. Our second contribution is a MAP inference algorithm for PEL that addresses the scalability issue of reasoning about an enormous number of time intervals and their constraints in a typical video. Specifically, our algorithm leverages the spanning-interval data structure for compactly representing and manipulating entire sets of time intervals without enumerating them. Our experiments on interpreting basketball videos show that PEL inference is able to jointly detect events and identify their time intervals, based on noisy input from primitive-event detectors.

Nasihatkon, Behrooz; Hartley, Richard; , ■Graph connectivity in sparse subspace clustering,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2137-2144, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995679
Abstract: Sparse Subspace Clustering (SSC) is one of the recent approaches to subspace segmentation. In SSC a graph is constructed whose nodes are the data points and whose edges are inferred from the L1-sparse representation of each point by the others. It has been proved that if the points lie on a mixture of independent subspaces, the graphical structure of each subspace is disconnected from the others. However, the problem of connectivity within each subspace is still unanswered. This is important since the subspace segmentation in SSC is based on finding the connected components of the graph. Our analysis is built upon the connection between the sparse representation through L1-norm minimization and the geometry of convex poly-topes proposed by the compressed sensing community. After introduction of some assumptions to make the problem well-defined, it is proved that the connectivity within each subspace holds for 2- and 3-dimensional subspaces. The claim of connectivity for general d-dimensional case, even for generic configurations, is proved false by giving a counterexample in dimensions greater than 3.

Zhang, Shaoting; Zhan, Yiqiang; Dewan, Maneesh; Huang, Junzhou; Metaxas, Dimitris N.; Zhou, Xiang Sean; , ■Sparse shape composition: A new framework for shape prior modeling,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1025-1032, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995322
Abstract: Image appearance cues are often used to derive object shapes, which is usually one of the key steps of image understanding tasks. However, when image appearance cues are weak or misleading, shape priors become critical to infer and refine the shape derived by these appearance cues. Effective modeling of shape priors is challenging because: 1) shape variation is complex and cannot always be modeled by a parametric probability distribution; 2) a shape instance derived from image appearance cues (input shape) may have gross errors; and 3) local details of the input shape are difficult to preserve if they are not statistically significant in the training data. In this paper we propose a novel Sparse Shape Composition model (SSC) to deal with these three challenges in a unified framework. In our method, training shapes are adaptively composed to infer/refine an input shape. The a-priori information is thus implicitly incorporated on-the-fly. Our model leverages two sparsity observations of the input shape instance: 1) the input shape can be approximately represented by a sparse linear combination of training shapes; 2) parts of the input shape may contain gross errors but such errors are usually sparse. Using L1 norm relaxation, our model is formulated as a convex optimization problem, which is solved by an efficient alternating minimization framework. Our method is extensively validated on two real world medical applications, 2D lung localization in X-ray images and 3D liver segmentation in low-dose CT scans. Compared to state-of-the-art methods, our model exhibits better performance in both studies.

Liu, Shubao; Cooper, David B.; , ■A complete statistical inverse ray tracing approach to multi-view stereo,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.913-920, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995334
Abstract: This paper presents a complete solution to estimating a scene's 3D geometry and appearance from multiple 2D images by using a statistical inverse ray tracing method. Instead of matching image features/pixels across images, the inverse ray tracing approach models the image generation process directly and searches for the best 3D geometry and surface reflectance model to explain all the observations. Here the image generation process is modeled through volumetric ray tracing, where the occlusion/visibility is exactly modeled. All the constraints (including ray constraints and prior knowledge about the geometry) are put into the Ray Markov Random Field (Ray MRF) formulation, developed in [10]. Differently from [10], where the voxel colors are estimated independently of the voxel occupancies, in this work, both voxel occupancies and colors (i.e., both geometry and appearance) are modeled and estimated jointly in the same inversey ray tracing framework (Ray MRF + deep belief propagation) and implemented in a common message passing scheme, which improves the accuracy significantly as verified by extensive experiments. The complete inverse ray tracing approach can better handle difficult problems in multi-view stereo than do traditional methods, including large camera baseline, occlusion, matching ambiguities, color constant or slowly changing regions, etc., without additional information and assumptions, such as initial surface estimate or simple background assumption. A prototype system is built and tested over several challenging datasets and compared with the state-of-the-art systems, which demonstrates its good performance and wide applicability.

Shyr, Alex; Darrell, Trevor; Jordan, Michael; Urtasun, Raquel; , ■Supervised hierarchical Pitman-Yor process for natural scene segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2281-2288, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995647
Abstract: From conventional wisdom and empirical studies of annotated data, it has been shown that visual statistics such as object frequencies and segment sizes follow power law distributions. Previous work has shown that both kinds of power-law behavior can be captured by using a hierarchical Pitman-Yor process prior within a nonparametric Bayesian approach to scene segmentation. In this paper, we add label information into the previously unsupervised model. Our approach exploits the labelled data by adding constraints on the parameter space during the variational learning phase. We evaluate our formulation on the La-belMe natural scene dataset, and show the effectiveness of our approach.

Sharma, Avinash; Horaud, Radu; Cech, Jan; Boyer, Edmond; , ■Topologically-robust 3D shape matching based on diffusion geometry and seed growing,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2481-2488, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995455
Abstract: 3D Shape matching is an important problem in computer vision. One of the major difficulties in finding dense correspondences between 3D shapes is related to the topological discrepancies that often arise due to complex kinematic motions. In this paper we propose a shape matching method that is robust to such changes in topology. The algorithm starts from a sparse set of seed matches and outputs dense matching. We propose to use a shape descriptor based on properties of the heat-kernel and which provides an intrinsic scale-space representation. This descriptor incorporates (i) heat-flow from already matched points and (ii) self diffusion. At small scales the descriptor behaves locally and hence it is robust to global changes in topology. Therefore, it can be used to build a vertex-to-vertex matching score conditioned by an initial correspondence set. This score is then used to iteratively add new correspondences based on a novel seed-growing method that iteratively propagates the seed correspondences to nearby vertices. The matching is farther densified via an EM-like method that explores the congruency between the two shape embeddings. Our method is compared with two recently proposed algorithms and we show that we can deal with substantial topological differences between the two shapes.

Smith, David L.; Field, Jacqueline; Learned-Miller, Erik; , ■Enforcing similarity constraints with integer programming for better scene text recognition,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.73-80, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995700
Abstract: The recognition of text in everyday scenes is made difficult by viewing conditions, unusual fonts, and lack of linguistic context. Most methods integrate a priori appearance information and some sort of hard or soft constraint on the allowable strings. Weinman and Learned-Miller [14] showed that the similarity among characters, as a supplement to the appearance of the characters with respect to a model, could be used to improve scene text recognition. In this work, we make further improvements to scene text recognition by taking a novel approach to the incorporation of similarity. In particular, we train a similarity expert that learns to classify each pair of characters as equivalent or not. After removing logical inconsistencies in an equivalence graph, we formulate the search for the maximum likelihood interpretation of a sign as an integer program. We incorporate the equivalence information as constraints in the integer program and build an optimization criterion out of appearance features and character bigrams. Finally, we take the optimal solution from the integer program, and compare all ■nearby■ solutions using a probability model for strings derived from search engine queries. We demonstrate word error reductions of more than 30% relative to previous methods on the same data set.

Gaidon, Adrien; Harchaoui, Zaid; Schmid, Cordelia; , ■Actom sequence models for efficient action detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3201-3208, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995646
Abstract: We address the problem of detecting actions, such as drinking or opening a door, in hours of challenging video data. We propose a model based on a sequence of atomic action units, termed ■actoms■, that are characteristic for the action. Our model represents the temporal structure of actions as a sequence of histograms of actom-anchored visual features. Our representation, which can be seen as a temporally structured extension of the bag-of-features, is flexible, sparse and discriminative. We refer to our model as Actom Sequence Model (ASM). Training requires the annotation of actoms for action clips. At test time, actoms are detected automatically, based on a non parametric model of the distribution of actoms, which also acts as a prior on an action's temporal structure. We present experimental results on two recent benchmarks for temporal action detection, ■Coffee and Cigarettes■ [12] and the dataset of [3]. We show that our ASM method outperforms the current state of the art in temporal action detection.

Wu, Changchang; Frahm, Jan-Michael; Pollefeys, Marc; , ■Repetition-based dense single-view reconstruction,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3113-3120, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995551
Abstract: This paper presents a novel approach for dense reconstruction from a single-view of a repetitive scene structure. Given an image and its detected repetition regions, we model the shape recovery as the dense pixel correspondences within a single image. The correspondences are represented by an interval map that tells the distance of each pixel to its matched pixels within the single image. In order to obtain dense repetitive structures, we develop a new repetition constraint that penalizes the inconsistency between the repetition intervals of the dynamically corresponding pixel pairs. We deploy a graph-cut to balance between the high-level constraint of geometric repetition and the low-level constraints of photometric consistency and spatial smoothness. We demonstrate the accurate reconstruction of dense 3D repetitive structures through a variety of experiments, which prove the robustness of our approach to outliers such as structure variations, illumination changes, and occlusions.

Ross, Stephane; Munoz, Daniel; Hebert, Martial; Bagnell, J. Andrew; , ■Learning message-passing inference machines for structured prediction,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2737-2744, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995724
Abstract: Nearly every structured prediction problem in computer vision requires approximate inference due to large and complex dependencies among output labels. While graphical models provide a clean separation between modeling and inference, learning these models with approximate inference is not well understood. Furthermore, even if a good model is learned, predictions are often inaccurate due to approximations. In this work, instead of performing inference over a graphical model, we instead consider the inference procedure as a composition of predictors. Specifically, we focus on message-passing algorithms, such as Belief Propagation, and show how they can be viewed as procedures that sequentially predict label distributions at each node over a graph. Given labeled graphs, we can then train the sequence of predictors to output the correct labeling s. The result no longer corresponds to a graphical model but simply defines an inference procedure, with strong theoretical properties, that can be used to classify new graphs. We demonstrate the scalability and efficacy of our approach on 3D point cloud classification and 3D surface estimation from single images.

Chakraborty, Shayok; Balasubramanian, Vineeth; Panchanathan, Sethuraman; , ■Dynamic batch mode active learning,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2649-2656, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995715
Abstract: Active learning techniques have gained popularity to reduce human effort in labeling data instances for inducing a classifier. When faced with large amounts of unlabeled data, such algorithms automatically identify the exemplar and representative instances to be selected for manual annotation. More recently, there have been attempts towards a batch mode form of active learning, where a batch of data points is simultaneously selected from an unlabeled set. Real-world applications require adaptive approaches for batch selection in active learning. However, existing work in this field has primarily been heuristic and static. In this work, we propose a novel optimization-based framework for dynamic batch mode active learning, where the batch size as well as the selection criteria are combined in a single formulation. The solution procedure has the same computational complexity as existing state-of-the-art static batch mode active learning techniques. Our results on four challenging biometric datasets portray the efficacy of the proposed framework and also certify the potential of this approach in being used for real world biometric recognition applications.

Yang, Yang; Yang, Yi; Huang, Zi; Shen, Heng Tao; Nie, Feiping; , ■Tag localization with spatial correlations and joint group sparsity,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.881-888, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995499
Abstract: Nowadays numerous social images have been emerging on the Web. How to precisely label these images is critical to image retrieval. However, traditional image-level tagging methods may become less effective because global image matching approaches can hardly cope with the diversity and arbitrariness of Web image content. This raises an urgent need for the fine-grained tagging schemes. In this work, we study how to establish mapping between tags and image regions, i.e. localize tags to image regions, so as to better depict and index the content of images. We propose the spatial group sparse coding (SGSC) by extending the robust encoding ability of group sparse coding with spatial correlations among training regions. We present spatial correlations in a two-dimensional image space and design group-specific spatial kernels to produce a more interpretable regularizer. Further we propose a joint version of the SGSC model which is able to simultaneously encode a group of intrinsically related regions within a test image. An effective algorithm is developed to optimize the objective function of the Joint SGSC. The tag localization task is conducted by propagating tags from sparsely selected groups of regions to the target regions according to the reconstruction coefficients. Extensive experiments on three public image datasets illustrate that our proposed models achieve great performance improvements over the state-of-the-art method in the tag localization task.

Takerkart, Sylvain; Ralaivola, Liva; , ■MKPM: A multiclass extension to the kernel projection machine,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2785-2791, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995657
Abstract: We introduce Multiclass Kernel Projection Machines (MKPM), a new formalism that extends the Kernel Projection Machine framework to the multiclass case. Our formulation is based on the use of output codes and it implements a co-regularization scheme by simultaneously constraining the projection dimensions associated with the individual predictors that constitute the global classifier. In order to solve the optimization problem posed by our formulation, we propose an efficient dynamic programming approach. Numerical simulations conducted on a few pattern recognition problems illustrate the soundness of our approach.

Chum, Ondrej; Mikulik, Andrej; Perdoch, Michal; Matas, Jiri; , ■Total recall II: Query expansion revisited,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.889-896, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995601
Abstract: Most effective particular object and image retrieval approaches are based on the bag-of-words (BoW) model. All state-of-the-art retrieval results have been achieved by methods that include a query expansion that brings a significant boost in performance. We introduce three extensions to automatic query expansion: (i) a method capable of preventing tf-idf failure caused by the presence of sets of correlated features (confusers), (ii) an improved spatial verification and re-ranking step that incrementally builds a statistical model of the query object and (iii) we learn relevant spatial context to boost retrieval performance. The three improvements of query expansion were evaluated on standard Paris and Oxford datasets according to a standard protocol, and state-of-the-art results were achieved.

Amberg, Brian; Vetter, Thomas; , ■GraphTrack: Fast and globally optimal tracking in videos,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1209-1216, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995341
Abstract: In video post-production it is often necessary to track interest points in the video. This is called off-line tracking, because the complete video is available to the algorithm and can be contrasted with on-line tracking, where an incoming stream is tracked in real time. Off-line tracking should be accurate and–if used interactively–needs to be fast, preferably faster than real-time. We describe a 50 to 100 frames per second off-line tracking algorithm, which globally maximizes the probability of the track given the complete video. The algorithm is more reliable than previous methods because it explains the complete frames, not only the patches of the final track, making as much use of the data as possible. It achieves efficiency by using a greedy search strategy with deferred cost evaluation, focusing the computational effort on the most promising track candidates while finding the globally optimal track.

Zen, Gloria; Ricci, Elisa; , ■Earth mover's prototypes: A convex learning approach for discovering activity patterns in dynamic scenes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3225-3232, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995578
Abstract: We present a novel approach for automatically discovering spatio-temporal patterns in complex dynamic scenes. Similarly to recent non-object centric methods, we use low level visual cues to detect atomic activities and then construct clip histograms. Differently from previous works, we formulate the task of discovering high level activity patterns as a prototype learning problem where the correlation among atomic activities is explicitly taken into account when grouping clip histograms. Interestingly at the core of our approach there is a convex optimization problem which allows us to efficiently extract patterns at multiple levels of detail. The effectiveness of our method is demonstrated on publicly available datasets.

Wang, Liang; Wang, Yizhou; Jiang, Tingting; Gao, Wen; , ■Instantly telling what happens in a video sequence using simple features,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3257-3264, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995377
Abstract: This paper presents an efficient method to tell what happens (e.g. recognize actions) in a video sequence from only a couple of frames in real time. For the sake of instantaneity, we employ two types of computationally efficient but perceptually important features, optical flow and edge, to capture motion and shape/structure information in video sequences. It is known that the two types of features are not sparse and can be unreliable or ambiguous at certain parts of a video. In order to endow them with strong discriminative power, we extend an efficient contrast set mining technique, the Emerging Pattern (EP) mining method, to learn joint features from videos to differentiate action classes. Experimental results show that the combination of the two types of features achieves superior performance in differentiating actions than that of using each single type of features alone. The learned features are discriminative, statistically significant (reliable) and display semantically meaningful shape-motion structures of human actions. Besides the instant action recognition, we also extend the proposed approach to anomaly detection and sequential event detection. The experiments demonstrate encouraging results.

Gao, Junhong; Kim, Seon Joo; Brown, Michael S.; , ■Constructing image panoramas using dual-homography warping,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.49-56, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995433
Abstract: This paper describes a method to construct seamless image mosaics of a panoramic scene containing two predominate planes: a distant back plane and a ground plane that sweeps out from the camera's location. While this type of panorama can be stitched when the camera is carefully rotated about its optical center, such ideal scene capture is hard to perform correctly. Existing techniques use a single homography per image to perform alignment followed by seam cutting or image blending to hide inevitable alignments artifacts. In this paper, we demonstrate how to use two homographies per image to produce a more seamless image. Specifically, our approach blends the homographies in the alignment procedure to perform a nonlinear warping. Once the images are geometrically stitched, they are further processed to blend seams and reduce curvilinear visual artifacts due to the nonlinear warping. As demonstrated in our paper, our procedure is able to produce results for this type of scene where current state-of-the-art techniques fail.

Shahar, Oded; Faktor, Alon; Irani, Michal; , ■Space-time super-resolution from a single video,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3353-3360, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995360
Abstract: Spatial Super Resolution (SR) aims to recover fine image details, smaller than a pixel size. Temporal SR aims to recover rapid dynamic events that occur faster than the video frame-rate, and are therefore invisible or seen incorrectly in the video sequence. Previous methods for Space-Time SR combined information from multiple video recordings of the same dynamic scene. In this paper we show how this can be done from a single video recording. Our approach is based on the observation that small space-time patches (‘ST-patches’, e.g., 5×5×3) of a single ‘natural video’, recur many times inside the same video sequence at multiple spatio-temporal scales. We statistically explore the degree of these ST-patch recurrences inside ‘natural videos’, and show that this is a very strong statistical phenomenon. Space-time SR is obtained by combining information from multiple ST-patches at sub-frame accuracy. We show how finding similar ST-patches can be done both efficiently (with a randomized-based search in space-time), and at sub-frame accuracy (despite severe motion aliasing). Our approach is particularly useful for temporal SR, resolving both severe motion aliasing and severe motion blur in complex ‘natural videos’.

Vijayanarasimhan, Sudheendra; Grauman, Kristen; , ■Large-scale live active learning: Training object detectors with crawled data and crowds,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1449-1456, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995430
Abstract: Active learning and crowdsourcing are promising ways to efficiently build up training sets for object recognition, but thus far techniques are tested in artificially controlled settings. Typically the vision researcher has already determined the dataset's scope, the labels ■actively■ obtained are in fact already known, and/or the crowd-sourced collection process is iteratively fine-tuned. We present an approach for live learning of object detectors, in which the system autonomously refines its models by actively requesting crowd-sourced annotations on images crawled from the Web. To address the technical issues such a large-scale system entails, we introduce a novel part-based detector amenable to linear classifiers, and show how to identify its most uncertain instances in sub-linear time with a hashing-based solution. We demonstrate the approach with experiments of unprecedented scale and autonomy, and show it successfully improves the state-of-the-art for the most challenging objects in the PASCAL benchmark. In addition, we show our detector competes well with popular nonlinear classifiers that are much more expensive to train.

Maji, Subhransu; Bourdev, Lubomir; Malik, Jitendra; , ■Action recognition from a distributed representation of pose and appearance,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3177-3184, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995631
Abstract: We present a distributed representation of pose and appearance of people called the ■poselet activation vector■. First we show that this representation can be used to estimate the pose of people defined by the 3D orientations of the head and torso in the challenging PASCAL VOC 2010 person detection dataset. Our method is robust to clutter, aspect and viewpoint variation and works even when body parts like faces and limbs are occluded or hard to localize. We combine this representation with other sources of information like interaction with objects and other people in the image and use it for action recognition. We report competitive results on the PASCAL VOC 2010 static image action classification challenge.
Schwing, Alexander G.; Zach, Christopher; Zheng, Yefeng; Pollefeys, Marc; , ■Adaptive random forest — How many ■experts■ to ask before making a decision?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1377-1384, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995684
Abstract: How many people should you ask if you are not sure about your way? We provide an answer to this question for Random Forest classification. The presented method is based on the statistical formulation of confidence intervals and conjugate priors for binomial as well as multinomial distributions. We derive appealing decision rules to speed up the classification process by leveraging the fact that many samples can be clearly mapped to classes. Results on test data are provided, and we highlight the applicability of our method to a wide range of problems. The approach introduces only one non-heuristic parameter, that allows to trade-off accuracy and speed without any re-training of the classifier. The proposed method automatically adapts to the difficulty of the test data and makes classification significantly faster without deteriorating the accuracy.

Zontak, Maria; Irani, Michal; , ■Internal statistics of a single natural image,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.977-984, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995401
Abstract: Statistics of ‘natural images’ provides useful priors for solving under-constrained problems in Computer Vision. Such statistics is usually obtained from large collections of natural images. We claim that the substantial internal data redundancy within a single natural image (e.g., recurrence of small image patches), gives rise to powerful internal statistics, obtained directly from the image itself. While internal patch recurrence has been used in various applications, we provide a parametric quantification of this property. We show that the likelihood of an image patch to recur at another image location can be expressed parametricly as a function of the spatial distance from the patch, and its gradient content. This ■internal parametric prior■ is used to improve existing algorithms that rely on patch recurrence. Moreover, we show that internal image-specific statistics is often more powerful than general external statistics, giving rise to more powerful image-specific priors. In particular: (i) Patches tend to recur much more frequently (densely) inside the same image, than in any random external collection of natural images. (ii) To find an equally good external representative patch for all the patches of an image, requires an external database of hundreds of natural images. (iii) Internal statistics often has stronger predictive power than external statistics, indicating that it may potentially give rise to more powerful image-specific priors.

Hussein, Mohamed; Porikli, Fatih; Li, Rui; Arslan, Suayb; , ■CrossTrack: Robust 3D tracking from two cross-sectional views,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1041-1048, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995429
Abstract: One of the challenges in radiotherapy of moving tumors is to determine the location of the tumor accurately. Existing solutions to the problem are either invasive or inaccurate. We introduce a non-invasive solution to the problem by tracking the tumor in 3D using bi-plane ultrasound image sequences. We present CrossTrack, a novel tracking algorithm in this framework. We pose the problem as recursive inference of 3D location and tumor boundary segmentation in the two ultrasound views using the tumor 3D model as a prior. For the segmentation task, a robust graph-based approach is deployed as follows: First, robust segmentation priors are obtained through the tumor 3D model. Second, a unified graph combining information across time and multiple views is constructed with a robust weighting function. For the tracking task, an effective mechanism for recovery from respiration-induced occlusion is introduced. Our experiments show the robustness of CrossTrack in handling challenging tumor shapes and disappearance scenarios, with sub-voxel accuracy, and almost 100% precision and recall, significantly outperforming baseline solutions.

Liu, Ming-Yu; Tuzel, Oncel; Ramalingam, Srikumar; Chellappa, Rama; , ■Entropy rate superpixel segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2097-2104, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995323
Abstract: We propose a new objective function for superpixel segmentation. This objective function consists of two components: entropy rate of a random walk on a graph and a balancing term. The entropy rate favors formation of compact and homogeneous clusters, while the balancing function encourages clusters with similar sizes. We present a novel graph construction for images and show that this construction induces a matroid — a combinatorial structure that generalizes the concept of linear independence in vector spaces. The segmentation is then given by the graph topology that maximizes the objective function under the matroid constraint. By exploiting submodular and mono-tonic properties of the objective function, we develop an efficient greedy algorithm. Furthermore, we prove an approximation bound of ½ for the optimality of the solution. Extensive experiments on the Berkeley segmentation benchmark show that the proposed algorithm outperforms the state of the art in all the standard evaluation metrics.

Pizarro, Daniel; Bartoli, Adrien; , ■Global optimization for optimal generalized procrustes analysis,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2409-2415, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995677
Abstract: This paper deals with generalized procrustes analysis. This is the problem of registering a set of shape data by estimating a reference shape and a set of rigid transformations given point correspondences. The transformed shape data must align with the reference shape as best possible. This is a difficult problem. The classical approach computes alternatively the reference shape, usually as the average of the transformed shapes, and each transformation in turn. We propose a global approach to generalized procrustes analysis for two- and three-dimensional shapes. It uses modern convex optimization based on the theory of Sum Of Squares functions. We show how to convert the whole procrustes problem, including missing data, into a semidefinite program. Our approach is statistically grounded: it finds the maximum likelihood estimate. We provide results on synthetic and real datasets. Compared to classical alternation our algorithm obtains lower errors. The discrepancy is very high when similarities are estimated or when the shape data have significant deformations.

Glasner, Daniel; Vitaladevuni, Shiv N.; Basri, Ronen; , ■Contour-based joint clustering of multiple segmentations,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2385-2392, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995436
Abstract: We present an unsupervised, shape-based method for joint clustering of multiple image segmentations. Given two or more closely-related images, such as nearby frames in a video sequence or images of the same scene taken under different lighting conditions, our method generates a joint segmentation of the images. We introduce a novel contour-based representation that allows us to cast the shape-based joint clustering problem as a quadratic semi-assignment problem. Our score function is additive. We use complex-valued affinities to assess the quality of matching the edge elements at the exterior bounding contour of clusters, while ignoring the contributions of elements that fall in the interior of the clusters. We further combine this contour-based score with region information and use a linear programming relaxation to solve for the joint clusters. We evaluate our approach on the occlusion boundary data-set of Stein et al.

Murray, Naila; Vanrell, Maria; Otazu, Xavier; Parraga, C. Alejandro; , ■Saliency estimation using a non-parametric low-level vision model,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.433-440, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995506
Abstract: Many successful models for predicting attention in a scene involve three main steps: convolution with a set of filters, a center-surround mechanism and spatial pooling to construct a saliency map. However, integrating spatial information and justifying the choice of various parameter values remain open problems. In this paper we show that an efficient model of color appearance in human vision, which contains a principled selection of parameters as well as an innate spatial pooling mechanism, can be generalized to obtain a saliency model that outperforms state-of-the-art models. Scale integration is achieved by an inverse wavelet transform over the set of scale-weighted center-surround responses. The scale-weighting function (termed ECSF) has been optimized to better replicate psychophysical data on color appearance, and the appropriate sizes of the center-surround inhibition windows have been determined by training a Gaussian Mixture Model on eye-fixation data, thus avoiding ad-hoc parameter selection. Additionally, we conclude that the extension of a color appearance model to saliency estimation adds to the evidence for a common low-level visual front-end for different visual tasks.

Baboud, Lionel; Cadik, Martin; Eisemann, Elmar; Seidel, Hans-Peter; , ■Automatic photo-to-terrain alignment for the annotation of mountain pictures,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.41-48, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995727
Abstract: We present a system for the annotation and augmentation of mountain photographs. The key issue resides in the registration of a given photograph with a 3D geo-referenced terrain model. Typical outdoor images contain little structural information, particularly mountain scenes whose aspect changes drastically across seasons and varying weather conditions. Existing approaches usually fail on such difficult scenarios. To avoid the burden of manual registration, we propose a novel automatic technique. Given only a viewpoint and FOV estimates, the technique is able to automatically derive the pose of the camera relative to the geometric terrain model. We make use of silhouette edges, which are among most reliable features that can be detected in the targeted situations. Using an edge detection algorithm, our technique then searches for the best match with silhouette edges rendered using the synthetic model. We develop a robust matching metric allowing us to cope with the inevitable noise affecting detected edges (e.g. due to clouds, snow, rocks, forests, or any phenomenon not encoded in the digital model). Once registered against the model, photographs can easily be augmented with annotations (e.g. topographic data, peak names, paths), which would otherwise imply a tedious fusion process. We further illustrate various other applications, such as 3D model-assisted image enhancement, or, inversely, texturing of digital models.

Kolomenkin, Michael; Leifman, George; Shimshoni, Ilan; Tal, Ayellet; , ■Reconstruction of relief objects from line drawings,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.993-1000, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995643
Abstract: This paper addresses the problem of automatic reconstruction of a 3D relief from a line drawing on top of a given base object. Reconstruction is challenging due to four reasons–the sparsity of the strokes, their ambiguity, their large number, and their inter-relations. Our approach is able to reconstruct a model from a complex drawing that consists of many inter-related strokes. Rather than viewing the inter-dependencies as a problem, we show how they can be exploited to automatically generate a good initial interpretation of the line drawing. Then, given a base and an interpretation, we propose an algorithm for reconstructing a consistent surface. The strength of our approach is demonstrated in the reconstruction of archaeological artifacts from drawings. These drawings are highly challenging, since artists created very complex and detailed descriptions of artifacts regardless of any considerations concerning their future use for shape reconstruction.

Taylor, Graham W.; Spiro, Ian; Bregler, Christoph; Fergus, Rob; , ■Learning invariance through imitation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2729-2736, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995538
Abstract: Supervised methods for learning an embedding aim to map high-dimensional images to a space in which perceptually similar observations have high measurable similarity. Most approaches rely on binary similarity, typically defined by class membership where labels are expensive to obtain and/or difficult to define. In this paper we propose crowd-sourcing similar images by soliciting human imitations. We exploit temporal coherence in video to generate additional pairwise graded similarities between the user-contributed imitations. We introduce two methods for learning nonlinear, invariant mappings that exploit graded similarities. We learn a model that is highly effective at matching people in similar pose. It exhibits remarkable invariance to identity, clothing, background, lighting, shift and scale.

Parikh, Devi; Zitnick, C. Lawrence; , ■Finding the weakest link in person detectors,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1425-1432, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995450
Abstract: Detecting people remains a popular and challenging problem in computer vision. In this paper, we analyze parts-based models for person detection to determine which components of their pipeline could benefit the most if improved. We accomplish this task by studying numerous detectors formed from combinations of components performed by human subjects and machines. The parts-based model we study can be roughly broken into four components: feature detection, part detection, spatial part scoring and contextual reasoning including non-maximal suppression. Our experiments conclude that part detection is the weakest link for challenging person detection datasets. Non-maximal suppression and context can also significantly boost performance. However, the use of human or machine spatial models does not significantly or consistently affect detection accuracy.

Fanelli, Gabriele; Gall, Juergen; Van Gool, Luc; , ■Real time head pose estimation with random regression forests,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.617-624, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995458
Abstract: Fast and reliable algorithms for estimating the head pose are essential for many applications and higher-level face analysis tasks. We address the problem of head pose estimation from depth data, which can be captured using the ever more affordable 3D sensing technologies available today. To achieve robustness, we formulate pose estimation as a regression problem. While detecting specific face parts like the nose is sensitive to occlusions, learning the regression on rather generic surface patches requires enormous amount of training data in order to achieve accurate estimates. We propose to use random regression forests for the task at hand, given their capability to handle large training datasets. Moreover, we synthesize a great amount of annotated training data using a statistical model of the human face. In our experiments, we show that our approach can handle real data presenting large pose changes, partial occlusions, and facial expressions, even though it is trained only on synthetic neutral face data. We have thoroughly evaluated our system on a publicly available database on which we achieve state-of-the-art performance without having to resort to the graphics card.

Rhemann, Christoph; Hosni, Asmaa; Bleyer, Michael; Rother, Carsten; Gelautz, Margrit; , ■Fast cost-volume filtering for visual correspondence and beyond,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3017-3024, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995372
Abstract: Many computer vision tasks can be formulated as labeling problems. The desired solution is often a spatially smooth labeling where label transitions are aligned with color edges of the input image. We show that such solutions can be efficiently achieved by smoothing the label costs with a very fast edge preserving filter. In this paper we propose a generic and simple framework comprising three steps: (i) constructing a cost volume (ii) fast cost volume filtering and (iii) winner-take-all label selection. Our main contribution is to show that with such a simple framework state-of-the-art results can be achieved for several computer vision applications. In particular, we achieve (i) disparity maps in real-time, whose quality exceeds those of all other fast (local) approaches on the Middlebury stereo benchmark, and (ii) optical flow fields with very fine structures as well as large displacements. To demonstrate robustness, the few parameters of our framework are set to nearly identical values for both applications. Also, competitive results for interactive image segmentation are presented. With this work, we hope to inspire other researchers to leverage this framework to other application areas.

Adato, Yair; Zickler, Todd; Ben-Shahar, Ohad; , ■A polar representation of motion and implications for optical flow,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1145-1152, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995419
Abstract: We explore a polar representation of optical flow in which each element of the brightness motion field is represented by its magnitude and orientation instead of its Cartesian projections. This seemingly small change in representation provides more direct access to the intrinsic structure of a flow field, and when used with existing variational inference procedures it provides a framework in which regularizers can be intuitively tailored for very different classes of motion. Our evaluations reveal that a flow estimation algorithm that is based on a polar representation can perform as well or better than the state-of-the-art when applied to traditional optical flow problems concerning camera or rigid scene motion, and at the same time, it facilitates both qualitative and quantitative improvements for non-traditional cases such as fluid flows and specular flows, whose structure is very different.

Lee, Jungmin; Cho, Minsu; Lee, Kyoung Mu; , ■Hyper-graph matching via reweighted random walks,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1633-1640, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995387
Abstract: Establishing correspondences between two feature sets is a fundamental issue in computer vision, pattern recognition, and machine learning. This problem can be well formulated as graph matching in which nodes represent feature points while edges describe pairwise relations between feature points. Recently, several researches have tried to embed higher-order relations of feature points by hyper-graph matching formulations. In this paper, we generalize the previous hyper-graph matching formulations to cover relations of features in arbitrary orders, and propose a novel state-of-the-art algorithm by reinterpreting the random walk concept on the hyper-graph in a probabilistic manner. Adopting personalized jumps with a reweighting scheme, the algorithm effectively reflects the one-to-one matching constraints during the random walk process. Comparative experiments on synthetic data and real images show that the proposed method clearly outperforms existing algorithms especially in the presence of noise and outliers.

Maji, Subhransu; Vishnoi, Nisheeth K.; Malik, Jitendra; , ■Biased normalized cuts,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, vol., no., pp.2057-2064, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995630
Abstract: We present a modification of ■Normalized Cuts■ to incorporate priors which can be used for constrained image segmentation. Compared to previous generalizations of ■Normalized Cuts■ which incorporate constraints, our technique has two advantages. First, we seek solutions which are sufficiently ■correlated■ with priors which allows us to use noisy top-down information, for example from an object detector. Second, given the spectral solution of the unconstrained problem, the solution of the constrained one can be computed in small additional time, which allows us to run the algorithm in an interactive mode. We compare our algorithm to other graph cut based algorithms and highlight the advantages.

Jegelka, Stefanie; Bilmes, Jeff; , ■Submodularity beyond submodular energies: Coupling edges in graph cuts,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1897-1904, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995589
Abstract: We propose a new family of non-submodular global energy functions that still use submodularity internally to couple edges in a graph cut. We show it is possible to develop an efficient approximation algorithm that, thanks to the internal submodularity, can use standard graph cuts as a subroutine. We demonstrate the advantages of edge coupling in a natural setting, namely image segmentation. In particular, for fine-structured objects and objects with shading variation, our structured edge coupling leads to significant improvements over standard approaches.

Lee, Philip; Wu, Ying; , ■Nonlocal matting,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2193-2200, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995665
Abstract: This work attempts to considerably reduce the amount of user effort in the natural image matting problem. The key observation is that the nonlocal principle, introduced to denoise images, can be successfully applied to the alpha matte to obtain sparsity in matte representation, and therefore dramatically reduce the number of pixels a user needs to manually label. We show how to avoid making the user provide redundant and unnecessary input, develop a method for clustering the image pixels for the user to label, and a method to perform high-quality matte extraction. We show that this algorithm is therefore faster, easier, and higher quality than state of the art methods.

Kitani, Kris M.; Okabe, Takahiro; Sato, Yoichi; Sugimoto, Akihiro; , ■Fast unsupervised ego-action learning for first-person sports videos,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3241-3248, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995406
Abstract: Portable high-quality sports cameras (e.g. head or helmet mounted) built for recording dynamic first-person video footage are becoming a common item among many sports enthusiasts. We address the novel task of discovering first-person action categories (which we call ego-actions) which can be useful for such tasks as video indexing and retrieval. In order to learn ego-action categories, we investigate the use of motion-based histograms and unsupervised learning algorithms to quickly cluster video content. Our approach assumes a completely unsupervised scenario, where labeled training videos are not available, videos are not pre-segmented and the number of ego-action categories are unknown. In our proposed framework we show that a stacked Dirichlet process mixture model can be used to automatically learn a motion histogram codebook and the set of ego-action categories. We quantitatively evaluate our approach on both in-house and public YouTube videos and demonstrate robust ego-action categorization across several sports genres. Comparative analysis shows that our approach outperforms other state-of-the-art topic models with respect to both classification accuracy and computational speed. Preliminary results indicate that on average, the categorical content of a 10 minute video sequence can be indexed in under 5 seconds.

Chen, Xiaowu; Chen, Mengmeng; Jin, Xin; Zhao, Qinping; , ■Face illumination transfer through edge-preserving filters,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.281-287, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995473
Abstract: This article proposes a novel image-based method to transfer illumination from a reference face image to a target face image through edge-preserving filters. According to our method, only a single reference image, without any knowledge of the 3D geometry or material information of the target face, is needed. We first decompose the lightness layers of the reference and the target images into large-scale and detail layers through weighted least square (WLS) filter after face alignment. The large-scale layer of the reference image is filtered with the guidance of the target image. Adaptive parameter selection schemes for the edge-preserving filters is proposed in the above two filtering steps. The final relit result is obtained by replacing the large-scale layer of the target image with that of the reference image. We acquire convincing relit result on numerous target and reference face images with different lighting effects and genders. Comparisons with previous work show that our method is less affected by geometry differences and can preserve better the identification structure and skin color of the target face.

Moreno-Noguer, Francesc; Porta, Josep M.; , ■Probabilistic simultaneous pose and non-rigid shape recovery,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1289-1296, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995532
Abstract: We present an algorithm to simultaneously recover non-rigid shape and camera poses from point correspondences between a reference shape and a sequence of input images. The key novel contribution of our approach is in bringing the tools of the probabilistic SLAM methodology from a rigid to a deformable domain. Under the assumption that the shape may be represented as a weighted sum of deformation modes, we show that the problem of estimating the modal weights along with the camera poses, may be probabilistically formulated as a maximum a posterior estimate and solved using an iterative least squares optimization. An extensive evaluation on synthetic and real data, shows that our approach has several significant advantages over current approaches, such as performing robustly under large amounts of noise and outliers, and neither requiring to track points over the whole sequence nor initializations close from the ground truth solution.

Matei, Bogdan C.; Sawhney, Harpreet S.; Samarasekera, Supun; , ■Vehicle tracking across nonoverlapping cameras using joint kinematic and appearance features,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3465-3472, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995575
Abstract: We describe a vehicle tracking algorithm using input from a network of nonoverlapping cameras. Our algorithm is based on a novel statistical formulation that uses joint kinematic and image appearance information to link local tracks of the same vehicles into global tracks with longer persistence. The algorithm can handle significant spatial separation between the cameras and is robust to challenging tracking conditions such as high traffic density, or complex road infrastructure. In these cases, traditional tracking formulations based on MHT, or JPDA algorithms, may fail to produce track associations across cameras due to the weak predictive models employed. We make several new contributions in this paper. Firstly, we model kinematic constraints between any two local tracks using road networks and transit time distributions. The transit time distributions are calculated dynamically as convolutions of normalized transit time distributions that are learned and adapted separately for individual roads. Secondly, we present a complete statistical tracker formulation, which combines kinematic and appearance likelihoods within a multi-hypothesis framework. We have extensively evaluated the algorithm proposed using a network of ground-based cameras with narrow field of view. The tracking results obtained on a large ground-truthed dataset demonstrate the effectiveness of the algorithm proposed.

Chen, Xiaogang; He, Xiangjian; Yang, Jie; Wu, Qiang; , ■An effective document image deblurring algorithm,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.369-376, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995568
Abstract: Deblurring camera-based document image is an important task in digital document processing, since it can improve both the accuracy of optical character recognition systems and the visual quality of document images. Traditional deblurring algorithms have been proposed to work for natural-scene images. However the natural-scene images are not consistent with document images. In this paper, the distinct characteristics of document images are investigated. We propose a content-aware prior for document image deblurring. It is based on document image foreground segmentation. Besides, an upper-bound constraint combined with total variation based method is proposed to suppress the rings in the deblurred image. Comparing with the traditional general purpose deblurring methods, the proposed deblurring algorithm can produce more pleasing results on document images. Encouraging experimental results demonstrate the efficacy of the proposed method.

Bergamasco, Filippo; Albarelli, Andrea; Rodola, Emanuele; Torsello, Andrea; , ■RUNE-Tag: A high accuracy fiducial marker with strong occlusion resilience,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.113-120, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995544
Abstract: Over the last decades fiducial markers have provided widely adopted tools to add reliable model-based features into an otherwise general scene. Given their central role in many computer vision tasks, countless different solutions have been proposed in the literature. Some designs are focused on the accuracy of the recovered camera pose with respect to the tag; some other concentrate on reaching high detection speed or on recognizing a large number of distinct markers in the scene. In such a crowded area both the researcher and the practitioner are licensed to wonder if there is any need to introduce yet another approach. Nevertheless, with this paper, we would like to present a general purpose fiducial marker system that can be deemed to add some valuable features to the pack. Specifically, by exploiting the projective properties of a circular set of sizeable dots, we propose a detection algorithm that is highly accurate. Further, applying a dot pattern scheme derived from error-correcting codes, allows for robustness with respect to very large occlusions. In addition, the design of the marker itself is flexible enough to accommodate different requirements in terms of pose accuracy and number of patterns. The overall performance of the marker system is evaluated in an extensive experimental section, where a comparison with a well-known baseline technique is presented.

Kowdle, Adarsh; Chang, Yao-Jen; Gallagher, Andrew; Chen, Tsuhan; , ■Active learning for piecewise planar 3D reconstruction,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.929-936, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995638
Abstract: This paper presents an active-learning algorithm for piecewise planar 3D reconstruction of a scene. While previous interactive algorithms require the user to provide tedious interactions to identify all the planes in the scene, we build on successful ideas from the automatic algorithms and introduce the idea of active learning, thereby improving the reconstructions while considerably reducing the effort. Our algorithm first attempts to obtain a piecewise planar reconstruction of the scene automatically through an energy minimization framework. The proposed active-learning algorithm then uses intuitive cues to quantify the uncertainty of the algorithm and suggest regions, querying the user to provide support for the uncertain regions via simple scribbles. These interactions are used to suitably update the algorithm, leading to better reconstructions. We show through machine experiments and a user study that the proposed approach can intelligently query users for interactions on informative regions, and users can achieve better reconstructions of the scene faster, especially for scenes with texture-less surfaces lacking cues like lines which automatic algorithms rely on.

Zhu, Yingying; Cox, Mark; Lucey, Simon; , ■3D motion reconstruction for real-world camera motion,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1-8, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995650
Abstract: This paper addresses the problem of 3D motion reconstruction from a series of 2D projections under low reconstructibility. Reconstructibility defines the accuracy of a 3D reconstruction from 2D projections given a particular trajectory basis, 3D point trajectory, and 3D camera center trajectory. Reconstructibility accuracy is inherently related to the correlation between point and camera trajectories. Poor correlation leads to good reconstruction, high correlation leads to poor reconstruction. Unfortunately, in most real-world situations involving non-rigid objects (e.g. bodies), camera and point motions are highly correlated (i.e., slow and smooth) resulting in poor reconstructibility. In this paper, we propose a novel approach for 3D motion reconstruction of non-rigid body motion in the presence of real-world camera motion. Specifically we: (i) propose the inclusion of a small number of keyframes in the video sequence from which 3D coordinates are inferred/estimated to circumvent ambiguities between point and camera motion, and (ii) employ a L1 penalty term to enforce a spar-sity constraint on the trajectory basis coefficients so as to ensure our reconstructions are consistent with the natural compressibility of human motion. We demonstrate impressive 3D motion reconstruction for 2D projection sequences with hitherto low reconstructibility.

Duan, Lijuan; Wu, Chunpeng; Miao, Jun; Qing, Laiyun; Fu, Yu; , ■Visual saliency detection by spatially weighted dissimilarity,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.473-480, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995676
Abstract: In this paper, a new visual saliency detection method is proposed based on the spatially weighted dissimilarity. We measured the saliency by integrating three elements as follows: the dissimilarities between image patches, which were evaluated in the reduced dimensional space, the spatial distance between image patches and the central bias. The dissimilarities were inversely weighted based on the corresponding spatial distance. A weighting mechanism, indicating a bias for human fixations to the center of the image, was employed. The principal component analysis (PCA) was the dimension reducing method used in our system. We extracted the principal components (PCs) by sampling the patches from the current image. Our method was compared with four saliency detection approaches using three image datasets. Experimental results show that our method outperforms current state-of-the-art methods on predicting human fixations.

Liu, Baiyang; Huang, Junzhou; Yang, Lin; Kulikowsk, Casimir; , ■Robust tracking using local sparse appearance model and K-selection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1313-1320, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995730
Abstract: Online learned tracking is widely used for it's adaptive ability to handle appearance changes. However, it introduces potential drifting problems due to the accumulation of errors during the self-updating, especially for occluded scenarios. The recent literature demonstrates that appropriate combinations of trackers can help balance stability and flexibility requirements. We have developed a robust tracking algorithm using a local sparse appearance model (SPT). A static sparse dictionary and a dynamically online updated basis distribution model the target appearance. A novel sparse representation-based voting map and sparse constraint regularized mean-shift support the robust object tracking. Besides these contributions, we also introduce a new dictionary learning algorithm with a locally constrained sparse representation, called K-Selection. Based on a set of comprehensive experiments, our algorithm has demonstrated better performance than alternatives reported in the recent literature.

Feng, Jiashi; Ni, Bingbing; Tian, Qi; Yan, Shuicheng; , ■Geometric ℓp-norm feature pooling for image classification,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2609-2704, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995370
Abstract: Modern visual classification models generally include a feature pooling step, which aggregates local features over the region of interest into a statistic through a certain spatial pooling operation. Two commonly used operations are the average and max poolings. However, recent theoretical analysis has indicated that neither of these two pooling techniques may be qualified to be optimal. Besides, we further reveal in this work that more severe limitations of these two pooling methods are from the unrecoverable loss of the spatial information during the statistical summarization and the underlying over-simplified assumption about the feature distribution. We aim to address these inherent issues in this work and generalize previous pooling methods as follows. We define a weighted ℓp-norm spatial pooling function tailored for the class-specific feature spatial distribution. Moreover, a sensible prior for the feature spatial correlation is incorporated. Optimizing such pooling function towards optimal class separability yields a so-called geometric ℓp-norm pooling (GLP) method. The described GLP method is capable of preserving the class-specific spatial/geometric information in the pooled features and significantly boosts the discriminating capability of the resultant features for image classification. Comprehensive evaluations on several image benchmarks demonstrate that the proposed GLP method can boost the image classification performance with a single type of feature to outperform or be comparable with the state-of-the-arts.

Lasowski, Ruxandra; Tevs, Art; Wand, Michael; Seidel, Hans-Peter; , ■Wavelet belief propagation for large scale inference problems,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1921-1928, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995489
Abstract: Loopy belief propagation (LBP) is a powerful tool for approximate inference in Markov random fields (MRFs). However, for problems with large state spaces, the runtime costs are often prohibitively high. In this paper, we present a new LBP algorithm that represents all beliefs, marginals, and messages in a wavelet representation, which can encode the probabilistic information much more compactly. Unlike previous work, our algorithm operates solely in the wavelet domain. This yields an output-sensitive algorithm where the running time depends mostly on the information content rather than the discretization resolution. We apply the new technique to typical problems with large state spaces such as image matching and wide-baseline optical flow where we observe a significantly improved scaling behavior with discretization resolution. For large problems, the new technique is significantly faster than even an optimized spatial domain implementation.

Zhou, Ziheng; Zhao, Guoying; Pietikainen, Matti; , ■Towards a practical lipreading system,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.137-144, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995345
Abstract: A practical lipreading system can be considered either as subject dependent (SD) or subject-independent (SI). An SD system is user-specific, i.e., customized for some particular user while an SI system has to cope with a large number of users. These two types of systems pose variant challenges and have to be treated differently. In this paper, we propose a simple deterministic model to tackle the problem. The model first seeks a low-dimensional manifold where visual features extracted from the frames of a video can be projected onto a continuous deterministic curve embedded in a path graph. Moreover, it can map arbitrary points on the curve back into the image space, making it suitable for temporal interpolation. Based on the model, we develop two separate strategies for SD and SI lipreading. The former is turned into a simple curve-matching problem while for the latter, we propose a video-normalization scheme to improve the system developed by Zhao et al. We evaluated our system on the OuluVS database and achieved recognition rates more than 20% higher than the ones reported by Zhao et al. in both SD and SI testing scenarios.

Jorstad, Anne; Jacobs, David; Trouve, Alain; , ■A deformation and lighting insensitive metric for face recognition based on dense correspondences,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2353-2360, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995431
Abstract: Face recognition is a challenging problem, complicated by variations in pose, expression, lighting, and the passage of time. Significant work has been done to solve each of these problems separately. We consider the problems of lighting and expression variation together, proposing a method that accounts for both variabilities within a single model. We present a novel deformation and lighting insensitive metric to compare images, and we present a novel framework to optimize over this metric to calculate dense correspondences between images. Typical correspondence cost patterns are learned between face image pairs and a Naïve Bayes classifier is applied to improve recognition accuracy. Very promising results are presented on the AR Face Database, and we note that our method can be extended to a broad set of applications.

Zhang, Wei; Wang, Xiaogang; Tang, Xiaoou; , ■Coupled information-theoretic encoding for face photo-sketch recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.513-520, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995324
Abstract: Automatic face photo-sketch recognition has important applications for law enforcement. Recent research has focused on transforming photos and sketches into the same modality for matching or developing advanced classification algorithms to reduce the modality gap between features extracted from photos and sketches. In this paper, we propose a new inter-modality face recognition approach by reducing the modality gap at the feature extraction stage. A new face descriptor based on coupled information-theoretic encoding is used to capture discriminative local face structures and to effectively match photos and sketches. Guided by maximizing the mutual information between photos and sketches in the quantized feature spaces, the coupled encoding is achieved by the proposed coupled information-theoretic projection tree, which is extended to the randomized forest to further boost the performance. We create the largest face sketch database including sketches of 1, 194 people from the FERET database. Experiments on this large scale dataset show that our approach significantly outperforms the state-of-the-art methods.

Isola, Phillip; Xiao, Jianxiong; Torralba, Antonio; Oliva, Aude; , ■What makes an image memorable?,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.145-152, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995721
Abstract: When glancing at a magazine, or browsing the Internet, we are continuously being exposed to photographs. Despite of this overflow of visual information, humans are extremely good at remembering thousands of pictures along with some of their visual details. But not all images are equal in memory. Some stitch to our minds, and other are forgotten. In this paper we focus on the problem of predicting how memorable an image will be. We show that memorability is a stable property of an image that is shared across different viewers. We introduce a database for which we have measured the probability that each picture will be remembered after a single view. We analyze image features and labels that contribute to making an image memorable, and we train a predictor based on global image descriptors. We find that predicting image memorability is a task that can be addressed with current computer vision techniques. Whereas making memorable images is a challenging task in visualization and photography, this work is a first attempt to quantify this useful quality of images.

Yuan, Junsong; Yang, Ming; Wu, Ying; , ■Mining discriminative co-occurrence patterns for visual recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2777-2784, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995476
Abstract: The co-occurrence pattern, a combination of binary or local features, is more discriminative than individual features and has shown its advantages in object, scene, and action recognition. We discuss two types of co-occurrence patterns that are complementary to each other, the conjunction (AND) and disjunction (OR) of binary features. The necessary condition of identifying discriminative co-occurrence patterns is firstly provided. Then we propose a novel data mining method to efficiently discover the optimal co-occurrence pattern with minimum empirical error, despite the noisy training dataset. This mining procedure of AND and OR patterns is readily integrated to boosting, which improves the generalization ability over the conventional boosting decision trees and boosting decision stumps. Our versatile experiments on object, scene, and action categorization validate the advantages of the discovered discriminative co-occurrence patterns.

Kulkarni, Girish; Premraj, Visruth; Dhar, Sagnik; Li, Siming; Choi, Yejin; Berg, Alexander C; Berg, Tamara L; , ■Baby talk: Understanding and generating simple image descriptions,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1601-1608, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995466
Abstract: We posit that visually descriptive language offers computer vision researchers both information about the world, and information about how people describe the world. The potential benefit from this source is made more significant due to the enormous amount of language data easily available today. We present a system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision. The system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.

Nishiyama, Masashi; Okabe, Takahiro; Sato, Imari; Sato, Yoichi; , ■Aesthetic quality classification of photographs based on color harmony,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.33-40, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995539
Abstract: Aesthetic quality classification plays an important role in how people organize large photo collections. In particular, color harmony is a key factor in the various aspects that determine the perceived quality of a photo, and it should be taken into account to improve the performance of automatic aesthetic quality classification. However, the existing models of color harmony take only simple color patterns into consideration–e.g., patches consisting of a few colors–and thus cannot be used to assess photos with complicated color arrangements. In this work, we tackle the challenging problem of evaluating the color harmony of photos with a particular focus on aesthetic quality classification. A key point is that a photograph can be seen as a collection of local regions with color variations that are relatively simple. This led us to develop a method for assessing the aesthetic quality of a photo based on the photo's color harmony. We term the method ‘bags-of-color-patterns.’ Results of experiments on a large photo collection with user-provided aesthetic quality scores show that our aesthetic quality classification method, which explicitly takes into account the color harmony of a photo, outperforms the existing methods. Results also show that the classification performance is improved by combining our color harmony feature with blur, edges, and saliency features that reflect the aesthetics of the photos.

Zhu, Zhiwei; Chiu, Han-Pang; Oskiper, Taragay; Ali, Saad; Hadsell, Raia; Samarasekera, Supun; Kumar, Rakesh; , ■High-precision localization using visual landmarks fused with range data,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.81-88, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995463
Abstract: Visual landmark matching with a pre-built landmark database is a popular technique for localization. Traditionally, landmark database was built with visual odometry system, and the 3D information of each visual landmark is reconstructed from video. Due to the drift of the visual odometry system, a global consistent landmark database is difficult to build, and the inaccuracy of each 3D landmark limits the performance of landmark matching. In this paper, we demonstrated that with the use of precise 3D Li-dar range data, we are able to build a global consistent database of high precision 3D visual landmarks, which improves the landmark matching accuracy dramatically. In order to further improve the accuracy and robustness, landmark matching is fused with a multi-stereo based visual odometry system to estimate the camera pose in two aspects. First, a local visual odometry trajectory based consistency check is performed to reject some bad landmark matchings or those with large errors, and then a kalman filtering is used to further smooth out some landmark matching errors. Finally, a disk-cache-mechanism is proposed to obtain the real-time performance when the size of the landmark grows for a large-scale area. A week-long real time live marine training experiments have demonstrated the high-precision and robustness of our proposed system.

Liu, Meizhu; Vemuri, Baba C.; , ■Robust and efficient regularized boosting using total Bregman divergence,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2897-2902, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995686
Abstract: Boosting is a well known machine learning technique used to improve the performance of weak learners and has been successfully applied to computer vision, medical image analysis, computational biology and other fields. A critical step in boosting algorithms involves update of the data sample distribution, however, most existing boosting algorithms use updating mechanisms that lead to overfitting and instabilities during evolution of the distribution which in turn results in classification inaccuracies. Regularized boosting has been proposed in literature as a means to overcome these difficulties. In this paper, we propose a novel total Bregman divergence (tBD) regularized LPBoost, termed tBRLPBoost. tBD is a recently proposed divergence in literature, which is statistically robust and we prove that tBRLPBoost requires a constant number of iterations to learn a strong classifier and hence is computationally more efficient compared to other regularized boosting algorithms in literature. Also, unlike other boosting methods that are only effective on a handful of datasets, tBRLPBoost works well on a variety of datasets. We present results of testing our algorithm on many public domain databases along with comparisons to several other state-of-the-art methods. Numerical results depict much improvement in efficiency and accuracy over competing methods.

Chang, Kai-Yueh; Liu, Tyng-Luh; Lai, Shang-Hong; , ■From co-saliency to co-segmentation: An efficient and fully unsupervised energy minimization model,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2129-2136, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995415
Abstract: We address two key issues of co-segmentation over multiple images. The first is whether a pure unsupervised algorithm can satisfactorily solve this problem. Without the user's guidance, segmenting the foregrounds implied by the common object is quite a challenging task, especially when substantial variations in the object's appearance, shape, and scale are allowed. The second issue concerns the efficiency if the technique can lead to practical uses. With these in mind, we establish an MRF optimization model that has an energy function with nice properties and can be shown to effectively resolve the two difficulties. Specifically, instead of relying on the user inputs, our approach introduces a co-saliency prior as the hint about possible foreground locations, and uses it to construct the MRF data terms. To complete the optimization framework, we include a novel global term that is more appropriate to co-segmentation, and results in a submodular energy function. The proposed model can thus be optimally solved by graph cuts. We demonstrate these advantages by testing our method on several benchmark datasets.

Leistner, Christian; Godec, Martin; Schulter, Samuel; Saffari, Amir; Werlberger, Manuel; Bischof, Horst; , ■Improving classifiers with unlabeled weakly-related videos,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2753-2760, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995475
Abstract: Current state-of-the-art object classification systems are trained using large amounts of hand-labeled images. In this paper, we present an approach that shows how to use unlabeled video sequences, comprising weakly-related object categories towards the target class, to learn better classifiers for tracking and detection. The underlying idea is to exploit the space-time consistency of moving objects to learn classifiers that are robust to local transformations. In particular, we use dense optical flow to find moving objects in videos in order to train part-based random forests that are insensitive to natural transformations. Our method, which is called Video Forests, can be used in two settings: first, labeled training data can be regularized to force the trained classifier to generalize better towards small local transformations. Second, as part of a tracking-by-detection approach, it can be used to train a general codebook solely on pair-wise data that can then be applied to tracking of instances of a priori unknown object categories. In the experimental part, we show on benchmark datasets for both tracking and detection that incorporating unlabeled videos into the learning of visual classifiers leads to improved results.

Muller, Thomas; Rannacher, Jens; Rabe, Clemens; Franke, Uwe; , ■Feature- and depth-supported modified total variation optical flow for 3D motion field estimation in real scenes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1193-1200, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995633
Abstract: We propose and evaluate improvements in motion field estimation in order to cope with challenges in real world scenarios. To build a real-time stereo-based three-dimensional vision system which is able to handle illumination changes, textureless regions and fast moving objects observed by a moving platform, we introduce a new approach to support the variational optical flow computation scheme with stereo and feature information. The improved flow result is then used as input for a temporal integrated robust three-dimensional motion field estimation technique. We evaluate the results of our optical flow algorithm and the resulting three-dimensional motion field against approaches known from literature. Tests on both synthetic realistic and real stereo sequences show that our approach is superior to approaches known from literature with respect to density, accuracy and robustness.

Elhamifar, Ehsan; Vidal, Rene; , ■Robust classification using structured sparse representation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1873-1879, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995664
Abstract: In many problems in computer vision, data in multiple classes lie in multiple low-dimensional subspaces of a high-dimensional ambient space. However, most of the existing classification methods do not explicitly take this structure into account. In this paper, we consider the problem of classification in the multi-sub space setting using sparse representation techniques. We exploit the fact that the dictionary of all the training data has a block structure where the training data in each class form few blocks of the dictionary. We cast the classification as a structured sparse recovery problem where our goal is to find a representation of a test example that uses the minimum number of blocks from the dictionary. We formulate this problem using two different classes of non-convex optimization programs. We propose convex relaxations for these two non-convex programs and study conditions under which the relaxations are equivalent to the original problems. In addition, we show that the proposed optimization programs can be modified properly to also deal with corrupted data. To evaluate the proposed algorithms, we consider the problem of automatic face recognition. We show that casting the face recognition problem as a structured sparse recovery problem can improve the results of the state-of-the-art face recognition algorithms, especially when we have relatively small number of training data for each class. In particular, we show that the new class of convex programs can improve the state-of-the-art face recognition results by 10% with only 25% of the training data. In addition, we show that the algorithms are robust to occlusion, corruption, and disguise.

Choi, Wongun; Shahid, Khuram; Savarese, Silvio; , ■Learning context for collective activity recognition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3273-3280, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995707
Abstract: In this paper we present a framework for the recognition of collective human activities. A collective activity is defined or reinforced by the existence of coherent behavior of individuals in time and space. We call such coherent behavior ‘Crowd Context’. Examples of collective activities are ■queuing in a line■ or ■talking■. Following [7], we propose to recognize collective activities using the crowd context and introduce a new scheme for learning it automatically. Our scheme is constructed upon a Random Forest structure which randomly samples variable volume spatio-temporal regions to pick the most discriminating attributes for classification. Unlike previous approaches, our algorithm automatically finds the optimal configuration of spatio-temporal bins, over which to sample the evidence, by randomization. This enables a methodology for modeling crowd context. We employ a 3D Markov Random Field to regularize the classification and localize collective activities in the scene. We demonstrate the flexibility and scalability of the proposed framework in a number of experiments and show that our method outperforms state-of-the art action classification techniques [7, 19].

Koppal, Sanjeev J.; Gkioulekas, Ioannis; Zickler, Todd; Barrows, Geoffrey L.; , ■Wide-angle micro sensors for vision on a tight budget,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.361-368, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995338
Abstract: Achieving computer vision on micro-scale devices is a challenge. On these platforms, the power and mass constraints are severe enough for even the most common computations (matrix manipulations, convolution, etc.) to be difficult. This paper proposes and analyzes a class of miniature vision sensors that can help overcome these constraints. These sensors reduce power requirements through template-based optical convolution, and they enable a wide field-of-view within a small form through a novel optical design. We describe the trade-offs between the field of view, volume, and mass of these sensors and we provide analytic tools to navigate the design space. We also demonstrate milli-scale prototypes for computer vision tasks such as locating edges, tracking targets, and detecting faces.

Brendel, William; Amer, Mohamed; Todorovic, Sinisa; , ■Multiobject tracking as maximum weight independent set,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1273-1280, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995395
Abstract: This paper addresses the problem of simultaneous tracking of multiple targets in a video. We first apply object detectors to every video frame. Pairs of detection responses from every two consecutive frames are then used to build a graph of tracklets. The graph helps transitively link the best matching tracklets that do not violate hard and soft contextual constraints between the resulting tracks. We prove that this data association problem can be formulated as finding the maximum-weight independent set (MWIS) of the graph. We present a new, polynomial-time MWIS algorithm, and prove that it converges to an optimum. Similarity and contextual constraints between object detections, used for data association, are learned online from object appearance and motion properties. Long-term occlusions are addressed by iteratively repeating MWIS to hierarchically merge smaller tracks into longer ones. Our results demonstrate advantages of simultaneously accounting for soft and hard contextual constraints in multitarget tracking. We outperform the state of the art on the benchmark datasets.

Wang, Wei; Chen, Cheng; Wang, Yizhou; Jiang, Tingting; Fang, Fang; Yao, Yuan; , ■Simulating human saccadic scanpaths on natural images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.441-448, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995423
Abstract: Human saccade is a dynamic process of information pursuit. Based on the principle of information maximization, we propose a computational model to simulate human saccadic scanpaths on natural images. The model integrates three related factors as driven forces to guide eye movements sequentially — reference sensory responses, fovea-periphery resolution discrepancy, and visual working memory. For each eye movement, we compute three multi-band filter response maps as a coherent representation for the three factors. The three filter response maps are combined into multi-band residual filter response maps, on which we compute residual perceptual information (RPI) at each location. The RPI map is a dynamic saliency map varying along with eye movements. The next fixation is selected as the location with the maximal RPI value. On a natural image dataset, we compare the saccadic scanpaths generated by the proposed model and several other visual saliency-based models against human eye movement data. Experimental results demonstrate that the proposed model achieves the best prediction accuracy on both static fixation locations and dynamic scanpaths.

Wu, Chenglei; Wilburn, Bennett; Matsushita, Yasuyuki; Theobalt, Christian; , ■High-quality shape from multi-view stereo and shading under general illumination,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.969-976, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995388
Abstract: Multi-view stereo methods reconstruct 3D geometry from images well for sufficiently textured scenes, but often fail to recover high-frequency surface detail, particularly for smoothly shaded surfaces. On the other hand, shape-from-shading methods can recover fine detail from shading variations. Unfortunately, it is non-trivial to apply shape-from-shading alone to multi-view data, and most shading-based estimation methods only succeed under very restricted or controlled illumination. We present a new algorithm that combines multi-view stereo and shading-based refinement for high-quality reconstruction of 3D geometry models from images taken under constant but otherwise arbitrary illumination. We have tested our algorithm on several scenes that were captured under several general and unknown lighting conditions, and we show that our final reconstructions rival laser range scans.

Thangali, Ashwin; Nash, Joan P.; Sclaroff, Stan; Neidle, Carol; , ■Exploiting phonological constraints for handshape inference in ASL video,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.521-528, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995718
Abstract: Handshape is a key linguistic component of signs, and thus, handshape recognition is essential to algorithms for sign language recognition and retrieval. In this work, linguistic constraints on the relationship between start and end handshapes are leveraged to improve handshape recognition accuracy. A Bayesian network formulation is proposed for learning and exploiting these constraints, while taking into consideration inter-signer variations in the production of particular handshapes. A Variational Bayes formulation is employed for supervised learning of the model parameters. A non-rigid image alignment algorithm, which yields improved robustness to variability in handshape appearance, is proposed for computing image observation likelihoods in the model. The resulting handshape inference algorithm is evaluated using a dataset of 1500 lexical signs in American Sign Language (ASL), where each lexical sign is produced by three native ASL signers.

Tran, Du; Yuan, Junsong; , ■Optimal spatio-temporal path discovery for video event detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3321-3328, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995416
Abstract: We propose a novel algorithm for video event detection and localization as the optimal path discovery problem in spatio-temporal video space. By finding the optimal spatio-temporal path, our method not only detects the starting and ending points of the event, but also accurately locates it in each video frame. Moreover, our method is robust to the scale and intra-class variations of the event, as well as false and missed local detections, therefore improves the overall detection and localization accuracy. The proposed search algorithm obtains the global optimal solution with proven lowest computational complexity. Experiments on realistic video datasets demonstrate that our proposed method can be applied to different types of event detection tasks, such as abnormal event detection and walking pedestrian detection.

Lu, Xiaoguang; Chen, Terrence; Comaniciu, Dorin; , ■Robust discriminative wire structure modeling with application to stent enhancement in fluoroscopy,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1121-1127, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995714
Abstract: Learning-based methods have been widely used in detecting landmarks or anatomical structures in various medical imaging applications. The performance of discriminative learning techniques has been demonstrated superior to traditional low-level filtering in robustness and scalability. Nevertheless, some structures and patterns are more difficult to be defined by such methods and complicated and ad-hoc methods still need to be used, e.g. a non-rigid and highly deformable wire structure. In this paper, we propose a novel scheme to train classifiers to detect the markers and guide wire segment anchored by markers. The classifier utilizes the markers as the end point and parameterizes the wire in-between them. The probabilities of the markers and the wire are integrated in a Bayesian framework. As a result, both the marker and the wire detection are improved by such a unified approach. Promising results are demonstrated by quantitative evaluation on 263 fluoroscopic sequences with 12495 frames. Our training scheme can further be generalized to localize longer guidewire with higher degrees of parameterization.

Yang, Bo; Huang, Chang; Nevatia, Ram; , ■Learning affinities and dependencies for multi-target tracking using a CRF model,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1233-1240, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995587
Abstract: We propose a learning-based Conditional Random Field (CRF) model for tracking multiple targets by progressively associating detection responses into long tracks. Tracking task is transformed into a data association problem, and most previous approaches developed heuristical parametric models or learning approaches for evaluating independent affinities between track fragments (tracklets). We argue that the independent assumption is not valid in many cases, and adopt a CRF model to consider both tracklet affinities and dependencies among them, which are represented by unary term costs and pairwise term costs respectively. Unlike previous methods, we learn the best global associations instead of the best local affinities between tracklets, and transform the task of finding the best association into an energy minimization problem. A RankBoost algorithm is proposed to select effective features for estimation of term costs in the CRF model, so that better associations have lower costs. Our approach is evaluated on challenging pedestrian data sets, and are compared with state-of-art methods. Experiments show effectiveness of our algorithm as well as improvement in tracking performance.

Benfold, Ben; Reid, Ian; , ■Stable multi-target tracking in real-time surveillance video,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3457-3464, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995667
Abstract: The majority of existing pedestrian trackers concentrate on maintaining the identities of targets, however systems for remote biometric analysis or activity recognition in surveillance video often require stable bounding-boxes around pedestrians rather than approximate locations. We present a multi-target tracking system that is designed specifically for the provision of stable and accurate head location estimates. By performing data association over a sliding window of frames, we are able to correct many data association errors and fill in gaps where observations are missed. The approach is multi-threaded and combines asynchronous HOG detections with simultaneous KLT tracking and Markov-Chain Monte-Carlo Data Association (MCM-CDA) to provide guaranteed real-time tracking in high definition video. Where previous approaches have used ad-hoc models for data association, we use a more principled approach based on a Minimal Description Length (MDL) objective which accurately models the affinity between observations. We demonstrate by qualitative and quantitative evaluation that the system is capable of providing precise location estimates for large crowds of pedestrians in real-time. To facilitate future performance comparisons, we make a new dataset with hand annotated ground truth head locations publicly available.

Shen, Jianbing; Yang, Xiaoshan; Jia, Yunde; Li, Xuelong; , ■Intrinsic images using optimization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3481-3487, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995507
Abstract: In this paper, we present a novel intrinsic image recovery approach using optimization. Our approach is based on the assumption of color characteristics in a local window in natural images. Our method adopts a premise that neighboring pixels in a local window of a single image having similar intensity values should have similar reflectance values. Thus the intrinsic image decomposition is formulated by optimizing an energy function with adding a weighting constraint to the local image properties. In order to improve the intrinsic image extraction results, we specify local constrain cues by integrating the user strokes in our energy formulation, including constant-reflectance, constant-illumination and fixed-illumination brushes. Our experimental results demonstrate that our approach achieves a better recovery of intrinsic reflectance and illumination components than by previous approaches.

Ott, Patrick; Everingham, Mark; , ■Shared parts for deformable part-based models,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, vol., no., pp.1513-1520, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995357
Abstract: The deformable part-based model (DPM) proposed by Felzenszwalb et al. has demonstrated state-of-the-art results in object localization. The model offers a high degree of learnt invariance by utilizing viewpoint-dependent mixture components and movable parts in each mixture component. One might hope to increase the accuracy of the DPM by increasing the number of mixture components and parts to give a more faithful model, but limited training data prevents this from being effective. We propose an extension to the DPM which allows for sharing of object part models among multiple mixture components as well as object classes. This results in more compact models and allows training examples to be shared by multiple components, ameliorating the effect of a limited size training set. We (i) reformulate the DPM to incorporate part sharing, and (ii) propose a novel energy function allowing for coupled training of mixture components and object classes. We report state-of-the-art results on the PASCAL VOC dataset.

Zheng, Wei-Shi; Gong, Shaogang; Xiang, Tao; , ■Person re-identification by probabilistic relative distance comparison,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.649-656, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995598
Abstract: Matching people across non-overlapping camera views, known as person re-identification, is challenging due to the lack of spatial and temporal constraints and large visual appearance changes caused by variations in view angle, lighting, background clutter and occlusion. To address these challenges, most previous approaches aim to extract visual features that are both distinctive and stable under appearance changes. However, most visual features and their combinations under realistic conditions are neither stable nor distinctive thus should not be used indiscriminately. In this paper, we propose to formulate person re-identification as a distance learning problem, which aims to learn the optimal distance that can maximises matching accuracy regardless the choice of representation. To that end, we introduce a novel Probabilistic Relative Distance Comparison (PRDC) model, which differs from most existing distance learning methods in that, rather than minimising intra-class variation whilst maximising intra-class variation, it aims to maximise the probability of a pair of true match having a smaller distance than that of a wrong match pair. This makes our model more tolerant to appearance changes and less susceptible to model over-fitting. Extensive experiments are carried out to demonstrate that 1) by formulating the person re-identification problem as a distance learning problem, notable improvement on matching accuracy can be obtained against conventional person re-identification techniques, which is particularly significant when the training sample size is small; and 2) our PRDC outperforms not only existing distance learning methods but also alternative learning methods based on boosting and learning to rank.

Yang, Tao; Zhang, Yanning; Tong, Xiaomin; Zhang, Xiaoqiang; Yu, Rui; , ■Continuously tracking and see-through occlusion based on a new hybrid synthetic aperture imaging model,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3409-3416, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995417
Abstract: Robust detection and tracking of multiple people in cluttered and crowded scenes with severe occlusion is a significant challenging task for many computer vision applications. In this paper, we present a novel hybrid synthetic aperture imaging model to solve this problem. The main characteristics of this approach include: (1) To the best of our knowledge, this algorithm is the first time to solve the occluded people imaging and tracking problem in a joint multiple camera synthetic aperture imaging domain. (2) A multiple model framework is designed to achieve seamless interaction among the detection, imaging and tracking modules. (3)In the object detection module, a multiple constraints based approach is presented for people localizing and ghost objects removal in a 3D foreground silhouette synthetic aperture imaging volume. (4) In the synthetic imaging module, a novel occluder removal based synthetic imaging approach is proposed to continuously obtain object clear image even under severe occlusion. (5) In the object tracking module, a camera array is used for robust people tracking in color synthetic aperture images. A network camera based hybrid synthetic aperture imaging system has been set up, and experimental results with qualitative and quantitative analysis demonstrate that the method can reliably locate and see people in challenge scene.

Rubinstein, Michael; Liu, Ce; Sand, Peter; Durand, Fredo; Freeman, William T.; , ■Motion denoising with application to time-lapse photography,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.313-320, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995374
Abstract: Motions can occur over both short and long time scales. We introduce motion denoising, which treats short-term changes as noise, long-term changes as signal, and re-renders a video to reveal the underlying long-term events. We demonstrate motion denoising for time-lapse videos. One of the characteristics of traditional time-lapse imagery is stylized jerkiness, where short-term changes in the scene appear as small and annoying jitters in the video, often obfuscating the underlying temporal events of interest. We apply motion denoising for resynthesizing time-lapse videos showing the long-term evolution of a scene with jerky short-term changes removed. We show that existing filtering approaches are often incapable of achieving this task, and present a novel computational approach to denoise motion without explicit motion analysis. We demonstrate promising experimental results on a set of challenging time-lapse sequences.

Streib, Kevin; Davis, James W.; , ■Using Ripley's K-function to improve graph-based clustering techniques,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2305-2312, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995509
Abstract: The success of any graph-based clustering algorithm depends heavily on the quality of the similarity matrix being clustered, which is itself highly dependent on point-wise scaling parameters. We propose a novel technique for finding point-wise scaling parameters based on Ripley's K-function [12] which enables data clustering at different density scales within the same dataset. Additionally, we provide a method for enhancing the spatial similarity matrix by including a density metric between neighborhoods. We show how our proposed methods for building similarity matrices can improve the results attained by traditional approaches for several well known clustering algorithms on a variety of datasets.

Wang, Yu; Narayanaswamy, Arunachalam; Roysam, Badrinath; , ■Novel 4-D Open-Curve Active Contour and curve completion approach for automated tree structure extraction,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1105-1112, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995620
Abstract: We present novel approaches for fully automated extraction of tree-like tubular structures from 3-D image stacks. A 4-D Open-Curve Active Contour (Snake) model is proposed for simultaneous 3-D centerline tracing and local radius estimation. An image energy term, stretching term, and a novel region-based radial energy term constitute the energy to be minimized. This combination of energy terms allows the 4-D open-curve snake model, starting from an automatically detected seed point, to stretch along and fit the tubular structures like neurites and blood vessels. A graph-based curve completion approach is proposed to merge possible fragments caused by discontinuities in the tree structures. After tree structure extraction, the centerlines serve as the starting points for a Fast Marching segmentation for which the stopping time is automatically chosen. We illustrate the performance of our method with various datasets.

Oliveira, Miguel; Sappa, Angel D.; Santos, Vitor; , ■Unsupervised local color correction for coarsely registered images,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.201-208, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995658
Abstract: The current paper proposes a new parametric local color correction technique. Initially, several color transfer functions are computed from the output of the mean shift color segmentation algorithm. Secondly, color influence maps are calculated. Finally, the contribution of every color transfer function is merged using the weights from the color influence maps. The proposed approach is compared with both global and local color correction approaches. Results show that our method outperforms the technique ranked first in a recent performance evaluation on this topic. Moreover, the proposed approach is computed in about one tenth of the time.

Chen, Jixu; Ji, Qiang; , ■Probabilistic gaze estimation without active personal calibration,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.609-616, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995675
Abstract: Existing eye gaze tracking systems typically require an explicit personal calibration process in order to estimate certain person-specific eye parameters. For natural human computer interaction, such a personal calibration is often cumbersome and unnatural. In this paper, we propose a new probabilistic eye gaze tracking system without explicit personal calibration. Unlike the traditional eye gaze tracking methods, which estimate the eye parameter deterministically, our approach estimates the probability distributions of the eye parameter and the eye gaze, by combining image saliency with the 3D eye model. By using an incremental learning framework, the subject doesn't need personal calibration before using the system. His/her eye parameter and gaze estimation can be improved gradually when he/she is naturally viewing a sequence of images on the screen. The experimental result shows that the proposed system can achieve less than three degrees accuracy for different people without calibration.

Bertelli, Luca; Yu, Tianli; Vu, Diem; Gokturk, Burak; , ■Kernelized structural SVM learning for supervised object segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2153-2160, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995597
Abstract: Object segmentation needs to be driven by top-down knowledge to produce semantically meaningful results. In this paper, we propose a supervised segmentation approach that tightly integrates object-level top down information with low-level image cues. The information from the two levels is fused under a kernelized structural SVM learning framework. We defined a novel nonlinear kernel for comparing two image-segmentation masks. This kernel combines four different kernels: the object similarity kernel, the object shape kernel, the per-image color distribution kernel, and the global color distribution kernel. Our experiments show that the structured SVM algorithm finds bad segmentations of the training examples given the current scoring function and punishes these bad segmentations to lower scores than the example (good) segmentations. The result is a segmentation algorithm that not only knows what good segmentations are, but also learns potential segmentation mistakes and tries to avoid them. Our proposed approach can obtain comparable performance to other state-of-the-art top-down driven segmentation approaches yet is flexible enough to be applied to widely different domains.

Cai, Xiao; Nie, Feiping; Huang, Heng; Kamangar, Farhad; , ■Heterogeneous image feature integration via multi-modal spectral clustering,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1977-1984, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995740
Abstract: In recent years, more and more visual descriptors have been proposed to describe objects and scenes appearing in images. Different features describe different aspects of the visual characteristics. How to combine these heterogeneous features has become an increasing critical problem. In this paper, we propose a novel approach to unsupervised integrate such heterogeneous features by performing multi-modal spectral clustering on unlabeled images and unsegmented images. Considering each type of feature as one modal, our new multi-modal spectral clustering (MMSC) algorithm is to learn a commonly shared graph Laplacian matrix by unifying different modals (image features). A non-negative relaxation is also added in our method to improve the robustness and efficiency of image clustering. We applied our MMSC method to integrate five types of popularly used image features, including SIFT, HOG, GIST, LBP, CENTRIST and evaluated the performance by two benchmark data sets: Caltech-101 and MSRC-v1. Compared with existing unsupervised scene and object categorization methods, our approach always achieves superior performances measured by three standard clustering evaluation metrices.

Wojek, Christian; Walk, Stefan; Roth, Stefan; Schiele, Bernt; , ■Monocular 3D scene understanding with explicit occlusion reasoning,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1993-2000, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995547
Abstract: Scene understanding from a monocular, moving camera is a challenging problem with a number of applications including robotics and automotive safety. While recent systems have shown that this is best accomplished with a 3D scene model, handling of partial object occlusion is still unsatisfactory. In this paper we propose an approach that tightly integrates monocular 3D scene tracking-by-detection with explicit object-object occlusion reasoning. Full object and object part detectors are combined in a mixture of experts based on their expected visibility, which is obtained from the 3D scene model. For the difficult case of multi-people tracking, we demonstrate that our approach yields more robust detection and tracking of partially visible pedestrians, even when they are occluded over long periods of time. Our approach is evaluated on two challenging sequences recorded from a moving camera in busy pedestrian zones and outperforms several state-of-the-art approaches.

Chang, Kuang-Yu; Chen, Chu-Song; Hung, Yi-Ping; , ■Ordinal hyperplanes ranker with cost sensitivities for age estimation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.585-592, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995437
Abstract: In this paper, we propose an ordinal hyperplane ranking algorithm called OHRank, which estimates human ages via facial images. The design of the algorithm is based on the relative order information among the age labels in a database. Each ordinal hyperplane separates all the facial images into two groups according to the relative order, and a cost-sensitive property is exploited to find better hyperplanes based on the classification costs. Human ages are inferred by aggregating a set of preferences from the ordinal hyperplanes with their cost sensitivities. Our experimental results demonstrate that the proposed approach outperforms conventional multiclass-based and regression-based approaches as well as recently developed ranking-based age estimation approaches.

Daubney, Ben; Xie, Xianghua; , ■Tracking 3D human pose with large root node uncertainty,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1321-1328, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995502
Abstract: Representing articulated objects as a graphical model has gained much popularity in recent years, often the root node of the graph describes the global position and orientation of the object. In this work a method is presented to robustly track 3D human pose by permitting greater uncertainty to be modeled over the root node than existing techniques allow. Significantly, this is achieved without increasing the uncertainty of remaining parts of the model. The benefit is that a greater volume of the posterior can be supported making the approach less vulnerable to tracking failure. Given a hypothesis of the root node state a novel method is presented to estimate the posterior over the remaining parts of the body conditioned on this value. All probability distributions are approximated using a single Gaussian allowing inference to be carried out in closed form. A set of deterministically selected sample points are used that allow the posterior to be updated for each part requiring just seven image likelihood evaluations making it extremely efficient. Multiple root node states are supported and propagated using standard sampling techniques. We believe this to be the first work devoted to efficient tracking of human pose whilst modeling large uncertainty in the root node and demonstrate the presented method to be more robust to tracking failures than existing approaches.

Mensink, Thomas; Verbeek, Jakob; Csurka, Gabriela; , ■Learning structured prediction models for interactive image labeling,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.833-840, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995380
Abstract: We propose structured models for image labeling that take into account the dependencies among the image labels explicitly. These models are more expressive than independent label predictors, and lead to more accurate predictions. While the improvement is modest for fully-automatic image labeling, the gain is significant in an interactive scenario where a user provides the value of some of the image labels. Such an interactive scenario offers an interesting trade-off between accuracy and manual labeling effort. The structured models are used to decide which labels should be set by the user, and transfer the user input to more accurate predictions on other image labels. We also apply our models to attribute-based image classification, where attribute predictions of a test image are mapped to class probabilities by means of a given attribute-class mapping. In this case the structured models are built at the attribute level. We also consider an interactive system where the system asks a user to set some of the attribute values in order to maximally improve class prediction performance. Experimental results on three publicly available benchmark data sets show that in all scenarios our structured models lead to more accurate predictions, and leverage user input much more effectively than state-of-the-art independent models.

Ding, Yuanyuan; Xiao, Jing; Yu, Jingyi; , ■A theory of multi-perspective defocusing,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.217-224, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995617
Abstract: We present a novel theory for characterizing defocus blurs in multi-perspective cameras such as catadioptric mirrors. Our approach studies how multi-perspective ray geometry transforms under the thin lens. We first use the General Linear Cameras (GLCs) [21] to approximate the incident multi-perspective rays to the lens and then apply a Thin Lens Operator (TLO) to map an incident GLC to the exit GLC. To study defocus blurs caused by the GLC rays, we further introduce a new Ray Spread Function (RSF) model analogous the Point Spread Function (PSF). While PSF models defocus blurs caused by a 3D scene point, RSF models blurs spread by rays. We derive closed form RSFs for incident GLC rays, and we show that for catadioptric cameras with a circular aperture, the RSF can be effectively approximated as a single or mixtures of elliptic-shaped kernels. We apply our method for predicting defocus blurs on commonly used catadioptric cameras and for reducing de-focus blurs in catadioptric projections. Experiments on synthetic and real data demonstrate the accuracy and general applicability of our approach.

He, Junfeng; Chang, Shih-Fu; Radhakrishnan, Regunathan; Bauer, Claus; , ■Compact hashing with joint optimization of search accuracy and time,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.753-760, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995518
Abstract: Similarity search, namely, finding approximate nearest neighborhoods, is the core of many large scale machine learning or vision applications. Recently, many research results demonstrate that hashing with compact codes can achieve promising performance for large scale similarity search. However, most of the previous hashing methods with compact codes only model and optimize the search accuracy. Search time, which is an important factor for hashing in practice, is usually not addressed explicitly. In this paper, we develop a new scalable hashing algorithm with joint optimization of search accuracy and search time simultaneously. Our method generates compact hash codes for data of general formats with any similarity function. We evaluate our method using diverse data sets up to 1 million samples (e.g., web images). Our comprehensive results show the proposed method significantly outperforms several state-of-the-art hashing approaches.

Prokaj, Jan; Medioni, Gerard; , ■Using 3D scene structure to improve tracking,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1337-1344, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995593
Abstract: In this work we consider the problem of tracking objects from a moving airborne platform in wide area surveillance through long occlusions and/or when their motion is unpredictable. The main idea is to take advantage of the known 3D scene structure to estimate a dynamic occlusion map, and to use the occlusion map to determine traffic entry and exit into these zones, which we call sources and sinks. Then the track linking problem is formulated as an alignment of sequences of tracks entering a sink and leaving a source. The sequence alignment problem is solved optimally and efficiently using dynamic programming. We have evaluated our algorithm on a vehicle tracking task in wide area motion imagery and have shown that track fragmentation is significantly decreased and outperforms the Hungarian algorithm.

Sankaranarayanan, Karthik; Davis, James W.; , ■Object association across PTZ cameras using logistic MIL,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3433-3440, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995398
Abstract: We propose a novel approach to associate objects across multiple PTZ cameras that can be used to perform camera handoff in wide-area surveillance scenarios. While previous approaches relied on geometric, appearance, or correlation-based information for establishing correspondences between static cameras, they each have well-known limitations and are not extendable to wide-area settings with PTZ cameras. In our approach, the slave camera only passively follows the target (by loose registration with the master) and bootstraps itself from its own incoming imagery, thus effectively circumventing the problems faced by previous approaches and avoiding the need to perform any model transfer. Towards this goal, we also propose a novel Multiple Instance Learning (MIL) formulation for the problem based on the logistic softmax function of covariance-based region features within a MAP estimation framework. We demonstrate our approach with multiple PTZ camera sequences in typical outdoor surveillance settings and show a comparison with state-of-the-art approaches.

Cherian, Anoop; Morellas, Vassilios; Papanikolopoulos, Nikolaos; Bedros, Saad J.; , ■Dirichlet process mixture models on symmetric positive definite matrices for appearance clustering in video surveillance applications,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3417-3424, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995723
Abstract: Covariance matrices of multivariate data capture feature correlations compactly, and being very robust to noise, they have been used extensively as feature descriptors in many areas in computer vision, like, people appearance tracking, DTI imaging, face recognition, etc. Since these matrices do not adhere to the Euclidean geometry, clustering algorithms using the traditional distance measures cannot be directly extended to them. Prior work in this area has been restricted to using K-means type clustering over the Rieman-nian space using the Riemannian metric. As the applications scale, it is not practical to assume the number of components in a clustering model, failing any soft-clustering algorithm. In this paper, a novel application of the Dirich-let Process Mixture Model framework is proposed towards unsupervised clustering of symmetric positive definite matrices. We approach the problem by extending the existing K-means type clustering algorithms based on the logdet divergence measure and derive the counterpart of it in a Bayesian framework, which leads to the Wishart-Inverse Wishart conjugate pair. Alternative possibilities based on the matrix Frobenius norm and log-Euclidean measures are also proposed. The models are extensively compared using two real-world datasets against the state-of-the-art algorithms and demonstrate superior performance.

Garg, Rahul; Seitz, Steven M.; Ramanan, Deva; Snavely, Noah; , ■Where's Waldo: Matching people in images of crowds,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1793-1800, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995546
Abstract: Given a community-contributed set of photos of a crowded public event, this paper addresses the problem of finding all images of each person in the scene. This problem is very challenging due to large changes in camera viewpoints, severe occlusions, low resolution and photos from tens or hundreds of different photographers. Despite these challenges, the problem is made tractable by exploiting a variety of visual and contextual cues–appearance, time-stamps, camera pose and co-occurrence of people. This paper demonstrates an approach that integrates these cues to enable high quality person matching in community photo collections downloaded from

Wang, Jiang; Chen, Zhuoyuan; Wu, Ying; , ■Action recognition with multiscale spatio-temporal contexts,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3185-3192, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995493
Abstract: The popular bag of words approach for action recognition is based on the classifying quantized local features density. This approach focuses excessively on the local features but discards all information about the interactions among them. Local features themselves may not be discriminative enough, but combined with their contexts, they can be very useful for the recognition of some actions. In this paper, we present a novel representation that captures contextual interactions between interest points, based on the density of all features observed in each interest point's mutliscale spatio-temporal contextual domain. We demonstrate that augmenting local features with our contextual feature significantly improves the recognition performance.

Gu, Steve; Tomasi, Carlo; , ■Branch and track,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1169-1174, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995599
Abstract: We present a new paradigm for tracking objects in video in the presence of other similar objects. This branch-and-track paradigm is also useful in the absence of motion, for the discovery of repetitive patterns in images. The object of interest is the lead object and the distracters are extras. The lead tracker branches out trackers for extras when they are detected, and all trackers share a common set of features. Sometimes, extras are tracked because they are of interest in their own right. In other cases, and perhaps more importantly, tracking extras makes tracking the lead nimbler and more robust, both because shared features provide a richer object model, and because tracking extras accounts for sources of confusion explicitly. Sharing features also makes joint tracking less expensive, and coordinating tracking across lead and extras allows optimizing window positions jointly rather than separately, for better results. The joint tracking of both lead and extras can be solved optimally by dynamic programming and branching is quickly determined by efficient subwindow search. Matlab experiments show near real time performance at 5–30 frames per second on a single-core laptop for 240 by 320 images.

Savchynskyy, Bogdan; Kappes, Jorg; Schmidt, Stefan; Schnorr, Christoph; , ■A study of Nesterov's scheme for Lagrangian decomposition and MAP labeling,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1817-1823, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995652
Abstract: We study the MAP-labeling problem for graphical models by optimizing a dual problem obtained by Lagrangian decomposition. In this paper, we focus specifically on Nes-terov's optimal first-order optimization scheme for non-smooth convex programs, that has been studied for a range of other problems in computer vision and machine learning in recent years. We show that in order to obtain an efficiently convergent iteration, this approach should be augmented with a dynamic estimation of a corresponding Lip-schitz constant, leading to a runtime complexity of O(1/∊) in terms of the desired precision ∊. Additionally, we devise a stopping criterion based on a duality gap as a sound basis for competitive comparison and show how to compute it efficiently. We evaluate our results using the publicly available Middlebury database and a set of computer generated graphical models that highlight specific aspects, along with other state-of-the-art methods for MAP-inference.

Komodakis, Nikos; , ■Efficient training for pairwise or higher order CRFs via dual decomposition,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1841-1848, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995375
Abstract: We present a very general algorithmic framework for structured prediction learning that is able to efficiently handle both pairwise and higher-order discrete MRFs/CRFs1. It relies on a dual decomposition approach that has been recently proposed for MRF optimization. By properly combining this approach with a max-margin method, our framework manages to reduce the training of a complex high-order MRF to the parallel training of a series of simple slave MRFs that are much easier to handle. This leads to an extremely efficient and general learning scheme. Furthermore, the proposed framework can yield learning algorithms of increasing accuracy since it naturally allows a hierarchy of convex relaxations to be used for MRF inference within a max-margin learning approach. It also offers extreme flexibility and can be easily adapted to take advantage of any special structure of a given class of MRFs. Experimental results demonstrate the great effectiveness of our method.

Fang, Yi; Sun, Mengtian; Kim, Minhyong; Ramani, Karthik; , ■Heat-mapping: A robust approach toward perceptually consistent mesh segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2145-2152, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995695
Abstract: 3D mesh segmentation is a fundamental low-level task with applications in areas as diverse as computer vision, computer-aided design, bio-informatics, and 3D medical imaging. A perceptually consistent mesh segmentation (PCMS), as defined in this paper is one that satisfies 1) in-variance to isometric transformation of the underlying surface, 2) robust to the perturbations of the surface, 3) robustness to numerical noise on the surface, and 4) close conformation to human perception. We exploit the intelligence of the heat as a global structure-aware message on a meshed surface and develop a robust PCMS scheme, called Heat-Mapping based on the heat kernel. There are three main steps in Heat-Mapping. First, the number of the segments is estimated based on the analysis of the behavior of the Laplacian spectrum. Second, the heat center, which is defined as the most representative vertex on each segment, is discovered by a proposed heat center hunting algorithm. Third, a heat center driven segmentation scheme reveals the PCMS with a high consistency towards human perception. Extensive experimental results on various types of models verify the performance of Heat-Mapping with respect to the consistent segmentation of articulated bodies, the topological changes, and various levels of numerical noise.

Zeng, Wei; Gu, Xianfeng David; , ■Registration for 3D surfaces with large deformations using quasi-conformal curvature flow,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2457-2464, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995410
Abstract: A novel method for registering 3D surfaces with large deformations is presented, which is based on quasi-conformal geometry. A general diffeomorphism distorts the conformal structure of the surface, which is represented as the Beltrami coefficient. Inversely, the diffeomorphism can be determined by the Beltrami coefficient in an essentially unique way. Our registration method first extracts the features on the surfaces, then estimates the Beltrami coefficient, and finally uniquely determines the registration mapping by solving Beltrami equations using curvature flow. The method is 1) general, it can search the desired registration in the whole space of diffeomorphisms, which includes the conventional searching spaces, such as rigid motions, isometric transformations or conformal mappings; 2) global optimal, the global optimum is determined by the method unique up to a 3 dimensional transformation group; 3) robust, it handles large surfaces with complicated topologies; 4) rigorous, it has solid theoretic foundation. Experiments on the real surfaces with large deformations and complicated topologies demonstrate the efficiency, robustness of the proposed method.

Saragih, Jason; , ■Principal regression analysis,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2881-2888, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995618
Abstract: A new paradigm for multivariate regression is proposed; principal regression analysis (PRA). It entails learning a low dimensional subspace over sample-specific regressors. For a given input, the model predicts a subspace thought to contain the corresponding response. Using this subspace as a prior, the search space is considerably more constrained. An efficient local optimisation strategy is proposed for learning and a practical choice for its initialisation suggested. The utility of PRA is demonstrated on the task of non-rigid face and car alignment using challenging ■in the wild■ datasets, where substantial performance improvements are observed over alignment with a conventional prior.

Gopalan, Raghuraman; Sankaranarayanan, Jagan; , ■Max-margin clustering: Detecting margins from projections of points on lines,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2769-2776, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995485
Abstract: Given a unlabelled set of points X ∊ RN belonging to k groups, we propose a method to identify cluster assignments that provides maximum separating margin among the clusters. We address this problem by exploiting sparsity in data points inherent to margin regions, which a max-margin classifier would produce under a supervised setting to separate points belonging to different groups. By analyzing the projections of X on the set of all possible lines L in RN, we first establish some basic results that are satisfied only by those line intervals lying outside a cluster, under assumptions of linear separability of clusters and absence of outliers. We then encode these results into a pair-wise similarity measure to determine cluster assignments, where we accommodate non-linearly separable clusters using the kernel trick. We validate our method on several UCI datasets and on some computer vision problems, and empirically show its robustness to outliers, and in cases where the exact number of clusters is not available. The proposed approach offers an improvement in clustering accuracy of about 6% on the average, and up to 15% when compared with several existing methods.

Mei, Xue; Ling, Haibin; Wu, Yi; Blasch, Erik; Bai, Li; , ■Minimum error bounded efficient ℓ1 tracker with occlusion detection,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1257-1264, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995421
Abstract: Recently, sparse representation has been applied to visual tracking to find the target with the minimum reconstruction error from the target template subspace. Though effective, these L1 trackers require high computational costs due to numerous calculations for ℓ1 minimization. In addition, the inherent occlusion insensitivity of the ℓ1 minimization has not been fully utilized. In this paper, we propose an efficient L1 tracker with minimum error bound and occlusion detection which we call Bounded Particle Resampling (BPR)-L1 tracker. First, the minimum error bound is quickly calculated from a linear least squares equation, and serves as a guide for particle resampling in a particle filter framework. Without loss of precision during resampling, most insignificant samples are removed before solving the computationally expensive ℓ1 minimization function. The BPR technique enables us to speed up the L1 tracker without sacrificing accuracy. Second, we perform occlusion detection by investigating the trivial coefficients in the ℓ1 minimization. These coefficients, by design, contain rich information about image corruptions including occlusion. Detected occlusions enhance the template updates to effectively reduce the drifting problem. The proposed method shows good performance as compared with several state-of-the-art trackers on challenging benchmark sequences.

Chang, Jason; Fisher, John W.; , ■Efficient MCMC sampling with implicit shape representations,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2081-2088, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995333
Abstract: We present a method for sampling from the posterior distribution of implicitly defined segmentations conditioned on the observed image. Segmentation is often formulated as an energy minimization or statistical inference problem in which either the optimal or most probable configuration is the goal. Exponentiating the negative energy functional provides a Bayesian interpretation in which the solutions are equivalent. Sampling methods enable evaluation of distribution properties that characterize the solution space via the computation of marginal event probabilities. We develop a Metropolis-Hastings sampling algorithm over level-sets which improves upon previous methods by allowing for topological changes while simultaneously decreasing computational times by orders of magnitude. An M-ary extension to the method is provided.

Hwang, Sung Ju; Sha, Fei; Grauman, Kristen; , ■Sharing features between objects and their attributes,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1761-1768, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995543
Abstract: Visual attributes expose human-defined semantics to object recognition models, but existing work largely restricts their influence to mid-level cues during classifier training. Rather than treat attributes as intermediate features, we consider how learning visual properties in concert with object categories can regularize the models for both. Given a low-level visual feature space together with attribute-and object-labeled image data, we learn a shared lower-dimensional representation by optimizing a joint loss function that favors common sparsity patterns across both types of prediction tasks. We adopt a recent kernelized formulation of convex multi-task feature learning, in which one alternates between learning the common features and learning task-specific classifier parameters on top of those features. In this way, our approach discovers any structure among the image descriptors that is relevant to both tasks, and allows the top-down semantics to restrict the hypothesis space of the ultimate object classifiers. We validate the approach on datasets of animals and outdoor scenes, and show significant improvements over traditional multi-class object classifiers and direct attribute prediction models.

Liu, Ce; Sun, Deqing; , ■A Bayesian approach to adaptive video super resolution,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.209-216, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995614
Abstract: Although multi-frame super resolution has been extensively studied in past decades, super resolving real-world video sequences still remains challenging. In existing systems, either the motion models are oversimplified, or important factors such as blur kernel and noise level are assumed to be known. Such models cannot deal with the scene and imaging conditions that vary from one sequence to another. In this paper, we propose a Bayesian approach to adaptive video super resolution via simultaneously estimating underlying motion, blur kernel and noise level while reconstructing the original high-res frames. As a result, our system not only produces very promising super resolution results that outperform the state of the art, but also adapts to a variety of noise levels and blur kernels. Theoretical analysis of the relationship between blur kernel, noise level and frequency-wise reconstruction rate is also provided, consistent with our experimental results.

Rousseau, Francois; Habas, Piotr A.; Studholme, Colin; , ■Human brain labeling using image similarities,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1081-1088, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995694
Abstract: We propose in this work a patch-based segmentation method relying on a label propagation framework. Based on image intensity similarities between the input image and a learning dataset, an original strategy which does not require any non-rigid registration is presented. Following recent developments in non-local image denoising, the similarity between images is represented by a weighted graph computed from intensity-based distance between patches. Experiments on simulated and in-vivo MR images show that the proposed method is very successful in providing automated human brain labeling.

Kawakami, Rei; Matsushita, Yasuyuki; Wright, John; Ben-Ezra, Moshe; Tai, Yu-Wing; Ikeuchi, Katsushi; , ■High-resolution hyperspectral imaging via matrix factorization,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2329-2336, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995457
Abstract: Hyperspectral imaging is a promising tool for applications in geosensing, cultural heritage and beyond. However, compared to current RGB cameras, existing hyperspectral cameras are severely limited in spatial resolution. In this paper, we introduce a simple new technique for reconstructing a very high-resolution hyperspectral image from two readily obtained measurements: A lower-resolution hyper-spectral image and a high-resolution RGB image. Our approach is divided into two stages: We first apply an unmixing algorithm to the hyperspectral input, to estimate a basis representing reflectance spectra. We then use this representation in conjunction with the RGB input to produce the desired result. Our approach to unmixing is motivated by the spatial sparsity of the hyperspectral input, and casts the unmixing problem as the search for a factorization of the input into a basis and a set of maximally sparse coefficients. Experiments show that this simple approach performs reasonably well on both simulations and real data examples.

Yu, Gang; Yuan, Junsong; Liu, Zicheng; , ■Unsupervised random forest indexing for fast action search,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.865-872, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995488
Abstract: Despite recent successes of searching small object in images, it remains a challenging problem to search and locate actions in crowded videos because of (1) the large variations of human actions and (2) the intensive computational cost of searching the video space. To address these challenges, we propose a fast action search and localization method that supports relevance feedback from the user. By characterizing videos as spatio-temporal interest points and building a random forest to index and match these points, our query matching is robust and efficient. To enable efficient action localization, we propose a coarse-to-fine sub-volume search scheme, which is several orders faster than the existing video branch and bound search. The challenging cross-dataset search of several actions validates the effectiveness and efficiency of our method.

Shen, Chunhua; Hao, Zhihui; , ■A direct formulation for totally-corrective multi-class boosting,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2585-2592, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995554
Abstract: Boosting combines a set of moderately accurate weak classifiers to form a highly accurate predictor. Compared with binary boosting classification, multi-class boosting received less attention. We propose a novel multi-class boosting formulation here. Unlike most previous multi-class boosting algorithms which decompose a multi-boost problem into multiple independent binary boosting problems, we formulate a direct optimization method for training multi-class boosting. Moreover, by explicitly deriving the La-grange dual of the formulated primal optimization problem, we design totally-corrective boosting using the column generation technique in convex optimization. At each iteration, all weak classifiers' weights are updated. Our experiments on various data sets demonstrate that our direct multi-class boosting achieves competitive test accuracy compared with state-of-the-art multi-class boosting in the literature.

Yu, Zhiding; Au, Oscar C.; Tang, Ketan; Xu, Chunjing; , ■Nonparametric density estimation on a graph: Learning framework, fast approximation and application in image segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2201-2208, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995692
Abstract: We present a novel framework for tree-structure embedded density estimation and its fast approximation for mode seeking. The proposed method could find diverse applications in computer vision and feature space analysis. Given any undirected, connected and weighted graph, the density function is defined as a joint representation of the feature space and the distance domain on the graph's spanning tree. Since the distance domain of a tree is a constrained one, mode seeking can not be directly achieved by traditional mean shift in both domain. we address this problem by introducing node shifting with force competition and its fast approximation. Our work is closely related to the previous literature of nonparametric methods. One shall see, however, that the new formulation of this problem can lead to many advantages and new characteristics in its application, as will be illustrated later in this paper.

Singaraju, Dheeraj; Vidal, Rene; , ■Using global bag of features models in random fields for joint categorization and segmentation of objects,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2313-2319, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995469
Abstract: We propose to bridge the gap between Random Field (RF) formulations for joint categorization and segmentation (JCaS), which model local interactions among pixels and superpixels, and Bag of Features categorization algorithms, which use global descriptors. For this purpose, we introduce new higher order potentials that encode the classification cost of a histogram extracted from all the objects in an image that belong to a particular category, where the cost is given as the output of a classifier when applied to the histogram. The potentials efficiently encode the classification costs of several histograms resulting from the different possible segmentations of an image. They can be integrated with existing potentials, hence providing a natural unification of global and local interactions. The potentials' parameters can be treated as parameters of the RF and hence be jointly learnt along with the other parameters of the RF. Experiments show that our framework can be used to improve the performance of existing JCaS algorithms.

Liu, Lingqiao; Wang, Lei; Shen, Chunhua; , ■A generalized probabilistic framework for compact codebook creation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1537-1544, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995628
Abstract: Compact and discriminative visual codebooks are preferred in many visual recognition tasks. In the literature, a few researchers have taken the approach of hierarchically merging visual words of a initial large-size code-book, but implemented this idea with different merging criteria. In this work, we show that by defining different class-conditional distribution function and parameter estimation method, these merging criteria can be unified under a single probabilistic framework. More importantly, by adopting new distribution functions and/or parameter estimation methods, we can generalize this framework to produce a spectrum of novel merging criteria. Two of them are particularly focused in this work. For one criterion, we adopt the multinomial distribution to model each object class, and for the other criterion we propose a max-margin-based parameter estimation method. Both theoretical analysis and experimental study demonstrate the superior performance of the two new merging criteria and the general applicability of our probabilistic framework.

Wang, Hongzhi; Suh, Jung Wook; Das, Sandhitsu; Pluta, John; Altinay, Murat; Yushkevich, Paul; , ■Regression-based label fusion for multi-atlas segmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1113-1120, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995382
Abstract: Automatic segmentation using multi-atlas label fusion has been widely applied in medical image analysis. To simplify the label fusion problem, most methods implicitly make a strong assumption that the segmentation errors produced by different atlases are uncorrelated. We show that violating this assumption significantly reduces the efficiency of multi-atlas segmentation. To address this problem, we propose a regression-based approach for label fusion. Our experiments on segmenting the hippocampus in magnetic resonance images (MRI) show significant improvement over previous label fusion techniques.

Fathi, Alireza; Ren, Xiaofeng; Rehg, James M.; , ■Learning to recognize objects in egocentric activities,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3281-3288, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995444
Abstract: This paper addresses the problem of learning object models from egocentric video of household activities, using extremely weak supervision. For each activity sequence, we know only the names of the objects which are present within it, and have no other knowledge regarding the appearance or location of objects. The key to our approach is a robust, unsupervised bottom up segmentation method, which exploits the structure of the egocentric domain to partition each frame into hand, object, and background categories. By using Multiple Instance Learning to match object instances across sequences, we discover and localize object occurrences. Object representations are refined through transduction and object-level classifiers are trained. We demonstrate encouraging results in detecting novel object instances using models produced by weakly-supervised learning.

Yang, Xingwei; Adluru, Nagesh; Latecki, Longin Jan; , ■Particle filter with state permutations for solving image jigsaw puzzles,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2873-2880, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995535
Abstract: We deal with an image jigsaw puzzle problem, which is defined as reconstructing an image from a set of square and non-overlapping image patches. It is known that a general instance of this problem is NP-complete, and it is also challenging for humans, since in the considered setting the original image is not given. Recently a graphical model has been proposed to solve this and related problems. The target label probability function is then maximized using loopy belief propagation. We also formulate the problem as maximizing a label probability function and use exactly the same pairwise potentials. Our main contribution is a novel inference approach in the sampling framework of Particle Filter (PF). Usually in the PF framework it is assumed that the observations arrive sequentially, e.g., the observations are naturally ordered by their time stamps in the tracking scenario. Based on this assumption, the posterior density over the corresponding hidden states is estimated. In the jigsaw puzzle problem all observations (puzzle pieces) are given at once without any particular order. Therefore, we relax the assumption of having ordered observations and extend the PF framework to estimate the posterior density by exploring different orders of observations and selecting the most informative permutations of observations. This significantly broadens the scope of applications of the PF inference. Our experimental results demonstrate that the proposed inference framework significantly outperforms the loopy belief propagation in solving the image jigsaw puzzle problem. In particular, the extended PF inference triples the accuracy of the label assignment compared to that using loopy belief propagation.

Rodriguez-Sanchez, Antonio J.; Tsotsos, John K.; , ■The importance of intermediate representations for the modeling of 2D shape detection: Endstopping and curvature tuned computations,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.4321-4326, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995671
Abstract: Computational models of visual processes with biological inspiration - and even biological realism - are currently of great interest in the computer vision community. This paper provides a biologically plausible model of 2D shape which incorporates intermediate layers of visual representation that have not previously been fully explored. We propose that endstopping and curvature cells are of great importance for shape selectivity and show how their combination can lead to shape selective neurons. This shape representation model provides a highly accurate fit with neural data from [17] and provides comparable results with real-world images to current computer vision systems. The conclusion is that such intermediate representations may no longer require a learning approach as a bridge between early representations based on Gabor or Difference of Gaussian filters (that are not learned since they are well-understood) and later representations closer to object representations that still can benefit from a learning methodology.

Barron, Jonathan T.; Malik, Jitendra; , ■High-frequency shape and albedo from shading using natural image statistics,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2521-2528, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995392
Abstract: We relax the long-held and problematic assumption in shape-from-shading (SFS) that albedo must be uniform or known, and address the problem of ■shape and albedo from shading■ (SAFS). Using models normally reserved for natural image statistics, we impose ■naturalness■ priors over the albedo and shape of a scene, which allows us to simultaneously recover the most likely albedo and shape that explain a single image. A simplification of our algorithm solves classic SFS, and our SAFS algorithm can solve the intrinsic image decomposition problem, as it solves a superset of that problem. We present results for SAFS, SFS, and intrinsic image decomposition on real lunar imagery from the Apollo missions, on our own pseudo-synthetic lunar dataset, and on a subset of the MIT Intrinsic Images dataset[15]. Our one unified technique appears to outperform the previous best individual algorithms for all three tasks. Our technique allows a coarse observation of shape (from a laser rangefinder or a stereo algorithm, etc) to be incorporated a priori. We demonstrate that even a small amount of low-frequency information dramatically improves performance, and motivate the usage of shading for high-frequency shape (and albedo) recovery.

Kneip, Laurent; Scaramuzza, Davide; Siegwart, Roland; , ■A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2969-2976, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995464
Abstract: The Perspective-Three-Point (P3P) problem aims at determining the position and orientation of the camera in the world reference frame from three 2D-3D point correspondences. This problem is known to provide up to four solutions that can then be disambiguated using a fourth point. All existing solutions attempt to first solve for the position of the points in the camera reference frame, and then compute the position and orientation of the camera in the world frame, which alignes the two point sets. In contrast, in this paper we propose a novel closed-form solution to the P3P problem, which computes the aligning transformation directly in a single stage, without the intermediate derivation of the points in the camera frame. This is made possible by introducing intermediate camera and world reference frames, and expressing their relative position and orientation using only two parameters. The projection of a world point into the parametrized camera pose then leads to two conditions and finally a quartic equation for finding up to four solutions for the parameter pair. A subsequent backsubstitution directly leads to the corresponding camera poses with respect to the world reference frame. We show that the proposed algorithm offers accuracy and precision comparable to a popular, standard, state-of-the-art approach but at much lower computational cost (15 times faster). Furthermore, it provides improved numerical stability and is less affected by degenerate configurations of the selected world points. The superior computational efficiency is particularly suitable for any RANSAC-outlier-rejection step, which is always recommended before applying PnP or non-linear optimization of the final solution.

Tamaazousti, Mohamed; Gay-Bellile, Vincent; Collette, Sylvie Naudet; Bourgeois, Steve; Dhome, Michel; , ■NonLinear refinement of structure from motion reconstruction by taking advantage of a partial knowledge of the environment,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3073-3080, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995358
Abstract: We address the challenging issue of camera localization in a partially known environment, i.e. for which a geometric 3D model that covers only a part of the observed scene is available. When this scene is static, both known and unknown parts of the environment provide constraints on the camera motion. This paper proposes a nonlinear refinement process of an initial SfM reconstruction that takes advantage of these two types of constraints. Compare to those that exploit only the model constraints i.e. the known part of the scene, including the unknown part of the environment in the optimization process yields a faster, more accurate and robust refinement. It also presents a much larger convergence basin. This paper will demonstrate these statements on varied synthetic and real sequences for both 3D object tracking and outdoor localization applications.
Pandharkar, Rohit; Velten, Andreas; Bardagjy, Andrew; Lawson, Everett; Bawendi, Moungi; Raskar, Ramesh; , ■Estimating Motion and size of moving non-line-of-sight objects in cluttered environments,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.265-272, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995465
Abstract: We present a technique for motion and size estimation of non-line-of-sight (NLOS) moving objects in cluttered environments using a time of flight camera and multipath analysis. We exploit relative times of arrival after reflection from a grid of points on a diffuse surface and create a virtual phased-array. By subtracting space-time impulse responses for successive frames, we separate responses of NLOS moving objects from those resulting from the cluttered environment. After reconstructing the line-of-sight scene geometry, we analyze the space of wavefronts using the phased array and solve a constrained least squares problem to recover the NLOS target location. Importantly, we can recover target's motion vector even in presence of uncalibrated time and pose bias common in time of flight systems. In addition, we compute the upper bound on the size of the target by backprojecting the extremas of the time profiles. Ability to track targets inside rooms despite opaque occluders and multipath responses has numerous applications in search and rescue, medicine and defense. We show centimeter accurate results by making appropriate modifications to a time of flight system.

Crandall, David; Owens, Andrew; Snavely, Noah; Huttenlocher, Dan; , ■Discrete-continuous optimization for large-scale structure from motion,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3001-3008, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995626
Abstract: Recent work in structure from motion (SfM) has successfully built 3D models from large unstructured collections of images downloaded from the Internet. Most approaches use incremental algorithms that solve progressively larger bundle adjustment problems. These incremental techniques scale poorly as the number of images grows, and can drift or fall into bad local minima. We present an alternative formulation for SfM based on finding a coarse initial solution using a hybrid discrete-continuous optimization, and then improving that solution using bundle adjustment. The initial optimization step uses a discrete Markov random field (MRF) formulation, coupled with a continuous Levenberg-Marquardt refinement. The formulation naturally incorporates various sources of information about both the cameras and the points, including noisy geotags and vanishing point estimates. We test our method on several large-scale photo collections, including one with measured camera positions, and show that it can produce models that are similar to or better than those produced with incremental bundle adjustment, but more robustly and in a fraction of the time.

Rodriguez, A. L.; Lopez-de-Teruel, P. E.; Ruiz, A.; , ■Reduced epipolar cost for accelerated incremental SfM,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3097-3104, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995569
Abstract: We propose a reduced algebraic cost based on pairwise epipolar constraints for the iterative refinement of a multiple view 3D reconstruction. The aim is to accelerate the intermediate steps required when incrementally building a reconstruction from scratch. Though the proposed error is algebraic, careful input data normalization makes it a good approximation to the true geometric epipolar distance. Its minimization is significantly faster and obtains a geometric reprojection error very close to the optimum value, requiring very few iterations of final standard BA refinement. Smart usage of a reduced measurement matrix for each pair of views allows elimination of the variables corresponding to the 3D points prior to nonlinear optimization, subsequently reducing computation, memory usage, and considerably accelerating convergence. This approach has been tested in a wide range of real and synthetic problems, consistently obtaining significant robustness and convergence improvements even when starting from rough initial solutions. Its efficiency and scalability make it thus an ideal choice for incremental SfM in real-time tracking applications or scene modelling from large image databases.

Mittal, Sushil; Anand, Saket; Meer, Peter; , ■Generalized projection based M-estimator: Theory and applications,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2689-2696, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995514
Abstract: We introduce a robust estimator called generalized projection based M-estimator (gpbM) which does not require the user to specify any scale parameters. For multiple inlier structures, with different noise covariances, the estimator iteratively determines one inlier structure at a time. Unlike pbM, where the scale of the inlier noise is estimated simultaneously with the model parameters, gpbM has three distinct stages–scale estimation, robust model estimation and inlier/outlier dichotomy. We evaluate our performance on challenging synthetic data, face image clustering upto ten different faces from Yale Face Database B and multi-body projective motion segmentation problem on Hopkins155 dataset. Results of state-of-the-art methods are presented for comparison.

Siddiquie, Behjat; Feris, Rogerio S.; Davis, Larry S.; , ■Image ranking and retrieval based on multi-attribute queries,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.801-808, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995329
Abstract: We propose a novel approach for ranking and retrieval of images based on multi-attribute queries. Existing image retrieval methods train separate classifiers for each word and heuristically combine their outputs for retrieving multiword queries. Moreover, these approaches also ignore the interdependencies among the query terms. In contrast, we propose a principled approach for multi-attribute retrieval which explicitly models the correlations that are present between the attributes. Given a multi-attribute query, we also utilize other attributes in the vocabulary which are not present in the query, for ranking/retrieval. Furthermore, we integrate ranking and retrieval within the same formulation, by posing them as structured prediction problems. Extensive experimental evaluation on the Labeled Faces in the Wild(LFW), FaceTracer and PASCAL VOC datasets show that our approach significantly outperforms several state-of-the-art ranking and retrieval methods.

Torsello, Andrea; Rodola, Emanuele; Albarelli, Andrea; , ■Multiview registration via graph diffusion of dual quaternions,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2441-2448, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995565
Abstract: Surface registration is a fundamental step in the reconstruction of three-dimensional objects. While there are several fast and reliable methods to align two surfaces, the tools available to align multiple surfaces are relatively limited. In this paper we propose a novel multiview registration algorithm that projects several pairwise alignments onto a common reference frame. The projection is performed by representing the motions as dual quaternions, an algebraic structure that is related to the group of 3D rigid transformations, and by performing a diffusion along the graph of adjacent (i.e., pairwise alignable) views. The approach allows for a completely generic topology with which the pair-wise motions are diffused. An extensive set of experiments shows that the proposed approach is both orders of magnitude faster than the state of the art, and more robust to extreme positional noise and outliers. The dramatic speedup of the approach allows it to be alternated with pairwise alignment resulting in a smoother energy profile, reducing the risk of getting stuck at local minima.

Qin, Danfeng; Gammeter, Stephan; Bossard, Lukas; Quack, Till; van Gool, Luc; , ■Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.777-784, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995373
Abstract: This paper introduces a simple yet effective method to improve visual word based image retrieval. Our method is based on an analysis of the k-reciprocal nearest neighbor structure in the image space. At query time the information obtained from this process is used to treat different parts of the ranked retrieval list with different distance measures. This leads effectively to a re-ranking of retrieved images. As we will show, this has two benefits: first, using different similarity measures for different parts of the ranked list allows for compensation of the ■curse of dimensionality■. Second, it allows for dealing with the uneven distribution of images in the data space. Dealing with both challenges has very beneficial effect on retrieval accuracy. Furthermore, a major part of the process happens offline, so it does not affect speed at retrieval time. Finally, the method operates on the bag-of-words level only, thus it could be combined with any additional measures on e.g. either descriptor level or feature geometry making room for further improvement. We evaluate our approach on common object retrieval benchmarks and demonstrate a significant improvement over standard bag-of-words retrieval.

Harada, Tatsuya; Ushiku, Yoshitaka; Yamashita, Yuya; Kuniyoshi, Yasuo; , ■Discriminative spatial pyramid,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1617-1624, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995691
Abstract: Spatial Pyramid Representation (SPR) is a widely used method for embedding both global and local spatial information into a feature, and it shows good performance in terms of generic image recognition. In SPR, the image is divided into a sequence of increasingly finer grids on each pyramid level. Features are extracted from all of the grid cells and are concatenated to form one huge feature vector. As a result, expensive computational costs are required for both learning and testing. Moreover, because the strategy for partitioning the image at each pyramid level is designed by hand, there is weak theoretical evidence of the appropriate partitioning strategy for good categorization. In this paper, we propose discriminative SPR, which is a new representation that forms the image feature as a weighted sum of semi-local features over all pyramid levels. The weights are automatically selected to maximize a discriminative power. The resulting feature is compact and preserves high discriminative power, even in low dimension. Furthermore, the discriminative SPR can suggest the distinctive cells and the pyramid levels simultaneously by observing the optimal weights generated from the fine grid cells.

Johnson, Micah K.; Adelson, Edward H.; , ■Shape estimation in natural illumination,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2553-2560, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995510
Abstract: The traditional shape-from-shading problem, with a single light source and Lambertian reflectance, is challenging since the constraints implied by the illumination are not sufficient to specify local orientation. Photometric stereo algorithms, a variant of shape-from-shading, simplify the problem by controlling the illumination to obtain additional constraints. In this paper, we demonstrate that many natural lighting environments already have sufficient variability to constrain local shape. We describe a novel optimization scheme that exploits this variability to estimate surface normals from a single image of a diffuse object in natural illumination. We demonstrate the effectiveness of our method on both simulated and real images.

Kotsia, Irene; Patras, Ioannis; , ■Support tucker machines,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.633-640, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995663
Abstract: In this paper we address the two-class classification problem within the tensor-based framework, by formulating the Support Tucker Machines (STuMs). More precisely, in the proposed STuMs the weights parameters are regarded to be a tensor, calculated according to the Tucker tensor decomposition as the multiplication of a core tensor with a set of matrices, one along each mode. We further extend the proposed STuMs to the Σ/Σw STuMs, in order to fully exploit the information offered by the total or the within-class covariance matrix and whiten the data, thus providing in-variance to affine transformations in the feature space. We formulate the two above mentioned problems in such a way that they can be solved in an iterative manner, where at each iteration the parameters corresponding to the projections along a single tensor mode are estimated by solving a typical Support Vector Machine-type problem. The superiority of the proposed methods in terms of classification accuracy is illustrated on the problems of gait and action recognition.

Gao, Shenghua; Chia, Liang-Tien; Tsang, Ivor Wai-Hung; , ■Multi-layer group sparse coding — For concurrent image classification and annotation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2809-2816, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995454
Abstract: We present a multi-layer group sparse coding framework for concurrent image classification and annotation. By leveraging the dependency between image class label and tags, we introduce a multi-layer group sparse structure of the reconstruction coefficients. Such structure fully encodes the mutual dependency between the class label, which describes the image content as a whole, and tags, which describe the components of the image content. Then we propose a multi-layer group based tag propagation method, which combines the class label and subgroups of instances with similar tag distribution to annotate test images. Moreover, we extend our multi-layer group sparse coding in the Reproducing Kernel Hilbert Space (RKHS) which captures the nonlinearity of features, and further improves performances of image classification and annotation. Experimental results on the LabelMe, UIUC-Sport and NUS-WIDE-Object databases show that our method outperforms the baseline methods, and achieves excellent performances in both image classification and annotation tasks.

Galleguillos, Carolina; McFee, Brian; Belongie, Serge; Lanckriet, Gert; , ■From region similarity to category discovery,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2665-2672, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995527
Abstract: The goal of object category discovery is to automatically identify groups of image regions which belong to some new, previously unseen category. This task is typically performed in a purely unsupervised setting, and as a result, performance depends critically upon accurate assessments of similarity between unlabeled image regions. To improve the accuracy of category discovery, we develop a novel multiple kernel learning algorithm based on structural SVM, which optimizes a similarity space for nearest-neighbor prediction. The optimized space is then used to cluster unlabeled data and identify new categories. Experimental results on the MSRC and PASCAL VOC2007 data sets indicate that using an optimized similarity metric can improve clustering for category discovery. Furthermore, we demonstrate that including both labeled and unlabeled training data when optimizing the similarity metric can improve the overall quality of the system.

Wang, Meng; Konrad, Janusz; Ishwar, Prakash; Jing, Kevin; Rowley, Henry; , ■Image saliency: From intrinsic to extrinsic context,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.417-424, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995743
Abstract: We propose a novel framework for automatic saliency estimation in natural images. We consider saliency to be an anomaly with respect to a given context that can be global or local. In the case of global context, we estimate saliency in the whole image relative to a large dictionary of images. Unlike in some prior methods, this dictionary is not annotated, i.e., saliency is assumed unknown. In the case of local context, we partition the image into patches and estimate saliency in each patch relative to a large dictionary of un-annotated patches from the rest of the image. We propose a unified framework that applies to both cases in three steps. First, given an input (image or patch) we extract k nearest neighbors from the dictionary. Then, we geometrically warp each neighbor to match the input. Finally, we derive the saliency map from the mean absolute error between the input and all its warped neighbors. This algorithm is not only easy to implement but also outperforms state-of-the-art methods.

Oreifej, Omar; Shu, Guang; Pace, Teresa; Shah, Mubarak; , ■A two-stage reconstruction approach for seeing through water,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1153-1160, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995428
Abstract: Several attempts have been lately proposed to tackle the problem of recovering the original image of an underwater scene using a sequence distorted by water waves. The main drawback of the state of the art [18] is that it heavily depends on modelling the waves, which in fact is ill-posed since the actual behavior of the waves along with the imaging process are complicated and include several noise components; therefore, their results are not satisfactory. In this paper, we revisit the problem by proposing a data-driven two-stage approach, each stage is targeted toward a certain type of noise. The first stage leverages the temporal mean of the sequence to overcome the structured turbulence of the waves through an iterative robust registration algorithm. The result of the first stage is a high quality mean and a better structured sequence; however, the sequence still contains unstructured sparse noise. Thus, we employ a second stage at which we extract the sparse errors from the sequence through rank minimization. Our method converges faster, and drastically outperforms state of the art on all testing sequences even only after the first stage.

Lhuillier, Maxime; , ■Fusion of GPS and structure-from-motion using constrained bundle adjustments,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3025-3032, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995456
Abstract: Two problems occur when bundle adjustment (BA) is applied on long image sequences: the large calculation time and the drift (or error accumulation). In recent work, the calculation time is reduced by local BAs applied in an incremental scheme. The drift may be reduced by fusion of GPS and Structure-from-Motion. An existing fusion method is BA minimizing a weighted sum of image and GPS errors. This paper introduces two constrained BAs for fusion, which enforce an upper bound for the reprojection error. These BAs are alternatives to the existing fusion BA, which does not guarantee a small reprojection error and requires a weight as input. Then the three fusion BAs are integrated in an incremental Structure-from-Motion method based on local BA. Lastly, we will compare the fusion results on a long monocular image sequence and a low cost GPS.

Parag, Toufiq; Elgammal, Ahmed; , ■Supervised hypergraph labeling,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2289-2296, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995522
Abstract: We address the problem of labeling individual datapoints given some knowledge about (small) subsets or groups of them. The knowledge we have for a group is the likelihood value for each group member to satisfy a certain model. This problem is equivalent to hypergraph labeling problem where each datapoint corresponds to a node and the each subset correspond to a hyperedge with likelihood value as its weight. We propose a novel method to model the label dependence using an Undirected Graphical Model and reduce the problem of hypergraph labeling into an inference problem. This paper describes the structure and necessary components of such model and proposes useful cost functions. We discuss the behavior of proposed algorithm with different forms of the cost functions, identify suitable algorithms for inference and analyze required properties when it is theoretically guaranteed to have exact solution. Examples of several real world problems are shown as applications of the proposed method.

Sun, Xin; Yao, Hongxun; Zhang, Shengping; , ■A novel supervised level set method for non-rigid object tracking,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3393-3400, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995656
Abstract: We present a novel approach to non-rigid object tracking based on a supervised level set model (SLSM). In contrast with conventional level set models, which emphasize the intensity consistency only and consider no priors, the curve evolution of the proposed SLSM is object-oriented and supervised by the specific knowledge of the target we want to track. Therefore, the SLSM can ensure a more accurate convergence to the target in tracking applications. In particular, we firstly construct the appearance model for the target in an on-line boosting manner due to its strong discriminative power between objects and background. Then the probability of the contour is modeled by considering both the region and edge cues in a Bayesian manner, leading the curve converge to the candidate region with maximum likelihood of being the target. Finally, accurate target region qualifies the samples fed the boosting procedure as well as the target model prepared for the next time step. Positive decrease rate is used to adjust the learning pace over time, enabling tracking to continue under partial and total occlusion. Experimental results on a number of challenging sequences validate the effectiveness of the technique.

Sanchez, Jorge; Perronnin, Florent; , ■High-dimensional signature compression for large-scale image classification,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1665-1672, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995504
Abstract: We address image classification on a large-scale, i.e. when a large number of images and classes are involved. First, we study classification accuracy as a function of the image signature dimensionality and the training set size. We show experimentally that the larger the training set, the higher the impact of the dimensionality on the accuracy. In other words, high-dimensional signatures are important to obtain state-of-the-art results on large datasets. Second, we tackle the problem of data compression on very large signatures (on the order of 105 dimensions) using two lossy compression strategies: a dimensionality reduction technique known as the hash kernel and an encoding technique based on product quantizers. We explain how the gain in storage can be traded against a loss in accuracy and/or an increase in CPU cost. We report results on two large databases — ImageNet and a dataset of lM Flickr images — showing that we can reduce the storage of our signatures by a factor 64 to 128 with little loss in accuracy. Integrating the decompression in the classifier learning yields an efficient and scalable training algorithm. On ILSVRC2010 we report a 74.3% accuracy at top-5, which corresponds to a 2.5% absolute improvement with respect to the state-of-the-art. On a subset of 10K classes of ImageNet we report a top-1 accuracy of 16.7%, a relative improvement of 160% with respect to the state-of-the-art.

Agrawal, Amit; Taguchi, Yuichi; Ramalingam, Srikumar; , ■Beyond Alhazen's problem: Analytical projection model for non-central catadioptric cameras with quadric mirrors,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2993-3000, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995596
Abstract: Catadioptric cameras are widely used to increase the field of view using mirrors. Central catadioptric systems having an effective single viewpoint are easy to model and use, but severely constraint the camera positioning with respect to the mirror. On the other hand, non-central cata-dioptric systems allow greater flexibility in camera placement, but are often approximated using central or linear models due to the lack of an exact model. We bridge this gap and describe an exact projection model for non-central catadioptric systems. We derive an analytical ‘forward projection’ equation for the projection of a 3D point reflected by a quadric mirror on the imaging plane of a perspective camera, with no restrictions on the camera placement, and show that it is an 8th degree equation in a single unknown. While previous non-central catadioptric cameras primarily use an axial configuration where the camera is placed on the axis of a rotationally symmetric mirror, we allow off-axis (any) camera placement. Using this analytical model, a non-central catadioptric camera can be used for sparse as well as dense 3D reconstruction similar to perspective cameras, using well-known algorithms such as bundle adjustment and plane sweeping. Our paper is the first to show such results for off-axis placement of camera with multiple quadric mirrors. Simulation and real results using parabolic mirrors and an off-axis perspective camera are demonstrated.

Sundberg, Patrik; Brox, Thomas; Maire, Michael; Arbelaez, Pablo; Malik, Jitendra; , ■Occlusion boundary detection and figure/ground assignment from optical flow,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2233-2240, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995364
Abstract: In this work, we propose a contour and region detector for video data that exploits motion cues and distinguishes occlusion boundaries from internal boundaries based on optical flow. This detector outperforms the state-of-the-art on the benchmark of Stein and Hebert [24], improving average precision from. 58 to. 72. Moreover, the optical flow on and near occlusion boundaries allows us to assign a depth ordering to the adjacent regions. To evaluate performance on this edge-based figure/ground labeling task, we introduce a new video dataset that we believe will support further research in the field by allowing quantitative comparison of computational models for occlusion boundary detection, depth ordering and segmentation in video sequences.

Liu, Xiaobai; Feng, Jiashi; Yan, Shuicheng; Lin, Liang; Jin, Hai; , ■Segment an image by looking into an image corpus,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2249-2256, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995497
Abstract: This paper investigates how to segment an image into semantic regions by harnessing an unlabeled image corpus. First, the image segmentation task is recast as a small-size patch grouping problem. Then, we discover two novel patch-pair priors, namely the first-order patch-pair density prior and the second-order patch-pair co-occurrence prior, founded on two statistical observations from the natural image corpus. The underlying rationalities are: 1) a patch-pair falling within the same object region generally has higher density than a patch-pair falling on different objects, and 2) two patch-pairs with high co-occurrence frequency are likely to bear similar semantic consistence confidences (SCCs), i.e. the confidence of the consisted two patches belonging to the same semantic concept. These two discriminative priors are further integrated into a unified objective function in order to augment the intrinsic patch-pair similarities, originally calculated using patch-level visual features, into the semantic consistence confidences. Nonnegative constraint is also imposed over the output variables and an efficient iterative procedure is provided to seek the optimal solution. The ultimate patch grouping is conducted by first building a similarity graph, which takes the atomic patches as vertices and the augmented patch-pair SCCs as edge weights, and then employing the popular Normalized Cut approach to group patches into semantic clusters. Extensive image segmentation experiments on two public databases clearly demonstrate the superiority of the proposed approach over various state-of-the-arts unsupervised image segmentation algorithms.

Kuo, Yin-Hsi; Lin, Hsuan-Tien; Cheng, Wen-Huang; Yang, Yi-Hsuan; Hsu, Winston H.; , ■Unsupervised auxiliary visual words discovery for large-scale image object retrieval,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.905-912, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995639
Abstract: Image object retrieval–locating image occurrences of specific objects in large-scale image collections–is essential for manipulating the sheer amount of photos. Current solutions, mostly based on bags-of-words model, suffer from low recall rate and do not resist noises caused by the changes in lighting, viewpoints, and even occlusions. We propose to augment each image with auxiliary visual words (AVWs), semantically relevant to the search targets. The AVWs are automatically discovered by feature propagation and selection in textual and visual image graphs in an unsupervised manner. We investigate variant optimization methods for effectiveness and scalability in large-scale image collections. Experimenting in the large-scale consumer photos, we found that the the proposed method significantly improves the traditional bag-of-words (111% relatively). Meanwhile, the selection process can also notably reduce the number of features (to 1.4%) and can further facilitate indexing in large-scale image object retrieval.

Perina, Alessandro; Jojic, Nebojsa; , ■Image analysis by counting on a grid,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1985-1992, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995742
Abstract: In recent object/scene recognition research images or large image regions are often represented as disorganized ■bags■ of image features. This representation allows direct application of models of word counts in text. However, the image feature counts are likely to be constrained in different ways than word counts in text. As a camera pans upwards from a building entrance over its first few floors and then above the penthouse to the backdrop formed by the mountains, and then further up into the sky, some feature counts in the image drop while others rise–only to drop again giving way to features found more often at higher elevations (Fig. 1). The space of all possible feature count combinations is constrained by the properties of the larger scene as well as the size and the location of the window into it. Accordingly, our model is based on a grid of feature counts, considerably larger than any of the modeled images, and considerably smaller than the real estate needed to tile the images next to each other tightly. Each modeled image is assumed to have a representative window in the grid in which the sum of feature counts mimics the distribution in the image. We provide learning procedures that jointly map all images in the training set to the counting grid and estimate the appropriate local counts in it. Experimentally, we demonstrate that the resulting representation captures the space of feature count combinations more accurately than the traditional models, such as latent Dirichlet allocation, even when modeling images of different scenes from the same category.

Kwak, Suha; Han, Bohyung; Han, Joon Hee; , ■Scenario-based video event recognition by constraint flow,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3345-3352, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995435
Abstract: We present a novel approach to representing and recognizing composite video events. A composite event is specified by a scenario, which is based on primitive events and their temporal-logical relations, to constrain the arrangements of the primitive events in the composite event. We propose a new scenario description method to represent composite events fluently and efficiently. A composite event is recognized by a constrained optimization algorithm whose constraints are defined by the scenario. The dynamic configuration of the scenario constraints is represented with constraint flow, which is generated from scenario automatically by our scenario parsing algorithm. The constraint flow reduces the search space dramatically, alleviates the effect of preprocessing errors, and guarantees the globally optimal solution for recognition. We validate our method to describe scenario and construct constraint flow for real videos and illustrate the effectiveness of our composite event recognition algorithm for natural video events.

Zheng, Yali; Nobuhara, Shohei; Sheikh, Yaser; , ■Structure from motion blur in low light,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2569-2576, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995594
Abstract: In theory, the precision of structure from motion estimation is known to increase as camera motion increases. In practice, larger camera motions induce motion blur, particularly in low light where longer exposures are needed. If the camera center moves during exposure, the trajectory traces in a motion-blurred image encode the underlying 3D structure of points and the motion of the camera. In this paper, we propose an algorithm to explicitly estimate the 3D structure of point light sources and camera motion from a motion-blurred image in a low light scene with point light sources. The algorithm identifies extremal points of the traces mapped out by the point sources in the image and classifies them into start and end sets. Each trace is charted out incrementally using local curvature, providing correspondences between start and end points. We use these correspondences to obtain an initial estimate of the epipolar geometry embedded in a motion-blurred image. The reconstruction and the 2D traces are used to estimate the motion of the camera during the interval of capture, and multiple view bundle adjustment is applied to refine the estimates.

Franco, Jean-Sebastien; Boyer, Edmond; , ■Learning temporally consistent rigidities,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1241-1248, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995440
Abstract: We present a novel probabilistic framework for rigid tracking and segmentation of shapes observed from multiple cameras. Most existing methods have focused on solving each of these problems individually, segmenting the shape assuming surface registration is solved, or conversely performing surface registration assuming shape segmentation or kinematic structure is known. We assume no prior kinematic or registration knowledge except for an over-estimate k of the number of rigidities in the scene, instead proposing to simultaneously discover, adapt, and track its rigid structure on the fly. We simultaneously segment and infer poses of rigid subcomponents of a single chosen reference mesh acquired in the sequence. We show that this problem can be rigorously cast as a likelihood maximization over rigid component parameters. We solve this problem using an Expectation Maximization algorithm, with latent observation assignments to reference vertices and rigid parts. Our experiments on synthetic and real data show the validity of the method, robustness to noise, and its promising applicability to complex sequences.

Shen, Li; Yeo, Chuohao; , ■Intrinsic images decomposition using a local and global sparse representation of reflectance,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.697-704, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995738
Abstract: Intrinsic image decomposition is an important problem that targets the recovery of shading and reflectance components from a single image. While this is an ill-posed problem on its own, we propose a novel approach for intrinsic image decomposition using a reflectance sparsity prior that we have developed. Our method is based on a simple observation: neighboring pixels usually have the same reflectance if their chromaticities are the same or very similar. We formalize this sparsity constraint on local reflectance, and derive a sparse representation of reflectance components using data-driven edge-avoiding-wavelets. We show that the reflectance component of natural images is sparse in this representation. We also propose and formulate a novel global reflectance sparsity constraint. Using this sparsity prior and global constraints, we formulate a l1-regularized least squares minimization problem for intrinsic image decomposition that can be solved efficiently. Our algorithm can successfully extract intrinsic images from a single image, without using other reflection or color models or any user interaction. The results on challenging scenes demonstrate the power of the proposed technique.

Wu, Wen; Chen, Terrence; Wang, Peng; Zhou, Shaohua Kevin; Comaniciu, Dorin; Barbu, Adrian; Strobel, Norbert; , ■Learning-based hypothesis fusion for robust catheter tracking in 2D X-ray fluoroscopy,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1097-1104, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995553
Abstract: Catheter tracking has become more and more important in recent interventional applications. It provides real time navigation for the physicians and can be used to control a motion compensated fluoro overlay reference image for other means of guidance, e.g. involving a 3D anatomical model. Tracking the coronary sinus (CS) catheter is effective to compensate respiratory and cardiac motion for 3D overlay navigation to assist positioning the ablation catheter in Atrial Fibrillation (Afib) treatments. During interventions, the CS catheter performs rapid motion and non-rigid deformation due to the beating heart and respiration. In this paper, we model the CS catheter as a set of electrodes. Novelly designed hypotheses generated by a number of learning-based detectors are fused. Robust hypothesis matching through a Bayesian framework is then used to select the best hypothesis for each frame. As a result, our tracking method achieves very high robustness against challenging scenarios such as low SNR, occlusion, foreshortening, non-rigid deformation, as well as the catheter moving in and out of ROI. Quantitative evaluation has been conducted on a database of 13221 frames from 1073 sequences. Our approach obtains 0.50mm median error and 0.76mm mean error. 97.8% of evaluated data have errors less than 2.00mm. The speed of our tracking algorithm reaches 5 frames-per-second on most data sets. Our approach is not limited to the catheters inside the CS but can be extended to track other types of catheters, such as ablation catheters or circumferential mapping catheters.

Vicente, Sara; Rother, Carsten; Kolmogorov, Vladimir; , ■Object cosegmentation,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2217-2224, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995530
Abstract: Cosegmentation is typically defined as the task of jointly segmenting ■something similar■ in a given set of images. Existing methods are too generic and so far have not demonstrated competitive results for any specific task. In this paper we overcome this limitation by adding two new aspects to cosegmentation: (1) the ■something■ has to be an object, and (2) the ■similarity■ measure is learned. In this way, we are able to achieve excellent results on the recently introduced iCoseg dataset, which contains small sets of images of either the same object instance or similar objects of the same class. The challenge of this dataset lies in the extreme changes in viewpoint, lighting, and object deformations within each set. We are able to considerably outperform several competitors. To achieve this performance, we borrow recent ideas from object recognition: the use of powerful features extracted from a pool of candidate object-like segmentations. We believe that our work will be beneficial to several application areas, such as image retrieval.

Song, Zheng; Chen, Qiang; Huang, Zhongyang; Hua, Yang; Yan, Shuicheng; , ■Contextualizing object detection and classification,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1585-1592, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995330
Abstract: In this paper, we investigate how to iteratively and mutually boost object classification and detection by taking the outputs from one task as the context of the other one. First, instead of intuitive feature and context concatenation or postprocessing with context, the so-called Contextualized Support Vector Machine (Context-SVM) is proposed, where the context takes the responsibility of dynamically adjusting the classification hyperplane, and thus the context-adaptive classifier is achieved. Then, an iterative training procedure is presented. In each step, Context-SVM, associated with the output context from one task (object classification or detection), is instantiated to boost the performance for the other task, whose augmented outputs are then further used to improve the former task by Context-SVM. The proposed solution is evaluated on the object classification and detection tasks of PASCAL Visual Object Challenge (VOC) 2007 and 2010, and achieves the state-of-the-art performance.

Mishra, Akshaya; Fieguth, Paul W.; Clausi, David A.; , ■From active contours to active surfaces,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2121-2128, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995612
Abstract: Identifying the surfaces of three-dimensional static objects or of two-dimensional objects over time are key to a variety of applications throughout computer vision. Active surface techniques have been widely applied to such tasks, such that a deformable spline surface evolves by the influence of internal and external (typically opposing) energies until the model converges to the desired surface. Present deformable model surface extraction techniques are computationally expensive and are not able to reliably identify surfaces in the presence of noise, high curvature, or clutter. This paper proposes a novel active surface technique, decoupled active surfaces, with the specific objectives of robustness and computational efficiency. Motivated by recent results in two-dimensional object segmentation, the internal and external energies are treated separately, which leads to much faster convergence. A truncated maximum likelihood estimator is applied to generate a surface consistent with the measurements (external energy), and a Bayesian linear least squares estimator is asserted to enforce the prior (internal energy). To maintain tractability for typical three-dimensional problems, the density of vertices is dynamically resampled based on curvature, a novel quasi-random search is used as a substitute for the ML estimator, and sparse conjugate-gradient is used to execute the Bayesian estimator. The performance of the proposed method is presented using two natural and two synthetic image volumes.

Balzer, Jonathan; Hofer, Sebastian; Beyerer, Jurgen; , ■Multiview specular stereo reconstruction of large mirror surfaces,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2537-2544, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995346
Abstract: In deflectometry, the shape of mirror objects is recovered from distorted images of a calibrated scene. While remarkably high accuracies are achievable, state-of-the-art methods suffer from two distinct weaknesses: First, for mainly constructive reasons, these can only capture a few square centimeters of surface area at once. Second, reconstructions are ambiguous i.e. infinitely many surfaces lead to the same visual impression. We resolve both of these problems by introducing the first multiview specular stereo approach, which jointly evaluates a series of overlapping deflectometric images. Two publicly available benchmarks accompany this paper, enabling us to numerically demonstrate viability and practicability of our approach.

Yang, Weilong; Toderici, George; , ■Discriminative tag learning on YouTube videos with latent sub-tags,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.3217-3224, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995402
Abstract: We consider the problem of content-based automated tag learning. In particular, we address semantic variations (sub-tags) of the tag. Each video in the training set is assumed to be associated with a sub-tag label, and we treat this sub-tag label as latent information. A latent learning framework based on LogitBoost is proposed, which jointly considers both the tag label and the latent sub-tag label. The latent sub-tag information is exploited in our framework to assist the learning of our end goal, i.e., tag prediction. We use the cowatch information to initialize the learning process. In experiments, we show that the proposed method achieves significantly better results over baselines on a large-scale testing video set which contains about 50 million YouTube videos.

Zhao, Peng; Quan, Long; , ■Translation symmetry detection in a fronto-parallel view,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.1009-1016, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995482
Abstract: In this paper, we present a method of detecting translation symmetries from a fronto-parallel image. The proposed method automatically detects unknown multiple repetitive patterns of arbitrary shapes, which are characterized by translation symmetries on a plane. The central idea of our approach is to take advantage of the interesting properties of translation symmetries in both image space and the space of transformation group. We first detect feature points in input image as sampling points. Then for each sampling point, we search for the most probable corresponding lattice structures in the image and transform spaces using scale-space similarity maps. Finally, using a MRF formulation, we optimally partition the graph of all sampling points associated with the estimated lattices into subgraphs of sampling points and lattices belonging to the same symmetry pattern. Our method is robust because of the joint analysis in image and transform spaces, and the MRF optimization. We demonstrate the robustness and effectiveness of our method on a large variety of images.

Yeung, Sai-Kit; Wu, Tai-Pang; Tang, Chi-Keung; Chan, Tony F.; Osher, Stanley; , ■Adequate reconstruction of transparent objects on a shoestring budget,■Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2513-2520, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995472
Abstract: Reconstructing transparent objects is a challenging problem. While producing reasonable results for quite complex objects, existing approaches require custom calibration or somewhat expensive labor to achieve high precision. On the other hand, when an overall shape preserving salient and fine details is sufficient, we show in this paper a significant step toward solving the problem on a shoestring budget, by using only a video camera, a moving spotlight, and a small chrome sphere. Specifically, the problem we address is to estimate the normal map of the exterior surface of a given solid transparent object, from which the surface depth can be integrated. Our technical contribution lies in relating this normal reconstruction problem to one of graph-cut segmentation. Unlike conventional formulations, however, our graph is dual-layered, since we can see a transparent object's foreground as well as the background behind it. Quantitative and qualitative evaluation are performed to verify the efficacy of this practical solution.

Yang, Xingwei; Latecki, Longin Jan; , ■Affinity learning on a tensor product graph with applications to shape and image retrieval,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2369-2376, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995325
Abstract: As observed in several recent publications, improved retrieval performance is achieved when pairwise similarities between the query and the database objects are replaced with more global affinities that also consider the relation among the database objects. This is commonly achieved by propagating the similarity information in a weighted graph representing the database and query objects. Instead of propagating the similarity information on the original graph, we propose to utilize the tensor product graph (TPG) obtained by the tensor product of the original graph with itself. By virtue of this construction, not only local but also long range similarities among graph nodes are explicitly represented as higher order relations, making it possible to better reveal the intrinsic structure of the data manifold. In addition, we improve the local neighborhood structure of the original graph in a preprocessing stage. We illustrate the benefits of the proposed approach on shape and image ranking and retrieval tasks. We are able to achieve the bull's eye retrieval score of 99.99% on MPEG-7 shape dataset, which is much higher than the state-of-the-art algorithms.

Kennedy, Ryan; Gallier, Jean; Shi, Jianbo; , ■Contour cut: Identifying salient contours in images by solving a Hermitian eigenvalue problem,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2065-2072, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995739
Abstract: The problem of finding one-dimensional structures in images and videos can be formulated as a problem of searching for cycles in graphs. In [11], an untangling-cycle cost function was proposed for identifying persistent cycles in a weighted graph, corresponding to salient contours in an image. We have analyzed their method and give two significant improvements. First, we generalize their cost function to a contour cut criterion and give a computational solution by solving a family of Hermitian eigenvalue problems. Second, we use the idea of a graph circulation, which ensures that each node has a balanced in- and out-flow and permits a natural random-walk interpretation of our cost function. We show that our method finds far more accurate contours in images than [11]. Furthermore, we show that our method is robust to graph compression which allows us to accelerate the computation without loss of accuracy.

Maurice, Xavier; Graebling, Pierre; Doignon, Christophe; , ■A pattern framework driven by the Hamming distance for structured light-based reconstruction with a single image,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.2497-2504, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995490
Abstract: Structured light based patterns provide a means to capture the state of an object shape. However it may be inefficient when the object is freely moving, when its surface contains high curvature parts or in out of depth of field situations. For image-based robotic guidance in unstructured and dynamic environment, only one shot is required for capturing the shape of a moving region-of-interest. Then robust patterns and real-time capabilities must be targeted. To this end, we have developed a novel technique for the generation of coded patterns directly driven by the Hamming distance. The counterpart is the big amount of codes the coding/decoding algorithms have to face with a high desired Hamming distance. We show that the mean Hamming distance is a useful criterion for driving the patterns generation process and we give a way to predict its value. Furthermore, to ensure local uniqueness of codewords with consideration of many incomplete ones, the Perfect Map theory is involved. Then, we describe a pseudorandom/exhaustive algorithm to build patterns with more than 200×200 features, in a very short time, thanks to a splitting strategy which performs the Hamming tests in the codeword space instead of the pattern array. This leads to a significant reduction of the computational complexity and it may be applied to other purposes. Finally, real-time reconstructions from single images are reported and results are compared to the best known which are outperformed in many cases.

Tang, Huixuan; Joshi, Neel; Kapoor, Ashish; , ■Learning a blind measure of perceptual image quality,■ Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on , vol., no., pp.305-312, 20-25 June 2011
doi: 10.1109/CVPR.2011.5995446
Abstract: It is often desirable to evaluate an image based on its quality. For many computer vision applications, a perceptually meaningful measure is the most relevant for evaluation; however, most commonly used measure do not map well to human judgements of image quality. A further complication of many existing image measure is that they require a reference image, which is often not available in practice. In this paper, we present a ■blind■ image quality measure, where potentially neither the groundtruth image nor the degradation process are known. Our method uses a set of novel low-level image features in a machine learning framework to learn a mapping from these features to subjective image quality scores. The image quality features stem from natural image measure and texture statistics. Experiments on a standard image quality benchmark dataset shows that our method outperforms the current state of art.


  • 2
  • 1
  • 1
  • 打赏
  • 扫一扫,分享海报

©️2022 CSDN 皮肤主题:大白 设计师:CSDN官方博客 返回首页




¥2 ¥4 ¥6 ¥10 ¥20
余额支付 (余额:-- )



钱包余额 0