关于 #今日arXiv精选
这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者。
DESYR: Definition and Syntactic Representation Based Claim Detection on the Web
Comment: 10 pages, Accepted at CIKM 2021
Link: http://arxiv.org/abs/2108.08759
Abstract
The formulation of a claim rests at the core of argument mining. To demarcatebetween a claim and a non-claim is arduous for both humans and machines, owingto latent linguistic variance between the two and the inadequacy of extensivedefinition-based formalization. Furthermore, the increase in the usage ofonline social media has resulted in an explosion of unsolicited information onthe web presented as informal text. To account for the aforementioned, in thispaper, we proposed DESYR. It is a framework that intends on annulling the saidissues for informal web-based text by leveraging a combination of hierarchicalrepresentation learning (dependency-inspired Poincare embedding),definition-based alignment, and feature projection. We do away with fine-tuningcomputer-heavy language models in favor of fabricating a more domain-centricbut lighter approach. Experimental results indicate that DESYR builds upon thestate-of-the-art system across four benchmark claim datasets, most of whichwere constructed with informal texts. We see an increase of 3 claim-F1 pointson the LESA-Twitter dataset, an increase of 1 claim-F1 point and 9 macro-F1points on the Online Comments(OC) dataset, an increase of 24 claim-F1 pointsand 17 macro-F1 points on the Web Discourse(WD) dataset, and an increase of 8claim-F1 points and 5 macro-F1 points on the Micro Texts(MT) dataset. We alsoperform an extensive analysis of the results. We make a 100-D pre-trainedversion of our Poincare-variant along with the source code.
Fine-Grained Element Identification in Complaint Text of Internet Fraud
Comment: 5 pages, 5 figures, 3 tables accepted as a short paper to CIKM 2021
Link: http://arxiv.org/abs/2108.08676
Abstract
Existing system dealing with online complaint provides a final decisionwithout explanations. We propose to analyse the complaint text of internetfraud in a fine-grained manner. Considering the complaint text includesmultiple clauses with various functions, we propose to identify the role ofeach clause and classify them into different types of fraud element. Weconstruct a large labeled dataset originated from a real finance serviceplatform. We build an element identification model on top of BERT and proposeadditional two modules to utilize the context of complaint text for betterelement label classification, namely, global context encoder and label refiner.Experimental results show the effectiveness of our model.
Language Model Augmented Relevance Score
Comment: In ACL 2021
Link: http://arxiv.org/abs/2108.08485
Abstract
Although automated metrics are commonly used to evaluate NLG systems, theyoften correlate poorly with human judgements. Newer metrics such as BERTScorehave addressed many weaknesses in prior metrics such as BLEU and ROUGE, whichrely on n-gram matching. These newer methods, however, are still limited inthat they do not consider the generation context, so they cannot properlyreward generated text that is correct but deviates from the given reference. In this paper, we propose Language Model Augmented Relevance Score (MARS), anew context-aware metric for NLG evaluation. MARS leverages off-the-shelflanguage models, guided by reinforcement learning, to create augmentedreferences that consider both the generation context and available humanreferences, which are then used as additional references to score generatedtext. Compared with seven existing metrics in three common NLG tasks, MARS notonly achieves higher correlation with human reference judgements, but alsodifferentiates well-formed candidates from adversarial samples to a largerdegree.
QUEACO: Borrowing Treasures from Weakly-labeled Behavior Data for Query Attribute Value Extraction
Comment: The 30th ACM International Conference on Information and Knowledge Management (CIKM 2021, Applied Research Track)
Link: http://arxiv.org/abs/2108.08468
Abstract
We study the problem of query attribute value extraction, which aims toidentify named entities from user queries as diverse surface form attributevalues and afterward transform them into formally canonical forms. Such aproblem consists of two phases: {named entity recognition (NER)} and {attributevalue normalization (AVN)}. However, existing works only focus on the NER phasebut neglect equally important AVN. To bridge this gap, this paper proposes aunified query attribute value extraction system in e-commerce search namedQUEACO, which involves both two phases. Moreover, by leveraging large-scaleweakly-labeled behavior data, we further improve the extraction performancewith less supervision cost. Specifically, for the NER phase, QUEACO adopts anovel teacher-student network, where a teacher network that is trained on thestrongly-labeled data generates pseudo-labels to refine the weakly-labeled datafor training a student network. Meanwhile, the teacher network can bedynamically adapted by the feedback of the student's performance onstrongly-labeled data to maximally denoise the noisy supervisions from the weaklabels. For the AVN phase, we also leverage the weakly-labeledquery-to-attribute behavior data to normalize surface form attribute valuesfrom queries into canonical forms from products. Extensive experiments on areal-world large-scale E-commerce dataset demonstrate the effectiveness ofQUEACO.
Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models
Comment: Accepted by Interspeech2021
Link: http://arxiv.org/abs/2108.08451
Abstract
Spoken Language Understanding (SLU) is one essential step in building adialogue system. Due to the expensive cost of obtaining the labeled data, SLUsuffers from the data scarcity problem. Therefore, in this paper, we focus ondata augmentation for slot filling task in SLU. To achieve that, we aim atgenerating more diverse data based on existing data. Specifically, we try toexploit the latent language knowledge from pretrained language models byfinetuning them. We propose two strategies for finetuning process: value-basedand context-based augmentation. Experimental results on two public SLU datasetshave shown that compared with existing data augmentation methods, our proposedmethod can generate more diverse sentences and significantly improve theperformance on SLU.
Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs
Comment: accepted to ICCV 2021
Link: http://arxiv.org/abs/2108.08841
Abstract
Controllable scene synthesis consists of generating 3D information thatsatisfy underlying specifications. Thereby, these specifications should beabstract, i.e. allowing easy user interaction, whilst providing enoughinterface for detailed control. Scene graphs are representations of a scene,composed of objects (nodes) and inter-object relationships (edges), proven tobe particularly suited for this task, as they allow for semantic control on thegenerated content. Previous works tackling this task often rely on syntheticdata, and retrieve object meshes, which naturally limits the generationcapabilities. To circumvent this issue, we instead propose the first work thatdirectly generates shapes from a scene graph in an end-to-end manner. Inaddition, we show that the same model supports scene modification, using therespective scene graph as interface. Leveraging Graph Convolutional Networks(GCN) we train a variational Auto-Encoder on top of the object and edgecategories, as well as 3D shapes and scene layouts, allowing latter sampling ofnew scenes and shapes.
PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers
Comment: Accepted to ICCV 2021 (Oral Presentation)
Link: http://arxiv.org/abs/2108.08839
Abstract
Point clouds captured in real-world applications are often incomplete due tothe limited sensor resolution, single viewpoint, and occlusion. Therefore,recovering the complete point clouds from partial ones becomes an indispensabletask in many practical applications. In this paper, we present a new methodthat reformulates point cloud completion as a set-to-set translation problemand design a new model, called PoinTr that adopts a transformer encoder-decoderarchitecture for point cloud completion. By representing the point cloud as aset of unordered groups of points with position embeddings, we convert thepoint cloud to a sequence of point proxies and employ the transformers forpoint cloud generation. To facilitate transformers to better leverage theinductive bias about 3D geometric structures of point clouds, we further devisea geometry-aware block that models the local geometric relationshipsexplicitly. The migration of transformers enables our model to better learnstructural knowledge and preserve detailed information for point cloudcompletion. Furthermore, we propose two more challenging benchmarks with morediverse incomplete point clouds that can better reflect the real-worldscenarios to promote future research. Experimental results show that our methodoutperforms state-of-the-art methods by a large margin on both the newbenchmarks and the existing ones. Code is available athttps://github.com/yuxumin/PoinTr
Fine-grained Semantics-aware Representation Enhancement for Self-supervised Monocular Depth Estimation
Comment: ICCV 2021 (Oral)
Link: http://arxiv.org/abs/2108.08829
Abstract
Self-supervised monocular depth estimation has been widely studied, owing toits practical importance and recent promising improvements. However, most workssuffer from limited supervision of photometric consistency, especially in weaktexture regions and at object boundaries. To overcome this weakness, we proposenovel ideas to improve self-supervised monocular depth estimation by leveragingcross-domain information, especially scene semantics. We focus on incorporatingimplicit semantic knowledge into geometric representation enhancement andsuggest two ideas: a metric learning approach that exploits thesemantics-guided local geometry to optimize intermediate depth representationsand a novel feature fusion module that judiciously utilizes cross-modalitybetween two heterogeneous feature representations. We comprehensively evaluateour methods on the KITTI dataset and demonstrate that our method outperformsstate-of-the-art methods. The source code is available athttps://github.com/hyBlue/FSRE-Depth.
Towards Vivid and Diverse Image Colorization with Generative Color Prior
Comment: ICCV 2021
Link: http://arxiv.org/abs/2108.08826
Abstract
Colorization has attracted increasing interest in recent years. Classicreference-based methods usually rely on external color images for plausibleresults. A large image database or online search engine is inevitably requiredfor retrieving such exemplars. Recent deep-learning-based methods couldautomatically colorize images at a low cost. However, unsatisfactory artifactsand incoherent colors are always accompanied. In this work, we aim atrecovering vivid colors by leveraging the rich and diverse color priorsencapsulated in a pretrained Generative Adversarial Networks (GAN).Specifically, we first "retrieve" matched features (similar to exemplars) via aGAN encoder and then incorporate these features into the colorization processwith feature modulations. Thanks to the powerful generative color prior anddelicate designs, our method could produce vivid colors with a single forwardpass. Moreover, it is highly convenient to obtain diverse results by modifyingGAN latent codes. Our method also inherits the merit of interpretable controlsof GANs and could attain controllable and smooth transitions by walking throughGAN latent space. Extensive experiments and user studies demonstrate that ourmethod achieves superior performance than previous works.
Click to Move: Controlling Video Generation with Sparse Motion
Comment: Accepted by International Conference on Computer Vision (ICCV 2021)
Link: http://arxiv.org/abs/2108.08815
Abstract
This paper introduces Click to Move (C2M), a novel framework for videogeneration where the user can control the motion of the synthesized videothrough mouse clicks specifying simple object trajectories of the key objectsin the scene. Our model receives as input an initial frame, its correspondingsegmentation map and the sparse motion vectors encoding the input provided bythe user. It outputs a plausible video sequence starting from the given frameand with a motion that is consistent with user input. Notably, our proposeddeep architecture incorporates a Graph Convolution Network (GCN) modelling themovements of all the objects in the scene in a holistic manner and effectivelycombining the sparse user motion information and image features. Experimentalresults show that C2M outperforms existing methods on two publicly availabledatasets, thus demonstrating the effectiveness of our GCN framework atmodelling object interactions. The source code is publicly available athttps://github.com/PierfrancescoArdino/C2M.
Causal Attention for Unbiased Visual Recognition
Comment: Accepted by ICCV 2021
Link: http://arxiv.org/abs/2108.08782
Abstract
Attention module does not always help deep models learn causal features thatare robust in any confounding context, e.g., a foreground object feature isinvariant to different backgrounds. This is because the confounders trick theattention to capture spurious correlations that benefit the prediction when thetraining and testing data are IID (identical & independent distribution); whileharm the prediction when the data are OOD (out-of-distribution). The solefundamental solution to learn causal attention is by causal intervention, whichrequires additional annotations of the confounders, e.g., a "dog" model islearned within "grass+dog" and "road+dog" respectively, so the "grass" and"road" contexts will no longer confound the "dog" recognition. However, suchannotation is not only prohibitively expensive, but also inherentlyproblematic, as the confounders are elusive in nature. In this paper, wepropose a causal attention module (CaaM) that self-annotates the confounders inunsupervised fashion. In particular, multiple CaaMs can be stacked andintegrated in conventional attention CNN and self-attention Vision Transformer.In OOD settings, deep models with CaaM outperform those without itsignificantly; even in IID settings, the attention localization is alsoimproved by CaaM, showing a great potential in applications that require robustvisual saliency. Codes are available at \url{https://github.com/Wangt-CN/CaaM}.
Learning to Match Features with Seeded Graph Matching Network
Comment: Accepted by ICCV2021, code to be realeased at https://github.com/vdvchen/SGMNet
Link: http://arxiv.org/abs/2108.08771
Abstract
Matching local features across images is a fundamental problem in computervision. Targeting towards high accuracy and efficiency, we propose Seeded GraphMatching Network, a graph neural network with sparse structure to reduceredundant connectivity and learn compact representation. The network consistsof 1) Seeding Module, which initializes the matching by generating a small setof reliable matches as seeds. 2) Seeded Graph Neural Network, which utilizesseed matches to pass messages within/across images and predicts assignmentcosts. Three novel operations are proposed as basic elements for messagepassing: 1) Attentional Pooling, which aggregates keypoint features within theimage to seed matches. 2) Seed Filtering, which enhances seed features andexchanges messages across images. 3) Attentional Unpooling, which propagatesseed features back to original keypoints. Experiments show that our methodreduces computational and memory complexity significantly compared with typicalattention-based networks while competitive or higher performance is achieved.
Category-Level 6D Object Pose Estimation via Cascaded Relation and Recurrent Reconstruction Networks
Comment: accepted by IROS2021
Link: http://arxiv.org/abs/2108.08755
Abstract
Category-level 6D pose estimation, aiming to predict the location andorientation of unseen object instances, is fundamental to many scenarios suchas robotic manipulation and augmented reality, yet still remains unsolved.Precisely recovering instance 3D model in the canonical space and accuratelymatching it with the observation is an essential point when estimating 6D posefor unseen objects. In this paper, we achieve accurate category-level 6D poseestimation via cascaded relation and recurrent reconstruction networks.Specifically, a novel cascaded relation network is dedicated for advancedrepresentation learning to explore the complex and informative relations amonginstance RGB image, instance point cloud and category shape prior. Furthermore,we design a recurrent reconstruction network for iterative residual refinementto progressively improve the reconstruction and correspondence estimations fromcoarse to fine. Finally, the instance 6D pose is obtained leveraging theestimated dense correspondences between the instance point cloud and thereconstructed 3D model in the canonical space. We have conducted extensiveexperiments on two well-acknowledged benchmarks of category-level 6D poseestimation, with significant performance improvement over existing approaches.On the representatively strict evaluation metrics of $3D_{75}$ and $5^{\circ}2cm$, our method exceeds the latest state-of-the-art SPD by $4.9\%$ and $17.7\%$on the CAMERA25 dataset, and by $2.7\%$ and $8.5\%$ on the REAL275 dataset.Codes are available at https://wangjiaze.cn/projects/6DPoseEstimation.html.
Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification
Comment: Accepted to ICCV 2021
Link: http://arxiv.org/abs/2108.08728
Abstract
Attention mechanism has demonstrated great potential in fine-grained visualrecognition tasks. In this paper, we present a counterfactual attentionlearning method to learn more effective attention based on causal inference.Unlike most existing methods that learn visual attention based on conventionallikelihood, we propose to learn the attention with counterfactual causality,which provides a tool to measure the attention quality and a powerfulsupervisory signal to guide the learning process. Specifically, we analyze theeffect of the learned visual attention on network prediction throughcounterfactual intervention and maximize the effect to encourage the network tolearn more useful attention for fine-grained image recognition. Empirically, weevaluate our method on a wide range of fine-grained recognition tasks whereattention plays a crucial role, including fine-grained image categorization,person re-identification, and vehicle re-identification. The consistentimprovement on all benchmarks demonstrates the effectiveness of our method.Code is available at https://github.com/raoyongming/CAL
How to cheat with metrics in single-image HDR reconstruction
Comment: ICCV 2021 workshop on Learning for Computational Imaging (LCI)
Link: http://arxiv.org/abs/2108.08713
Abstract
Single-image high dynamic range (SI-HDR) reconstruction has recently emergedas a problem well-suited for deep learning methods. Each successive techniquedemonstrates an improvement over existing methods by reporting higher imagequality scores. This paper, however, highlights that such improvements inobjective metrics do not necessarily translate to visually superior images. Thefirst problem is the use of disparate evaluation conditions in terms of dataand metric parameters, calling for a standardized protocol to make it possibleto compare between papers. The second problem, which forms the main focus ofthis paper, is the inherent difficulty in evaluating SI-HDR reconstructionssince certain aspects of the reconstruction problem dominate objectivedifferences, thereby introducing a bias. Here, we reproduce a typicalevaluation using existing as well as simulated SI-HDR methods to demonstratehow different aspects of the problem affect objective quality metrics.Surprisingly, we found that methods that do not even reconstruct HDRinformation can compete with state-of-the-art deep learning methods. We showhow such results are not representative of the perceived quality and thatSI-HDR reconstruction needs better evaluation protocols.
Real-time Image Enhancer via Learnable Spatial-aware 3D Lookup Tables
Comment: Accepted to ICCV2021
Link: http://arxiv.org/abs/2108.08697
Abstract
Recently, deep learning-based image enhancement algorithms achievedstate-of-the-art (SOTA) performance on several publicly available datasets.However, most existing methods fail to meet practical requirements either forvisual perception or for computation efficiency, especially for high-resolutionimages. In this paper, we propose a novel real-time image enhancer vialearnable spatial-aware 3-dimentional lookup tables(3D LUTs), which wellconsiders global scenario and local spatial information. Specifically, weintroduce a light weight two-head weight predictor that has two outputs. One isa 1D weight vector used for image-level scenario adaptation, the other is a 3Dweight map aimed for pixel-wise category fusion. We learn the spatial-aware 3DLUTs and fuse them according to the aforementioned weights in an end-to-endmanner. The fused LUT is then used to transform the source image into thetarget tone in an efficient way. Extensive results show that our modeloutperforms SOTA image enhancement methods on public datasets both subjectivelyand objectively, and that our model only takes about 4ms to process a 4Kresolution image on one NVIDIA V100 GPU.
3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces
Comment: Published at ICCV 2021
Link: http://arxiv.org/abs/2108.08653
Abstract
3D Shape representation has substantial effects on 3D shape reconstruction.Primitive-based representations approximate a 3D shape mainly by a set ofsimple implicit primitives, but the low geometrical complexity of theprimitives limits the shape resolution. Moreover, setting a sufficient numberof primitives for an arbitrary shape is challenging. To overcome these issues,we propose a constrained implicit algebraic surface as the primitive with fewlearnable coefficients and higher geometrical complexities and a deep neuralnetwork to produce these primitives. Our experiments demonstrate thesuperiorities of our method in terms of representation power compared to thestate-of-the-art methods in single RGB image 3D shape reconstruction.Furthermore, we show that our method can semantically learn segments of 3Dshapes in an unsupervised manner. The code is publicly available fromhttps://myavartanoo.github.io/3dias/ .
Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition
Comment: ACM MM Oral paper
Link: http://arxiv.org/abs/2108.08633
Abstract
For a given video-based Human-Object Interaction scene, modeling thespatio-temporal relationship between humans and objects are the important cueto understand the contextual information presented in the video. With theeffective spatio-temporal relationship modeling, it is possible not only touncover contextual information in each frame but also to directly captureinter-time dependencies. It is more critical to capture the position changes ofhuman and objects over the spatio-temporal dimension when their appearancefeatures may not show up significant changes over time. The full use ofappearance features, the spatial location and the semantic information are alsothe key to improve the video-based Human-Object Interaction recognitionperformance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks(STIGPN) are constructed, which encode the videos with a graph composed ofhuman and object nodes. These nodes are connected by two types of relations:(i) spatial relations modeling the interactions between human and theinteracted objects within each frame. (ii) inter-time relations capturing thelong range dependencies between human and the interacted objects across frame.With the graph, STIGPN learn spatio-temporal features directly from the wholevideo-based Human-Object Interaction scenes. Multi-modal features and amulti-stream fusion strategy are used to enhance the reasoning capability ofSTIGPN. Two Human-Object Interaction video datasets, including CAD-120 andSomething-Else, are used to evaluate the proposed architectures, and thestate-of-the-art performance demonstrates the superiority of STIGPN.
VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction
Comment: ICCV 2021 Accepted
Link: http://arxiv.org/abs/2108.08623
Abstract
To reconstruct a 3D scene from a set of calibrated views, traditionalmulti-view stereo techniques rely on two distinct stages: local depth mapscomputation and global depth maps fusion. Recent studies concentrate on deepneural architectures for depth estimation by using conventional depth fusionmethod or direct 3D reconstruction network by regressing Truncated SignedDistance Function (TSDF). In this paper, we advocate that replicating thetraditional two stages framework with deep neural networks improves both theinterpretability and the accuracy of the results. As mentioned, our networkoperates in two steps: 1) the local computation of the local depth maps with adeep MVS technique, and, 2) the depth maps and images' features fusion to builda single TSDF volume. In order to improve the matching performance betweenimages acquired from very different viewpoints (e.g., large-baseline androtations), we introduce a rotation-invariant 3D convolution kernel calledPosedConv. The effectiveness of the proposed architecture is underlined via alarge series of experiments conducted on the ScanNet dataset where our approachcompares favorably against both traditional and deep learning techniques.
Spatially-Adaptive Image Restoration using Distortion-Guided Networks
Comment: Accepted at ICCV 2021
Link: http://arxiv.org/abs/2108.08617
Abstract
We present a general learning-based solution for restoring images sufferingfrom spatially-varying degradations. Prior approaches are typicallydegradation-specific and employ the same processing across different images anddifferent pixels within. However, we hypothesize that such spatially rigidprocessing is suboptimal for simultaneously restoring the degraded pixels aswell as reconstructing the clean regions of the image. To overcome thislimitation, we propose SPAIR, a network design that harnessesdistortion-localization information and dynamically adjusts computation todifficult regions in the image. SPAIR comprises of two components, (1) alocalization network that identifies degraded pixels, and (2) a restorationnetwork that exploits knowledge from the localization network in filter andfeature domain to selectively and adaptively restore degraded pixels. Our keyidea is to exploit the non-uniformity of heavy degradations in spatial-domainand suitably embed this knowledge within distortion-guided modules performingsparse normalization, feature extraction and attention. Our architecture isagnostic to physical formation model and generalizes across several types ofspatially-varying degradations. We demonstrate the efficacy of SPAIRindividually on four restoration tasks-removal of rain-streaks, raindrops,shadows and motion blur. Extensive qualitative and quantitative comparisonswith prior art on 11 benchmark datasets demonstrate that ourdegradation-agnostic network design offers significant performance gains overstate-of-the-art degradation-specific architectures. Code available athttps://github.com/human-analysis/spatially-adaptive-image-restoration.
Feature Stylization and Domain-aware Contrastive Learning for Domain Generalization
Comment: Accepted to ACM MM 2021 (oral)
Link: http://arxiv.org/abs/2108.08596
Abstract
Domain generalization aims to enhance the model robustness against domainshift without accessing the target domain. Since the available source domainsfor training are limited, recent approaches focus on generating samples ofnovel domains. Nevertheless, they either struggle with the optimization problemwhen synthesizing abundant domains or cause the distortion of class semantics.To these ends, we propose a novel domain generalization framework where featurestatistics are utilized for stylizing original features to ones with noveldomain properties. To preserve class information during stylization, we firstdecompose features into high and low frequency components. Afterward, westylize the low frequency components with the novel domain styles sampled fromthe manipulated statistics, while preserving the shape cues in high frequencyones. As the final step, we re-merge both components to synthesize novel domainfeatures. To enhance domain robustness, we utilize the stylized features tomaintain the model consistency in terms of features as well as outputs. Weachieve the feature consistency with the proposed domain-aware supervisedcontrastive loss, which ensures domain invariance while increasing classdiscriminability. Experimental results demonstrate the effectiveness of theproposed feature stylization and the domain-aware contrastive loss. Throughquantitative comparisons, we verify the lead of our method upon existingstate-of-the-art methods on two benchmarks, PACS and Office-Home.
3D Shapes Local Geometry Codes Learning with SDF
Comment: DLGC workshop in ICCV 2021
Link: http://arxiv.org/abs/2108.08593
Abstract
A signed distance function (SDF) as the 3D shape description is one of themost effective approaches to represent 3D geometry for rendering andreconstruction. Our work is inspired by the state-of-the-art method DeepSDFthat learns and analyzes the 3D shape as the iso-surface of its shell and thismethod has shown promising results especially in the 3D shape reconstructionand compression domain. In this paper, we consider the degeneration problem ofreconstruction coming from the capacity decrease of the DeepSDF model, whichapproximates the SDF with a neural network and a single latent code. We proposeLocal Geometry Code Learning (LGCL), a model that improves the original DeepSDFresults by learning from a local shape geometry of the full 3D shape. We add anextra graph neural network to split the single transmittable latent code into aset of local latent codes distributed on the 3D shape. Mentioned latent codesare used to approximate the SDF in their local regions, which will alleviatethe complexity of the approximation compared to the original DeepSDF.Furthermore, we introduce a new geometric loss function to facilitate thetraining of these local latent codes. Note that other local shape adjustingmethods use the 3D voxel representation, which in turn is a problem highlydifficult to solve or even is insolvable. In contrast, our architecture isbased on graph processing implicitly and performs the learning regressionprocess directly in the latent code space, thus make the proposed architecturemore flexible and also simple for realization. Our experiments on 3D shapereconstruction demonstrate that our LGCL method can keep more details with asignificantly smaller size of the SDF decoder and outperforms considerably theoriginal DeepSDF method under the most important quantitative metrics.
Exploiting Scene Graphs for Human-Object Interaction Detection
Comment: Accepted to ICCV 2021
Link: http://arxiv.org/abs/2108.08584
Abstract
Human-Object Interaction (HOI) detection is a fundamental visual task aimingat localizing and recognizing interactions between humans and objects. Existingworks focus on the visual and linguistic features of humans and objects.However, they do not capitalise on the high-level and semantic relationshipspresent in the image, which provides crucial contextual and detailed relationalknowledge for HOI inference. We propose a novel method to exploit thisinformation, through the scene graph, for the Human-Object Interaction (SG2HOI)detection task. Our method, SG2HOI, incorporates the SG information in twoways: (1) we embed a scene graph into a global context clue, serving as thescene-specific environmental context; and (2) we build a relation-awaremessage-passing module to gather relationships from objects' neighborhood andtransfer them into interactions. Empirical evaluation shows that our SG2HOImethod outperforms the state-of-the-art methods on two benchmark HOI datasets:V-COCO and HICO-DET. Code will be available at https://github.com/ht014/SG2HOI.
StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation
Comment: Accepted by ICCV2021. Project is in https://github.com/SJTU-ViSYS/StructDepth
Link: http://arxiv.org/abs/2108.08574
Abstract
Self-supervised monocular depth estimation has achieved impressiveperformance on outdoor datasets. Its performance however degrades notably inindoor environments because of the lack of textures. Without rich textures, thephotometric consistency is too weak to train a good depth network. Inspired bythe early works on indoor modeling, we leverage the structural regularitiesexhibited in indoor scenes, to train a better depth network. Specifically, weadopt two extra supervisory signals for self-supervised training: 1) theManhattan normal constraint and 2) the co-planar constraint. The Manhattannormal constraint enforces the major surfaces (the floor, ceiling, and walls)to be aligned with dominant directions. The co-planar constraint states thatthe 3D points be well fitted by a plane if they are located within the sameplanar region. To generate the supervisory signals, we adopt two components toclassify the major surface normal into dominant directions and detect theplanar regions on the fly during training. As the predicted depth becomes moreaccurate after more training epochs, the supervisory signals also improve andin turn feedback to obtain a better depth model. Through extensive experimentson indoor benchmark datasets, the results show that our network outperforms thestate-of-the-art methods. The source code is available athttps://github.com/SJTU-ViSYS/StructDepth .
DECA: Deep viewpoint-Equivariant human pose estimation using Capsule Autoencoders
Comment: International Conference on Computer Vision 2021 (ICCV 2021), 8 pages, 4 figures, 4 tables, accepted for ICCV 2021 oral
Link: http://arxiv.org/abs/2108.08557
Abstract
Human Pose Estimation (HPE) aims at retrieving the 3D position of humanjoints from images or videos. We show that current 3D HPE methods suffer a lackof viewpoint equivariance, namely they tend to fail or perform poorly whendealing with viewpoints unseen at training time. Deep learning methods oftenrely on either scale-invariant, translation-invariant, or rotation-invariantoperations, such as max-pooling. However, the adoption of such procedures doesnot necessarily improve viewpoint generalization, rather leading to moredata-dependent methods. To tackle this issue, we propose a novel capsuleautoencoder network with fast Variational Bayes capsule routing, named DECA. Bymodeling each joint as a capsule entity, combined with the routing algorithm,our approach can preserve the joints' hierarchical and geometrical structure inthe feature space, independently from the viewpoint. By achieving viewpointequivariance, we drastically reduce the network data dependency at trainingtime, resulting in an improved ability to generalize for unseen viewpoints. Inthe experimental validation, we outperform other methods on depth images fromboth seen and unseen viewpoints, both top-view, and front-view. In the RGBdomain, the same network gives state-of-the-art results on the challengingviewpoint transfer task, also establishing a new framework for top-view HPE.The code can be found at https://github.com/mmlab-cv/DECA.
A Unified Objective for Novel Class Discovery
Comment: ICCV 2021 (Oral)
Link: http://arxiv.org/abs/2108.08536
Abstract
In this paper, we study the problem of Novel Class Discovery (NCD). NCD aimsat inferring novel object categories in an unlabeled set by leveraging fromprior knowledge of a labeled set containing different, but related classes.Existing approaches tackle this problem by considering multiple objectivefunctions, usually involving specialized loss terms for the labeled and theunlabeled samples respectively, and often requiring auxiliary regularizationterms. In this paper, we depart from this traditional scheme and introduce aUNified Objective function (UNO) for discovering novel classes, with theexplicit purpose of favoring synergy between supervised and unsupervisedlearning. Using a multi-view self-labeling strategy, we generate pseudo-labelsthat can be treated homogeneously with ground truth labels. This leads to asingle classification objective operating on both known and unknown classes.Despite its simplicity, UNO outperforms the state of the art by a significantmargin on several benchmarks (~+10% on CIFAR-100 and +8% on ImageNet). Theproject page is available at: \url{https://ncd-uno.github.io}.
Understanding and Mitigating Annotation Bias in Facial Expression Recognition
Comment: To appear in ICCV 2021
Link: http://arxiv.org/abs/2108.08504
Abstract
The performance of a computer vision model depends on the size and quality ofits training data. Recent studies have unveiled previously-unknown compositionbiases in common image datasets which then lead to skewed model outputs, andhave proposed methods to mitigate these biases. However, most existing worksassume that human-generated annotations can be considered gold-standard andunbiased. In this paper, we reveal that this assumption can be problematic, andthat special care should be taken to prevent models from learning suchannotation biases. We focus on facial expression recognition and compare thelabel biases between lab-controlled and in-the-wild datasets. We demonstratethat many expression datasets contain significant annotation biases betweengenders, especially when it comes to the happy and angry expressions, and thattraditional methods cannot fully mitigate such biases in trained models. Toremove expression annotation bias, we propose an AU-Calibrated FacialExpression Recognition (AUC-FER) framework that utilizes facial action units(AUs) and incorporates the triplet loss into the objective function.Experimental results suggest that the proposed method is more effective inremoving expression annotation bias than existing techniques.
Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain
Comment: ICCV 2021
Link: http://arxiv.org/abs/2108.08487
Abstract
Recently, the generalization behavior of Convolutional Neural Networks (CNN)is gradually transparent through explanation techniques with the frequencycomponents decomposition. However, the importance of the phase spectrum of theimage for a robust vision system is still ignored. In this paper, we noticethat the CNN tends to converge at the local optimum which is closely related tothe high-frequency components of the training images, while the amplitudespectrum is easily disturbed such as noises or common corruptions. In contrast,more empirical studies found that humans rely on more phase components toachieve robust recognition. This observation leads to more explanations of theCNN's generalization behaviors in both robustness to common perturbations andout-of-distribution detection, and motivates a new perspective on dataaugmentation designed by re-combing the phase spectrum of the current image andthe amplitude spectrum of the distracter image. That is, the generated samplesforce the CNN to pay more attention to the structured information from phasecomponents and keep robust to the variation of the amplitude. Experiments onseveral image datasets indicate that the proposed method achievesstate-of-the-art performances on multiple generalizations and calibrationtasks, including adaptability for common corruptions and surface variations,out-of-distribution detection, and adversarial attack.
Learning Anchored Unsigned Distance Functions with Gradient Direction Alignment for Single-view Garment Reconstruction
Comment: ICCV 2021
Link: http://arxiv.org/abs/2108.08478
Abstract
While single-view 3D reconstruction has made significant progress benefitingfrom deep shape representations in recent years, garment reconstruction isstill not solved well due to open surfaces, diverse topologies and complexgeometric details. In this paper, we propose a novel learnable AnchoredUnsigned Distance Function (AnchorUDF) representation for 3D garmentreconstruction from a single image. AnchorUDF represents 3D shapes bypredicting unsigned distance fields (UDFs) to enable open garment surfacemodeling at arbitrary resolution. To capture diverse garment topologies,AnchorUDF not only computes pixel-aligned local image features of query points,but also leverages a set of anchor points located around the surface to enrich3D position features for query points, which provides stronger 3D space contextfor the distance function. Furthermore, in order to obtain more accurate pointprojection direction at inference, we explicitly align the spatial gradientdirection of AnchorUDF with the ground-truth direction to the surface duringtraining. Extensive experiments on two public 3D garment datasets, i.e., MGNand Deep Fashion3D, demonstrate that AnchorUDF achieves the state-of-the-artperformance on single-view garment reconstruction.
Medical Image Segmentation using 3D Convolutional Neural Networks: A Review
Comment: 17 pages, 4 figures
Link: http://arxiv.org/abs/2108.08467
Abstract
Computer-aided medical image analysis plays a significant role in assistingmedical practitioners for expert clinical diagnosis and deciding the optimaltreatment plan. At present, convolutional neural networks (CNN) are thepreferred choice for medical image analysis. In addition, with the rapidadvancements in three-dimensional (3D) imaging systems and the availability ofexcellent hardware and software support to process large volumes of data, 3Ddeep learning methods are gaining popularity in medical image analysis. Here,we present an extensive review of the recently evolved 3D deep learning methodsin medical image segmentation. Furthermore, the research gaps and futuredirections in 3D medical image segmentation are discussed.
Self-Supervised Video Representation Learning with Meta-Contrastive Network
Comment: Accepted to ICCV 2021
Link: http://arxiv.org/abs/2108.08426
Abstract
Self-supervised learning has been successfully applied to pre-train videorepresentations, which aims at efficient adaptation from pre-training domain todownstream tasks. Existing approaches merely leverage contrastive loss to learninstance-level discrimination. However, lack of category information will leadto hard-positive problem that constrains the generalization ability of thiskind of methods. We find that the multi-task process of meta learning canprovide a solution to this problem. In this paper, we propose aMeta-Contrastive Network (MCN), which combines the contrastive learning andmeta learning, to enhance the learning ability of existing self-supervisedapproaches. Our method contains two training stages based on model-agnosticmeta learning (MAML), each of which consists of a contrastive branch and a metabranch. Extensive evaluations demonstrate the effectiveness of our method. Fortwo downstream tasks, i.e., video action recognition and video retrieval, MCNoutperforms state-of-the-art approaches on UCF101 and HMDB51 datasets. To bemore specific, with R(2+1)D backbone, MCN achieves Top-1 accuracies of 84.8%and 54.5% for video action recognition, as well as 52.5% and 23.7% for videoretrieval.
Generating Smooth Pose Sequences for Diverse Human Motion Prediction
Comment: ICCV21(oral)
Link: http://arxiv.org/abs/2108.08422
Abstract
Recent progress in stochastic motion prediction, i.e., predicting multiplepossible future human motions given a single past pose sequence, has led toproducing truly diverse future motions and even providing control over themotion of some body parts. However, to achieve this, the state-of-the-artmethod requires learning several mappings for diversity and a dedicated modelfor controllable motion prediction. In this paper, we introduce a unified deepgenerative network for both diverse and controllable motion prediction. To thisend, we leverage the intuition that realistic human motions consist of smoothsequences of valid poses, and that, given limited data, learning a pose prioris much more tractable than a motion one. We therefore design a generator thatpredicts the motion of different body parts sequentially, and introduce anormalizing flow based pose prior, together with a joint angle loss, to achievemotion realism.Our experiments on two standard benchmark datasets, Human3.6Mand HumanEva-I, demonstrate that our approach outperforms the state-of-the-artbaselines in terms of both sample diversity and accuracy. The code is availableat https://github.com/wei-mao-2019/gsps
Exploiting Multi-Object Relationships for Detecting Adversarial Attacks in Complex Scenes
Comment: ICCV'21 Accepted
Link: http://arxiv.org/abs/2108.08421
Abstract
Vision systems that deploy Deep Neural Networks (DNNs) are known to bevulnerable to adversarial examples. Recent research has shown that checking theintrinsic consistencies in the input data is a promising way to detectadversarial attacks (e.g., by checking the object co-occurrence relationshipsin complex scenes). However, existing approaches are tied to specific modelsand do not offer generalizability. Motivated by the observation that languagedescriptions of natural scene images have already captured the objectco-occurrence relationships that can be learned by a language model, we developa novel approach to perform context consistency checks using such languagemodels. The distinguishing aspect of our approach is that it is independent ofthe deployed object detector and yet offers very high accuracy in terms ofdetecting adversarial examples in practical scenes with multiple objects.
Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning
Comment: Initial submission; appeared as spotlight talk in ICML 2021 Workshop on Theory of RL
Link: http://arxiv.org/abs/2108.08812
Abstract
Actor-critic methods are widely used in offline reinforcement learningpractice, but are not so well-understood theoretically. We propose a newoffline actor-critic algorithm that naturally incorporates the pessimismprinciple, leading to several key advantages compared to the state of the art.The algorithm can operate when the Bellman evaluation operator is closed withrespect to the action value function of the actor's policies; this is a moregeneral setting than the low-rank MDP model. Despite the added generality, theprocedure is computationally tractable as it involves the solution of asequence of second-order programs. We prove an upper bound on the suboptimalitygap of the policy returned by the procedure that depends on the data coverageof any arbitrary, possibly data dependent comparator policy. The achievableguarantee is complemented with a minimax lower bound that is matching up tologarithmic factors.
·