今日arXiv精选 | 9篇ICCV 2021最新论文

最新推荐文章于 2024-05-17 15:25:09 发布

PaperWeekly

最新推荐文章于 2024-05-17 15:25:09 发布

阅读量652

点赞数

文章标签：人工智能 sms animation 办公软件 github

原文链接：https://mp.weixin.qq.com/s/lfjg3xgadA5u76xrDN9bNA#rd

版权

关于 #今日arXiv精选

这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。

The Power of Points for Modeling Humans in Clothing

Comment: In ICCV 2021. Project page: https://qianlim.github.io/POP

Link: http://arxiv.org/abs/2109.01137

Abstract

Currently it requires an artist to create 3D human avatars with realisticclothing that can move naturally. Despite progress on 3D scanning and modelingof human bodies, there is still no technology that can easily turn a staticscan into an animatable avatar. Automating the creation of such avatars wouldenable many applications in games, social networking, animation, and AR/VR toname a few. The key problem is one of representation. Standard 3D meshes arewidely used in modeling the minimally-clothed body but do not readily capturethe complex topology of clothing. Recent interest has shifted to implicitsurface models for this task but they are computationally heavy and lackcompatibility with existing 3D tools. What is needed is a 3D representationthat can capture varied topology at high resolution and that can be learnedfrom data. We argue that this representation has been with us all along -- thepoint cloud. Point clouds have properties of both implicit and explicitrepresentations that we exploit to model 3D garment geometry on a human body.We train a neural network with a novel local clothing geometric feature torepresent the shape of different outfits. The network is trained from 3D pointclouds of many types of clothing, on many bodies, in many poses, and learns tomodel pose-dependent clothing deformations. The geometry feature can beoptimized to fit a previously unseen scan of a person in clothing, enabling thescan to be reposed realistically. Our model demonstrates superior quantitativeand qualitative results in both multi-outfit modeling and unseen outfitanimation. The code is available for research purposes.

NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

Comment: To appear in ICCV 2021 (Oral). Project page: https://weiyithu.github.io/NerfingMVS/

Link: http://arxiv.org/abs/2109.01129

Abstract

In this work, we present a new multi-view depth estimation method thatutilizes both conventional SfM reconstruction and learning-based priors overthe recently proposed neural radiance fields (NeRF). Unlike existing neuralnetwork based optimization method that relies on estimated correspondences, ourmethod directly optimizes over implicit volumes, eliminating the challengingstep of matching pixels in indoor scenes. The key to our approach is to utilizethe learning-based priors to guide the optimization process of NeRF. Our systemfirstly adapts a monocular depth network over the target scene by finetuning onits sparse SfM reconstruction. Then, we show that the shape-radiance ambiguityof NeRF still exists in indoor environments and propose to address the issue byemploying the adapted depth priors to monitor the sampling process of volumerendering. Finally, a per-pixel confidence map acquired by error computation onthe rendered image can be used to further improve the depth quality.Experiments show that our proposed framework significantly outperformsstate-of-the-art methods on indoor scenes, with surprising findings presentedon the effectiveness of correspondence-based optimization and NeRF-basedoptimization over the adapted depth priors. In addition, we show that theguided optimization scheme does not sacrifice the original synthesis capabilityof neural radiance fields, improving the rendering quality on both seen andnovel views. Code is available at https://github.com/weiyithu/NerfingMVS.

The Functional Correspondence Problem

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2109.01097

Abstract

The ability to find correspondences in visual data is the essence of mostcomputer vision tasks. But what are the right correspondences? The task ofvisual correspondence is well defined for two different images of same objectinstance. In case of two images of objects belonging to same category, visualcorrespondence is reasonably well-defined in most cases. But what aboutcorrespondence between two objects of completely different category -- e.g., ashoe and a bottle? Does there exist any correspondence? Inspired by humans'ability to: (a) generalize beyond semantic categories and; (b) infer functionalaffordances, we introduce the problem of functional correspondences in thispaper. Given images of two objects, we ask a simple question: what is the setof correspondences between these two images for a given task? For example, whatare the correspondences between a bottle and shoe for the task of pounding orthe task of pouring. We introduce a new dataset: FunKPoint that has groundtruth correspondences for 10 tasks and 20 object categories. We also introducea modular task-driven representation for attacking this problem and demonstratethat our learned representation is effective for this task. But mostimportantly, because our supervision signal is not bound by semantics, we showthat our learned representation can generalize better on few-shotclassification problem. We hope this paper will inspire our community to thinkbeyond semantics and focus more on cross-category generalization and learningrepresentations for robotics tasks.

SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting

Comment: ICCV 2021 (Oral); Project page: https://varunjampani.github.io/slide ; Video: https://www.youtube.com/watch?v=RQio7q-ueY8

Link: http://arxiv.org/abs/2109.01068

Abstract

Single image 3D photography enables viewers to view a still image from novelviewpoints. Recent approaches combine monocular depth networks with inpaintingnetworks to achieve compelling results. A drawback of these techniques is theuse of hard depth layering, making them unable to model intricate appearancedetails such as thin hair-like structures. We present SLIDE, a modular andunified system for single image 3D photography that uses a simple yet effectivesoft layering strategy to better preserve appearance details in novel views. Inaddition, we propose a novel depth-aware training strategy for our inpaintingmodule, better suited for the 3D photography task. The resulting SLIDE approachis modular, enabling the use of other components such as segmentation andmatting for improved layering. At the same time, SLIDE uses an efficientlayered depth formulation that only requires a single forward pass through thecomponent networks to produce high quality 3D photos. Extensive experimentalanalysis on three view-synthesis datasets, in combination with user studies onin-the-wild image collections, demonstrate superior performance of ourtechnique in comparison to existing strong baselines while being conceptuallymuch simpler. Project page: https://varunjampani.github.io/slide

4D-Net for Learned Multi-Modal Alignment

Comment: ICCV 2021

Link: http://arxiv.org/abs/2109.01066

Abstract

We present 4D-Net, a 3D object detection approach, which utilizes 3D PointCloud and RGB sensing information, both in time. We are able to incorporate the4D information by performing a novel dynamic connection learning across variousfeature representations and levels of abstraction, as well as by observinggeometric constraints. Our approach outperforms the state-of-the-art and strongbaselines on the Waymo Open Dataset. 4D-Net is better able to use motion cuesand dense image information to detect distant objects more successfully.

Adversarial Robustness for Unsupervised Domain Adaptation

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2109.00946

Abstract

Extensive Unsupervised Domain Adaptation (UDA) studies have shown greatsuccess in practice by learning transferable representations across a labeledsource domain and an unlabeled target domain with deep models. However,previous works focus on improving the generalization ability of UDA models onclean examples without considering the adversarial robustness, which is crucialin real-world applications. Conventional adversarial training methods are notsuitable for the adversarial robustness on the unlabeled target domain of UDAsince they train models with adversarial examples generated by the supervisedloss function. In this work, we leverage intermediate representations learnedby multiple robust ImageNet models to improve the robustness of UDA models. Ourmethod works by aligning the features of the UDA model with the robust featureslearned by ImageNet pre-trained models along with domain adaptation training.It utilizes both labeled and unlabeled domains and instills robustness withoutany adversarial intervention or label requirement during domain adaptationtraining. Experimental results show that our method significantly improvesadversarial robustness compared to the baseline while keeping clean accuracy onvarious UDA benchmarks.

Generative Models for Multi-Illumination Color Constancy

Comment: Accepted in International Conference on Computer Vision Workshop (ICCVW) 2021

Link: http://arxiv.org/abs/2109.00863

Abstract

In this paper, the aim is multi-illumination color constancy. However, mostof the existing color constancy methods are designed for single light sources.Furthermore, datasets for learning multiple illumination color constancy arelargely missing. We propose a seed (physics driven) based multi-illuminationcolor constancy method. GANs are exploited to model the illumination estimationproblem as an image-to-image domain translation problem. Additionally, a novelmulti-illumination data augmentation method is proposed. Experiments on singleand multi-illumination datasets show that our methods outperform sota methods.

SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos

Comment: Accepted to EPIC@ICCV 2021

Link: http://arxiv.org/abs/2109.00829

Abstract

Action anticipation in egocentric videos is a difficult task due to theinherently multi-modal nature of human actions. Additionally, some actionshappen faster or slower than others depending on the actor or surroundingcontext which could vary each time and lead to different predictions. Based onthis idea, we build upon RULSTM architecture, which is specifically designedfor anticipating human actions, and propose a novel attention-based techniqueto evaluate, simultaneously, slow and fast features extracted from threedifferent modalities, namely RGB, optical flow, and extracted objects. Twobranches process information at different time scales, i.e., frame-rates, andseveral fusion schemes are considered to improve prediction accuracy. Weperform extensive experiments on EpicKitchens-55 and EGTEA Gaze+ datasets, anddemonstrate that our technique systematically improves the results of RULSTMarchitecture for Top-5 accuracy metric at different anticipation times.

Self-Calibrating Neural Radiance Fields

Comment: Accepted in ICCV21, Project Page: https://postech-cvlab.github.io/SCNeRF/

Link: http://arxiv.org/abs/2108.13826

Abstract

In this work, we propose a camera self-calibration algorithm for genericcameras with arbitrary non-linear distortions. We jointly learn the geometry ofthe scene and the accurate camera parameters without any calibration objects.Our camera model consists of a pinhole model, a fourth order radial distortion,and a generic noise model that can learn arbitrary non-linear cameradistortions. While traditional self-calibration algorithms mostly rely ongeometric constraints, we additionally incorporate photometric consistency.This requires learning the geometry of the scene, and we use Neural RadianceFields (NeRF). We also propose a new geometric loss function, viz., projectedray distance loss, to incorporate geometric consistency for complex non-linearcamera models. We validate our approach on standard real image datasets anddemonstrate that our model can learn the camera intrinsics and extrinsics(pose) from scratch without COLMAP initialization. Also, we show that learningaccurate camera models in a differentiable manner allows us to improve PSNRover baselines. Our module is an easy-to-use plugin that can be applied to NeRFvariants to improve performance. The code and data are currently available athttps://github.com/POSTECH-CVLab/SCNeRF.