今日arXiv精选 | 13 篇 ICCV 2021 最新论文

最新推荐文章于 2023-05-10 15:39:16 发布

PaperWeekly

最新推荐文章于 2023-05-10 15:39:16 发布

阅读量821

点赞数

原文链接：https://mp.weixin.qq.com/s/XamCrois5kmwl6amFWjkpQ#rd

版权

关于 #今日arXiv精选

这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。

A QuadTree Image Representation for Computational Pathology

Comment: 11 pages, 5 figures, accepted to CDPath ICCV 2021

Link: http://arxiv.org/abs/2108.10873

Abstract

The field of computational pathology presents many challenges for computervision algorithms due to the sheer size of pathology images. Histopathologyimages are large and need to be split up into image tiles or patches so modernconvolutional neural networks (CNNs) can process them. In this work, we presenta method to generate an interpretable image representation of computationalpathology images using quadtrees and a pipeline to use these representationsfor highly accurate downstream classification. To the best of our knowledge,this is the first attempt to use quadtrees for pathology image data. We show itis highly accurate, able to achieve as good results as the currently widelyadopted tissue mask patch extraction methods all while using over 38% lessdata.

Tune it the Right Way: Unsupervised Validation of Domain Adaptation via Soft Neighborhood Density

Comment: ICCV2021

Link: http://arxiv.org/abs/2108.10860

Abstract

Unsupervised domain adaptation (UDA) methods can dramatically improvegeneralization on unlabeled target domains. However, optimal hyper-parameterselection is critical to achieving high accuracy and avoiding negativetransfer. Supervised hyper-parameter validation is not possible without labeledtarget data, which raises the question: How can we validate unsupervisedadaptation techniques in a realistic way? We first empirically analyze existingcriteria and demonstrate that they are not very effective for tuninghyper-parameters. Intuitively, a well-trained source classifier should embedtarget samples of the same class nearby, forming dense neighborhoods in featurespace. Based on this assumption, we propose a novel unsupervised validationcriterion that measures the density of soft neighborhoods by computing theentropy of the similarity distribution between points. Our criterion is simplerthan competing validation methods, yet more effective; it can tunehyper-parameters and the number of training iterations in both imageclassification and semantic segmentation models. The code used for the paperwill be available at \url{https://github.com/VisionLearningGroup/SND}.

Bridging Unsupervised and Supervised Depth from Focus via All-in-Focus Supervision

Comment: ICCV 2021. Project page: https://albert100121.github.io/AiFDepthNet/ Code: https://github.com/albert100121/AiFDepthNet

Link: http://arxiv.org/abs/2108.10843

Abstract

Depth estimation is a long-lasting yet important task in computer vision.Most of the previous works try to estimate depth from input images and assumeimages are all-in-focus (AiF), which is less common in real-world applications.On the other hand, a few works take defocus blur into account and consider itas another cue for depth estimation. In this paper, we propose a method toestimate not only a depth map but an AiF image from a set of images withdifferent focus positions (known as a focal stack). We design a sharedarchitecture to exploit the relationship between depth and AiF estimation. As aresult, the proposed method can be trained either supervisedly with groundtruth depth, or \emph{unsupervisedly} with AiF images as supervisory signals.We show in various experiments that our method outperforms the state-of-the-artmethods both quantitatively and qualitatively, and also has higher efficiencyin inference time.

imGHUM: Implicit Generative Models of 3D Human Shape and Articulated Pose

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.10842

Abstract

We present imGHUM, the first holistic generative model of 3D human shape andarticulated pose, represented as a signed distance function. In contrast toprior work, we model the full human body implicitly as a functionzero-level-set and without the use of an explicit template mesh. We propose anovel network architecture and a learning paradigm, which make it possible tolearn a detailed implicit generative model of human pose, shape, and semantics,on par with state-of-the-art mesh-based models. Our model features desireddetail for human models, such as articulated pose including hand motion andfacial expressions, a broad spectrum of shape variations, and can be queried atarbitrary resolutions and spatial locations. Additionally, our model hasattached spatial semantics making it straightforward to establishcorrespondences between different shape instances, thus enabling applicationsthat are difficult to tackle using classical implicit representations. Inextensive experiments, we demonstrate the model accuracy and its applicabilityto current research problems.

Meta Self-Learning for Multi-Source Domain Adaptation: A Benchmark

Comment: 10 pages, 4 figures, ICCV workshop

Link: http://arxiv.org/abs/2108.10840

Abstract

In recent years, deep learning-based methods have shown promising results incomputer vision area. However, a common deep learning model requires a largeamount of labeled data, which is labor-intensive to collect and label. What'smore, the model can be ruined due to the domain shift between training data andtesting data. Text recognition is a broadly studied field in computer visionand suffers from the same problems noted above due to the diversity of fontsand complicated backgrounds. In this paper, we focus on the text recognitionproblem and mainly make three contributions toward these problems. First, wecollect a multi-source domain adaptation dataset for text recognition,including five different domains with over five million images, which is thefirst multi-domain text recognition dataset to our best knowledge. Secondly, wepropose a new method called Meta Self-Learning, which combines theself-learning method with the meta-learning paradigm and achieves a betterrecognition result under the scene of multi-domain adaptation. Thirdly,extensive experiments are conducted on the dataset to provide a benchmark andalso show the effectiveness of our method. The code of our work and dataset areavailable soon at https://bupt-ai-cz.github.io/Meta-SelfLearning/.

LLVIP: A Visible-infrared Paired Dataset for Low-light Vision

Comment: 9 pages, 9 figures, ICCV workshop

Link: http://arxiv.org/abs/2108.10831

Abstract

It is very challenging for various visual tasks such as image fusion,pedestrian detection and image-to-image translation in low light conditions dueto the loss of effective target areas. In this case, infrared and visibleimages can be used together to provide both rich detail information andeffective target areas. In this paper, we present LLVIP, a visible-infraredpaired dataset for low-light vision. This dataset contains 33672 images, or16836 pairs, most of which were taken at very dark scenes, and all of theimages are strictly aligned in time and space. Pedestrians in the dataset arelabeled. We compare the dataset with other visible-infrared datasets andevaluate the performance of some popular visual algorithms including imagefusion, pedestrian detection and image-to-image translation on the dataset. Theexperimental results demonstrate the complementary effect of fusion on imageinformation, and find the deficiency of existing algorithms of the three visualtasks in very low-light conditions. We believe the LLVIP dataset willcontribute to the community of computer vision by promoting image fusion,pedestrian detection and image-to-image translation in very low-lightapplications. The dataset is being released inhttps://bupt-ai-cz.github.io/LLVIP.

Reconcile Prediction Consistency for Balanced Object Detection

Comment: To appear in ICCV 2021

Link: http://arxiv.org/abs/2108.10809

Abstract

Classification and regression are two pillars of object detectors. In mostCNN-based detectors, these two pillars are optimized independently. Withoutdirect interactions between them, the classification loss and the regressionloss can not be optimized synchronously toward the optimal direction in thetraining phase. This clearly leads to lots of inconsistent predictions withhigh classification score but low localization accuracy or low classificationscore but high localization accuracy in the inference phase, especially for theobjects of irregular shape and occlusion, which severely hurts the detectionperformance of existing detectors after NMS. To reconcile predictionconsistency for balanced object detection, we propose a Harmonic loss toharmonize the optimization of classification branch and localization branch.The Harmonic loss enables these two branches to supervise and promote eachother during training, thereby producing consistent predictions with highco-occurrence of top classification and localization in the inference phase.Furthermore, in order to prevent the localization loss from being dominated byoutliers during training phase, a Harmonic IoU loss is proposed to harmonizethe weight of the localization loss of different IoU-level samples.Comprehensive experiments on benchmarks PASCAL VOC and MS COCO demonstrate thegenerality and effectiveness of our model for facilitating existing objectdetectors to state-of-the-art accuracy.

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.10743

Abstract

Panorama images have a much larger field-of-view thus naturally encodeenriched scene context information compared to standard perspective images,which however is not well exploited in the previous scene understandingmethods. In this paper, we propose a novel method for panoramic 3D sceneunderstanding which recovers the 3D room layout and the shape, pose, position,and semantic category for each object from a single full-view panorama image.In order to fully utilize the rich context information, we design a novel graphneural network based context model to predict the relationship among objectsand room layout, and a differentiable relationship-based optimization module tooptimize object arrangement with well-designed objective functions on-the-fly.Realizing the existing data are either with incomplete ground truth oroverly-simplified scene, we present a new synthetic dataset with good diversityin room layout and furniture placement, and realistic image quality for totalpanoramic 3D scene understanding. Experiments demonstrate that our methodoutperforms existing methods on panoramic scene understanding in terms of bothgeometry accuracy and object arrangement. Code is available athttps://chengzhag.github.io/publication/dpc.

Temporal Knowledge Consistency for Unsupervised Visual Representation Learning

Comment: To appear in ICCV 2021

Link: http://arxiv.org/abs/2108.10668

Abstract

The instance discrimination paradigm has become dominant in unsupervisedlearning. It always adopts a teacher-student framework, in which the teacherprovides embedded knowledge as a supervision signal for the student. Thestudent learns meaningful representations by enforcing instance spatialconsistency with the views from the teacher. However, the outputs of theteacher can vary dramatically on the same instance during different trainingstages, introducing unexpected noise and leading to catastrophic forgettingcaused by inconsistent objectives. In this paper, we first integrate instancetemporal consistency into current instance discrimination paradigms, andpropose a novel and strong algorithm named Temporal Knowledge Consistency(TKC). Specifically, our TKC dynamically ensembles the knowledge of temporalteachers and adaptively selects useful information according to its importanceto learning instance temporal consistency. Experimental result shows that TKCcan learn better visual representations on both ResNet and AlexNet on linearevaluation protocol while transfer well to downstream tasks. All experimentssuggest the good effectiveness and generalization of our method.

Improving Generalization of Batch Whitening by Convolutional Unit Optimization

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.10629

Abstract

Batch Whitening is a technique that accelerates and stabilizes training bytransforming input features to have a zero mean (Centering) and a unit variance(Scaling), and by removing linear correlation between channels (Decorrelation).In commonly used structures, which are empirically optimized with BatchNormalization, the normalization layer appears between convolution andactivation function. Following Batch Whitening studies have employed the samestructure without further analysis; even Batch Whitening was analyzed on thepremise that the input of a linear layer is whitened. To bridge the gap, wepropose a new Convolutional Unit that is in line with the theory, and ourmethod generally improves the performance of Batch Whitening. Moreover, we showthe inefficacy of the original Convolutional Unit by investigating rank andcorrelation of features. As our method is employable off-the-shelf whiteningmodules, we use Iterative Normalization (IterNorm), the state-of-the-artwhitening module, and obtain significantly improved performance on five imageclassification datasets: CIFAR-10, CIFAR-100, CUB-200-2011, Stanford Dogs, andImageNet. Notably, we verify that our method improves stability and performanceof whitening when using large learning rate, group size, and iteration number.

Making Higher Order MOT Scalable: An Efficient Approximate Solver for Lifted Disjoint Paths

Comment: ICCV 2021. Short version published at CVPR 2021 RVSU workshop https://omnomnom.vision.rwth-aachen.de/data/RobMOTS/workshop/papers/9/CameraReady/paper_V3.pdf . Implementation available at https://github.com/LPMP/LPMP and https://github.com/TimoK93/ApLift

Link: http://arxiv.org/abs/2108.10606

Abstract

We present an efficient approximate message passing solver for the lifteddisjoint paths problem (LDP), a natural but NP-hard model for multiple objecttracking (MOT). Our tracker scales to very large instances that come from longand crowded MOT sequences. Our approximate solver enables us to process theMOT15/16/17 benchmarks without sacrificing solution quality and allows forsolving MOT20, which has been out of reach up to now for LDP solvers due to itssize and complexity. On all these four standard MOT benchmarks we achieveperformance comparable or better than current state-of-the-art methodsincluding a tracker based on an optimal LDP solver.

Support-Set Based Cross-Supervision for Video Grounding

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.10576

Abstract

Current approaches for video grounding propose kinds of complex architecturesto capture the video-text relations, and have achieved impressive improvements.However, it is hard to learn the complicated multi-modal relations by onlyarchitecture designing in fact. In this paper, we introduce a novel Support-setBased Cross-Supervision (Sscs) module which can improve existing methods duringtraining phase without extra inference cost. The proposed Sscs module containstwo main components, i.e., discriminative contrastive objective and generativecaption objective. The contrastive objective aims to learn effectiverepresentations by contrastive learning, while the caption objective can traina powerful video encoder supervised by texts. Due to the co-existence of somevisual entities in both ground-truth and background intervals, i.e., mutualexclusion, naively contrastive learning is unsuitable to video grounding. Weaddress the problem by boosting the cross-supervision with the support-setconcept, which collects visual information from the whole video and eliminatesthe mutual exclusion of entities. Combined with the original objectives, Sscscan enhance the abilities of multi-modal relation modeling for existingapproaches. We extensively evaluate Sscs on three challenging datasets, andshow that our method can improve current state-of-the-art methods by largemargins, especially 6.35% in terms of R1@0.5 on Charades-STA.

ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation

Comment: ICCV2021

Link: http://arxiv.org/abs/2108.10528

Abstract

RGB-D semantic segmentation has attracted increasing attention over the pastfew years. Existing methods mostly employ homogeneous convolution operators toconsume the RGB and depth features, ignoring their intrinsic differences. Infact, the RGB values capture the photometric appearance properties in theprojected image space, while the depth feature encodes both the shape of alocal geometry as well as the base (whereabout) of it in a larger context.Compared with the base, the shape probably is more inherent and has a strongerconnection to the semantics, and thus is more critical for segmentationaccuracy. Inspired by this observation, we introduce a Shape-awareConvolutional layer (ShapeConv) for processing the depth feature, where thedepth feature is firstly decomposed into a shape-component and abase-component, next two learnable weights are introduced to cooperate withthem independently, and finally a convolution is applied on the re-weightedcombination of these two components. ShapeConv is model-agnostic and can beeasily integrated into most CNNs to replace vanilla convolutional layers forsemantic segmentation. Extensive experiments on three challenging indoor RGB-Dsemantic segmentation benchmarks, i.e., NYU-Dv2(-13,-40), SUN RGB-D, and SID,demonstrate the effectiveness of our ShapeConv when employing it over fivepopular architectures. Moreover, the performance of CNNs with ShapeConv isboosted without introducing any computation and memory increase in theinference phase. The reason is that the learnt weights for balancing theimportance between the shape and base components in ShapeConv become constantsin the inference phase, and thus can be fused into the following convolution,resulting in a network that is identical to one with vanilla convolutionallayers.

PaperWeekly

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
今日arXiv精选 | 13 篇 ICCV 2021 最新论文

关于#今日arXiv精选这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。A QuadTree Image Representation fo...
复制链接

扫一扫