关于 #今日arXiv精选
这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者。
A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems
Comment: This paper has been accepted by Interspeech 2021
Link: http://arxiv.org/abs/2108.07493
Abstract
It's challenging to customize transducer-based automatic speech recognition(ASR) system with context information which is dynamic and unavailable duringmodel training. In this work, we introduce a light-weight contextual spellingcorrection model to correct context-related recognition errors intransducer-based ASR systems. We incorporate the context information into thespelling correction model with a shared context encoder and use a filteringalgorithm to handle large-size context lists. Experiments show that the modelimproves baseline ASR model performance with about 50% relative word error ratereduction, which also significantly outperforms the baseline method such ascontextual LM biasing. The model also shows excellent performance forout-of-vocabulary terms not seen during training.
Modeling Protein Using Large-scale Pretrain Language Model
Comment: Accepted paper in Pretrain@KDD 2021 (The International Workshop on Pretraining: Algorithms, Architectures, and Applications)
Link: http://arxiv.org/abs/2108.07435
Abstract
Protein is linked to almost every life process. Therefore, analyzing thebiological structure and property of protein sequences is critical to theexploration of life, as well as disease detection and drug discovery.Traditional protein analysis methods tend to be labor-intensive andtime-consuming. The emergence of deep learning models makes modeling datapatterns in large quantities of data possible. Interdisciplinary researchershave begun to leverage deep learning methods to model large biologicaldatasets, e.g. using long short-term memory and convolutional neural networkfor protein sequence classification. After millions of years of evolution,evolutionary information is encoded in protein sequences. Inspired by thesimilarity between natural language and protein sequences, we use large-scalelanguage models to model evolutionary-scale protein sequences, encoding proteinbiology information in representation. Significant improvements are observed inboth token-level and sequence-level tasks, demonstrating that our large-scalemodel can accurately capture evolution information from pretraining onevolutionary-scale individual sequences. Our code and model are available athttps://github.com/THUDM/ProteinLM.
Who's Waldo? Linking People Across Text and Images
Comment: Published in ICCV 2021 (Oral).
Project: https://whoswaldo.github.io
Link: http://arxiv.org/abs/2108.07253
Abstract
We present a task and benchmark dataset for person-centric visual grounding,the problem of linking between people named in a caption and people pictured inan image. In contrast to prior work in visual grounding, which is predominantlyobject-based, our new task masks out the names of people in captions in orderto encourage methods trained on such image-caption pairs to focus on contextualcues (such as rich interactions between multiple people), rather than learningassociations between names and appearances. To facilitate this task, weintroduce a new dataset, Who's Waldo, mined automatically from image-captiondata on Wikimedia Commons. We propose a Transformer-based method thatoutperforms several strong baselines on this task, and are releasing our datato the research community to spur work on contextual models that consider bothvision and language.
Group-aware Contrastive Regression for Action Quality Assessment
Comment: Accepted to ICCV 2021
Link: http://arxiv.org/abs/2108.07797
Abstract
Assessing action quality is challenging due to the subtle differences betweenvideos and large variations in scores. Most existing approaches tackle thisproblem by regressing a quality score from a single video, suffering a lot fromthe large inter-video score variations. In this paper, we show that therelations among videos can provide important clues for more accurate actionquality assessment during both training and inference. Specifically, wereformulate the problem of action quality assessment as regressing the relativescores with reference to another video that has shared attributes (e.g.,category and difficulty), instead of learning unreferenced scores. Followingthis formulation, we propose a new Contrastive Regression (CoRe) framework tolearn the relative scores by pair-wise comparison, which highlights thedifferences between videos and guides the models to learn the key hints forassessment. In order to further exploit the relative information between twovideos, we devise a group-aware regression tree to convert the conventionalscore regression into two easier sub-problems: coarse-to-fine classificationand regression in small intervals. To demonstrate the effectiveness of CoRe, weconduct extensive experiments on three mainstream AQA datasets including AQA-7,MTL-AQA and JIGSAWS. Our approach outperforms previous methods by a largemargin and establishes new state-of-the-art on all three benchmarks.
RandomRooms: Unsupervised Pre-training from Synthetic Shapes and Randomized Layouts for 3D Object Detection
Comment: Accepted to ICCV 2021
Link: http://arxiv.org/abs/2108.07794
Abstract
3D point cloud understanding has made great progress in recent years.However, one major bottleneck is the scarcity of annotated real datasets,especially compared to 2D object detection tasks, since a large amount of laboris involved in annotating the real scans of a scene. A promising solution tothis problem is to make better use of the synthetic dataset, which consists ofCAD object models, to boost the learning on real datasets. This can be achievedby the pre-training and fine-tuning procedure. However, recent work on 3Dpre-training exhibits failure when transfer features learned on syntheticobjects to other real-world applications. In this work, we put forward a newmethod called RandomRooms to accomplish this objective. In particular, wepropose to generate random layouts of a scene by making use of the objects inthe synthetic CAD dataset and learn the 3D scene representation by applyingobject-level contrastive learning on two random scenes generated from the sameset of synthetic objects. The model pre-trained in this way can serve as abetter initialization when later fine-tuning on the 3D object detection task.Empirically, we show consistent improvement in downstream 3D detection tasks onseveral base models, especially when less training data are used, whichstrongly demonstrates the effectiveness and generalization of our method.Benefiting from the rich semantic knowledge and diverse objects from syntheticdata, our method establishes the new state-of-the-art on widely-used 3Ddetection benchmarks ScanNetV2 and SUN RGB-D. We expect our attempt to providea new perspective for bridging object and scene-level 3D understanding.
End-to-End Dense Video Captioning with Parallel Decoding
Comment: Accepted by ICCV 2021
Link: http://arxiv.org/abs/2108.07781
Abstract
Dense video captioning aims to generate multiple associated captions withtheir temporal locations from the video. Previous methods follow asophisticated "localize-then-describe" scheme, which heavily relies on numeroushand-crafted components. In this paper, we proposed a simple yet effectiveframework for end-to-end dense video captioning with parallel decoding (PDVC),by formulating the dense caption generation as a set prediction task. Inpractice, through stacking a newly proposed event counter on the top of atransformer decoder, the PDVC precisely segments the video into a number ofevent pieces under the holistic understanding of the video content, whicheffectively increases the coherence and readability of predicted captions.Compared with prior arts, the PDVC has several appealing advantages: (1)Without relying on heuristic non-maximum suppression or a recurrent eventsequence selection network to remove redundancy, PDVC directly produces anevent set with an appropriate size; (2) In contrast to adopting the two-stagescheme, we feed the enhanced representations of event queries into thelocalization head and caption head in parallel, making these two sub-tasksdeeply interrelated and mutually promoted through the optimization; (3) Withoutbells and whistles, extensive experiments on ActivityNet Captions and YouCook2show that PDVC is capable of producing high-quality captioning results,surpassing the state-of-the-art two-stage methods when its localizationaccuracy is on par with them. Code is available athttps://github.com/ttengwang/PDVC.
Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation
Comment: ICCV 2021.
Code: https://github.com/csyxwei/OroJaR
Link: http://arxiv.org/abs/2108.07668
Abstract
Unsupervised disentanglement learning is a crucial issue for understandingand exploiting deep generative models. Recently, SeFa tries to find latentdisentangled directions by performing SVD on the first projection of apre-trained GAN. However, it is only applied to the first layer and works in apost-processing way. Hessian Penalty minimizes the off-diagonal entries of theoutput's Hessian matrix to facilitate disentanglement, and can be applied tomulti-layers.However, it constrains each entry of output independently, makingit not sufficient in disentangling the latent directions (e.g., shape, size,rotation, etc.) of spatially correlated variations. In this paper, we propose asimple Orthogonal Jacobian Regularization (OroJaR) to encourage deep generativemodel to learn disentangled representations. It simply encourages the variationof output caused by perturbations on different latent dimensions to beorthogonal, and the Jacobian with respect to the input is calculated torepresent this variation. We show that our OroJaR also encourages the output'sHessian matrix to be diagonal in an indirect manner. In contrast to the HessianPenalty, our OroJaR constrains the output in a holistic way, making it veryeffective in disentangling latent dimensions corresponding to spatiallycorrelated variations. Quantitative and qualitative experimental results showthat our method is effective in disentangled and controllable image generation,and performs favorably against the state-of-the-art methods. Our code isavailable at https://github.com/csyxwei/OroJaR
Look Who's Talking: Active Speaker Detection in the Wild
Comment: To appear in Interspeech 2021.
Data: https://github.com/clovaai/lookwhostalking
Link: http://arxiv.org/abs/2108.07640
Abstract
In this work, we present a novel audio-visual dataset for active speakerdetection in the wild. A speaker is considered active when his or her face isvisible and the voice is audible simultaneously. Although active speakerdetection is a crucial pre-processing step for many audio-visual tasks, thereis no existing dataset of natural human speech to evaluate the performance ofactive speaker detection. We therefore curate the Active Speakers in the Wild(ASW) dataset which contains videos and co-occurring speech segments with densespeech activity labels. Videos and timestamps of audible segments are parsedand adopted from VoxConverse, an existing speaker diarisation dataset thatconsists of videos in the wild. Face tracks are extracted from the videos andactive segments are annotated based on the timestamps of VoxConverse in asemi-automatic way. Two reference systems, a self-supervised system and a fullysupervised one, are evaluated on the dataset to provide the baselineperformances of ASW. Cross-domain evaluation is conducted in order to show thenegative effect of dubbed videos in the training data.
Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation
Comment: Accepted by ICCV 2021
Link: http://arxiv.org/abs/2108.07628
Abstract
Remarkable results have been achieved by DCNN based self-supervised depthestimation approaches. However, most of these approaches can only handle eitherday-time or night-time images, while their performance degrades for all-dayimages due to large domain shift and the variation of illumination between dayand night images. To relieve these limitations, we propose a domain-separatednetwork for self-supervised depth estimation of all-day images. Specifically,to relieve the negative influence of disturbing terms (illumination, etc.), wepartition the information of day and night image pairs into two complementarysub-spaces: private and invariant domains, where the former contains the uniqueinformation (illumination, etc.) of day and night images and the lattercontains essential shared information (texture, etc.). Meanwhile, to guaranteethat the day and night images contain the same information, thedomain-separated network takes the day-time images and corresponding night-timeimages (generated by GAN) as input, and the private and invariant featureextractors are learned by orthogonality and similarity loss, where the domaingap can be alleviated, thus better depth maps can be expected. Meanwhile, thereconstruction and photometric losses are utilized to estimate complementaryinformation and depth maps effectively. Experimental results demonstrate thatour approach achieves state-of-the-art depth estimation results for all-dayimages on the challenging Oxford RobotCar dataset, proving the superiority ofour proposed approach.
DRÆM -- A discriminatively trained reconstruction embedding for surface anomaly detection
Comment: Accepted to ICCV2021
Link: http://arxiv.org/abs/2108.07610
Abstract
Visual surface anomaly detection aims to detect local image regions thatsignificantly deviate from normal appearance. Recent surface anomaly detectionmethods rely on generative models to accurately reconstruct the normal areasand to fail on anomalies. These methods are trained only on anomaly-freeimages, and often require hand-crafted post-processing steps to localize theanomalies, which prohibits optimizing the feature extraction for maximaldetection capability. In addition to reconstructive approach, we cast surfaceanomaly detection primarily as a discriminative problem and propose adiscriminatively trained reconstruction anomaly embedding model (DRAEM). Theproposed method learns a joint representation of an anomalous image and itsanomaly-free reconstruction, while simultaneously learning a decision boundarybetween normal and anomalous examples. The method enables direct anomalylocalization without the need for additional complicated post-processing of thenetwork output and can be trained using simple and general anomaly simulations.On the challenging MVTec anomaly detection dataset, DRAEM outperforms thecurrent state-of-the-art unsupervised methods by a large margin and evendelivers detection performance close to the fully-supervised methods on thewidely used DAGM surface-defect detection dataset, while substantiallyoutperforming them in localization accuracy.
Self-Supervised Pretraining and Controlled Augmentation Improve Rare Wildlife Recognition in UAV Images
Comment: accepted by 2021 IEEE/CVF International Conference on Computer Vision (ICCV) Workshops
Link: http://arxiv.org/abs/2108.07582
Abstract
Automated animal censuses with aerial imagery are a vital ingredient towardswildlife conservation. Recent models are generally based on deep learning andthus require vast amounts of training data. Due to their scarcity and minusculesize, annotating animals in aerial imagery is a highly tedious process. In thisproject, we present a methodology to reduce the amount of required trainingdata by resorting to self-supervised pretraining. In detail, we examine acombination of recent contrastive learning methodologies like Momentum Contrast(MoCo) and Cross-Level Instance-Group Discrimination (CLD) to condition ourmodel on the aerial images without the requirement for labels. We show that acombination of MoCo, CLD, and geometric augmentations outperforms conventionalmodels pre-trained on ImageNet by a large margin. Crucially, our method stillyields favorable results even if we reduce the number of training animals tojust 10%, at which point our best model scores double the recall of thebaseline at similar precision. This effectively allows reducing the number ofrequired annotations to a fraction while still being able to trainhigh-accuracy models in such highly challenging settings.
Investigating transformers in the decomposition of polygonal shapes as point collections
Comment: DLGC@ICCVW 2021
Link: http://arxiv.org/abs/2108.07533
Abstract
Transformers can generate predictions in two approaches: 1. auto-regressivelyby conditioning each sequence element on the previous ones, or 2. directlyproduce an output sequences in parallel. While research has mostly exploredupon this difference on sequential tasks in NLP, we study the differencebetween auto-regressive and parallel prediction on visual set prediction tasks,and in particular on polygonal shapes in images because polygons arerepresentative of numerous types of objects, such as buildings or obstacles foraerial vehicles. This is challenging for deep learning architectures as apolygon can consist of a varying carnality of points. We provide evidence onthe importance of natural orders for Transformers, and show the benefit ofdecomposing complex polygons into collections of points in an auto-regressivemanner.
Unsupervised Geodesic-preserved Generative Adversarial Networks for Unconstrained 3D Pose Transfer
Comment: ICCV 2021
Link: http://arxiv.org/abs/2108.07520
Abstract
With the strength of deep generative models, 3D pose transfer regainsintensive research interests in recent years. Existing methods mainly rely on avariety of constraints to achieve the pose transfer over 3D meshes, e.g., theneed for the manually encoding for shape and pose disentanglement. In thispaper, we present an unsupervised approach to conduct the pose transfer betweenany arbitrate given 3D meshes. Specifically, a novel Intrinsic-ExtrinsicPreserved Generative Adversarial Network (IEP-GAN) is presented for bothintrinsic (i.e., shape) and extrinsic (i.e., pose) information preservation.Extrinsically, we propose a co-occurrence discriminator to capture thestructural/pose invariance from distinct Laplacians of the mesh. Meanwhile,intrinsically, a local intrinsic-preserved loss is introduced to preserve thegeodesic priors while avoiding the heavy computations. At last, we show thepossibility of using IEP-GAN to manipulate 3D human meshes in various ways,including pose transfer, identity swapping and pose interpolation with latentcode vector arithmetic. The extensive experiments on various 3D datasets ofhumans, animals and hands qualitatively and quantitatively demonstrate thegenerality of our approach. Our proposed model produces better results and issubstantially more efficient compared to recent state-of-the-art methods. Codeis available: https://github.com/mikecheninoulu/Unsupervised_IEPGAN.
PR-RRN: Pairwise-Regularized Residual-Recursive Networks for Non-rigid Structure-from-Motion
Comment: Accepted to ICCV 2021
Link: http://arxiv.org/abs/2108.07506
Abstract
We propose PR-RRN, a novel neural-network based method for Non-rigidStructure-from-Motion (NRSfM). PR-RRN consists of Residual-Recursive Networks(RRN) and two extra regularization losses. RRN is designed to effectivelyrecover 3D shape and camera from 2D keypoints with novel residual-recursivestructure. As NRSfM is a highly under-constrained problem, we propose two newpairwise regularization to further regularize the reconstruction. TheRigidity-based Pairwise Contrastive Loss regularizes the shape representationby encouraging higher similarity between the representations of high-rigiditypairs of frames than low-rigidity pairs. We propose minimum singular-valueratio to measure the pairwise rigidity. The Pairwise Consistency Loss enforcesthe reconstruction to be consistent when the estimated shapes and cameras areexchanged between pairs. Our approach achieves state-of-the-art performance onCMU MOCAP and PASCAL3D+ dataset.
G-DetKD: Towards General Distillation Framework for Object Detectors via Contrastive and Semantic-guided Feature Imitation
Comment: Accepted by ICCV2021
Link: http://arxiv.org/abs/2108.07482
Abstract
In this paper, we investigate the knowledge distillation (KD) strategy forobject detection and propose an effective framework applicable to bothhomogeneous and heterogeneous student-teacher pairs. The conventional featureimitation paradigm introduces imitation masks to focus on informativeforeground areas while excluding the background noises. However, we find thatthose methods fail to fully utilize the semantic information in all featurepyramid levels, which leads to inefficiency for knowledge distillation betweenFPN-based detectors. To this end, we propose a novel semantic-guided featureimitation technique, which automatically performs soft matching between featurepairs across all pyramid levels to provide the optimal guidance to the student.To push the envelop even further, we introduce contrastive distillation toeffectively capture the information encoded in the relationship betweendifferent feature regions. Finally, we propose a generalized detection KDpipeline, which is capable of distilling both homogeneous and heterogeneousdetector pairs. Our method consistently outperforms the existing detection KDtechniques, and works when (1) components in the framework are used separatelyand in conjunction; (2) for both homogeneous and heterogenous student-teacherpairs and (3) on multiple detection benchmarks. With a powerfulX101-FasterRCNN-Instaboost detector as the teacher, R50-FasterRCNN reaches44.0% AP, R50-RetinaNet reaches 43.3% AP and R50-FCOS reaches 43.1% AP on COCOdataset.
Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks
Comment: Accepted by ICCV2021
Link: http://arxiv.org/abs/2108.07478
Abstract
Instance segmentation in 3D scenes is fundamental in many applications ofscene understanding. It is yet challenging due to the compound factors of datairregularity and uncertainty in the numbers of instances. State-of-the-artmethods largely rely on a general pipeline that first learns point-wisefeatures discriminative at semantic and instance levels, followed by a separatestep of point grouping for proposing object instances. While promising, theyhave the shortcomings that (1) the second step is not supervised by the mainobjective of instance segmentation, and (2) their point-wise feature learningand grouping are less effective to deal with data irregularities, possiblyresulting in fragmented segmentations. To address these issues, we propose inthis work an end-to-end solution of Semantic Superpoint Tree Network (SSTNet)for proposing object instances from scene points. Key in SSTNet is anintermediate, semantic superpoint tree (SST), which is constructed based on thelearned semantic features of superpoints, and which will be traversed and splitat intermediate tree nodes for proposals of object instances. We also design inSSTNet a refinement module, termed CliqueNet, to prune superpoints that may bewrongly grouped into instance proposals. Experiments on the benchmarks ofScanNet and S3DIS show the efficacy of our proposed method. At the time ofsubmission, SSTNet ranks top on the ScanNet (V2) leaderboard, with 2% higher ofmAP than the second best method. The source code in PyTorch is available athttps://github.com/Gorilla-Lab-SCUT/SSTNet.
Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences
Comment: iccv 2021
Link: http://arxiv.org/abs/2108.07422
Abstract
We address the problem of visible-infrared person re-identification(VI-reID), that is, retrieving a set of person images, captured by visible orinfrared cameras, in a cross-modal setting. Two main challenges in VI-reID areintra-class variations across person images, and cross-modal discrepanciesbetween visible and infrared images. Assuming that the person images areroughly aligned, previous approaches attempt to learn coarse image- or rigidpart-level person representations that are discriminative and generalizableacross different modalities. However, the person images, typically cropped byoff-the-shelf object detectors, are not necessarily well-aligned, whichdistract discriminative person representation learning. In this paper, weintroduce a novel feature learning framework that addresses these problems in aunified way. To this end, we propose to exploit dense correspondences betweencross-modal person images. This allows to address the cross-modal discrepanciesin a pixel-level, suppressing modality-related features from personrepresentations more effectively. This also encourages pixel-wise associationsbetween cross-modal local features, further facilitating discriminative featurelearning for VI-reID. Extensive experiments and analyses on standard VI-reIDbenchmarks demonstrate the effectiveness of our approach, which significantlyoutperforms the state of the art.
Contextual Convolutional Neural Networks
Comment: Accepted at ICCV Workshop on Neural Architectures (NeurArch 2021)
Link: http://arxiv.org/abs/2108.07387
Abstract
We propose contextual convolution (CoConv) for visual recognition. CoConv isa direct replacement of the standard convolution, which is the core componentof convolutional neural networks. CoConv is implicitly equipped with thecapability of incorporating contextual information while maintaining a similarnumber of parameters and computational cost compared to the standardconvolution. CoConv is inspired by neuroscience studies indicating that (i)neurons, even from the primary visual cortex (V1 area), are involved indetection of contextual cues and that (ii) the activity of a visual neuron canbe influenced by the stimuli placed entirely outside of its theoreticalreceptive field. On the one hand, we integrate CoConv in the widely-usedresidual networks and show improved recognition performance over baselines onthe core tasks and benchmarks for visual recognition, namely imageclassification on the ImageNet data set and object detection on the MS COCOdata set. On the other hand, we introduce CoConv in the generator of astate-of-the-art Generative Adversarial Network, showing improved generativeresults on CIFAR-10 and CelebA. Our code is available athttps://github.com/iduta/coconv.
Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation
Comment: ICCV 2021
Link: http://arxiv.org/abs/2108.07181
Abstract
Various deep learning techniques have been proposed to solve the single-view2D-to-3D pose estimation problem. While the average prediction accuracy hasbeen improved significantly over the years, the performance on hard poses withdepth ambiguity, self-occlusion, and complex or rare poses is still far fromsatisfactory. In this work, we target these hard poses and present a novelskeletal GNN learning solution. To be specific, we propose a hop-awarehierarchical channel-squeezing fusion layer to effectively extract relevantinformation from neighboring nodes while suppressing undesired noises in GNNlearning. In addition, we propose a temporal-aware dynamic graph constructionprocedure that is robust and effective for 3D pose estimation. Experimentalresults on the Human3.6M dataset show that our solution achieves 10.3\% averageprediction accuracy improvement and greatly improves on hard poses overstate-of-the-art techniques. We further apply the proposed technique on theskeleton-based action recognition task and also achieve state-of-the-artperformance. Our code is available athttps://github.com/ailingzengzzz/Skeletal-GNN.
MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction
Comment: The latest camera ready version (this paper has been accepted by ICCV2021)
Link: http://arxiv.org/abs/2108.07152
Abstract
Human motion prediction is a challenging task due to the stochasticity andaperiodicity of future poses. Recently, graph convolutional network has beenproven to be very effective to learn dynamic relations among pose joints, whichis helpful for pose prediction. On the other hand, one can abstract a humanpose recursively to obtain a set of poses at multiple scales. With the increaseof the abstraction level, the motion of the pose becomes more stable, whichbenefits pose prediction too. In this paper, we propose a novel Multi-ScaleResidual Graph Convolution Network (MSR-GCN) for human pose prediction task inthe manner of end-to-end. The GCNs are used to extract features from fine tocoarse scale and then from coarse to fine scale. The extracted features at eachscale are then combined and decoded to obtain the residuals between the inputand target poses. Intermediate supervisions are imposed on all the predictedposes, which enforces the network to learn more representative features. Ourproposed approach is evaluated on two standard benchmark datasets, i.e., theHuman3.6M dataset and the CMU Mocap dataset. Experimental results demonstratethat our method outperforms the state-of-the-art approaches. Code andpre-trained models are available at https://github.com/Droliven/MSRGCN.
Revisiting State Augmentation methods for Reinforcement Learning with Stochastic Delays
Comment: Accepted at CIKM'21
Link: http://arxiv.org/abs/2108.07555
Abstract
Several real-world scenarios, such as remote control and sensing, arecomprised of action and observation delays. The presence of delays degrades theperformance of reinforcement learning (RL) algorithms, often to such an extentthat algorithms fail to learn anything substantial. This paper formallydescribes the notion of Markov Decision Processes (MDPs) with stochastic delaysand shows that delayed MDPs can be transformed into equivalent standard MDPs(without delays) with significantly simplified cost structure. We employ thisequivalence to derive a model-free Delay-Resolved RL framework and show thateven a simple RL algorithm built upon this framework achieves near-optimalrewards in environments with stochastic delays in actions and observations. Thedelay-resolved deep Q-network (DRDQN) algorithm is bench-marked on a variety ofenvironments comprising of multi-step and stochastic delays and results inbetter performance, both in terms of achieving near-optimal rewards andminimizing the computational overhead thereof, with respect to the currentlyestablished algorithms.
BOBCAT: Bilevel Optimization-Based Computerized Adaptive Testing
Comment: IJCAI 2021 with supplementary material
Link: http://arxiv.org/abs/2108.07386
Abstract
Computerized adaptive testing (CAT) refers to a form of tests that arepersonalized to every student/test taker. CAT methods adaptively select thenext most informative question/item for each student given their responses toprevious questions, effectively reducing test length. Existing CAT methods useitem response theory (IRT) models to relate student ability to their responsesto questions and static question selection algorithms designed to reduce theability estimation error as quickly as possible; therefore, these algorithmscannot improve by learning from large-scale student response data. In thispaper, we propose BOBCAT, a Bilevel Optimization-Based framework for CAT todirectly learn a data-driven question selection algorithm from training data.BOBCAT is agnostic to the underlying student response model and iscomputationally efficient during the adaptive testing process. Throughextensive experiments on five real-world student response datasets, we showthat BOBCAT outperforms existing CAT methods (sometimes significantly) atreducing test length.
Monolithic vs. hybrid controller for multi-objective Sim-to-Real learning
Comment: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021
Link: http://arxiv.org/abs/2108.07514
Abstract
Simulation to real (Sim-to-Real) is an attractive approach to constructcontrollers for robotic tasks that are easier to simulate than to analyticallysolve. Working Sim-to-Real solutions have been demonstrated for tasks with aclear single objective such as "reach the target". Real world applications,however, often consist of multiple simultaneous objectives such as "reach thetarget" but "avoid obstacles". A straightforward solution in the context ofreinforcement learning (RL) is to combine multiple objectives into a multi-termreward function and train a single monolithic controller. Recently, a hybridsolution based on pre-trained single objective controllers and a switching rulebetween them was proposed. In this work, we compare these two approaches in themulti-objective setting of a robot manipulator to reach a target while avoidingan obstacle. Our findings show that the training of a hybrid controller iseasier and obtains a better success-failure trade-off than a monolithiccontroller. The controllers trained in simulator were verified by a realset-up.
·