今日arXiv精选：Transformer专题论文推荐

最新推荐文章于 2024-10-09 09:09:04 发布

PaperWeekly

最新推荐文章于 2024-10-09 09:09:04 发布

阅读量548

点赞数

文章标签：人工智能 sms 3d firebug 办公软件

原文链接：https://mp.weixin.qq.com/s/FlBwzQGl_RdBQV2saWEaqg#rd

版权

关于 #今日arXiv精选

这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。

Attentive fine-tuning of Transformers for Translation of low-resourced languages @LoResMT 2021

Comment: 10 pages

Link: http://arxiv.org/abs/2108.08556

Abstract

This paper reports the Machine Translation (MT) systems submitted by theIIITT team for the English->Marathi and English->Irish language pairs LoResMT2021 shared task. The task focuses on getting exceptional translations forrather low-resourced languages like Irish and Marathi. We fine-tune IndicTrans,a pretrained multilingual NMT model for English->Marathi, using externalparallel corpus as input for additional training. We have used a pretrainedHelsinki-NLP Opus MT English->Irish model for the latter language pair. Ourapproaches yield relatively promising results on the BLEU metrics. Under theteam name IIITT, our systems ranked 1, 1, and 2 in English->Marathi,Irish->English, and English->Irish, respectively.

Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models

Comment: Accepted by Interspeech2021

Link: http://arxiv.org/abs/2108.08451

Abstract

Spoken Language Understanding (SLU) is one essential step in building adialogue system. Due to the expensive cost of obtaining the labeled data, SLUsuffers from the data scarcity problem. Therefore, in this paper, we focus ondata augmentation for slot filling task in SLU. To achieve that, we aim atgenerating more diverse data based on existing data. Specifically, we try toexploit the latent language knowledge from pretrained language models byfinetuning them. We propose two strategies for finetuning process: value-basedand context-based augmentation. Experimental results on two public SLU datasetshave shown that compared with existing data augmentation methods, our proposedmethod can generate more diverse sentences and significantly improve theperformance on SLU.

PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers

Comment: Accepted to ICCV 2021 (Oral Presentation)

Link: http://arxiv.org/abs/2108.08839

Abstract

Point clouds captured in real-world applications are often incomplete due tothe limited sensor resolution, single viewpoint, and occlusion. Therefore,recovering the complete point clouds from partial ones becomes an indispensabletask in many practical applications. In this paper, we present a new methodthat reformulates point cloud completion as a set-to-set translation problemand design a new model, called PoinTr that adopts a transformer encoder-decoderarchitecture for point cloud completion. By representing the point cloud as aset of unordered groups of points with position embeddings, we convert thepoint cloud to a sequence of point proxies and employ the transformers forpoint cloud generation. To facilitate transformers to better leverage theinductive bias about 3D geometric structures of point clouds, we further devisea geometry-aware block that models the local geometric relationshipsexplicitly. The migration of transformers enables our model to better learnstructural knowledge and preserve detailed information for point cloudcompletion. Furthermore, we propose two more challenging benchmarks with morediverse incomplete point clouds that can better reflect the real-worldscenarios to promote future research. Experimental results show that our methodoutperforms state-of-the-art methods by a large margin on both the newbenchmarks and the existing ones. Code is available athttps://github.com/yuxumin/PoinTr

Do Vision Transformers See Like Convolutional Neural Networks?

Comment: None

Link: http://arxiv.org/abs/2108.08810

Abstract

Convolutional neural networks (CNNs) have so far been the de-facto model forvisual data. Recent work has shown that (Vision) Transformer models (ViT) canachieve comparable or even superior performance on image classification tasks.This raises a central question: how are Vision Transformers solving thesetasks? Are they acting like convolutional networks, or learning entirelydifferent visual representations? Analyzing the internal representationstructure of ViTs and CNNs on image classification benchmarks, we find strikingdifferences between the two architectures, such as ViT having more uniformrepresentations across all layers. We explore how these differences arise,finding crucial roles played by self-attention, which enables early aggregationof global information, and ViT residual connections, which strongly propagatefeatures from lower to higher layers. We study the ramifications for spatiallocalization, demonstrating ViTs successfully preserve input spatialinformation, with noticeable effects from different classification methods.Finally, we study the effect of (pretraining) dataset scale on intermediatefeatures and transfer learning, and conclude with a discussion on connectionsto new architectures such as the MLP-Mixer.

Video Relation Detection via Tracklet based Visual Transformer

Comment: 1st of Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021

Link: http://arxiv.org/abs/2108.08669

Abstract

Video Visual Relation Detection (VidVRD), has received significant attentionof our community over recent years. In this paper, we apply thestate-of-the-art video object tracklet detection pipeline MEGA and deepSORT togenerate tracklet proposals. Then we perform VidVRD in a tracklet-based mannerwithout any pre-cutting operations. Specifically, we design a tracklet-basedvisual Transformer. It contains a temporal-aware decoder which performs featureinteractions between the tracklets and learnable predicate query embeddings,and finally predicts the relations. Experimental results strongly demonstratethe superiority of our method, which outperforms other methods by a largemargin on the Video Relation Understanding (VRU) Grand Challenge in ACMMultimedia 2021. Codes are released athttps://github.com/Dawn-LX/VidVRD-tracklets.

A Multi-input Multi-output Transformer-based Hybrid Neural Network for Multi-class Privacy Disclosure Detection

Comment: 20 pages

Link: http://arxiv.org/abs/2108.08483

Abstract

The concern regarding users' data privacy has risen to its highest level dueto the massive increase in communication platforms, social networking sites,and greater users' participation in online public discourse. An increasingnumber of people exchange private information via emails, text messages, andsocial media without being aware of the risks and implications. Researchers inthe field of Natural Language Processing (NLP) have concentrated on creatingtools and strategies to identify, categorize, and sanitize private informationin text data since a substantial amount of data is exchanged in textual form.However, most of the detection methods solely rely on the existence ofpre-identified keywords in the text and disregard the inference of theunderlying meaning of the utterance in a specific context. Hence, in somesituations, these tools and algorithms fail to detect disclosure, or theproduced results are miss-classified. In this paper, we propose a multi-input,multi-output hybrid neural network which utilizes transfer-learning,linguistics, and metadata to learn the hidden patterns. Our goal is to betterclassify disclosure/non-disclosure content in terms of the context ofsituation. We trained and evaluated our model on a human-annotated ground truthdataset, containing a total of 5,400 tweets. The results show that the proposedmodel was able to identify privacy disclosure through tweets with an accuracyof 77.4% while classifying the information type of those tweets with animpressive accuracy of 99%, by jointly learning for two separate tasks.