今日arXiv精选 | 21篇EMNLP 2021最新论文

最新推荐文章于 2024-05-21 12:25:45 发布

PaperWeekly

最新推荐文章于 2024-05-21 12:25:45 发布

阅读量652

点赞数

文章标签： github git go restful https

原文链接：http://mp.weixin.qq.com/s?__biz=MzIwMTc4ODE0Mw==&mid=2247538948&idx=5&sn=3d9949dae3a96c279b682cddb1e0e521&chksm=96ea8684a19d0f92f8b8b72160c338f0333470cbaeb027f19f1c2457775a1fa90f05a0c7e910

版权

关于 #今日arXiv精选

这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Comment: 11 pages. SustaiNLP workshop at EMNLP 2021

Link: http://arxiv.org/abs/2109.07460

Abstract

Contextual embedding-based language models trained on large data sets, suchas BERT and RoBERTa, provide strong performance across a wide range of tasksand are ubiquitous in modern NLP. It has been observed that fine-tuning thesemodels on tasks involving data from domains different from that on which theywere pretrained can lead to suboptimal performance. Recent work has exploredapproaches to adapt pretrained language models to new domains by incorporatingadditional pretraining using domain-specific corpora and task data. We proposean alternative approach for transferring pretrained language models to newdomains by adapting their tokenizers. We show that domain-specific subwordsequences can be efficiently determined directly from divergences in theconditional token distributions of the base and domain-specific corpora. Indatasets from four disparate domains, we find adaptive tokenization on apretrained RoBERTa model provides >97% of the performance benefits of domainspecific pretraining. Our approach produces smaller models and less trainingand inference time than other approaches using tokenizer augmentation. Whileadaptive tokenization incurs a 6% increase in model parameters in ourexperimentation, due to the introduction of 10k new domain-specific tokens, ourapproach, using 64 vCPUs, is 72x faster than further pretraining the languagemodel on domain-specific corpora on 8 TPUs.

Challenges in Detoxifying Language Models

Comment: 23 pages, 6 figures, published in Findings of EMNLP 2021

Link: http://arxiv.org/abs/2109.07445

Abstract

Large language models (LM) generate remarkably fluent text and can beefficiently adapted across NLP tasks. Measuring and guaranteeing the quality ofgenerated text in terms of safety is imperative for deploying LMs in the realworld; to this end, prior work often relies on automatic evaluation of LMtoxicity. We critically discuss this approach, evaluate several toxicitymitigation strategies with respect to both automatic and human evaluation, andanalyze consequences of toxicity mitigation in terms of model bias and LMquality. We demonstrate that while basic intervention strategies caneffectively optimize previously established automatic metrics on theRealToxicityPrompts dataset, this comes at the cost of reduced LM coverage forboth texts about, and dialects of, marginalized groups. Additionally, we findthat human raters often disagree with high automatic toxicity scores afterstrong toxicity reduction interventions -- highlighting further the nuancesinvolved in careful evaluation of LM toxicity.

Is "moby dick" a Whale or a Bird? Named Entities and Terminology in Speech Translation

Comment: Accepted at EMNLP2021

Link: http://arxiv.org/abs/2109.07439

Abstract

Automatic translation systems are known to struggle with rare words. Amongthese, named entities (NEs) and domain-specific terms are crucial, since errorsin their translation can lead to severe meaning distortions. Despite theirimportance, previous speech translation (ST) studies have neglected them, alsodue to the dearth of publicly available resources tailored to their specificevaluation. To fill this gap, we i) present the first systematic analysis ofthe behavior of state-of-the-art ST systems in translating NEs and terminology,and ii) release NEuRoparl-ST, a novel benchmark built from European Parliamentspeeches annotated with NEs and terminology. Our experiments on the threelanguage directions covered by our benchmark (en->es/fr/it) show that STsystems correctly translate 75-80% of terms and 65-70% of NEs, with very lowperformance (37-40%) on person names.

SupCL-Seq: Supervised Contrastive Learning for Downstream Optimized Sequence Representations

Comment: short paper, EMNLP 2021, Findings

Link: http://arxiv.org/abs/2109.07424

Abstract

While contrastive learning is proven to be an effective training strategy incomputer vision, Natural Language Processing (NLP) is only recently adopting itas a self-supervised alternative to Masked Language Modeling (MLM) forimproving sequence representations. This paper introduces SupCL-Seq, whichextends the supervised contrastive learning from computer vision to theoptimization of sequence representations in NLP. By altering the dropout maskprobability in standard Transformer architectures, for every representation(anchor), we generate augmented altered views. A supervised contrastive loss isthen utilized to maximize the system's capability of pulling together similarsamples (e.g., anchors and their altered views) and pushing apart the samplesbelonging to the other classes. Despite its simplicity, SupCLSeq leads to largegains in many sequence classification tasks on the GLUE benchmark compared to astandard BERTbase, including 6% absolute improvement on CoLA, 5.4% on MRPC,4.7% on RTE and 2.6% on STSB. We also show consistent gains over selfsupervised contrastively learned representations, especially in non-semantictasks. Finally we show that these gains are not solely due to augmentation, butrather to a downstream optimized sequence representation. Code:https://github.com/hooman650/SupCL-Seq

RankNAS: Efficient Neural Architecture Search by Pairwise Ranking

Comment: Accepted to EMNLP 2021 Long Paper

Link: http://arxiv.org/abs/2109.07383

Abstract

This paper addresses the efficiency challenge of Neural Architecture Search(NAS) by formulating the task as a ranking problem. Previous methods requirenumerous training examples to estimate the accurate performance ofarchitectures, although the actual goal is to find the distinction between"good" and "bad" candidates. Here we do not resort to performance predictors.Instead, we propose a performance ranking method (RankNAS) via pairwiseranking. It enables efficient architecture search using much fewer trainingexamples. Moreover, we develop an architecture selection method to prune thesearch space and concentrate on more promising candidates. Extensiveexperiments on machine translation and language modeling tasks show thatRankNAS can design high-performance architectures while being orders ofmagnitude faster than state-of-the-art NAS systems.

Topic Transferable Table Question Answering

Comment: To appear at EMNLP 2021

Link: http://arxiv.org/abs/2109.07377

Abstract

Weakly-supervised table question-answering(TableQA) models have achievedstate-of-art performance by using pre-trained BERT transformer to jointlyencoding a question and a table to produce structured query for the question.However, in practical settings TableQA systems are deployed over table corporahaving topic and word distributions quite distinct from BERT's pretrainingcorpus. In this work we simulate the practical topic shift scenario bydesigning novel challenge benchmarks WikiSQL-TS and WikiTQ-TS, consisting oftrain-dev-test splits in five distinct topic groups, based on the popularWikiSQL and WikiTableQuestions datasets. We empirically show that, despitepre-training on large open-domain text, performance of models degradessignificantly when they are evaluated on unseen topics. In response, we proposeT3QA (Topic Transferable Table Question Answering) a pragmatic adaptationframework for TableQA comprising of: (1) topic-specific vocabulary injectioninto BERT, (2) a novel text-to-text transformer generator (such as T5, GPT2)based natural language question generation pipeline focused on generating topicspecific training data, and (3) a logical form reranker. We show that T3QAprovides a reasonably good baseline for our topic shift benchmarks. We believeour topic split benchmarks will lead to robust TableQA solutions that arebetter suited for practical deployment.

Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU

Comment: Accepted at EMNLP 2021

Link: http://arxiv.org/abs/2109.07364

Abstract

Incremental processing allows interactive systems to respond based on partialinputs, which is a desirable property e.g. in dialogue agents. The currentlypopular Transformer architecture inherently processes sequences as a whole,abstracting away the notion of time. Recent work attempts to apply Transformersincrementally via restart-incrementality by repeatedly feeding, to an unchangedmodel, increasingly longer input prefixes to produce partial outputs. However,this approach is computationally costly and does not scale efficiently for longsequences. In parallel, we witness efforts to make Transformers more efficient,e.g. the Linear Transformer (LT) with a recurrence mechanism. In this work, weexamine the feasibility of LT for incremental NLU in English. Our results showthat the recurrent LT model has better incremental performance and fasterinference speed compared to the standard Transformer and LT withrestart-incrementality, at the cost of part of the non-incremental (fullsequence) quality. We show that the performance drop can be mitigated bytraining the model to wait for right context before committing to an output andthat training with input prefixes is beneficial for delivering correct partialoutputs.

Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Comment: EMNLP 2021

Link: http://arxiv.org/abs/2109.07306

Abstract

Compared to monolingual models, cross-lingual models usually require a moreexpressive vocabulary to represent all languages adequately. We find that manylanguages are under-represented in recent cross-lingual language models due tothe limited vocabulary capacity. To this end, we propose an algorithm VoCap todetermine the desired vocabulary capacity of each language. However, increasingthe vocabulary size significantly slows down the pre-training speed. In orderto address the issues, we propose k-NN-based target sampling to accelerate theexpensive softmax. Our experiments show that the multilingual vocabularylearned with VoCap benefits cross-lingual language model pre-training.Moreover, k-NN-based target sampling mitigates the side-effects of increasingthe vocabulary size while achieving comparable performance and fasterpre-training speed. The code and the pretrained multilingual vocabularies areavailable at https://github.com/bozheng-hit/VoCapXLM.

Unsupervised Keyphrase Extraction by Jointly Modeling Local and Global Context

Comment: 10 pages, 4 figures, EMNLP 2021,code: https://github.com/xnliang98/uke_ccrank

Link: http://arxiv.org/abs/2109.07293

Abstract

Embedding based methods are widely used for unsupervised keyphrase extraction(UKE) tasks. Generally, these methods simply calculate similarities betweenphrase embeddings and document embedding, which is insufficient to capturedifferent context for a more effective UKE model. In this paper, we propose anovel method for UKE, where local and global contexts are jointly modeled. Froma global view, we calculate the similarity between a certain phrase and thewhole document in the vector space as transitional embedding based models do.In terms of the local view, we first build a graph structure based on thedocument where phrases are regarded as vertices and the edges are similaritiesbetween vertices. Then, we proposed a new centrality computation method tocapture local salient information based on the graph structure. Finally, wefurther combine the modeling of global and local context for ranking. Weevaluate our models on three public benchmarks (Inspec, DUC 2001, SemEval 2010)and compare with existing state-of-the-art models. The results show that ourmodel outperforms most models while generalizing better on input documents withdifferent domains and length. Additional ablation study shows that both thelocal and global information is crucial for unsupervised keyphrase extractiontasks.

Regressive Ensemble for Machine Translation Quality Evaluation

Comment: 8 pages incl. references, Proceedings of EMNLP 2021 Sixth Conference on Machine Translation (WMT 21)

Link: http://arxiv.org/abs/2109.07242

Abstract

This work introduces a simple regressive ensemble for evaluating machinetranslation quality based on a set of novel and established metrics. Weevaluate the ensemble using a correlation to expert-based MQM scores of the WMT2021 Metrics workshop. In both monolingual and zero-shot cross-lingualsettings, we show a significant performance improvement over single metrics. Inthe cross-lingual settings, we also demonstrate that an ensemble approach iswell-applicable to unseen languages. Furthermore, we identify a strongreference-free baseline that consistently outperforms the commonly-used BLEUand METEOR measures and significantly improves our ensemble's performance.

SWEAT: Scoring Polarization of Topics across Different Corpora

Comment: Published as a conference paper at EMNLP2021

Link: http://arxiv.org/abs/2109.07231

Abstract

Understanding differences of viewpoints across corpora is a fundamental taskfor computational social sciences. In this paper, we propose the Sliced WordEmbedding Association Test (SWEAT), a novel statistical measure to compute therelative polarization of a topical wordset across two distributionalrepresentations. To this end, SWEAT uses two additional wordsets, deemed tohave opposite valence, to represent two different poles. We validate ourapproach and illustrate a case study to show the usefulness of the introducedmeasure.

{E}fficient{BERT}: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation

Comment: Findings of EMNLP 2021

Link: http://arxiv.org/abs/2109.07222

Abstract

Pre-trained language models have shown remarkable results on various NLPtasks. Nevertheless, due to their bulky size and slow inference speed, it ishard to deploy them on edge devices. In this paper, we have a critical insightthat improving the feed-forward network (FFN) in BERT has a higher gain thanimproving the multi-head attention (MHA) since the computational cost of FFN is2$\sim$3 times larger than MHA. Hence, to compact BERT, we are devoted todesigning efficient FFN as opposed to previous works that pay attention to MHA.Since FFN comprises a multilayer perceptron (MLP) that is essential in BERToptimization, we further design a thorough search space towards an advanced MLPand perform a coarse-to-fine mechanism to search for an efficient BERTarchitecture. Moreover, to accelerate searching and enhance modeltransferability, we employ a novel warm-up knowledge distillation strategy ateach search stage. Extensive experiments show our searched EfficientBERT is6.9$\times$ smaller and 4.4$\times$ faster than BERT$\rm_{BASE}$, and hascompetitive performances on GLUE and SQuAD Benchmarks. Concretely,EfficientBERT attains a 77.7 average score on GLUE \emph{test}, 0.7 higher thanMobileBERT$\rm_{TINY}$, and achieves an 85.3/74.5 F1 score on SQuAD v1.1/v2.0\emph{dev}, 3.2/2.7 higher than TinyBERT$_4$ even without data augmentation.The code is released at https://github.com/cheneydon/efficient-bert.

A Relation-Oriented Clustering Method for Open Relation Extraction

Comment: 12 pages, 6figures, emnlp2021

Link: http://arxiv.org/abs/2109.07205

Abstract

The clustering-based unsupervised relation discovery method has graduallybecome one of the important methods of open relation extraction (OpenRE).However, high-dimensional vectors can encode complex linguistic informationwhich leads to the problem that the derived clusters cannot explicitly alignwith the relational semantic classes. In this work, we propose arelation-oriented clustering model and use it to identify the novel relationsin the unlabeled data. Specifically, to enable the model to learn to clusterrelational data, our method leverages the readily available labeled data ofpre-defined relations to learn a relation-oriented representation. We minimizedistance between the instance with same relation by gathering the instancestowards their corresponding relation centroids to form a cluster structure, sothat the learned representation is cluster-friendly. To reduce the clusteringbias on predefined classes, we optimize the model by minimizing a jointobjective on both labeled and unlabeled data. Experimental results show thatour method reduces the error rate by 29.2% and 15.7%, on two datasetsrespectively, compared with current SOTA methods.

Adversarial Mixing Policy for Relaxing Locally Linear Constraints in Mixup

Comment: This paper is accepted to appear in the main conference of EMNLP2021

Link: http://arxiv.org/abs/2109.07177

Abstract

Mixup is a recent regularizer for current deep classification networks.Through training a neural network on convex combinations of pairs of examplesand their labels, it imposes locally linear constraints on the model's inputspace. However, such strict linear constraints often lead to under-fittingwhich degrades the effects of regularization. Noticeably, this issue is gettingmore serious when the resource is extremely limited. To address these issues,we propose the Adversarial Mixing Policy (AMP), organized in a min-max-randformulation, to relax the Locally Linear Constraints in Mixup. Specifically,AMP adds a small adversarial perturbation to the mixing coefficients ratherthan the examples. Thus, slight non-linearity is injected in-between thesynthetic examples and synthetic labels. By training on these data, the deepnetworks are further regularized, and thus achieve a lower predictive errorrate. Experiments on five text classification benchmarks and five backbonemodels have empirically shown that our methods reduce the error rate over Mixupvariants in a significant margin (up to 31.3%), especially in low-resourceconditions (up to 17.5%).

Disentangling Generative Factors in Natural Language with Discrete Variational Autoencoders

Comment: Findings of EMNLP 2021

Link: http://arxiv.org/abs/2109.07169

Abstract

The ability of learning disentangled representations represents a major stepfor interpretable NLP systems as it allows latent linguistic features to becontrolled. Most approaches to disentanglement rely on continuous variables,both for images and text. We argue that despite being suitable for imagedatasets, continuous variables may not be ideal to model features of textualdata, due to the fact that most generative factors in text are discrete. Wepropose a Variational Autoencoder based method which models language featuresas discrete variables and encourages independence between variables forlearning disentangled representations. The proposed model outperformscontinuous and discrete baselines on several qualitative and quantitativebenchmarks for disentanglement as well as on a text style transfer downstreamapplication.

Can Language Models be Biomedical Knowledge Bases?

Comment: EMNLP 2021. Code available at https://github.com/dmis-lab/BioLAMA

Link: http://arxiv.org/abs/2109.07154

Abstract

Pre-trained language models (LMs) have become ubiquitous in solving variousnatural language processing (NLP) tasks. There has been increasing interest inwhat knowledge these LMs contain and how we can extract that knowledge,treating LMs as knowledge bases (KBs). While there has been much work onprobing LMs in the general domain, there has been little attention to whetherthese powerful LMs can be used as domain-specific KBs. To this end, we createthe BioLAMA benchmark, which is comprised of 49K biomedical factual knowledgetriples for probing biomedical LMs. We find that biomedical LMs with recentlyproposed probing methods can achieve up to 18.51% Acc@5 on retrievingbiomedical knowledge. Although this seems promising given the task difficulty,our detailed analyses reveal that most predictions are highly correlated withprompt templates without any subjects, hence producing similar results on eachrelation and hindering their capabilities to be used as domain-specific KBs. Wehope that BioLAMA can serve as a challenging benchmark for biomedical factualprobing.

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Comment: 22 pages, accepted to EMNLP 2021 main conference

Link: http://arxiv.org/abs/2109.07152

Abstract

Transformer architecture has become ubiquitous in the natural languageprocessing field. To interpret the Transformer-based models, their attentionpatterns have been extensively analyzed. However, the Transformer architectureis not only composed of the multi-head attention; other components can alsocontribute to Transformers' progressive performance. In this study, we extendedthe scope of the analysis of Transformers from solely the attention patterns tothe whole attention block, i.e., multi-head attention, residual connection, andlayer normalization. Our analysis of Transformer-based masked language modelsshows that the token-to-token interaction performed via attention has lessimpact on the intermediate representations than previously assumed. Theseresults provide new intuitive explanations of existing reports; for example,discarding the learned attention patterns tends not to adversely affect theperformance. The codes of our experiments are publicly available.

Beyond Glass-Box Features: Uncertainty Quantification Enhanced Quality Estimation for Neural Machine Translation

Comment: Accepted by Findings of EMNLP 2021

Link: http://arxiv.org/abs/2109.07141

Abstract

Quality Estimation (QE) plays an essential role in applications of MachineTranslation (MT). Traditionally, a QE system accepts the original source textand translation from a black-box MT system as input. Recently, a few studiesindicate that as a by-product of translation, QE benefits from the model andtraining data's information of the MT system where the translations come from,and it is called the "glass-box QE". In this paper, we extend the definition of"glass-box QE" generally to uncertainty quantification with both "black-box"and "glass-box" approaches and design several features deduced from them toblaze a new trial in improving QE's performance. We propose a framework to fusethe feature engineering of uncertainty quantification into a pre-trainedcross-lingual language model to predict the translation quality. Experimentresults show that our method achieves state-of-the-art performances on thedatasets of WMT 2020 QE shared task.

Towards Document-Level Paraphrase Generation with Sentence Rewriting and Reordering

Comment: Findings of EMNLP 2021

Link: http://arxiv.org/abs/2109.07095

Abstract

Paraphrase generation is an important task in natural language processing.Previous works focus on sentence-level paraphrase generation, while ignoringdocument-level paraphrase generation, which is a more challenging and valuabletask. In this paper, we explore the task of document-level paraphrasegeneration for the first time and focus on the inter-sentence diversity byconsidering sentence rewriting and reordering. We propose CoRPG (CoherenceRelationship guided Paraphrase Generation), which leverages graph GRU to encodethe coherence relationship graph and get the coherence-aware representation foreach sentence, which can be used for re-arranging the multiple (possiblymodified) input sentences. We create a pseudo document-level paraphrase datasetfor training CoRPG. Automatic evaluation results show CoRPG outperforms severalstrong baseline models on the BERTScore and diversity scores. Human evaluationalso shows our model can generate document paraphrase with more diversity andsemantic preservation.

Transformer-based Lexically Constrained Headline Generation

Comment: EMNLP 2021

Link: http://arxiv.org/abs/2109.07080

Abstract

This paper explores a variant of automatic headline generation methods, wherea generated headline is required to include a given phrase such as a company ora product name. Previous methods using Transformer-based models generate aheadline including a given phrase by providing the encoder with additionalinformation corresponding to the given phrase. However, these methods cannotalways include the phrase in the generated headline. Inspired by previousRNN-based methods generating token sequences in backward and forward directionsfrom the given phrase, we propose a simple Transformer-based method thatguarantees to include the given phrase in the high-quality generated headline.We also consider a new headline generation strategy that takes advantage of thecontrollable generation order of Transformer. Our experiments with the JapaneseNews Corpus demonstrate that our methods, which are guaranteed to include thephrase in the generated headline, achieve ROUGE scores comparable to previousTransformer-based methods. We also show that our generation strategy performsbetter than previous strategies.

Improving Text Auto-Completion with Next Phrase Prediction

Comment: 4 pages, 2 figures, 4 tables, Accepted in EMNLP 2021-Findings

Link: http://arxiv.org/abs/2109.07067

Abstract

Language models such as GPT-2 have performed well on constructingsyntactically sound sentences for text auto-completion task. However, suchmodels often require considerable training effort to adapt to specific writingdomains (e.g., medical). In this paper, we propose an intermediate trainingstrategy to enhance pre-trained language models' performance in the textauto-completion task and fastly adapt them to specific domains. Our strategyincludes a novel self-supervised training objective called Next PhrasePrediction (NPP), which encourages a language model to complete the partialquery with enriched phrases and eventually improve the model's textauto-completion performance. Preliminary experiments have shown that ourapproach is able to outperform the baselines in auto-completion for email andacademic writing domains.

PaperWeekly

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
今日arXiv精选 | 21篇EMNLP 2021最新论文

关于#今日arXiv精选这是「AI 学术前沿」旗下的一档栏目，编辑将每日从arXiv中精选高质量论文，推送给读者。Efficient Domain Adaptation of Lan...
复制链接

扫一扫