NLP paper daily1 arxiv16.3.2017

最新推荐文章于 2022-12-24 16:01:38 发布

mijiaoxiaosan

最新推荐文章于 2022-12-24 16:01:38 发布

阅读量1.4k

点赞数

分类专栏：自然语言处理文章标签： nlp 自然语言处理

本文链接：https://blog.csdn.net/mijiaoxiaosan/article/details/62867519

版权

自然语言处理专栏收录该内容

4 篇文章 1 订阅

订阅专栏

arXiv:1703.05260
InScript: Narrative texts annotated with script information

This paper presents the InScript corpus (Narrative Texts Instantiating Script structure). InScript is a corpus of 1,000 stories centered
around 10 different scenarios. Verbs and noun phrases are annotated with event and participant types, respectively. Additionally, the text
is annotated with coreference information. The corpus shows rich lexical variation and will serve as a unique resource for the study of
the role of script knowledge in natural language processing.

InScirpt：用脚本信息注释的叙述性文本
本文介绍了一种新的语料库，InScript (Narrative Texts Instantiating Script structure)。InScript 是一个由集中在10个不同场景的1000个故事组成的语料库。动词和名词短语分别被注解为事件和参与者类型。同时，文本由易混淆的信息注解。该语料库展现了词汇多样性，同时为研究自然语言处理中脚本知识的任务提供了独特的资源。

arXiv:1703.05122
Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

Code-mixing or code-switching are the effortless phenomena of natural switching between two or more languages in a single conversation. Use of a foreign word in a language; however, does not necessarily mean that the speaker is code-switching because often languages borrow lexical items from other languages. If a word is borrowed, it becomes a part of the lexicon of a language; whereas, during code-switching, the speaker is aware that the conversation involves foreign words or phrases. Identifying whether a foreign word used by a bilingual speaker is due to borrowing or code-switching is a fundamental importance to theories of multilingualism, and an essential prerequisite towards the development of language and speech technologies for multilingual communities. In this paper, we present a series of novel computational methods to identify the borrowed likeliness of a word, based on the social media signals. We first propose context based clustering method to sample a set of candidate words from the social media data.Next, we propose three novel and similar metrics based on the usage of these words by the users in different tweets; these metrics were used to score and rank the candidate words indicating their borrowed likeliness. We compare these rankings with a ground truth ranking constructed through a human judgment experiment. The Spearman’s rank correlation between the two rankings (nearly 0.62 for all the three metric variants) is more than double the value (0.26) of the most competitive existing baseline reported in the literature. Some other striking observations are, (i) the correlation is higher for the ground truth data elicited from the younger participants (age less than 30) than that from the older participants, and (ii )those participants who use mixed-language for tweeting the least, provide the best signals of borrowing.

这个词是借用的吗？一个自动量化社交媒体上词语借来使用可能性的方法。

借来词已经融入的新的语言成为该语言的新的词语。而语言转化则是讲者意识到对话包括了外来词或者短语。本文中，作者提出了新的计算方法来识别一个词是借来词的可能。首先提出了基于上下文的聚类方法聚合了一些候选词，其次基于不同用户不同推特上对这些词的用法提出了三个新的相似的度量方法，这些方法可以对候选词是借来词的可能性给出评分。并且将这种评分和人工评判进行了对比试验。

arXiv:1703.04929
SyntaxNet Models for the CoNLL 2017 Shared Task

We describe a baseline dependency parsing system for the CoNLL2017 Shared Task. This system, which we call “ParseySaurus,” uses the DRAGNN framework [Kong et al, 2017] to combine transition-based recurrent parsing and tagging with character-based word representations. On the v1.3 Universal Dependencies Treebanks, the new system outpeforms the publicly available, state-of-the-art “Parsey’s Cousins” models by 3.47% absolute Labeled Accuracy Score (LAS) across 52 treebanks.

针对CoNll2017工作的语法网模型

针对CoNLL2017给出了一个基线依赖解析系统。利用了DRAGNN framework [Kong et al, 2017]，结合基于转换的循环分析同时利用字符的词表示来标记。新系统性能很出色。

arXiv:1703.04914
Ensemble of Neural Classifiers for Scoring Knowledge Base Triples

This paper describes our approach for the triple scoring task at WSDM Cup 2017. The task aims to assign a relevance score for each pair of entities and their types in a knowledge base in order to enhance the ranking results in entity retrieval tasks. We propose an approach wherein the outputs of multiple neural network classifiers are combined using a supervised machine learning model. The experimental results show that our proposed method achieves the best performance in one out of three measures, and performs competitively in the other two measures.

结合多个神经网络分类器在知识基准上对三个measures进行打分

本论文描述了作者在2017 WSDM Cup（ACM网络搜索与数据挖掘国际会议，ACM International Conference on Web Search and Data Mining，简称WSDM）上所做的关于triple scoring工作的方法。这项工作要基于一个知识基准对每个实体对分配一个想关分数，以此来提高在实体检索工作中的排名结果。作者提出一个方法，利用一个监督的机器学习模型结合了多个神经网络分类器的输出。试验结果显示他们提出的方法在三钟测量中的一个得到了最后的结果，并且在其他两个中也具有显著的竞争性。

arXiv:1703.04887
Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets

This paper proposes a new route for applying the generative adversarial nets (GANs) to NLP tasks (taking the neural machine translation as an instance) and the widespread perspective that GANs can’t work well in the NLP area turns out to be unreasonable. In this work, we build a conditional sequence generative adversarial net which comprises of two adversarial sub models, a generative model (generator) which translates the source sentence into the target sentence as the traditional NMT models do and a discriminative model (discriminator) which discriminates the machine-translated target sentence from the human-translated sentence. From the perspective of Turing test, the proposed model is to generate the translation which is indistinguishable from the human-translated one. Experiments show that the proposed model achieves significant improvements than the traditional NMT model. In Chinese-English translation tasks, we obtain up to +2.0 BLEU points improvement. To the best of our knowledge, this is the first time that the quantitative results about the application of GANs in the traditional NLP task is reported. Meanwhile, we present detailed strategies for GAN training. In addition, We find that the discriminator of the proposed model shows great capability in data cleaning.

利用条件序列生成对抗网来提升神经机器翻译

本文提出一种新的方法，将生成对抗网（GAN）应用于NLP任务，也昭示所谓GANs不能再NLP领域工作的很好这一思想是没有任何道理的。在本文工作中，作者利用两个对抗支模型建立了一个条件序列生成对抗网。一个生成模型（生成器）将源语言翻译成目标语言，就像传统的神经机器翻译模型所做的那样。还有一个判别模型（判别器）将机器翻译的目标句子从人工翻译的句子中分开。按照图灵测试的思想，提出的模型应该生成与人工翻译的几乎无法区分的翻译。实验表明该模型相比于传统的神经机器翻译模型得到了很吊的提升。在汉英翻译工作中，作者获得了2+的BLEU分值的提升。据作者所知，这种定量的结果还是GANs在NLP任务应用中的首次。同时，他们还提出了一些对于GAN细节上的训练策略。并且发现判别其显示了很好的数据清洗能力。

arXiv:1703.04879
Sparse Named Entity Classification using Factorization Machines

Named entity classification is the task of classifying text-based elements into various categories, including places, names, dates, times, and monetary values. A bottleneck in named entity classification, however, is the data problem of sparseness, because new named entities continually emerge, making it rather difficult to maintain a dictionary for named entity classification. Thus, in this paper, we address the problem of named entity classification using matrix factorization to overcome the problem of feature sparsity. Experimental results show that our proposed model, with fewer features and a smaller size, achieves competitive accuracy to state-of-the-art models.

利用分解机对稀疏命名实体进行分类

命名实体分类就是把基于文本的元素分到不同类别，包括地名人名日期时间等等。而这一任务的瓶颈是数据的稀疏性，因为新的命名实体总是在不断地出现，导致很难维护命名实体分类的词典。本文利用矩阵分解来克服特征稀疏的问题。实验表明提出的模型，只用了很少的特征和很小的模型，就获得了相对于目前state-of-the-art模型也具有竞争性的准确度。

arXiv:1703.04826
Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling

Semantic role labeling (SRL) is the task of identifying the predicate-argument structure of a sentence. It is typically regarded as an important step in the standard natural language processing pipeline, providing information to downstream tasks such as information extraction and question answering. As the semantic representations are closely related to syntactic ones, we exploit syntactic information in our model. We propose a version of graph convolutional networks (GCNs), a recent class of multilayer neural networks operating on graphs, suited to modeling syntactic dependency graphs. GCNs over syntactic dependency trees are used as sentence encoders, producing latent feature representations of words in a sentence and capturing information relevant to predicting the semantic representations. We observe that GCN layers are complementary to LSTM ones: when we stack both GCN and LSTM layers, we obtain a substantial improvement over an already state-of-the-art LSTM SRL model, resulting in the best reported scores on the standard benchmark (CoNLL-2009) both for Chinese and English.

利用图卷积网络编码句子来进行语义角色标记

语义角色标记（SRL）就是对句子中谓词参数结构进行是被的工作。在自然语言处理流程中是很重要的一步，为下层的工作如信息抽取以及QA等提供了信息。由于语义表示和句法十分相关，作者在模型中融入了句法信息。提出了一种图卷积网络（GCNs），这是最近运行在图（graph）上的一类多层神经网络，很适合来建模句法以来的图。基于句法依赖树的GCNs可以用来对句子进行编码，生成对词潜在的特征表示，同时可以捕获跟要预测的语义表示相关的信息。作者将GCN层作为对LSTM的补充，合并GCN以及LSTM之后，相对于现在state-of-the-art的LSTM SRL模型获得了一个本质上的提升。在CoNLL-2009中标准基准测试中无论是对汉语还是英语都获得了最好的结果。

arXiv:1703.04816
FastQA: A Simple and Efficient Neural Architecture for Question Answering

Recent development of large-scale question answering (QA) datasets triggered a substantial amount of research into end-to-end neural architectures for QA. Increasingly complex systems have been conceived without comparison to a simpler neural baseline system that would justify their complexity. In this work, we propose a simple heuristic that guided the development of FastQA, an efficient end-to-end neural model for question answering that is very competitive with existing models. We further demonstrate, that an extended version (FastQAExt) achieves state-of-the-art results on recent benchmark datasets, namely SQuAD, NewsQA and MsMARCO, outperforming most existing models. However, we show that increasing the complexity of FastQA to FastQAExt does not yield any systematic improvements. We argue that the same holds true for most existing systems that are similar to FastQAExt. A manual analysis reveals that our proposed heuristic explains most predictions of our model, which indicates that modeling a simple heuristic is enough to achieve strong performance on extractive QA datasets. The overall strong performance of FastQA puts results of existing, more complex models into perspective.

快速问答：一个针对问答系统的简单有效的神经网络结构

最近大规模问答（QA）数据集的发展激起了很多关于QA的端到端神经网络结构的研究。问题是那些系统都特别复杂。本文中作者提出了一个简单的启发式方法，指导快速问答的研究，这是一种有效的针对QA的端到端的神经网络模型，相对于现存的模型表现出不错的竞争力。进一步表明，扩展版本（FastQAExt）在标准测试基准SQuAD，NewsQA以及MsMARCO上都获得了State-of-the-art的结果，比很多现存的模型都好。同时，作者表明从FastQA到FastQAExt增加的复杂性冰没有任何系统的提升。人工的分析揭示了他们提出的启发式的方法解释了很多他们模型的预测结果，这表明对一个简单的启发式的方法建模在已经足够在提取的QA数据集上获得很棒的表现。FastQA总体的优越表现使得现存的更复杂的模型拥有前景。

arXiv:1703.04718
Extending Automatic Discourse Segmentation for Texts in Spanish to Catalan

At present, automatic discourse analysis is a relevant research topic in the field of NLP. However, discourse is one of the phenomena most difficult to process. Although discourse parsers have been already developed for several languages, this tool does not exist for Catalan. In order to implement this kind of parser, the first step is to develop a discourse segmenter. In this article we present the first discourse segmenter for texts in Catalan. This segmenter is based on Rhetorical Structure Theory (RST) for Spanish, and uses lexical and syntactic information to translate rules valid for Spanish into rules for Catalan. We have evaluated the system by using a gold standard corpus including manually segmented texts and results are promising.

将西班牙语文本的自动话语分段扩展到加泰罗尼亚语

自动演讲分析是是NLP领域的一个相关研究话题。然而，演讲时一个很难处理的现象。即使目前已经为不同的语言研发了不同的话语分析器，但并没有适合于加泰罗尼亚语的工具。为了实现这种分析器，第一步是开展语言切分。本文作者介绍了针对加泰罗尼亚语文本的话语分割工具。这个分割器基于针对西班牙语的修辞结构理论（RST），同时应用了词法以及句法信息将西班牙语的规则翻译成加泰罗尼亚语的规则。最后作者利用保函人工切分文本的语料库对这一系统进行了评估，结果显著。

arXiv:1703.04677
A computational investigation of sources of variability in sentence comprehension difficulty in aphasia

We present a computational evaluation of three hypotheses about sources of deficit in sentence comprehension in aphasia: slowed processing, intermittent deficiency, and resource reduction. The ACT-R based Lewis & Vasishth 2005 model is used to implement these three proposals. Slowed processing is implemented as slowed default production-rule firing time; intermittent deficiency as increased random noise in activation of chunks in memory; and resource reduction as reduced goal activation. As data, we considered subject vs. object relatives presented in a self-paced listening modality to 56 individuals with aphasia (IWA) and 46 matched controls. The participants heard the sentences and carried out a picture verification task to decide on an interpretation of the sentence. These response accuracies are used to identify the best parameters (for each participant) that correspond to the three hypotheses mentioned above. We show that controls have more tightly clustered (less variable) parameter values than IWA; specifically, compared to controls, among IWA there are more individuals with low goal activations, high noise, and slow default action times. This suggests that (i) individual patients show differential amounts of deficit along the three dimensions of slowed processing, intermittent deficient, and resource reduction, (ii) overall, there is evidence for all three sources of deficit playing a role, and (iii) IWA have a more variable range of parameter values than controls. In sum, this study contributes a proof of concept of a quantitative implementation of, and evidence for, these three accounts of comprehension deficits in aphasia.

这个好像跟我关系不大，针对失语症的研究。

arXiv:1703.04650
Joint Learning of Correlated Sequence Labelling Tasks Using Bidirectional Recurrent Neural Networks

The stream of words produced by Automatic Speech Recognition (ASR) systems is devoid of any punctuations and formatting. Most natural language processing applications usually expect segmented and well-formatted texts as input, which is not available in ASR output. This paper proposes a novel technique of jointly modelling multiple correlated tasks such as punctuation and capitalization using bidirectional recurrent neural networks, which leads to improved performance for each of these tasks. This method can be extended for joint modelling of any other correlated multiple sequence labelling tasks.

自动语音识别系统还不能对标点和格式进行很好的处理。很多自然语言处理应用都期望对文本有一个很好的气氛，把经过很好的格式化的文本作为输入，这在自动语音识别的输出中还不能实现。本文提出一种新的方法，利用双向循环神经网络，联合多项工作如标点以及大写等进行建模，提升了这类工作的表现。这种方法也可以扩展到联合其他相关多序列标记任务中进行建模。

arXiv:1703.04617
Exploring Question Understanding and Adaptation in Neural-Network-Based Question Answering

The last several years have seen intensive interest in exploring neural-network-based models for machine comprehension (MC) and question answering (QA). In this paper, we approach the problems by closely modelling questions in a neural network framework. We first introduce syntactic information to help encode questions. We then view and model different types of questions and the information shared among them as an adaptation task and proposed adaptation models for them. On the Stanford Question Answering Dataset (SQuAD), we show that these approaches can help attain better results over a competitive baseline.

在基于神经网络的问答中探寻对问题的理解改编

过去几年人们对研究机器理解以及问答有很高的热情。本文作者在神经网络框架中对问题建模。首先引入句法信息来帮助编码问题，之后针对不同类型的问题进行建模，他们之间共享的信息作为改编性工作，同时针对这些提出改编性模型。在标准问答数据集（SQuAD）上，这些方法帮助获得了更好的结果。

arXiv:1703.05123
Character-based Neural Embeddings for Tweet Clustering

In this paper we show how the performance of tweet clustering can be improved by leveraging character-based neural networks. The proposed approach overcomes the limitations related to the vocabulary explosion in the word-based models and allows for the seamless processing of the multilingual content. Our evaluation results and code are available on-line at https://github.com/vendi12/tweet2vec_clustering

面向推特聚类的基于字符的神经嵌入

作者在本文中利用机遇字符的神经网络提升了推特聚类的表现。该方法克服了机遇词的模型的词汇爆炸的相关问题。同事允许对多语言内容进行无缝的处理。评估结果和代码也在github上给出。

arXiv:1703.04908
Emergence of Grounded Compositional Language in Multi-Agent Populations

By capturing statistical patterns in large corpora, machine learning has enabled significant advances in natural language processing, including in machine translation, question answering, and sentiment analysis. However, for agents to intelligently interact with humans, simply capturing the statistical patterns is insufficient. In this paper we investigate if, and how, grounded compositional language can emerge as a means to achieve goals in multi-agent populations. Towards this end, we propose a multi-agent learning environment and learning methods that bring about emergence of a basic compositional language. This language is represented as streams of abstract discrete symbols uttered by agents over time, but nonetheless has a coherent structure that possesses a defined vocabulary and syntax. We also observe emergence of non-verbal communication such as pointing and guiding when language communication is unavailable.

多代理人群中基础构成语言的出现

借助于捕获大规模语料库中的统计学模式信息，机器学习已经在自然语言处理中的很多任务取得了显著的效果，比如MT，QA以及情感分析等等。然而对于与人交互的智能代理仍然不能充分地捕获统计学模式信息。本文作者研究是否以及如何处理，基础构成语言才可以作为一个方式在多代理人群中取得好的效果。

arXiv:1703.04854
Distributed-Representation Based Hybrid Recommender System with Short Item Descriptions

Collaborative filtering (CF) aims to build a model from users’ past behaviors and/or similar decisions made by other users, and use the model to recommend items for users. Despite of the success of previous collaborative filtering approaches, they are all based on the assumption that there are sufficient rating scores available for building high-quality recommendation models. In real world applications, however, it is often difficult to collect sufficient rating scores, especially when new items are introduced into the system, which makes the recommendation task challenging. We find that there are often “short” texts describing features of items, based on which we can approximate the similarity of items and make recommendation together with rating scores. In this paper we “borrow” the idea of vector representation of words to capture the information of short texts and embed it into a matrix factorization framework. We empirically show that our approach is effective by comparing it with state-of-the-art approaches.

利用短条目描述的基于分布式表示的混合推荐系统

协同过滤针对的是基于用户过去行为或者和其他用户相似的决策而建立的模型。这些协同过滤方法都假定有足够的评分可以获得，以此建立高质量的推荐模型。而现实中并不能获取足够的评价，特别是新的条目加入系统的时候。作者发现经常有对条目的“短”文本的描述，基于此他们可以估计条目之间的相似性，然后与评分结合来做推荐。本文作者借用了词的向量的分布式表示这一思想来获取短文本信息，并且把他们嵌入到矩阵分解工作中。作者试验表明，相比于目前state-of-the-art的工作，他们的方法十分有效。

arXiv:1703.04783
Multichannel End-to-end Speech Recognition

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.

多信道的端到端的语音识别

端到端的神经网络在语音识别领域已经对隐马尔科夫模型提出了严峻的挑战。在循环神经网络编码解码结构中利用注意力机制可以解决动态时间对其的问题，允许将端到端声音的信息和语言建模祖坟的训练联结起来。本文作者扩展了这种端到端的框架，在升学编码网络中针对有噪声一直以及语音增强的麦克风阵列信号进行处理。这可以是的聚束祖坟可以被最优化的和识别结构联结，以此来提高端到端的语音识别。在有噪声的语音识别基准（CHiME-4和AMI）的实验上表明，这种多信道端到端的系统比普通的基于注意力的模型表现得更好。