Efficient Estimation of Word Representations in Vector Space(翻译)

We propose two novel model architectures for computing continuous vector representations
of words from very large data sets. The quality of these representations
is measured in a word similarity task, and the results are compared to the previously
best performing techniques based on different types of neural networks. We
observe large improvements in accuracy at much lower computational cost, i.e. it
takes less than a day to learn high quality word vectors from a 1.6 billion words
data set. Furthermore, we show that these vectors provide state-of-the-art performance
on our test set for measuring syntactic and semantic word similarities.

我们提出两个用于 在大规模数据集上 计算连续词向量表示的新模型框架。这两种表示的评估方式为:在词的相似性计算,此结果对比最近表现最佳的不同类型的神经网络类型。对比结果显示,在大幅降低计算消耗的情况下准确性获得了提升。例如:在16亿词的数据集合上,训练少于1天,可以获得一个高质量的词向量。此高质量的词向量,在用于评估的语法、语义相似性的数据集上获得了最先进的性能。

 

1 Introduction
Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity
between words, as these are represented as indices in a vocabulary. This choice has several good
reasons - simplicity, robustness and the observation that simple models trained on huge amounts of
data outperform complex systems trained on less data. An example is the popular N-gram model
used for statistical language modeling - today, it is possible to train N-grams on virtually all available
data (trillions of words [3]).
1简介
许多当前的NLP系统和技术将单词视为原子单元 - 没有词与词相似性的概念,就好像它们在词汇表中表示为索引。 这个选择有几个好处: 简单,鲁棒 以及一种现象:依赖大数据量训练得到的简单的模型 优于 通过较少数据训练的复杂系统。 一个例子是流行的N-gram模型用于统计语言建模 - 今天,可能所有的可用数据都在用于训练N-gram模型([3])。

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.
然而,这种简单的技术在许多任务中都处于极限。 例如,用于训练语音识别模型的数据 在特定的领域中数据是很少的。训练效果受限于高质量的语音数据(通常只有几百万词)。在机器翻译领域,很多语言的只包含几十亿的词或者更少。因而,当前的状况是,对简单技术的提升很难取得显著的效果,我们应该关注更先进的技术。
With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform N-gram models [1, 27, 17].

伴随近些年机器学习技术的发展,在更大数据集上训练更复杂的模型变为了可能,且复杂模型的效果优于简单模型。最成功的四路是 词的分布式标识。例如,基于神经网络的语言模型 显著的优于N-grams models.

1.1 Goals of the Paper

The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as we know, none of the previously proposed architectures has been successfully trained on more than a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100.

本文的目的是介绍再huge data(十亿基本的word,百万级基础词)上训练高质量词向量的技术。就我们目前所知,还没有一个模型 可以达到如下效果:在几百万不同的词,词向量为50-100.

We use recently proposed techniques for me

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值