word2vec论文资料汇总

A Beginner’s Guide to Word2Vec and Neural Word Embeddings

Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand.
Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well to genes, code, likes, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.
Why? Because words are simply discrete states like the other data mentioned above, and we are simply looking for the transitional probabilities between those states: the likelihood that they will co-occur. So gene2vec, like2vec and follower2vec are all possible. With that in mind, the tutorial below will help you understand how to create neural embeddings for any group of discrete and co-occurring states.
The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.
Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.
The output of the Word2vec neural net is a vocabulary in which each item has a vector attached to it, which can be fed into a deep-learning net or simply queried to detect relationships between words.
Measuring cosine similarity, no similarity is expressed as a 90 degree angle, while total similarity of 1 is a 0 degree angle, complete overlap; i.e. Sweden equals Sweden, while Norway has a cosine distance of 0.760124 from Sweden, the highest of any other country.

Word2vec是一个两层的神经网络,通过 **"矢量化 "**词语来处理文本。它的输入是一个文本语料库,其输出是一组向量:代表该语料库中单词的特征向量。虽然Word2vec不是一个深度神经网络,但它将文本转化为深度神经网络可以理解的数字形式。
Word2vec的应用超出了野外解析句子的范围。它同样可以应用于基因、代码、喜欢、播放列表、社交媒体图和其他可能被识别出模式的语言或符号系列。
为什么要用word2vec?因为词语只是像上面提到的其他数据一样的离散状态,而我们只是在寻找这些状态之间的过渡概率:它们共同出现的可能性。所以gene2vec、like2vec和follower2vec都是可能的。考虑到这一点,下面的教程将帮助你了解如何为任何一组离散和共同出现的状态创建神经嵌入。

Word2vec的目的和用处是在向量空间中将相似词的向量分组。也就是说,它以数学方式检测相似性。Word2vec创建的向量是单词特征的分布式数字表示,这些特征如单个单词的上下文。它是在没有人为干预的情况下进行的。
如果有足够的数据、用法和语境,Word2vec可以根据过去的出现情况对一个词的含义做出高度准确的猜测。这些猜测可以用来建立一个词与其他词的联系(例如,"男人 "是 "男孩 "的意思,"女人 "是 "女孩 "的意思),或者对文档进行聚类并按主题进行分类。这些聚类可以成为科学研究、法律发现、电子商务和客户关系管理等不同领域的搜索、情感分析和建议的基础。
Word2vec神经网络的输出是一个词汇表,其中每个项目都有一个向量,可以被送入一个深度学习网络或简单地查询以检测单词之间的关系。
测量余弦相似度,没有相似度表示为90度角,而总相似度为1是0度角,完全重合;即瑞典等于瑞典,而挪威与瑞典的余弦距离为0.760124,是所有其他国家中最高的。

A Beginner’s Guide to Word2Vec and Neural Word Embeddings

The Illustrated Word2vec

I find the concept of embeddings to be one of the most fascinating ideas in machine learning. If you’ve ever used Siri, Google Assistant, Alexa, Google Translate, or even smartphone keyboard with next-word prediction, then chances are you’ve benefitted from this idea that has become central to Natural Language Processing models. There has been quite a development over the last couple of decades in using embeddings for neural models (Recent developments include contextualized word embeddings leading to cutting-edge models like BERT and GPT2).
Word2vec is a method to efficiently create word embeddings and has been around since 2013. But in addition to its utility as a word-embedding method, some of its concepts have been shown to be effective in creating recommendation engines and making sense of sequential data even in commercial, non-language tasks. Companies like Airbnb, Alibaba, Spotify, and Anghami have all benefitted from carving out this brilliant piece of machinery from the world of NLP and using it in production to empower a new breed of recommendation engines.
In this post, we’ll go over the concept of embedding, and the mechanics of generating embeddings with word2vec. But let’s start with an example to get familiar with using vectors to represent things. Did you know that a list of five numbers (a vector) can represent so much about your personality?

我发现嵌入的概念是机器学习中最迷人的想法之一。如果你曾经使用过Siri、谷歌助手、Alexa、谷歌翻译,甚至是带有下一个单词预测功能的智能手机键盘,那么你有可能已经从这个已经成为自然语言处理模型核心的想法中受益。在过去的几十年里,在使用嵌入的神经模型方面有了相当大的发展(最近的发展包括上下文词嵌入,导致BERT和GPT2等尖端模型)。
Word2vec是一种有效创建词嵌入的方法,自2013年以来一直存在。但除了作为一个词嵌入方法的效用外,它的一些概念已被证明在创建推荐引擎和使连续数据有意义方面是有效的,甚至在商业、非语言任务中也是如此。像Airbnb、阿里巴巴、Spotify和Anghami这样的公司都受益于从NLP世界中挖掘出的这一杰出机器,并在生产中使用它来增强新品种的推荐引擎。
在这篇文章中,我们将讨论嵌入的概念,以及用word2vec生成嵌入的机制。但让我们从一个例子开始,熟悉使用向量来表示事物。你知道一个由五个数字组成的列表(一个向量)可以代表很多关于你的个性吗?

The Illustrated Word2vec

word2vec Google

Introduction
This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

介绍
这个工具提供了连续词袋和跳格架构的有效实现,用于计算单词的向量表示。这些表示法随后可用于许多自然语言处理应用和进一步的研究。

Pre-trained word and phrase vectors
We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in [2]. The archive is available here: GoogleNews-vectors-negative300.bin.gz.

预训练的单词和短语向量
我们正在发布在谷歌新闻数据集(约1000亿字)中训练的预训练向量。该模型包含300维的向量,用于300万个单词和短语。短语是使用[2]中描述的一种简单的数据驱动方法获得的。该档案可在此获得。GoogleNews-vectors-negative300.bin.gz

word2vec Google

【论文】Information-Theory Interpretation of the Skip-Gram Negative-Sampling Objective Function

Abstract
In this paper, we define a measure of dependency between two random variables, based on the Jensen-Shannon (JS) divergence between their joint distribution and the product of their marginal distributions. Then, we show that word2vec’s skip-gram with negative sampling embedding algorithm finds the optimal low-dimensional approximation of this JS dependency measure between the words and their contexts. The gap between the optimal score and the low-dimensional approximation is demonstrated on a standard text corpus.

摘要
在本文中,我们根据两个随机变量的联合分布与它们的边际分布的乘积之间的Jensen-Shannon(JS)分歧,定义了两个随机变量之间的依赖性测量。然后,我们表明word2vec的带负采样嵌入的skip-gram算法可以找到词和它们的语境之间的这种JS依赖性度量的最佳低维近似值。我们在一个标准文本语料库上证明了最佳得分和低维近似值之间的差距。

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 167–171
Vancouver, Canada, July 30 - August 4, 2017.
c 2017 Association for Computational Linguistics
https://doi.org/10.18653/v1/P17-2026P

【论文】Word Embeddings for User Profiling in Online Social Networks

Abstract
User profiling in social networks can be significantly augmented by using available full-text items such as posts or statuses and ratings (in the form of likes) that users give them. In this work, we apply modern natural language processing techniques based on word embeddings to several problems related to user profiling in social networks. First, we present an approach to create user profiles that measure a user’s interest in various topics mined from the full texts of the items. As a result, we get a user profile that can be used, e.g., for cold start recommendations for items, targeted advertisement, and other purposes; our experiments show that the interests mining method performs on a level comparable with collaborative algorithms while at the same time being a cold start approach, i.e., it does not use the likes of an item being recommended.
Second, we study the problem of predicting a user’s demographic attributes such as age and gender based on his or her full-text items. We evaluate the efficiency of various age prediction algorithms based on word2vec word embeddings and conduct an extensive experimental evaluation, comparing these algorithms with each other and with classical baseline approaches.

摘要
社交网络中的用户特征分析可以通过使用可用的全文项目,如帖子或状态以及用户给予的评分(以喜欢的形式)来大大增强。在这项工作中,我们将基于词嵌入的现代自然语言处理技术应用于与社交网络中用户分析有关的几个问题。首先,我们提出了一种创建用户档案的方法,以衡量用户对从项目的全文中挖掘出来的各种主题的兴趣。结果是,我们得到了一个用户档案,可以用于例如冷启动项目推荐、有针对性的广告和其他目的;我们的实验表明,兴趣挖掘方法的表现与协作算法相当,同时也是一种冷启动方法,即它不使用被推荐项目的喜欢。
第二,我们研究了根据用户的全文项目来预测其人口属性,如年龄和性别的问题。我们评估了基于word2vec词嵌入的各种年龄预测算法的效率,并进行了广泛的实验评估,将这些算法相互比较,并与经典的基线方法进行比较。

Word Embeddings for User Profiling in Online Social Networks

【论文】Support vector machines and Word2vec for text classification with semantic features

Abstract
With the rapid expansion of new available information presented to us online on a daily basis, text classification becomes imperative in order to classify and maintain it. Word2vec offers a unique perspective to the text mining community. By converting words and phrases into a vector representation, word2vec takes an entirely new approach on text classification. Based on the assumption that word2vec brings extra semantic features that helps in text classification, our work demonstrates the effectiveness of word2vec by showing that tf-idf and word2vec combined can outperform tf-idf because word2vec provides complementary features (e.g. semantics that tf-idf can’t capture) to tf-idf. Our results show that the combination of word2vec weighted by tf-idf and tf-idf does not outperform tf-idf consistently. It is consistent enough to say the combination of the two can outperform either individually.

摘要
随着每天在网上呈现给我们的新的可用信息的迅速扩大,为了对其进行分类和维护,文本分类变得势在必行。Word2vec为文本挖掘界提供了一个独特的视角。通过将单词和短语转换为矢量表示,word2vec在文本分类方面采取了一种全新的方法。基于word2vec带来的额外语义特征有助于文本分类的假设,我们的工作通过展示tf-idf和word2vec的结合可以超越tf-idf来证明word2vec的有效性,因为word2vec为tf-idf提供了互补的特征(例如tf-idf无法捕捉的语义)。我们的结果表明,由tf-idf和tf-idf加权的word2vec的组合并没有持续地优于tf-idf。说二者的组合可以胜过单独的任何一个,是足够一致的。

Support vector machines and Word2vec for text classification with semantic features

Gensim Pretrained models

Introduction
This module implements the word2vec family of algorithms, using highly optimized C routines, data streaming and Pythonic interfaces.
The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling: Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.

本模块使用高度优化的C程序、数据流和Pythonic接口实现了word2vec系列算法。
word2vec算法包括skip-gram和CBOW模型,使用层次化的softmax或负采样。Tomas Mikolov等人:《矢量空间中单词表征的高效估计》,Tomas Mikolov等人:《单词和短语的分布式表征及其组成》。

Word2vec embeddings

import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))
['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']
>>>
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')
>>>
# Use the downloaded vectors as usual:
glove_vectors.most_similar('twitter')
[('facebook', 0.948005199432373),
 ('tweet', 0.9403423070907593),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104824066162109),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885937333106995),
 ('tweets', 0.8878158330917358),
 ('tl', 0.8778461217880249),
 ('link', 0.8778210878372192),
 ('internet', 0.8753897547721863)]
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
【资源说明】 1.项目代码均经过功能验证ok,确保稳定可靠运行。欢迎下载食用体验! 2.主要针对各个计算机相关专业,包括计算机科学、信息安全、数据科学与大数据技术、人工智能、通信、物联网等领域的在校学生、专业教师、企业员工。 3.项目具有丰富的拓展空间,不仅可作为入门进阶,也可直接作为毕设、课程设计、大作业、初期项目立项演示等用途。 4.当然也鼓励大家基于此进行二次开发。在使用过程中,如有问题或建议,请及时沟通。 5.期待你能在项目中找到乐趣和灵感,也欢迎你的分享和反馈! 【项目介绍】 基于Python的中文本关键词抽取源码(分别使用TF-IDF、TextRank、Word2Vec词聚类三种方法)+数据集和说明.zip 一篇文档的关键词等同于最能表达文档主旨的N个词语,即对于文档来说最重要的词,因此,可以将文本关键词抽取问题转化为词语重要性排序问题,选取排名前TopN个词语作为文本关键词。目前,主流的文本关键词抽取方法主要有以下两大类: (1)基于统计的关键词提取方法 该方法根据统计信息,如词频,来计算得到文档中词语的权重,按权重值排序提取关键词。TF-IDF和TextRank均属于此类方法,其中TF-IDF方法通过计算单文本词频(Term Frequency, TF)和逆文本频率指数(Inverse Document Frequency, IDF)得到词语权重;TextRank方法基于PageRank的思想,通过词语共现窗口构建共现网络,计算词语得分。此类方法简单易行,适用性较强,然而未考虑词序问题。 (2)基于机器学习的关键词提取方法 该方法包括了SVM、朴素贝叶斯等有监督学习方法,以及K-means、层次聚类等无监督学习方法。在此类方法中,模型的好坏取决于特征提取,而深度学习正是特征提取的一种有效方式。由Google推出的Word2Vec词向量模型,是自然语言领域中具有代表性的学习工具。它在训练语言模型的过程中将词典映射到一个更抽象的向量空间中,每一个词语通过高维向量表示,该向量空间中两点之间的距离就对应两个词语的相似程度。 基于以上研究,本文分别采用**TF-IDF方法、TextRank方法和Word2Vec词聚类方法**,利用Python语言进行开发,实现文本关键词的抽取。 总结了三种常用的抽取文本关键词的方法:TF-IDF、TextRank和Word2Vec词向量聚类,并做了原理、流程以及代码的详细描述。因本文使用的测试语料较为特殊且数量较少,未做相应的结果分析,根据观察可以发现,得到的十个文本关键词都包含有文本的主旨信息,其中TF-IDF和TextRank方法的结果较好,Word2Vec词向量聚类方法的效果不佳,这与文献[8]中的结论是一致的。文献[8]中提到,对单文档直接应用Word2Vec词向量聚类方法时,选择聚类中心作为文本的关键词本身就是不准确的,因此与其距离最近的N个词语也不一定是关键词,因此用这种方法得到的结果效果不佳;而TextRank方法是基于图模型的排序算法,在单文档关键词抽取方面有较为稳定的效果,因此较多的论文是在TextRank的方法上进行改进而提升关键词抽取的准确率。 另外,本文的实验目的主要在于讲解三种方法的思路和流程,实验过程中的某些细节仍然可以改进。例如Word2Vec模型训练的原始语料可加入相应的专业性文本语料;标题文本往往包含文档的重要信息,可对标题文本包含的词语给予一定的初始权重;测试数据集可采集多个分类的长文本,与之对应的聚类算法KMeans()函数中的n_clusters参数就应当设置成分类的个数;根据文档的分词结果,去除掉所有文档中都包含某一出现频次超过指定阈值的词语;等等。各位可根据自己的实际情况或者参考论文资料进行参数的优化以及细节的调整

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值