R语言中利用word2vec包创建词向量

zoujiahui_2018

已于 2024-01-13 20:34:24 修改

阅读量1.5k

点赞数 1

分类专栏： # 自然语言处理文章标签： r语言 word2vec 自然语言处理

于 2022-05-05 21:03:09 首次发布

原文链接：https://cran.rstudio.com/

版权

自然语言处理专栏收录该内容

5 篇文章 0 订阅

订阅专栏

介绍

将词汇向量化是自然语言处理的基本一步，这里解释如何利用R语言中的word2vec实现该功能。

函数word2vec()介绍

word2vec(
  x,
  type = c("cbow", "skip-gram"),
  dim = 50,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 5L,
  lr = 0.05,
  hs = FALSE,
  negative = 5L,
  sample = 0.001,
  min_count = 5L,
  split = c(" \n,.-!?:;/\"#$%&'()*+<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"),
  stopwords = character(),
  threads = 1L,
  encoding = "UTF-8",
  ...
)

参数介绍

x
a character vector with text or the path to the file on disk containing training data
type
the type of algorithm to use, either ‘cbow’ or ‘skip-gram’. Defaults to ‘cbow’
dim
dimension of the word vectors. Defaults to 50.
window
skip length between words. Defaults to 5.
iter
number of training iterations. Defaults to 5.
lr
initial learning rate also known as alpha. Defaults to 0.05
hs
logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling.
negative
integer with the number of negative samples. Only used in case hs is set to FALSE
sample
threshold for occurrence of words. Defaults to 0.001
min_count
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.
split
a character vector of length 2 where the first element indicates how to split words and the second element indicates how to split sentences in x
stopwords
a character vector of stopwords to exclude from training
threads
number of CPU threads to use. Defaults to 1.
encoding
the encoding of x and stopwords. Defaults to ‘UTF-8’. Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument is passed on to file when writing x to hard disk in case you provided it as a character vector.
…
further arguments passed on to the C++ function w2v_train - for expert use only

注意

Some advice on the optimal set of parameters to use for training as defined by Mikolov et al.

argument type: skip-gram (slower, better for infrequent words) vs cbow (fast)
argument hs: the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
argument dim: dimensionality of the word vectors: usually more is better, but not always
argument window: for skip-gram usually around 10, for cbow around 5
argument sample: sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 0.001 to 0.00001)

返回值

an object of class w2v_trained which is a list with elements

model: a Rcpp pointer to the model
data: a list with elements file: the training data used, stopwords: the character vector of stopwords, n
vocabulary: the number of words in the vocabulary
success: logical indicating if training succeeded
error_log: the error log in case training failed
control: as list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample, split_words, split_sents, expTableSize and expValueMax

实例

library(udpipe)
library(word2vec)
## 读取数据
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

x[1:3]
# [1] "zeer leuke plek om te vertoeven , rustig en toch erg centraal gelegen in het centrum van brussel , leuk adres om te kennen , als we terug naar brussel komen zullen we zeker teruggaan ! \n"                                                                                                                                                                                                              
# [2] "het appartement ligt op een goede locatie: op loopafstand van de europese wijk en vlakbij verschilende metrostations en toch rustig. het is zeer ruim en modern ingericht en voorzien van alle gemakken. ik vond het de perfecte plek om te verblijven als je in brussel moet zijn. het contact met de verhuurders was prettig en er was een duidelijke instructie voor het verkrijgen van de sleutels.  "
# [3] "bedankt bettina en collin. ik ben heel blij dat ik bij jullie heb verbleven, in zo'n prachtige stille omgeving en dat toch vlakbij het centrum van brussel. jullie zijn heel warme, vriendelijke mensen. en zo lief dat ik een bordje mocht mee eten als ik wou. "  
# 

# 创建词向量
model <- word2vec(x = x,type='skip-gram', dim = 15,window = 5, iter = 20,stopwords = c('a','and','is'))
emb   <- as.matrix(model)
head(emb)
#                  [,1]       [,2]      [,3]       [,4]       [,5]       [,6]       [,7]       [,8]
# persoonlijke -0.3042065 -0.1184860 -1.201266 -2.3450553  0.8367136 1.04953480 -1.2723569 -1.4005361
# horen         0.7087634 -0.3501654 -1.595468  0.1109525  2.0254283 1.23821533  1.5700442 -1.2268434
# atomium       1.2572429 -0.1606519 -1.493153  0.4524430  0.8455694 0.07073398 -0.7973976 -1.4407603
# maakte        0.3302444 -0.9569687 -1.484279 -1.7543029  0.3030996 0.98232365  1.3279068 -1.6941810
# best          0.7629781 -0.6842179 -1.688801 -0.7240605  1.8485303 0.86944288  1.1971822 -1.7174718
# vele          0.9057386 -2.0495012 -1.318962 -1.6884000 -0.2110360 0.94611353 -0.2507533  0.1012649
#                   [,9]       [,10]      [,11]      [,12]       [,13]       [,14]      [,15]
# persoonlijke -0.19206892 -0.06204492  1.3579390 -0.3258170 -0.34472424 -0.62112594 -0.2712209
# horen         0.03050502 -0.64446056 -0.9804438 -0.4411592 -0.32660010 -0.59695488  0.4198991
# atomium       1.40547168 -0.18392503 -0.5595728  0.3433310 -0.86827427 -2.03704834  0.4290147
# maakte       -0.67575353 -0.45142314  0.5420703  0.2255125 -1.27246237 -0.05663725 -0.6136616
# best         -0.26500553  0.10107612  0.6706604 -0.9874146 -0.03598274 -0.20110258 -0.6859254
# vele          0.19042419 -0.35138375 -0.7454548  0.5159962  0.44680297 -1.77189505  0.2347254


# 得到新词的向量值
emb <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
emb
#                [,1]       [,2]      [,3]     [,4]     [,5]      [,6]       [,7]       [,8]
# bus         1.2695463 -0.7826285 -1.491744  1.18059 1.169373 1.0245659  0.5599136 -0.9842771
# toilet      0.6540414 -0.9534258 -1.994404 -0.92366 1.082000 0.3969926 -0.4183521  0.3170334
# unknownword        NA         NA        NA       NA       NA        NA         NA         NA
#                 [,9]      [,10]     [,11]        [,12]       [,13]     [,14]      [,15]
# bus          1.648509  0.3475770 -0.571207  0.003765954 -0.09555515 -1.482624 -0.2918743
# toilet      -1.834143 -0.9669787 -1.333826 -0.838830054  0.12554233 -0.486763  0.4404586
# unknownword        NA         NA        NA           NA          NA        NA         NA

# 获得相近的词汇
nn  <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
nn
# $bus
# term1       term2 similarity rank
# 1   bus       metro  0.9660415    1
# 2   bus gemakkelijk  0.9644725    2
# 3   bus        voet  0.9583727    3
# 4   bus      gratis  0.9559109    4
# 5   bus        tram  0.9528881    5
# 
# $toilet
# term1     term2 similarity rank
# 1 toilet  koelkast  0.9544268    1
# 2 toilet    douche  0.9469139    2
# 3 toilet    werkte  0.9395254    3
# 4 toilet        tv  0.9391247    4
# 5 toilet slaapbank  0.9203613    5


#获取参与训练的词汇集合
vocab <- summary(model, type = "vocabulary")
# 词汇个数
model$vocabulary
# [1] 639




# 根据向量值寻找相近的词汇
emb <- as.matrix(model)
vector <- emb["buurt", ] - emb["rustige", ] + emb["restaurants", ]
predict(model, vector, type = "nearest", top_n = 10)
#               term similarity rank
# 1           cafe  0.9898908    1
# 2          cafes  0.9836166    2
# 3           tips  0.9831459    3
# 4  restaurantjes  0.9734224    4
# 5          buurt  0.9720594    5
# 6       allerlei  0.9607190    6
# 7          geven  0.9604430    7
# 8           veel  0.9536935    8
# 9           wijk  0.9472404    9
# 10          eten  0.9413667   10

## 保存模型
path <- "mymodel.bin"

write.word2vec(model, file = path)
model <- read.word2vec(path)