R语言中利用word2vec包创建词向量

介绍

将词汇向量化是自然语言处理的基本一步,这里解释如何利用R语言中的word2vec实现该功能。

函数word2vec()介绍

word2vec(
  x,
  type = c("cbow", "skip-gram"),
  dim = 50,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 5L,
  lr = 0.05,
  hs = FALSE,
  negative = 5L,
  sample = 0.001,
  min_count = 5L,
  split = c(" \n,.-!?:;/\"#$%&'()*+<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"),
  stopwords = character(),
  threads = 1L,
  encoding = "UTF-8",
  ...
)

参数介绍

  • x
    a character vector with text or the path to the file on disk containing training data

  • type
    the type of algorithm to use, either ‘cbow’ or ‘skip-gram’. Defaults to ‘cbow’

  • dim
    dimension of the word vectors. Defaults to 50.

  • window
    skip length between words. Defaults to 5.

  • iter
    number of training iterations. Defaults to 5.

  • lr
    initial learning rate also known as alpha. Defaults to 0.05

  • hs
    logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling.

  • negative
    integer with the number of negative samples. Only used in case hs is set to FALSE

  • sample
    threshold for occurrence of words. Defaults to 0.001

  • min_count
    integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.

  • split
    a character vector of length 2 where the first element indicates how to split words and the second element indicates how to split sentences in x

  • stopwords
    a character vector of stopwords to exclude from training

  • threads
    number of CPU threads to use. Defaults to 1.

  • encoding
    the encoding of x and stopwords. Defaults to ‘UTF-8’. Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument is passed on to file when writing x to hard disk in case you provided it as a character vector.


  • further arguments passed on to the C++ function w2v_train - for expert use only

注意

Some advice on the optimal set of parameters to use for training as defined by Mikolov et al.

  • argument type: skip-gram (slower, better for infrequent words) vs cbow (fast)

  • argument hs: the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)

  • argument dim: dimensionality of the word vectors: usually more is better, but not always

  • argument window: for skip-gram usually around 10, for cbow around 5

  • argument sample: sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 0.001 to 0.00001)

返回值

an object of class w2v_trained which is a list with elements

  • model: a Rcpp pointer to the model

  • data: a list with elements file: the training data used, stopwords: the character vector of stopwords, n

  • vocabulary: the number of words in the vocabulary

  • success: logical indicating if training succeeded

  • error_log: the error log in case training failed

  • control: as list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample, split_words, split_sents, expTableSize and expValueMax

实例

library(udpipe)
library(word2vec)
## 读取数据
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

x[1:3]
# [1] "zeer leuke plek om te vertoeven , rustig en toch erg centraal gelegen in het centrum van brussel , leuk adres om te kennen , als we terug naar brussel komen zullen we zeker teruggaan ! \n"                                                                                                                                                                                                              
# [2] "het appartement ligt op een goede locatie: op loopafstand van de europese wijk en vlakbij verschilende metrostations en toch rustig. het is zeer ruim en modern ingericht en voorzien van alle gemakken. ik vond het de perfecte plek om te verblijven als je in brussel moet zijn. het contact met de verhuurders was prettig en er was een duidelijke instructie voor het verkrijgen van de sleutels.  "
# [3] "bedankt bettina en collin. ik ben heel blij dat ik bij jullie heb verbleven, in zo'n prachtige stille omgeving en dat toch vlakbij het centrum van brussel. jullie zijn heel warme, vriendelijke mensen. en zo lief dat ik een bordje mocht mee eten als ik wou. "  
# 

# 创建词向量
model <- word2vec(x = x,type='skip-gram', dim = 15,window = 5, iter = 20,stopwords = c('a','and','is'))
emb   <- as.matrix(model)
head(emb)
#                  [,1]       [,2]      [,3]       [,4]       [,5]       [,6]       [,7]       [,8]
# persoonlijke -0.3042065 -0.1184860 -1.201266 -2.3450553  0.8367136 1.04953480 -1.2723569 -1.4005361
# horen         0.7087634 -0.3501654 -1.595468  0.1109525  2.0254283 1.23821533  1.5700442 -1.2268434
# atomium       1.2572429 -0.1606519 -1.493153  0.4524430  0.8455694 0.07073398 -0.7973976 -1.4407603
# maakte        0.3302444 -0.9569687 -1.484279 -1.7543029  0.3030996 0.98232365  1.3279068 -1.6941810
# best          0.7629781 -0.6842179 -1.688801 -0.7240605  1.8485303 0.86944288  1.1971822 -1.7174718
# vele          0.9057386 -2.0495012 -1.318962 -1.6884000 -0.2110360 0.94611353 -0.2507533  0.1012649
#                   [,9]       [,10]      [,11]      [,12]       [,13]       [,14]      [,15]
# persoonlijke -0.19206892 -0.06204492  1.3579390 -0.3258170 -0.34472424 -0.62112594 -0.2712209
# horen         0.03050502 -0.64446056 -0.9804438 -0.4411592 -0.32660010 -0.59695488  0.4198991
# atomium       1.40547168 -0.18392503 -0.5595728  0.3433310 -0.86827427 -2.03704834  0.4290147
# maakte       -0.67575353 -0.45142314  0.5420703  0.2255125 -1.27246237 -0.05663725 -0.6136616
# best         -0.26500553  0.10107612  0.6706604 -0.9874146 -0.03598274 -0.20110258 -0.6859254
# vele          0.19042419 -0.35138375 -0.7454548  0.5159962  0.44680297 -1.77189505  0.2347254


# 得到新词的向量值
emb <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
emb
#                [,1]       [,2]      [,3]     [,4]     [,5]      [,6]       [,7]       [,8]
# bus         1.2695463 -0.7826285 -1.491744  1.18059 1.169373 1.0245659  0.5599136 -0.9842771
# toilet      0.6540414 -0.9534258 -1.994404 -0.92366 1.082000 0.3969926 -0.4183521  0.3170334
# unknownword        NA         NA        NA       NA       NA        NA         NA         NA
#                 [,9]      [,10]     [,11]        [,12]       [,13]     [,14]      [,15]
# bus          1.648509  0.3475770 -0.571207  0.003765954 -0.09555515 -1.482624 -0.2918743
# toilet      -1.834143 -0.9669787 -1.333826 -0.838830054  0.12554233 -0.486763  0.4404586
# unknownword        NA         NA        NA           NA          NA        NA         NA

# 获得相近的词汇
nn  <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
nn
# $bus
# term1       term2 similarity rank
# 1   bus       metro  0.9660415    1
# 2   bus gemakkelijk  0.9644725    2
# 3   bus        voet  0.9583727    3
# 4   bus      gratis  0.9559109    4
# 5   bus        tram  0.9528881    5
# 
# $toilet
# term1     term2 similarity rank
# 1 toilet  koelkast  0.9544268    1
# 2 toilet    douche  0.9469139    2
# 3 toilet    werkte  0.9395254    3
# 4 toilet        tv  0.9391247    4
# 5 toilet slaapbank  0.9203613    5


#获取参与训练的词汇集合
vocab <- summary(model, type = "vocabulary")
# 词汇个数
model$vocabulary
# [1] 639




# 根据向量值寻找相近的词汇
emb <- as.matrix(model)
vector <- emb["buurt", ] - emb["rustige", ] + emb["restaurants", ]
predict(model, vector, type = "nearest", top_n = 10)
#               term similarity rank
# 1           cafe  0.9898908    1
# 2          cafes  0.9836166    2
# 3           tips  0.9831459    3
# 4  restaurantjes  0.9734224    4
# 5          buurt  0.9720594    5
# 6       allerlei  0.9607190    6
# 7          geven  0.9604430    7
# 8           veel  0.9536935    8
# 9           wijk  0.9472404    9
# 10          eten  0.9413667   10

## 保存模型
path <- "mymodel.bin"

write.word2vec(model, file = path)
model <- read.word2vec(path)

对中文分词的注意事项

对中文句子分词的时候要先用jiebaR包对句子进行分割,实现和英文一样一个词一个空格的形式,然后再利用word2vec()对中文词汇进行向量化。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值