介绍
将词汇向量化是自然语言处理的基本一步,这里解释如何利用R语言中的word2vec实现该功能。
函数word2vec()介绍
word2vec(
x,
type = c("cbow", "skip-gram"),
dim = 50,
window = ifelse(type == "cbow", 5L, 10L),
iter = 5L,
lr = 0.05,
hs = FALSE,
negative = 5L,
sample = 0.001,
min_count = 5L,
split = c(" \n,.-!?:;/\"#$%&'()*+<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"),
stopwords = character(),
threads = 1L,
encoding = "UTF-8",
...
)
参数介绍
-
x
a character vector with text or the path to the file on disk containing training data -
type
the type of algorithm to use, either ‘cbow’ or ‘skip-gram’. Defaults to ‘cbow’ -
dim
dimension of the word vectors. Defaults to 50. -
window
skip length between words. Defaults to 5. -
iter
number of training iterations. Defaults to 5. -
lr
initial learning rate also known as alpha. Defaults to 0.05 -
hs
logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling. -
negative
integer with the number of negative samples. Only used in case hs is set to FALSE -
sample
threshold for occurrence of words. Defaults to 0.001 -
min_count
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5. -
split
a character vector of length 2 where the first element indicates how to split words and the second element indicates how to split sentences in x -
stopwords
a character vector of stopwords to exclude from training -
threads
number of CPU threads to use. Defaults to 1. -
encoding
the encoding of x and stopwords. Defaults to ‘UTF-8’. Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument is passed on to file when writing x to hard disk in case you provided it as a character vector. -
…
further arguments passed on to the C++ function w2v_train - for expert use only
注意
Some advice on the optimal set of parameters to use for training as defined by Mikolov et al.
-
argument type: skip-gram (slower, better for infrequent words) vs cbow (fast)
-
argument hs: the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
-
argument dim: dimensionality of the word vectors: usually more is better, but not always
-
argument window: for skip-gram usually around 10, for cbow around 5
-
argument sample: sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 0.001 to 0.00001)
返回值
an object of class w2v_trained which is a list with elements
-
model: a Rcpp pointer to the model
-
data: a list with elements file: the training data used, stopwords: the character vector of stopwords, n
-
vocabulary: the number of words in the vocabulary
-
success: logical indicating if training succeeded
-
error_log: the error log in case training failed
-
control: as list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample, split_words, split_sents, expTableSize and expValueMax
实例
library(udpipe)
library(word2vec)
## 读取数据
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)
x[1:3]
# [1] "zeer leuke plek om te vertoeven , rustig en toch erg centraal gelegen in het centrum van brussel , leuk adres om te kennen , als we terug naar brussel komen zullen we zeker teruggaan ! \n"
# [2] "het appartement ligt op een goede locatie: op loopafstand van de europese wijk en vlakbij verschilende metrostations en toch rustig. het is zeer ruim en modern ingericht en voorzien van alle gemakken. ik vond het de perfecte plek om te verblijven als je in brussel moet zijn. het contact met de verhuurders was prettig en er was een duidelijke instructie voor het verkrijgen van de sleutels. "
# [3] "bedankt bettina en collin. ik ben heel blij dat ik bij jullie heb verbleven, in zo'n prachtige stille omgeving en dat toch vlakbij het centrum van brussel. jullie zijn heel warme, vriendelijke mensen. en zo lief dat ik een bordje mocht mee eten als ik wou. "
#
# 创建词向量
model <- word2vec(x = x,type='skip-gram', dim = 15,window = 5, iter = 20,stopwords = c('a','and','is'))
emb <- as.matrix(model)
head(emb)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# persoonlijke -0.3042065 -0.1184860 -1.201266 -2.3450553 0.8367136 1.04953480 -1.2723569 -1.4005361
# horen 0.7087634 -0.3501654 -1.595468 0.1109525 2.0254283 1.23821533 1.5700442 -1.2268434
# atomium 1.2572429 -0.1606519 -1.493153 0.4524430 0.8455694 0.07073398 -0.7973976 -1.4407603
# maakte 0.3302444 -0.9569687 -1.484279 -1.7543029 0.3030996 0.98232365 1.3279068 -1.6941810
# best 0.7629781 -0.6842179 -1.688801 -0.7240605 1.8485303 0.86944288 1.1971822 -1.7174718
# vele 0.9057386 -2.0495012 -1.318962 -1.6884000 -0.2110360 0.94611353 -0.2507533 0.1012649
# [,9] [,10] [,11] [,12] [,13] [,14] [,15]
# persoonlijke -0.19206892 -0.06204492 1.3579390 -0.3258170 -0.34472424 -0.62112594 -0.2712209
# horen 0.03050502 -0.64446056 -0.9804438 -0.4411592 -0.32660010 -0.59695488 0.4198991
# atomium 1.40547168 -0.18392503 -0.5595728 0.3433310 -0.86827427 -2.03704834 0.4290147
# maakte -0.67575353 -0.45142314 0.5420703 0.2255125 -1.27246237 -0.05663725 -0.6136616
# best -0.26500553 0.10107612 0.6706604 -0.9874146 -0.03598274 -0.20110258 -0.6859254
# vele 0.19042419 -0.35138375 -0.7454548 0.5159962 0.44680297 -1.77189505 0.2347254
# 得到新词的向量值
emb <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
emb
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# bus 1.2695463 -0.7826285 -1.491744 1.18059 1.169373 1.0245659 0.5599136 -0.9842771
# toilet 0.6540414 -0.9534258 -1.994404 -0.92366 1.082000 0.3969926 -0.4183521 0.3170334
# unknownword NA NA NA NA NA NA NA NA
# [,9] [,10] [,11] [,12] [,13] [,14] [,15]
# bus 1.648509 0.3475770 -0.571207 0.003765954 -0.09555515 -1.482624 -0.2918743
# toilet -1.834143 -0.9669787 -1.333826 -0.838830054 0.12554233 -0.486763 0.4404586
# unknownword NA NA NA NA NA NA NA
# 获得相近的词汇
nn <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
nn
# $bus
# term1 term2 similarity rank
# 1 bus metro 0.9660415 1
# 2 bus gemakkelijk 0.9644725 2
# 3 bus voet 0.9583727 3
# 4 bus gratis 0.9559109 4
# 5 bus tram 0.9528881 5
#
# $toilet
# term1 term2 similarity rank
# 1 toilet koelkast 0.9544268 1
# 2 toilet douche 0.9469139 2
# 3 toilet werkte 0.9395254 3
# 4 toilet tv 0.9391247 4
# 5 toilet slaapbank 0.9203613 5
#获取参与训练的词汇集合
vocab <- summary(model, type = "vocabulary")
# 词汇个数
model$vocabulary
# [1] 639
# 根据向量值寻找相近的词汇
emb <- as.matrix(model)
vector <- emb["buurt", ] - emb["rustige", ] + emb["restaurants", ]
predict(model, vector, type = "nearest", top_n = 10)
# term similarity rank
# 1 cafe 0.9898908 1
# 2 cafes 0.9836166 2
# 3 tips 0.9831459 3
# 4 restaurantjes 0.9734224 4
# 5 buurt 0.9720594 5
# 6 allerlei 0.9607190 6
# 7 geven 0.9604430 7
# 8 veel 0.9536935 8
# 9 wijk 0.9472404 9
# 10 eten 0.9413667 10
## 保存模型
path <- "mymodel.bin"
write.word2vec(model, file = path)
model <- read.word2vec(path)
对中文分词的注意事项
对中文句子分词的时候要先用jiebaR包对句子进行分割,实现和英文一样一个词一个空格的形式,然后再利用word2vec()对中文词汇进行向量化。