我无法将此示例扩展到字级模型 . 见下面的代表
library(keras)
library(readr)
library(stringr)
library(purrr)
library(tokenizers)
# Parameters
maxlen
# Data Preparation
# Retrieve text
path
'nietzsche.txt',
origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt'
)
# Load, collapse, and tokenize text
text %
str_to_lower() %>%
str_c(collapse = "\n") %>%
tokenize_words( simplify = TRUE)
print(sprintf("corpus length: %d", length(text)))
words %
unique() %>%
sort()
print(sprintf("total words: %d", length(words)))
这使:
[1] "corpus length: 101345"
[1] "total words: 10283"
当我继续下一步时,我会遇到问题:
# Cut the text in semi-redundant sequences of maxlen characters
dataset
seq(1, length(text) - maxlen - 1, by = 3),
~list(sentece = text[.x:(.x + maxlen - 1)], next_char = text[.x + maxlen])
)
dataset
# Vectorization
X
y
for(i in 1:length(dataset$sentece)){
X[i,,]
as.integer(x == dataset$sentece[[i]])
})
y[i,]
}
Error: cannot allocate vector of size 103.5 Gb
现在与字符示例相比,我的词汇比词汇中的字符要多得多,这可能就是我遇到矢量大小问题的原因,但是我怎样才能预先处理单词级文本数据以适应一个人吗?这是通过嵌入层以某种方式完成的吗?我是否需要删除一些停止词/词干来使词汇量下降?