前段时间利用R语言做文本主题分析时,想要生成DTM矩阵,遇到了如下错误
报上述错误的R语言代码如下
samgov.segmentText <- read.csv('samgov_segment.csv', header = TRUE, fill = TRUE, stringsAsFactors = F)
d.corpus <- Corpus(VectorSource(samgov.segmentText$x),readerControl = list(language = "UTF-8"))
d.corpus <- tm_map(d.corpus, removeWords, stopwordsCN())
ctrl <- list(removePunctuation = TRUE, removeNumbers= TRUE, wordLengths = c(2, Inf),weighting = weightTf, encoding = "UTF-8")
d.dtm <- DocumentTermMatrix(d.corpus,control = ctrl)
我尝试了网上提供的一些方法,推荐最多的就是设置语言,如
先设置Sys.setlocale(locale="English"),再执行以上代码,后设回Sys.setlocale(locale="Chinese (Simplified)_People's Republic of China.936") 等方法,可并不奏效。
后来又查了很多资料,终于在知乎[1]上找到了解决问题的有效方法 (*^▽^*)
解决方法如下
加一句 m <- enc2utf8(samgov.segmentText$x)
R语言代码如下
samgov.segmentText <- read.csv('samgov_segment.csv', header = TRUE, fill = TRUE, stringsAsFactors = F)
m <- enc2utf8(samgov.segmentText$x)
d.corpus <- Corpus(VectorSource(m),readerControl = list(language = "UTF-8"))
d.corpus <- tm_map(d.corpus, removeWords, stopwordsCN())
ctrl <- list(removePunctuation = TRUE, removeNumbers= TRUE, wordLengths = c(2, Inf),weighting = weightTf, encoding = "UTF-8")
d.dtm <- DocumentTermMatrix(d.corpus,control = ctrl)
运行结果为
DTM(DocumentTermMatrix)矩阵:
该矩阵也称为文档-词项矩阵,该矩阵的行代表文档,列代表词汇,矩阵元素即为文档中某一词汇出现的次数。
维基百科[2]解释如下
对于DTM矩阵在R语言中可以使用tm包提供的函数DocumentTermMatrix来获取
参考:
[1] 知乎(具体链接找不到了T_T,但是非常感谢给出方法的童鞋)