问题产生原因是新版本R的scan函数读取utf8格式数据有时会添加\n,解决办法是在执行TermDocumentMatrix前,调用Sys.setlocale(locale=”English”),之后再设定回去,Sys.setlocale(locale=”Chinese (Simplified)_People’s Republic of China.936”),local设置通过函数sessionInfo()获得。
txt<-Corpus(VectorSource(segmentCN(Diy_dict,returnType = "tm")),readerControl = list(language = "UTF-8"))
Sys.setlocale(locale="English")
tdm<-DocumentTermMatrix(txt)
df_dtm2<-as.data.frame(inspect(tdm))
Sys.setlocale(locale="Chinese (Simplified)_People's Republic of China.936")