做中文文本挖掘的时候经常要读取中文网站上的信息,但英文系统RStudio在WINDOWS系统下有无法完全兼容中文字符,所以print到显示器上的中文字符统统是乱码。处理方法如下:
首先修改系统语言:
- Control Panel -> Region and Language -> Formats -> Chinese (Simplified, PRC)
- Control Panel -> Region and Language -> Administrative -> Change System Locale... -> Chinese (Simplified, PRC)
修改完了以后,可以用`sessionInfo()`在RStudio中查看系统设置:
sessionInfo()
sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8 LC_CTYPE=Chinese (Simplified)_China.utf8
[3] LC_MONETARY=Chinese (Simplified)_China.utf8 LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_China.utf8attached base packages:
[1] stats graphics grDevices utils datasets methods baseloaded via a namespace (and not attached):
[1] compiler_4.2.0 tools_4.2.0
可以看到
然后修改RStudio中的读取和保存,还有默认编辑的encoding设置——统统修改为UTF-8:
- File -> Reopen with Encoding -> UTF-8
- File -> Save with Encoding -> UTF-8
- Tools -> Global -> General -> Default text encoding -> UTF-8
这样基本就没有什么问题了
唯一的不太习惯的地方就是所有的error message或者warning message都会变得有点莫名其妙。
借鉴岁月催猪老的文章,感谢大佬