本文实现对文本的分词操作,并统计词频。
一、分词
运用R语言进行分词操作需要用到Rwordseg包,而Rwordseg依赖于rjava。
Rwordseg是一个R环境下的中文分词工具,引用了Ansj包,Ansj是一个开源的java中文分词工具,基于中科院的ictclas中文分词算法,采用隐马尔科夫模型(HMM)。Rwordseg有很多优点,一是分词准确,二是分词速度超快,三是可以导入自定义词库。
(1)安装JDK。想要安装rJava需要先在电脑上下载安装JDK,即java devolop kit,然后再通过R从CRAN上选择安装rJava,否则,即使安装了rJava也用不了。
(2)安装加载rjava包。rJava包的作用是提供java的库,供Rwordseg调用。其安装加载语句:
> install.packages("rjava")
> library(rjava)
(3)安装加载Rwordseg包。
> install.packages("Rwordseg", repos = "http://R-Forge.R-project.org", type = "source")
> library(Rwordseg)
(4)利用segmentCN()函数进行分词
> segmentCN("盼望着,盼望着,东风来了,春天的脚步近了。
+ 一切都像刚睡醒的样子,欣欣然张开了眼。山朗润起来了,水涨起来了,太阳的脸红起来了。
+ 小草偷偷地从土地里钻出来,嫩嫩的,绿绿的。园子里,田野里,瞧去,一大片一大片满是的。坐着,躺着,打两个滚,踢几脚球,赛几趟跑,捉几回迷藏。风轻俏俏的,草软绵绵的。
+ 桃树,杏树,梨树,你不让我,我不让你,都开满了花赶趟儿。红的像火,粉的像霞,白的像雪。花里带着甜味;闭了眼,树上仿佛已经满是桃儿,杏儿,梨儿。花下成千成百的蜜蜂嗡嗡的闹着,大小的蝴蝶飞来飞去。野花遍地是:杂样儿,有名字的,没名字的,散在草丛里像眼睛像星星,还眨呀眨。
+ “吹面不寒杨柳风”,不错的,像母亲的手抚摸着你,风里带着些心翻的泥土的气息,混着青草味儿,还有各种花的香,都在微微润湿的空气里酝酿。鸟儿将巢安在繁花嫩叶当中,高兴起来,呼朋引伴的卖弄清脆的歌喉,唱出婉转的曲子,跟清风流水应和着。牛背上牧童的短笛,这时候也成天嘹亮的响着。
+ 雨是最寻常的,一下就是三两天。可别恼。看,像牛牦,像花针,像细丝,密密的斜织着,人家屋顶上全笼着一层薄烟。树叶却绿得发亮,小草也青得逼你的眼。傍晚时候,上灯了,一点点黄晕的光,烘托出一片安静而和平的夜。在乡下,小路上,石桥边,有撑着伞慢慢走着的人,地里还有工作的农民,披着所戴着笠。他们的房屋稀稀疏疏的,在雨里静默着。
+ 天上的风筝渐渐多了,地上的孩子也多了。城里乡下,家家户户,老老小小,也赶趟似的,一个个都出来了。舒活舒活筋骨,抖擞抖擞精神,各做各的一份事儿去。“一年之计在于春”,刚起头儿,有的是功夫,有的是希望
+ 春天像刚落地的娃娃,从头到脚都是新的,它生长着。
+ 春天像小姑娘,花枝招展的笑着走着。
+ 春天像健壮的青年,有铁一般的胳膊和腰脚,领着我们向前去。")
[1] "盼望" "着" "盼望" "着"
[5] "东风" "来" "了" "春天"
[9] "的" "脚步" "近" "了"
[13] "一切" "都" "像" "刚"
[17] "睡醒" "的" "样子" "欣欣然"
[21] "张" "开" "了" "眼"
[25] "山" "朗" "润" "起来"
[29] "了" "水" "涨" "起来"
[33] "了" "太阳" "的" "脸红"
[37] "起来" "了" "小" "草"
[41] "偷偷" "地" "从" "土地"
[45] "里" "钻" "出来" "嫩"
[49] "嫩" "的" "绿" "绿"
[53] "的" "园子" "里" "田野"
[57] "里" "瞧" "去" "一"
[61] "大" "片" "一" "大片"
[65] "满" "是" "的" "坐"
[69] "着" "躺" "着" "打"
[73] "两个" "滚" "踢" "几"
[77] "脚" "球" "赛" "几趟"
[81] "跑" "捉" "几回" "迷"
[85] "藏" "风" "轻" "俏"
[89] "俏" "的" "草" "软绵绵"
[93] "的" "桃树" "杏树" "梨树"
[97] "你" "不" "让" "我"
[101] "我" "不" "让" "你"
[105] "都" "开" "满" "了"
[109] "花" "赶趟" "儿" "红"
[113] "的" "像" "火" "粉"
[117] "的" "像" "霞" "白"
[121] "的" "像" "雪" "花"
[125] "里" "带" "着" "甜味"
[129] "闭" "了" "眼" "树上"
[133] "仿佛" "已经" "满" "是"
[137] "桃" "儿" "杏" "儿"
[141] "梨" "儿" "花" "下"
[145] "成" "千" "成百" "的"
[149] "蜜蜂" "嗡嗡" "的" "闹"
[153] "着" "大小" "的" "蝴蝶"
[157] "飞来" "飞" "去" "野花"
[161] "遍地" "是" "杂" "样"
[165] "儿" "有" "名字" "的"
[169] "没" "名字" "的" "散"
[173] "在" "草丛" "里" "像"
[177] "眼睛" "像" "星星" "还"
[181] "眨" "呀" "眨" "吹"
[185] "面" "不" "寒" "杨柳"
[189] "风" "不错" "的" "像"
[193] "母亲" "的" "手" "抚摸"
[197] "着" "你" "风" "里"
[201] "带" "着" "些" "心"
[205] "翻" "的" "泥土" "的"
[209] "气息" "混" "着" "青草"
[213] "味儿" "还有" "各种" "花"
[217] "的" "香" "都" "在"
[221] "微微" "润湿" "的" "空气"
[225] "里" "酝酿" "鸟儿" "将"
[229] "巢" "安" "在" "繁花"
[233] "嫩叶" "当中" "高兴" "起来"
[237] "呼" "朋" "引" "伴"
[241] "的" "卖弄" "清脆" "的"
[245] "歌喉" "唱" "出" "婉转"
[249] "的" "曲子" "跟" "清风"
[253] "流水" "应和" "着" "牛"
[257] "背" "上" "牧童" "的"
[261] "短笛" "这时候" "也" "成天"
[265] "嘹亮" "的" "响" "着"
[269] "雨" "是" "最" "寻常"
[273] "的" "一下" "就" "是"
[277] "三两天" "可" "别" "恼"
[281] "看" "像" "牛" "牦"
[285] "像" "花" "针" "像"
[289] "细" "丝" "密" "密"
[293] "的" "斜" "织" "着"
[297] "人家" "屋顶" "上" "全"
[301] "笼" "着" "一层" "薄"
[305] "烟" "树叶" "却" "绿"
[309] "得" "发亮" "小" "草"
[313] "也" "青" "得" "逼"
[317] "你" "的" "眼" "傍晚"
[321] "时候" "上" "灯" "了"
[325] "一点点" "黄晕" "的" "光"
[329] "烘托" "出" "一片" "安静"
[333] "而" "和平" "的" "夜"
[337] "在" "乡下" "小路" "上"
[341] "石桥" "边" "有" "撑"
[345] "着" "伞" "慢慢" "走"
[349] "着" "的" "人" "地"
[353] "里" "还" "有" "工作"
[357] "的" "农民" "披" "着"
[361] "所" "戴" "着" "笠"
[365] "他们" "的" "房屋" "稀稀疏疏"
[369] "的" "在" "雨" "里"
[373] "静默" "着" "天上" "的"
[377] "风筝" "渐渐" "多" "了"
[381] "地上" "的" "孩子" "也"
[385] "多" "了" "城里" "乡下"
[389] "家家户户" "老" "老小" "小"
[393] "也" "赶趟" "似的" "一个个"
[397] "都" "出来" "了" "舒"
[401] "活" "舒" "活" "筋骨"
[405] "抖擞" "抖擞精神" "各" "做"
[409] "各" "的" "一份" "事儿"
[413] "去" "一年之计在于春" "刚" "起"
[417] "头儿" "有的是" "功夫" "有的是"
[421] "希望" "春天" "像" "刚"
[425] "落" "地" "的" "娃娃"
[429] "从" "头" "到" "脚"
[433] "都" "是" "新" "的"
[437] "它" "生长" "着" "春天"
[441] "像" "小姑娘" "花枝招展" "的"
[445] "笑" "着" "走" "着"
[449] "春天" "像" "健壮" "的"
[453] "青年" "有" "铁" "一般"
[457] "的" "胳膊" "和" "腰"
[461] "脚" "领" "着" "我们"
[465] "向" "前" "去"
> segmentCN("春.txt")
Output file: 春.segment.txt
[1] TRUE
系统默认会将分词结果存储在输入文本所在文件夹,并在原文件名的基础上添加“segment”。也可以使用outfile参数指定输出文件和名称和路径。