R语言中利用jiebaR包实现中文分词

介绍

能够实现中文分词的R包有Rwordseg包和jiebaR包,从目前来看jiebaR包的功能更加强大,效率也更高。这里将介绍如何使用jiebaR包实现中文分词。

worker()函数介绍

worker()函数可以创建一个jiebaR对象,包括分割器、查找器、重点词提取器等等,随后可进行具体的工作。

worker(type = "mix", dict = DICTPATH, hmm = HMMPATH,
  user = USERPATH, idf = IDFPATH, stop_word = STOPPATH, write = T,
  qmax = 20, topn = 5, encoding = "UTF-8", detect = T,
  symbol = F, lines = 1e+05, output = NULL, bylines = F,
  user_weight = "max")

参数介绍

  • type
    The type of jiebaR workers including mix, mp, hmm, full, query, tag, simhash, and keywords.

  • dict
    A path to main dictionary, default value is DICTPATH, and the value is used for mix, mp, query, full, tag, simhash and keywords workers.

  • hmm
    A path to Hidden Markov Model, default value is HMMPATH, full, and the value is used for mix, hmm, query, tag, simhash and keywords workers.

  • user
    A path to user dictionary, default value is USERPATH, and the value is used for mix, full, tag and mp workers.

  • idf
    A path to inverse document frequency, default value is IDFPATH, and the value is used for simhash and keywords workers.

  • stop_word
    A path to stop word dictionary, default value is STOPPATH, and the value is used for simhash, keywords, tagger and segment workers. Encoding of this file is checked by file_coding, and it should be UTF-8 encoding. For segment workers, the default STOPPATH will not be used, so you should provide another file path.

  • write
    Whether to write the output to a file, or return a the result in a object. This value will only be used when the input is a file path. The default value is TRUE. The value is used for segment and speech tagging workers.

  • qmax
    Max query length of words, and the value is used for query workers.

  • topn
    The number of keywords, and the value is used for simhash and keywords workers.

  • encoding
    The encoding of the input file. If encoding detection is enable, the value of encoding will be ignore.

  • detect
    Whether to detect the encoding of input file using file_coding function. If encoding detection is enable, the value of encoding will be ignore.

  • symbol
    Whether to keep symbols in the sentence.

  • lines
    The maximal number of lines to read at one time when input is a file. The value is used for segmentation and speech tagging workers.

  • output
    A path to the output file, and default worker will generate file name by system time stamp, the value is used for segmentation and speech tagging workers.

  • bylines
    return the result by the lines of input files

  • user_weight
    the weight of the user dict words. “min” “max” or “median”.

使用方式

是进行分词时,下面的3种使用方式是等价的:

segment(words,worker)
worker<=words
worker[words]

其中,words代表待分词的文本,worker是worker()对象。

new_user_word()函数介绍

该函数用于条件用户自定义的词汇

new_user_word(worker, words, tags = rep("n", length(words)))

参数介绍

  • worker
    a jieba worker

  • words
    the new words,是一个向量

  • tags
    the new words tags, default “n”,

freq()函数介绍

freq(x)对一个字符串向量x进行词频统计,随后可以基于此绘制词云图

实例

利用默认库进行分词


library(jiebaR)
engine<-worker()
words<-"4月28日,在北京市新型冠状病毒肺炎疫情防控工作第318场新闻发布会上,市疾控中心副主任、全国新型冠状病毒肺炎专家组成员庞星火通报,4月27日15时至28日15时,本市新增本土新冠肺炎病毒感染者56例,其中确诊病例53例、无症状感染者3例;房山区20例、朝阳区14例、顺义区8例、通州区6例、海淀区3例、丰台区2例、东城区1例、石景山区1例、大兴区1例。社区筛查6例、主动就诊2例、风险人员48例。"
segment(words,engine)
# 或者 engine<=words
# 或者 engine[words]
# [1] "4"          "月"         "28"         "日"         "在"         "北京市"     "新型"      
# [8] "冠状病毒"   "肺炎"       "疫情"       "防控"       "工作"       "第"         "318"       
# [15] "场"         "新闻"       "发布会"     "上"         "市"         "疾控中心"   "副"        
# [22] "主任"       "全国"       "新型"       "冠状病毒"   "肺炎"       "专家"       "组成员"    
# [29] "庞"         "星火"       "通报"       "4"          "月"         "27"         "日"        
# [36] "15"         "时至"       "28"         "日"         "15"         "时"         "本市"      
# [43] "新增"       "本土"       "新冠"       "肺炎"       "病毒感染者" "56"         "例"        
# [50] "其中"       "确诊"       "病例"       "53"         "例"         "无症状"     "感染者"    
# [57] "3"          "例"         "房山区"     "20"         "例"         "朝阳区"     "14"        
# [64] "例"         "顺义区"     "8"          "例"         "通州区"     "6"          "例"        
# [71] "海淀区"     "3"          "例"         "丰台区"     "2"          "例"         "东城区"    
# [78] "1"          "例"         "石景山区"   "1"          "例"         "大兴区"     "1"         
# [85] "例"         "社区"       "筛查"       "6"          "例"         "主动"       "就诊"      
# [92] "2"          "例"         "风险"       "人员"       "48"         "例"   

利用自定义词库进行分割

library(jiebaR)
words<-"4月28日,在北京市新型冠状病毒肺炎疫情防控工作第318场新闻发布会上,市疾控中心副主任、全国新型冠状病毒肺炎专家组成员庞星火通报,4月27日15时至28日15时,本市新增本土新冠肺炎病毒感染者56例,其中确诊病例53例、无症状感染者3例;房山区20例、朝阳区14例、顺义区8例、通州区6例、海淀区3例、丰台区2例、东城区1例、石景山区1例、大兴区1例。社区筛查6例、主动就诊2例、风险人员48例。"
engine_new_word<-worker()
new_user_word(engine_new_word, c("新闻发布会","新型冠状病毒"))
segment(words,engine_new_word)
# [1] "4"            "月"           "28"           "日"           "在"           "北京市"      
# [7] "新型冠状病毒" "肺炎"         "疫情"         "防控"         "工作"         "第"          
# [13] "318"          "场"           "新闻发布会"   "上"           "市"           "疾控中心"    
# [19] "副"           "主任"         "全国"         "新型冠状病毒" "肺炎"         "专家"        
# [25] "组成员"       "庞"           "星火"         "通报"         "4"            "月"          
# [31] "27"           "日"           "15"           "时至"         "28"           "日"          
# [37] "15"           "时"           "本市"         "新增"         "本土"         "新冠"        
# [43] "肺炎"         "病毒感染者"   "56"           "例"           "其中"         "确诊"        
# [49] "病例"         "53"           "例"           "无症状"       "感染者"       "3"           
# [55] "例"           "房山区"       "20"           "例"           "朝阳区"       "14"          
# [61] "例"           "顺义区"       "8"            "例"           "通州区"       "6"           
# [67] "例"           "海淀区"       "3"            "例"           "丰台区"       "2"           
# [73] "例"           "东城区"       "1"            "例"           "石景山区"     "1"           
# [79] "例"           "大兴区"       "1"            "例"           "社区"         "筛查"        
# [85] "6"            "例"           "主动"         "就诊"         "2"            "例"          
# [91] "风险"         "人员"         "48"           "例"  

通过文本文件添加用户自定义词库

library(jiebaR)
words<-"4月28日,在北京市新型冠状病毒肺炎疫情防控工作第318场新闻发布会上,市疾控中心副主任、全国新型冠状病毒肺炎专家组成员庞星火通报,4月27日15时至28日15时,本市新增本土新冠肺炎病毒感染者56例,其中确诊病例53例、无症状感染者3例;房山区20例、朝阳区14例、顺义区8例、通州区6例、海淀区3例、丰台区2例、东城区1例、石景山区1例、大兴区1例。社区筛查6例、主动就诊2例、风险人员48例。"
engine_user<-worker(user='dictionary.txt')
segment(words,engine_user)
# [1] "4"            "月"           "28"           "日"           "在"           "北京市"      
# [7] "新型冠状病毒" "肺炎"         "疫情"         "防控"         "工作"         "第"          
# [13] "318"          "场"           "新闻发布会"   "上"           "市"           "疾控中心"    
# [19] "副"           "主任"         "全国"         "新型冠状病毒" "肺炎"         "专家"        
# [25] "组成员"       "庞"           "星火"         "通报"         "4"            "月"          
# [31] "27"           "日"           "15"           "时至"         "28"           "日"          
# [37] "15"           "时"           "本市"         "新增"         "本土"         "新冠"        
# [43] "肺炎"         "病毒感染者"   "56"           "例"           "其中"         "确诊"        
# [49] "病例"         "53"           "例"           "无症状"       "感染者"       "3"           
# [55] "例"           "房山区"       "20"           "例"           "朝阳区"       "14"          
# [61] "例"           "顺义区"       "8"            "例"           "通州区"       "6"           
# [67] "例"           "海淀区"       "3"            "例"           "丰台区"       "2"           
# [73] "例"           "东城区"       "1"            "例"           "石景山区"     "1"           
# [79] "例"           "大兴区"       "1"            "例"           "社区"         "筛查"        
# [85] "6"            "例"           "主动"         "就诊"         "2"            "例"          
# [91] "风险"         "人员"         "48"           "例"  

在这里插入图片描述

注意事项

1.如果你的词库是用记事本写的话,因为编码有时不是UTF-8,使用时会出现 各种错误,甚至软件奔溃。所以建议使用notepad++编辑,将编码设置为utf-8,另存为txt文件。
2.如果你需要添加搜狗细胞词库的话,那你需要安装cidian包,它可以帮助 我们把搜狗细胞词库转换为jiebaR可以使用的词库。

自定义停用词

library(jiebaR)
words<-"4月28日,在北京市新型冠状病毒肺炎疫情防控工作第318场新闻发布会上,市疾控中心副主任、全国新型冠状病毒肺炎专家组成员庞星火通报,4月27日15时至28日15时,本市新增本土新冠肺炎病毒感染者56例,其中确诊病例53例、无症状感染者3例;房山区20例、朝阳区14例、顺义区8例、通州区6例、海淀区3例、丰台区2例、东城区1例、石景山区1例、大兴区1例。社区筛查6例、主动就诊2例、风险人员48例。"
engine_user<-worker(user='dictionary.txt',stop_word = "stopwords.txt")
segment(words,engine_user)
# [1] "4"            "月"           "28"           "日"           "北京市"       "新型冠状病毒"
# [7] "肺炎"         "疫情"         "防控"         "工作"         "318"          "场"          
# [13] "新闻发布会"   "上"           "市"           "疾控中心"     "副"           "主任"        
# [19] "全国"         "新型冠状病毒" "肺炎"         "专家"         "组成员"       "庞"          
# [25] "星火"         "通报"         "4"            "月"           "27"           "日"          
# [31] "15"           "时至"         "28"           "日"           "15"           "时"          
# [37] "本市"         "新增"         "本土"         "新冠"         "肺炎"         "病毒感染者"  
# [43] "56"           "其中"         "确诊"         "病例"         "53"           "无症状"      
# [49] "感染者"       "3"            "房山区"       "20"           "朝阳区"       "14"          
# [55] "顺义区"       "8"            "通州区"       "6"            "海淀区"       "3"           
# [61] "丰台区"       "2"            "东城区"       "1"            "石景山区"     "1"           
# [67] "大兴区"       "1"            "社区"         "筛查"         "6"            "主动"        
# [73] "就诊"         "2"            "风险"         "人员"         "48" 

在这里插入图片描述

进行分词并词频统计


library(jiebaR)
words<-"4月28日,在北京市新型冠状病毒肺炎疫情防控工作第318场新闻发布会上,市疾控中心副主任、全国新型冠状病毒肺炎专家组成员庞星火通报,4月27日15时至28日15时,本市新增本土新冠肺炎病毒感染者56例,其中确诊病例53例、无症状感染者3例;房山区20例、朝阳区14例、顺义区8例、通州区6例、海淀区3例、丰台区2例、东城区1例、石景山区1例、大兴区1例。社区筛查6例、主动就诊2例、风险人员48例。"
engine_user<-worker(user='dictionary.txt',stop_word = "stopwords.txt")
freq(segment(words,engine_user))
#             char freq
# 1          就诊    1
# 2          主动    1
# 3          社区    1
# 4        大兴区    1
# 5             1    3
# 6        丰台区    1
# 7        通州区    1
# 8          筛查    1
# 9            14    1
# 10            3    2
# 11       无症状    1
# 12         病例    1
# ...

词性标注

library(jiebaR)
words<-"4月28日,在北京市新型冠状病毒肺炎疫情防控工作第318场新闻发布会上,市疾控中心副主任、全国新型冠状病毒肺炎专家组成员庞星火通报,4月27日15时至28日15时,本市新增本土新冠肺炎病毒感染者56例,其中确诊病例53例、无症状感染者3例;房山区20例、朝阳区14例、顺义区8例、通州区6例、海淀区3例、丰台区2例、东城区1例、石景山区1例、大兴区1例。社区筛查6例、主动就诊2例、风险人员48例。"
engine_user<-worker(user='dictionary.txt',stop_word = "stopwords.txt",type="tag")
segment(words,engine_user)
# x              m              m              m             ns              b 
# "4"           "月"           "28"           "日"       "北京市" "新型冠状病毒" 
# n              n             vn             vn              m              q 
# "肺炎"         "疫情"         "防控"         "工作"          "318"           "场" 
# a              f              n              n              b              b 
# "新闻发布会"           "上"           "市"     "疾控中心"           "副"         "主任" 
# n              b              n              n              l             nr 
# "全国" "新型冠状病毒"         "肺炎"         "专家"       "组成员"           "庞" 
# n              n              x              m              m              m 
# "星火"         "通报"            "4"           "月"           "27"           "日" 
# m              x              m              m              m              n 
# "15"         "时至"           "28"           "日"           "15"           "时" 
# n              v              n              x              n              n 
# "本市"         "新增"         "本土"         "新冠"         "肺炎"   "病毒感染者" 
# m              r              v              n              m              i 
# "56"         "其中"         "确诊"         "病例"           "53"       "无症状" 
# n              x             ns              m             ns              m 
# "感染者"            "3"       "房山区"           "20"       "朝阳区"           "14" 
# ns              x             ns              x             ns              x 
# "顺义区"            "8"       "通州区"            "6"       "海淀区"            "3" 
# ns              x             ns              x             ns              x 
# "丰台区"            "2"       "东城区"            "1"     "石景山区"            "1" 
# ns              x              n             vn              x              b 
# "大兴区"            "1"         "社区"         "筛查"            "6"         "主动" 
# v              x              n              n              m 
# "就诊"            "2"         "风险"         "人员"           "48" 

在这里插入图片描述

注意事项

如果用户自定义的词库中不指定tag的值,输出结果就会认为是x

提取关键词

library(jiebaR)
words<-"4月28日,在北京市新型冠状病毒肺炎疫情防控工作第318场新闻发布会上,市疾控中心副主任、全国新型冠状病毒肺炎专家组成员庞星火通报,4月27日15时至28日15时,本市新增本土新冠肺炎病毒感染者56例,其中确诊病例53例、无症状感染者3例;房山区20例、朝阳区14例、顺义区8例、通州区6例、海淀区3例、丰台区2例、东城区1例、石景山区1例、大兴区1例。社区筛查6例、主动就诊2例、风险人员48例。"
engine_user<-worker(user='dictionary.txt',stop_word = "stopwords.txt",type="keywords",topn=10)
engine_user<=words
# 234.784        35.2176        35.2176        26.4953        23.4784        23.4784 
# " "            "1"           "日"         "肺炎"           "15"           "28" 
# 23.4784        23.4784        23.4784        23.4784 
# "3" "新型冠状病毒"            "2"            "6" 
  • 9
    点赞
  • 54
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值