分词工具调研
一、背景
调研了两种分词工具:
Ansj:https://github.com/NLPchina/ansj_seg
HanLP(https://github.com/hankcs/HanLP#7-极速词典分词)
最终选择了HanLP
二、Ansj
-
利用DicAnalysis可以自定义词库:
val forest = DicLibrary.get() if(forest == null){ DicLibrary.put(DicLibrary.DEFAULT, DicLibrary.DEFAULT, new Forest()) } for(b<-brandDict){ DicLibrary.insert(DicLibrary.DEFAULT, b, "n", 1) } for(c<-cateDict){ DicLibrary.insert(DicLibrary.DEFAULT, c, "n", 1) } // 自定义词库和停用词等,需要通过广播将词表发送给各节点 val stopBC = spark.sparkContext.broadcast(stop) val dicBC = spark.sparkContext.broadcast(DicLibrary.get(DicLibrary.DEFAULT)) val parse = DicAnalysis.parse(keywordDealed, dicBC.value).recognition(stopBC.va