使用IK分词器
- 集成ik分词器 https://mp.csdn.net/postedit/93602713
- 实体类PosEntity
/** 省略了getter、setter*/ class PosEntity{ private Integer posId; private String posName; private String posAddress; }
实体类中posName和posAddress都用作中文字段,使用IK分词器
-
创建索引
@Test public void createIKIndex(){ XContentBuilder contentBuilder = null; try { contentBuilder = XContentFactory.jsonBuilder() .startObject() .startObject(typeName) .startObject("properties") .startObject("posId").field("type","integer").endObject() .startObject("posName").field("type","text").field("analyzer","ik_max_word").endObject() .startObject("posAddress").field("type","text").field("analyzer","ik_max_word").endObject() .endObject() .endObject() .endObject(); } catch (IOException e) { e.printStackTrace(); } client.admin().indices().prepareCreate(ikIndexName).addMapping(typeName, contentBuilder).get(); }
-
导入数据
@Test public void loadIKData(){ PosEntity entity = new PosEntity(3, "渝B21902321", "中国是世界上人口最多的国家"); PosEntity entity1 = new PosEntity(4, "渝A21902321", "今天是你的生日胖头鱼"); client.prepareIndex(ikIndexName, typeName, "3").setSource(JSONObject.toJSONString(entity), XContentType.JSON).get(); client.prepareIndex(ikIndexName, typeName, "4").setSource(JSONObject.toJSONString(entity1), XContentType.JSON).get(); }
导入的有车牌号和正常的语句。分词情况可以在kibana中
GET _analyze {"text":"今天是你的生日胖子","analyzer":"ik_max_word"}
查看。
-
搜索操作跟前面类似
@Test public void multiIK(){ MultiMatchQueryBuilder queryBuilder = new MultiMatchQueryBuilder("是你的", "posName", "posAddress"); SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryBuilder); execSearch(builder); } @Test public void ikStringQuery(){ QueryStringQueryBuilder queryStringQueryBuilder = new QueryStringQueryBuilder("posAddress:是你的"); SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryStringQueryBuilder); execSearch(builder); }
扩展ik词库
- 先用kibana解析“蓝瘦香菇很好吃”这句话
GET _analyze {"text":"蓝瘦香菇很好吃","analyzer":"ik_max_word"}
结果如下
{ "tokens": [ { "token": "蓝", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR", "position": 0 }, { "token": "瘦", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }, { "token": "香菇", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "很好", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 3 }, { "token": "很", "start_offset": 4, "end_offset": 5, "type": "CN_CHAR", "position": 4 }, { "token": "好吃", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 5 } ] }
蓝瘦香菇被未被识别成一个词,,很被识别了。现在让蓝瘦香菇被识别为整体,很作为停词被忽略。
-
进入ik配置文件目录 /home/es/elasticsearch-6.2.2/plugins/ik/config,创建一个词典文件
vim my_extra.dic
添加蓝瘦香菇 ,保存修改
-
创建一个分词文件 ,添加很
vim my_extra.dic
-
修改IK配置文件,在添加自定义的文件
vim IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict">my_extra.dic</entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords">my_stopword.dic</entry> <!--用户可以在这里配置远程扩展字典 --> <!-- <entry key="remote_ext_dict">words_location</entry> --> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
分别添加进自己扩展的词库文件和分词文件。
-
重启es集群
-
重新解析那句话,结果如下
{ "tokens": [ { "token": "蓝瘦香菇", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "瘦", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }, { "token": "香菇", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "很好", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 3 }, { "token": "好吃", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 4 } ] }