(五)使用IK分词器、扩展ik词库和停词库

使用IK分词器

  1. 集成ik分词器 https://mp.csdn.net/postedit/93602713
  2. 实体类PosEntity 
    /** 省略了getter、setter*/
    class PosEntity{
    		private Integer posId;
    		private String posName;
    		private String posAddress;
    	}

    实体类中posName和posAddress都用作中文字段,使用IK分词器

  3. 创建索引 

    @Test
    	public void createIKIndex(){
    		XContentBuilder contentBuilder = null;
    		try {
    			contentBuilder = XContentFactory.jsonBuilder()
    					.startObject()
    					.startObject(typeName)
    					.startObject("properties")
    					.startObject("posId").field("type","integer").endObject()
    					.startObject("posName").field("type","text").field("analyzer","ik_max_word").endObject()
    					.startObject("posAddress").field("type","text").field("analyzer","ik_max_word").endObject()
    					.endObject()
    					.endObject()
    					.endObject();
    		} catch (IOException e) {
    			e.printStackTrace();
    		}
    		client.admin().indices().prepareCreate(ikIndexName).addMapping(typeName, contentBuilder).get();
    	}

     

  4. 导入数据 

    	@Test
    	public void loadIKData(){
    		PosEntity entity = new PosEntity(3, "渝B21902321", "中国是世界上人口最多的国家");
    		PosEntity entity1 = new PosEntity(4, "渝A21902321", "今天是你的生日胖头鱼");
    		client.prepareIndex(ikIndexName, typeName, "3").setSource(JSONObject.toJSONString(entity), XContentType.JSON).get();
    		client.prepareIndex(ikIndexName, typeName, "4").setSource(JSONObject.toJSONString(entity1), XContentType.JSON).get();
    	}

    导入的有车牌号和正常的语句。分词情况可以在kibana中 

    	GET _analyze
    	{"text":"今天是你的生日胖子","analyzer":"ik_max_word"}

    查看。

  5. 搜索操作跟前面类似  

    	@Test
    	public void multiIK(){
    		MultiMatchQueryBuilder queryBuilder = new MultiMatchQueryBuilder("是你的", "posName", "posAddress");
    		SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryBuilder);
    		execSearch(builder);
    	}
    
    	@Test
    	public void ikStringQuery(){
    		QueryStringQueryBuilder queryStringQueryBuilder = new QueryStringQueryBuilder("posAddress:是你的");
    		SearchRequestBuilder builder = client.prepareSearch(ikIndexName, typeName).setQuery(queryStringQueryBuilder);
    		execSearch(builder);
    	}
    

     

扩展ik词库

  1. 先用kibana解析“蓝瘦香菇很好吃”这句话 
    	GET _analyze
    	{"text":"蓝瘦香菇很好吃","analyzer":"ik_max_word"}

    结果如下 

    {
      "tokens": [
        {
          "token": "蓝",
          "start_offset": 0,
          "end_offset": 1,
          "type": "CN_CHAR",
          "position": 0
        },
        {
          "token": "瘦",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 1
        },
        {
          "token": "香菇",
          "start_offset": 2,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "很好",
          "start_offset": 4,
          "end_offset": 6,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "很",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 4
        },
        {
          "token": "好吃",
          "start_offset": 5,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 5
        }
      ]
    }

    蓝瘦香菇被未被识别成一个词,,很被识别了。现在让蓝瘦香菇被识别为整体,很作为停词被忽略。

  2. 进入ik配置文件目录 /home/es/elasticsearch-6.2.2/plugins/ik/config,创建一个词典文件 

    vim my_extra.dic

    添加蓝瘦香菇 ,保存修改  

  3. 创建一个分词文件 ,添加很 

    vim my_extra.dic
  4. 修改IK配置文件,在添加自定义的文件

    vim IKAnalyzer.cfg.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
    <properties>
            <comment>IK Analyzer 扩展配置</comment>
            <!--用户可以在这里配置自己的扩展字典 -->
            <entry key="ext_dict">my_extra.dic</entry>
             <!--用户可以在这里配置自己的扩展停止词字典-->
            <entry key="ext_stopwords">my_stopword.dic</entry>
            <!--用户可以在这里配置远程扩展字典 -->
            <!-- <entry key="remote_ext_dict">words_location</entry> -->
            <!--用户可以在这里配置远程扩展停止词字典-->
            <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
    </properties>
    

    分别添加进自己扩展的词库文件和分词文件。

  5. 重启es集群  

  6. 重新解析那句话,结果如下 

    {
      "tokens": [
        {
          "token": "蓝瘦香菇",
          "start_offset": 0,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "瘦",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 1
        },
        {
          "token": "香菇",
          "start_offset": 2,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "很好",
          "start_offset": 4,
          "end_offset": 6,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "好吃",
          "start_offset": 5,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 4
        }
      ]
    }

     

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值