ElasticSearch分词器总结

一、ik、pinyin分词器

今天用通讯录演示ES检索功能,在对姓名检索时,想实现中文和拼音均可检索,于是除之前常用的中文分词器ik外,又下载了拼音分词器pinyin,使用情况总结如下:

1、下载

ik:https://github.com/medcl/elasticsearch-analysis-ik
pinyin:https://github.com/medcl/elasticsearch-analysis-pinyin

2、安装

将下载的文件解压后放入es文件夹plugins下,可新建ik,pinyin文件夹;
其中pinyin分词器我不知为何无法直接下载zip文件,所以是下载的源码然后打包,再解压后放入plugins/pinyin下

3、pinyin分词器测试
GET _analyze?pretty
{
  "analyzer": "pinyin",
  "text": "刘德华"
}

结果:

{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "de",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "hua",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    },
    {
      "token": "ldh",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    }
  ]
}
4、索引模板中分词器配置

在模板setting中分词器的配置

"analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik_max_word"
                },
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "keep_separate_first_letter" : "false",
                    "lowercase" : "true",
                    "type" : "pinyin",
                    "limit_first_letter_length" : "16",
                    "remove_duplicated_term" : "true",
                    "keep_original" : "true",
                    "keep_full_pinyin" : "true",
                    "keep_joined_full_pinyin":"true",
                    "keep_none_chinese_in_joined_full_pinyin":"true"
            }
          }
        }

其中my_pinyin中配置项在https://github.com/medcl/elasticsearch-analysis-pinyin文档中有说明,可根据自己需求进行配置。

5、mapping中创建type

可以在一个属性中设置多个分词器fields:

 "mappings": {
         "doc": {
            "properties": {
                "PERSON_ENAME": {
                  "type" : "text",
                  "fields" : {
                        "ik" : {"type" : "text", "analyzer" :"ik"},
                        "english": { "type":"text","analyzer": "english"},
                        "standard" : {"type" : "text"}
                    }
               },
                "CONTACTER_NAME": {
                  "type" : "text",
                  "fields" : {
                        "ik" : {"type" : "text", "analyzer" :"ik"},    
                        "pinyin": { "type":"text","analyzer": "pinyin_analyzer"},                       
                        "standard" : {"type" : "text"}
                    }
               }               
            }
         }
      } 
6、测试

在多个字段中查询

POST sim/doc/_search
{
  "query": {
    "multi_match" : {
    "query" : "dfbb",
    "fields" : [
      "PERSON_ENAME.ik", 
      "PERSON_ENAME.standard",
      "PERSON_ENAME.english",
      "CONTACTER_NAME.ik", 
      "CONTACTER_NAME.standard",
      "CONTACTER_NAME.pinyin"]
    }
  }
}

没有更多推荐了,返回首页

私密
私密原因:
请选择设置私密原因
  • 广告
  • 抄袭
  • 版权
  • 政治
  • 色情
  • 无意义
  • 其他
其他原因:
120
出错啦
系统繁忙,请稍后再试

关闭