ElasticSearch50:索引管理_快速上机动手实战修改分词器以及定制自己的分词器

版权声明:本文为博主原创文章,可转载,但必须署名转载处! https://blog.csdn.net/m0_37557582/article/details/79005336
1.默认的分词器
standard
standard tokenizer:以单词的边界进行切分
standard token filter:什么都不做
lowercase token filter:将所有字母转换成小写
stop token filter(默认被禁用),移除停用词,比如a the it等等

2.修改分词器的设置

例子:启用standard的基于english的分词器的停用词token filter
其中,es_std是这个分词器的名称
PUT /index0
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std":{
          "type":"standard",
          "stopwords":"_english_"
        }
      }
    }
  }
}



测试:

使用standard分词器分词a little dog

GET /index0/_analyze
{
  "analyzer":"standard",
  "text":"a little dog"
}
执行结果:
{
  "tokens": [
    {
      "token": "a",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "little",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}


使用设置的es_std分词器分词a little dog,可以看到结果中,停用词过滤了

GET /index0/_analyze
{
  "analyzer":"es_std",
  "text":"a little dog"
}
执行结果

{
  "tokens": [
    {
      "token": "little",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "dog",
      "start_offset": 9,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}





3.定制化自己的分词器
例子
char_filter:类型为mapping,定义自己的替换过滤器,这里我们将&转换为and,并将这个过滤器起名为&_to_and
my_stopwords:类型为stop,定义自己的停用词,这里我们设置了两个停用词a和the
my_analyzer:类型为customer,自定义分词器,分词前操作:html_strip过滤html代码标签,&_to_and是我们自己定义的字符过滤器(将&提换成and),分词使用standard,停用词使用my_stopwords,并将所有的词转成小写

PUT /index0
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and":{
          "type":"mapping",
          "mappings":["&=> and"]
        }
      },
      "filter":{
        "my_stopwords":{
          "type":"stop",
          "stopwords":["a","the"]
        }
      },
      "analyzer":{
        "my_analyzer":{
          "type":"custom",
          "char_filter":["html_strip","&_to_and"],
          "tokenizer":"standard",
          "filter":["lowercase","my_stopwords"]
        }
      }
    }
  }
}


执行:报错,索引已存在,
{
  "error": {
    "root_cause": [
      {
        "type": "index_already_exists_exception",
        "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
        "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
        "index": "index0"
      }
    ],
    "type": "index_already_exists_exception",
    "reason": "index [index0/zeKanPhhTR-6fiUjKRoe9g] already exists",
    "index_uuid": "zeKanPhhTR-6fiUjKRoe9g",
    "index": "index0"
  },
  "status": 400
}

我们先删除这个索引 DELETE /index0,然后再执行
执行成功:
{
  "acknowledged": true,
  "shards_acknowledged": true
}



测试我们的分词器my_analyzer:
模拟一段文本:tom and jery in the a house <a> & me HAHA
从执行结果中可以看出,a和the过滤了,HAHA转成了小写,&转成了and,<a>标签过滤了

GET /index0/_analyze
{
  "analyzer": "my_analyzer",
  "text":"tom and jery in the a house <a> & me HAHA"
}

执行结果

{
  "tokens": [
    {
      "token": "tom",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "and",
      "start_offset": 4,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "jery",
      "start_offset": 8,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "in",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "house",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "and",
      "start_offset": 32,
      "end_offset": 33,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "me",
      "start_offset": 34,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "haha",
      "start_offset": 37,
      "end_offset": 41,
      "type": "<ALPHANUM>",
      "position": 9
    }
  ]
}





4.在我们的索引中使用我们自定义的分词器
设置mytype中的字段content使用我们的自定义的分词器my_analyzer
GET /index0/_mapping/my_type
{
    "properties":{
        "content":{
            "type":"text",
            "analyzer":"my_analyzer"
        }
    }
}










没有更多推荐了,返回首页

私密
私密原因:
请选择设置私密原因
  • 广告
  • 抄袭
  • 版权
  • 政治
  • 色情
  • 无意义
  • 其他
其他原因:
120
出错啦
系统繁忙,请稍后再试

关闭