Elasticsearch7 分词器(内置分词器和自定义分词器)

本文介绍了Elasticsearch7的分词器,包括analysis的概览,字符过滤器如html_strip、mapping和pattern_replace,以及各种token过滤器如asciifolding、length和ngram。讨论了tokenizer如Standard tokenizer、NGram Tokenizer和Edge NGram Tokenizer。还涵盖了内置的analyzer,如standard、simple和fingerprint,并讲解了如何创建自定义分词器。
摘要由CSDN通过智能技术生成

Elasticsearch7 分词器(内置分词器和自定义分词器)

analysis

概览

"settings":{
    "analysis": { # 自定义分词
      "filter": {
      	"自定义过滤器": {
            "type": "edge_ngram",  # 过滤器类型
            "min_gram": "1",  # 最小边界 
            "max_gram": "6"  # 最大边界
        }
      },  # 过滤器
      "char_filter": {},  # 字符过滤器
      "tokenizer": {},   # 分词
      "analyzer": {
      	"自定义分词器名称": {
          "type": "custom",
          "tokenizer": "上述自定义分词名称或自带分词",
          "filter": [
            "上述自定义过滤器名称或自带过滤器"
          ],
          "char_filter": [
          	"上述自定义字符过滤器名称或自带字符过滤器"
          ]
        }
      }  # 分词器
    }
}

查询分词效果:

1.查询指定索引库的分词器效果
POST /discovery-user/_analyze
{
  "analyzer": "analyzer_ngram", 
  "text":"i like cats"
}
2.查询所有索引库通用的分词器效果
POST _analyze
{
  "analyzer": "standard",  # english,ik_max_word,ik_smart
  "text":"i like cats"
}

char_filter

定义:字符过滤器将原始文本作为字符流来接收,并可以新增,移除或修改字符转换字符流
A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
可去除HTML元素或转换0123为零一二三

一个分词器可应用0或多个字符过滤器,按顺序生效
An analyzer may have zero or more character filters, which are applied in order.

es7自带字符过滤器:

  • HTML Strip Character Filter:html_strip
去除HTML元素
The html_strip character filter strips out HTML elements like <b> and decodes HTML entities like &amp;.
  • Mapping Character Filter:mapping
符合映射关系的字符进行替换  
The mapping character filter replaces any occurrences of the specified strings with the specified replacements.
  • Pattern Replace Character Filter:pattern_replace
符合正则表达式的字符替换为指定的字符
The pattern_replace character filter replaces any characters matching a regular expression with the specified replacement.
html_strip

html_strip接受escaped_tags参数

"char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
}
escaped_tags:An array of HTML tags which should not be stripped from the original text.
即忽略的HTML标签
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}
I'm so <b>happy</b>!  # 忽略了b标签
mapping

The mapping character filter accepts a map of keys and values. Whenever it encounters a string of characters that is the same as a key, it replaces them with the value associated with that key.
Replacements are allowed to be the empty string允许空值

The mapping character filter accepts the following parameters:映射有以下两个参数,且必选其一
mappings

A array of mappings, with each element having the form key => value
映射的数组,每个映射的格式为 key => value

mappings_path

A path, either absolute or relative to the config directory, to a UTF-8 encoded text mappings file containing a key => value mapping per line.
文件映射,路径是绝对路径或相对于config文件夹的相对路径,文件需utf-8编码且每行的映射格式为key => value
"char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "一 => 0",
            "二 => 1",
            "# => ",  # 映射值可以为空
            "一二三 => 老虎"  # 映射可以多个字符
          ]
        }
}
pattern_replace

The pattern_replace character filter uses a regular expression to match characters which should be replaced with the specified replacement string. The replacement string can refer to capture groups in the regular expression.

Beware of Pathological Regular Expressions
使用正则需要注意低效率的正则表达式,此类表达式可能引起StackOverflowError,es7的正则表达式遵从Java 的Pattern

正则表达式有以下参数:
pattern:必选

A Java regular expression. Required.

replacement:

The replacement string, which can reference capture groups using the $1..$9 syntax
要替换的字符串,通过

flags:

Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".
123-456-789 → 123_456_789:
"char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
}

Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting
正则过滤改变长度可能导致高亮结果有误

filter

A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.

Token filters are not allowed to change the position or character offsets of each token.

An analyzer may have zero or more token filters, which are applied in order.

asciifolding

A token filter of type asciifolding that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists

Accepts preserve_original setting which defaults to false but if true will keep the original token as well as emit the folded token
将前127个ASCII字符(基本拉丁语的Unicode块)中不包含的字母、数字和符号Unicode字符转换为对应的ASCII字符(如果

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值