elasticSearch Analysis Token Filters作用及相关样例

最新推荐文章于 2024-04-16 20:00:30 发布

wangzhuo0978

最新推荐文章于 2024-04-16 20:00:30 发布

阅读量5.5k

点赞数 3

文章标签： Analysis TokenFilters

本文链接：https://blog.csdn.net/wangzhuo0978/article/details/79914849

版权

1.Standard Token Filter

standard 目前什么都不做;

2.ASCII Folding Token Filter

asciifolding 类型的词元过滤器，将不在前127个ASCII字符（“基本拉丁文”Unicode块）中的字母，数字和符号Unicode字符转换为ASCII等效项（如果存在）。

3.Length Token Filter

length用于去掉过长或者过短的单词;

min 定义最短长度

max 定义最长长度

输入：

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "length", "min":1, "max":3 }],  
  "text" : "this is a test"
}

输出：

"tokens": [
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]

4.Lowercase Token Filter

lowercase 类型的词元过滤器,将词元文本规范化为小写

5. Uppercase Token Filter

uppercase类型的词元过滤器,将词元文本规范化为大写;

6.Porter Stem Token Filter(Porter Stem 词元过滤器)

porter_stem类型的词元过滤器,根据波特干扰算法转换词元流

注:输入数据必须已经是小写,以使其可以正常工作

作用解读:返回一个英文单词的词干

1> 处理复数,以及ed和ing结束的单词
2> 复数变为单数
3> 如果单词包含元音,并且以ys结尾,将ys改为i(测试的,貌似是

等等... 围绕提取词干展开

输入:

GET _analyze
{
  "tokenizer" : "standard",
  "filter": ["porter_stem"],  
  "text" : ["I readed books", "eys"]
}

输出:

"tokens": [
    {
      "token": "I",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "read",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "book",
      "start_offset": 9,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "ei",
      "start_offset": 15,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]

7.Shingle Token Filter

single类型的词元过滤器用于创建词元的组合作为单个词元

输入:

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "shingle", "output_unigrams": "false"}],  
  "text" : ["this is a test"]
}

输出:

注:这里如果设置output_unigrams = “true” 切默认为true,则会将输入分词原样加入token

#将this is a test 变为 this is, is a, a test, 两两组合,其中组合数目可以自定义

[
    {
      "token": "this is",
      "start_offset": 0,
      "end_offset": 7,
      "type": "shingle",
      "position": 0
    },
    {
      "token": "is a",
      "start_offset": 5,
      "end_offset": 9,
      "type": "shingle",
      "position": 1
    },
    {
      "token": "a test",
      "start_offset": 8,
      "end_offset": 14,
      "type": "shingle",
      "position": 2
    }
  ]

8.Stop Token Filter

stop 类型的词元过滤器用于将stowords所列的单词从token stream中移除

stopwords 一个词元列表,默认为_english_ stopwords

输入:

{
  "tokenizer" : "standard",
  "filter": [{"type": "stop", "stopwords": ["this", "a"]}],  
  "text" : ["this is a test"]
}

输出:

# stopwords中拦截词this, a 被过滤掉;
"tokens": [
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "test",
      "start_offset": 10,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]

9.Word Delimiter Token Filter

word_delimiter 词元分析器将单词分解为子词,并对子词进行可选的转换

命名为 word_delimiter，它将单词分解为子词，并对子词组进行可选的转换。词被分为以下规则的子词：

拆分字内分隔符（默认情况下，所有非字母数字字符）。.

"Wi-Fi" → "Wi", "Fi"

按大小写转换拆分: "PowerShot" → "Power", "Shot"

按字母数字转换拆分: "SD500" → "SD", "500"

每个子词的前导和尾随的词内分隔符都被忽略: "//hello---there, dude" → "hello", "there", "dude"

每个子词都删除尾随的“s”: "O’Neil’s" → "O", "Neil"
参数包括：
generate_word_parts

true 导致生成单词部分："PowerShot" ⇒ "Power" "Shot"。默认 true

generate_number_parts

true 导致生成数字子词："500-42" ⇒ "500" "42"。默认 true

catenate_numbers

true 导致单词最大程度的链接到一起："wi-fi" ⇒ "wifi"。默认 false

catenate_numbers

true 导致数字最大程度的连接到一起："500-42" ⇒ "50042"。默认 false

catenate_all

true 导致所有的子词可以连接："wi-fi-4000" ⇒ "wifi4000"。默认 false

split_on_case_change

true 导致 "PowerShot" 作为两个词元（"Power-Shot" 作为两部分看待）。默认 true

preserve_original

true 在子词中保留原始词： "500-42" ⇒ "500-42" "500" "42"。默认 false

split_on_numerics

true 导致 "j2se" 成为三个词元： "j" "2" "se"。默认 true

stem_english_possessive

true 导致每个子词中的 "'s" 都会被移除："O’Neil’s" ⇒ "O", "Neil"。默认 true

高级设置：

protected_words

被分隔时的受保护词列表。一个数组，或者也可以将 protected_words_path 设置为配置有保护字的文件（每行一个）。如果存在，则自动解析为基于 config/ 位置的位置。

输入:

GET _analyze
{
  "tokenizer" : "standard",
  "filter": [{"type": "word_delimiter"}],  
  "text" : ["PowerShot", "219-230"]
}

输出:

# PowerShot 和 219-230 被分解为 Power, Shot, 219, 230
"tokens": [
    {
      "token": "Power",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Shot",
      "start_offset": 5,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "219",
      "start_offset": 10,
      "end_offset": 13,
      "type": "<NUM>",
      "position": 2
    },
    {
      "token": "230",
      "start_offset": 14,
      "end_offset": 17,
      "type": "<NUM>",
      "position": 3
    }
  ]

10.Word Delimiter Graph Token Filter

略过

11.Stemmer Token Filter

stemmer词元过滤器,可以添加几乎所有的词元过滤器,所以是一个通用接口
用法如下:
PUT /my_index
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["standard", "lowercase", "my_stemmer"]
                }
            },
            "filter" : {
                "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "light_german"
                }
            }
        }
    }
}

The language/name parameter controls the stemmer with the following available values (the preferred filters are marked in bold):

Arabic	`arabic`
Armenian	`armenian`
Basque

最低0.47元/天解锁文章

wangzhuo0978

关注

3
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
elasticSearch Analysis Token Filters作用及相关样例

1.Standard Token Filterstandard 目前什么都不做;2.ASCII Folding Token Filterasciifolding 类型的词元过滤器，将不在前127个ASCII字符（“基本拉丁文”Unicode块）中的字母，数字和符号Unicode字符转换为ASCII等效项（如果存在）。3.Length Token Filterlength用于去掉过长或者过短的单词;...
复制链接

扫一扫