ElasticSearch文本分析(三)_es curl语句文本分析-CSDN博客

本文链接：https://blog.csdn.net/Suubyy/article/details/118383524

文章目录

ElasticSearch文本分析(三)
- 分词过滤器

ElasticSearch文本分析(三)

分词过滤器

分词过滤器接收来自分词器的数据流，他可以更新分词（更新为小写）、删除分词（移除通用词）和添加分词（同义词）

撇号（`'`）分词过滤器

去除撇号后的所有字符，包括撇号本身。

这个过滤器包含在Elasticsearch的内置土耳其语言分析器中。它使用Lucene的撇号过滤器，这是为土耳其语而建的。

示例

以下分析 API 请求演示了撇号分词过滤器的工作原理：

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer" : "standard",
  "filter" : ["apostrophe"],
  "text" : "Istanbul\u0027a veya Istanbul\u0027dan"
}
'

过滤器产生以下分词：

[ Istanbul, veya, Istanbul ]

添加到分析上

根据以下创建所有的API请求使用撇号分词过滤器来配置一个自定义的分析器：

curl -X PUT "localhost:9200/apostrophe_example?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_apostrophe": {
          "tokenizer": "standard",
          "filter": [ "apostrophe" ]
        }
      }
    }
  }
}
'

经典分词过滤器

对经典分词器生成的术语执行可选的后处理。这个过滤器从单词的末尾删除英语's，并从首字母缩写中删除.。它使用Lucene的ClassicFilter。

示例

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer" : "classic",
  "filter" : ["classic"],
  "text" : "The 2 Q.U.I.C.K. Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

过滤器产生以下分词：

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, bone ]

添加到分析器上

curl -X PUT "localhost:9200/classic_example?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "classic_analyzer": {
          "tokenizer": "classic",
          "filter": [ "classic" ]
        }
      }
    }
  }
}
'

条件分词过滤器

示例

以下分析 API 请求使用条件过滤器匹配 THE QUICK BROWN FOX 中少于 5 个字符的分词。然后将小写过滤器应用于那些匹配到的分词，将它们转换为小写。

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "condition",
      "filter": [ "lowercase" ],
      "script": {
        "source": "token.getTerm().length() < 5"
      }
    }
  ],
  "text": "THE QUICK BROWN FOX"
}
'

过滤器产生以下标记：

[ the, QUICK, BROWN, fox ]

配置参数

filter：（必须的，分词过滤器数组）。分词过滤器数组。如果分词与脚本中条件相匹配，则按顺序应用于该分词。这些过滤器可以包括在索引映射中定义的自定义分词过滤器。
script：（必须，脚本对象）。应用于分词过滤器的脚本中。如果分词与脚本中的条件匹配则应用到该分词上。

自定义和增加到分析器上

curl -X PUT "localhost:9200/palindrome_list?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_reverse_first_token": {
          "tokenizer": "whitespace",
          "filter": [ "reverse_first_token" ]
        }
      },
      "filter": {
        "reverse_first_token": {
          "type": "condition",
          "filter": [ "reverse" ],
          "script": {
            "source": "token.getPosition() === 0"
          }
        }
      }
    }
  }
}
'

Delimited payload 分词过滤器

根据指定的分隔符将标记流分隔为分词和有效负载。

例如你可以使用带有|分隔符的delimited_payload过滤器将the|1 quick|2 fox|3分词为the,quick,fox，其各自的有效负载为1,2,3。

示例

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "whitespace",
  "filter": ["delimited_payload"],
  "text": "the|0 brown|10 fox|5 is|0 quick|10"
}
'

这个过滤器产出的分词为：

[ the, brown, fox, is, quick ]

提示：这个API不会返回存储的有效负载。包含返回有效负载的例子请查看Return stored payloads

添加到分析器上

下面的创建索引API请求使用delimited_payload过滤器来配置新的自定义分析器。

curl -X PUT "localhost:9200/delimited_payload?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_delimited_payload": {
          "tokenizer": "whitespace",
          "filter": [ "delimited_payload" ]
        }
      }
    }
  }
}
'