ElasticSearch-分词器介绍

九品神元师

于 2024-07-22 20:44:10 发布

阅读量89

点赞数 1

文章标签： elasticsearch 大数据搜索引擎

本文链接：https://blog.csdn.net/yimin_tank/article/details/140619212

版权

Analysis

Analysis文本分析，也叫分词，是把全文本转换为一系列单词的过程。

Analyzer的组成

通常Analyzer由三个部分组成。

Character Filters：针对原始文本处理，例如去除html标签等。
Tokenizer：按照一定的规则，对字符串进行切分单词。
Token Filter：将切分的单词进行加工、大小写转换、删除stopwords、增加同义词等。

ES中内置的分词器

Standard Analyzer：默认分词器，按词切分，小写处
Simple Analyzer：按照非字母切分（符号被过滤），小写处理
Stop Analyzer：小写处理，停用词过滤器（the、a、is等）
Whitespace Analyzer：按照空格切分，不转小写
Keyword Analyzer：不分词，直接将输入当作输出
Patter Analyzer：正则表达式，默认\W+(非字符分割)
Language：提供了30多种常见语言的分词器
Customer Analyzer：自定义分词器

Analyzer的使用

可以直接指定Analyzer进行分词测试。

举例：比如我们现在要查看一下ES是如何进行分词的。

GET /_analyze
{
  "analyzer": "standard",
  "text":"行人,蓝色衣服,黑色裤子,带帽子"
}

返回如下，可以看到分词结果。token表示分词的单词，start_offset表示单词在文档中的开始位置，end_offset表示单词在文档中的结束位置，type表示单词的类型（文本/数字…），position表示单词在文档中的位置。

{
  "tokens" : [
    {
      "token" : "行",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "人",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "蓝",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "色",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "衣",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "服",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "黑",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "色",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "裤",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "子",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "带",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "帽",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "子",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    }
  ]
}