Elasticsearch之Analyzer分词器介绍

最新推荐文章于 2025-03-26 17:39:29 发布

程大帅气

最新推荐文章于 2025-03-26 17:39:29 发布

阅读量2.9k

点赞数 2

分类专栏： Elasticsearch 文章标签： elasticsearch 搜索引擎 java

本文链接：https://blog.csdn.net/weixin_44692700/article/details/122074246

版权

Elasticsearch 专栏收录该内容

13 篇文章

订阅专栏

Elasticsearch之Analyzer分词器介绍

Analysis

Analysis

Analysis文本分析，也叫分词，是把全文本转换为一系列单词的过程。

Analyzer叫做分词器。Analysis是通过Analyzer来实现的，ES当中内置了很多分词器，同时我们也可以按需定制化分词器。

分词器的作用，除了在数据写入时对需要分词的字段进行词条切分转换，同时匹配Query语句的时候也需要使用相同的分词器对查询语句进行分析。

例如：
Elasticsearch is fun这个文本就会被分词器切分成，elasticsaerch、is、fun三个单词。

Analyzer的组成

通常Analyzer由三个部分组成。

Character Filters：针对原始文本处理，例如去除html标签等。
Tokenizer：按照一定的规则，对字符串进行切分单词。
Token Filter：将切分的单词进行加工、大小写转换、删除stopwords、增加同义词等。

ES中内置的分词器

Standard Analyzer：默认分词器，按词切分，小写处理
Simple Analyzer：按照非字母切分（符号被过滤），小写处理
Stop Analyzer：小写处理，停用词过滤器（the、a、is等）
Whitespace Analyzer：按照空格切分，不转小写
Keyword Analyzer：不分词，直接将输入当作输出
Patter Analyzer：正则表达式，默认\W+(非字符分割)
Language：提供了30多种常见语言的分词器
Customer Analyzer：自定义分词器

Analyzer的使用

我们可以直接指定Analyzer进行分词测试。

举例：比如我们现在要查看一下ES是如何进行分词的。

GET /_analyze
{
  "analyzer": "standard",
  "text":"Elasticsearch is fun"
}

返回如下，可以看到分词结果。token表示分词的单词，start_offset表示单词在文档中的开始位置，end_offset表示单词在文档中的结束位置，type表示单词的类型（文本/数字…），position表示单词在文档中的位置。

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "fun",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

我们也可以指定索引的字段来进行分词测试，可以看到这个字段是如何对文本进行分词的。

比如我们要指定索引index中的字段comment字段来进行分词测试，发起请求如下：

post index/_analyze
{
  "field":"comment",
  "text":"ES真好玩"
}

可以看到把我们输入的文本进行了分词处理。

{
  "tokens" : [
    {
      "token" : "es",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "真",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "好",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "玩",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    }
  ]
}

我们也可以自定义分词器进行测试

post /_analyze
{
  "tokenizer":"standard",
  "filter":["lowercase"],
  "text":"Elasticsearch is FUN"
}

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "fun",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

几种分词器介绍

Standard Analyzer

Standard Analyzer是ES中默认的分词器，它有几个规则：

按照单词进行切分
小写处理
它的Stop（词过滤器，is、the等）默认是关闭的。

在这里插入图片描述

GET /_analyze
{
  "analyzer": "standard",
  "text":" 1 Elasticsearch is FUN5."
}

{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "fun5",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

可以看到standard分词器，就是按照空格进行分词，没有过滤掉is这种关键字，并且没有过滤掉数字等。

Simple Analyzer

按照非字母切分，非字母的都会被去除
小写处理

在这里插入图片描述

GET /_analyze
{
  "analyzer": "simple",
  "text":" 1 Elasticsearch is FUN511asd."
}

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "fun",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "asd",
      "start_offset" : 26,
      "end_offset" : 29,
      "type" : "word",
      "position" : 3
    }
  ]
}

可以看到Simple分词器，把最后单词FUN511asd中的进行了切分，切分成fun和asd（这里不仅仅是数字，只要是非字母都会切分、符号等），并且全部转小写处理。

Stop Analyzer

按照非字母切分，非字母的都会被去除
小写处理
多了stop filter，会将is、a、the等关键词去除

GET /_analyze
{
  "analyzer": "stop",
  "text":" 1 Elasticsearch is FUN511asd."
}

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "fun",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "asd",
      "start_offset" : 26,
      "end_offset" : 29,
      "type" : "word",
      "position" : 3
    }
  ]
}

可以看到stop除了有simple的功能，还将一些关键词，比如is进行了去除

Whitespace Analyzer

按照空格进行切分

在这里插入图片描述

GET /_analyze
{
  "analyzer": "whitespace",
  "text":" 1 Elasticsearch is FUN511asd."
}

{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "FUN511asd.",
      "start_offset" : 20,
      "end_offset" : 30,
      "type" : "word",
      "position" : 3
    }
  ]
}

Keyword Analyzer

不进行分词，直接将输入当作一个term输出

在这里插入图片描述

GET /_analyze
{
  "analyzer": "keyword",
  "text":" 1 Elasticsearch is FUN511asd."
}

{
  "tokens" : [
    {
      "token" : " 1 Elasticsearch is FUN511asd.",
      "start_offset" : 0,
      "end_offset" : 30,
      "type" : "word",
      "position" : 0
    }
  ]
}

Pattern Analyzer

通过正则表达式进行分词
默认是\W+，非字符的符号进行分割

在这里插入图片描述

GET /_analyze
{
  "analyzer": "pattern",
  "text":" 1 Elasticsearch is FUN511asd-a."
}

{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "fun511asd",
      "start_offset" : 20,
      "end_offset" : 29,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 30,
      "end_offset" : 31,
      "type" : "word",
      "position" : 4
    }
  ]
}

可以看到FUN511asd-a被分成了fun511asd和a

Language Analyzer

可以指定不同的语言进行分词，比如English.

GET /_analyze
{
  "analyzer": "english",
  "text":"ES真是太好玩了，Elasticsearch is FUN-fun"
}

但是对于中文来说，分词器就有了一些特定的难点：

一个句子，要被切分成一个个单词，而不是一个个的字。
在英文中，单词有空格进行分割，中文没有
一句中文，在不同的上下文语言环境中，有不同的意思
几句中文可能表达的是相同的意思，但是分词不同

我们可以安装不同的中文分词器
比如：
ICU Analyzer
IK-支持自定义词库，支持热更新分词字典
THULAC