4、ElasticSearch中的分词器

最新推荐文章于 2023-11-11 16:32:31 发布

码农的进阶之路

最新推荐文章于 2023-11-11 16:32:31 发布

阅读量293

点赞数

分类专栏： Elastic Stack学习之旅文章标签： ik Analyzer analyzer _analyze

本文链接：https://blog.csdn.net/zyxwvuuvwxyz/article/details/108678970

版权

Elastic Stack学习之旅专栏收录该内容

12 篇文章 16 订阅

订阅专栏

文章目录

- - - 1、Analysis与Analyzer
    - - 2、ElasticSearch的内置分词器
      - 2.1、Standard Analyzer
        2.2、Simple Analyzer
        2.3、Stop Analyzer
        2.4、WhiteSpace Analyzer
        2.5、Keyword Analyzer
        2.6、Pattern Analyzer
        2.7、English Analyzer
        2.8、中文分词
        2.9、自定义分词

1、Analysis与Analyzer

Analysis 文本分析是把全文本转换一系列单词(term/token)的过程，也叫分词。Analysis是通过Analyzer实现的，可使用ElasticSearch内置的分词器或按需定制分词器。

除了在数据写入转换词条时用到分词器，匹配Query语句时也需要用相同的分词器对查询语句进行分析。

Analyzer由三部分组成：

Character Filters 针对原始文本处理，例如去除html
Tokenizer 安装规则切分为单词
Token Filter 将切分的单词进行加工，如单词小写、删除stopword、增加同义词等

2、ElasticSearch的内置分词器

前置：使用_analyzer 分词API
如使用默认分词器进行分词

POST _analyze
{
  "text": ["I'm studing now"],
  "analyzer": "standard"
}
## 响应
{
  "tokens" : [
    {
      "token" : "i'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "studing",
      "start_offset" : 4,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "now",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

2.1、Standard Analyzer

默认分词器按词切分小写处理

Tokenizer：standard
Token Filters
- standard
- lower case
- Stop(默认关闭)

2.2、Simple Analyzer

按照非小写字母切分，非字母的都被去除
小写处理
Tokenizer：lowercase

举例：对"I’m studying 11"分词

POST _analyze
{
  "text": ["I'm studying 11"],
  "analyzer": "simple"
}
## 响应
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "m",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "studying",
      "start_offset" : 4,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    }
  ]
}

2.3、Stop Analyzer

相比Simple Analyzer 多了stop filter,会把the/a/is/in等修饰词去掉

Tokenizer：lowercase
Token Filters：stop

举例：对"I’m studying in the room"进行分词

POST _analyze
{
  "text": ["I'm studying in the room"],
  "analyzer": "stop"
}
## 响应
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "m",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "studying",
      "start_offset" : 4,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "room",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "word",
      "position" : 5
    }
  ]
}

2.4、WhiteSpace Analyzer

按照空格切分
Tokenizer：WhiteSpace

举例：对"I’m studying in the room"进行分词

POST _analyze
{
  "text": ["I'm studying in the room"],
  "analyzer": "whitespace"
}
## 响应
{
  "tokens" : [
    {
      "token" : "I'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "studying",
      "start_offset" : 4,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "in",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "the",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "room",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "word",
      "position" : 4
    }
  ]
}

2.5、Keyword Analyzer

不分词，直接将输入当一个词语输出
Tokenizer：Keyword

举例：对"I’m studying in the room"进行分词

POST _analyze
{
  "text": ["I'm studying in the room"],
  "analyzer": "keyword"
}
##响应
{
  "tokens" : [
    {
      "token" : "I'm studying in the room",
      "start_offset" : 0,
      "end_offset" : 24,
      "type" : "word",
      "position" : 0
    }
  ]
}

2.6、Pattern Analyzer

通过正则表达式进行分词
默认是\W+，非字符的符号进行分隔
Tokenizer： Pattern
Token Filters：lowercase/stop

2.7、English Analyzer

语言分词器
举例：对"I’m studying in the room"进行分词

POST _analyze
{
  "text": ["I'm studying in the room"],
  "analyzer": "english"
}
##响应
{
  "tokens" : [
    {
      "token" : "i'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "studi",
      "start_offset" : 4,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "room",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

2.8、中文分词

中文分词的难点：

需要将中文句子，切分成一个一个词(不是一个一个字)
一句中文在不同的上下文中有不同的理解

常用中文分词器

ICU Analyzer 提供了Unicode的支持，更好的支持亚洲语言
IK 支持自定义词库，支持热更新分词字典
THULAC

本次学习中文分词使用ik分词插件，具体插件如何安装在下一小节给出详细介绍，这里不再赘述

ik分词插件有两个分词器，分别是ik_smart、ik_max_word

举例：对"这个苹果不大好吃"进行分词

POST _analyze
{
  "text": ["这个苹果不大好吃"],
  "analyzer": "ik_max_word"
}
## 响应
{
  "tokens" : [
    {
      "token" : "这个",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "苹果",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "不大好",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "不大",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "大好",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "好吃",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

2.9、自定义分词

我的理解：根据自己的需求重新指定Character Filters、Tokenizer、Tokenizer，使其满足自己的需求。
需求：如 ”不分词将词语按照其小写形式原样输出“

POST _analyze
{
  "tokenizer": "keyword",
  "filter": ["lowercase"],
  "text": ["Mastering Elasticsearch"]
}
## 响应
{
  "tokens" : [
    {
      "token" : "Mastering Elasticsearch",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "word",
      "position" : 0
    }
  ]
}