Elasticsearch学习4：分词器

最新推荐文章于 2024-01-24 15:33:51 发布

tyw15

最新推荐文章于 2024-01-24 15:33:51 发布

阅读量159

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/tyw15/article/details/107613278

版权

elasticsearch 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

知识点 1 ：分词器插件安装

如何查看elasticsearch已经安装了什么插件

在浏览器中输入 http://es的ip地址/_cat/plugins

分词器插件安装，下载对应版本，解压到plugins目录，重启

analysis-icu分词器

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html

ik分词器

https://github.com/medcl/elasticsearch-analysis-ik/releases

https://www.cnblogs.com/dgwblog/p/12374212.html

知识点 2 ：通过Analyzer进行分词

Analysis：即文本分析，是把全文本转化为一系列单词（term/token）的过程，也叫分词；在Elasticsearch 中可通过内置分词器实现分词，也可以按需定制分词器。

Analyzer 由三部分组成

• Character Filters：原始文本处理，如去除 html
• Tokenizer：按照规则切分为单词
• Token Filters：对切分单词加工、小写、删除 stopwords，增加同义词

Analyzer API

通过三种方式查看 Analyzer 如何进行工作
• 直接指定 Analyzer 进行测试
• 指定索引字段进行测试
• 自定义分词器进行测试

Elasticsearch 内置分词器

Stop Analyzer :Simple Analyzer +停用词过滤（the，is ，a，in，to等助词）

Language：按照语言特点分词，如下英语，Stop Analyzer +词转换(单复数等)

IK分词PK官方ICU分词器

GET /_analyze
{
"analyzer":"icu_analyzer",
"text":"长风破浪会有时，直挂云帆济沧海"
}

{
  "tokens" : [
    {
      "token" : "长风破浪",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "会",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "有时",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "直",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "挂",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "云",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "帆",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "济",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "沧海",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    }
  ]
}

GET /_analyze
{
"analyzer":"ik_max_word",
"text":"长风破浪会有时，直挂云帆济沧海"
}

{
  "tokens" : [
    {
      "token" : "长风破浪",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "长风",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "破浪",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "会有",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "有时",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "直",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "挂",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "云",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "帆",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "济",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "CN_CHAR",
      "position" : 9
    },
    {
      "token" : "沧海",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 10
    }
  ]
}

tyw15

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch学习4：分词器

知识点 1 ：分词器插件安装如何查看elasticsearch已经安装了什么插件在浏览器中输入http://es的ip地址/_cat/plugins分词器插件安装，下载对应版本，解压到plugins目录，重启analysis-icu分词器https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.htmlik分词器https://github.com/medcl/elasticsearch.
复制链接

扫一扫