Elasticsearch分词

西直门三太子

于 2022-08-23 17:12:10 发布

阅读量355

点赞数 1

分类专栏： # Elasticsearch 文章标签： elasticsearch 搜索引擎

本文链接：https://blog.csdn.net/qq_29744347/article/details/126488935

版权

Elasticsearch 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

文章目录

分词流程

Character Filters 初始操作，如标签过滤等
Tokenizer 按规进行切分，如按空格进行切分等
Tokenizer Filters 二次加工，如大小写转换等

在这里插入图片描述

自带的分词

使用 GET /_analyze 查看分词结果

Standard

默认分词器，按词切分，小写处理

GET /_analyze
{
  "analyzer": "standard",
  "text":"1 A 2 b"
}

# result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

Simple

按照非字母进行切分(中文被当作字母)，过滤符号，小写处理（中文还是中文）

GET /_analyze
{
  "analyzer": "simple",
  "text":"1 好 ef A-B 2 c"
}

# result
{
  "tokens" : [
    {
      "token" : "好",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ef",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 4
    }
  ]
}

Stop

在Simple的基础上增加了停用词过滤(the,a,is)

GET /_analyze
{
  "analyzer": "stop",
  "text":"1 好 ef A-B is the  2 c"
}

# result
{
  "tokens" : [
    {
      "token" : "好",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ef",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "b",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "word",
      "position" : 6
    }
  ]
}

Whitespace

按空格切分，不转小写

GET /_analyze
{
  "analyzer": "whitespace",
  "text":"1 好 ef A-B is the  2 c"
}

# result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "好",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ef",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "A-B",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "is",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "the",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "2",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "c",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "word",
      "position" : 7
    }
  ]
}

Keyword

不分词

GET /_analyze
{
  "analyzer": "keyword",
  "text":"1 好 ef A-B is the  2 c"
}

#result
{
  "tokens" : [
    {
      "token" : "1 好 ef A-B is the  2 c",
      "start_offset" : 0,
      "end_offset" : 22,
      "type" : "word",
      "position" : 0
    }
  ]
}

Patter

按正则分词,默认\W+

GET /_analyze
{
  "analyzer": "pattern:\d+",
  "text":"1 好 ef A-B is the  2 c"
}

# result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ef",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "is",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "the",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "2",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "c",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "word",
      "position" : 7
    }
  ]
}

中文分词

从 https://github.com/medcl/elasticsearch-analysis-ik 选择对应自己Elasticsearch的版本。我的当前es版本为v7.1.0,那么插件也选7.1.0

 ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip

如果是es集群，请为每一台都安装ik插件，否则有可能导致kibana异常

GET /_analyze
{
  "analyzer": "ik_smart",
  "text":"苹果电脑是比较适合程序员的电脑"
}
# result
{
  "tokens" : [
    {
      "token" : "苹果电脑",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "比较",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "适合",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "程序员",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "的",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "电脑",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

西直门三太子

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch分词

从 https://github.com/medcl/elasticsearch-analysis-ik 选择对应自己Elasticsearch的版本。我的当前es版本为v7.1.0,那么插件也选7.1.0。按照非字母进行切分(中文被当作字母)，过滤符号，小写处理（中文还是中文）如果是es集群，请为每一台都安装ik插件，否则有可能导致kibana异常。在Simple的基础上增加了停用词过滤(the,a,is)默认分词器，按词切分，小写处理。按正则分词,默认\W+按空格切分，不转小写。
复制链接

扫一扫