es - elasticsearch自定义分析器 - 内建分词过滤器 - 2

最新推荐文章于 2023-06-29 21:21:27 发布

这是谁的博客？

最新推荐文章于 2023-06-29 21:21:27 发布

阅读量328

点赞数

分类专栏： stack - es （version_7.10.1）文章标签： es

本文链接：https://blog.csdn.net/a13662080711/article/details/113306906

版权

stack - es （version_7.10.1）专栏收录该内容

253 篇文章 20 订阅

订阅专栏

世界上并没有完美的程序，但是我们并不因此而沮丧，因为写程序就是一个不断追求完美的过程。

自定义分析器 :

Character filters :
    1. 作用 : 字符的增、删、改转换
    2. 数量限制 : 可以有0个或多个
    3. 内建字符过滤器 :
        1. HTML Strip Character filter : 去除html标签
        2. Mapping Character filter : 映射替换
        3. Pattern Replace Character filter : 正则替换
Tokenizer :
    1. 作用 :
        1. 分词
        2. 记录词的顺序和位置（短语查询）
        3. 记录词的开头和结尾位置（高亮）
        4. 记录词的类型（分类）
    2. 数量限制 : 有且只能有一个
    3. 分类 :
        1. 完整分词 :
            1. Standard
            2. Letter
            3. Lowercase
            4. whitespace
            5. UAX URL Email
            6. Classic
            7. Thai
        2. 切词 :
            1. N-Gram
            2. Edge N-Gram
        3. 文本 :
            1. Keyword
            2. Pattern
            3. Simple Pattern
            4. Char Group
            5. Simple Pattern split
            6. Path
Token filters :
    1. 作用 : 分词的增、删、改转换
    2. 数量限制 : 可以有0个或多个
    3. 分类 :
        1. apostrophe
        2. asciifolding
        3. cjk bigram
        4. cjk width
        5. classic
        6. common grams
        7. conditional
        8. decimal digit

今天演示内容中：

common grams token filter
conditional token filter
重点关注。

# classic token filter
# 作用 :
#   1. 删除'及后面字符
#   2. 删除缩写间的点
# 适用 : classic分词器
GET /_analyze
{
  "tokenizer": "classic",
  "filter": ["classic"],
  "text": ["hello this is hi's good H.J.K.M. Q.U.I.C.K.  "]
}

# 结果
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "this",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "hi",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "<APOSTROPHE>",
      "position" : 3
    },
    {
      "token" : "good",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "HJKM",
      "start_offset" : 24,
      "end_offset" : 32,
      "type" : "<ACRONYM>",
      "position" : 5
    },
    {
      "token" : "QUICK",
      "start_offset" : 33,
      "end_offset" : 43,
      "type" : "<ACRONYM>",
      "position" : 6
    }
  ]
}

# common grams token filter
# 作用 :
#   1. 指定的词与前后词结合
#   2. 可以避免停用词造成的损失
# 配置项 :
#   1. common_words      : 要结合的词
#   2. common_words_path : 要结合词的路径
#   3. ignore_case       : 忽略大小写，默认false
#   4. query_mode        : 是否单独显示指定的结合的词，默认false - 显示
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [{
    "type"         : "common_grams",
    "common_words" : ["是", "的", "Is"],
    "ignore_case"  : true,
    "query_mode"   : true
  }],
  "text": ["我是中国人", "这是我的饭", "this is my food"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "我_是",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "gram",
      "position" : 0
    },
    {
      "token" : "是_中",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "gram",
      "position" : 1
    },
    {
      "token" : "中",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "国",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "人",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "这_是",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "gram",
      "position" : 105
    },
    {
      "token" : "是_我",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "gram",
      "position" : 106
    },
    {
      "token" : "我_的",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "gram",
      "position" : 107
    },
    {
      "token" : "的_饭",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "gram",
      "position" : 108
    },
    {
      "token" : "this_is",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "gram",
      "position" : 209
    },
    {
      "token" : "is_my",
      "start_offset" : 17,
      "end_offset" : 22,
      "type" : "gram",
      "position" : 210
    },
    {
      "token" : "my",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 211
    },
    {
      "token" : "food",
      "start_offset" : 23,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 212
    }
  ]
}

# conditional token filter
# 作用   : 条件过滤，以条件判断是否执行过滤器中的内容
# 配置项 :
#   1. filter : 过滤器
#   2. script : 过滤脚本
GET /_analyze
{
  "tokenizer": "standard",
  "filter": [{
      "type"   : "condition",
      "filter" : ["lowercase"],
      "script" : {
        "source": "token.getTerm().length() < 5"
      }
  }], 
  "text": ["THE QUICK BROWN FOX"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "QUICK",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "BROWN",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

# decimal digit token filter
# 作用 : 特殊数字字符转为阿拉伯数字
GET /_analyze
{
  "tokenizer": "keyword",
  "filter": ["decimal_digit"],
  "text": ["6.7 १ १-one two-२ ३ "]
}

# 结果
{
  "tokens" : [
    {
      "token" : "6.7 1 1-one two-2 3 ",
      "start_offset" : 0,
      "end_offset" : 20,
      "type" : "word",
      "position" : 0
    }
  ]
}

这是谁的博客？

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
es - elasticsearch自定义分析器 - 内建分词过滤器 - 2

世界上并没有完美的程序，但是我们并不因此而沮丧，因为写程序就是一个不断追求完美的过程。自定义分析器 :Character filters :    1. 作用 : 字符的增、删、改转换    2. 数量限制 : 可以有0个或多个    3. 内建字符过滤器 :    &nbs.
复制链接

扫一扫