ElasticSearch文本分析(二)

最新推荐文章于 2024-04-24 14:42:00 发布

666呀

最新推荐文章于 2024-04-24 14:42:00 发布

阅读量473

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/Suubyy/article/details/118364606

版权

elasticsearch 专栏收录该内容

39 篇文章 7 订阅

订阅专栏

ElasticSearch文本分析(二)

分词器参考

分词器接收字符流，将其分解成单独的分词，并输出分词流。例如，空格分词器看到任何空格的时候，都会将文本分解成分词。它将会把"Quick brown fox!"w文本分解成 [Quick, brown, fox!]。

分词器还将负责记录一下内容：

每个术语的位置和顺序
在原文中开始和结束的偏移量
分词类型，生成的术语分类。例如 <ALPHANUM>, <HANGUL>, 或者 <NUM>。更简单的分词器只会产生word分词类型

ElasticSearch内部有很多内置分词器，可以用于构建自定义分析器。

字符组分词器

每当遇到定义集中的字符时，char_group分词器都会将文本分解为术语。

配置

char_group 标记器接受一个参数：

`tokenize_on_chars`	一个包含用于标记字符串的字符列表。每当遇到这个列表中的字符时，就会启动一个新的分词。它可以接受单个字符，如-，或字符组:空格，字母，数字，标点，符号。
`max_token_length`	最大分词长度。如果一个分词超过了这个长度，那么将以`max_token_length`间隔对其进行分割。默认为255。

例子

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": {
    "type": "char_group",
    "tokenize_on_chars": [
      "whitespace",
      "-",
      "\n"
    ]
  },
  "text": "The QUICK brown-fox"
}
'

响应值：

{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "QUICK",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 2
    },
    {
      "token": "fox",
      "start_offset": 16,
      "end_offset": 19,
      "type": "word",
      "position": 3
    }
  ]
}

Edge n-gram 分词器

edge_ngram标分词器首先在遇到指定字符列表中的一个时将文本分解为单词，然后它截取每个单词的n个字符，其中n个字符的开头固定在单词的开头。

提示：当您需要搜索具有广泛已知顺序的文本时，例如电影或歌曲标题，completion suggester分词器比 Edge N-gram 更有效。当尝试自动完成可以以任何顺序出现的单词时，Edge N-gram 具有优势。

例子

在默认设置下，edge_ngram 分词器将初始文本视为单个分词，并生成最小长度为 1 和最大长度为 2 的 N-gram：

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "edge_ngram",
  "text": "Quick Fox"
}
'

上面的句子将产生以下术语：

[ Q, Qu ]

提示：默认的gram长度几乎完全没用。使用前需要配置edge_ngram。

配置

edge_ngram 分词器接受以下参数：

min_gram：gram字符的最小长度，默认为1
max_gram：gram字符的最大程度。默认为2。具体请参阅 max_gram 参数的限制。
token_chars：应该包含在分词器中的字符类。ElasticSearch将对不属于指定类的字符进行拆分。默认情况下保留所有字符。

字符类可以是以下任何一种：
- letter — for example a, b, ï or 京
- digit — for example 3 or 7
- whitespace — for example " " or "\n"
- punctuation — for example ! or "
- symbol — for example $ or √
- custom — custom characters which need to be set using the custom_token_chars setting.
cunstom_token_chars：自定义字符应该被视为一种分词。例如设置+-_成为分词的一部分。

`max_gram`参数的限制

edg_ngram分词的max_gram值限制了分词的字符长度。当 edge_ngram 分詞器与索引分析器一起使用时，这意味着大于 max_gram 长度的搜索词可能不匹配任何索引词。

例如，如果 max_gram 为 3，则对 apple 的搜索将与索引词 app 不匹配。a不会被拆分

配置示例

在这个例子中，我们配置 edge_ngram 分词器将字母和数字视为分词，并生成最小长度为 2 和最大长度为 10 的gram：

curl -X PUT "localhost:9200/my-index-00001?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my-index-00001/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}
'

上面的例子产生以下术语：

[ Qu, Qui, Quic, Quick, Fo, Fox, Foxe, Foxes ]

`keyword`分词器

关键字标记器是一个“noop”标记器，它接受给定的任何文本并输出与单个术语完全相同的文本。它可以与令牌过滤器结合使用以标准化输出，例如小写电子邮件地址。

例子

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "keyword",
  "text": "New York"
}
'

上面的句子将产生以下术语：

[ New York ]

结合分词过滤器

可以将关键字分词器与分词过滤器结合使用，以规范结构化数据，例如产品 ID 或电子邮件地址。

例如，以下分析 API 请求使用关键字标记器和小写过滤器将电子邮件地址转换为小写。

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "keyword",
  "filter": [ "lowercase" ],
  "text": "john.SMITH@example.COM"
}
'

该请求产生以下令牌：

[ john.smith@example.com ]

LowerCase分词器

与lette分词器一样，当遇到不是字母的字符时就会将文本分词并且将其转换为小写。但是与lette相比，lowercase分词器性能更优

例子

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "lowercase",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

上面的句子将产生以下术语：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

路径层次分词器

path_hierarchy分词器采用像文件系统路径一样的分层值，按照路径分隔符进行拆分，并为路径中每一个元素生成一个术语。

例子

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "path_hierarchy",
  "text": "/one/two/three"
}
'

响应结果：

[ /one, /one/two, /one/two/three ]

配置

path_hierarchy接收以下参数：

`delimiter`	路径分隔符。默认为`/`
`replacement`	用于分隔符的可选替换字符。默认为分隔符
`buffer_size`	-
`reverse`	如果设置为 true，则以相反的顺序发出分词。默认为假。
`skip`	要跳过的初始令分词数。默认为 0。

配置示例

例如，我们配置path_hierarchy分词器的拆分字符为-，并将他们替换为/。跳过前两个分词：

curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-",
          "replacement": "/",
          "skip": 2
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my-index-000001/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_analyzer",
  "text": "one-two-three-four-five"
}
'

上面的例子产生以下术语：

[ /three, /three/four, /three/four/five ]

如果我们将 reverse 设置为 true，它将产生以下结果：

[ one/two/three/, two/three/, three/ ]

详细例子

path_hierarchy分词器的一个常见用例就是通过文件路径来过滤结果。如果将数据与文件路径一起索引，则使用path_hierarchy分词器通过文件路径的不同部分来分析路径允许过滤结果。

这个例子配置了一索引有两个自定义分析器，并且将这些分析器应用于存储文件名称的file_path文本字段的多字段上。两个分析器之一使用反向标记化。然后对一些示例文档进行索引，以表示两个不同用户的照片文件夹中的照片的一些文件路径。

curl -X PUT "localhost:9200/file-path-test?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_path_tree": {
          "tokenizer": "custom_hierarchy"
        },
        "custom_path_tree_reversed": {
          "tokenizer": "custom_hierarchy_reversed"
        }
      },
      "tokenizer": {
        "custom_hierarchy": {
          "type": "path_hierarchy",
          "delimiter": "/"
        },
        "custom_hierarchy_reversed": {
          "type": "path_hierarchy",
          "delimiter": "/",
          "reverse": "true"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "file_path": {
        "type": "text",
        "fields": {
          "tree": {
            "type": "text",
            "analyzer": "custom_path_tree"
          },
          "tree_reversed": {
            "type": "text",
            "analyzer": "custom_path_tree_reversed"
          }
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/file-path-test/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
}
'
curl -X POST "localhost:9200/file-path-test/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
}
'
curl -X POST "localhost:9200/file-path-test/_doc/3?pretty" -H 'Content-Type: application/json' -d'
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
}
'
curl -X POST "localhost:9200/file-path-test/_doc/4?pretty" -H 'Content-Type: application/json' -d'
{
  "file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
}
'
curl -X POST "localhost:9200/file-path-test/_doc/5?pretty" -H 'Content-Type: application/json' -d'
{
  "file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
}
'

模式分词器

模式分词器使用正则表达式来拆分文本为术语，或者匹配文本为术语。

默认正则表达式为\W+，当遇到非单词字符时进行拆分。

提示：小心病态正在表达式。模式分词器使用java的正则表达式。一个不好的正则表达式可能会拖慢性能，甚至导致内存溢出。获取更多帮助请参考pathological regular expressions and how to avoid them.

示例

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "pattern",
  "text": "The foo_bar_size\u0027s default is 5."
}
'

[ The, foo_bar_size, s, default, is, 5 ]

配置

模式分词器接收以下参数：

`pattern`	java正则表达式。默认为`\W+`
`flags`	java正则表达式标志。标志应该通过管道符指定例如： `"CASE_INSENSITIVE\|COMMENTS"`.
`group`	将捕获的哪个组提取为分词。默认值为-1 (split)。

配置示例

在此示例中，我们配置模式分词器以在遇到逗号时将文本分解为分词：

curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my-index-000001/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}
'

[ comma, separated, values ]

在下一个示例中，我们将模式标记器配置为捕获用双引号括起来的值（忽略嵌入的转义引号 "）。正则表达式本身如下所示：

"((?:\\"|[^"]|\\")*)"

当在 JSON 中指定模式时，" 和 \ 字符需要转义，因此模式最终看起来像：

\"((?:\\\\\"|[^\"]|\\\\\")+)\"

curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
          "group": 1
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my-index-000001/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_analyzer",
  "text": "\"value\", \"value with embedded \\\" quote\""
}
'

上面的例子产生以下两个术语：

[ value, value with embedded \" quote ]

简单的模式匹配分词器

简单的模式匹配分词器用正在表达式来捕获匹配到的文本作为术语。它支持的正则表达式功能集比模式分词器更有限，但是这个分词器更快。简单的模式匹配分词器不像模式匹配分词器那样，它不支持在模式匹配上进行拆分。想要使用相同的受限的正则表达式的子集在模式匹配上进行拆分，请查看simple_pattern_split分词器。

此示例配置 simple_pattern 分词器以生成三位数的术语：

curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern",
          "pattern": "[0123456789]{3}"
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my-index-000001/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_analyzer",
  "text": "fd-786-335-514-x"
}
'

[ 786, 335, 514 ]

标准的分词器

标准分词器提供基于语法的分词（基于 Unicode 文本分割算法，如 Unicode 标准附件 #29 中所述）并且适用于大多数语言。

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

配置

标准分词器接收以下参数：

max_token_length：最大分词的字符长度。如果看到超过此长度的令牌，则以 max_token_length 长度间隔拆分。默认为 255

配置实例

在此示例中，我们将标准分词器配置为 max_token_length 为 5（用于演示目的）：

curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my-index-000001/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
'

[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]

`UAX URL email`分词器

uax_url_email 标记器类似于标准标记器，除了它将 URL 和电子邮件地址识别为单个标记。

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "uax_url_email",
  "text": "Email me at john.smith@global-international.com"
}
'

[ Email, me, at, john.smith@global-international.com ]

空格分词器

每当遇到空白字符时，空白标记器将文本分解为术语。

curl -X POST "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

配置

标准分词器接收以下参数：

max_token_length：最大分词的字符长度。如果看到超过此长度的令牌，则以 max_token_length 长度间隔拆分。默认为 255

分词过滤器参考

字符过滤器参考

666呀

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch文本分析(二)

文章目录ElasticSearch文本分析(二)分词器参考字符组分词器配置例子Edge n-gram 分词器例子配置`max_gram`参数的限制配置示例`keyword`分词器例子结合分词过滤器LowerCase分词器例子路径层次分词器例子配置配置示例详细例子模式分词器示例配置配置示例简单的模式匹配分词器标准的分词器配置配置实例`UAX URL email`分词器空格分词器配置分词过滤器参考字符过滤器参考ElasticSearch文本分析(二)分词器参考分词器接收字符流，将其分解成单独的分词，并输出
复制链接

扫一扫