Elasticsearch分词与Analyzer详解-CSDN博客

本文链接：https://blog.csdn.net/accordall/article/details/108648814

分词

1、精确字段v.s Full Text

es会为每个字段创建一个倒排索引，但遇到精确字段（keyword）不会做这一操作；
1、精确字段keyword：精确查找，不会分词，主要的类型包括数字，日期或者一个精确的字符串可以设置为此类型
2、text：全文搜索字段，需要参与分词

Analyzer

1、处理流程

Character Filter -->Tokenizer–>Token Filters

Character Filter

在Tokenier之前对本本进行处理，例如增加或者替换字符。可以配置多个，会影响Tokenizer的position和offset信息。
自带的Character Filter

HTML strip–去除html标签
Mapping–字符串替换
Pattern replace --正则替换

Tokenizer

将原本的文本按照一定规则切分为词项
内置的有：
whitespace/standard/uax_url_email/pattern/keyword/path_hierarchy
可用java自定义

Token Filters

将Tokenizer输出的词，进行增加，修改，删除
内置有：Lowercase/stop/synonym(添加近义词)

POST users/_analyze
{
  "text": "我是中国人",
  "analyzer": "ik_max_word"
}

PUT logs/_doc/1
{"level":"DEBUG"}

GET /logs/_mapping
#html_strip 去除网页标记信息
POST _analyze
{
  "tokenizer":"keyword",
  "char_filter":["html_strip"],
  "text": "<b>hello world</b>"
}

# 解析path
POST _analyze
{
  "tokenizer":"path_hierarchy",
  "text":"/user/ymruan/a/b/c/d/e"
}



#使用char filter进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

#char filter 替换表情符号
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ ":) => happy", ":( => sad"]
      }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}

# 去除空格和 snowball
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball","zw"],
  "text": ["The gilrs in China are playing this game! zw"]
}


// whitespace与stop
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The rain in Spain falls mainly on the plain."]
}


#remove 加入lowercase后，The被当成 stopword删除
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop","snowball"],
  "text": ["The gilrs in China are playing this game! zw"]
}

#正则表达式 替换
GET _analyze 
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "pattern_replace",
        "pattern" : "http://(.*)",
        "replacement" : "$1"
      }
    ],
    "text" : "http://www.elastic.co"
}