Elasticsearch核心技术与实战学习笔记第三章 20多字段特性及Mapping中配置自定义Analyzer

最新推荐文章于 2024-04-21 20:23:20 发布

bohu83

最新推荐文章于 2024-04-21 20:23:20 发布

阅读量378

点赞数

分类专栏： ES 文章标签： elasticsearch 分词器 Excat Values filter

本文链接：https://blog.csdn.net/bohu83/article/details/106180032

版权

ES 专栏收录该内容

63 篇文章 17 订阅

订阅专栏

一序

本文属于极客时间Elasticsearch核心技术与实战学习笔记系列。

二多字段类型

多字段特性

厂家名字实现精确匹配
- 增加一个 keyword 字段
使用不同的 analyzer
- 不同语言
- pinyin 字段的搜索
- 还支持为搜索和索引指定不同的 analyzer

Excat values v.s Full Text

Excat Values ：包括数字 / 日期 / 具体一个字符串（例如 “Apple Store”）

Elasticsearch 中的 keyword

全文本，非结构化的文本数据

Elasticsearch 中的 text

Exact Value不需要被分词

Elaticsearch 为每一个字段创建一个倒排索引

Exact Value 在索引时，不需要做特殊的分词处理

三自定义分词器

当 Elasticsearch 自带的分词器无法满足时，可以自定义分词器。通过自组合不同的组件实现自定义的分析器。

Character Filter
Tokenizer
Token Filter

你可以通过在一个适合你的特定数据的设置之中组合字符过滤器、分词器、词汇单元过滤器来创建自定义的分析器。按这三种照顺序执行。

3.1Character Filters

在 Tokenizer 之前对文本进行处理，例如增加删除及替换字符。可以配置多个 Character Filters。会影响 Tokenizer 的 position 和 offset 信息
一些自带的 Character Filters

HTML strip - 去除 html 标签
Mapping - 字符串替换
Pattern replace - 正则匹配替换

3.2Tokenizer

将原始的文本按照一定的规则，切分为词（term or token）
Elasticsearch 内置的 Tokenizers

whitespace | standard | uax_url_email | pattern | keyword | path hierarchy

可以用 JAVA 开发插件，实现自己的 Tokenizer

3.3Token Filters

将 Tokenizer 输出的单词，进行增加、修改、删除
自带的 Token Filters
- Lowercase |stop| synonym（添加近义词）

Demo

//结果过滤掉html字符。

#使用char filter进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

结果：中划线替换为下划线

tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },

char filter 替换表情符号

正则表达式

替换掉了HTTP://

// whitespace与stop
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

返回：带着第一个大写的The, in、 on 去掉了。mainly变成main

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "fall",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "main",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "plain.",
      "start_offset" : 38,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

//remove 加入lowercase后，The被当成 stopword删除

自定义 analyzer

先看下官网的demo。2.X版本的

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

demo，


#定义自己的分词器
PUT my_index
{
"settings": {
  "analysis": {
    "analyzer": {
      "my_custom_analyzer":{
        "type":"custom",
        "char_filter":[
          "emoticons"
        ],
        "tokenizer":"punctuation",
        "filter":[
          "lowercase",
          "english_stop"
        ]
      }
    },
    "tokenizer": {
      "punctuation":{
        "type":"pattern",
        "pattern": "[ .,!?]"
      }
    },
    "char_filter": {
      "emoticons":{
        "type":"mapping",
        "mappings" : [ 
          ":) => happy",
          ":( => sad"
        ]
      }
    },
    "filter": {
      "english_stop":{
        "type":"stop",
        "stopwords":"_english_"
      }
    }
  }
}
}

执行：

POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": ["I am felling :)", "Feeling :( today"]
}

指定了索引，指定了分词器：结果就是我们想要的。

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "feeling",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "word",
      "position" : 104
    },
    {
      "token" : "sad",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "word",
      "position" : 105
    },
    {
      "token" : "today",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 106
    }
  ]
}