ElasticSearch中的自定义分词

最新推荐文章于 2024-08-13 17:43:38 发布

天才小熊猫12138584

最新推荐文章于 2024-08-13 17:43:38 发布

阅读量2.9k

点赞数 1

分类专栏： ElasticSearch 文章标签： ElasticSearch 分词器

本文链接：https://blog.csdn.net/qq_40990836/article/details/95952612

版权

ElasticSearch 专栏收录该内容

20 篇文章 0 订阅

订阅专栏

自定义分词

当 ElasticSearch 自带的分词器无法满足时，可以自定义分词器，通过组合不同的组件实现
1. Character Filter
2. Tokenizer
3. Token Filter

Character Filters

在Tokenizer 之前对文本进行处理，例如增加删除以及替换字符，可以配置多个 Character Filters，会影响 Tokenizer的 position 和 offset 信息
一些自带的 Character Filters
1. HTML strip - 去除 html 标签

// 去除html 标签
GET _analyze
{
  "tokenizer" : "keyword",
  "char_filter" : ["html_strip"],
  "text" : "<b>Hello Word</b>"
}
// 返回结果
{
  "tokens" : [
    {
      "token" : "Hello Word",
      "start_offset" : 3,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

	2. Mapping - 字符串替换

// 使用char filter 进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : ["- => _"]
      }
    ],
    "text" :  "123-456, I-test! test-900 650-550-12345"
}
// 返回结果
{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test_900",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "650_550_12345",
      "start_offset" : 26,
      "end_offset" : 39,
      "type" : "<NUM>",
      "position" : 3
    }
  ]
}


// 通过mapping 的方式，来替换表情符号
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type":"mapping",
        "mappings" : [":) => happy", ":( => sad"]
      }
    ],
    "text": ["I am felling :)", "I am very :("]
}
// 返回结果
// 将 ：）替换成了 happy, :( 替换成了 sad
"tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "I",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "am",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "very",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "sad",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

	3. Pattern  replace - 正则匹配替换

Tokenizer

将原始的文本按照一定的规则，切分为词 (term or token)
ElasticSearch 内置的 Tokenizers
1. whitespace / standard / uax_url_email /pattern / keyword /path hierarchy

// whitespace 和 stop 
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the palin"]
}
// 返回结果
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "falls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "mainly",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "palin",
      "start_offset" : 38,
      "end_offset" : 43,
      "type" : "word",
      "position" : 8
    }
  ]
}
// 可以看到切分的除了第一个 The on  in 都没有了

可以使用 JAVA 开发插件，实现自己的 Tokenizer

Token Filters

将 Tokenizer 输出的单词（term），进行添加，修改，删除。
自带的 Token Filters
1. Lowercase / stop / synonym （添加近义词）

// whitespace 和 stop 
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop"],
  "text": ["The rain in Spain falls mainly on the palin"]
}

自定义一个自己的分词器

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer" : {
          "type" : "custom",
          "char_fitler" : [
              "emoticons"
            ],
            "tokenizer" : "punctuation",
            "filter" : [
                "lowercase" ,
                "english_stop"
              ]
        }
      },
      "tokenizer" : {
        "punctuation" : {
          "type" : "pattern",
          "pattern" : "[.,!?]"
        }
      },
      "char_filter": {
        "emoticons" : {
          "type" :"mapping",
          "mappings" : [
              ":) => _happy_",
              ":( => _sad_"
            ]
        }
      },
      "filter": {
        "english_stop" :{
          "type":"stop",
          "stopwords" :"_english_"
        }
      }
    }
  }
}
// 成功添加的返回样例
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

// 通过刚才的分词器来指定分词

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you?"
}
// 返回结果
{
  "tokens" : [
    {
      "token" : "i'm a :) person",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " and you",
      "start_offset" : 16,
      "end_offset" : 24,
      "type" : "word",
      "position" : 1
    }
  ]
}