ElasticSearch中的自定义分词

自定义分词
  • 当 ElasticSearch 自带的分词器无法满足时, 可以自定义分词器,通过组合不同的组件实现
    1. Character Filter
    2. Tokenizer
    3. Token Filter
Character Filters
  • 在Tokenizer 之前对文本进行处理, 例如增加删除以及 替换字符,可以配置多个 Character Filters, 会影响 Tokenizer的 position 和 offset 信息
  • 一些自带的 Character Filters
    1. HTML strip - 去除 html 标签
// 去除html 标签
GET _analyze
{
  "tokenizer" : "keyword",
  "char_filter" : ["html_strip"],
  "text" : "<b>Hello Word</b>"
}
// 返回结果
{
  "tokens" : [
    {
      "token" : "Hello Word",
      "start_offset" : 3,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

	2. Mapping - 字符串替换
// 使用char filter 进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : ["- => _"]
      }
    ],
    "text" :  "123-456, I-test! test-900 650-550-12345"
}
// 返回结果
{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test_900",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "650_550_12345",
      "start_offset" : 26,
      "end_offset" : 39,
      "type" : "<NUM>",
      "position" : 3
    }
  ]
}


// 通过mapping 的方式,来替换表情符号
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type":"mapping",
        "mappings" : [":) => happy", ":( => sad"]
      }
    ],
    "text": ["I am felling :)", "I am very :("]
}
// 返回结果
// 将 :)替换成了 happy, :( 替换成了 sad
"tokens" : [
    {
      "token" : "I",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "I",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "am",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "very",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "sad",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

	3. Pattern  replace - 正则匹配替换
Tokenizer
  • 将原始的文本 按照一定的规则,切分为词 (term or token)
  • ElasticSearch 内置的 Tokenizers
    1. whitespace / standard / uax_url_email /pattern / keyword /path hierarchy
// whitespace 和 stop 
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"],
  "text": ["The rain in Spain falls mainly on the palin"]
}
// 返回结果
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "falls",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "mainly",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "palin",
      "start_offset" : 38,
      "end_offset" : 43,
      "type" : "word",
      "position" : 8
    }
  ]
}
// 可以看到切分的除了第一个 The on  in 都没有了

  • 可以使用 JAVA 开发插件, 实现自己的 Tokenizer
Token Filters
  • 将 Tokenizer 输出的单词(term), 进行添加,修改,删除。
  • 自带的 Token Filters
    1. Lowercase / stop / synonym (添加近义词)
// whitespace 和 stop 
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop"],
  "text": ["The rain in Spain falls mainly on the palin"]
}
自定义一个自己的分词器
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer" : {
          "type" : "custom",
          "char_fitler" : [
              "emoticons"
            ],
            "tokenizer" : "punctuation",
            "filter" : [
                "lowercase" ,
                "english_stop"
              ]
        }
      },
      "tokenizer" : {
        "punctuation" : {
          "type" : "pattern",
          "pattern" : "[.,!?]"
        }
      },
      "char_filter": {
        "emoticons" : {
          "type" :"mapping",
          "mappings" : [
              ":) => _happy_",
              ":( => _sad_"
            ]
        }
      },
      "filter": {
        "english_stop" :{
          "type":"stop",
          "stopwords" :"_english_"
        }
      }
    }
  }
}
// 成功添加的返回样例
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

// 通过刚才的分词器来指定分词

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you?"
}
// 返回结果
{
  "tokens" : [
    {
      "token" : "i'm a :) person",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : " and you",
      "start_offset" : 16,
      "end_offset" : 24,
      "type" : "word",
      "position" : 1
    }
  ]
}

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值