在Tokenizer 之前对文本进行处理, 例如增加删除以及 替换字符,可以配置多个 Character Filters, 会影响 Tokenizer的 position 和 offset 信息
一些自带的 Character Filters
HTML strip - 去除 html 标签
// 去除html 标签
GET _analyze
{"tokenizer":"keyword","char_filter":["html_strip"],"text":"<b>Hello Word</b>"}// 返回结果{"tokens":[{"token":"Hello Word","start_offset":3,"end_offset":17,"type":"word","position":0}]}
2. Mapping - 字符串替换
// 使用char filter 进行替换
POST _analyze
{"tokenizer":"standard","char_filter":[{"type":"mapping","mappings":["- => _"]}],"text":"123-456, I-test! test-900 650-550-12345"}// 返回结果{"tokens":[{"token":"123_456","start_offset":0,"end_offset":7,"type":"<NUM>","position":0},{"token":"I_test","start_offset":9,"end_offset":15,"type":"<ALPHANUM>","position":1},{"token":"test_900","start_offset":17,"end_offset":25,"type":"<ALPHANUM>","position":2},{"token":"650_550_12345","start_offset":26,"end_offset":39,"type":"<NUM>","position":3}]}// 通过mapping 的方式,来替换表情符号
POST _analyze
{"tokenizer":"standard","char_filter":[{"type":"mapping","mappings":[":) => happy",":( => sad"]}],"text":["I am felling :)","I am very :("]}// 返回结果// 将 :)替换成了 happy, :( 替换成了 sad"tokens":[{"token":"I","start_offset":0,"end_offset":1,"type":"<ALPHANUM>","position":0},{"token":"am","start_offset":2,"end_offset":4,"type":"<ALPHANUM>","position":1},{"token":"felling","start_offset":5,"end_offset":12,"type":"<ALPHANUM>","position":2},{"token":"happy","start_offset":13,"end_offset":15,"type":"<ALPHANUM>","position":3},{"token":"I","start_offset":16,"end_offset":17,"type":"<ALPHANUM>","position":4},{"token":"am","start_offset":18,"end_offset":20,"type":"<ALPHANUM>","position":5},{"token":"very","start_offset":21,"end_offset":25,"type":"<ALPHANUM>","position":6},{"token":"sad","start_offset":26,"end_offset":28,"type":"<ALPHANUM>","position":7}]}
3. Pattern replace - 正则匹配替换
Tokenizer
将原始的文本 按照一定的规则,切分为词 (term or token)
ElasticSearch 内置的 Tokenizers
whitespace / standard / uax_url_email /pattern / keyword /path hierarchy
// whitespace 和 stop
GET _analyze
{"tokenizer":"whitespace","filter":["stop"],"text":["The rain in Spain falls mainly on the palin"]}// 返回结果{"tokens":[{"token":"The","start_offset":0,"end_offset":3,"type":"word","position":0},{"token":"rain","start_offset":4,"end_offset":8,"type":"word","position":1},{"token":"Spain","start_offset":12,"end_offset":17,"type":"word","position":3},{"token":"falls","start_offset":18,"end_offset":23,"type":"word","position":4},{"token":"mainly","start_offset":24,"end_offset":30,"type":"word","position":5},{"token":"palin","start_offset":38,"end_offset":43,"type":"word","position":8}]}// 可以看到切分的除了第一个 The on in 都没有了
可以使用 JAVA 开发插件, 实现自己的 Tokenizer
Token Filters
将 Tokenizer 输出的单词(term), 进行添加,修改,删除。
自带的 Token Filters
Lowercase / stop / synonym (添加近义词)
// whitespace 和 stop
GET _analyze
{"tokenizer":"whitespace","filter":["lowercase","stop"],"text":["The rain in Spain falls mainly on the palin"]}
自定义一个自己的分词器
PUT my_index
{"settings":{"analysis":{"analyzer":{"my_custom_analyzer":{"type":"custom","char_fitler":["emoticons"],"tokenizer":"punctuation","filter":["lowercase","english_stop"]}},"tokenizer":{"punctuation":{"type":"pattern","pattern":"[.,!?]"}},"char_filter":{"emoticons":{"type":"mapping","mappings":[":) => _happy_",":( => _sad_"]}},"filter":{"english_stop":{"type":"stop","stopwords":"_english_"}}}}}// 成功添加的返回样例{"acknowledged":true,"shards_acknowledged":true,"index":"my_index"}// 通过刚才的分词器来指定分词
POST my_index/_analyze
{"analyzer":"my_custom_analyzer","text":"I'm a :) person, and you?"}// 返回结果{"tokens":[{"token":"i'm a :) person","start_offset":0,"end_offset":15,"type":"word","position":0},{"token":" and you","start_offset":16,"end_offset":24,"type":"word","position":1}]}