analyzer 分词器
elasticsearch 中自带分词器_analyze
-
Character Filters
- -针对原始文本进行处理,比如去除html等标记
- HTML strip 去除html标签和转换html实体
- Mapping 进行字符替换
- Pattern Replace 进行正则匹配替换
- -会影响后续Tokenizer解析的postion和offset信息
- 测试 html_strip
- POST _analyze
- {
- "char_filter":["html_strip"],
- "text":"<p>delete lable </p>",
- "tokenizer":"keyword"
- }
-
Tokenizer
- -将原文本按照一定规则切割为分词
- 测试 path_hierarchy
- post _analyze
- {
- "text":"/a/b/c",
- "tokennizer":"path_hierarchy"
- }
-
Token Filters
- -针对tokenizer处理的单词进行再加工,比如转小写增删等处理
- lowarcase 将所有的term转换为小写
- stop 删除stop words
- NGram 和 Edge NGram 连词分割
- POST _analyze
- {
- "text": ["<p>zxc asd fgh start stop words yes ol</p> stop"],
- "tokenizer": "keyword",
- "char_filter": ["html_strip"],
- "filter": [
- "stop",
- {
- "type":"ngram",
- "min_gram":4,
- "max_gram":4
- }
- ]
- }
- Synonym 添加近义词的term
自定义分词器
格式
- PUT my_index_word
- {
- "settings":{
- "analysis":{
- "char_filter":{}
- "tokenizer":{}
- "filter":{}
- "analyzer":{}
- }
- }
- }
定义分词器实例
- PUT myanalyzes
- {
- "settings": {
- "analysis": {
- "analyzer": {
- "mydex":{
- "type":"custom",
- "tokenizer":"mydeftokenizer",
- "char_filter":[
- "mydefchar_filter"
- ],
- "filter":[
- "lowercase",
- "asciifolding"
- ]
- }
- }
- "tokenizer":{
- "mydeftokenizer":{
- "type":"pattern",
- "pattern":"[.;,/!?]"
- }
- },
- "char_filter":{
- "mydefchar_filter":{
- "type":"myapping",
- "myappings":[
- ":) => help",
- "(: => nohelp"
- ]
- }
- }
- }
- }
- }
调用自定义分词器
- POST myanalyzes/_analyze
- {
- "analyzer": "mydex",
- "text": ["<p>help' 1 !niad</p>"]
- }