ElasticSearch使用模板中定义的分词器

ElasticSearch使用模板中定义的分词器

ElasticSearch7.2.0

1、定义ik+english分词器

2、定义ik+english+同义词分词器

3、定义english+pinyin分词器

4、使用分词器

5、测试分词效果

对人工智能感兴趣点下面链接

现在人工智能非常火爆,很多朋友都想学,但是一般的教程都是为博硕生准备的,太难看懂了。最近发现了一个非常适合小白入门的教程,不仅通俗易懂而且还很风趣幽默。所以忍不住分享一下给大家。点这里可以跳转到教程。

https://www.cbedai.net/u014646662

1、定义ik+english分词器

在这里强调一下 "index_patterns": ["*"],指匹配所有索引,即所有索引都可用该模板的分词器,单不等于默认使用(除非你设置的级别比较高)。只有template_default这个模板的内容才会默认使用。下同


Post _template/ik_en_analyzer
{
    "index_patterns": ["*"],
  "order" : 0,
  "version": 0,
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas":1 ,
     "analysis": {
            "analyzer": {
                "ik_en_analyzer": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["en_stemmer"]
                }
            },
            "filter": {
                 "en_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                }
            }
        }
  }
}

2、定义ik+english+同义词分词器

提前准备文件

在elasticsearch/config/analysis目录下创建synonym.txt(analysis这个目录自己创建)

synonym.txt

马铃薯,土豆
番茄,西红柿
i,me,我
you,你

模板定义

Post _template/template_synonym
{
    "index_patterns": ["*"],
  "order" : 0,
  "version": 0,
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas":1 ,
     "analysis": {
            "analyzer": {
                "ik_en_synonym_analyzer": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["en_stemmer","interim_synonym"]
                }
            },
            "filter": {
                 "en_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                },
                "interim_synonym": {
                    "type": "synonym", 
                    "synonyms_path" : "analysis/synonym.txt" 
                }
            }
        }
  }
}

3、定义english+pinyin分词器

个人认为拼音分词器不建议与ik一起用(有相应的需求除外)

Post _template/template_pinyin
{
    "index_patterns": ["*"],
  "order" : 0,
  "version": 0,
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas":1 ,
     "analysis": {
            "analyzer": {
                "en_pinyin_analyzer": {
                    "type": "custom",
                    "tokenizer": "pinyin",
                    "filter": ["en_stemmer"]
                }
            },
            "filter": {
                 "en_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                }
            }
        }
  }
}

4、使用分词器

在定义索引的时候就可以直接使用了

这里定义了三个字段,分别用以上刚刚定义好的分词器

PUT test_analyzer
{
	"settings": {
		"number_of_shards": "1",
		"number_of_replicas": "0"
	},
	"mappings": {
		"properties": {
			"pinyin": {
				"type": "text",
				"analyzer": "en_pinyin_analyzer"
			},
			"ik": {
				"type": "text",
				"analyzer": "ik_en_analyzer"
			},
			"synonym": {
				"type": "text",
				"analyzer": "ik_en_synonym_analyzer"
			}
			
		}
	}
}

5、测试分词效果

测试pinyin+english


GET test_analyzer/_analyze
{
  "field": "pinyin",
  "text": "赵本山"
}

GET test_analyzer/_analyze
{
  "field": "pinyin",
  "text": "zhaobenshan"
}


第一种分词结果(赵本山)
{
  "tokens" : [
    {
      "token" : "zhao",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zb",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ben",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "shan",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    }
  ]
}
第二种分词结果(zhaobenshan)
{
  "tokens" : [
    {
      "token" : "zhao",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zhaobenshan",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ben",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "shan",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    }
  ]
}

测试ik+english

GET test_analyzer/_analyze
{
  "field": "ik",
  "text": "我有100个梨 I have 100 pears"
}


测试结果
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "有",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "100",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "ARABIC",
      "position" : 2
    },
    {
      "token" : "个",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "COUNT",
      "position" : 3
    },
    {
      "token" : "梨",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "i",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "have",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "ENGLISH",
      "position" : 6
    },
    {
      "token" : "100",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "ARABIC",
      "position" : 7
    },
    {
      "token" : "pear", //把复数形式的s去掉了
      "start_offset" : 19,
      "end_offset" : 24,
      "type" : "ENGLISH",
      "position" : 8
    }
  ]
}

测试同义词

GET test_analyzer/_analyze
{
  "field": "synonym",
  "text": "土豆的果实是长在底下的"
}

测试结果:既有土豆也有马铃薯
{
  "tokens" : [
    {
      "token" : "土豆",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "马铃薯", 
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "的",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "果实",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "实是",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "长在",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "底下",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "的",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 6
    }
  ]
}

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值