ElasticSearch使用模板中定义的分词器

最新推荐文章于 2023-04-04 10:59:48 发布

置顶周天祥

最新推荐文章于 2023-04-04 10:59:48 发布

阅读量8.3k

点赞数

分类专栏： ElasticSearch 大数据

本文链接：https://blog.csdn.net/u014646662/article/details/98174611

版权

大数据同时被 2 个专栏收录

84 篇文章 2 订阅

订阅专栏

ElasticSearch

52 篇文章 16 订阅

订阅专栏

ElasticSearch使用模板中定义的分词器

ElasticSearch7.2.0

1、定义ik+english分词器

2、定义ik+english+同义词分词器

3、定义english+pinyin分词器

4、使用分词器

5、测试分词效果

对人工智能感兴趣点下面链接

现在人工智能非常火爆，很多朋友都想学，但是一般的教程都是为博硕生准备的，太难看懂了。最近发现了一个非常适合小白入门的教程，不仅通俗易懂而且还很风趣幽默。所以忍不住分享一下给大家。点这里可以跳转到教程。

https://www.cbedai.net/u014646662

1、定义ik+english分词器

在这里强调一下 "index_patterns": ["*"]，指匹配所有索引，即所有索引都可用该模板的分词器，单不等于默认使用（除非你设置的级别比较高）。只有template_default这个模板的内容才会默认使用。下同


Post _template/ik_en_analyzer
{
    "index_patterns": ["*"],
  "order" : 0,
  "version": 0,
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas":1 ,
     "analysis": {
            "analyzer": {
                "ik_en_analyzer": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["en_stemmer"]
                }
            },
            "filter": {
                 "en_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                }
            }
        }
  }
}

2、定义ik+english+同义词分词器

提前准备文件

在elasticsearch/config/analysis目录下创建synonym.txt（analysis这个目录自己创建）

synonym.txt

马铃薯,土豆
番茄,西红柿
i,me,我
you,你

模板定义

Post _template/template_synonym
{
    "index_patterns": ["*"],
  "order" : 0,
  "version": 0,
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas":1 ,
     "analysis": {
            "analyzer": {
                "ik_en_synonym_analyzer": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["en_stemmer","interim_synonym"]
                }
            },
            "filter": {
                 "en_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                },
                "interim_synonym": {
                    "type": "synonym", 
                    "synonyms_path" : "analysis/synonym.txt" 
                }
            }
        }
  }
}

3、定义english+pinyin分词器

个人认为拼音分词器不建议与ik一起用（有相应的需求除外）

Post _template/template_pinyin
{
    "index_patterns": ["*"],
  "order" : 0,
  "version": 0,
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas":1 ,
     "analysis": {
            "analyzer": {
                "en_pinyin_analyzer": {
                    "type": "custom",
                    "tokenizer": "pinyin",
                    "filter": ["en_stemmer"]
                }
            },
            "filter": {
                 "en_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                }
            }
        }
  }
}

4、使用分词器

在定义索引的时候就可以直接使用了

这里定义了三个字段，分别用以上刚刚定义好的分词器

PUT test_analyzer
{
	"settings": {
		"number_of_shards": "1",
		"number_of_replicas": "0"
	},
	"mappings": {
		"properties": {
			"pinyin": {
				"type": "text",
				"analyzer": "en_pinyin_analyzer"
			},
			"ik": {
				"type": "text",
				"analyzer": "ik_en_analyzer"
			},
			"synonym": {
				"type": "text",
				"analyzer": "ik_en_synonym_analyzer"
			}
			
		}
	}
}

5、测试分词效果

测试pinyin+english


GET test_analyzer/_analyze
{
  "field": "pinyin",
  "text": "赵本山"
}

GET test_analyzer/_analyze
{
  "field": "pinyin",
  "text": "zhaobenshan"
}


第一种分词结果（赵本山）
{
  "tokens" : [
    {
      "token" : "zhao",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zb",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ben",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "shan",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    }
  ]
}
第二种分词结果（zhaobenshan）
{
  "tokens" : [
    {
      "token" : "zhao",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zhaobenshan",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ben",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "shan",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    }
  ]
}

测试ik+english

GET test_analyzer/_analyze
{
  "field": "ik",
  "text": "我有100个梨 I have 100 pears"
}


测试结果
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "有",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "100",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "ARABIC",
      "position" : 2
    },
    {
      "token" : "个",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "COUNT",
      "position" : 3
    },
    {
      "token" : "梨",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "i",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "have",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "ENGLISH",
      "position" : 6
    },
    {
      "token" : "100",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "ARABIC",
      "position" : 7
    },
    {
      "token" : "pear", //把复数形式的s去掉了
      "start_offset" : 19,
      "end_offset" : 24,
      "type" : "ENGLISH",
      "position" : 8
    }
  ]
}

测试同义词

GET test_analyzer/_analyze
{
  "field": "synonym",
  "text": "土豆的果实是长在底下的"
}

测试结果：既有土豆也有马铃薯
{
  "tokens" : [
    {
      "token" : "土豆",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "马铃薯", 
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "的",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "果实",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "实是",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "长在",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "底下",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "的",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "CN_CHAR",
      "position" : 6
    }
  ]
}