8.多字段特性&Analyzer&Template

最新推荐文章于 2023-07-27 16:16:15 发布

Gedeon

最新推荐文章于 2023-07-27 16:16:15 发布

阅读量311

点赞数 1

分类专栏： ELK

本文链接：https://blog.csdn.net/qq_24908345/article/details/97299150

版权

ELK 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

1.多字段特性

多字段特性
- 厂商名字实现精准匹配
  - 增加一个keyword字段
- 使用不同的analyzer
  - 不同语言
  - pinyin字段的搜索
  - 还支持为搜索和索引指定不同的analyzer

1.1 Exact Values v.s Full Text

Exact Value:包括数字/日期/具体一个字符串 Elasticsearch 中的keyword
Full Text（全文本）非结构化的文本数据， Elasticsearch 中的text

在这里插入图片描述

1.1.1 Exact Values 不需要被分词

Elasticsearch 为每一个字段创建一个倒排索引
- Exact Value 在索引时，不需要做特殊的分词处理

2 自定义分词

当Elasticsearch 自带的分词器无法满足时，可以自定义分词器。通过自组合不同的组件实现

Character Filter
Tokenizer
Token Filter

2.1 Character Filters

在Tokenizer 之前对文本进行处理，例如增加删除及替换字符。可以配置多个Character Filters。会影响Tokenizer 的position 和 offset 信息

一些自带的Character Filters

HTML strip : 去除html标签
Mapping : 字符串替换
Pattern_replace : 正则匹配替换

2.1.1 HTML_strip

	POST _analyze
	{
	  "tokenizer":"keyword",   // 关键词 tokenizer 即将输入 当做一个整体输出
	  "char_filter":["html_strip"],  //去除html 标签
	  "text": "<b>hello world</b>"
	}

	# 返回结果
		{
	  "tokens" : [
	    {
	      "token" : "hello world",
	      "start_offset" : 3,
	      "end_offset" : 18,
	      "type" : "word",
	      "position" : 0
	    }
	  ]  
	}

2.1.2mapping 替换

	POST _analyze
{
  "tokenizer": "standard",   	
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]		//将 - 替换为 _
      }
    ],
  "text": "123-456, I-test! test-990"
}
# 返回结果
	{
  "tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test_990",
      "start_offset" : 17,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

2.1.3 Pattern_replace

的pattern_replace字符过滤器使用一个正则表达式匹配到要修改的字符然后使用指定的字符串替换。替换字符串可以引用正则表达式中的匹配组。

pattern 一个Java的正则表达式。
replacement 替换字符串，可以参考使用匹配组 $1… $9。

flags Java正则表达式标志。

  GET _analyze
  {
    "tokenizer": "standard",
    "char_filter": [
        {
          "type" : "pattern_replace",
          "pattern" : "http://(.*)",
          "replacement" : "$1"
        }
      ],
      "text" : "http://www.elastic.co"
  }

  # 输出结果
  	{
"tokens" : [
  {
    "token" : "www.elastic.co",
    "start_offset" : 0,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 0
  }
]

}

2.2 Tokenizer

将原始的文本按照一定的规则，切分为词（term or token）
Elasticsearch 内置的Tokenizers
- whitespace、standard、uax_url_emal、pattern、keyword、path_hierarchy

可以用Java开发插件，实现自己的Tokenizer

  	POST _analyze
  {
    "tokenizer":"path_hierarchy",
    "text":"/user/ymruan/a/b"
  }
  # 返回结果
  {
"tokens" : [
  {
    "token" : "/user",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "/user/ymruan",
    "start_offset" : 0,
    "end_offset" : 12,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "/user/ymruan/a",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  },
  {
    "token" : "/user/ymruan/a/b",
    "start_offset" : 0,
    "end_offset" : 16,
    "type" : "word",
    "position" : 0
  }
]

}

2.3 Token Filters

将Tokenizer 输出的单词（term），进行修改，增加，删除
自带的Token Filters
- Lowercase/stop/synonym(添加近义词)

2.4 设置一个 Custom Analyzer

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer":{
          "type":"custom",
          "char_filter":["emoticons"],		// 设置 char_filter
          "tokenizer":"punctuation",		// 设置 tokenizer
          "filter":["lowercase","english_stop"]	设置filter 
        }
      },
      "tokenizer": {				// 定义tokenizer
        "punctuation":{
          "type":"pattern",
          "pattern":"[ .,!?]"
        }
      },
      
      "char_filter":{			// 定义char_filter
        "emoticons":{
          "type":"mapping",
          "mappings":[":) => _happy_",":( => _sad_"]
        }
      },
      "filter": {				//定义filter
        "english_stop":{
          "type":"stop",
          "stopwords":"_english_"
        }
      }
    }
  }
}

3 Index Template 和 Dynamic Template

集群上的索引会越来越多，例如：为你的日志每天创建一个索引。使用多个索引可以让你更好的管理你的数据，提高性能

3.1.1 什么是Index Template

Index Templages 帮助你设定 Mappings 和 Settings 并按照一定的规则，自动匹配到新创建的索引之上
- 模板仅在一个索引被创建时，才会产生作用。修改模板不会影响已创建的索引
- 你可以设定多个索引模板，这些设置会被merge在一起。
- 你可以指定order的数值，控制 merging的过程

3.1.2 Index Template 的工作方式

当一个索引被新创建时

应用Elasticsearch 默认的settings 和 mappings
应用order 数值低的Index Template 中的设定
应用order 数值高的Index Template 中的设定，之前的设定会被覆盖
应用创建索引时，用户所指定的Settings 和 Mappings 并覆盖之前模板中的设定

3.1.3 Demo

在这里插入图片描述
创建两个 Index Templates

PUT _template/template_default
{
  "index_patterns": ["*"],	//匹配所有索引
  "order" : 0,				// orde数值为0
  "version": 1,
  "settings": {
    "number_of_shards": 1, //主分片个数
    "number_of_replicas":1 //复制分片个数
  }
}
	PUT /_template/template_test
{
    "index_patterns" : ["test*"],
    "order" : 1, 	// order 数值为1 将覆盖 order = 0 的值
    "settings" : {
    	"number_of_shards": 1,
        "number_of_replicas" : 2	、
    },
    "mappings" : {
    	"date_detection": false, 	// 日期转换关闭
    	"numeric_detection": true	// 开启数值类型自动转换
    }
}

查看template信息

GET /_template/template_default
GET /_template/temp*

创建新索引

//写入新的数据，index以test开头
PUT testtemplate/_doc/1
{
	"someNumber":"1",
	"someDate":"2019/01/01"
}

查看状态

	GET testtemplate/_mapping
	# 返回结果
	{
  "testtemplate" : {
    "mappings" : {
      "date_detection" : false,
      "numeric_detection" : true,
      "properties" : {
        "someDate" : {
          "type" : "text",	//日期字符串没有被推断为日期类型
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "someNumber" : {	//  被推断为数值类型
          "type" : "long"
        }
      }
    }
  }
}
GET testtemplate/_settings
# 返回结果
{
  "testtemplate" : {
    "settings" : {
      "index" : {
        "creation_date" : "1564111116898",
        "number_of_shards" : "1",	
        "number_of_replicas" : "2",	 	// 复制分片数为2 可见已经覆盖
        "uuid" : "poT_luf-Svu6oPhWIeI7fQ",
        "version" : {
          "created" : "7010099"
        },
        "provided_name" : "testtemplate"
      }
    }
  }
}

3.2.1 什么是Dynamic Template

根据Elasticsearch 识别的数据类型，结合字段名称，来动态设定字段类型
- 所有的字符串类型都设定成Keyword,或者关闭keyword字段
- is 开头的字段都设置成boolean
- long_开头的都设置成long类型

在这里插入图片描述

3.2.2 demo1

PUT my_index
{
  "mappings": {
    "dynamic_templates": [
            {
        "strings_as_boolean": {			// Template 名称
          "match_mapping_type":   "string",		// 	要匹配的类型
          "match":"is*",					// is 开头的 将其设置为boolean 类型
          "mapping": {
            "type": "boolean"
          }
        }
      },
      {
        "strings_as_keywords": {
          "match_mapping_type":   "string",		
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  }
}

插入数据

	PUT my_index/_doc/1
	{
	  "firstName":"Ruan",
	  "isVIP":"true"
	}

查看mapping

	GET my_index/_mapping
	# 返回结果
		{
  "my_index" : {
    "mappings" : {
      "dynamic_templates" : [
        {
          "strings_as_boolean" : {
            "match" : "is*",
            "match_mapping_type" : "string",
            "mapping" : {
              "type" : "boolean"
            }
          }
        },
        {
          "strings_as_keywords" : {
            "match_mapping_type" : "string",
            "mapping" : {
              "type" : "keyword"
            }
          }
        }
      ],
      "properties" : {
        "firstName" : {
          "type" : "keyword"	// 如果没有 dynamic_template 应该为 ”text"类型
        },
        "isVIP" : {
          "type" : "boolean"	// 如果没有 dynamic_template 应该为“text"类型
        }
      }
    }
  }
}

3.2.3 demo2

DELETE my_index
PUT my_index
{
  "mappings": {
    "dynamic_templates": [
      {
        "full_name": {
          "path_match":   "name.*",			// 匹配
          "path_unmatch": "*.middle",		// 排除
          "mapping": {
            "type":       "text",
            "copy_to":    "full_name"	
          }
        }
      }
    ]
  }
}
PUT my_index/_doc/1
{
  "name": {
    "first":  "John",
    "middle": "Winston",
    "last":   "Lennon"
  }
}


GET my_index/_search?q=full_name:John //有结果
GET my_index/_search?q=full_name:Winston // 无结果