elasticsearch文档Mapping

songtaiwu

已于 2024-08-16 17:32:33 修改

阅读量136

点赞数 1

文章标签： elasticsearch 大数据搜索引擎

于 2024-08-14 17:30:19 首次发布

本文链接：https://blog.csdn.net/songtaiwu/article/details/141196220

版权

Dynamic mapping | Elasticsearch Guide [8.15] | Elastic

Dynamic mapping

Dynamic field mapping

当Elasticsearch在文档中发现一个新的字段，它会自动添加字段到类型mapping中。dynamic参数控制着这一行为。

通过设置dynamic字段为 true 或 runtime，可以让Elasticsearch在接收到新文档时能够自动创建字段。当自动字段mapping开启，Elasticsearch会用下面表格中的规则来确认如何进行字段类型的映射。

JSON data type	"dynamic": "true"	"dynamic": "runtime"
null	No field added	No field added
true or false	bool	bool
double	float	double
long	long	long
object	object	No field added
array	Depends on the first non-null value in the array	Depends on the first non-null value in the array
string that passes date detection	date	date
string that passes numeric detection	float or long	double or long
string that doen't pass date detection or numeric detection	text with a .keyword sub-field	keyword

你可以关闭自动mapping，可以在document级别或者object级别。设置dynamic参数为false可以忽略新字段，设置为strict可以拒绝哪些有未知字段的文档写入。

Date detection

如果日期检测是开启的（默认开启），新的string类型字段会被检测看它们的内容是否满足 dynamic_date_formats参数定义的日期格式。如果满足，新的date字段会以对应的格式添加到mapping。

dynamic_date_formats的默认值如下：

[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]

例子：


PUT my-index-000001/_doc/1
{
  "create_date": "2015/09/02"
}

GET my-index-000001/_mapping

上面例子中，插入一个文档，基于create_date字段的格式，会创建一个date字段，具体mapping内容如下：

{
	"my-index-000001": {
		"mappings": {
			"properties": {
				"create_date": {
					"type": "date",
					"format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis"
				}
			}
		}
	}
}

Disabling date detection

通过设置 date_detection为false可以关闭自动日期检测。

PUT my-index-000001
{
  "mappings": {
    "date_detection": false
  }
}

PUT my-index-000001/_doc/1 
{
  "create_date": "2015/09/02"
}

上面请求，先是创建一个mapping叫my-index-000001，并且关闭日期自动检测。然后插入一个文档，create_date被检测到要新增字段，字段类型会用text。

查询下mapping如下

GET  my-index-000001/_mapping

{
	"my-index-000001": {
		"mappings": {
			"date_detection": false,
			"properties": {
				"create_date": {
					"type": "text",
					"fields": {
						"keyword": {
							"type": "keyword",
							"ignore_above": 256
						}
					}
				}
			}
		}
	}
}

Customizing detected date formats

我们可以通过 dynamic_date_formats 来定义自己的日期格式。

PUT my-index-000001
{
  "mappings": {
    "dynamic_date_formats": ["MM/dd/yyyy"]
  }
}

PUT my-index-000001/_doc/1
{
  "create_date": "09/25/2015"
}

设置日期格式时，用数组和 ||分割的字符串两个方式是有区别的。

当使用数组配置日期格式时，第一个写入的文档中的还没有映射的日期字段如果匹配了数组中的一个pattern，这个pattern就是字段的格式。


PUT my-index-000001
{
  "mappings": {
    "dynamic_date_formats": [ "yyyy/MM", "MM/dd/yyyy"]
  }
}

PUT my-index-000001/_doc/1
{
  "create_date": "09/25/2015"
}

上面用数组定义了日期的pattern两个，插入的数据符合第二个。再去查询mapping，可以看到字段的格式就是 "MM/dd/yyyy"。后面再插入日期格式是yyyy/MM的数据就不行了。

当使用||分割的字符串来配置日期格式，它支持任何配置好的格式。也就是说，支持文档使用上面列出的不同格式。

PUT my-index-000001
{
  "mappings": {
    "dynamic_date_formats": [ "yyyy/MM||MM/dd/yyyy"]
  }
}

PUT my-index-000001/_doc/1
{
  "create_date": "09/25/2015"
}

上面用字符串定义的，插入的数据后，再去查询mapping，可以看到字段的格式就是 "yyyy/MM||MM/dd/yyyy"。

epoch_millis 和 epoch_second 不支持设置为动态日期格式

Numeric detection

尽管JSON数据支持浮点类型和整数类型（它的number用双精度存数据），一些应用或者语言有时会用字符串来表示数字。通常的方案是明确这些字段的类型，但是我们也能通过 numeric detection（默认是关闭的）来开启自动检测。

PUT my-index-000001
{
  "mappings": {
    "numeric_detection": true
  }
}

PUT my-index-000001/_doc/1
{
  "my_float":   "1.0", 
  "my_integer": "1" 
}

自动创建的mappings字段中，my_float的类型是 float， my_integer的类型是long。

Dynamic templates

Explicit mapping

你比Elasticsearch更了解你的数据，尽管自动mapping非常好用，在某种意义上我们更希望能自定义明确的mappings。

可以在创建索引和给存在的索引添加字段的时候来创建字段mappings。

Create an index with an explicit mapping

创建索引的API中，可以使用明确的mapping的字段。

PUT /my-index-000001
{
  "mappings": {
    "properties": {
      "age":    { "type": "integer" },  
      "email":  { "type": "keyword"  }, 
      "name":   { "type": "text"  }     
    }
  }
}

如上，age字段是integer类型，email是keyword类型，name是text类型。

Add a field to an existing mapping

在使用 update mapping API的时候，可以新增一个或更多字段到已有的索引中。

下面例子中，增加了一个employee-id字段，类型是keyword。index：false代表这个 employee-id字段的值可以被存储但是不能被索引或者查询。

PUT /my-index-000001/_mapping
{
  "properties": {
    "employee-id": {
      "type": "keyword",
      "index": false
    }
  }
}

Update the mapping of a field

除了支持的 mapping parameters，你不能修改mapping中的已存在字段的类型。修改一个已经存在的字段会导致之前索引好的数据失效。

如果需要更改数据流支持索引中字段的映射，请参阅更改数据流的映射和设置。

如果需要更改其他索引中某个字段的映射，请使用正确的映射创建一个新索引，并将数据重新索引到该索引中。

重命名字段将使已在旧字段名称下索引的数据无效。相反，添加别名字段以创建备用字段名。

View the mapping of an index

可以使用 get mapping API 来查看已经存在的index的mapping信息。

GET /my-index-000001/_mapping

View the mapping of specific fields

如果你只想看一一个具体字段的mapping信息，你可以使用get field mapping API。

当你的index包含大量字段你也不用查看全部信息时，这个方式非常有用。

下面的请求就是仅仅查看 employee-id字段的mapping。

GET /my-index-000001/_mapping/field/employee-id

Runtime fields

运行时字段指的是在查询过程中进行评估的字段。运行时字段有如下功能：

给已经存在的文档添加字段而不用在进行重新索引。

Benefits

由于运行时字段并不会被索引，它不会增加es索引的大小。我们可以直接在索引mapping中定义出运行时字段，它能节约存储消耗、增加数据写入速度。当你定义了一个运行时字段后，你可以立刻用于查询请求中，可以参与聚合、过滤、排序。

如果你把一个运行时字段改为索引字段，与之前运行时字段相关的任何查询都不用修改。

运行时字段最核心最重要的好处是它可以在提取到文档后还能添加字段。这个能力简化了mapping设计，你不用再预先确定好数据怎么转换，利用运行时字段你可以随时修改mapping映射。使用运行时字段减少了索引大小，提供了查询速度，这结合使用了更少的资源并降低了您的运营成本。

Incentives

Map a runtime field

在mapping定义阶段用到runtime部分并定义painless script，这样就能映射运行时字段。

脚本可以访问一个文档的整个上下文，包括原始的_source (用params._source访问) 以及任何映射的字段和它们的值。在查询阶段，脚本会运行并为每个脚本字段生成值。

输出运行时字段值

如果通过定义painless script来使用运行时字段，你必须使用emit函数来输出计算后的值。

举个例子，下面的请求中，脚本会从@timestamp字段（date类型）中计算出星期几。脚本是基于timestamp的值进行计算，最后通过emit把结果输出。

PUT my-index-000001/
{
  "mappings": {
    "runtime": {
      "day_of_week": {
        "type": "keyword",
        "script": {
          "source": "emit(doc['@timestamp'].value.dayOfWeekEnum.getDisplayName(TextStyle.FULL, Locale.ROOT))"
        }
      }
    },
    "properties": {
      "@timestamp": {"type": "date"}
    }
  }
}

在runtime部分可以使用下面这些数据类型：

boolean
composite
date
double
geo_point
ip
keyword
long
lookup

对于date类型的运行时字段，它可以接收一个format参数，这个date字段类型一样。

对于lookup类型的运行时字段，它允许从相关索引中抽取字段。

如果设置dynamic参数为runtime来开启”动态字段映射“，新的字段会自动成了mapping中的运行时字段。


PUT my-index-000001
{
  "mappings": {
    "dynamic": "runtime",
    "properties": {
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

Define runtime fields without a script

运行时字段典型的用法就是利用painless script来操作数据。然而，也有一些场景我们不用脚本来定义。比如，你想要从_source中直接抽取原始数据，你不需要脚本，仅仅创建一个运行时字段，比如day_of_week：

PUT my-index-000001/
{
  "mappings": {
    "runtime": {
      "day_of_week": {
        "type": "keyword"
      }
    }
  }
}

当不提供script脚本时，Elasticsearch在查询过程中会隐含地从_source中寻找和运行时字段同名的字段。如果_source中存在与运行时字段同名的字段，那么Elasticsearch会直接从该字段中提取值，并将其作为运行时字段的结果返回。如果_source中不存在与运行时字段同名的字段，那么Elasticsearch不会在查询响应中为该运行时字段返回任何值。

在大多数情况下，都是尽可能用doc_values来获取字段值。从doc_values中获取数据要比从_source中更快，因为数据基于Lucene引擎存储的原因。

Ignoring script errors on runtime fields

脚本在运行时有可能抛出错误，获取文档中丢失或无效的数据以及执行了无效操作时都可能报错。on_script_error参数用来控制错误发生后的行为。设置为continue表示忽略运行时字段上的所有错误。默认配置是fail，即会在查询时返回分区错误的报告。

Updating and removing runtime fields

更新和删除运行时字段

你可以在任何时候更新和删除运行时字段。通过添加一个同名的运行时字段到mapping中可以替换已经存在的字段。通过设置运行时字段值为 null 可以删除一个运行时字段。

PUT my-index-000001/_mapping
{
 "runtime": {
   "day_of_week": null
 }
}

Field data types

字段数据类型

每个字段都有一个数据类型，或者叫字段类型。这个类型表明了字段包含的数据是什么，比如字符串或者布尔值，以及它们预期的用途。例如，对于strings数据可以用text 和 keyword字段来索引。但是text字段会用全文搜索分析，keyword字符串按原样用于过滤和排序。

字段类型是按照家族分组的。在同一个家族的类型有着相同的查询行为，但是它们对于空间消耗以及性能特征是有差别的。

目前，有两个类型家族，keyword和text。其他类型家族都只有单一的字段类型。比如，boolean类型家族只有一个字段类型：boolean。

Common types

通用类型

binary
- 二进制数据是用Base64编码的字符串。
boolean
- 值是true 或 false。
keywords
- keyword家族中，包括keyword、constant_keyword、wildcard。
Numbers
- 数字类型，long、double
Dates
- 日期类型，date、date_nanos
alias
- 为存在的字段定义别名

Objects and relational types

对象和关系类型

object
- 一个JSON对象
flattend
- 一整个JSON对象作为一个字段值
nested
- JSON对象保存当前字段和其子字段的关系
join
- 定义一个父子字段关系，它们需要在统一索引中。

structured data types

结构化数据类型

Range
- 有long_range、double range、date_range、ip_range
ip
- IPv4 和 IPv6地址
version
- 软件版本，支持Semantic Versionning优先规则
murmur3
- 计算和存储值得hash

Aggregate data types

聚合数据类型

aggregate_metric_double
histogram

Text search types

文档搜索类型

text fields
annotated-text
completion
search_as_you_type
semantic_text
token_count

Document ranking types

dense_vector

sparse_vector

rank_feature

rank_features

Spatial data types

geo_point

geo_shape

point

shape

Arrays

在Elasticsearch中，并没有专用的array数据类型。任何字段都可以包含0个或多个值，但它们必须是相同的数据类型，例如：

strings数组： ["one", "two"]
integers数组：[1, 2]
arrays数组：[1, [2, 3]] 它和 [1, 2, 3]是相同的
object数组： [{"name": "Mary", "age": "12"}, {"name": "John", "age": 10}]

如果通过自动方式添加一个字段，数组中第一个值决定了字段的类型。所有后面的字段都必须是相同的数据类型或者它们至少可以强制转换为相同的类型。

数组中有混合的数据类型是不支持的：[10, "some string"]

一个数组可能包含null值，它要么被配置的null_value值替代，要么会被整个忽略。一个空的数组[]会被看作丢失的字段，一个没用值的字段。

在文档中要使用数组任何提前的配置规则都不需要，array是es开箱即用的。

PUT my-index-000001/_doc/1
{
  "message": "some arrays in this document...",
  "tags":  [ "elasticsearch", "wow" ], 
  "lists": [ 
    {
      "name": "prog_list",
      "description": "programming list"
    },
    {
      "name": "cool_list",
      "description": "cool stuff list"
    }
  ]
}

PUT my-index-000001/_doc/2 
{
  "message": "no arrays in this document...",
  "tags":  "elasticsearch",
  "lists": {
    "name": "prog_list",
    "description": "programming list"
  }
}

GET my-index-000001/_search
{
  "query": {
    "match": {
      "tags": "elasticsearch" 
    }
  }
}

上面的请求后，

tags字段会自动添加为string字段。
lists字段会自动添加为object字段。
第二个文档不包含数组，但他会被索引的相同的字段中。
查询tags字段值是elasticsearch的数据，两个文档都匹配。

Boolean field type

boolean类型字段接受 JSON的ture、false值，也能接受字符串的值。

False values	false、"false"、""（empty string）
True values	true, "true"

例如：


PUT my-index-000001
{
  "mappings": {
    "properties": {
      "is_published": {
        "type": "boolean"
      }
    }
  }
}

POST my-index-000001/_doc/1?refresh
{
  "is_published": "true" 
}

GET my-index-000001/_search
{
  "query": {
    "term": {
      "is_published": true 
    }
  }
}

索引的文档中用的"true"，会被转换为true。

查询的时候用JSON格式的true。

Parameters for boolean fields

下面的是boolean字段接受的参数：

doc_values	影响的是字段在磁盘是否使用列式存储，它后续可以被用于排序、聚合、脚本执行。接受true（默认）、false
index	影响的是字段是否能快速搜索，接受true(默认)、false。字段开启了doc_values才能使用term或者基于范围的查询，尽管有些慢。
ignore_malformed	默认情况下，如果一个错误的数据类型要索引到这个字段会抛出异常，并且会拒绝整个文档写入。如果把这个参数设置为true，则允许忽略异常。畸形的字段不会被索引，文档中其他的字段一切正常。接受true、false值。注意，如果script参数设置了就不用配置这个参数。
null_value	可以配置为上面列出的任何代表true、false的值。用于替代任何显式空值。默认是null，代表字段被当作丢失。注意，如果script参数设置了就不用配置这个参数。
on_script_error	这个参数定义了当script参数定义的脚本运行报错时如何处理。默认值fail，这会导致整个文档被拒绝。如果设置为continue，代表会把字段存到文档的_ignored 元数据字段然后继续索引。只有当script字段设置了，这个参数才能设置。
script	如果参数设置为true, 字段的值来自脚本执行，而不是从source中直接读取。如果一个写入的文档中设置了这个字段的值，这个文档会拒绝并报错。脚本的格式和字段的运行时等同。
store	这些值的存储和提取是否和_source字段分开。接受true、false(默认)
meta	字段的元数据

Synthetic _source

只有当使用了TSDB索引（时间序列数据库索引）时 Synthetic _source才一般有效。即索引需要把index.mode设置为time_series。对于其他索引，synthetic _source还是技术预览阶段，这时候功能还可能在未来的版本中有所改变或者移除。Elastic公司会解决任何问题，但是对于还在技术预览阶段的功能是不受支持服务等级协议(SLA)约束的。

boolean字段在他们默认的配置中是支持synthetic _source的。当copy_to或doc_values禁用时是不能使用synthetic _source的。

synthetic source经常用于布尔字段的排序，例如：

PUT idx
{
  "mappings": {
    "_source": { "mode": "synthetic" },
    "properties": {
      "bool": { "type": "boolean" }
    }
  }
}
PUT idx/_doc/1
{
  "bool": [true, false, true, false]
}

文档会变为

{
  "bool": [false, false, true, true]
}

Text type family

文本家族包含下面两个字段类型：

text，这是传统的用于全文内容的字段类型，比如邮件的内容或者产品的描述。
math_only_text, 是text类型的一个做了空间优化的变种，它禁用了分数，在需要位置的查询中执行较慢。非常适合索引日志消息数据。

Text field type

Use a field as both text and keyword

Parameters for text fields

analyzer	analyzer是在索引阶段和查询阶段都能使用的（除非用search_analyzer覆盖）。默认是用defualt index analyzer，或者 standard analyzer。
eager_global_ordinals	此配置是为了让分片进行refresh的时候执行生成global ordinals的过程。接受true、false(默认)，如果我们使用词进行聚合很频繁，那就可以把这个功能打开。
fielddate
fielddata_frequency_filter
fields
index
index_options
index_prefixes
index_phrases
norms
position_increment_gap
store
search_analyzer
search_quote_analyzer
similariry
term_vector
meta

Synthetic _source

fielddata mapping parameter

text字段默认是可以搜索的，但是它们不能进行聚合、排序和脚本执行。如果你用脚本尝试排序、聚合或者访问一个text字段，会得到一个字段值不可用的异常。为了能在内存装载字段数据，需要设置 fielddata=true。

把字段数据加载到内存会消耗大量内存

在聚合、排序、运行脚本操作时，使用field data是访问全文字段的分析词的唯一方法。例如，一个全文字段中的New York被分析为 new 和 york。想要基于这些字段进行聚合必须使用field data。

Before enabling fielddata

通常来说在text字段上开启fielddata是没意义的。field data存储因为计算开销大会用到堆。计算field data会导致延迟峰值，增加堆的使用也是集群性能问题的原因之一。

更多的用户希望在text字段上使用 multi-field mappings, 这样一个text字段既能用于全文索引也能作为keyword字段来分析，如下例子：

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_field": { 
        "type": "text",
        "fields": {
          "keyword": { 
            "type": "keyword"
          }
        }
      }
    }
  }
}

使用 my_field字段来查询

使用my_field.keyword字段来聚合、排序、或者用于脚本

Enabling fielddata on text fields

你可以使用update mapping API来基于一个已经存在的字段来开启fielddata。

PUT my-index-000001/_mapping
{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

fielddata_frequency_filter mapping parameter

fielddata filtering 用于减少加载到内存中的词的数量，这样能较少内存占用。词可以按频率过滤。

频率过滤可以让你把那些文档经常落于设置的最小值和最大值之间的值加载起来，可以设置一个绝对值（如果这个值比1.0大）或者百分比（例如 0.01 1.0）。频率是每个segment来计算的。百分比是基于拥有这个字段的文档数量来计算的，对比与segment的全部文档。

Match-only text field type

songtaiwu

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
elasticsearch文档Mapping

当Elasticsearch在文档中发现一个新的字段。
复制链接

扫一扫