ES mapping 详解

最新推荐文章于 2024-09-05 13:09:01 发布

ZhaoYingChao88

最新推荐文章于 2024-09-05 13:09:01 发布

阅读量3.9w

点赞数 9

分类专栏： elasticsearch 文章标签： elasticsearch

本文为博主原创文章，未经博主允许不得转载。如果对你有帮助，请记得点赞支持，谢谢！

本文链接：https://blog.csdn.net/zyc88888/article/details/83027458

版权

elasticsearch 专栏收录该内容

50 篇文章 39 订阅

订阅专栏

1 mapping type

映射（mapping）

映射是定义一个文档以及其所包含的字段如何被存储和索引的方法。

例如，用映射来定义以下内容：

哪些 string 类型的 field 应当被当成当成 full-text 字段
哪些字段应该是数值类型、日期类型或者是地理位置信息
日期类型字段的格式是怎么样的
是否文档的所有字段都需要被索引到 _all 字段
动态增加的 field 可以由用户自定义的模板来控制其行为

映射类型（mapping types）

每个索引都有一个或多个映射类型（mapping type）来对索引内的文档进行逻辑分组（mapping type 就是平常所说的 type）。

每一个映射类型都包含以下内容：

1. 元数据字段

元数据字段用来自定义如何处理关联文档的元数据。元数据字段包括： _index, _type, _id, _source.

2. 字段列表或属性

每个映射类型都包含一个字段列表或者是和该类型相关的一些属性。

字段数据类型（field datatypes）

每一个字段，都属于一种数据类型。

1. 基本数据类型

string, long, boolean, ip

2. JSON 分层数据类型

object, nested

3. 特殊类型

geo_point, geo_shape, completion

动态映射（dynamic mapping）

字段及其映射类型不必在使用前事先定义好，这得益于动态映射的应用。

动态映射能够根据文档索引过程来自动生成映射类型和字段名。

动态映射规则可以用来定义新类型和新字段的映射。

显式映射（explicit mappings）

如果你比 ES 通过猜测来确定映射更加了解你的数据，那么定义一个动态映射将会很有用。不过有时候你可能需要指定自己的显式映射。

显式映射可以在创建索引时候定义，或者用 mapping API 来为已有的索引添加映射类型或字段。

映射更新（updating existing mappings）

映射支持更新，如果需要，必须重建索引并设置正确的 mapping ，而不是试图去更新已有的 mapping。

字段之间共享映射类型（fileds are shared across mapping types）

映射类型用来逻辑分组字段，但是每个映射类型之间的字段并非独立存在的。

1. 规则：

字段在以下条件：

相同字段名
相同索引
不同映射类型

的时候其实是映射到内部相同的字段上，所以，必须拥有相同的映射设置。

2. 例外：

有一些例外，参数：

copy_to
dynamic
enabled
ignore_above
include_in_all
properties

可以对满足前述“规则”的字段进行各自不同的设置。

2 field datatypes

基本类型

1. 字符串

字符串类型被分为两种情况：full-text 和 keywords。

full-text 表示字段内容会被分析，而 keywords 表示字段值只能作为一个精确值查询。

参数：

analyzer、boost、doc_values、fielddata、fields、ignore_above、include_in_all、index、index_options、norms、null_value、position_increment_gap、store、search_analyzer、search_quote_analyzer、similarity、term_vector

2. 数值

数值类型包括： long, integer, short, byte, double, float 。

参数：

coerce、boost、doc_values、ignore_malformed、include_in_all、index、null_value、precision_step、store

3. 日期

JSON 本身并没有日期数据类型，在 ES 中的日期类型可以是：

类似 "2015-01-01" or "2015/01/01 12:10:30" 的字符串
long 类型的毫秒级别的时间戳
int 类型的秒级别的时间戳

日期类型默认会被转换为 UTC 并且转换为毫秒级别的时间戳的 long 类型存储。

日期类型如果不指定 format ，将会以默认格式表示。

参数：

boost、doc_values、format、ignore_malformed、include_in_all、index、null_value、precision_step、store

4. 布尔

布尔假： false, "false", "off", "no", "0", "" (empty string), 0, 0.0 。

布尔真：任何不为假的值。

像 terms aggregation 聚合，是使用 1 和 0 来作为 key 的，key_as_string 则是用字符串 true 和 false

布尔类型的值，在 scripts 中则始终返回 1 或 0

参数：

boost、doc_values、index、null_value、store

5. 二进制

二进制类型以 Base64 编码方式接收一个二进制值，二进制类型字段默认不存储，也不可搜索。

参数：doc_values、store

复杂类型

1. 对象

JSON 格式本身是分层级的——文档可以包含对象，对象还可以包含子对象。不过，在 ES 内部 "对象" 被索引为一个扁平的键值对。

例如：


PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": { 
      "first": "John",
      "last":  "Smith"
    }
  }
}

转换为：


{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"  //层级结构被以 "." 来表示。
}

2. 数组

数组类型，要求数组元素的数据类型必须一致。

字符串数组: [ "one", "two" ]
数字数组: [ 1, 2 ]
数组数组: [ 1, [ 2, 3 ]] which is the equivalent of [ 1, 2, 3 ]
对象数组: [ { "name": "Mary", "age": 12 }, { "name": "John", "age": 10 }]

数组元素的数据类型，将会由其第一个元素的数据类型决定。

对象数组，在 ES 内部将会被转换为 "多值" 的扁平数据类型。后面将会详解这一点。

例如：


PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

转转为：

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

3. 对象数组

对象数组在 ES 内部，会把所有数组元素（即对象）合并，对象中的每一个字段被索引为一个 "多值" 字段。

这将导致每个数组元素（对象）内部的字段关联性丢失，解决的方法是使用 nested 类型。

例如：


PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": [
    { 
      "first": "John",
      "last":  "Smith"
    },
    { 
      "first": "Bob",
      "last":  "Leo"
    }
    ]
  }
}

转换为：


{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John Bob",
  "manager.name.last": "Smith Leo" 
}
// 如果我们搜索：
"bool": {
      "must": [
        { "match": { "manager.name.first": "John" }},   // John Smith
        { "match": { "manager.name.last": "Leo"}}       // Bob Leo
      ]
}
//这将会导致导致文档被命中，显然，John Smith 、Bob Leo 两组字段它们内在的关联性都丢失了

参数：

dynamic、enabled、include_in_all、properties

4. 嵌套(nested)

嵌套类型是一个特殊对象类型，嵌套类型允许对对象数组的每一个元素（对象）相互独立的进行查询，也即他们不会被合并为一个对象。

嵌套类型的文档可以：

用 nested 查询来查询
用 nested来分析以及 reverse_nested 来聚合
用 nested sorting 来排序
用 nested inner hits 来检索或高亮

例如：


PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": [
    { 
      "first": "John",
      "last":  "Smith"
    },
    { 
      "first": "Bob",
      "last":  "Leo"
    }
    ]
  }
}

转换为：


{
  "region":             "US",
  "manager.age":        30,
  {
      "manager.name.first": "John",
      "manager.name.last": "Smith"
  },
  {
      "manager.name.first": "Bob",
      "manager.name.last": "Leo" 
  }
}
// 如果我们搜索：
"bool": {
      "must": [
        { "match": { "manager.name.first": "John" }},   // John Smith
        { "match": { "manager.name.last": "Leo"}}       // Bob Leo
      ]
}
//这样的查询将不能命中文档！！！

参数：

dynamic、include_in_all、properties

专有类型

1. IPV4类型

IPV4 数据类型其实质是个 long 类型，不过其能接收一个 IPV4 地址并且将他转换为 long 类型存放。

参数：

boost、doc_values、include_in_all、index、null_value、precision_step、store

3 Meta-Fields

文档标识相关元数据字段

_index

当执行多索引查询时，可能需要添加特定的一些与文档有关联的索引的子句。
_index 字段可以用在 term、terms 查询，聚合(aggregations)操作，脚本(script)操作以及用来排序(sort)。



GET index_1,index_2/_search
{
  "query": {
    "terms": {
      "_index": ["index_1", "index_2"] 
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "_index", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_index": { 
        "order": "asc"
      }
    }
  ],
  "script_fields": {
    "index_name": {
      "script": "doc['_index']" 
    }
  }
}

_type

_type 可以用来让针对具体 type 的搜索更加快。
_type 字段可以用在 querys、aggregations、scripts 以及 sorting。


GET my_index/_search/type_*
{
  "query": {
    "terms": {
      "_type": [ "type_1", "type_2" ] 
    }
  },
  "aggs": {
    "types": {
      "terms": {
        "field": "_type", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_type": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "type": {
      "script": "doc['_type']" 
    }
  }
}

原始信息相关元数据字段

_source

字段说明

_source 字段存放的是文档的原始 JSON 信息
_source 字段不被 indexed ，不过被 stored ，所以可以通过 get 或 search 取得该字段的值。

禁用_source字段

_source 字段可以在 mapping 设置中禁用
如果禁用 _source 字段将会有一些其它影响，比如：update API 将无法使用等等。


PUT tweets
{
  "mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      }
    }
  }
}

_source排除特定字段

在 _source 的 mapping 设置中可以通过 includes 和 excludes 参数来包含或排除特定字段
包含或排除的字段，需要以 plain 格式的 field 名称，名称支持通配符。


PUT logs
{
  "mappings": {
    "event": {
      "_source": {
        "includes": [
          "*.count",
          "meta.*"
        ],
        "excludes": [
          "meta.description",
          "meta.other.*"
        ]
      }
    }
  }
}

索引操作相关元数据字段

_all

字段说明

_all 字段把其他所有字段的内容存储到一个大的字符串中，不管其它字段是什么数据类型，在 _all 中都被当作字符串处理。
每个 index 只有一个 _all 字段。
该字符串会被 analyzed 和 indexed，但不会 store（存储）。可以被搜索，但无法用来恢复。
_all 字段也和普通字符串字段一样可以接收：analyzer、term_vectors、index_options 和 store 等参数。
生成 _all 字段是有资源消耗的，会消耗 CPU 和 disk 存储。



GET my_index/_search
{
  "query": {
    "match": {
      "_all": "john smith 1970"
    }
  }
}

_all字段查询

query_string 和 simple_query_string 查询操作，默认就是查询 _all 字段，除非自己明确指定。



GET _search
{
  "query": {
    "query_string": {
      "query": "john smith 1970"
    }
  }
}

禁用_all字段

_all 字段可以在 mapping 设置中完全禁用，如果禁用，query_string 和 simple_query_string 查询操作需要指定默认字段才可用。



PUT my_index
{
  "mappings": {
    "my_type": {
      "_all": {
        "enabled": false 
      },
      "properties": {
        "content": {
          "type": "string"
        }
      }
    }
  },
  "settings": {
    "index.query.default_field": "content" 
  },
}

_all排除特定字段

字段通过 mapping 设置可以通过 include_in_all 参数控制该字段否包含在 _all 字段。



PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": { 
          "type": "date",
          "include_in_all": false
        }
      }
    }
  }
}

_all字段存储

_all 字段可以通过参数 store 来设置其是否存储。



PUT myindex
{
  "mappings": {
    "mytype": {
      "_all": {
        "store": true
      }
    }
  }
}

_field_names

字段说明

_field_names 字段是用来存储文档中所有非 null 字段的字段名称的。
该字段供 exists 和 missing 查询使用，来查询某个文档中是否包含或不包含某个字段。



GET my_index/_search
{
  "query": {
    "terms": {
      "_field_names": [ "title" ] 
    }
  },
  "aggs": {
    "Field names": {
      "terms": {
        "field": "_field_names", 
        "size": 10
      }
    }
  },
  "script_fields": {
    "Field names": {
      "script": "doc['_field_names']" 
    }
  }
}

路由相关元数据字段

_parent

字段说明

在同一个 index 中，可以通过设置 type 的父子关系来建立文档之间的父子关系。
父子 type 必须是不同的 type。
指定的 parent type 必须要是还不存在的，已存在的 type 不能作为其它 type 的 parent type。
父子关系的 doc 必须被索引到相同的 shard 上，子文档通过参数 parent 参数来作为其 routing 来保证索引到相同分片。


PUT my_index
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent" 
      }
    }
  }
}

_routing

_routing 字段用来确定文档索引的分片：shared_num = hash(routing) % num_primary_shards
默认的 _routing 是文档的 _id 或 _parent 的 ID。
通过 routing 参数可以自定义 _routing 的值。


GET my_index/_search
{
  "query": {
    "terms": {
      "_routing": [ "user1" ] 
    }
  },
  "aggs": {
    "Routing values": {
      "terms": {
        "field": "_routing", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_routing": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "Routing value": {
      "script": "doc['_routing']" 
    }
  }
}

4 mapping setting

mapping type

映射设置一般发生在：

1. 增加新的 index 的时候，添加 mapping type，对 fields 的映射进行设置


PUT twitter 
{
  "mappings": {
    "tweet": {
      "properties": {
        "message": {
          "type": "string"
        }
      }
    }
  }
}

2. 为 index 增加新的 mapping type，对 fields 的映射进行设置


PUT twitter/_mapping/user 
{
  "properties": {
    "name": {
      "type": "string"
    }
  }
}

3. 为已有 mapping type 增加新的 fields 映射设置


PUT twitter/_mapping/tweet 
{
  "properties": {
    "user_name": {
      "type": "string"
    }
  }
}

设置方式

1. 在 PUT 请求体中给出完整的 mapping 设置


PUT twitter 
{
  "mappings": {                         //mappings 对象，说明进行 mapping 设置
    "tweet": {                          //指定 mapping type
      "properties": {                   //指定 mapping type 的 properties 设置
        "message": {                    //对字段 message 的映射进行设置
          "type": "string"              //mapping 参数配置
        }
      }
    }
  }
}

增加 index 的时候，除了可以设置 mapping type，还可以对 index 进行设置，比如配置自定义 analyzer、索引分片个数设置等


PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "autocomplete"
        }
      }
    }
  }
}

2. 在 PUT 请求 URI 中指定 type，并在请求体中给出 type 的各项设置


PUT twitter/_mapping/user 
{
  "properties": {                   //指定 mapping type 的 properties 设置
    "name": {                       //对字段 message 的映射进行设置
      "type": "string"              //mapping 参数配置
    }
  }
}

3. 一个完整的 mapping type 设置包括：Meta-fields 和 Fields 或者 properties 设置


PUT my_index
{
  "mappings": {
    "type_1": { 
      "properties": {...}           //properties 设置
    },
    "type_2": { 
      "_all": {                     //meta-fields 设置
        "enabled": false
      },
      "properties": {...}
    }
  }
}

5 dynamic mapping

概述

在使用 ES 的时，我们不需要事先定义好映射设置就可以直接向索引中导入文档。ES 可以自动实现每个字段的类型检测，并进行 mapping 设置，这个过程就叫动态映射（dynamic mapping）。

动态映射可以通过以下设置来关闭。

PUT /_settings 
{
  "index.mapper.dynamic":false
}

动态映射的规则也可以自定义，有以下几种我们可以自定义规则的应用场景：

默认映射（_default_ mapping）
字段动态映射（dynamic field mapping）
动态模板（dynamic template）
索引模板（index template）

其中，前 3 个条件中都是针对特定 index 下的 type 进行设置，而条件 4 是针对所有满足条件的 index 进行设置。

默认映射

默认映射通过把 mapping type 设置为 _default_ 来定义。

默认映射将会应用到该 index 下的任意新增 type 中。

默认映射可以在添加 index 时候设置，也可以创建 index 之后再通过 PUT mapping 接口进行设置。


PUT my_index
{
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false         //默认映射禁用掉所有新增 type 的 _all 元数据字段
      }
    },
    "user": {}, 
    "blogpost": { 
      "_all": {
        "enabled": true     //覆盖 _default_ 的设置,启用 _all 字段
      }
    }
  }
}

字段动态映射

默认情况，发现新的字段，ES 自动检测其 datatype 并将其加入到 mapping type 中。

通过一些设置，我们可以控制字段动态映射的方式，包括：日期类型检测、数值类型检测、自定义日期类型的格式等。


PUT my_index         //禁用日期类型检测
{
  "mappings": {
    "my_type": {
      "date_detection": false
    }
  }
}
PUT my_index       //自定义日期类型的格式
{
  "mappings": {
    "my_type": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}
PUT my_index        //启用数值类型检测
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}

动态模板

动态模板将会根据条件判断，应用到满足条件的新增字段上去。

应用条件包括：

用 match_mapping_type 来检测新增字段的数据类型是否满足某种条件
用 match、unmatch 和 match_pattern 来判断新增字段的字段名是否满足某种条件
用 path_match 和 path_unmatch 来判断新增字段的完整路径是否匹配某条件

动态模板以数组的形式给出，数组的每一个元素就是一个模板。每个模板都有各自的应用条件，一旦新增的字段满足某个模板，模板内容就会应用到该字段上。

有两个特殊的变量，在模板中可以运用：{name}、{dynamic_type}。前者表示原字段的字段名，后者标识原字段被 ES 自动识别出来的数据类型。


"dynamic_templates": [                 //数组,每个元素都是一个动态模板
    {
      "my_template_name": {            //动态模板名称
        ...  match conditions ...      //应用条件判断
        "mapping": { ... }             //映射设置
      }
    },
    ...                                //多个数组元素标识多个动态模板
  ]


PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "named_analyzers": {
            "match_mapping_type": "string",
            "match": "*",
            "mapping": {
              "type": "string",
              "analyzer": "{name}"
            }
          }
        },
        {
          "no_doc_values": {
            "match_mapping_type":"*",
            "mapping": {
              "type": "{dynamic_type}",
              "doc_values": false
            }
          }
        }
      ]
    }
  }
}

索引模板

索引模板根据条件来判断新建的索引（只应用到新建索引上）是否满足某条件，并对其进行映射设置。

索引模板包含一些对索引的设置和映射设置。

在索引模板中有一个特殊变量可以运用：{index}。表示匹配上条件的原索引名称。


PUT /_template/template_1
{
  "template": "te*",                          //判断条件,判断哪些索引将应用该模板
  "settings": {                               //索引设置
    "number_of_shards": 1
  },
  "mappings": {                               //映射设置
    "type1": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "host_name": {
          "type": "string",
          "index": "not_analyzed"
        },
        "created_at": {
          "type": "date",
          "format": "EEE MMM dd HH:mm:ss Z YYYY"
        }
      }
    }
  }
}

参照：https://www.cnblogs.com/licongyu/category/819588.html

更多请参照：