ElasticSearch—Mappings

xueshijun666

已于 2022-10-08 20:59:33 修改

阅读量341

点赞数

文章标签： elasticsearch java 大数据

于 2022-10-08 20:58:57 首次发布

本文链接：https://blog.csdn.net/xueshijun666/article/details/20066977

版权

Using dynamic templates in document mapping

Managing nested objects

Managing a child document with a join field

Adding a field with multiple mappings

Mapping a GeoPoint field

Mapping a GeoShape field

Mapping an IP field

Mapping an Alias field

Mapping a Percolator field

Mapping the Rank Feature and Feature Vector fields

Mapping the Search as you type field

Using the Range Fields type

Using the Flattened field type

Using the Point and Shape field types

Using the Dense Vector field type

Using the Histogram field type

Adding metadata to a mapping

Specifying different analyzers

Using index components and templates

Mapping base types

PUT test/_mapping
{ "properties" : {
    "id" : {"type" : "keyword"},
    "date" : {"type" : "date"},
    "customer_id" : {"type" : "keyword"},
    "sent" : {"type" : "boolean"},
    "name" : {"type" : "keyword"},
    "quantity" : {"type" : "integer"},
    "price" : {"type" : "double"},
    "vat" : {"type" : "double", "index": false}
} }

•   store (default false): This marks the field to be stored in a separate index fragment for fast retrieval. Storing a field consumes disk space but reduces computation if you need to extract it from a document (that is, in scripting and aggregations). The possible values for this option are true and false. They are always retuned as an array of values for consistency.
The stored fields are faster than others in aggregations.
•   index: This defines whether or not the field should be indexed. The possible values for this parameter are true and false. Index fields are not searchable (the default is true).
•   null_value: This defines a default value if the field is null.
•   boost: This is used to change the importance of a field (the default is 1.0).
•   search_analyzer: This defines an analyzer to be used during the search. If it's not defined, the analyzer of the parent object is used (the default is null).
•   analyzer: This sets the default analyzer to be used (the default is null).
•   norms: This controls the Lucene norms. This parameter is used to score queries better. If the field is only used for filtering, it's a best practice to disable it to reduce resource usage (true for analyzed fields and false for not_analyzed ones).
•   copy_to: This allows you to copy the content of a field to another one to achieve functionalities, similar to the _all field.
•   ignore_above: This allows you to skip the indexing string if it's bigger than its value. This is useful for processing fields for exact filtering, aggregations, and sorting. It also prevents a single term token from becoming too big and prevents errors due to the Lucene term's byte-length limit of 32,766. The maximum suggested value is 8191 (https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html).

Mapping arrays

PUT test/_doc/_mapping
{ "properties" : {
    "name" : {"type" : "keyword"},
    "tag" : {"type" : "keyword", "store" : true},
}}
PUT test/_doc/1
{"name": "document1", "tag": "awesome"}
PUT test/_doc/2
{"name": "document2", "tag": ["cool", "awesome", "amazing"] }

Mapping an object

PUT test/_doc/_mapping
{ "properties" : {
"id" : {"type" : "keyword"},
"date" : {"type" : "date"},
"customer_id" : {"type" : "keyword", "store" : true},
"sent" : {"type" : "boolean"},
"item" : {
"type" : "object",
"properties" : {
    "name" : {"type" : "text"},
    "quantity" : {"type" : "integer"},
    "price" : {"type" : "double"},
    "vat" : {"type" : "double"}
}
} } }

The most important attributes of an object are as follows:
•   properties: This is a collection of fields or objects (we can consider them as columns in the SQL world).
•   enabled: This establishes whether or not the object should be processed. If it's set to false, the data contained in the object is not indexed and it cannot be searched (the default is true).
•   dynamic: This allows Elasticsearch to add new field names to the object using a reflection on the values of the inserted data. If it's set to false, when you try to index an object containing a new field type, it'll be rejected silently. If it's set to strict, when a new field type is present in the object, an error will be raised, skipping the indexing process. The dynamic parameter allows you to be safe about making changes to the document's structure (the default is true).

Mapping a document

PUT test/_mapping
{ "_source": { "store": true },
"_routing": { "required": true },
"_index": { "enabled": true },
"properties": {} 
}

•   _id: This allows you to index only the ID part of the document. All the ID queries will speed up using the ID value (by default, this is not indexed and not stored).
•   _index: This controls whether or not the index must be stored as part of the document. It can be enabled by setting the "enabled": true parameter (enabled=false is the default).
•   _source: This controls how the document's source is stored. Storing the source is very useful, but it's a storage overhead, so it is not required. Consequently, it's better to turn it off (enabled=true is the default).
•   _routing: This defines the shard that will store the document. It supports additional parameters, such as required (true/false). This is used to force the presence of the routing value, raising an exception if it's not provided.

Using dynamic templates in document mapping

PUT test/_mapping
{
"dynamic_date_formats":["yyyy-MM-dd", "dd-MM-yyyy"],\
"date_detection": true,
"numeric_detection": true,
"dynamic_templates":[
    {"template1":{
    "match":"*",
    "match_mapping_type": "long",
    "mapping": {"type":" {dynamic_type}", "store": true}
    }} ],
"properties" : {...}
}

Managing nested objects


PUT test/_mapping
{ "properties" : {
    "id" : {"type" : "keyword"},
    "date" : {"type" : "date"},
    "customer_id" : {"type" : "keyword"},
    "sent" : {"type" : "boolean"},
    "item" : {"type" : "nested",
    "properties" : {
    "name" : {"type" : "keyword"},
    "quantity" : {"type" : "long"},
    "price" : {"type" : "double"},
    "vat" : {"type" : "double"}
} } } }

Managing a child document with a join field

PUT test1/_mapping
{ "properties": {
"join_field": {
    "type": "join", "relations": { "order": "item" }
},
"id": { "type": "keyword" },
"date": { "type": "date" },
"customer_id": { "type": "keyword" },
"sent": { "type": "boolean" },
"name": { "type": "text" },
"quantity": { "type": "integer" },
"vat": { "type": "double" }
} }

PUT test/_doc/1?refresh
{ "id": "1", "date": "2018-11-16T20:07:45Z", "customer_id": "100", "sent": true, "join_field": "order" }

PUT test/_doc/c1?routing=1&refresh
{ "name": "tshirt", "quantity": 10, "price": 4.3, "vat": 8.5,
"join_field": { "name": "item", "parent": "1" } }

Adding a field with multiple mappings

{ "name": {
"type": "keyword",
"fields": {
    "name": {"type": "keyword"},
    "tk": {"type": "text"},
    "code": {"type": "text","analyzer": "code_analyzer"}
} }

•   name: This points to the default multifield subfield-field (the keyword one).
•   name.tk: This points to the standard analyzed (tokenized) text field.
•   name.code: This points to a field that was analyzed with a code extractor analyzer.
As you may have noticed in the preceding example, we changed the analyzer to introduce a code extractor analyzer that allows you to extract the item code from a string.
By using the multifield, if we index a string such as Good Item to buy - ABC1234, we'll have the following:
•   name = Good Item to buy - ABC1234 (useful for sorting)
•   name.tk= ["good", "item", "to", "buy", "abc1234"] (useful for searching)
•   name.code = ["ABC1234"] (useful for searching and aggregations)

Mapping a GeoPoint field

PUT test/_mapping
{ "properties": {
"id": {"type": "keyword",},
"date": {"type": "date"},
"customer_id": {"type": "keyword"},
"customer_ip": {"type": "ip"},
"customer_location": {"type": "geo_point"},
"sent": {"type": "boolean"}
} }

•   lat_lon (the default is false): This allows you to store the latitude and longitude as the .lat and .lon fields.
Storing these values improves the performance of many memory algorithms that are used in distance and shape calculus.
It makes sense to set lat_lon to true so that you store them if there is a single point value for a field. This speeds up searches and reduces memory usage during computation.
•   geohash (the default is false): This allows you to store the computed geohash value.
•   geohash_precision (the default is 12): This defines the precision to be used in geohash calculus.

For example, given a geo point value, [45.61752, 9.08363], it can be stored using one of the following syntaxes:
•   customer_location = [45.61752, 9.08363]
•   customer_location.lat = 45.61752
•   customer_location.lon = 9.08363
•   customer_location.geohash = u0n7w8qmrfj

Mapping a GeoShape field

"customer_location": {
"type": "geo_shape",
"tree": "quadtree",
"precision": "1m" }
}

Mapping an IP field

"customer_ip": { "type": "ip" }
The IP must be in the standard point notation form, as follows:
"customer_ip":"19.18.200.201"

Mapping an Alias field

PUT test/_mapping
{ "properties": {
    "id": {"type": "keyword"},
    "date": {"type": "date"},
    "customer_id": {"type": "keyword"},
    "sent": {"type": "boolean"},
    "item": {
        "type": "object",
        "properties": {
        "name": {"type": "keyword"},
        "quantity": {"type": "long"},
        "price": {"type": "double"},
        "vat": {"type": "double"}
} } } }

PUT test/_doc/1?refresh
{ "id": "1", "date": "2018-11-16T20:07:45Z",
"customer_id": "100", "sent": true,
"item": [ { "name": "tshirt", "quantity": 10, "price": 4.3, "vat": 8.5 } ] }

GET test/_search
{ "query": { "term": { "item.cost": 4.3 } } }

Mapping a Percolator field

PUT test-percolator
{ "mappings": {
    "properties": {
    "query": { "type": "percolator" },
    "body": { "type": "text" }
} } } 
PUT test-percolator/_doc/1?refresh
{ "query": { "match": { "body": "quick brown fox" }}}

GET test-percolator/_search
{ "query": {
"percolate": {
"field": "query",
"document": { "body": "fox jumps over the lazy dog" } } } 
}

Mapping the Rank Feature and Feature Vector fields

rank_feature and rank_features are special type fields that are used for storing values and are mainly used to score the results.


1.    To be able to score based on a pagerank value and an inverse url length, we can use the following mapping:

PUT test-rank
{ "mappings": {
"properties": {
"pagerank": { "type": "rank_feature" },
"url_length": {
    "type": "rank_feature",
    "positive_score_impact": false
} } } }


2.    Now, we can store a document, as shown here:
PUT test-rank/_doc/1
{ "pagerank": 5, "url_length": 20 }
PUT test-rank/_doc/2
{ "pagerank": 4, "url_length": 21 }

3.    Now, we can execute a feature query on the pagerank value to return our record with a similar query, like so:
GET test-rank/_search
{ "query": { "rank_feature": { "field":"pagerank" }}}

1.    First, we must define the mapping for the categories field:
PUT test-ranks
{ "mappings": {
"properties": {
"categories": { "type": "rank_features" } } } }
2.    Now, we can store some documents in the index by using the following commands:
PUT test-ranks/_doc/1
{ "categories": { "sport": 14.2, "economic": 24.3 } }
PUT test-ranks/_doc/2
{ "categories": { "sport": 19.2, "economic": 23.1 } }
3.    Now, we can search based on the saved feature values, as shown here:
GET test-ranks/_search
{ "query": { "rank_feature": { "field": "categories.sport" } } }
GET test-ranks/_search
{ "query": { "rank_feature": { "field": "categories.economic" } } }

Mapping the Search as you type field

The "search_as_you_type" field can be customized using the max_shingle_size parameter (the default is 3). This parameter allows you to define the maximum size of the gram to be created.

1.    To be able to prove "search as you type" on a title field, we will use the following mapping:
PUT test-sayt
{ "mappings": {
    "properties": {
    "title": { "type": "search_as_you_type" }
} } }
2.    Now, we can store some documents, as shown here:
PUT test-sayt/_doc/1
{ "title": "Ice Age" }
PUT test-sayt/_doc/2
{ "title": "The Polar Express" }
PUT test-sayt/_doc/3
{ "title": "The Godfather" }
3.    Now, we can execute a match query on the title value to return our records:
GET test-sayt/_search
{
"query": {
    "multi_match": {
    "query": "the p", "type": "bool_prefix",
    "fields": [ "title", "title._2gram", "title._3gram" ]
} } }

Using the Range Fields type

•   integer_range: This is used to store signed 32-bit integer values.
•   float_range: This is used to store signed 32-bit floating-point values.
•   long_range: This is used to store signed 64-bit integer values.
•   double_range: This is used to store signed 64-bit floating-point values.
•   date_range: This is used to store date values as 64-bit integers.
•   ip_range: This is used to store IPv4 and IPv6 values.

• gt or gte for the lower bound of the range
• lt or lte for the upper bound of the range

1.    To populate our stock, we need to create an index with range fields. Let's use the following mapping:
PUT test-range
{ "mappings": {
"properties": {
    "price": { "type": "float_range" },
    "timeframe": { "type": "date_range" }
} } }
2.    Now, we can store some documents, as shown here:
PUT test-range/_bulk
{"index":{"_index":"test-range","_id":"1"}}
{"price":{"gte":1.5,"lt":3.2},"timeframe":{"gte":"2022-01-01T12:00:00","lt":"2022-01-01T12:00:01"}}
{"index":{"_index":"test-range","_id":"2"}}
{"price":{"gte":1.7,"lt":3.7},"timeframe":{"gte":"2022-01-01T12:00:01","lt":"2022-01-01T12:00:02"}}
{"index":{"_index":"test-range","_id":"3"}}
{"price":{"gte":1.3,"lt":3.3},"timeframe":{"gte":"2022-01-01T12:00:02","lt":"2022-01-01T12:00:03"}}
3.    Now, we can execute a query for filtering on price and timeframe values to check the correct indexing of the data:
GET test-range/_search
{ "query": {
    "bool": {
        "filter": [
        { "term": { "price": { "value": 2.4 } } },
        { "term": { "timeframe": { "value": "2022-01-01T12:00:02" } } }
] } } }

Using the Flattened field type


1.    To create our configuration index with a flattened field, we will use the following mapping:
PUT test-flattened
{ "mappings": {
"properties": {
"name": { "type": "keyword" },
"configs": { "type": "flattened" } } } }
2.    Now, we can store some documents that contain our configuration data:
PUT test-flattened/_bulk
{"index":{"_index":"test-flattened","_id":"1"}}
{"name":"config1","configs":{"key1":"value1","key3":"2022-01-01T12:00:01"}}
{"index":{"_index":"test-flattened","_id":"2"}}
{"name":"config2","configs":{"key1":true,"key2":30}}
{"index":{"_index":"test-flattened","_id":"3"}}
{"name":"config3","configs":{"key4":"test","key2":30.3}}
3.    Now, we can execute a query that's searching for the text in all the configurations:
POST test-flattened/_search
{ "query": { "term": { "configs": "test" } } }
Alternatively, we can search for a particular key in the configs object, like so:
POST test-flattened/_search
{ "query": { "term": { "configs.key4": "test" } } }

Using the Point and Shape field types

map a device's coordinates in our shop. 
1.    To create our index for storing devices and their location, we will use the following mapping:
PUT test-point
{ "mappings": {
"properties": {
    "device": { "type": "keyword" },
    "location": { "type": "point" } } } }
2.    Now, we can store some documents that contain our device's data:
PUT test-point/_bulk
{"index":{"_index":"test-point","_id":"1"}}
{"device":"device1","location":{"x":10,"y":10}}
{"index":{"_index":"test-point","_id":"2"}}
{"device":"device2","location":{"x":10,"y":15}}
{"index":{"_index":"test-point","_id":"3"}}
{"device":"device3","location":{"x":15,"y":10}}

create shapes in our shop so that we can divide it into parts and check if the people/devices are inside the defined shape. 
1.    First, let's create an index to store our shapes:
PUT test-shape
{ "mappings": {
"properties": {
    "room": { "type": "keyword" },
    "geometry": { "type": "shape" } } } }
2.    Now, we can store a document to test the mapping:
POST test-shape/_doc/1
{ "room":"hall",
    "geometry" : {
    "type" : "polygon",
    "coordinates" : [ [ [8.0, 8.0], [8.0, 12.0], [12.0, 12.0], [12.0, 8.0], [8.0, 8.0]] ] } }
3.    Now, let's search our devices in our stored shape:
POST test-point/_search
{ "query": {
    "shape": {
        "location": {
        "indexed_shape": { "index": "test-shape", "id": "1", "path": "geometry" } } } } }

Using the Dense Vector field type

Elasticsearch is often used to store machine learning data for training algorithms. X-Pack provides the Dense Vector field to store vectors that have up to 2,048 dimension values.

1.    To create an index to store a vector of values, we will use the following mapping:
PUT test-dvector
{ "mappings": {
"properties": {
"vector": { "type": "dense_vector", "dims": 4 },
"model": { "type": "keyword" } } } }
2.    Now, we can store a document to test the mapping:
POST test-dvector/_doc/1
{ "model":"pipe_flood", "vector" : [8.1, 8.3, 12.1, 7.32] }

Using the Histogram field type

Histograms are a common data type for analytics and machine learning analysis. We can store Histograms in the form of values and counts; they are not indexed, but they can be used in aggregations.
The histogram field type is a special mapping that's available in X-Pack that is commonly used to store the results of Histogram aggregations in Elasticsearch for further processing, such as to compare the aggregation results at different times.

1.    First, let's create an index for the Histogram by using the following mapping:
PUT test-histo
{ "mappings": {
"properties": {
    "histogram": { "type": "histogram" },
    "model": { "type": "keyword" } } } }
2.    Now, we can store a document to test the mapping:
POST test-histo/_doc/1
{ "model":"show_level", "histogram" : { "values" : [2016, 2017, 2018, 2019, 2020, 2021], "counts" : [283, 337, 323, 312, 236, 232] } }

Aggregations:
•   Metric aggregations such as min, max, sum, value_count, and avg
•   The percentiles and percentile_ranks aggregations
•   The boxplot aggregation
•   The histogram aggregation
The data is not indexed, but you can also check the existence of a document by populating this field with the exist query.

Adding metadata to a mapping

Sometimes, when we are working with our mapping, we may need to store some additional data to be used for display purposes, ORM facilities, permissions, or simply to track them in the mapping.

Specifying different analyzers

Elasticsearch, and we described how easy it is to change the standard analyzer with the analyzer and search_analyzer properties.

{ "name": {
"type": "string",
"index_analyzer": "standard",
"search_analyzer": "simple"
} }

The most famous ones are as follows:
•   The ICU analysis plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-icu.html)
•   The Phonetic analysis plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-phonetic.html)
•   The Smart Chinese analysis plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-smartcn.html)
•   The Japanese (kuromoji) analysis plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-kuromoji.html)

Using index components and templates

Real-world index mapping can be very complex and often, parts of it can be reused between different indices types. To be able to simplify this management, mappings can be divided into the following:
• Components: These will collect the reusable parts of the mapping.
• Index templates: These aggregate the components in a single template.
Using components is the most manageable way to scale on large index mappings because they can simplify large template management.


1.    First, we will create three components for the timestamp, order, and items. These will store parts of our index mapping:
PUT _component_template/timestamp-management
{ "template": {
    "mappings": {
    "properties": {
        "@timestamp": { "type": "date" } } } } }
PUT _component_template/order-data
{ "template": {
    "mappings": {
    "properties": {
        "id": { "type": "keyword" },
        "date": { "type": "date" },
        "customer_id": { "type": "keyword" },
        "sent": { "type": "boolean" } } } } }
PUT _component_template/items-data
{ "template": {
    "mappings": {
    "properties": {
        "item": {
            "type": "object",
            "properties": {
            "name": { "type": "keyword" },
            "quantity": { "type": "long" },
            "cost": { "type": "alias", "path": "item.price" },
            "price": { "type": "double" },
            "vat": { "type": "double" } } } } } } }
2.    Now, we can create an index template that can sum them up:
PUT _index_template/order
{
    "index_patterns": ["order*"],
    "template": {
        "settings": { "number_of_shards": 1 },
        "mappings": {
        "properties": { "id": { "type": "keyword" } }
        },
        "aliases": { "order": { } }
    },
    "priority": 200,
    "composed_of": ["timestamp-management", "order-data", "items-data"],
    "version": 1,
    "_meta": { "description": "My order index template" } }