ElasticSearch—Basic Operations

最新推荐文章于 2023-10-17 13:34:13 发布

xueshijun666

最新推荐文章于 2023-10-17 13:34:13 发布

阅读量593

点赞数

文章标签： elasticsearch java 大数据

本文链接：https://blog.csdn.net/xueshijun666/article/details/127216258

版权

Creating an index

Deleting an index

Opening or closing an index

Putting a mapping in an index

Using ForceMerge on an index

Shrinking an index

Checking whether an index exists

Managing index settings

Using index aliases

Managing dangling indices

Resolving index names

Rolling over an index

Speeding up atomic operations (bulk operations)

Speeding up GET operations (multi-GET)

Creating an index

The HTTP method for creating an index is PUT;
http://<server>/<index_name>


PUT /myindex
{ "settings": {
"index": {
"number_of_shards": 2, "number_of_replicas": 1
} } }


PUT /myindex
{ "settings": { "number_of_shards": 2, "number_of_replicas": 1 },
"mappings": {
    "properties": {
    "id": { "type": "keyword", "store": true },
    "date": { "type": "date", "store": false },
    "customer_id": { "type": "keyword", "store": true },
    "sent": { "type": "boolean" },
    "name": { "type": "text" },
    "quantity": { "type": "integer" },
    "vat": { "type": "double", "index": true }
} } }

Deleting an index

The HTTP method for deleting an index is DELETE.
http://<server>/<index_name>
DELETE /myindex

Opening or closing an index

The HTTP method for opening/closing an index is POST.
http://<server>/<index_name>/_open
http://<server>/<index_name>/_close

If you want to keep your data but save resources (memory or CPU), a good practice is to close them.

POST /myindex/_close
POST /myindex/_open

There are many use cases regarding closing an index:
• It can disable date-based indices (indices that store their records by date) – for example, when you keep an index for a week, month, or day and you want to keep a fixed number of old indices (that is, 2 months old) online and some offline (that is, from 2 months to 6 months old).
• When you do searches on all the active indices of a cluster and don't want to search in some indices (in this case, using an alias is the best solution, but you can achieve the same with an alias with closed indices).

There is the possibility to freeze the indices with _freeze and unfreeze them with _unfreeze. When an index is frozen, it's in read-only mode:

POST /myindex/_freeze
POST /myindex/_unfreeze

Putting a mapping in an index

The HTTP method for putting a mapping in an index is PUT (POST also works).
http://<server>/<index_name>/_mapping

PUT /myindex/_mapping
{ "properties": {
   "id": { "type": "keyword", "store": true },
   "date": { "type": "date", "store": false },
   "customer_id": {"type": "keyword","store": true },
   "sent": { "type": "boolean" },
   "name": { "type": "text" },
   "quantity": { "type": "integer" },
   "vat": { "type": "double", "index": false } } }

When a mapping is inserted, if there is an existing mapping for this type, it is merged with the new one.
If there is a field with a different type and the type could not be updated, an exception for expanding the fields property is raised.
To prevent an exception during the merging mapping phase, it's possible to specify the ignore_conflicts parameter as true (the default is false).

Getting a mapping

The HTTP method for getting a mapping is GET.
http://<server>/_mapping
http://<server>/<index_name>/_mapping

GET /myindex/_mapping?pretty

Reindexing an index

The most common scenarios are as follows:
•   Changing the analyzer of the mapping
•   Adding a new subfield to the mapping, where you need to reprocess all the records to search for the new subfield
•   Removing unused mappings
•   Changing a record structure that requires a new mapping

The HTTP method for reindexing an index is POST.
http://<server>/_reindex
POST /_reindex?pretty=true
{ "source": { "index": "myindex" },
"dest": { "index": "myindex2" } }

The advantages of the new Elasticsearch implementation are as follows:
•   You can quickly copy data because it is completely managed on the server side.
•   You can manage the operation better due to the new task API.
•   Better error-handling support as it is done at the server level. This allows us to manage failovers better during reindex operations.
•   The source index can be a remote one, so this command lets you copy/back up data or part of a dataset from an Elasticsearch cluster to another one.
At the server level, this action is comprised of the following steps:
1.   Initializing an Elasticsearch task to manage the operation
2.   Creating the target index and copying the source mappings, if required
3.   Executing a query to collect the documents to be reindexed
4.   Reindexing all the documents using bulk operations until all the documents have been reindexed

•   The source section manages how to select source documents. The most important subsections are as follows:
   o   index, which is the source index to be used. It can also be a list of indices.
   o   query (optional), which is the Elasticsearch query to be used to select parts of the document.
   o   sort (optional), which can be used to provide a way of sorting the documents.

•   The dest section manages how to control target written documents. The most important parameters in this section are as follows:
   o   index, which is the target index to be used. If it is not available, it must be created.
   o   version_type (optional) where, when it is set to external, the external version is preserved.
   o   routing (optional), which controls the routing in the destination index. It can be any of the following:
          keep (the default), which preserves the original routing
          discard, which discards the original routing
          =<text>, which uses the text value for the routing process
•   pipeline (optional), which allows you to define a custom pipeline for ingestion. We will learn more about the ingestion pipeline in Chapter 12, Using the Ingest Module.
•   size (optional), which specifies the number of documents to be reindexed.
•   script (optional), which allows you to define a script for document manipulation. This will be discussed in the Reindexing with a custom script recipe in Chapter 8, Scripting in Elasticsearch.

Refreshing an index

When you send data to Elasticsearch, the data is not instantly searchable. This only happens after a time interval (generally a second) known as the refresh rate.

The HTTP method that's used for both operations is POST.
The URL format for refreshing an index is as follows:
http://<server>/<index_name(s)>/_refresh
The URL format for refreshing all the indices in a cluster is as follows:
http://<server>/_refresh

POST /myindex/_refresh

#guarantee the new value is searchable:
POST /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?refresh=true
{ "id": "1234", "date": "2013-06-07T12:14:54",
"customer_id": "customer1", "sent": true,
"in_stock_items": 0,
"items": [
{ "name": "item1", "quantity": 3, "vat": 20 },
{ "name": "item2", "quantity": 2, "vat": 20 },
{ "name": "item3", "quantity": 1, "vat": 10 } ] }

Flushing an index

For performance reasons, Elasticsearch stores some data in memory and on a transaction log.
If we want to free memory, we need to empty the transaction log, and to ensure that our data is safely written on disk, we need to flush an index.
Elasticsearch automatically provides periodic flushing on disk, but forcing flushing can be useful in the following situations:
• When we need to shut down a node to prevent stale data
• When we need to have all the data in a safe state (for example, after a big indexing operation so that all the data can be flushed and refreshed)

The HTTP method that's used for both operations is POST.
The URL format for flushing an index is as follows:
http://<server>/<index_name(s)>/_flush[?refresh=True]
The URL format for flushing all the indices in a cluster is as follows:
http://<server>/_flush[?refresh=True]

POST /myindex/_flush

Using ForceMerge on an index

Lucene stores your data in several segments on disk. These segments are created when you index a new document or record, or when you delete a document.
In Elasticsearch, the deleted document is not removed from disk; instead, it is marked as deleted (and referred to as a tombstone).
To free up space, you need to forcemerge to purge deleted documents.

The HTTP method that's used here is POST.
The URL format for optimizing one or more indices is as follows:
http://<server>/<index_name(s)>/_flush[?refresh=True]
The URL format for optimizing all the indices in a cluster is as follows:
http://<server>/_flush[?refresh=True]

POST /myindex/_forcemerge

The forcemerge operation in Lucene tries to reduce the segments in an I/O-heavy way by removing unused ones, purging deleted documents, and rebuilding the index with a minimal number of segments.
The main advantages of this are as follows:
•   It reduces both file descriptors.
•   It frees up the memory that was used by the segment readers.
•   It improves performance during searches due to less segment management.
ForceMerge is a very I/O-heavy operation. The index can be unresponsive during this optimization.
It is generally executed on indices that are rarely modified, such as the Logstash for previous days.

You can pass several additional parameters to the ForceMerge call, such as the following:
•   max_num_segments: The default value is autodetect. For full optimization, set this value to 1.
•   only_expunge_deletes: The default value is false. Lucene does not delete documents from segments; instead, it marks them as deleted. This flag only merges segments that have been deleted.
•   flush: The default value is true. Elasticsearch performs a flush after a ForceMerge.
•   wait_for_merge: The default value is true. If the request needs to wait, then the merge ends.

Shrinking an index

By using the shrink API, it's possible to reduce the number of shards in an index.
This feature targets several common scenarios:
•   A wrong number of shards will be provided during the initial design sizing. Often, sizing the shards without knowing the correct data or text distribution tends to oversize the number of shards.
•   You should reduce the number of shards to reduce memory and resource usage.
•   You should reduce the number of shards to speed up searching.

The HTTP method that's used here is POST.
The URL format for optimizing one or more indices is as follows:
http://<server>/<source_index_name>/_shrink/<target_index_name>

Checking whether an index exists

The HTTP method for checking an index's existence is HEAD.
http://<server>/<index_name>/

HEAD /myindex/

The most common status codes are as follows:
•   The 20X family, if everything is okay
•   404, if the resource is not available
•   The 50X family, if there are server errors

Managing index settings

http://<server>/<index_name>/_settings

GET /myindex/_settings?pretty=true

PUT /myindex/_settings
{"index":{ "number_of_replicas": 2}}

1.   Disable the refresh:
PUT /myindex/_settings
{"index":{"refresh_interval": "-1"}}
2.   Bulk-index millions of documents.
3.   Restore the refresh:
PUT /myindex/_settings
{"index":{"refresh_interval": "1s"}}
4.   Optionally, you can optimize an index for search performance:
POST /myindex/_forcemerge

Using index aliases

http://<server>/_aliases
http://<server>/<index>/_alias/<alias_name>

GET /_aliases
GET /myindex/_alias

PUT /myindex/_alias/myalias1
DELETE /myindex/_alias/myalias1

#Aliases can also be used to define a filter and routing parameter.
POST /myindex/_aliases/user1alias
{ "filter": { "term": { "user": "user_1" } },
"search_routing": "1,2", "index_routing": "2" }

Managing dangling indices

In the case of a node failure, if there are not enough replicas, you can lose some shards (and the data within those shards).
Indices with missing shards are marked in red and they are put in read-only mode with issues in case you try to query the data.
In this situation, the only available option is to drop the broken index and recover them from the data or a backup. When the node that failed returns as active in the cluster, there will be some dangling indices (the orphan shards).

http://<server>/_dangling
To manage a dangling index, follow these steps:
1.   We need the list of dangling indices that are present in our cluster (we used GET to read here):
GET /_dangling
The output will be as follows:
   { "_nodes" : { "total" : 1, "successful" : 1,
   "failed" : 0 },
   "cluster_name" : "packtpub",
   "dangling_indices" : [
   { "index_name": "my-index-000001",
   "index_uuid": "zmM4e0JtBkeUjiHD-MihPQ",
   "creation_date_millis": 1589414451372,
   "node_ids": [ "pL47UN3dAb2d5RCWP6lQ3e" ]
   } ] }
2.   We can restore the data that's available in index_uuid (zmM4e0JtBkeUjiHD) of the previous response like so:
POST /_dangling/zmM4e0JtBkeUjiHD?accept_data_loss=true
The output will be as follows:
{ "acknowledged" : true }
3.   If you wish to save space and remove the data of the dangling index with index_uuid (zmM4e0JtBkeUjiHD), you can execute the following command:
DELETE /_dangling/<index-uuid>?accept_data_loss=true
The output will be as follows:
{ "acknowledged" : true }

Resolving index names

GET /_resolve/index/myinde*

Rolling over an index

When you're using a system that manages logs, it is very common to use rolling files for your log entries. By doing so, you can have indices that are similar to rolling files.

1.   We need an index with a logs_write alias that points to it alone:
PUT /mylogs-000001
{ "aliases": { "logs_write": {} } }
2.   We can add the rolling index to the logs_write alias like so:
POST /logs_write/_rollover
{ "conditions": {
   "max_age": "7d", "max_docs": 100000, "max_size": "5g" },
"settings": { "index.number_of_shards": 3 } }
The output will be as follows:
{ "acknowledged" : false,
"shards_acknowledged" : false,
"old_index" : "mylogs-000001",
"new_index" : "mylogs-000002",
"rolled_over" : false, "dry_run" : false,
"conditions" : {
"[max_docs: 100000]" : false,
"[max_age: 7d]" : false } }
3.   If your alias doesn't point to a single index, the following error will be returned:
{ "error" : {
"root_cause" : [ {
"type" : "illegal_argument_exception",
"reason" : "source alias maps to multiple indices" } ],
"type" : "illegal_argument_exception",
"reason" : "source alias maps to multiple indices"
}, "status" : 400 }

You can define it by using different criteria to roll over your index:
•   max_age (Optional): The validity period for writing in this index.
•   max_docs (Optional): The maximum number of documents in an index.
•   max_size (Optional): The maximum size of the index. (Pay attention and divide it by the number of shards to get the real shard size.)

Using rolling indices has several advantages, including the following:
•   You can have indices with a fixed number of documents/sizes, which can prevent you from having small indices that contain data.
•   You can automatically manage the time validity of the index, which can span different days.
If large data is stored in rolling indices, then the following disadvantages may occur:
•   It's more difficult to filter indices for data and your queries should often hit all the indices of a rolling group (more time and resources will be needed for queries).
•   There will be issues in guaranteeing a GDPR approach to the end of life of the indices because some days of data could be present in two indices.
•   There will be more complexity in operational activities such as ForceMerge and cold/warm indices management.

Indexing a document

POST http://<server>/<index>/_doc
POST/PUT http://<server>/<index>/_doc/<id>
POST/PUT http://<server>/<index>/_create/<id>

POST /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw
{ "id": "1234", "date": "2013-06-07T12:14:54",
"customer_id": "customer1", "sent": true,
"in_stock_items": 0,
"items": [
{ "name": "item1", "quantity": 3, "vat": 20 },
{ "name": "item2", "quantity": 2, "vat": 20 },
{ "name": "item3", "quantity": 1, "vat": 10 } ] }

Indexing a JSON document consists of the following steps:
1. Routing the call to the correct shard based on the ID, routing, or parent metadata.
If the ID is not supplied by the client, a new one is created .
If you don't provide an ID during the indexing phase, Elasticsearch will automatically associate a new one with your document.
To improve performance, the ID should generally be of the same character length to improve the balancing of the data tree that stores them.

2.   Validating the sent JSON.
3.   Processing the JSON according to the mapping. If new fields are present in the document (and the mapping can be updated), new fields will be added to the mapping.
4.   Indexing the document in the shard. If the ID already exists, it is updated.
5.   If it contains nested documents, it extracts and processes them separately.
6.   Returning information about the saved document (ID and versioning).

Due to the REST call's nature, it's better to pay attention when you're not using ASCII characters due to URL encoding and decoding (or to ensure that the client framework you use escapes them correctly).

The most used ones are as follows:
•   routing: This controls the shard to be used for indexing, as follows:
POST /myindex/_doc?routing=1
•   consistency(one/quorum/all): By default, an index operation succeeds if a quorum (>replica/2+1) of active shards is available. The right consistency value can be changed for index action:
POST /myindex/_doc?consistency=one
•   replication (sync/async): Elasticsearch returns from an index operation when all the shards of the current replication group have executed the index operation. Setting up async replication allows us to execute the index action synchronously on the primary shard and asynchronously on secondary shards. In this way, the API call returns the response action faster:
POST /myindex/_doc?replication=async
•   version: The version allows us to use the optimistic concurrency control (http://en.wikipedia.org/wiki/Optimistic_concurrency_control). The first time a document is indexed, its version (1) is set on the document. Every time it's updated, this value is incremented. Optimistic concurrency control is a way to manage concurrency in every insert or update operation. The passed version value is the last seen version (usually, it's returned by a GET or a search). Indexing only happens if the current index version's value is equal to the passed one:
POST /myindex/_doc?version=2
•   op_type: This can be used to force a create on a document. If a document with the same ID exists, the index will fail:
POST /myindex/_doc?op_type=create
•   refresh: This forces a refresh once you've indexed the document. It allows documents to be ready for searching once they've been indexed:
POST /myindex/_doc?refresh=true
•   timeout: This defines a time to wait for the primary shard to be available. Sometimes, the primary shard is not in a writable status (if it's relocating or recovering from a gateway) and a timeout for the write operation is raised after 1 minute:
POST /myindex/_doc?timeout=5m

Getting a document

http://<server>/<index_name>/_doc/<id>

GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw

Several additional parameters can be used to control the GET call:
•   _source allows us to retrieve only a subset of fields. This is very useful for reducing bandwidth or for retrieving calculated fields such as the attachment-mapping ones:
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?_source=date,sent
•   stored_fields, similar to source, allows us to retrieve only a subset of fields that are marked as stored in the mapping. Stored fields are kept in a separated memory portion of the index, and they can be retrieved without you having to parse the JSON source:
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?stored_fields=date,sent
•   routing allows us to specify the shard to be used for the GET operation. To retrieve a document, the routing that's used at indexing time must be the same as the one that was used at search time:
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?routing=customer_id
•   refresh allows us to refresh the current shard before performing the GET operation (it must be used with care because it slows down indexing and introduces some overhead):
GET /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw?refresh=true
•   preference allows us to control which shard replica is chosen to execute the GET method. Generally, Elasticsearch chooses a random shard for the GET call. The possible values are as follows:
o   _primary for the primary shard.
o   _local, first trying the local shard and then falling back to a random choice. Using the local shard reduces the bandwidth usage and should generally be used with auto-replicating shards (replica set to 0-all).
o   custom value for selecting a shard-related value, such as customer_id or username.

Deleting a document

DELETE http://<server>/<index_name>/_doc/<id>
DELETE /myindex/_doc/2qLrAfPVQvCRMe7Ku8r0Tw
删除记录只命中包含文档的分片，因此没有开销。

Updating a document

更新基本字段、计数更新
POST http://<server>/<index_name>/_update/<id>

PUT test/_mapping
{ "properties": {
   "id": {"type": "keyword"},
   "date": {"type": "date"},
   "customer_id": {"type": "keyword"},
   "salary": {"type": "double"},
   "sent": {"type": "boolean"},
   "item": {
       "type": "object",
       "properties": {
       "name": {"type": "keyword"},
       "quantity": {"type": "long"},
       "price": {"type": "double"},
       "vat": {"type": "double"}
} } } }

PUT test/_doc/1?refresh
{ "id": "1", "date": "2018-11-16T20:07:45Z",
"customer_id": "100", "salary":100.0,"sent": true,
"item": [ { "name": "tshirt", "quantity": 10, "price": 4.3, "vat": 8.5 } ] }

GET test/_doc/1?refresh
POST /test/_update/1?refresh
{ "script": {
"source": "ctx._source.salary += params.inc_salary",
"params": { "inc_salary": 200.0 } },
"upsert": { "in_stock_items": 4 }
}

or
POST /test/_update/1
{"doc":{"salary":400.0}}

By using Painless scripting, it is possible to apply advanced operations on fields, such as the following:
• Removing a field, like so:
POST /test/_update/1
{ "script" : {"inline": "ctx._source.remove(\"salary\")"}}
• Adding a new field, like so:
POST /test/_update/1
{ "script" : {"inline": "ctx._source.salary=300.0"}}

如果文档中某字段不存在
POST /test/_update/1
{ "doc": { "loan": 200.0 }, "doc_as_upsert" : true }
POST /test/_update/1
{"doc":{"loan":200.0}}

Speeding up atomic operations (bulk operations)

POST http://<server>/<index_name/_bulk
1.   We need to collect the create/index/delete/update commands in a structure made up of bulk JSON lines, composed of a line of action with metadata, and another optional line of data related to the action. Every line must end with a new line, \n. A bulk data file should be presented like this:
{ "index":{ "_index":"myindex", "_id":"1" } }
{ "field1" : "value1", "field2" : "value2" }
{ "delete":{ "_index":"myindex", "_id":"2" } }
{ "create":{ "_index":"myindex", "_id":"3" } }
{ "field1" : "value1", "field2" : "value2" }
{ "update":{ "_index":"myindex", "_id":"3" } }
{ "doc":{"field1" : "value1", "field2" : "value2" }}
2.   This file can be sent with the following POST:
curl -s -XPOST localhost:9200/_bulk --data-binary @bulkdata;
3.   The output that's returned by Elasticsearch should collect all the responses from the actions.

POST /_bulk
{ "index":{ "_index":"myindex", "_id":"1" } }
{ "field1" : "value1", "field2" : "value2" }

{ "delete":{ "_index":"myindex", "_id":"2" } }

{ "create":{ "_index":"myindex", "_id":"3" } }
{ "field1" : "value1", "field2" : "value2" }

{ "update":{ "_index":"myindex", "_id":"3" } }

{ "doc":{"field1" : "value1", "field2" : "value2" }}