Elasticsearch6.4专题之9:Document APIs

Document APIs


This section starts with a short introduction to Elasticsearch’s data replication model, followed by a detailed description of the following CRUD APIs:

Single document APIs

  • Index API
  • Get API
  • Delete API
  • Update API

Multi-document APIs

  • Multi Get API
  • Bulk API
  • Delete By Query API
  • Update By Query API
  • Reindex API

Reading and Writing documents

Basic write model

Failure handling

Basic read model

Failure handling

A few simple implications

Efficient reads

Under normal operation each read operation is performed once for each relevant replication group. Only under failure conditions do multiple copies of the same shard execute the same search.

Read unacknowledged

Since the primary first indexes locally and then replicates the request, it is possible for a concurrent read to already see the change before it has been acknowledged.

Two copies by default

Failures

A single shard can slow down indexing

Because the primary waits for all replicas in the in-sync copies set during each operation, a single slow shard can slow down the entire replication group. This is the price we pay for the read efficiency mentioned above. Of course a single slow shard will also slow down unlucky searches that have been routed to it.

Dirty reads

Index API

The index API adds or updates a typed JSON document in a specific index, making it searchable. The following example inserts the JSON document into the “twitter” index, under a type called _doc with an id of 1:

PUT twitter/_doc/1
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
或
curl -X PUT "localhost:9200/twitter/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

Automatic Index Creation

The index operation automatically creates an index if it does not already exist, and applies any index templates that are configured. The index operation also creates a dynamic type mapping for the specified type if one does not already exist. By default, new fields and objects will automatically be added to the mapping definition for the specified type if needed. Check out the mapping section for more information on mapping definitions, and the the put mapping API for information about updating type mappings manually.

Automatic index creation is controlled by the action.auto_create_index setting. This setting defaults to true, meaning that indices are always automatically created. Automatic index creation can be permitted only for indices matching certain patterns by changing the value of this setting to a comma-separated list of these patterns. It can also be explicitly permitted and forbidden by prefixing patterns in the list with a + or -. Finally it can be completely disabled by changing this setting to false.

PUT _cluster/settings
{
    "persistent": {
        ##Permit only the auto-creation of indices called twitter, index10, no other index matching index1*, and any other index matching ind*. The patterns are matched in the order in which they are given.
        "action.auto_create_index": "twitter,index10,-index1*,+ind*" 
    }
}

PUT _cluster/settings
{
    "persistent": {
        ##Completely disable the auto-creation of indices.
        "action.auto_create_index": "false" 
    }
}

PUT _cluster/settings
{
    "persistent": {
        ##Permit the auto-creation of indices with any name. This is the default.
        "action.auto_create_index": "true" 
    }
}
或
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
    "persistent": {
        "action.auto_create_index": "twitter,index10,-index1*,+ind*" 
    }
}
'
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
    "persistent": {
        "action.auto_create_index": "false" 
    }
}
'
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
    "persistent": {
        
        "action.auto_create_index": "true" 
    }
}'

Versioning

Each indexed document is given a version number. The associated version number is returned as part of the response to the index API request. The index API optionally allows for optimistic concurrency control when the version parameter is specified. This will control the version of the document the operation is intended to be executed against. A good example of a use case for versioning is performing a transactional read-then-update. Specifying a version from the document initially read ensures no changes have happened in the meantime (when reading in order to update, it is recommended to set preference to _primary). For example:

PUT twitter/_doc/1?version=2
{
    "message" : "elasticsearch now has versioning support, double cool!"
}
或
curl -X PUT "localhost:9200/twitter/_doc/1?version=2&pretty" -H 'Content-Type: application/json' -d'
{
    "message" : "elasticsearch now has versioning support, double cool!"
}'

By default, internal versioning is used that starts at 1 and increments with each update, deletes included. Optionally, the version number can be supplemented with an external value (for example, if maintained in a database). To enable this functionality, version_type should be set to external. The value provided must be a numeric, long value greater or equal to 0, and less than around 9.2e+18. When using the external version type, instead of checking for a matching version number, the system checks to see if the version number passed to the index request is greater than the version of the currently stored document. If true, the document will be indexed and the new version number used. If the value provided is less than or equal to the stored document’s version number, a version conflict will occur and the index operation will fail.

A nice side effect is that there is no need to maintain strict ordering of async indexing operations executed as a result of changes to a source database, as long as version numbers from the source database are used. Even the simple case of updating the Elasticsearch index using data from a database is simplified if external versioning is used, as only the latest version will be used if the index operations are out of order for whatever reason.

Version types

Next to the internal & external version types explained above, Elasticsearch also supports other types for specific use cases. Here is an overview of the different version types and their semantics.

  • internal

    only index the document if the given version is identical to the version of the stored document.
  • external or external_gt

    only index the document if the given version is strictly higher than the version of the stored document or if there is no existing document. The given version will be used as the new version and will be stored with the new document. The supplied version must be a non-negative long number.
  • external_gte

    only index the document if the given version is equal or higher than the version of the stored document. If there is no existing document the operation will succeed as well. The given version will be used as the new version and will be stored with the new document. The supplied version must be a non-negative long number.

Operation Type

The index operation also accepts an op_type that can be used to force a create operation, allowing for “put-if-absent” behavior. When create is used, the index operation will fail if a document by that id already exists in the index.

Here is an example of using the op_type parameter:

PUT twitter/_doc/1?op_type=create
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
或
curl -X PUT "localhost:9200/twitter/_doc/1?op_type=create&pretty" -H 'Content-Type: application/json' -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

Another option to specify create is to use the following uri:

PUT twitter/_doc/1/_create
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
或
curl -X PUT "localhost:9200/twitter/_doc/1/_create?pretty" -H 'Content-Type: application/json' -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

Automatic ID Generation

The index operation can be executed without specifying the id. In such a case, an id will be generated automatically. In addition, the op_type will automatically be set to create. Here is an example (note the POST used instead of PUT):

POST twitter/_doc/
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
或
curl -X POST "localhost:9200/twitter/_doc/?pretty" -H 'Content-Type: application/json' -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

Routing

By default, shard placement — or routing — is controlled by using a hash of the document’s id value. For more explicit control, the value fed into the hash function used by the router can be directly specified on a per-operation basis using the routing parameter. For example:

POST twitter/_doc?routing=kimchy
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
或
curl -X POST "localhost:9200/twitter/_doc?routing=kimchy&pretty" -H 'Content-Type: application/json' -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

In the example above, the “_doc” document is routed to a shard based on the routing parameter provided: “kimchy”.

When setting up explicit mapping, the _routing field can be optionally used to direct the index operation to extract the routing value from the document itself. This does come at the (very minimal) cost of an additional document parsing pass. If the _routing mapping is defined and set to be required, the index operation will fail if no routing value is provided or extracted.

Distributed

The index operation is directed to the primary shard based on its route (see the Routing section above) and performed on the actual node containing this shard. After the primary shard completes the operation, if needed, the update is distributed to applicable replicas.

Wait For Active Shards

To improve the resiliency of writes to the system, indexing operations can be configured to wait for a certain number of active shard copies before proceeding with the operation. If the requisite number of active shard copies are not available, then the write operation must wait and retry, until either the requisite shard copies have started or a timeout occurs. By default, write operations only wait for the primary shards to be active before proceeding (i.e. wait_for_active_shards=1). This default can be overridden in the index settings dynamically by setting index.write.wait_for_active_shards. To alter this behavior per operation, the wait_for_active_shards request parameter can be used.

Valid values are all or any positive integer up to the total number of configured copies per shard in the index (which is number_of_replicas+1). Specifying a negative value or a number greater than the number of shard copies will throw an error.

For example, suppose we have a cluster of three nodes, A, B, and C and we create an index index with the number of replicas set to 3 (resulting in 4 shard copies, one more copy than there are nodes). If we attempt an indexing operation, by default the operation will only ensure the primary copy of each shard is available before proceeding. This means that even if B and C went down, and A hosted the primary shard copies, the indexing operation would still proceed with only one copy of the data. If wait_for_active_shards is set on the request to 3 (and all 3 nodes are up), then the indexing operation will require 3 active shard copies before proceeding, a requirement which should be met because there are 3 active nodes in the cluster, each one holding a copy of the shard. However, if we set wait_for_active_shards to all (or to 4, which is the same), the indexing operation will not proceed as we do not have all 4 copies of each shard active in the index. The operation will timeout unless a new node is brought up in the cluster to host the fourth copy of the shard.

It is important to note that this setting greatly reduces the chances of the write operation not writing to the requisite number of shard copies, but it does not completely eliminate the possibility, because this check occurs before the write operation commences. Once the write operation is underway, it is still possible for replication to fail on any number of shard copies but still succeed on the primary. The _shards section of the write operation’s response reveals the number of shard copies on which replication succeeded/failed.

Refresh

Control when the changes made by this request are visible to search. See refresh.

Noop Updates

When updating a document using the index api a new version of the document is always created even if the document hasn’t changed. If this isn’t acceptable use the _update api with detect_noop set to true. This option isn’t available on the index api because the index api doesn’t fetch the old source and isn’t able to compare it against the new source.

There isn’t a hard and fast rule about when noop updates aren’t acceptable. It’s a combination of lots of factors like how frequently your data source sends updates that are actually noops and how many queries per second Elasticsearch runs on the shard with receiving the updates.

Timeout

The primary shard assigned to perform the index operation might not be available when the index operation is executed. Some reasons for this might be that the primary shard is currently recovering from a gateway or undergoing relocation. By default, the index operation will wait on the primary shard to become available for up to 1 minute before failing and responding with an error. The timeout parameter can be used to explicitly specify how long it waits. Here is an example of setting it to 5 minutes:

PUT twitter/_doc/1?timeout=5m
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
或
curl -X PUT "localhost:9200/twitter/_doc/1?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

Get API

The get API allows to get a typed JSON document from the index based on its id. The following example gets a JSON document from an index called twitter, under a type called _doc, with id valued 0:

GET twitter/_doc/0
或
curl -X GET "localhost:9200/twitter/_doc/0?pretty"

The above result includes the _index, _type, _id and _version of the document we wish to retrieve, including the actual _source of the document if it could be found (as indicated by the found field in the response).

The API also allows to check for the existence of a document using HEAD, for example:

查询文档是否存在

HEAD twitter/_doc/0
或
curl -I "localhost:9200/twitter/_doc/0?pretty"

Realtime

By default, the get API is realtime, and is not affected by the refresh rate of the index (when data will become visible for search). If a document has been updated but is not yet refreshed, the get API will issue a refresh call in-place to make the document visible. This will also make other documents changed since the last refresh visible. In order to disable realtime GET, one can set the realtime parameter to false.

Source filtering

By default, the get operation returns the contents of the _source field unless you have used the stored_fields parameter or if the _source field is disabled. You can turn off _source retrieval by using the _source parameter:

GET twitter/_doc/0?_source=false
或
curl -X GET "localhost:9200/twitter/_doc/0?_source=false&pretty"

If you only need one or two fields from the complete _source, you can use the _source_include & _source_exclude parameters to include or filter out that parts you need. This can be especially helpful with large documents where partial retrieval can save on network overhead. Both parameters take a comma separated list of fields or wildcard expressions. Example:

GET twitter/_doc/0?_source_include=*.id&_source_exclude=entities
或
curl -X GET "localhost:9200/twitter/_doc/0?_source_include=*.id&_source_exclude=entities&pretty"

If you only want to specify includes, you can use a shorter notation:

GET twitter/_doc/0?_source=*.id,retweeted
或
curl -X GET "localhost:9200/twitter/_doc/0?_source=*.id,retweeted&pretty"

Stored Fields

The get operation allows specifying a set of stored fields that will be returned by passing the stored_fields parameter. If the requested fields are not stored, they will be ignored. Consider for instance the following mapping:

PUT twitter
{
   "mappings": {
      "_doc": {
         "properties": {
            "counter": {
               "type": "integer",
               "store": false
            },
            "tags": {
               "type": "keyword",
               "store": true
            }
         }
      }
   }
}

Now we can add a document:

PUT twitter/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

and try to retrieve it:

GET twitter/_doc/1?stored_fields=tags,counter
或
curl -X GET "localhost:9200/twitter/_doc/1?stored_fields=tags,counter&pretty"

Field values fetched from the document itself are always returned as an array. Since the counter field is not stored the get request simply ignores it when trying to get the stored_fields.

It is also possible to retrieve metadata fields like the _routing field:

PUT twitter/_doc/2?routing=user1
{
    "counter" : 1,
    "tags" : ["white"]
}
GET twitter/_doc/2?routing=user1&stored_fields=tags,counter
或
curl -X GET "localhost:9200/twitter/_doc/2?routing=user1&stored_fields=tags,counter&pretty"

Also only leaf fields can be returned via the stored_field option. So object fields can’t be returned and such requests will fail.

Getting the _source directly

Use the /{index}/{type}/{id}/_source endpoint to get just the _source field of the document, without any additional content around it. For example:

GET twitter/_doc/1/_source
或
curl -X GET "localhost:9200/twitter/_doc/1/_source?pretty"

You can also use the same source filtering parameters to control which parts of the _source will be returned:

GET twitter/_doc/1/_source?_source_include=*.id&_source_exclude=entities'
或
curl -X GET "localhost:9200/twitter/_doc/1/_source?_source_include=*.id&_source_exclude=entities'&pretty"

Note, there is also a HEAD variant for the _source endpoint to efficiently test for document _source existence. An existing document will not have a _source if it is disabled in the mapping.

HEAD twitter/_doc/1/_source
或
curl -I "localhost:9200/twitter/_doc/1/_source?pretty"

Routing

When indexing using the ability to control the routing, in order to get a document, the routing value should also be provided. For example:

GET twitter/_doc/2?routing=user1
或
curl -X GET "localhost:9200/twitter/_doc/2?routing=user1&pretty"

The above will get a tweet with id 2, but will be routed based on the user. Note, issuing a get without the correct routing, will cause the document not to be fetched.

Preference

Controls a preference of which shard replicas to execute the get request on. By default, the operation is randomized between the shard replicas.

The preference can be set to:

  • _primary

    The operation will go and be executed only on the primary shards.
  • _local

    The operation will prefer to be executed on a local allocated shard if possible.
  • Custom (string) value

    A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with “jumping values” when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.

Refresh

The refresh parameter can be set to true in order to refresh the relevant shard before the get operation and make it searchable. Setting it to true should be done after careful thought and verification that this does not cause a heavy load on the system (and slows down indexing).

Distributed

The get operation gets hashed into a specific shard id. It then gets redirected to one of the replicas within that shard id and returns the result. The replicas are the primary shard and its replicas within that shard id group. This means that the more replicas we will have, the better GET scaling we will have.

Versioning support

You can use the version parameter to retrieve the document only if its current version is equal to the specified one. This behavior is the same for all version types with the exception of version type FORCE which always retrieves the document. Note that FORCE version type is deprecated.

Internally, Elasticsearch has marked the old document as deleted and added an entirely new document. The old version of the document doesn’t disappear immediately, although you won’t be able to access it. Elasticsearch cleans up deleted documents in the background as you continue to index more data.

Delete API

The delete API allows to delete a typed JSON document from a specific index based on its id. The following example deletes the JSON document from an index called twitter, under a type called _doc, with id 1:

DELETE /twitter/_doc/1
或
curl -X DELETE "localhost:9200/twitter/_doc/1?pretty"

Versioning

Each document indexed is versioned. When deleting a document, the version can be specified to make sure the relevant document we are trying to delete is actually being deleted and it has not changed in the meantime. Every write operation executed on a document, deletes included, causes its version to be incremented. The version number of a deleted document remains available for a short time after deletion to allow for control of concurrent operations. The length of time for which a deleted document’s version remains available is determined by the index.gc_deletes index setting and defaults to 60 seconds.

Routing

When indexing using the ability to control the routing, in order to delete a document, the routing value should also be provided. For example:

DELETE /twitter/_doc/1?routing=kimchy
或
curl -X DELETE "localhost:9200/twitter/_doc/1?routing=kimchy&pretty"

The above will delete a tweet with id 1, but will be routed based on the user. Note, issuing a delete without the correct routing, will cause the document to not be deleted.

When the _routing mapping is set as required and no routing value is specified, the delete api will throw a RoutingMissingException and reject the request.

Automatic index creation

If an external versioning variant is used, the delete operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index), and also automatically creates a dynamic type mapping for the specific type if it has not been created before (check out the put mapping API for manually creating type mapping).

Distributed

The delete operation gets hashed into a specific shard id. It then gets redirected into the primary shard within that id group, and replicated (if needed) to shard replicas within that id group.

Wait For Active Shards

When making delete requests, you can set the wait_for_active_shards parameter to require a minimum number of shard copies to be active before starting to process the delete request. See here for further details and a usage example.

Refresh

Control when the changes made by this request are visible to search. See ?refresh.

Timeout

The primary shard assigned to perform the delete operation might not be available when the delete operation is executed. Some reasons for this might be that the primary shard is currently recovering from a store or undergoing relocation. By default, the delete operation will wait on the primary shard to become available for up to 1 minute before failing and responding with an error. The timeout parameter can be used to explicitly specify how long it waits. Here is an example of setting it to 5 minutes:

DELETE /twitter/_doc/1?timeout=5m
或
curl -X DELETE "localhost:9200/twitter/_doc/1?timeout=5m&pretty"

Delete By Query API

The simplest usage of _delete_by_query just performs a deletion on every document that match a query. Here is the API:

POST twitter/_delete_by_query
{
  ##The query must be passed as a value to the query key, in the same way as the Search API. You can also use the q parameter in the same way as the search api.
  
  "query": { 
    "match": {
      "message": "some message"
    }
  }
}
或
curl -X POST "localhost:9200/twitter/_delete_by_query?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { 
    "match": {
      "message": "some message"
    }
  }
}'

_delete_by_query gets a snapshot of the index when it starts and deletes what it finds using internal versioning. That means that you’ll get a version conflict if the document changes between the time when the snapshot was taken and when the delete request is processed. When the versions match the document is deleted.

During the _delete_by_query execution, multiple search requests are sequentially executed in order to find all the matching documents to delete. Every time a batch of documents is found, a corresponding bulk request is executed to delete all these documents. In case a search or bulk request got rejected, _delete_by_query relies on a default policy to retry rejected requests (up to 10 times, with exponential back off). Reaching the maximum retries limit causes the _delete_by_query to abort and all failures are returned in the failures of the response. The deletions that have been performed still stick. In other words, the process is not rolled back, only aborted. While the first failure causes the abort, all failures that are returned by the failing bulk request are returned in the failures element; therefore it’s possible for there to be quite a few failed entities.

If you’d like to count version conflicts rather than cause them to abort then set conflicts=proceed on the url or “conflicts”: “proceed” in the request body.

Back to the API format, this will delete tweets from the twitter index:

POST twitter/_doc/_delete_by_query?conflicts=proceed
{
  "query": {
    "match_all": {}
  }
}
或
curl -X POST "localhost:9200/twitter/_doc/_delete_by_query?conflicts=proceed&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}'

It’s also possible to delete documents of multiple indexes and multiple types at once, just like the search API:

POST twitter,blog/_docs,post/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}
或
curl -X POST "localhost:9200/twitter,blog/_docs,post/_delete_by_query?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_all": {}
  }
}'

If you provide routing then the routing is copied to the scroll query, limiting the process to the shards that match that routing value:

POST twitter/_delete_by_query?routing=1
{
  "query": {
    "range" : {
        "age" : {
           "gte" : 10
        }
    }
  }
}
或
curl -X POST "localhost:9200/twitter/_delete_by_query?routing=1&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range" : {
        "age" : {
           "gte" : 10
        }
    }
  }
}'

By default _delete_by_query uses scroll batches of 1000. You can change the batch size with the scroll_size URL parameter:

POST twitter/_delete_by_query?scroll_size=5000
{
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}
或
curl -X POST "localhost:9200/twitter/_delete_by_query?scroll_size=5000&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}'

URL Parameters

In addition to the standard parameters like pretty, the Delete By Query API also supports refresh, wait_for_completion, wait_for_active_shards, timeout and scroll.

Sending the refresh will refresh all shards involved in the delete by query once the request completes. This is different than the Delete API’s refresh parameter which causes just the shard that received the delete request to be refreshed.

If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as you see fit. When you are done with it, delete it so Elasticsearch can reclaim the space it uses.

wait_for_active_shards controls how many copies of a shard must be active before proceeding with the request. See here for details. timeout controls how long each write request waits for unavailable shards to become available. Both work exactly how they work in the Bulk API. As _delete_by_query uses scroll search, you can also specify the scroll parameter to control how long it keeps the “search context” alive, eg ?scroll=10m, by default it’s 5 minutes.

requests_per_second can be set to any positive decimal number (1.4, 6, 1000, etc) and throttles rate at which _delete_by_query issues batches of delete operations by padding each batch with a wait time. The throttling can be disabled by setting requests_per_second to -1.

The throttling is done by waiting between batches so that scroll that _delete_by_query uses internally can be given a timeout that takes into account the padding. The padding time is the difference between the batch size divided by the requests_per_second and the time spent writing. By default the batch size is 1000, so if the requests_per_second is set to 500:

target_time = 1000 / 500 per second = 2 seconds
wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds

Since the batch is issued as a single _bulk request large batch sizes will cause Elasticsearch to create many requests and then wait for a while before starting the next set. This is “bursty” instead of “smooth”. The default is -1.

Response body

The JSON response looks like this:

{
  "took" : 147,
  "timed_out": false,
  "total": 119,
  "deleted": 119,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1.0,
  "throttled_until_millis": 0,
  "failures" : [ ]
}
took

The number of milliseconds from start to end of the whole operation.

timed_out

This flag is set to true if any of the requests executed during the delete by query execution has timed out.

total

The number of documents that were successfully processed.

deleted

The number of documents that were successfully deleted.

batches

The number of scroll responses pulled back by the delete by query.

version_conflicts

The number of version conflicts that the delete by query hit.

noops

This field is always equal to zero for delete by query. It only exists so that delete by query, update by query and reindex APIs return responses with the same structure.

retries

The number of retries attempted by delete by query. bulk is the number of bulk actions retried and search is the number of search actions retried.

throttled_millis

Number of milliseconds the request slept to conform to requests_per_second.

requests_per_second

The number of requests per second effectively executed during the delete by query.

throttled_until_millis

This field should always be equal to zero in a _delete_by_query response. It only has meaning when using the Task API, where it indicates the next time (in milliseconds since epoch) a throttled request will be executed again in order to conform to requests_per_second.

failures

Array of failures if there were any unrecoverable errors during the process. If this is non-empty then the request aborted because of those failures. Delete-by-query is implemented using batches and any failure causes the entire process to abort but all failures in the current batch are collected into the array. You can use the conflicts option to prevent reindex from aborting on version conflicts.

Works with the Task API

You can fetch the status of any running delete-by-query requests with the Task API:

GET _tasks?detailed=true&actions=*/delete/byquery
或
curl -X GET "localhost:9200/_tasks?detailed=true&actions=*/delete/byquery&pretty"

With the task id you can look up the task directly:

GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619
或
curl -X GET "localhost:9200/_tasks/r1A2WoRbTwKZ516z6NEs5A:36619?pretty"

The advantage of this API is that it integrates with wait_for_completion=false to transparently return the status of completed tasks. If the task is completed and wait_for_completion=false was set on it then it’ll come back with results or an error field. The cost of this feature is the document that wait_for_completion=false creates at .tasks/task/${taskId}. It is up to you to delete that document.

Works with the Cancel Task API

Any Delete By Query can be canceled using the task cancel API:

POST _tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel
或
curl -X POST "localhost:9200/_tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel?pretty"

The task ID can be found using the tasks API.

Cancellation should happen quickly but might take a few seconds. The task status API above will continue to list the task until it is wakes to cancel itself.

Rethrottling

The value of requests_per_second can be changed on a running delete by query using the _rethrottle API:

POST _delete_by_query/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1
或
curl -X POST "localhost:9200/_delete_by_query/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1&pretty"

The task ID can be found using the tasks API.

Just like when setting it on the _delete_by_query API requests_per_second can be either -1 to disable throttling or any decimal number like 1.7 or 12 to throttle to that level. Rethrottling that speeds up the query takes effect immediately but rethrotting that slows down the query will take effect on after completing the current batch. This prevents scroll timeouts.

Slicing

Delete-by-query supports Sliced Scroll to parallelize the deleting process. This parallelization can improve efficiency and provide a convenient way to break the request down into smaller parts.

Manually slicing

Slice a delete-by-query manually by providing a slice id and total number of slices to each request:

POST twitter/_delete_by_query
{
  "slice": {
    "id": 0,
    "max": 2
  },
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
POST twitter/_delete_by_query
{
  "slice": {
    "id": 1,
    "max": 2
  },
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
或
curl -X POST "localhost:9200/twitter/_delete_by_query?pretty" -H 'Content-Type: application/json' -d'
{
  "slice": {
    "id": 0,
    "max": 2
  },
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}'
curl -X POST "localhost:9200/twitter/_delete_by_query?pretty" -H 'Content-Type: application/json' -d'
{
  "slice": {
    "id": 1,
    "max": 2
  },
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}'

Which you can verify works with:

GET _refresh
POST twitter/_search?size=0&filter_path=hits.total
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
或
curl -X GET "localhost:9200/_refresh?pretty"
curl -X POST "localhost:9200/twitter/_search?size=0&filter_path=hits.total&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}'
Automatic slicing

You can also let delete-by-query automatically parallelize using Sliced Scroll to slice on _uid. Use slices to specify the number of slices to use:

POST twitter/_delete_by_query?refresh&slices=5
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
或
curl -X POST "localhost:9200/twitter/_delete_by_query?refresh&slices=5&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}'

Which you also can verify works with:

POST twitter/_search?size=0&filter_path=hits.total
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
或
curl -X POST "localhost:9200/twitter/_search?size=0&filter_path=hits.total&pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}'

Setting slices to auto will let Elasticsearch choose the number of slices to use. This setting will use one slice per shard, up to a certain limit. If there are multiple source indices, it will choose the number of slices based on the index with the smallest number of shards.

Adding slices to _delete_by_query just automates the manual process used in the section above, creating sub-requests which means it has some quirks:

  • You can see these requests in the Tasks APIs. These sub-requests are “child” tasks of the task for the request with slices.
  • Fetching the status of the task for the request with slices only contains the status of completed slices.
  • These sub-requests are individually addressable for things like cancellation and rethrottling.
  • Rethrottling the request with slices will rethrottle the unfinished sub-request proportionally.
  • Canceling the request with slices will cancel each sub-request.
  • Due to the nature of slices each sub-request won’t get a perfectly even portion of the documents. All documents will be addressed, but some slices may be larger than others. Expect larger slices to have a more even distribution.
  • Parameters like requests_per_second and size on a request with slices are distributed proportionally to each sub-request. Combine that with the point above about distribution being uneven and you should conclude that the using size with slices might not result in exactly size documents being _delete_by_queryed.
  • Each sub-requests gets a slightly different snapshot of the source index though these are all taken at approximately the same time.

Picking the number of slices

If slicing automatically, setting slices to auto will choose a reasonable number for most indices. If you’re slicing manually or otherwise tuning automatic slicing, use these guidelines.

Query performance is most efficient when the number of slices is equal to the number of shards in the index. If that number is large, (for example, 500) choose a lower number as too many slices will hurt performance. Setting slices higher than the number of shards generally does not improve efficiency and adds overhead.

Delete performance scales linearly across available resources with the number of slices.

Whether query or delete performance dominates the runtime depends on the documents being reindexed and cluster resources.

Update API

The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the “get” and “reindex”.

Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field needs to be enabled for this feature to work.

For example, let’s index a simple doc:

PUT test/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}
或
curl -X PUT "localhost:9200/test/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
    "counter" : 1,
    "tags" : ["red"]
}'

Scripted updates

Now, we can execute a script that would increment the counter:

POST test/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    }
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    }
}'

We can add a tag to the list of tags (note, if the tag exists, it will still add it, since its a list):

POST test/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}'

In addition to _source, the following variables are available through the ctx map: _index, _type, _id, _version, _routing and _now (the current timestamp).

We can also add a new field to the document:

POST test/_doc/1/_update
{
    "script" : "ctx._source.new_field = 'value_of_new_field'"
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "script" : "ctx._source.new_field = \u0027value_of_new_field\u0027"
}'

Or remove a field from the document:

POST test/_doc/1/_update
{
    "script" : "ctx._source.remove('new_field')"
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "script" : "ctx._source.remove(\u0027new_field\u0027)"
}'

And, we can even change the operation that is executed. This example deletes the doc if the tags field contain green, otherwise it does nothing (noop):

POST test/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = \u0027delete\u0027 } else { ctx.op = \u0027none\u0027 }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}'

Updates with a partial document

The update API also support passing a partial document, which will be merged into the existing document (simple recursive merge, inner merging of objects, replacing core “keys/values” and arrays). To fully replace the existing document, the index API should be used instead. The following partial update adds a new field to the existing document:

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "doc" : {
        "name" : "new_name"
    }
}'

If both doc and script are specified, then doc is ignored. Best is to put your field pairs of the partial document in the script itself.

Detecting noop updates

If doc is specified its value is merged with the existing _source. By default updates that don’t change anything detect that they don’t change anything and return “result”: “noop” like this:

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "doc" : {
        "name" : "new_name"
    }
}'

If name was new_name before the request was sent then the entire update request is ignored. The result element in the response returns noop if the request was ignored.

You can disable this behavior by setting “detect_noop”: false like this:

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "detect_noop": false
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "doc" : {
        "name" : "new_name"
    },
    "detect_noop": false
}'

Upserts

If the document does not already exist, the contents of the upsert element will be inserted as a new document. If the document does exist, then the script will be executed instead:

POST test/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1
    }
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1
    }
}'
scripted_upsert

If you would like your script to run regardless of whether the document exists or not — i.e. the script handles initializing the document instead of the upsert element — then set scripted_upsert to true:

POST sessions/session/dh3sgudg8gsrgl/_update
{
    "scripted_upsert":true,
    "script" : {
        "id": "my_web_session_summariser",
        "params" : {
            "pageViewEvent" : {
                "url":"foo.com/bar",
                "response":404,
                "time":"2014-01-01 12:32"
            }
        }
    },
    "upsert" : {}
}
或
curl -X POST "localhost:9200/sessions/session/dh3sgudg8gsrgl/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "scripted_upsert":true,
    "script" : {
        "id": "my_web_session_summariser",
        "params" : {
            "pageViewEvent" : {
                "url":"foo.com/bar",
                "response":404,
                "time":"2014-01-01 12:32"
            }
        }
    },
    "upsert" : {}
}'
doc_as_upsert

Instead of sending a partial doc plus an upsert doc, setting doc_as_upsert to true will use the contents of doc as the upsert value:

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "doc_as_upsert" : true
}
或
curl -X POST "localhost:9200/test/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
    "doc" : {
        "name" : "new_name"
    },
    "doc_as_upsert" : true
}'

Parameters

The update operation supports the following query-string parameters:

  • retry_on_conflict

In between the get and indexing phases of the update, it is possible that another process might have already updated the same document. By default, the update will fail with a version conflict exception. The retry_on_conflict parameter controls how many times to retry the update before finally throwing an exception.

  • routing

Routing is used to route the update request to the right shard and sets the routing for the upsert request if the document being updated doesn’t exist. Can’t be used to update the routing of an existing document.

  • timeout

Timeout waiting for a shard to become available.

  • wait_for_active_shards

The number of shard copies required to be active before proceeding with the update operation. See here for details.

  • refresh

Control when the changes made by this request are visible to search. See ?refresh.

  • _source

Allows to control if and how the updated source should be returned in the response. By default the updated source is not returned. See source filtering for details.

  • version

The update API uses the Elasticsearch’s versioning support internally to make sure the document doesn’t change during the update. You can use the version parameter to specify that the document should only be updated if its version matches the one specified.

Update By Query API

The simplest usage of _update_by_query just performs an update on every document in the index without changing the source. This is useful to pick up a new property or some other online mapping change. Here is the API:

POST twitter/_update_by_query?conflicts=proceed

_update_by_query gets a snapshot of the index when it starts and indexes what it finds using internal versioning. That means that you’ll get a version conflict if the document changes between the time when the snapshot was taken and when the index request is processed. When the versions match the document is updated and the version number is incremented.

All update and query failures cause the _update_by_query to abort and are returned in the failures of the response. The updates that have been performed still stick. In other words, the process is not rolled back, only aborted. While the first failure causes the abort, all failures that are returned by the failing bulk request are returned in the failures element; therefore it’s possible for there to be quite a few failed entities.

If you want to simply count version conflicts not cause the _update_by_query to abort you can set conflicts=proceed on the url or “conflicts”: “proceed” in the request body. The first example does this because it is just trying to pick up an online mapping change and a version conflict simply means that the conflicting document was updated between the start of the _update_by_query and the time when it attempted to update the document. This is fine because that update will have picked up the online mapping update.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

风吹千里

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值