elasticsearch reference 2.3 学习笔记

本文是Elasticsearch 2.3的学习笔记,涵盖了集群健康、索引创建、文档操作、批量处理、搜索API、过滤与聚合等关键知识点。介绍了Elasticsearch作为全文搜索引擎的特性,强调了分片和复制的重要性,以及如何通过API进行数据管理和查询优化。
摘要由CSDN通过智能技术生成

1. Getting Started

1.1 Elasticsearch

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

1.2 Sharding is important for two primary reasons:

It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

1.3 Replication is important for two primary reasons:

It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

1.4 Cluster Health

curl 'localhost:9200/_cat/health?v'

curl 'localhost:9200/_cat/nodes?v'

1.5 List All Indices

curl 'localhost:9200/_cat/indices?v'

1.6 Create an Index

curl -XPUT 'localhost:9200/customer?pretty'

1.7 Index and Query a Document

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
  "name": "John Doe"
}'

curl -XGET 'localhost:9200/customer/external/1?pretty'

1.8 Delete an Index

curl -XDELETE 'localhost:9200/customer?pretty'

1.9 Updating Documents

Note though that Elasticsearch does not actually do in-place updates under the hood. Whenever we do an update, Elasticsearch deletes the old document and then indexes a new document with the update applied to it in one shot.

1.10 Deleting Documents

curl -XDELETE 'localhost:9200/customer/external/2?pretty'

1.11 Batch Processing

In addition to being able to index, update, and delete individual documents, Elasticsearch also provides the ability to perform any of the above operations in batches using the _bulk API. This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as little network roundtrips as possible.

The bulk API executes all the actions sequentially and in order. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it. When the bulk API returns, it will provide a status for each action (in the same order it was sent in) so that you can check if a specific action failed or not.

1.12 The Search API

It is important to understand that once you get your search results back, Elasticsearch is completely done with the request and does not maintain any kind of server-side resources or open cursors into your results. This is in stark contrast to many other platforms such as SQL wherein you may initially get a partial subset of your query results up-front and then you have to continuously go back to the server if you want to fetch (or page through) the rest of the results using some kind of stateful server-side cursor.

1.13 Executing Filters

In the previous section, we skipped over a little detail called the document score (_score field in the search results). The score is a numeric value that is a relative measure of how well the document matches the search query that we specified. The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.

But queries do not always need to produce scores, in particular when they are only used for “filtering” the document set. Elasticsearch detects these situations and automatically optimizes query execution in order not to compute useless scores.

1.14 Executing Aggregations

Aggregations provide the ability to group and extract statistics from your data. The easiest way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL aggregate functions. In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response. This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.

2. Setup

2.1 Environment Variables

Most times it is better to leave the default JAVA_OPTS as they are, and use the ES_JAVA_OPTS environment variable in order to set / change JVM settings or arguments.

The ES_HEAP_SIZE environment variable allows to set the heap memory that will be allocated to elasticsearch java process. It will allocate the same value to both min and max values, though those can be set explicitly (not recommended) by setting ES_MIN_MEM (defaults to 256m), and ES_MAX_MEM (defaults to 1g).

It is recommended to set the min and max memory to the same value, and enable mlockall.

2.2 File Descriptors

Make sure to increase the number of open files descriptors on the machine (or for the user running elasticsearch). Setting it to 32k or even 64k is recommended.

In order to test how many open files the process can open, start it with -Des.max-open-files set to true. This will print the number of open files the process can open on startup.

Alternatively, you can retrieve the max_file_descriptors for each node using the Nodes Info API, with:

curl localhost:9200/_nodes/stats/process?pretty

2.3 Virtual memory

Elasticsearch uses a hybrid mmapfs / niofs directory by default to store its indices. The default operating system limits on mmap counts is likely to be too low, which may result in out of memory exceptions. On Linux, you can increase the limits by running the following command as root:

sysctl -w vm.max_map_count=262144

To set this value permanently, update the vm.max_map_count setting in /etc/sysctl.conf.

2.4 Memory Settings

Most operating systems try to use as much memory as possible for file system caches and eagerly swap out unused application memory, possibly resulting in the elasticsearch process being swapped. Swapping is very bad for performance and for node stability, so it should be avoided at all costs.

There are three options: Disable swap、Configure swappiness、mlockall

2.5 Elasticsearch Settings

elasticsearch configuration files can be found under ES_HOME/config folder. The folder comes with two files, the elasticsearch.yml for configuring Elasticsearch different modules, and logging.yml for configuring the Elasticsearch logging.

The configuration format is YAML.

2.6 Directory Layout

zip and tar.gz
|Type | Description| Location |
|:— |:———–|:———|
home |Home of elasticsearch installation|{extract.path}
bin |Binary scripts including elasticsearch to start a node|{extract.path}/bin
conf |Configuration files elasticsearch.yml and logging.yml|{extract.path}/config
data |The location of the data files of each index / shard allocated on the node|{extract.path}/data
logs |Log files location|{extract.path}/logs
plugins|Plugin files location. Each plugin will be contained in a subdirectory|{extract.path}/plugins
repo |Shared file system repository locations.|Not configured
script|Location of script files.|{extract.path}/config/scripts

3 Breaking changes (skipped)

4 API Conventions (skipped)

5 Document APIs

5.1 Index API

The index API adds or updates a typed JSON document in a specific index, making it searchable. The following example inserts the JSON document into the “twitter” index, under a type called “tweet” with an id of 1:

curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

5.2 Automatic Index Creation

The index operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index), and also automatically creates a dynamic type mapping for the specific type if one has not yet been created (check out the put mapping API for manually creating a type mapping).

5.3 Versioning

Each indexed document is given a version number. The associated version number is returned as part of the response to the index API request. The index API optionally allows for optimistic concurrency control when the version parameter is specified. This will control the version of the document the operation is intended to be executed against. A good example of a use case for versioning is performing a transactional read-then-update. Specifying a version from the document initially read ensures no changes have happened in the meantime (when reading in order to update, it is recommended to set preference to _primary). For example:

curl -XPUT 'localhost:9200/twitter/tweet/1?version=2' -d '{
    "message" : "elasticsearch now has versioning support, double cool!"
}'

5.4 Operation Type

The index operation also accepts an op_type that can be used to force a create operation, allowing for “put-if-absent” behavior. When create is used, the index operation will fail if a document by that id already exists in the index.

Here is an example of using the op_type parameter:

curl -XPUT 'http://localhost:9200/twitter/tweet/1?op_type=create' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

5.5 Routing

By default, shard placement — or routing — is controlled by using a hash of the document’s id value. For more explicit control, the value fed into the hash function used by the router can be directly specified on a per-operation basis using the routing parameter. For example:

curl -XPOST 'http://localhost:9200/twitter/tweet?routing=kimchy' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

5.6 Parents & Children (**适合什么场景?)

A child document can be indexed by specifying its parent when indexing. For example:

curl -XPUT localhost:9200/blogs/blog_tag/1122?parent=1111 -d '{
    "tag" : "something"
}'

When indexing a child document, the routing value is automatically set to be the same as its parent, unless the routing value is explicitly specified using the routing parameter.

5.7 Distributed

The index operation is directed to the primary shard based on its route (see the Routing section above) and performed on the actual node containing this shard. After the primary shard completes the operation, if needed, the update is distributed to applicable replicas.

5.8 Write Consistency

To prevent writes from taking place on the “wrong” side of a network partition, by default, index operations only succeed if a quorum (>replicas/2+1) of active shards are available.

5.9 Write Consistency (**如果index设置为1个主分片,两个复制分片。当两个复制分片都不可用的时候index在主分片是否成功?复制分片可用了是否能复制?)

To prevent writes from taking place on the “wrong” side of a network partition, by default, index operations only succeed if a quorum (>replicas/2+1) of active shards are available.

The index operation only returns after all active shards within the replication group have indexed the document (sync replication).

5.10 Refresh

To refresh the shard (not the whole index) immediately after the operation occurs, so that the document appears in search results immediately, the refresh parameter can be set to true. Setting this option to true should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint. Note, getting a document using the get API is completely realtime and doesn’t require a refresh.

5.11 Get API

The get API allows to get a typed JSON document from the index based on its id. The following example gets a JSON document from an index called twitter, under a type called tweet, with id valued 1:

curl -XGET 'http://localhost:9200/twitter/tweet/1'

5.12 Preference

Controls a preference of which shard replicas to execute the get request on. By default, the operation is randomized between the shard replicas.

The preference can be set to:

  • _primary: The operation will go and be executed only on the primary shards.
  • _local: The operation will prefer to be executed on a local allocated shard if possible.
  • Custom (string) value: A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with “jumping values” when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.

5.13 Delete API

The delete API allows to delete a typed JSON document from a specific index based on its id. The following example deletes the JSON document from an index called twitter, under a type called tweet, with id valued 1:

curl -XDELETE 'http://localhost:9200/twitter/tweet/1'

The delete operation gets hashed into a specific shard id. It then gets redirected into the primary shard within that id group, and replicated (if needed) to shard replicas within that id group.

5.14 Update API

The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the “get” and “reindex”.

Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field needs to be enabled for this feature to work.

5.15 Update By Query API (new and should still be considered experimental)

The simplest usage of _update_by_query just performs an update on every document in the index without changing the source. This is useful to pick up a new property or some other online mapping change. Here is the API:

curl -XPOST 'localhost:9200/twitter/_update_by_query?conflicts=proceed'

All update and query failures cause the _update_by_query to abort and are returned in the failures of the response. The updates that have been performed still stick. In other words, the process is not rolled back, only aborted.

5.16 Multi Get API

Multi GET API allows to get multiple documents based on an index, type (optional) and id (and possibly routing). The response includes a docs array with all the fetched documents, each element similar in structure to a document provided by the get API. Here is an example:

curl 'localhost:9200/_mget' -d '{
    "docs" : [
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "1"
        },
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "2"
        }
    ]
}'

5.17 Bulk API

The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed.

The REST API endpoint is /_bulk, and it expects the following JSON structure:

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

NOTE: the final line of data must end with a newline character \n.

The possible actions are index, create, delete and update. index and create expect a source on the next line, and have the same semantics as the op_type parameter to the standard index API (i.e. create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary). delete does not expect a source on the following line, and has the same semantics as the standard delete API. update expects that the partial doc, upsert and script and its options are specified on the next line.

Here is an example of a correct sequence of bulk commands:

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }

The endpoints are /_bulk, /{index}/_bulk, and {index}/{type}/_bulk. When the index or the index/type are provided, they will be used by default on bulk items that don’t provide them explicitly.

A note on the format. The idea here is to make processing of this as fast as possible. As some of the actions will be redirected to other shards on other nodes, only action_meta_data is parsed on the receiving node side.

5. 18 Reindex API (new and should still be considered experimental)

The most basic form of _reindex just copies documents from one index to another. This will copy documents from the twitter index into the new_twitter index:

POST /_reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

You can limit the documents by adding a type to the source or by adding a query. This will only copy tweet’s made by kimchy into new_twitter:

POST /_reindex
{
  "source": {
    "index": "twitter",
    "type": "tweet",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

5.19 Term Vectors

Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false.

curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true'

Three types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics.

Term information
- term frequency in the field (always returned)
- term positions (positions : true)
- start and end offsets (offsets : true)
- term payloads (payloads : true), as base64 encoded bytes

Term statistics
- total term frequency (how often a term occurs in all documents)
- document frequency (the number of documents containing the current term)

Setting term_statistics to true (default is false) will return term statistics. By default these values are not returned since term statistics can have a serious performance impact.

Field statistics
- document count (how many documents contain this field)
- sum of document frequencies (the sum of document frequencies for all terms in this field)
- sum of total term frequencies (the sum of total term frequencies of each term in this field)

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in, unless dfs is set to true. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected.

See more examples: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-termvectors.html#_behaviour

6 Search APIs

The search API allows you to execute a search query and get back search hits that match the query. The query can either be provided using a simple query string as a parameter, or using a request body.

All search APIs can be applied across multiple types within an index, and across multiple indices with support for the multi index syntax.

A search request can be executed purely using a URI by providing request parameters. Not all search options are exposed when executing a search using this mode, but it can be handy for quick “curl tests”. Here is an example:

curl -XGET 'http://localhost:9200/twitter/tweet/_search?q=user:kimchy'

The search request can be executed with a search DSL, which includes the Query DSL, within its body. Here is an example:

curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}'

6.4 Query

The query element within the search request body allows to define a query using the Query DSL.

{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

6.5 From / Size

Pagination of results can be done by using the from and size parameters.

{
    "from" : 0, "size" : 10,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000. See the Scroll API for more efficient ways to do deep scrolling.

6.6 Sort (**多个字段的排序规则是怎么样的?)

Allows to add one or more sort on specific fields. Each sort can be reversed as well. The sort is defined on a per field level, with special field name for _score to sort by score, and _doc to sort by index order.

{
    "sort" : [
        { "post_date" : {"order" : "asc"}},
        "user",
        { "name" : "desc" },
        { "age" : "desc" },
        "_score"
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

The sort values for each document returned are also returned as part of the response.
The order option can have the following values:

  • asc: Sort in ascending order
  • desc: Sort in descending order

The order defaults to desc when sorting on the _score, and defaults to asc when sorting on anything else.

6.7 Sort mode option

Elasticsearch supports sorting by array or multi-valued fields. The mode option controls what array value is picked for sorting the document it belongs to. The mode option can have the following values:

  • min: Pick the lowest value.
  • max: Pick the highest value.
  • sum: Use the sum of all values as sort value. Only applicable for number based array fields.
  • avg: Use the average of all values as sort value. Only applicable for number based array fields.
  • median: Use the median of all values as sort value. Only applicable for number based array fields.

6.8 Missing Values

The missing parameter specifies how docs which are missing the field should be treated: The missing value can be set to _last, _first, or a custom value (that will be used for missing docs as the sort value). For example:

{
    "sort" : [
        { "price" : {"missing" : "_last"} },
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

6.9 Script Based Sorting

Allow to sort based on custom scripts, here is an example:

{
    "query" : {
        ....
    },
    "sort" : {
        "_script" : {
            "type" : "number",
            "script" : {
                "inline": "doc['field_name'].value * factor",
                "params" : {
                    "factor" : 1.1
                }
            },
            "order" : "asc"
        }
    }
}

6.10 Memory Considerations

When sorting, the relevant sorted field values are loaded into memory. This means that per shard, there should be enough memory to contain them. For string based types, the field sorted on should not be analyzed / tokenized. For numeric types, if possible, it is recommended to explicitly set the type to narrower types (like short, integer and float).

6.11 Source filtering

Allows to control how the _source field is returned with every hit.

By default operations return the contents of the _source field unless you have used the fields parameter or if the _source field is disabled.

  • To disable _source retrieval set to false.
  • The _source also accepts one or more wildcard patterns to control what parts of the _source should be returned.
  • Finally, for complete control, you can specify both include and exclude patterns.

6.12 Fields

The fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.

6.13 Script Fields

Allows to return a script evaluation (based on different fields) for each hit, for example:

{
    "query" : {
        ...
    },
    "script_fields" : {
        "test1" : {
            "script" : "_source.obj1.obj2"
        },
        "test2" : {
            "script" : {
                "inline": "doc['my_field_name'].value * factor",
                "params" : {
                    "factor"  : 2.0
                }
            }
        }
    }
}

Note the _source keyword here to navigate the json-like model.

It’s important to understand the difference between doc[‘my_field’].value and _source.my_field. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[…] notation only allows for simple valued fields (can’t return a json object from it) and make sense only on non-analyzed or single term based fields.

The _source on the other hand causes the source to be loaded, parsed, and then only the relevant part of the json is returned.

6.14 Field Data Fields (**需要了解stored和fielddata的概念)

Allows to return the field data representation of a field for each hit, for example:

{
    "query" : {
        ...
    },
    "fielddata_fields" : ["test1", "test2"]
}

Field data fields can work on fields that are not stored.

It’s important to understand that using the fielddata_fields parameter will cause the terms for that field to be loaded to memory (cached), which will result in more memory consumption.

6.15 Post filter

The post_filter is applied to the search hits at the very end of a search request, after aggregations have already been calculated.

{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "brandName": "vans"
        }
      }
    }
  },
  "aggs": {
    "colors": {
      "terms": {
        "field": "colorNames"
      }
    },
    "color_red": {
      "filter": {
        "term": {
          "colorNames": "红色"
        }
      },
      "aggs": {
        "smallSorts": {
          "terms": {
            "field": "smallSort"
          }
        }
      }
    }
  },
  "post_filter": {
    "term": {
      "colorNames": "红色"
    }
  }
}
  • The main query now finds all products by vans, regardless of color.
  • The colors agg returns popular colors by vans.
  • The color_red agg limits the small sort sub-aggregation to red vans products.
  • Finally, the post_filter removes colors other than red from the search hits.

6.16 Highlighting

Allows to highlight search results on one or more fields. The implementation uses either the lucene highlighter, fast-vector-highlighter or postings-highlighter. The following is an example of the search request body:

{
    "query" : {...},
    "highlight" : {
        "fields" : {
            "content" : {}
        }
    }
}
6.16.1 Plain highlighter

The default choice of highlighter is of type plain and uses the Lucene highlighter. It tries

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值