Here are a few sample use-cases that Elasticsearch could be used for:
- You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them.
- You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you.
- You run a price alerting platform which allows price-savvy customers to specify a rule like "I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month". In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
- You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data.
- 经营在线网店时,允许用户在线搜索商品。这种场景下,可以使用Elasticsearch存储所有商品的特征和分类,并为上品提供搜索功能以及完全的建议。
- 可以用来搜集日志或者交易数据,然后分析或者挖掘这些数据以获取购买趋势、统计数据、汇总数据、或者异常。这种情况下,你可以使用Logstash(Elasticsearch/Logstash/Kibana技术栈的一部分)去收集、聚合或者分析数据,然后将Logstash数据传输给Elasticsearch。一旦数据存入Elasticsearch,可以搜索或者聚合数据以挖掘你感兴趣的信息。
- 可以运行价格提醒平台,支持客户设定一些价格提醒规则:例如对某种商品比较感兴趣,当此商品价格低于某个价格时,可以通知此客户。这种情况下,你可以抓取买方价格,然后将它们推送到Elasticsearch,然后使用逆向搜索能力对价格变动和客户需求相匹配,最终将找到的价格匹配的商品通知客户。
- 有分析/商业智能需求,想快速调查、分析、可视化或者询问大量数据的临时问题(想象一下百万或者十亿级别的记录)。这种情况下,你可以使用Elasticsearch存储数据,然后使用Kibana来创建自定义的仪表盘,可以将你认为重要的数据可视化显示。另外,可以使用Elasticsearch聚合功能执行复杂的商业智能需求。
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev
, logging-stage
, and logging-prod
for the development, staging, and production clusters.
elasticsearch
which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named
elasticsearch
.
elasticsearch
.
An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.
Sharding is important for two primary reasons:
- It allows you to horizontally split/scale your content volume
- It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
- 允许水平分割或者扩展数据
- 允许在shards之间进行分布式或者并行操作,这样可以提高性能或者吞吐量
Replication is important for two primary reasons:
- It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
- It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
- 提供某个shard或者节点失效时的高可用性。出于这个原因,备份的shards之间不应当分布在相同的节点上。
- 允许扩展搜索的数据量或者吞储量,这样可以并行搜索所有备份。
LUCENE-5843
, the limit is
2,147,483,519
(= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the
_cat/shards
api.
2、Installationedit
java -version echo $JAVA_HOME
www.elastic.co/downloads
along with all the releases that have been made in the past.
zip
or
tar
archive, or a
DEB
or
RPM
package.
[2016-09-16T14:17:51,251][INFO ][o.e.n.Node ] [] initializing ... [2016-09-16T14:17:51,329][INFO ][o.e.e.NodeEnvironment ] [6-bjhwl] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [317.7gb], net total_space [453.6gb], spins? [no], types [ext4] [2016-09-16T14:17:51,330][INFO ][o.e.e.NodeEnvironment ] [6-bjhwl] heap size [1.9gb], compressed ordinary object pointers [true] [2016-09-16T14:17:51,333][INFO ][o.e.n.Node ] [6-bjhwl] node name [6-bjhwl] derived from node ID; set [node.name] to override [2016-09-16T14:17:51,334][INFO ][o.e.n.Node ] [6-bjhwl] version[5.2.0], pid[21261], build[f5daa16/2016-09-16T09:12:24.346Z], OS[Linux/4.4.0-36-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_60/25.60-b23] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [aggs-matrix-stats] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [ingest-common] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [lang-expression] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [lang-groovy] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [lang-mustache] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [lang-painless] [2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [percolator] [2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [reindex] [2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [transport-netty3] [2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded module [transport-netty4] [2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService ] [6-bjhwl] loaded plugin [mapper-murmur3] [2016-09-16T14:17:53,521][INFO ][o.e.n.Node ] [6-bjhwl] initialized [2016-09-16T14:17:53,521][INFO ][o.e.n.Node ] [6-bjhwl] starting ... [2016-09-16T14:17:53,671][INFO ][o.e.t.TransportService ] [6-bjhwl] publish_address {192.168.8.112:9300}, bound_addresses {{192.168.8.112:9300} [2016-09-16T14:17:53,676][WARN ][o.e.b.BootstrapCheck ] [6-bjhwl] max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144] [2016-09-16T14:17:56,731][INFO ][o.e.h.HttpServer ] [6-bjhwl] publish_address {192.168.8.112:9200}, bound_addresses {[::1]:9200}, {192.168.8.112:9200} [2016-09-16T14:17:56,732][INFO ][o.e.g.GatewayService ] [6-bjhwl] recovered [0] indices into cluster_state [2016-09-16T14:17:56,748][INFO ][o.e.n.Node ] [6-bjhwl] started
./elasticsearch -Ecluster.name=my_cluster_name -Enode.name=my_node_name
192.168.8.112
) and port (
9200
) that our node is
9200
to provide access to its REST API. This port is configurable if
- Check your cluster, node, and index health, status, and statistics
- Administer your cluster, node, and index data and metadata
- Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes
- Execute advanced search operations such as paging, sorting, filtering, scripting, aggregations, and many others
- 检查集群、节点、索引健康、状态、以及统计信息
- 管理集群、节点,以及索引数据和元数据
- 执行CRUD(创建、读取、更新、删除)以及搜索操作
- 执行高级搜索操作例如分页、排序、过滤、脚本执行、聚合以及很多其他的操作
_cat
API. You can run the command below in
Kibana’s Console by clicking "VIEW IN CONSOLE" or with
curl
by clicking the "COPY AS CURL" link below and pasting it into a terminal.
Whenever we ask for the cluster health, we either get green, yellow, or red. Green means everything is good (cluster is fully functional), yellow means all data is available but some replicas are not yet allocated (cluster is fully functional), and red means some data is not available for whatever reason. Note that even if a cluster is red, it still is partially functional (i.e. it will continue to serve search requests from the available shards) but you will likely need to fix it ASAP since you have missing data.
5、List All Indicesedit
Now let’s take a peek at our indices:
GET /_cat/indices?v
And the response:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
pretty
to the end of the call to tell it to pretty-print the JSON response (if any).
Let’s now put something into our customer index. Remember previously that in order to index a document, we must tell Elasticsearch which type in the index it should go to.
PUT /customer/external/1?pretty { "name": "John Doe" }
And the response:
{ "_index" : "customer", "_type" : "external", "_id" : "1", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "created" : true }
From the above, we can see that a new customer document was successfully created inside the customer index and the external type. The document also has an internal id of 1 which we specified at index time.
GET /customer/external/1?pretty
{ "_index" : "customer", "_type" : "external", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "name": "John Doe" } }
found
, stating that we found a document with the requested ID 1 and another field,
_source
, which returns the full JSON document that we indexed from the previous step.
DELETE /customer?pretty GET /_cat/indices?v
And the response:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
PUT /customer PUT /customer/external/1 { "name": "John Doe" } GET /customer/external/1 DELETE /customer
<REST Verb> /<Index>/<Type>/<ID>
PUT /customer/external/1?pretty { "name": "John Doe" }
PUT /customer/external/1?pretty { "name": "Jane Doe" }
PUT /customer/external/2?pretty { "name": "Jane Doe" }
The above indexes a new document with an ID of 2.
POST /customer/external?pretty { "name": "Jane Doe" }
POST
verb instead of PUT since we didn’t specify an ID.
POST /customer/external/1/_update?pretty { "doc": { "name": "Jane Doe" } }
POST /customer/external/1/_update?pretty { "doc": { "name": "Jane Doe", "age": 20 } }
POST /customer/external/1/_update?pretty { "script" : "ctx._source.age += 5" }
ctx._source
refers to the current source document that is about to be updated.
SQL UPDATE-WHERE
statement).
11、Deleting Documentsedit
DELETE /customer/external/2?pretty
_bulk
API. This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as few network roundtrips as possible.
POST /customer/external/_bulk?pretty {"index":{"_id":"1"}} {"name": "John Doe" } {"index":{"_id":"2"}} {"name": "Jane Doe" }
POST /customer/external/_bulk?pretty {"update":{"_id":"1"}} {"doc": { "name": "John Doe becomes Jane Doe" } } {"delete":{"_id":"2"}}
"account_number": 0,
"balance": 16623,
"firstname": "Bradshaw",
"lastname": "Mckenzie",
"age": 29,
"gender": "F",
"address": "244 Columbus Place",
"employer": "Euron",
"email": "bradshawmckenzie@euron.com",
"city": "Hobucken",
"state": "CO"
www.json-generator.com/
so please ignore the actual values and semantics of the data as these are all randomly generated.
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open bank l7sSYV2cQXmu6_4rJWVIww 5 1 1000 0 128.6kb 128.6kb
_search
endpoint. This example returns all documents in the bank index:
GET /bank/_search?q=*&sort=account_number:asc&pretty
_search
endpoint) in the bank index, and the
q=*
parameter instructs Elasticsearch to match all documents in the index. The
sort=account_number:asc
parameter indicates to sort the results using the
account_number
field of each document in an ascending order. The
pretty
parameter, again, just tells Elasticsearch to return pretty-printed JSON results.
And the response (partially shown):
{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : null,
"hits" : [ {
"_index" : "bank",
"_type" : "account",
"_id" : "0",
"sort": [0],
"_score" : null,
"_index" : "bank",
"_type" : "account",
"_id" : "1",
"sort": [1],
"_score" : null,
]
}
}
As for the response, we see the following parts:
took
– time in milliseconds for Elasticsearch to execute the searchtimed_out
– tells us if the search timed out or not_shards
– tells us how many shards were searched, as well as a count of the successful/failed searched shardshits
– search resultshits.total
– total number of documents matching our search criteriahits.hits
– actual array of search results (defaults to first 10 documents)hits.sort
- sort key for results (missing if sorting by score)hits._score
andmax_score
- ignore these fields for now
- took:Elasticsearch执行本次搜索花费的微秒数
- timed_out:本次搜索是否超时
- _shards:本次搜索共搜索了多少个shards,还有搜索成功或者失败的shards个数
- hits:搜索结果
- hits.total:匹配我们搜索规则的文档总数
- hits.hits:搜索结果的数组格式(默认是前10个结果)
- hits.sort:排序的关键字内容,如果使用balance排序,则显示的是balance值(如果使用score排序,则没有这个关键字)
- hits._score以及max_score:目前忽略这些字段
GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ] }
q=*
in the URI, we POST a JSON-style query request body to the
_search
API. We’ll discuss this JSON query in the next section.
GET /bank/_search { "query": { "match_all": {} } }
query
part tells us what our query definition is and the
match_all
part is simply the type of query that we want to run. The
match_all
query is simply a search for all documents in the specified index.
query
parameter, we also can pass other parameters to influence the search results. In the example in the section above we passed in
sort
, here we pass in
size
:
GET /bank/_search { "query": { "match_all": {} }, "size": 1 }
size
is not specified, it defaults to 10.
match_all
and returns documents 11 through 20:
GET /bank/_search { "query": { "match_all": {} }, "from": 10, "size": 10 }
from
parameter (0-based) specifies which document index to start from and the
size
parameter specifies how many documents to return starting at the from parameter. This feature is useful when implementing paging of search results. Note that if
from
is not specified, it defaults to 0.
match_all
and sorts the results by account balance in descending order and returns the top 10 (default size) documents.
GET /bank/_search { "query": { "match_all": {} }, "sort": { "balance": { "order": "desc" } } }
_source
field in the search hits). If we don’t want the entire source document returned, we have the ability to request only a few fields from within source to be returned.
account_number
and
balance
(inside of
_source
), from the search:
GET /bank/_search { "query": { "match_all": {} }, "_source": ["account_number", "balance"] }
Note that the above example simply reduces the _source
field. It will still only return one field named _source
but within it, only the fields account_number
and balance
are included.
If you come from a SQL background, the above is somewhat similar in concept to the SQL SELECT FROM
field list.
match_all
query is used to match all documents. Let’s now introduce a new query called the
match
query, which can be thought of as a basic fielded search query (i.e. a search done against a specific field or set of fields).
GET /bank/_search { "query": { "match": { "account_number": 20 } } }
GET /bank/_search { "query": { "match": { "address": "mill" } } }
GET /bank/_search { "query": { "match": { "address": "mill lane" } } }
match
(
match_phrase
) that returns all accounts containing the phrase "mill lane" in the address:
GET /bank/_search { "query": { "match_phrase": { "address": "mill lane" } } }
bool
(ean) query. The
bool
query allows us to compose smaller queries into bigger queries using boolean logic.
match
queries and returns all accounts containing "mill" and "lane" in the address:
GET /bank/_search { "query": { "bool": { "must": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } } }
bool must
clause specifies all the queries that must be true for a document to be considered a match.
match
queries and returns all accounts containing "mill" or "lane" in the address:
GET /bank/_search { "query": { "bool": { "should": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } } }
bool should
clause specifies a list of queries either of which must be true for a document to be considered a match.
match
queries and returns all accounts that contain neither "mill" nor "lane" in the address:
GET /bank/_search { "query": { "bool": { "must_not": [ { "match": { "address": "mill" } }, { "match": { "address": "lane" } } ] } } }
bool must_not
clause specifies a list of queries none of which must be true for a document to be considered a match.
must
,
should
, and
must_not
clauses simultaneously inside a
bool
query. Furthermore, we can compose
bool
queries inside any of these
bool
clauses to mimic any complex multi-level boolean logic.
GET /bank/_search { "query": { "bool": { "must": [ { "match": { "age": "40" } } ], "must_not": [ { "match": { "state": "ID" } } ] } } }
_score
field in the search results). The score is a numeric value that is a relative measure of how well the document matches the search query that we specified. The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.
bool
query that we introduced in the previous section also supports
filter
clauses which allow to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed. As an example, let’s introduce the
range
query, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.
GET /bank/_search { "query": { "bool": { "must": { "match_all": {} }, "filter": { "range": { "balance": { "gte": 20000, "lte": 30000 } } } } } }
match_all
query (the query part) and a
range
query (the filter part). We can substitute any other queries into the query and the filter parts. In the above case, the range query makes perfect sense since documents falling into the range all match "equally", i.e., no document is more relevant than another.
match_all
,
match
,
bool
, and
range
queries, there are a lot of other query types that are available and we won’t go into them here. Since we already have a basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge in learning and experimenting with the other query types.
GET /bank/_search {"size":0,"aggs":{"group_by_state":{"terms":{"field":"state.keyword"}}}}
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC
And the response (partially shown):
{"took":29,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1000,"max_score":0.0,"hits":[]},"aggregations":{"group_by_state":{"doc_count_error_upper_bound":20,"sum_other_doc_count":770,"buckets":[{"key":"ID","doc_count":27},{"key":"TX","doc_count":27},{"key":"AL","doc_count":25},{"key":"MD","doc_count":25},{"key":"TN","doc_count":23},{"key":"MA","doc_count":21},{"key":"NC","doc_count":21},{"key":"ND","doc_count":21},{"key":"ME","doc_count":20},{"key":"MO","doc_count":20}]}}}
ID
(Idaho), followed by 27 accounts in
TX
(Texas), followed by 25 accounts in
AL
(Alabama), and so forth.
size=0
to not show search hits because we only want to see the aggregation results in the response.
GET /bank/_search {"size":0,"aggs":{"group_by_state":{"terms":{"field":"state.keyword"},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}
average_balance
aggregation inside the
group_by_state
aggregation. This is a common pattern for all the aggregations. You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.
GET /bank/_search {"size":0,"aggs":{a"group_by_state":{"terms":{"field":"state.keyword","order":{"average_balance":"desc"}},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}
GET /bank/_search {"size":0,"aggs":{"group_by_age":{"range":{"field":"age","ranges":[{"from":20,"to":30},{"from":30,"to":40},{"from":40,"to":50}]},"aggs":{"group_by_gender":{"terms":{"field":"gender.keyword"},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}}}