Elasticsearch学习
- Elasticsearch学习
- 1.Install Sense
- 2.Getting Started
- _ search
- human language
- sorting
- settings
- mappings
- 相关性
- analyze
- aggregations
- 重建索引reindexing
- Refresh
- Flush
- Segment Merging
- 知识点
1.Install Sense
https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html#sense
2.Getting Started
-
REQUEST FORMAT
curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
-
GET 数据指定字段, 查询到结果,
found
为true
GET /website/blog/123?_source=title,text&pretty
GET /website/blog/_mget
{
"ids" : [ "1", "2" ]
}
GET /website/blog/_mget
{
"docs" : [
{ "_id" : 2 },
{ "_type" : "pageviews", "_id" : 1 }
]
}
- Updating a Whole Document
created
为false
,因为id已经存在过了
PUT /website/blog/123
{
"title": "My first blog entry",
"text": "I am starting to get the hang of this...",
"date": "2014/01/02"
}
{
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 2,
"created": false
}
POST /website/blog/1/_update
{
"doc" : {
"tags" : [ "testing" ],
"views": 0
}
}
使用script
更新数据,需要在配置文件中添加script.inline: true
POST /website/blog/1/_update
{
"script" : "ctx._source.views+=1"
}
也可以使用script
向数组
内添加数据
POST /website/blog/123/_update
{
"script" : "ctx._source.tags+=new_tag",
"params" : {
"new_tag" : "search"
}
}
- Creating a New Document
创建一个新文档,不覆盖
之前存在的
POST /website/blog/
{ ... }
PUT /website/blog/123?op_type=create
{ ... }
PUT /website/blog/123/_create
{ ... }
- 文档计数
POST /cnki02,wanfang02,pubmed02/doc/_count
{
"query":{
"match_all": {}
}
}
- _bulk批量操作
A good bulk size to start playing with is around 5-15MB in size.
POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title": "My first blog post" }
{ "index": { "_index": "website", "_type": "blog" }}
{ "title": "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }
_ search
只显示部分字段
GET /cnki02/_search
{
"query": {"match_all": {}},
"_source":["title_cn","organizations"]
}
exist
需要注意的是null
与“null”
的不同
POST /pubmed02/doc/_search
{
"query": {
"exists" : { "field" : "source_all" }
}
}
对于一个对象的exists或者missing,是扁平化后shuould处理的
{
"name" : {
"first" : "John",
"last" : "Smith"
}
}
The reason that it works is that a filter like
{
"exists" : { "field" : "name" }
}
is really executed as
{
"bool": {
"should": [
{ "exists": { "field": "name.first" }},
{ "exists": { "field": "name.last" }}
]
}
}
missing
POST /pubmed02/doc/_search
{
"query": {
"missing" : { "field" : "source_all" }
}
}
多个精确值terms
GET /my_store/products/_search
{
"filter": {
"terms": {
"price": [20,30]
}
}
}
范围查询
GET /my_store/products/_search
{
"filter": {
"range": {
"price": {
"gt": 20,
"lt": 40
}
}
}
}
querying_string
GET /megacorp/employee/_search?q=last_name:Smith
- 是分词的
mathch 以及控制精度
operator
minimum_should_match
等同于bool should中minimum_should_match
GET /megacorp/employee/_search
{
"query" : {
"match" : {
"last_name" : "Smith"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"title": {
"query": "BROWN DOG!",
"operator": "and"
}
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"title": {
"query": "quick brown dog",
"minimum_should_match": "75%"
}
}
}
}
GET /my_index/my_type/_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "brown" }},
{ "match": { "title": "fox" }},
{ "match": { "title": "dog" }}
],
"minimum_should_match": 2
}
}
}
should boost加权
GET /_search
{
"query": {
"bool": {
"must": {
"match": {
"content": {
"query": "full text search",
"operator": "and"
}
}
},
"should": [
{ "match": {
"content": {
"query": "Elasticsearch",
"boost": 3
}
}},
{ "match": {
"content": {
"query": "Lucene",
"boost": 2
}
}}
]
}
}
}
match_phrase
- 精确匹配单词或短语
- 短语顺序必须紧挨着
GET /megacorp/employee/_search
{
"query" : {
"match_phrase" : {
"about" : "I love to go rock climbing"
}
}
}
Multifield Search
Best Fieldsdis_max Query
某一个字段计算score而不是结合多个字段计算score,排序
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}
}
Tuning Best Fields Queries tie_breaker
优化最佳字段
With the tie_breaker, all matching clauses count, but the best-matching clause counts most.
The tie_breaker can be a floating-point value between 0 and 1, where 0 uses just the best-matching clause and 1 counts all matching clauses equally. The exact value can be tuned based on your data and queries, but a reasonable value should be close to zero, (for example, 0.1 - 0.4
), in order not to overwhelm the best-matching nature of dis_max.
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
],
"tie_breaker": 0.3
}
}
}
Most Field
GET /my_index/_search
{
"query": {
"multi_match": {
"query": "jumping rabbits",
"type": "most_fields",
"fields": [ "title^10", "title.std" ]
}
}
}
Proximity Matching 精确匹配match_phrase
使用slop来减小严格
POST /my_index/my_type/_search
{
"query": {
"match_phrase": {
"title": {
"query": "quick dog",
"slop": 50
}
}
}
}
match_phrase 国语严格了,使用下面方式,match作为基本查询,使用match_phrase增加相关性
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": {
"query": "quick brown fox",
"minimum_should_match": "30%"
}
}
},
"should": {
"match_phrase": {
"title": {
"query": "quick brown fox",
"slop": 50
}
}
}
}
}
}
match_phrase是比较消耗性能的,可以优化一下
GET /my_index/my_type/_search
{
"query": {
"match": {
"title": {
"query": "quick brown fox",
"minimum_should_match": "30%"
}
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query": {
"match_phrase": {
"title": {
"query": "quick brown fox",
"slop": 50
}
}
}
}
}
}
Producing Shingles
词的相关性
相关的词汇,可以增加相关性,在索引时,会多消耗一些性能和磁盘空间,但搜索时比match_phrase效率要高
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 2,
"output_unigrams": false
}
},
"analyzer": {
"my_shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_shingle_filter"
]
}
}
}
}
}
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": {
"match": {
"title": "the hungry alligator ate sue"
}
},
"should": {
"match": {
"title.shingles": "the hungry alligator ate sue"
}
}
}
}
}
Partial Matching
The prefix, wildcard, and regexp queries operate on terms. If you use them to query an analyzed field, they will examine each term in the field, not the field as a whole.
prefix,wildcard,regexp 是低层次的,基于term的搜索,对于分词的字段并不特别适用,因为分词字段,分词后是多个term,这三种方法是将查询作为一个term来搜索
prefix
前缀匹配
prefix 非常消耗性能,尽量避免使用,或者使用长的term
GET /my_index/address/_search
{
"query": {
"prefix": {
"postcode": "W1"
}
}
}
wildcard
通配符
GET /my_index/address/_search
{
"query": {
"wildcard": {
"postcode": "W?F*HW"
}
}
}
regexp
正则表达式
GET /my_index/address/_search
{
"query": {
"regexp": {
"postcode": "W[0-9].+"
}
}
}
Query-Time Search-as-You-Type
{
"match_phrase_prefix" : {
"brand" : {
"query": "johnnie walker bl",
"max_expansions": 50
}
}
}
Ngrams
跨库搜索以及加权
GET /blogs-*/post/_search
{
"query": {
"multi_match": {
"query": "deja vu",
"fields": [ "title", "title.stemmed" ]
"type": "most_fields"
}
},
"indices_boost": {
"blogs-en": 3,
"blogs-fr": 2
}
}
boosting Query
must not 太严格,使用boosting query虽然还会出现在结果,但会降低排名
GET /_search
{
"query": {
"boosting": {
"positive": {
"match": {
"text": "apple"
}
},
"negative": {
"match": {
"text": "pie tart fruit crumble tree"
}
},
"negative_boost": 0.5
}
}
}
constant_score Query
GET /_search
{
"query": {
"bool": {
"should": [
{ "constant_score": {
"query": { "match": { "features": "wifi" }}
}},
{ "constant_score": {
"query": { "match": { "features": "garden" }}
}},
{ "constant_score": {
"boost": 2
"query": { "match": { "features": "pool" }}
}}
]
}
}
}
human language
识别语言种类
Of particular note is the chromium-compact-language-detector library from Mike McCandless, which uses the open source (Apache License 2.0) Compact Language Detector (CLD) from Google. It is small, fast, and accurate, and can detect 160+ languages from as little as two sentences. It can even detect multiple languages within a single block of text. Bindings exist for several languages including Python, Perl, JavaScript, PHP, C#/.NET, and R.
Identifying the language of the user’s search request is not quite as simple. The CLD is designed for text that is at least 200 characters in length. Shorter amounts of text, such as search keywords, produce much less accurate results. In these cases, it may be preferable to take simple heuristics into account such as the country of origin, the user’s selected language, and the HTTP accept-language headers.
sorting
GET /_search
"sort": "field"
- 多个字段排序
GET /_search
{
"query" : {
"bool" : {
"must": { "match": { "tweet": "manage text search" }},
"filter" : { "term" : { "user_id" : 2 }}
}
},
"sort": [
{ "date": { "order": "desc" }},
{ "_score": { "order": "desc" }}
]
}
- 字段有多个值时
ziduan
GET /_search
"sort": {
"dates": {
"order": "asc",
"mode": "min"
}
}
- String Sorted
字符型字段排序也是多值,find art odd 如果按照mode中min和max排序不是我们想要的按照单词顺序排序,所以该字段要使用fields分词和不分词分别索引
"tweet": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
GET /_search
{
"query": {
"match": {
"tweet": "elasticsearch"
}
},
"sort": "tweet.raw"
}
分词用来全文检索,不分词用来排序
- score计算过程
GET /_search?explain
{
"query" : { "match" : { "tweet" : "honeymoon" }}
}
- not match 原因(指定id)
GET /us/tweet/12/_explain
{
"query" : {
"bool" : {
"filter" : { "term" : { "user_id" : 2 }},
"must" : { "match" : { "tweet" : "honeymoon" }}
}
}
}
settings
获得settings
GET /cnki02/_settings/
设置settings
- number_of_shards
- number_of_replicas
- analysis
PUT /my_temp_index
{
"settings": {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}
复制分片数量可以更新
PUT /my_temp_index/_settings
{
"number_of_replicas": 1
}
设置分词
PUT /spanish_docs
{
"settings": {
"analysis": {
"analyzer": {
"es_std": {
"type": "standard",
"stopwords": "_spanish_"
}
}
}
}
}
配置自定义分词
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [ "&=> and "]
}},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": [ "the", "a" ]
}},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "html_strip", "&_to_and" ],
"tokenizer": "standard",
"filter": [ "lowercase", "my_stopwords" ]
}}
}}}
mappings
Create Index
PUT /my_index
{
"settings": {
"number_of_replicas": 0,
"number_of_shards":1 },
"mappings": {
"type_one": { ... any mappings ... },
"type_two": { ... any mappings ... },
...
}
}
- 关闭自动创建索引
action.auto_create_index: false
获得mappings
GET /cnki02/_mapping/
动态dynamic mapping
true
Add new fields dynamically—the defaultfalse
Ignore new fieldsstrict
Throw an exception if an unknown field is encountered
PUT /my_index
{
"mappings": {
"my_type": {
"dynamic": "strict",
"properties": {
"title": { "type": "string"},
"stash": {
"type": "object",
"dynamic": true
}
}
}
}
}
预防string与date混乱
PUT /my_index
{
"mappings": {
"my_type": {
"date_detection": false
}
}
}
关闭_all或指定字段或指定分词
PUT /my_index/_mapping/my_type
{
"my_type": {
"_all": { "enabled": false }
}
}
制定字段
PUT /my_index/my_type/_mapping
{
"my_type": {
"include_in_all": false,
"properties": {
"title": {
"type": "string",
"include_in_all": true
},
...
}
}
}
为_all制定分词
PUT /my_index/my_type/_mapping
{
"my_type": {
"_all": { "analyzer": "whitespace" }
}
}
相关性
Term Frequence
tf(t in d) = √frequency
如果不向考虑term在filed中出现的频次,可以关闭term frequence
If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping
:
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs"
}
}
}
}
}
boosting
- boosting indexes
GET /docs_2014_*/_search
{
"indices_boost": {
"docs_2014_10": 3,
"docs_2014_09": 2
},
"query": {
"match": {
"text": "quick brown fox"
}
}
}
analyze
测试分词
GET /_analyze
{
"analyzer":"ik",
"text":"杨延友是好人"
}
GET /cnki02/_analyze
{
"field":"publishInfo.periodicalInfo.year",
"text":"Text to analyze"
}
query验证_validate
GET /my_index/my_type/_validate/query?explain
{
"query": {
"bool": {
"should": [
{ "match": { "title": "Foxes"}},
{ "match": { "english_title": "Foxes"}}
]
}
}
}
一些词不想stem
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stem_exclusion": [ "organization", "organizations" ],
"stopwords": [
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
"if", "in", "into", "is", "it", "of", "on", "or", "such", "that",
"the", "their", "then", "there", "these", "they", "this", "to",
"was", "will", "with"
]
}
}
}
}
}
配置搜索分词search_analyzer
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
aggregations
terms aggs
GET /megacorp/employee/_search
{
"aggs": {
"all_interests": {
"terms": { "field": "interests" }
}
}
}
可以加query
条件后聚合
GET /megacorp/employee/_search
{
"query": {
"match": {
"last_name": "smith"
}
},
"aggs": {
"all_interests": {
"terms": {
"field": "interests"
}
}
}
}
聚合之中嵌套聚合
GET /megacorp/employee/_search
{
"aggs" : {
"all_interests" : {
"terms" : { "field" : "interests" },
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
}
}
}
}
}
# 聚合结果
{
"aggregations": {
"all_interests": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "music",
"doc_count": 2,
"avg_age": {
"value": 28.5
}
},
{
"key": "forestry",
"doc_count": 1,
"avg_age": {
"value": 35
}
},
{
"key": "sports",
"doc_count": 1,
"avg_age": {
"value": 25
}
}
]
}
}
重建索引reindexing
-
reidex
Reindex API
Reindex API_Referce -
Index Aliases
- 创建别名
PUT /my_index_v1
PUT /my_index_v1/_alias/my_index
- 查看别名指向
GET /*/_alias/my_index
GET /my_index_v1/_alias/*
- 处理别名
POST /_aliases
{
"actions": [
{ "remove": { "index": "my_index_v1", "alias": "my_index" }},
{ "add": { "index": "my_index_v2", "alias": "my_index" }}
]
}
Refresh
将新的segment commit是昂贵的,但是写入文件缓存是简单的,可以通过后者达到近实时搜索。这个过程叫Refresh
手动刷新API
POST /_refresh
POST /blogs/_refresh
导入大量数据时
PUT /my_logs/_settings
{ "refresh_interval": -1 }
PUT /my_logs/_settings
{ "refresh_interval": "1s" }
Flush
The purpose of the translog is to ensure that operations are not lost
根据translog刷新segment,再清楚内存和translog的过程成为Flush。
手动flushAPI
POST /blogs/_flush
POST /_flush?wait_for_ongoing
对于集群可以优化
PUT /my_index/_settings
{
"index.translog.durability": "async",
"index.translog.sync_interval": "5s"
}
Segment Merging
段合并非常耗费资源
知识点
- document frequencies are calculated per shard, rather than per index
- Aginst Deep Pagination
分页在1000-5000页(10000-50000个文档)是允许的,再高就不行了。涉及到多个分片以及排序,性能吃力,而且实际上页不人道,因为人翻两页就不会继续下去了。