Elasticsearch学习

最新推荐文章于 2024-10-16 14:19:32 发布

kakakfka

最新推荐文章于 2024-10-16 14:19:32 发布

阅读量408

点赞数

文章标签： es elasticsearch

本文链接：https://blog.csdn.net/kakakfka/article/details/72777827

版权

~~Edit~~

Elasticsearch学习

Elasticsearch学习

1.Install Sense

https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html#sense

2.Getting Started

REQUEST FORMAT
curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
GET 数据指定字段, 查询到结果，found为 true

GET /website/blog/123?_source=title,text&pretty

_mget获取多个文档

GET /website/blog/_mget
{
   "ids" : [ "1", "2" ]
}

GET /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}

Updating a Whole Document
created为false，因为id已经存在过了

 PUT /website/blog/123
{
  "title": "My first blog entry",
  "text":  "I am starting to get the hang of this...",
  "date":  "2014/01/02"
}

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 2,
  "created":   false 
}

Partial Updates to Documents

POST /website/blog/1/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "views": 0
   }
}

使用script更新数据,需要在配置文件中添加script.inline: true

POST /website/blog/1/_update
{
   "script" : "ctx._source.views+=1"
}

也可以使用script向数组内添加数据

POST /website/blog/123/_update
{
   "script" : "ctx._source.tags+=new_tag",
   "params" : {
      "new_tag" : "search"
   }
}

Creating a New Document
创建一个新文档，不覆盖之前存在的

POST /website/blog/
{ ... }
PUT /website/blog/123?op_type=create
{ ... }
PUT /website/blog/123/_create
{ ... }

文档计数

POST /cnki02,wanfang02,pubmed02/doc/_count
{
    "query":{
        "match_all": {}
    }
}

_bulk批量操作
A good bulk size to start playing with is around 5-15MB in size.

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} 
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }

_ search

只显示部分字段

GET /cnki02/_search
{
    "query": {"match_all": {}},
    "_source":["title_cn","organizations"]
}

`exist`

需要注意的是null与“null”的不同

POST /pubmed02/doc/_search
{
    "query": {
    "exists" : { "field" : "source_all" }
    }
}

对于一个对象的exists或者missing，是扁平化后shuould处理的

{
   "name" : {
      "first" : "John",
      "last" :  "Smith"
   }
}

The reason that it works is that a filter like

{
    "exists" : { "field" : "name" }
}

is really executed as

{
    "bool": {
        "should": [
            { "exists": { "field": "name.first" }},
            { "exists": { "field": "name.last" }}
        ]
    }
}

`missing`

POST /pubmed02/doc/_search
{
    "query": {
    "missing" : { "field" : "source_all" }
    }
}

多个精确值`terms`

Contains, but Does Not Equal

GET /my_store/products/_search
{
    "filter": {
        "terms": {
           "price": [20,30]
        }
    }
}

范围查询

GET /my_store/products/_search
{
    "filter": {
        "range": {
           "price": {
              "gt": 20,
              "lt": 40
           }
        }
    }
}

querying_string

GET /megacorp/employee/_search?q=last_name:Smith

是分词的

mathch 以及控制精度

operator
minimum_should_match等同于bool should中minimum_should_match

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "title": {      
                "query":    "BROWN DOG!",
                "operator": "and"
            }
        }
    }
}

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "title": {
        "query":                "quick brown dog",
        "minimum_should_match": "75%"
      }
    }
  }
}

GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "brown" }},
        { "match": { "title": "fox"   }},
        { "match": { "title": "dog"   }}
      ],
      "minimum_should_match": 2 
    }
  }
}

should boost加权

A reasonable range for boost lies between 1 and 10, maybe 15. Boosts higher than that have little more impact because scores are normalized.

GET /_search
{
    "query": {
        "bool": {
            "must": {
                "match": {  
                    "content": {
                        "query":    "full text search",
                        "operator": "and"
                    }
                }
            },
            "should": [
                { "match": {
                    "content": {
                        "query": "Elasticsearch",
                        "boost": 3 
                    }
                }},
                { "match": {
                    "content": {
                        "query": "Lucene",
                        "boost": 2 
                    }
                }}
            ]
        }
    }
}

match_phrase

精确匹配单词或短语
短语顺序必须紧挨着

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "I love to go rock climbing"
        }
    }
}

Multifield Search

Best Fields`dis_max Query`

What if, instead of combining the scores from each field, we used the score from the best-matching field as the overall score for the query? This would give preference to a single field that contains both of the words we are looking for, rather than the same word repeated in different fields.

某一个字段计算score而不是结合多个字段计算score，排序

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

Tuning Best Fields Queries `tie_breaker`

优化最佳字段
With the tie_breaker, all matching clauses count, but the best-matching clause counts most.
The tie_breaker can be a floating-point value between 0 and 1, where 0 uses just the best-matching clause and 1 counts all matching clauses equally. The exact value can be tuned based on your data and queries, but a reasonable value should be close to zero, (for example, 0.1 - 0.4), in order not to overwhelm the best-matching nature of dis_max.

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.3
        }
    }
}

Most Field

GET /my_index/_search
{
   "query": {
        "multi_match": {
            "query":  "jumping rabbits",
            "type":   "most_fields", 
            "fields":      [ "title^10", "title.std" ]
        }
    }
}

Proximity Matching 精确匹配`match_phrase`

使用slop来减小严格

POST /my_index/my_type/_search
{
   "query": {
      "match_phrase": {
         "title": {
            "query": "quick dog",
            "slop":  50 
         }
      }
   }
}

match_phrase 国语严格了，使用下面方式，match作为基本查询，使用match_phrase增加相关性

GET /my_index/my_type/_search   
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "title": {
            "query":                "quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      "should": {
        "match_phrase": { 
          "title": {
            "query": "quick brown fox",
            "slop":  50
          }
        }
      }
    }
  }
}

match_phrase是比较消耗性能的，可以优化一下

GET /my_index/my_type/_search
{
    "query": {
        "match": {  
            "title": {
                "query":                "quick brown fox",
                "minimum_should_match": "30%"
            }
        }
    },
    "rescore": {
        "window_size": 50, 
        "query": {         
            "rescore_query": {
                "match_phrase": {
                    "title": {
                        "query": "quick brown fox",
                        "slop":  50
                    }
                }
            }
        }
    }
}

Producing Shingles

词的相关性
相关的词汇，可以增加相关性，在索引时，会多消耗一些性能和磁盘空间，但搜索时比match_phrase效率要高

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2, 
                    "max_shingle_size": 2, 
                    "output_unigrams":  false   
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" 
                    ]
                }
            }
        }
    }
}


GET /my_index/my_type/_search
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "title": "the hungry alligator ate sue"
            }
         },
         "should": {
            "match": {
               "title.shingles": "the hungry alligator ate sue"
            }
         }
      }
   }
}

Partial Matching

The prefix, wildcard, and regexp queries operate on terms. If you use them to query an analyzed field, they will examine each term in the field, not the field as a whole.
prefix，wildcard,regexp 是低层次的，基于term的搜索，对于分词的字段并不特别适用，因为分词字段，分词后是多个term，这三种方法是将查询作为一个term来搜索

`prefix`前缀匹配

prefix 非常消耗性能，尽量避免使用，或者使用长的term

GET /my_index/address/_search
{
    "query": {
        "prefix": {
            "postcode": "W1"
        }
    }
}

`wildcard`通配符

GET /my_index/address/_search
{
    "query": {
        "wildcard": {
            "postcode": "W?F*HW" 
        }
    }
}

`regexp`正则表达式

GET /my_index/address/_search
{
    "query": {
        "regexp": {
            "postcode": "W[0-9].+" 
        }
    }
}

Query-Time Search-as-You-Type

{
    "match_phrase_prefix" : {
        "brand" : {
            "query":          "johnnie walker bl",
            "max_expansions": 50
        }
    }
}

Ngrams

跨库搜索以及加权

GET /blogs-*/post/_search 
{
    "query": {
        "multi_match": {
            "query":   "deja vu",
            "fields":  [ "title", "title.stemmed" ] 
            "type":    "most_fields"
        }
    },
    "indices_boost": { 
        "blogs-en": 3,
        "blogs-fr": 2
    }
}

boosting Query

must not 太严格，使用boosting query虽然还会出现在结果，但会降低排名

GET /_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "text": "apple"
        }
      },
      "negative": {
        "match": {
          "text": "pie tart fruit crumble tree"
        }
      },
      "negative_boost": 0.5
    }
  }
}

constant_score Query

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "features": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "features": "garden" }}
        }},
        { "constant_score": {
          "boost":   2
          "query": { "match": { "features": "pool" }}
        }}
      ]
    }
  }
}

human language

识别语言种类

Of particular note is the chromium-compact-language-detector library from Mike McCandless, which uses the open source (Apache License 2.0) Compact Language Detector (CLD) from Google. It is small, fast, and accurate, and can detect 160+ languages from as little as two sentences. It can even detect multiple languages within a single block of text. Bindings exist for several languages including Python, Perl, JavaScript, PHP, C#/.NET, and R.
Identifying the language of the user’s search request is not quite as simple. The CLD is designed for text that is at least 200 characters in length. Shorter amounts of text, such as search keywords, produce much less accurate results. In these cases, it may be preferable to take simple heuristics into account such as the country of origin, the user’s selected language, and the HTTP accept-language headers.

sorting

GET /_search
    "sort": "field"

多个字段排序

GET /_search
{
    "query" : {
        "bool" : {
            "must":   { "match": { "tweet": "manage text search" }},
            "filter" : { "term" : { "user_id" : 2 }}
        }
    },
    "sort": [
        { "date":   { "order": "desc" }},
        { "_score": { "order": "desc" }}
    ]
}

字段有多个值时

ziduan
GET /_search
"sort": {
    "dates": {
        "order": "asc",
        "mode":  "min"
    }
}

String Sorted

字符型字段排序也是多值，find art odd 如果按照mode中min和max排序不是我们想要的按照单词顺序排序，所以该字段要使用fields分词和不分词分别索引
"tweet": { 
    "type":     "string",
    "analyzer": "english",
    "fields": {
        "raw": { 
            "type":  "string",
            "index": "not_analyzed"
        }
    }
}

GET /_search
{
    "query": {
        "match": {
            "tweet": "elasticsearch"
        }
    },
    "sort": "tweet.raw"
}
分词用来全文检索，不分词用来排序

score计算过程

GET /_search?explain 
{
   "query"   : { "match" : { "tweet" : "honeymoon" }}
}

not match 原因（指定id）

GET /us/tweet/12/_explain
{
   "query" : {
      "bool" : {
         "filter" : { "term" :  { "user_id" : 2           }},
         "must" :  { "match" : { "tweet" :   "honeymoon" }}
      }
   }
}

settings

获得settings

GET /cnki02/_settings/

设置settings

number_of_shards
number_of_replicas
analysis

PUT /my_temp_index
{
    "settings": {
        "number_of_shards" :   1,
        "number_of_replicas" : 0
    }
}

复制分片数量可以更新

PUT /my_temp_index/_settings
{
    "number_of_replicas": 1
}

设置分词

PUT /spanish_docs
{
    "settings": {
        "analysis": {
            "analyzer": {
                "es_std": {
                    "type":      "standard",
                    "stopwords": "_spanish_"
                }
            }
        }
    }
}

配置自定义分词
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

mappings

Create Index

PUT /my_index
{
    "settings": { 
        "number_of_replicas": 0,
        "number_of_shards":1 },
    "mappings": {
        "type_one": { ... any mappings ... },
        "type_two": { ... any mappings ... },
        ...
    }
}

关闭自动创建索引
action.auto_create_index: false

获得mappings

GET /cnki02/_mapping/

动态dynamic mapping

trueAdd new fields dynamically—the default
falseIgnore new fields
strictThrow an exception if an unknown field is encountered

PUT /my_index
{
    "mappings": {
        "my_type": {
            "dynamic":      "strict", 
            "properties": {
                "title":  { "type": "string"},
                "stash":  {
                    "type":     "object",
                    "dynamic":  true 
                }
            }
        }
    }
}

预防string与date混乱

PUT /my_index
{
    "mappings": {
        "my_type": {
            "date_detection": false
        }
    }
}

关闭_all或指定字段或指定分词

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "_all": { "enabled": false }
    }
}

制定字段
PUT /my_index/my_type/_mapping
{
    "my_type": {
        "include_in_all": false,
        "properties": {
            "title": {
                "type":           "string",
                "include_in_all": true
            },
            ...
        }
    }
}

为_all制定分词
PUT /my_index/my_type/_mapping
{
    "my_type": {
        "_all": { "analyzer": "whitespace" }
    }
}

analyze

测试分词

GET /_analyze
{
    "analyzer":"ik",
    "text":"杨延友是好人"
}

GET /cnki02/_analyze
{   
    "field":"publishInfo.periodicalInfo.year",
    "text":"Text to analyze"
}

query验证`_validate`

GET /my_index/my_type/_validate/query?explain
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title":         "Foxes"}},
                { "match": { "english_title": "Foxes"}}
            ]
        }
    }
}

一些词不想stem

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stem_exclusion": [ "organization", "organizations" ], 
          "stopwords": [ 
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
            "if", "in", "into", "is", "it", "of", "on", "or", "such", "that",
            "the", "their", "then", "there", "these", "they", "this", "to",
            "was", "will", "with"
          ]
        }
      }
    }
  }
}

配置搜索分词`search_analyzer`

PUT /my_index/my_type/_mapping
{
    "my_type": {
        "properties": {
            "name": {
                "type":            "string",
                "analyzer":  "autocomplete", 
                "search_analyzer": "standard" 
            }
        }
    }
}

aggregations

terms aggs

GET /megacorp/employee/_search
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}

可以加query条件后聚合

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "last_name": "smith"
    }
  },
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests"
      }
    }
  }
}

聚合之中嵌套聚合

GET /megacorp/employee/_search
{
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}
# 聚合结果
{
"aggregations": {
      "all_interests": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "music",
               "doc_count": 2,
               "avg_age": {
                  "value": 28.5
               }
            },
            {
               "key": "forestry",
               "doc_count": 1,
               "avg_age": {
                  "value": 35
               }
            },
            {
               "key": "sports",
               "doc_count": 1,
               "avg_age": {
                  "value": 25
               }
            }
         ]
      }
   }

重建索引`reindexing`

reidex
Reindex API
Reindex API_Referce
Index Aliases

创建别名

PUT /my_index_v1 
PUT /my_index_v1/_alias/my_index

查看别名指向

GET /*/_alias/my_index
GET /my_index_v1/_alias/*

处理别名

POST /_aliases
{
    "actions": [
        { "remove": { "index": "my_index_v1", "alias": "my_index" }},
        { "add":    { "index": "my_index_v2", "alias": "my_index" }}
    ]
}

Refresh

将新的segment commit是昂贵的，但是写入文件缓存是简单的，可以通过后者达到近实时搜索。这个过程叫Refresh 手动刷新API

POST /_refresh 
POST /blogs/_refresh

导入大量数据时

PUT /my_logs/_settings
{ "refresh_interval": -1 } 

PUT /my_logs/_settings
{ "refresh_interval": "1s" }

Flush

The purpose of the translog is to ensure that operations are not lost
根据translog刷新segment，再清楚内存和translog的过程成为Flush。
手动flushAPI

POST /blogs/_flush 
POST /_flush?wait_for_ongoing

对于集群可以优化

PUT /my_index/_settings
{
    "index.translog.durability": "async",
    "index.translog.sync_interval": "5s"
}

Segment Merging

段合并非常耗费资源

知识点

document frequencies are calculated per shard, rather than per index
Aginst Deep Pagination
分页在1000-5000页(10000-50000个文档)是允许的，再高就不行了。涉及到多个分片以及排序，性能吃力，而且实际上页不人道，因为人翻两页就不会继续下去了。