Elasticsearch学习

Edit

Elasticsearch学习

1.Install Sense

https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html#sense

2.Getting Started

  • REQUEST FORMAT
    curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'

  • GET 数据指定字段, 查询到结果,found true

GET /website/blog/123?_source=title,text&pretty
GET /website/blog/_mget
{
   "ids" : [ "1", "2" ]
}

GET /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}
 PUT /website/blog/123
{
  "title": "My first blog entry",
  "text":  "I am starting to get the hang of this...",
  "date":  "2014/01/02"
}

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 2,
  "created":   false 
}
POST /website/blog/1/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "views": 0
   }
}

使用script更新数据,需要在配置文件中添加script.inline: true

POST /website/blog/1/_update
{
   "script" : "ctx._source.views+=1"
}

也可以使用script数组内添加数据

POST /website/blog/123/_update
{
   "script" : "ctx._source.tags+=new_tag",
   "params" : {
      "new_tag" : "search"
   }
}
POST /website/blog/
{ ... }
PUT /website/blog/123?op_type=create
{ ... }
PUT /website/blog/123/_create
{ ... }
  • 文档计数
POST /cnki02,wanfang02,pubmed02/doc/_count
{
    "query":{
        "match_all": {}
    }
}
POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} 
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} } 

_ search

只显示部分字段

GET /cnki02/_search
{
    "query": {"match_all": {}},
    "_source":["title_cn","organizations"]
}

exist

需要注意的是null“null”的不同

POST /pubmed02/doc/_search
{
    "query": {
    "exists" : { "field" : "source_all" }
    }
}

对于一个对象的exists或者missing,是扁平化后shuould处理的

{
   "name" : {
      "first" : "John",
      "last" :  "Smith"
   }
}

The reason that it works is that a filter like

{
    "exists" : { "field" : "name" }
}

is really executed as

{
    "bool": {
        "should": [
            { "exists": { "field": "name.first" }},
            { "exists": { "field": "name.last" }}
        ]
    }
}

missing

POST /pubmed02/doc/_search
{
    "query": {
    "missing" : { "field" : "source_all" }
    }
}

多个精确值terms

Contains, but Does Not Equal

GET /my_store/products/_search
{
    "filter": {
        "terms": {
           "price": [20,30]
        }
    }
}

范围查询

GET /my_store/products/_search
{
    "filter": {
        "range": {
           "price": {
              "gt": 20,
              "lt": 40
           }
        }
    }
}

querying_string

GET /megacorp/employee/_search?q=last_name:Smith

  • 是分词的

mathch 以及控制精度

  • operator
  • minimum_should_match等同于bool should中minimum_should_match
GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}
GET /my_index/my_type/_search
{
    "query": {
        "match": {
            "title": {      
                "query":    "BROWN DOG!",
                "operator": "and"
            }
        }
    }
}

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "title": {
        "query":                "quick brown dog",
        "minimum_should_match": "75%"
      }
    }
  }
}
GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "brown" }},
        { "match": { "title": "fox"   }},
        { "match": { "title": "dog"   }}
      ],
      "minimum_should_match": 2 
    }
  }
}
should boost加权

A reasonable range for boost lies between 1 and 10, maybe 15. Boosts higher than that have little more impact because scores are normalized.

GET /_search
{
    "query": {
        "bool": {
            "must": {
                "match": {  
                    "content": {
                        "query":    "full text search",
                        "operator": "and"
                    }
                }
            },
            "should": [
                { "match": {
                    "content": {
                        "query": "Elasticsearch",
                        "boost": 3 
                    }
                }},
                { "match": {
                    "content": {
                        "query": "Lucene",
                        "boost": 2 
                    }
                }}
            ]
        }
    }
}

match_phrase

  • 精确匹配单词或短语
  • 短语顺序必须紧挨着
GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "I love to go rock climbing"
        }
    }
}

Multifield Search

Best Fieldsdis_max Query

What if, instead of combining the scores from each field, we used the score from the best-matching field as the overall score for the query? This would give preference to a single field that contains both of the words we are looking for, rather than the same word repeated in different fields.

某一个字段计算score而不是结合多个字段计算score,排序

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}
Tuning Best Fields Queries tie_breaker

优化最佳字段
With the tie_breaker, all matching clauses count, but the best-matching clause counts most.
The tie_breaker can be a floating-point value between 0 and 1, where 0 uses just the best-matching clause and 1 counts all matching clauses equally. The exact value can be tuned based on your data and queries, but a reasonable value should be close to zero, (for example, 0.1 - 0.4), in order not to overwhelm the best-matching nature of dis_max.

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.3
        }
    }
}
Most Field
GET /my_index/_search
{
   "query": {
        "multi_match": {
            "query":  "jumping rabbits",
            "type":   "most_fields", 
            "fields":      [ "title^10", "title.std" ]
        }
    }
}

Proximity Matching 精确匹配match_phrase

使用slop来减小严格

POST /my_index/my_type/_search
{
   "query": {
      "match_phrase": {
         "title": {
            "query": "quick dog",
            "slop":  50 
         }
      }
   }
}

match_phrase 国语严格了,使用下面方式,match作为基本查询,使用match_phrase增加相关性

GET /my_index/my_type/_search   
{
  "query": {
    "bool": {
      "must": {
        "match": { 
          "title": {
            "query":                "quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      "should": {
        "match_phrase": { 
          "title": {
            "query": "quick brown fox",
            "slop":  50
          }
        }
      }
    }
  }
}

match_phrase是比较消耗性能的,可以优化一下

GET /my_index/my_type/_search
{
    "query": {
        "match": {  
            "title": {
                "query":                "quick brown fox",
                "minimum_should_match": "30%"
            }
        }
    },
    "rescore": {
        "window_size": 50, 
        "query": {         
            "rescore_query": {
                "match_phrase": {
                    "title": {
                        "query": "quick brown fox",
                        "slop":  50
                    }
                }
            }
        }
    }
}
Producing Shingles

词的相关性
相关的词汇,可以增加相关性,在索引时,会多消耗一些性能和磁盘空间,但搜索时比match_phrase效率要高

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2, 
                    "max_shingle_size": 2, 
                    "output_unigrams":  false   
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" 
                    ]
                }
            }
        }
    }
}


GET /my_index/my_type/_search
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "title": "the hungry alligator ate sue"
            }
         },
         "should": {
            "match": {
               "title.shingles": "the hungry alligator ate sue"
            }
         }
      }
   }
}

Partial Matching

The prefix, wildcard, and regexp queries operate on terms. If you use them to query an analyzed field, they will examine each term in the field, not the field as a whole.
prefix,wildcard,regexp 是低层次的,基于term的搜索,对于分词的字段并不特别适用,因为分词字段,分词后是多个term,这三种方法是将查询作为一个term来搜索

prefix前缀匹配

prefix 非常消耗性能,尽量避免使用,或者使用长的term

GET /my_index/address/_search
{
    "query": {
        "prefix": {
            "postcode": "W1"
        }
    }
}

wildcard通配符
GET /my_index/address/_search
{
    "query": {
        "wildcard": {
            "postcode": "W?F*HW" 
        }
    }
}
regexp正则表达式
GET /my_index/address/_search
{
    "query": {
        "regexp": {
            "postcode": "W[0-9].+" 
        }
    }
}
Query-Time Search-as-You-Type
{
    "match_phrase_prefix" : {
        "brand" : {
            "query":          "johnnie walker bl",
            "max_expansions": 50
        }
    }
}
Ngrams

Ngrams

跨库搜索以及加权

GET /blogs-*/post/_search 
{
    "query": {
        "multi_match": {
            "query":   "deja vu",
            "fields":  [ "title", "title.stemmed" ] 
            "type":    "most_fields"
        }
    },
    "indices_boost": { 
        "blogs-en": 3,
        "blogs-fr": 2
    }
}

boosting Query

must not 太严格,使用boosting query虽然还会出现在结果,但会降低排名

GET /_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "text": "apple"
        }
      },
      "negative": {
        "match": {
          "text": "pie tart fruit crumble tree"
        }
      },
      "negative_boost": 0.5
    }
  }
}

constant_score Query

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "constant_score": {
          "query": { "match": { "features": "wifi" }}
        }},
        { "constant_score": {
          "query": { "match": { "features": "garden" }}
        }},
        { "constant_score": {
          "boost":   2
          "query": { "match": { "features": "pool" }}
        }}
      ]
    }
  }
}

human language

识别语言种类

Of particular note is the chromium-compact-language-detector library from Mike McCandless, which uses the open source (Apache License 2.0) Compact Language Detector (CLD) from Google. It is small, fast, and accurate, and can detect 160+ languages from as little as two sentences. It can even detect multiple languages within a single block of text. Bindings exist for several languages including Python, Perl, JavaScript, PHP, C#/.NET, and R.
Identifying the language of the user’s search request is not quite as simple. The CLD is designed for text that is at least 200 characters in length. Shorter amounts of text, such as search keywords, produce much less accurate results. In these cases, it may be preferable to take simple heuristics into account such as the country of origin, the user’s selected language, and the HTTP accept-language headers.

sorting

GET /_search
    "sort": "field"
  • 多个字段排序
GET /_search
{
    "query" : {
        "bool" : {
            "must":   { "match": { "tweet": "manage text search" }},
            "filter" : { "term" : { "user_id" : 2 }}
        }
    },
    "sort": [
        { "date":   { "order": "desc" }},
        { "_score": { "order": "desc" }}
    ]
}
  • 字段有多个值时
ziduan
GET /_search
"sort": {
    "dates": {
        "order": "asc",
        "mode":  "min"
    }
}
  • String Sorted
字符型字段排序也是多值,find art odd 如果按照mode中min和max排序不是我们想要的按照单词顺序排序,所以该字段要使用fields分词和不分词分别索引
"tweet": { 
    "type":     "string",
    "analyzer": "english",
    "fields": {
        "raw": { 
            "type":  "string",
            "index": "not_analyzed"
        }
    }
}

GET /_search
{
    "query": {
        "match": {
            "tweet": "elasticsearch"
        }
    },
    "sort": "tweet.raw"
}
分词用来全文检索,不分词用来排序
  • score计算过程
GET /_search?explain 
{
   "query"   : { "match" : { "tweet" : "honeymoon" }}
}
  • not match 原因(指定id)
GET /us/tweet/12/_explain
{
   "query" : {
      "bool" : {
         "filter" : { "term" :  { "user_id" : 2           }},
         "must" :  { "match" : { "tweet" :   "honeymoon" }}
      }
   }
}

settings

获得settings

GET /cnki02/_settings/

设置settings

  • number_of_shards
  • number_of_replicas
  • analysis
PUT /my_temp_index
{
    "settings": {
        "number_of_shards" :   1,
        "number_of_replicas" : 0
    }
}

复制分片数量可以更新

PUT /my_temp_index/_settings
{
    "number_of_replicas": 1
}

设置分词

PUT /spanish_docs
{
    "settings": {
        "analysis": {
            "analyzer": {
                "es_std": {
                    "type":      "standard",
                    "stopwords": "_spanish_"
                }
            }
        }
    }
}

配置自定义分词
PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

mappings

Create Index

PUT /my_index
{
    "settings": { 
        "number_of_replicas": 0,
        "number_of_shards":1 },
    "mappings": {
        "type_one": { ... any mappings ... },
        "type_two": { ... any mappings ... },
        ...
    }
}
  • 关闭自动创建索引
    action.auto_create_index: false

获得mappings

GET /cnki02/_mapping/

动态dynamic mapping

  • trueAdd new fields dynamically—the default
  • falseIgnore new fields
  • strictThrow an exception if an unknown field is encountered
PUT /my_index
{
    "mappings": {
        "my_type": {
            "dynamic":      "strict", 
            "properties": {
                "title":  { "type": "string"},
                "stash":  {
                    "type":     "object",
                    "dynamic":  true 
                }
            }
        }
    }
}

预防string与date混乱

PUT /my_index
{
    "mappings": {
        "my_type": {
            "date_detection": false
        }
    }
}

关闭_all或指定字段或指定分词

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "_all": { "enabled": false }
    }
}

制定字段
PUT /my_index/my_type/_mapping
{
    "my_type": {
        "include_in_all": false,
        "properties": {
            "title": {
                "type":           "string",
                "include_in_all": true
            },
            ...
        }
    }
}

为_all制定分词
PUT /my_index/my_type/_mapping
{
    "my_type": {
        "_all": { "analyzer": "whitespace" }
    }
}

相关性

Term Frequence

tf(t in d) = √frequency
如果不向考虑term在filed中出现的频次,可以关闭term frequence
If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:

PUT /my_index
{
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type":          "string",
          "index_options": "docs" 
        }
      }
    }
  }
}

boosting

  • boosting indexes
GET /docs_2014_*/_search 
{
  "indices_boost": { 
    "docs_2014_10": 3,
    "docs_2014_09": 2
  },
  "query": {
    "match": {
      "text": "quick brown fox"
    }
  }
}

analyze

测试分词

GET /_analyze
{
    "analyzer":"ik",
    "text":"杨延友是好人"
}

GET /cnki02/_analyze
{   
    "field":"publishInfo.periodicalInfo.year",
    "text":"Text to analyze"
}
query验证_validate
GET /my_index/my_type/_validate/query?explain
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title":         "Foxes"}},
                { "match": { "english_title": "Foxes"}}
            ]
        }
    }
}

一些词不想stem

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stem_exclusion": [ "organization", "organizations" ], 
          "stopwords": [ 
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
            "if", "in", "into", "is", "it", "of", "on", "or", "such", "that",
            "the", "their", "then", "there", "these", "they", "this", "to",
            "was", "will", "with"
          ]
        }
      }
    }
  }
}

配置搜索分词search_analyzer

PUT /my_index/my_type/_mapping
{
    "my_type": {
        "properties": {
            "name": {
                "type":            "string",
                "analyzer":  "autocomplete", 
                "search_analyzer": "standard" 
            }
        }
    }
}

aggregations

terms aggs

GET /megacorp/employee/_search
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}

可以加query条件后聚合

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "last_name": "smith"
    }
  },
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests"
      }
    }
  }
}

聚合之中嵌套聚合

GET /megacorp/employee/_search
{
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}
# 聚合结果
{
"aggregations": {
      "all_interests": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "music",
               "doc_count": 2,
               "avg_age": {
                  "value": 28.5
               }
            },
            {
               "key": "forestry",
               "doc_count": 1,
               "avg_age": {
                  "value": 35
               }
            },
            {
               "key": "sports",
               "doc_count": 1,
               "avg_age": {
                  "value": 25
               }
            }
         ]
      }
   }

重建索引reindexing

  1. reidex
    Reindex API
    Reindex API_Referce

  2. Index Aliases


  • 创建别名
PUT /my_index_v1 
PUT /my_index_v1/_alias/my_index 
  • 查看别名指向
GET /*/_alias/my_index
GET /my_index_v1/_alias/*
  • 处理别名
POST /_aliases
{
    "actions": [
        { "remove": { "index": "my_index_v1", "alias": "my_index" }},
        { "add":    { "index": "my_index_v2", "alias": "my_index" }}
    ]
}

Refresh

将新的segment commit是昂贵的,但是写入文件缓存是简单的,可以通过后者达到近实时搜索。这个过程叫Refresh 手动刷新API

POST /_refresh 
POST /blogs/_refresh 

导入大量数据时

PUT /my_logs/_settings
{ "refresh_interval": -1 } 

PUT /my_logs/_settings
{ "refresh_interval": "1s" } 

Flush

The purpose of the translog is to ensure that operations are not lost
根据translog刷新segment,再清楚内存和translog的过程成为Flush。
手动flushAPI

POST /blogs/_flush 
POST /_flush?wait_for_ongoing 

对于集群可以优化

PUT /my_index/_settings
{
    "index.translog.durability": "async",
    "index.translog.sync_interval": "5s"
}

Segment Merging

段合并非常耗费资源

知识点

  1. document frequencies are calculated per shard, rather than per index
  2. Aginst Deep Pagination
    分页在1000-5000页(10000-50000个文档)是允许的,再高就不行了。涉及到多个分片以及排序,性能吃力,而且实际上页不人道,因为人翻两页就不会继续下去了。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值