Elasticsearch v2.2 快速入门(含curl,Sense,python 3种客户端方式)

最新推荐文章于 2024-07-20 15:27:02 发布

AbnerGong

最新推荐文章于 2024-07-20 15:27:02 发布

阅读量4.5k

点赞数 1

分类专栏： Elasticsearch 文章标签： python elasticsearch curl Sense

本文链接：https://blog.csdn.net/AbnerGong/article/details/50794048

版权

Elasticsearch 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

Elasticsearch v2.2 快速入门(含curl,Sense,python 3种客户端方式)　作者:AbnerGong

本人花了一周时间看文档和前人的博客，才刚刚达到入门水平。写此文是给各位作垫脚石，希望各位能在入门上少花时间。同时我也希望能与大家互相学习，如果各位写出了更进一步的文章，或补充本文中略去的诸多细节，欢迎将地址贴在回复中，万分感谢！

理解Elasticsearch的基本概念：建议阅读 Elasticsearch学习笔记 by siddontang@2015.01

索引Index，搜索search是基本概念，也是本文主要关注的方面。
Lucene关键概念：Document是一条数据，包含多个Field；每个Field包含name和value；如果分词(Analyzed)则每个值里包含多个Term，如果不分词则整个值即为1个Term；实际存储方式Inverted index是以Term为主键的，每个Term在哪些文档的哪些位置出现，即为一条Token，实际搜索时是在Term表中查找。
将Elasticsearch与MySQL对照更好理解： index = DB数据库； type = table一张表；Mapping = schema表中包含的域及其类型；Document = row一条数据；Field = column；聚合aggs = group by
Node，Cluster，Shard，Replica的基本概念。也可参考1.1.1 基本概念
Restful API 的基本概念。访问工具：httpie 或 Sense编辑器

检索与搜索：强烈推荐Elasticsearch 权威指南翻译by QQ1350995917@2014/05

假设我们已经作为Megacorp公司HR部门的一份子，接到建立一个employee的目录，这个目录支持雇主和雇员之间的换位思考和实时的动态的协同工作，有如下要求：
1. 数据应该能包含若干值的tags，数值和全文本。
2. 能获取雇员的详细信息。
3. 允许结构化的检索，如找到年龄大于30岁的员工。
4. 允许简单的全文检索和更复杂的短语检索。
5. 能从符合检索条件的文档中高亮检索片段。
6. 能让雇主根据数据生成数据分析图形表格。

创建/删除/验证存在索引：

#curl
$ curl -XPUT 'http://localhost:9200/megacorp/'   #创建index
$ curl -XPUT 'http://localhost:9200/megacorp/' -d '  #创建Index并设定碎片数和副本数
index :
    number_of_shards : 3   #默认为5，即5个主片
    number_of_replicas : 2   #默认为1，即为每个主片建1个副本
'
$ curl -XDELETE 'http://localhost:9200/twitter/'  #删除index 
$ curl -XHEAD -i 'http://localhost:9200/twitter'

#Sense
PUT megacorp

#Python
es.indices.create(index="megacorp")
es.indices.delete(index="megacorp")
es.indices.exists(index="megacorp")

创建/删除/读取表结构：Mapping

mapping的意思是制图、图谱，在这里相当于索引里的结构图，简称为结构。请注意，索引、索引的类型、索引类型的字段都有其结构(mapping)。

#curl，创建
curl -XPOST localhost:9200/customer -d '{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" : {
        "type1" : {
            "properties" : {
                "field1" : { "type" : "string", "index" : "not_analyzed" }
            }
        }
    }
}'
#读取表结构
curl -XGET 'http://localhost:9200/twitter/_mapping/tweet'

#Sense
#也可以在创建索引时指定表结构
PUT customer
{
  "mappings": {  #指定多个表结构
    "user": {   #第一个表user
      "_all":       { "enabled": false  },  #禁用了meta域_all
      "properties": {   #这个表中有如下属性
        "title":    { "type": "string"  },     #对每个字段，要指定它的类型，类型为string时默认分词
        "name":     { "type": "string"  }, 
        "age":      { "type": "integer" }  
      }
    },
    "blogpost": { 
      "properties": { 
        "title":    { "type": "string"  }, 
        "body":     { "type": "string"  }, 
        "user_id":  {
          "type":   "string", 
          "index":  "not_analyzed"     #指定不分词
        },
        "created":  {
          "type":   "date", 
          "format": "strict_date_optional_time||epoch_millis"   #对时间域要指定格式
        }
      }
    }
  }
}
#读取表结构
GET twitter/_mapping/tweet,user  #指定索引，指定类型名，可以指定多类型
GET twitter/_mapping  #指定索引，不指定类型名
GET _mapping/tweet    #不指定索引，指定类型名
GET _mapping   #不指定索引，不指定类型，取出所有结构
#读取表的字段的结构
GET publications/_mapping/article/field/author.id,abstract,name  #指定索引,关键字_mapping，指定类型，关键字field，指定字段
GET _all/_mapping/tw*/field/*.id  #可以模糊匹配
GET _mapping/field/*       #不指定索引，不指定类型，不指定字段名。  */

#Python
#创建索引时添加表结构，注意将false改为False
q = {
  "mappings": {
    "user2": {
      "_all":       { "enabled": False  },
      "properties": {
        "title":    { "type": "string"  },
        "name":     { "type": "string"  },
        "age":      { "type": "integer" }
      }
    }
  }
}
es.indices.create(index="customer",body=q)  #创建索引并指定它的配置和表结构
#给已有索引添加表结构，这个似乎没法批量添加。注意给已有表结构添加字段也是同样的方式。
requestbody = {
      "_all":       { "enabled": False  },
      "properties": {
        "title":    { "type": "string"  },
        "name":     { "type": "string"  },
        "age":      { "type": "integer" }
      }
    }
print es.indices.put_mapping(index="custom",doc_type="user",body=requestbody)
#读取表结构和字段结构
es.indices.get_mapping(index="custom", doc_type="user") #这两个参数可以都没有
es.indices.get_field_mapping(index="custom",doc_type="user2",fields=["name","age"])  
#确认某类型存在
es.indices.exists_type(index="custom",doc_type="user")  #两个参数必须都有

检索(即插入信息)：

#curl
curl -GET 'http://localhost:9200/megacorp/employee/1?pretty'

{
    "first_name":  "John",
    "last_name":   "Smith",
    "age":         25,
    "about":       "I love to go rock climbing",
    "interests":  ["sports","music"]
}

#sense
PUT megacorp/employee/1
{
    "first_name":"John",
    "last_name":  "Smith",
    "age":        25,
    "about":      "I love to go rock climbing",
    "interests":["sports","music"]
}

#python
body={
    "first_name":"John",
    "last_name":  "Smith",
    "age":        25,
    "about":      "I love to go rock climbing",
    "interests":["sports","music"]
}
es.create(index="megacorp", doc_type="employee", id=1,body=body)

简单查找(指定索引/类型/id号)

【注意】请不要在插入0秒后立刻查找，会查不到，得等1s才能查到。下同。

#curl
curl -GET 'http://localhost:9200/megacorp/employee/1?pretty'

#Sense
GET megacorp/employee/1?pretty

#python
es.get(index="megacorp", doc_type="employee", id=1)

简单搜索(查找字符串)

#curl
curl -XGET 'http://localhost:9200/_search?pretty' #显示全部索引的内容
curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' #只指定索引/类型
curl -XGET 'http://localhost:9200/_search?q=about:"go rock"&pretty' #只指定搜索内容about必须包含字符串"go rock"完全匹配不分词
curl -XGET 'http://localhost:9200/megacorp/employee/_search?q=about:"go rock"&pretty' #组合起来

#Sense
GET _search?pretty #显示全部索引的内容
GET megacorp/employee/_search?pretty #指定索引/类型，不指定搜索内容
GET _search?q=about:"go rock"&pretty #只指定搜索内容about必须包含字符串"go rock"完全匹配不分词
GET megacorp/employee/_search?q=about:"go rock"&pretty #组合起来

#python
es.search() #不指定任何。返回结果es.search()["hits"]["hits"]是一个数组，每项是一条dict记录，它有"_index"、"_type"、"_id"、"_source"、"_score"等字段
es.search(filter_path=['hits.hits._id', 'hits.hits._type']) #指定返回结果包含内容，通过这种方式可以使得每条dict记录只包含"_id"和"_type"字段
es.search(filter_path=['hits.hits._*']) #这样就选出所有字段了
es.search(index='megacorp', doc-type="employee") #指定索引/类型，不指定搜索内容
es.search(from_=3, size=5) #跳过前3个，从下标我3号的结果开始，共取5个结果
#更多参数请见官方[API documentation](http://elasticsearch-py.readthedocs.org/en/master/api.html)

如何使用Search DSL搜索

前面的搜索条件放在网址后面的参数里，这样只能进行简单的搜索。要执行复杂的搜索，需要用到Search DSL，它是将搜索条件放到JSON里，通过数据的方式传递给服务器。举例如下，仍然是搜索about中包含有完整”go rock”的，可以如下书写：

#curl
curl -XGET 'http://localhost:9200/_search?pretty' -d
 {"from" : 0, "size" : 10,
"query":{"match_phrase":{"about":"go rock"}}}

#Sense
GET _search?pretty
 {"from" : 0, "size" : 10,
"query":{"match_phrase":{"about":"go rock"}}}

#python
body = {"from" : 0, "size" : 10,
"query":{"match_phrase":{"about":"go rock"}}}
es.search(body = body) #指定搜索内容

显然，curl/Sense/python三种方式的JSON部分是完全相同的（即Search DSL），而非JSON的部分与搜索没有关系，也就是说无论我们怎么更改搜索条件，非JSON的部分都是不需要改变的，我们研究的只是JSON部分，所以接下来的代码将只包含JSON部分，而不区分curl,Sense,python。

上面的代码中Search DSL包含”from”、”size”、”query”三个字段，其中”query”表示我们要查询的内容，”from”表示跳过几个结果，”size”表示返回几个结果。这只是举个例子，Search DSL有很多字段，包括Query DSL，from/size，Sort，Source filtering，fields，Script fields，Field Data Fields，Post filter，Highlighting，Rescoring，Seach Type，Scroll。下面将一一阐述这些字段的含义及用法。

Query DSL

0.重要前言

在继续学习以前，还要强调一下Search DSL(搜索DSL)和Query DSL(查询DSL)的区别。
它们的区别在于，Search DSL包含Query DSL，Search DSL是可执行的，Query DSL只表示条件，只是Search DSL中的一个字段。用以上面的例子来说

QueryDSL = {"match_phrase":{"about":"go rock"}}  #这是一个查询，它能表示about中包含"go rock"这个含义
SearchDSL = {"query": {"match_phrase":{"about":"go rock"}} } #这是一个搜索，它最终可以被执行

查询和搜索在中文中的含义差不多，我认为不够直观，所以我将Query翻译为条件。
Query DSL(条件DSL)分为leaf query(原子条件)，compound query(复合条件)两类。
显然Query DSL(条件DSL)能表达我们要查询的条件，任意的条件还可以进行组合成为更复杂的条件，但是不管如何组合，条件都不是最终的JSON对象，必须在条件外加一个query(打分)或filter(过滤)才能用来执行。接下来我会介绍原子条件和复杂条件，你也可以参考官方网页，但是测试时请注意，最外层不是query或filter的都只是条件，不是最终JSON对象，不能直接用来查询，在使用前必须外套一层。

1.原子条件

比如我想要，这就是一个条件，{"match_phrase":{"about":"go rock"}}可以表示这个条件。显然，这个条件是JSON中的最小单位，姑且称之为原子条件(leaf query clause)，原子条件当然能组合成复合条件，这个后面再说。

原子条件分为match类(Full text queries)和term类(Term level queries)，它们的概念在两个链接开头有介绍，不过还是再解释一下：
本文开头时提到所有域都会以Term表来存储(分词or不分词)，查找时实际上是查找Term表，区别是match会将待搜索字符串先分词再在Terms中查找，而term则是将整个字符串不分词直接在Term表中查找，也就是说查找的字符串分词还是不分只与用match还是term有关，与查找的域分词还是没分没有任何关系。具体样例可参见Term Query。

match类(Full text queries))

包括match、multi_match、common_terms、query_string、simple_query_string共五类，下面一一列举

#简单写法
{"match":{"about":"go suck"}} #match要求匹配里面至少1个单词
{"match_phrase":{"about":"go rock"}} #match_phrase要求匹配完整字符串

#复杂写法，但可设置更多参数
{"match":{
     "about" : {
            "query" : "go suck",
            "operator" : "and",
            "zero_terms_query": "all"
        }
}}
{"match_phrase" : {
        "about" : {
            "query" : "go rock",
            "analyzer" : "my_analyzer"
        }
}}
{"match_phrase_prefix" : {
        "message" : {
            "query" : "go rock",
            "max_expansions" : 10
        }
}}

multi_match：match的多域版本，即几个域一起查找字符串
common_terms：更专业的查询，更优先考虑生僻词
query_string：
simple_query_string：

term类(Term level queries)

全文检索将会在执行前先对query进行分词，而term-level查询则对存储在inverted index的terms进行精确操作。这些查询通常用作结构化数据比如数字、日期、枚举，而不是全文域。另外，它们允许你craft低级查询，在分析过程以前。
term类包括term、terms、range、exists、missing、prefix、wildcard、regexp、fuzzy、type、ids，下面一一列举：
term：找到在指定域中包含指定的完整词(exact term)的文档
terms
range

#简单写法
{ "range":{"age":{"gt":30}}}

#复杂写法，可设置更多参数：
#gte是≥，gt是＞，lte是≤，lt是＜，boost是设置权重(默认为1)
{"range" : {
        "age" : {
            "gte" : 10,
            "lte" : 20,
            "boost" : 2.0
        }
}}
#针对日期，有如下几种方式
{
    "range" : {
        "date" : {
            "gte" : "now-1d/d",
            "lt" :  "now/d"
        }
    }
}
{
    "range" : {
        "born" : {
            "gte": "01/01/2012",
            "lte": "2013",
            "format": "dd/MM/yyyy||yyyy"
        }
    }
}
{"range" : {
        "timestamp" : {
            "gte": "2015-01-01 00:00:00", 
            "lte": "now", 
            "time_zone": "+01:00"
        }
}}
{"range" : {
         "postDate" : {
              "from" : "2010-03-01",
              "to" : "2010-04-01"
         }
}}

exists
missing
prefix
wildcard：
regexp：找到文档，它的指定域包含与特定正则表达式匹配的词
fuzzy：找到文档，它的指定域包含与指定词模糊相似的词。模糊性由Levenshtein编辑距离确定
type：找到指定类型的文档
ids：找到有指定类型和ID的文档

更多内容请参考全文检索(full-text queries)

2.复合条件

一些原子条件可以用联结词连接成为大的复合条件。联结词请见compound queries，下面一一列举：

bool

语法：它先给每个条件加一个限定逻辑词(加must/filter/should/must_not)，然后用bool连接在一起
作用：它能综合考虑条件，并将打分相加。must(必须出现并贡献分)，filter(必须出现然而分数会被忽略)，should(其中的多个原子条件至少要有minimum_should_match个成立)，must_not(不能出现)
注意：每个条件可以是match类/term类原子条件，也可以是复合条件。限定逻辑词不可以是query，参考文档中出现query，但是亲测失败

{
    "bool" : {
        "must" : {
            "term" : { "user" : "kimchy" }
        },
        "filter": {
            "term" : { "tag" : "tech" }
        },
        "must_not" : {
            "range" : {
                "age" : { "from" : 10, "to" : 20 }
            }
        },
        "should" : [
            {
                "term" : { "tag" : "wow" }
            },
            {
                "term" : { "tag" : "elasticsearch" }
            }
        ],
        "minimum_should_match" : 1,
        "boost" : 1.0
    }
}
用python构造如下：
atom1={"match":{"data":"really me"}} #原子条件
atom2={"term":{"data":"bore"}} #原子条件
atom3={"match_all":{}} #原子条件
comp={"bool":{"filter":atom2}} #复合条件
comp={"bool":{"filter":comp}} #f复合条件的复合条件
body={"filter":comp}  #最终命令

boosting

语法：它先给每个条件加一个positive/negative限定词，然后用boosting连接在一起。
作用：它可以给部分条件减分。我们知道bool中NOT的内容出现时会被删去，这里negative的内容出现时不会被删去而是会减分

{
    "boosting" : {  
        "positive" : { 
            "term" : {  
                "field1" : "value1"
            }
        },
        "negative" : {  
            "term" : {
                "field2" : "value2"
            }
        },
        "negative_boost" : 0.2 
    }
}
#简化结构为：boosting>(postive>条件+negative>条件+negative_boost:值)

constant_score

用法：它先给每一个条件加一个限定词filter或query，然后用constant_score连在一起
作用：将最终打分变为常数。不论里面用的是query/filter + match/term的哪种搭配，最终的得分均为常数，至于boost为1.2不知道啥用。

{
    "constant_score" : {
        "filter" : {
            "term" : { "user" : "kimchy"}
        },
        "boost" : 1.2
    }
}

dis_max

用法：比较长，没细看
作用：bool是将几项的得分相加，它是取几项中的最好得分。几项条件中有一项满足即可。

{
    "dis_max" : {
        "tie_breaker" : 0.7,
        "boost" : 1.2,
        "queries" : [
            {
                "term" : { "age" : 34 }
            },
            {
                "term" : { "age" : 35 }
            }
        ]
    }
}