python 使用 Elasticsearch 增删查改

最新推荐文章于 2024-05-12 17:57:20 发布

i0208

最新推荐文章于 2024-05-12 17:57:20 发布

阅读量1.2k

点赞数

分类专栏： Elasticsearch

本文链接：https://blog.csdn.net/Waller_/article/details/109810964

版权

Elasticsearch 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

pip install elasticsearch==7.10

连接ES

from elasticsearch import Elasticsearch

# es服务器ip, port
ES_IP = '172.30.xx.xx'
ES_PORT = 9200

# 认证信息
http_auth = ('elastic', '123456')

es = Elasticsearch(
    [ES_IP],
    http_auth=('elastic', '123456'),
    port=ES_PORT
)

基础概念

index: 索引, 可以看做是mysql的表名
doc_type: 文档类型
id: 可看做是mysql表中记录的id
body: 查询体, 就是es的查询语句，使用DSL语句，相比SQL语句要复杂得多，但是基本逻辑其实是类似的

操作

索引

创建索引

from elasticsearch.exceptions import RequestError

方法一: 若索引已经存在了，就返回个400，
try:
    es.indices.create(index='my-index')
except RequestError as e:
    print('索引已存在')

方法二: 参数 ignore=400 表示 忽略返回的400状态码
es.indices.create(index='my-index', ignore=400)  # 索引不存在创建,存在则不操作

删除索引

# 该索引及其内部的数据全部删除, 类似mysql把表删除

es.indices.delete(index='my-index', ignore=[400, 404])

数据

向指定索引内增加数据, 若索引不存在,则自动创建索引,并插入数据

方法一: index
如果不指定 id，会自动生成一个 id,
若该id存在,则是将此id的原数据删除,重新创建

es.index(index="user_info",doc_type="_doc",id=1, body={"name":"喜洋洋","age":21,})


方法二: create
需要我们指定 id 字段来唯一标识该条数据
若该id存在,则报错

from elasticsearch.exceptions import ConflictError
try:
    es.create(index="user_info",doc_type="_doc",id=1,body={"name":"喜洋洋","age":27})
except ConflictError as e:
    print('id存在')

删除索引内指定id对应的那一行数据

from elasticsearch.exceptions import NotFoundError
try:
    es.delete(index='user_info', doc_type='_doc', id=2)
except NotFoundError as e:
    print('数据不存在')

或

es.delete(index='user_info', doc_type='_doc', id=1, ignore=[404])  # 忽略状态码

修改索引内指定id对应的数据, 如 将id为1的name改成'懒洋洋'
body={"doc":{"name":"懒洋洋"}}  # 固定写法
es.update(index="user_info",doc_type="_doc",id=1, body=body)

get 查询

res = es.get(index="user_info", doc_type="_doc",id=1)
print(res)
{'_index': 'user_info', '_type': '_doc', '_id': '1', '_version': 3, '_seq_no': 11, '_primary_term': 1, 'found': True, '_source': {'name': '懒洋洋', 'age': 18}}

search 查询

1.查询所有:
body = {
        'query':{
        'match_all':{}
         }
    }

res = es.search(index='user_info',doc_type='_doc')
print(res['hits']['hits'])
[
    {'_index': 'user_info', '_type': '_doc', '_id': '2', '_score': 1.0, '_source': {'name': '喜洋洋', 'age': 27}},
    {'_index': 'user_info', '_type': '_doc', '_id': '1', '_score': 1.0, '_source': {'name': '懒洋洋', 'age': 18}}
]

2.根据某个字段的值进行查询数据, 如查询年龄为18的数据信息
body = {
        "query":{
            "term":{
                "age":18
                }
            }
        }
res = es.search(index='user_info',doc_type='_doc', body=body)
print(res['hits']['hits'])
[{'_index': 'user_info', '_type': '_doc', '_id': '1', '_score': 1.0, '_source': {'name': '懒洋洋', 'age': 18}}]
 
3.根据某个字段的多个值进行查询数据, 如查询年龄为18和27的数据信息
body = {
        "query":{
            "terms":{  # 注意是 'terms'
                "age":[18, 27]  # 注意是 列表
                }
            }
        }
res = es.search(index='user_info',doc_type='_doc', body=body)
print(res['hits']['hits'])
[
    {'_index': 'user_info', '_type': '_doc', '_id': '2', '_score': 1.0, '_source': {'name': '喜洋洋', 'age': 27}}, 
    {'_index': 'user_info', '_type': '_doc', '_id': '1', '_score': 1.0, '_source': {'name': '懒洋洋', 'age': 18}}
]

4.根据某个字段包含某个字符进行查询, 如查询名字中含有'喜'的数据
body = {
        "query":{
            "match":{
                "name":'喜'
                }
            }
        }
res = es.search(index='user_info',doc_type='_doc', body=body)
print(res['hits']['hits'])
[{'_index': 'user_info', '_type': '_doc', '_id': '2', '_score': 0.4700036, '_source': {'name': '喜洋洋', 'age': 27}}]

5.根据多个字段包含某个字符进行查询, 如查询name与age中含有'2'的数据
body = {
        "query":{
            "multi_match":{
                "query":'2',
                "fields":["name","age"]
                }
            }
        }
res = es.search(index='user_info',doc_type='_doc', body=body)
print(res['hits']['hits'])
[{'_index': 'user_info', '_type': '_doc', '_id': '3', '_score': 1.1001158, '_source': {'name': '美羊羊2', 'age': 20}}]

6.范围查询
body = {
        "query":{
            "range":{
                       "age":{
                             "gte":20,       # >=18
                             "lte":30        # <=30
                       }
                    }
            }
        }
res = es.search(index='user_info',doc_type='_doc', body=body)
print(res['hits']['hits'])
[
    {'_index': 'user_info', '_type': '_doc', '_id': '2', '_score': 1.0, '_source': {'name': '喜洋洋', 'age': 27}}, 
    {'_index': 'user_info', '_type': '_doc', '_id': '3', '_score': 1.0, '_source': {'name': '美羊羊2', 'age': 20}}
]

配置IK分词器后查询

首先是创建索引, 在创建索引时,就要指定字段的拆词粒度,如下,注意:这里是es7.x版本

body = {
  'mappings':{
    'dynamic':'strict',  # 规定如果添加新的字段,报错. 
      'properties':{
        'id': {
              'type': 'text',
          },
        'text':{
          'type':'text',
          'analyzer':'ik_max_word',  # 新增数据时,规定该字段对应值的拆词粒度为 ik_max_word
          "search_analyzer": "ik_smart"  # 查询时,规定该字段对应搜索词的拆词粒度为 ik_smart
        },
         'knowledge_id':{
          'type':'text',
        },
      }
  }
}

res = es.indices.create(index='my-index', body=body)

下面插入一些数据到索引中

info = [
    {'text':'服务器无法登录，提示“可信芯片异常，拒绝登陆！"', 'id':1, 'knowledge_id':1},
    {'text':'查看系统中所有用户的三种方式', 'id':2, 'knowledge_id':2},
    {'text':'如何退出三合一注册向导全屏界面？', 'id':2, 'knowledge_id':3},
]

for dic in info:
    es.index(index='my-index',doc_type="_doc",body=dic)

到head中看一下搜索词为'三合一拒绝登陆',拆词粒度为'ik_smart'时, 得到的拆词结果如下,得到三个词: '三合一','拒绝','登录',

对应到我们的数据中,可以查出来的结果应该是两条数据,第一条和第三条.

用代码实现查询:

body = {
        "query":{
            "match":{
                "text":'三合一拒绝登陆'
                }
            }
        }

res = es.search(index='my-index',doc_type='_doc', body=body)
print(res['hits']['hits'])

# 这里仅把查出的数据拿出来展示
# {'text': '服务器无法登录，提示“可信芯片异常，拒绝登陆！"', 'id': 1, 'knowledge_id': 1}}
# {'text': '如何退出三合一注册向导全屏界面？', 'id': 2, 'knowledge_id': 3}

python获取es拆分后的词

body={
    "text":"惠普 p2015dn",
    "analyzer":"ik_max_word"
    # "analyzer":"optimizeIK"
}
res = es.indices.analyze(index=INDEX_KNOWLEDGE, body=body)
key_list = [dic['token'] for dic in res['tokens']]
print(key_list)  # ['惠普', 'p2015dn', 'p', '2015', 'dn']

案例:

现在需要在文章内容中匹配搜索词,如果文章数量多,用mysql的模糊匹配会很慢,所以将mysql中的数据取出,存到es中,

import pymysql
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import RequestError, ConflictError, NotFoundError, ConnectionError

# 连接mysql获取数据
def checkmysql(num=None):
    conn = pymysql.connect(
        host='172.30.00.00',
        port=3306,
        user='xy',
        password='123456',
        database='xxx',
        charset='utf8',
    )

    cursor = conn.cursor(pymysql.cursors.DictCursor)

    cursor.execute('select * from knowledgetext')
    # rows = cursor.fetchmany(num)
    rows = cursor.fetchall()
    cursor.close()
    conn.close()
    return rows

# 连接es
ES_IP = '172.30.00.01'
ES_PORT = 9200

# 认证信息
http_auth = ('elastic', '123456')

es = Elasticsearch(
    [ES_IP],
    http_auth=('elastic', '123456'),
    port=ES_PORT
)


# 规定body内字段格式及拆词类型
body = {
    "settings": {
        "index": {
            "number_of_shards": 1,  # 是数据分片数，默认为5，有时候设置为3
            "number_of_replicas": 0  # 是数据备份数，如果只有一台机器，设置为0
        }
    },
  'mappings':{
    'dynamic':'strict',  # 规定如果添加新的字段,报错
      'properties':{
        'id': {
              'type': 'text',
          },
        'text':{
          'type':'text',
          'analyzer':'ik_max_word',  # 新增数据时,规定该字段的拆词粒度为 ik_max_word
          "search_analyzer": "ik_smart"  # 查询时,规定该此字段的拆词粒度为 ik_smart
        },
         'knowledge_id':{
          'type':'text',
        },
      }
  }
}

# 创建索引
es.indices.create(index='my-index', body=body)

# 将数据存入es
for dic in checkmysql():
    es.index(index='my-index',doc_type="_doc",body=dic, request_timeout=30)
    # request_timeout 是允许的最大超时时间

# 查询数据
# 1.查所有
body = {
        "size": 111,  # 最大显示数量,es默认展示10条

        'query':{
        'match_all':{}
         }
    }
# 2.查具体某个字段
body = {
        "size": 10000,  # 最大显示数量
        "query": {
            "match": {
                # "text": search_key,
                "text": {
                    "query": search_key,
                    "analyzer": "ik_smart",  # 用来指定搜索的词语按那种拆词粒度拆词
                    "operator": "or",  # 按拆分后的词查询时,词与词之间是 and 还是 or 的关系
                    "minimum_should_match": "75%"  # 该参数用来控制应该匹配的分词的最少数量,至少匹配几个词才召回查询的结果
                }
            }
        },
        }

try:
    res = es.search(
        index='my-index',
        # doc_type='_doc',  # 可以不加doc_type, 若加上该参数则会出现 ElasticsearchDeprecationWarning: [types removal] Specifying types in search requests is deprecated.
        body=body,
        request_timeout=30  # 允许的超时时间,默认是10s
    )
except ConnectionError:
    res = None
    print('连接超时或ES未启动')


def filter_data(res=None):
    data = []
    if res:
        for d in res['hits']['hits']:
            data.append(d.get('_source'))
    return data

print(len(filter_data(res)))

推荐文章参考文章参考文章参考文章

查看文章参考文章参考文章参考文章参考文章

常见错误

其他: 提高搜索精准度, 使IK分词器兼容英文分词

IK兼容英文分词

body = {
    "settings": {
        "index": {
            "number_of_shards": 1,  # 是数据分片数，默认为5，有时候设置为3
            "number_of_replicas": 0  # 是数据备份数，如果只有一台机器，设置为0
        },
        # 使中文分词器IK 增加对英文的支持,可以理解为基于ik自定义了拆词模式:optimizeIK
        "analysis": {
            "analyzer": {
                "optimizeIK": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": [
                        "stemmer"  # stemmer模式是将在ik对文档完成分词之后，将其中的英文单词做提取词干处理。
                    ] 
                }
            }
        }
    },
    'mappings': {
        # 'dynamic': 'strict',  # 规定如果添加新的字段,报错
        'properties': {
            'id': {
                'type': 'keyword',
            },
            'text': {
                "type": "text",
                "analyzer": "optimizeIK",  # 若不需要对英文分词的支持,可注销这行,将下面一行开打
                # "analyzer": "ik_max_word",  # 新增数据时,规定该字段的拆词粒度为 ik_max_word
                "search_analyzer": "ik_smart"  # 查询时,规定该此字段的拆词粒度为 ik_smart
            },
            'knowledge_id': {
                'type': 'keyword',
            },
        }
    }
}

补充: 使用上面模板创建的索引支持三种分词模式: ik_max_word, ik_smart, optimizeIK(这是自定义的分词模式,准确的说应该是IK的ik_max_word + stemmer英文拆词)

检查一下拆词结果

POST klbp-knowledge/_analyze
{
    "text":"惠普 p2015dn",
    "analyzer":"ik_smart"  # 这是IK自带的拆词模式
}

结果:
{
  "tokens" : [
    {
      "token" : "惠普",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "p2015dn",
      "start_offset" : 3,
      "end_offset" : 10,
      "type" : "LETTER",
      "position" : 1
    }
  ]
}


POST klbp-knowledge/_analyze
{
    "text":"惠普 p2015dn",
    "analyzer":"optimizeIK"  # 这是兼容英文拆词后的拆词模式
}

结果:
{
  "tokens" : [
    {
      "token" : "惠普",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "p2015dn",
      "start_offset" : 3,
      "end_offset" : 10,
      "type" : "LETTER",
      "position" : 1
    },
    {
      "token" : "p",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "ENGLISH",
      "position" : 2
    },
    {
      "token" : "2015",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "ARABIC",
      "position" : 3
    },
    {
      "token" : "dn",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 4
    }
  ]
}

提高搜索精准度

多字段搜索:
这是基于上面自定义了拆词模式的搜索
POST klbp-knowledge/_search
{
        "size": 10000,
        "query": {
            # bool: 内部的匹配方式如果匹配成功了,就把结果拿出来, 内部的匹配方式都会去匹配
            "bool": {
                "must_not": {
                    "match": {
                        "status": "0"
                    }
                },
                "must": [
                    {
                        # multi_match 匹配多个字段
                        "multi_match": {
                            "query": "惠普 p2015",
                            "fields": [
                                "title",
                                "abstract"
                            ],
                            "analyzer": "optimizeIK",
                            "minimum_should_match": "50%"
                            # "type": "best_fields",  # 使 完全匹配的文档占的评分比较高
                            # "tie_breaker": 0.3  # 使 没有完全匹配的评分乘以0.3的系数
                        }
                    },
                    {
                        "bool": {
                            "should": [
                                {
                                    "match": {
                                        "abstract": {
                                            "query": "惠普 p2015",
                                            "analyzer": "optimizeIK",  # 可根据实际情况换成 ik_smart 模式
                                            "boost": 2,
                                            "operator": "and"
                                        }
                                    }
                                },
                                {
                                    "match": {
                                        "title": {
                                            "query": "惠普 p2015",
                                            "analyzer": "optimizeIK",
                                            "operator": "and",
                                            "boost": 3
                                        }
                                    }
                                },
                                {
                                    "match": {
                                        "keyword": {
                                            "query": "惠普 p2015",
                                            "analyzer": "optimizeIK"
                                        }
                                    }
                                }
                            ],
                            "minimum_should_match": 1  # 至少满足一个条件
                        }
                    }
                ]
            }
        }
    }

单字段搜索:
POST klbp-knowledgetext/_search
{
  "size": 10000,
  "query": {
    "dis_max": {
      "queries": [
        {
          "function_score": {
            "query": {
              "match_phrase": {
                "text": {
                  "query": "惠普 p2015",
                  "slop": 2
                }
              }
            },
            # 定义了一个加分方法
            "functions": [
              {
                "weight": 10
              }
            ]
          }
        },
        {
          "match": {
            "text": {
              "query": "惠普 p2015",
              "analyzer": "optimizeIK",
              "minimum_should_match": "50%"
            }
          }
        }
      ]
    }
  }
}

提高搜索精准度推荐文章文章文章

如果需要搜索分页，可以通过from size组合来进行。from表示从第几行开始，size表示查询多少条文档。from默认为0，size默认为10，
如果搜索size大于10000，需要设置index.max_result_window参数
注意：size的大小不能超过index.max_result_window这个参数的设置，默认为10,000。

PUT _settings
{
    "index": {
        "max_result_window": "10000000"
    }
}

elasticsearch bool中should must联用问题

elasticsearch match_phrase slop参数问题

i0208

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录