python借助elasticsearch实现精准查询与bm25查询

最新推荐文章于 2024-05-11 17:17:44 发布

P-ShineBeam

最新推荐文章于 2024-05-11 17:17:44 发布

阅读量770

点赞数 6

分类专栏： NLP基础数据库文章标签： python elasticsearch 嵌入式实时数据库

本文链接：https://blog.csdn.net/weixin_42045968/article/details/136476785

版权

NLP基础同时被 2 个专栏收录

26 篇文章 1 订阅

订阅专栏

数据库

8 篇文章 0 订阅

订阅专栏

python elasticsearch实现精准查询与bm25查询

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from elasticsearch_dsl import Search
from elasticsearch_dsl.query import Match

1）连接 Elasticsearch

es = Elasticsearch(['http://127.21.0.0:9200'],  
    http_auth=('elastic', '123456'))  # 替换为你的用户名和密码

2）创建es索引

your_index_name= 'my_es_index'  # 设置自己的es索引名称
es.indices.create(index=your_index_name, ignore=400)

定义文档映射

映射（Mapping）是指定义文档及其包含的字段如何存储和索引的过程。

doc_mapping = {
    "properties": {
        "content": {
            "type": "text",
            "similarity": "BM25"  # 使用BM25相似性算法
        },
            "parent_id": {  
            "type": "keyword"  # 使用keyword类型以进行精确匹配  
        } ,
            "class_id": {  
            "type": "keyword"  # 使用keyword类型以进行精确匹配  
        }   
    }
}

这里定义了使用BM25索引的content字段；以及支持精确索引的parent_id和class_id字段。

创建文档映射

es.indices.put_mapping(index=index_name,  body=doc_mapping)

3）批量插入文档

docs = [
    {"class_id":'aa12234551', "parent_id":12234551, "content": "哇吃了橘子", "inx": 3},
    {"class_id":'aa12234551', "parent_id":12234551,"content": "阿飞  吃了橙子", "inx": 4},
    {"class_id":'aa12234551', "parent_id":12234551,"content": "冬天不允许穿裤衩", "inx": 5},
    {"class_id":'bbaa12234551', "parent_id":12234552, "content": "不知道这个人哇吃了橘子", "inx": 3},
    # 添加更多文档...
]

准备批量插入的动作

actions = [
    {"_index": index_name, "_source": doc} for doc in docs
]

使用bulk函数批量插入文档

bulk(es, actions)

4）查询文档

构建查询体

query = {
  "query": {
    "bool": {
      "must": [
          {
            "match": {
              "content":  "哇撒吃了橘子"
            }
          },
          {
            "term": {
              "parent_id":  12234551
            }
          }
        ]
    }
  }
}

这里使用bool可以进行多条件查询，其中must的“match”表示使用bm25对这个字段进行搜索，而“term”表示对"parent_id"这个字段精准搜索。must类似SQL中的and；若想使用类似SQL中or, 代表匹配其中一个条件则用should条件替换掉must。

执行搜索

response = es.search(index=your_index_name, body=query)  # 替换"your_index_name"为你的索引名称

输出搜索结果

Dis = []
Inx = []
for hit in response['hits']['hits']:
    # print(f"Document Score: {hit.meta.score}, Content: {hit.content}, id:{hit.inx}")
    Dis.append(hit['_score'])
    Inx.append(hit['_source']['inx'])

5）查询要删除的文档

query = {
  "query": {
    "bool": {
      "must": [
          {
            "match": {
              "content":  "哇撒吃了橘子"
            }
          },
          {
            "term": {
              "item_id":  12234551
            }
          }
        ]
    }
  }
}

执行查询

search_results = es.search(index=your_index_name, body=query)

遍历查询结果并删除文档

for hit in search_results['hits']['hits']:
    document_id = hit['_id']
    es.delete(index=index_name, id=document_id)

6）删除整个index

from elasticsearch import Elasticsearch

es = Elasticsearch(['http://127.21.0.0:9200'],  
    http_auth=('elastic', '123456'))  # 替换为你的用户名和密码)
    
index_name= 'my_es_index'


es.indices.delete(index=index_name, ignore=[400, 404])

print(f"Index {index_name_to_delete} and its data have been deleted.")