_all、_source、store、index的使用

本文介绍了Elasticsearch中_all、_source和store等字段的作用及配置方法,包括字段的禁用、包含与排除等高级应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1._all

1.1_all field

_all字段是一个很少用到的字段,它连接所有字段的值构成一个用空格(space)分隔的大string,该string被analyzed和index,但是不被store。当你不知道不清楚document结构的时候,可以用_all。如,有一document:

curl -XPUT 'http://127.0.0.1:9200/myindex/order/0508' -d '{
    "name": "Scott",
    "age": "24"
}'

用_all字段search:

curl -XGET "http://127.0.0.1:9200/myindex/order/_search?pretty" -d '{
    "query": {
        "match": {
            "_all": "Scott 24"
        }
    }
}'

也可以用query_string:

curl -XGET "http://127.0.0.1:9200/myindex/order/_search?pretty" -d '{
    "query": {
        "query_string": {
            "query": "Scott 24"
        }
    }
}'

输出:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2712221,
    "hits" : [ {
      "_index" : "myindex",
      "_type" : "order",
      "_id" : "0508",
      "_score" : 0.2712221
    } ]
  }
}

注意:_all是按空格(space)分隔的,所以,对于date类型就被analyzed为["year", "month", "day"]。如,一document:

{
  "first_name":    "John",
  "last_name":     "Smith",
  "date_of_birth": "1970-10-24"
}

curl -XGET "http://127.0.0.1:9200/myindex/order/_search?pretty" -d '{
    "query": {
        "match": {
            "_all": "john smith 1970"
        }
    }
}'

_all字段将包含["john", "smith", "1970", "10", "24"]。

所以,_all 字段仅仅是一个经过分析的 string 字段。它使用默认的分析器来分析它的值,而不管这值本来所在的字段指定的分析器。而且像所有 string 类型字段一样,你可以配置 _all 字段使用的分析器:

PUT /myindex/order/_mapping
{
    "order": {
        "_all": { "analyzer": "whitespace" }
    }
}

1.2 Disable _all field

_all字段需要额外的CPU周期和更多的磁盘。所以,如果不需要_all,最好将其禁用,如:

curl -XPUT 'http://127.0.0.1:9200/myindex/order/_mapping' -d '{
    "order": {
        "_all": {
            "enabled": true
        },
        "properties": {
            .......
        }
    }
}'

1.3 Excluding fields from _all

你可能不想把_all禁用,而是希望_all包含某些特定的fields。通过include_in_all选项可以控制字段是否要被包含在_all字段中,默认值是true。在一个对象上设置include_in_all可以修改这个对象所有字段的默认行为。如,指定_all包含name:

PUT /myindex/order/_mapping
{
    "order": {
        "include_in_all": false,
        "properties": {
            "name": {
                "type": "string",
                "include_in_all": true
            },
            ...
        }
    }
}

2._source

2.1 Disable _source field

ElasticSearch用JSOn字符串表示document主体,且保存在_source中。像其他保存的字段一样,_source字段也会在写入硬盘前压缩。_source字段不能被index,所以不能被搜索到。但是它却被store,所以_source还是要占用磁盘空间。不过,你可以禁用_source。

curl -XPUT 'http://127.0.0.1:9200/myindex/order/_mapping' -d '{
    "order": {
        "_source": {
            "enabled": false
        },
        "properties": {
			......
        }
    }
}'

不过,禁用_source之后,下面的功能将不再支持:

  1. 更新请求不再起作用,
  2. On the fly highlighting,
  3. 从ElasticSearch的一个index,重新索引到另一个时,要么改变mapping'或analysis,要么升级index到一个新的版本,
  4. 在index阶段,通过view document主体debug查询和聚合,
  5. 在以后自动修复index的功能丧失。
如果考虑的磁盘空间,你可以增加 compression level,而不用禁用_source。

2.2 Including / Excluding fields from _source

在_sourcez字段store前,而在document被index之后,你可以减少_source字段的内容。移除_source中的fields和禁用_source有相似的缺点,特别是当你不能从一个ElasticSearch的index重新索引到另一个index。但是你可以用source filtering。如下是官网的一个例子:

PUT logs
{
  "mappings": {
    "event": {
      "_source": {
        "includes": [
          "*.count",
          "meta.*"
        ],
        "excludes": [
          "meta.description",
          "meta.other.*"
        ]
      }
    }
  }
}

PUT logs/event/1
{
  "requests": {
    "count": 10,
    "foo": "bar" 
  },
  "meta": {
    "name": "Some metric",
    "description": "Some metric description", 
    "other": {
      "foo": "one", 
      "baz": "two" 
    }
  }
}

GET logs/event/_search
{
  "query": {
    "match": {
      "meta.other.foo": "one" 
    }
  }
}

当然,即使{"_source": {"enabled": true}},你也可以通过限定_source来请求指定字段:

GET /_search
{
    "query":   { "match_all": {}},
    "_source": [ "title", "created" ]
}

3.store

store属于field的属性,如:

curl -XPUT 'http://127.0.0.1:9200/myindex/order/_mapping' -d '{
    "order": {
        ......
        "properties": {
            "name": {
				"type": "string", 
				"store": "no", 
				......
			},
			......
        }
    }
}'

被store标记的fields被存储在和index不同的fragment中,以便于快速检索。虽然store占用磁盘空间,但是减少了计算。store的值可以取yes/no或者true/false,默认值是no或者false。

被store标记的fields可以用以下方式search(多个fields时,用fields=f1,f2,f3...):

curl -XGET 'http://hadoop:9200/myindex/order/0508?fields=age&pretty=true'

4.index

和store一样,index也是fields的属性。它用于配置每个被index的field,且默认值是analyzed。index有三个值:

  1. no:该field将不在被index。这样便于管理不需要被search的fields。
  2. analyzed:该field用配置的analyzer分析。它一般是小写且标记化的,使用ElasticSearch默认的配置StandardAnalyzer。
  3. not_analyzed:该field可以处理和index,但是不能改变其analyzer。默认使用的是ElasticSearch配置的KeywordAnalyzer,它把每个field作为一个标识处理。

curl -XPUT 'http://127.0.0.1:9200/myindex/order/_mapping' -d '{
    "order": {
        ......
        "properties": {
            "name": {
				"type": "string", 
				"index": "no", 
				......
			},
			......
        }
    }
}'





# mcp_server.py from datetime import datetime from mcp.server.fastmcp import FastMCP import logging import os import asyncio import hashlib import json import threading import time import numpy as np import faiss from langchain_community.docstore.in_memory import InMemoryDocstore from langchain_community.vectorstores import FAISS from langchain_community.llms import OpenAIChat from langchain.chains import RetrievalQA from ollama_embeding import CustomEmbeding from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler from langchain_community.document_loaders import ( TextLoader, PyPDFLoader, Docx2txtLoader, UnstructuredPowerPointLoader, UnstructuredExcelLoader, CSVLoader, UnstructuredHTMLLoader, UnstructuredMarkdownLoader, UnstructuredEmailLoader, UnstructuredFileLoader ) # 配置日志记录器 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") logger = logging.getLogger(__name__) # 创建 FastMCP 实例 mcp = FastMCP("VectorService") class VectorService: def __init__(self): self.embedding_function = CustomEmbeding('shaw/dmeta-embedding-zh') self.docstore = InMemoryDocstore() self.index = faiss.IndexFlatL2(768) self.vector_store = None self.existing_index_path = "E:/llm_rag/faiss_index/index.faiss" self.existing_index_pkl_path = "E:/llm_rag/faiss_index/index.pkl" self.is_processing = False self.last_processed_count = 0 self.initialized = False # 添加初始化完成标志 self.load_or_init_vector_store() # 初始化向量存储 self.is_initialized = True # 初始化完成 def load_or_init_vector_store(self): if self.vector_store is not None: return self.vector_store # 已初始化 if os.path.exists(self.existing_index_path) and os.path.exists(self.existing_index_pkl_path): vector_store = FAISS.load_local( "E:/llm_rag/faiss_index", embeddings=self.embedding_function, allow_dangerous_deserialization=True ) logger.info("Loaded existing vector store.") self.vector_store = vector_store return vector_store else: vector_store = FAISS( embedding_function=self.embedding_function, index=self.index, docstore=self.docstore, index_to_docstore_id={} ) logger.info("Initialized new vector store.") self.vector_store = vector_store return vector_store def get_id(self, file_path): """Generate file id""" return hashlib.md5(file_path.encode()).hexdigest() def load_document(self, file_path: str): file_ext = file_path.split('.')[-1].lower() logger.info(f"Loading document from {file_path}") loader_map = { 'txt': TextLoader, 'pdf': PyPDFLoader, 'docx': Docx2txtLoader, 'pptx': UnstructuredPowerPointLoader, 'xlsx': UnstructuredExcelLoader, 'csv': CSVLoader, 'html': UnstructuredHTMLLoader, 'htm': UnstructuredHTMLLoader, 'md': UnstructuredMarkdownLoader, 'eml': UnstructuredEmailLoader, 'msg': UnstructuredEmailLoader } if file_ext not in loader_map: logger.warning(f"Unsupported file type: {file_ext}") return None loader_class = loader_map.get(file_ext, UnstructuredFileLoader) loader = loader_class(file_path) try: documents = loader.load() logger.info(f"Loaded {len(documents)} documents from {file_path}") return documents except Exception as e: logger.error(f"Error loading {file_path}: {str(e)}") return None def _add_vector_metadata(self, file_name, file_name_path): """ 添加文件元数据 :return: """ docs = [] metadatas = [] try: file_stats = os.stat(file_name_path) file_size = file_stats.st_size res = self.load_document(file_name_path) if res: # 生成文件唯一标识(使用文件路径的哈希值) id = self.get_id(file_name_path) for doc in res: # 合并用户提供的元数据和文档自身的元数据 doc_metadata = doc.metadata.copy() doc_metadata.update({ "source": file_name, "file_path": file_name_path, "id": id, "upload_time": datetime.now().isoformat() }) # docs.append(doc.page_content.strip()) # 将文件名融入内容(提高文件名的权重) enhanced_content = f"文件名: {file_name}\n内容: {doc.page_content.strip()}" docs.append(enhanced_content) metadatas.append(doc_metadata) logger.info(f"Processed {file_name} ({file_size / (1024 * 1024.0):.2f} MB)") except Exception as e: logger.error(f"Error processing {file_name_path}: {str(e)}") return docs, metadatas def process_documents(self, data_path: str): """把所有文件进行批量向量化,添加文件唯一标识""" try: self.is_processing = True all_docs = [] all_metadatas = [] for root, dirs, files in os.walk(data_path): for file_name in files: file_name_path = os.path.join(root, file_name) logger.info(f"Processing file: {file_name_path}") # 调用 _add_vector_metadata 处理文件 docs, metadatas = self._add_vector_metadata( file_name=file_name, file_name_path=file_name_path ) # 累积结果 all_docs.extend(docs) all_metadatas.extend(metadatas) # 保存所有文件的向量数据 self._save_data_vector(docs=all_docs, metadatas=all_metadatas) self.last_processed_count = len(all_docs) self.is_processing = False return { "status": "success", "message": "Documents processed successfully", "document_count": len(all_docs) } except Exception as e: logger.error(f"Error processing documents: {str(e)}") self.is_processing = False return {"status": "error", "message": str(e)} def _save_data_vector(self, docs, metadatas): """Save the data vector to faiss""" self.vector_store = self.load_or_init_vector_store() docs = [doc for doc in docs if doc] try: logger.info("Starting embedding process...") self.vector_store.add_texts(texts=docs, metadatas=metadatas) logger.info("Embedding process completed.") except Exception as e: logger.error(f"An error occurred during embedding: {str(e)}") try: logger.info("Saving updated vector store...") self.vector_store.save_local("E:/llm_rag/faiss_index") logger.info("Updated vector store saved to E:/llm_rag/faiss_index.") except Exception as e: logger.error(f"An error occurred during saving: {str(e)}") return docs def check_process_status(self): """检查处理状态""" if self.is_processing: return { "status": "processing", "message": "Documents are being processed" } else: if os.path.exists(self.existing_index_path) and os.path.exists(self.existing_index_pkl_path): if self.last_processed_count > 0: return { "status": "success", "message": "Vector data has been updated", "last_processed_count": self.last_processed_count } else: return { "status": "ready", "message": "Vector store exists but no new data processed" } else: return { "status": "empty", "message": "No vector store exists" } def add_vector(self, new_file_name_path: str, new_file_name: str): """添加单个文件的向量""" try: self.is_processing = True docs, metadatas = self._add_vector_metadata( file_name=new_file_name, file_name_path=new_file_name_path ) self._save_data_vector(docs=docs, metadatas=metadatas) self.last_processed_count = len(docs) self.is_processing = False return { "status": "success", "message": "Vector added successfully" } except Exception as e: logger.error(f"Error adding vector: {str(e)}") self.is_processing = False return { "status": "error", "message": str(e) } vector_service = VectorService() @mcp.tool() def process_documents(data_path: str): """处理指定路径下的所有文档并生成向量存储""" logger.info(f"Starting to process documents in {data_path}") return vector_service.process_documents(data_path) @mcp.tool() def check_process_status(): """检查处理状态""" logger.info("Checking process status") return vector_service.check_process_status() @mcp.tool() def add_vector(new_file_name_path: str, new_file_name: str): """添加单个文件的向量""" logger.info(f"Adding vector for file: {new_file_name_path}") return vector_service.add_vector(new_file_name_path, new_file_name) @mcp.tool(name="searchfile", description=f"根据关键词搜索文件并返回匹配的内容") def search_answer(query: str): """ 获取检索相关的文件 :param query: 用户问题 :return: 返回检索到的文档 """ if not vector_service.is_initialized: logger.info("Server is not initialized yet. Please wait.") return {"status": "error", "message": "Server is not initialized yet. Please wait."} logger.info(f"Searching for relevant documents: {query}") try: retriever = FAISS.load_local( "E:/llm_rag/faiss_index", CustomEmbeding('shaw/dmeta-embedding-zh'), allow_dangerous_deserialization=True ).as_retriever(search_kwargs={"k": 10}) docs = retriever.get_relevant_documents(query) logger.info(f"找到 {len(docs)} 个相关文档块") logger.info(f"docs:{docs}") # return docs results = [] for doc in docs: metadata = doc.metadata file_path = metadata.get("file_path", "") # 安全检查:确保文件在允许的目录内 allowed_dir = "E:\\llm_rag\\data\\" if file_path and file_path.startswith(allowed_dir): # 生成相对路径并构建下载URL download_url = os.path.relpath(file_path, allowed_dir) results.append({ "content": doc.page_content, # 文档内容 "download_url": download_url # 下载链接 }) return results except Exception as e: logger.error(f"搜索出错: {str(e)}") return {"status": "error", "message": str(e)} if __name__ == "__main__": mcp.settings.port = 8880 logger.info("Starting mcp server through MCP") mcp.run(transport="sse") # 使用标准输入输出通信 请根据上述的报错修改并列出完整代码,只修改关键错误部分,不要修改其他代码
最新发布
06-19
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值