背景
大当前模型时代下,很多公司都开始使用大模型构建自己的产品体系,在现有业务基础上进行创新。
目前国内比较流行使用的开源大模型有CHATGLM2、LLAMA2等。使用Langchain\Llama_index等框架结合LLM与向量库等技术可以轻松的构建各领域的大模型体系。
知识数据丰富性和准确性是大模型产品成功与否的关键。
大模型的数据处理过程
知识数据的处理过程可用如下图描述:
左边蓝色部分为数据源,一般情况下,我们的数据源数据包含结构化的数据和非结构化数据,结构化的数据存储在SQL或ES中,非结构化的数据存储在文件系统中。
我们可通过元数据库来记录数据源数据的基本结构。
模型数据处理过程的工程化
模型数据处理过程的工程化考虑的点包括整体架构的高可用性、数据的可复用性、数据的实时处理和离线处理(可重复处理)。
在高可用方面考虑,如果要实现非结构化数据的高可用,需要使用到分布式文件系统。
目前比较流行的开源分布式文件系统解决方案有HDFS、MINIO等,经过测试MINIO的性能碾压HDFS,我们选择MINIO作为非结构化数据存储的高可用解决方案。
这样我们可工程化的大模型数据处理高可用架构可设计为如下:
这里MINIO的数据主要自如下方面:1、训练数据 ,2、大模型中间处理文件 3、用户上传数据
另外,在知识数据和向量数据方面的选型,我们选择了新版本的ES。 Elasticsearch 从8.8版本开始,新增 RRF,支持向量与文本混合搜索技术,可大大简化大模型知识数据的存储架构。(参考:Elasticsearch:倒数排序融合 - Reciprocal rank fusion - 掘金)
Langchain框架整合Minio
要实现以上数据架构,就涉及到了Langchain框架整合Minio的问题(Langchain现在不支持从Minio中Load 数据)。
Langchain框架目前支持从S3文件和S3目录中加载数据,我的思路是将这两个类改下,改成支持Minio.
Minio Loader 源代码-minio_file.py
#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import os
import tempfile
from typing import List
import platform
from langchain.document_loaders.unstructured import UnstructuredBaseLoader
class MinioFileLoader(UnstructuredBaseLoader):
"""Load from `minio` file."""
def __init__(self, url: str,bucket: str,key: str,access_key: str,secret_key: str, secure: bool):
"""Initialize with bucket and key name.
Args:
bucket: The name of the minio bucket. MINIO bucket名
key: The key of the MINIO object. MINIO object
"""
super().__init__()
self.url =url
self.access_key = access_key
self.secret_key = secret_key
self.secure = secure
self.bucket = bucket
self.key = key
def _get_elements(self) -> List:
"""Get elements."""
from unstructured.partition.auto import partition
try:
from minio import Minio
except ImportError:
raise ImportError(
"Could not import `minio` python package. "
"Please install it with `pip install minio`."
)
client = Minio( self.url, access_key=self.access_key,secret_key=self.secret_key,secure=self.secure )
with tempfile.TemporaryDirectory() as temp_dir:
dash="/"
#
if platform.system() == "Windows":
dash="\\"
file_path = f"{temp_dir}{dash}{self.key.replace('/','_')}"
print("------",file_path)
os.makedirs(os.path.dirname(file_path), exist_ok=True)
#s3.download_file(self.bucket, self.key, file_path)
client.fget_object(self.bucket, self.key, file_path)
return partition(filename=file_path)
def _get_metadata(self) -> dict:
return {"source": f"minio://{self.bucket}/{self.key}"}
if __name__=='__main__':
minio=MinioFileLoader(url="127.0.0.1:9900",bucket="llm",key="data/dataset/pdf_download/pdf_data/c73086fcb0c21c164006af61df75b59c.pdf",
access_key="minio_access_key",secret_key="XXXXX",secure=False)
Minio Loader 源代码-minio_directory.py
from __future__ import annotations
from typing import List, Optional, Union
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
from minio_file import MinioFileLoader
class MinioDirectoryLoader(BaseLoader):
"""Load from `Minio` directory."""
def __init__(
self,
bucket: str,
prefix: str = "",
endpoint_url: str = "127.0.0.1:9000",
*,
region_name: Optional[str] = None,
api_version: Optional[str] = None,
use_ssl: Optional[bool] = False,
verify: Union[str, bool, None] = None,
secret_access_key: Optional[str] = None,
session_token: Optional[str] = None,
):
"""Initialize with bucket and key name.
:param bucket: The name of the S3 bucket.
:param prefix: The prefix of the S3 key. Defaults to "".
:param region_name: The name of the region associated with the client.
A client is associated with a single region.
:param api_version: The API version to use. By default, botocore will
use the latest API version when creating a client. You only need
to specify this parameter if you want to use a previous API version
of the client.
:param use_ssl: Whether to use SSL. By default, SSL is used.
Note that not all services support non-ssl connections.
:param verify: Whether to verify SSL certificates.
By default SSL certificates are verified. You can provide the
following values:
* False - do not validate SSL certificates. SSL will still be
used (unless use_ssl is False), but SSL certificates
will not be verified.
* path/to/cert/bundle.pem - A filename of the CA cert bundle to
uses. You can specify this argument if you want to use a
different CA cert bundle than the one used by botocore.
:param endpoint_url: The URL to use for the constructed
client. Normally, botocore will automatically construct the
appropriate URL to use when communicating with a service. You can
specify URL (like "IP:port") to
override this behavior. If this value is provided, then
``use_ssl`` is ignored.
:param secret_access_key: The secret key to use when creating
the client. Same semantics as aws_access_key_id above.
:param session_token: The session token to use when creating
the client.
"""
self.bucket = bucket
self.prefix = prefix
self.region_name = region_name
self.api_version = api_version
self.use_ssl = use_ssl
self.verify = verify
self.endpoint_url = endpoint_url
self.secret_access_key = secret_access_key
self.session_token = session_token
def load(self) -> List[Document]:
"""Load documents."""
try:
from minio import Minio
except ImportError:
raise ImportError(
"Could not import Minio python package. "
"Please install it with `pip install Minio`."
)
docs = [] # Create client with access key and secret key with specific region.
client = Minio(self.endpoint_url, access_key=self.secret_access_key,secret_key= self.session_token,secure=self.use_ssl )
# List objects information whose names starts with "my/prefix/".
objects = client.list_objects(bucket_name=self.bucket, prefix=self.prefix, recursive=True)
i=0
for obj in objects:
print(obj.object_name)
loader =MinioFileLoader(url=self.endpoint_url,bucket=self.bucket,key=obj.object_name,
access_key=self.secret_access_key,secret_key=self.session_token,secure=self.use_ssl)
docs.extend(loader.load())
return docs
if __name__=='__main__':
minioLoader=MinioDirectoryLoader(endpoint_url="127.0.0.1:9900",bucket="llm",prefix="data/dataset/pdf_download/pdf_data/",
secret_access_key="minio_access_key",session_token="XXXXXXXXXXXXX",use_ssl=False)
docs=minioLoader.load()
print(len(docs))
print(docs[0])