使用Milvus进行文本分析过程

前端练习生-小爵

已于 2024-07-17 16:29:37 修改

阅读量311

点赞数 7

文章标签： milvus 人工智能机器学习 python 自然语言处理

于 2024-07-17 16:28:55 首次发布

本文链接：https://blog.csdn.net/2401_85811247/article/details/140498464

版权

为了能够高效的分析文本内容且存储到milvus数据库中，我们要通过使用 PyMilvus 中的 `SentenceTransformerEmbeddingFunction` 将文本数据转换为向量。

1.首先读取一个 CSV 文件中的文本列

2.利用预训练的 `all-MiniLM-L6-v2` 模型将这些文本转换为固定维度的向量表示。

3.生成的向量随后被打印出来，并保存到一个新的 CSV 文件中，以便用于后续的机器学习或其他分析任务。

下面是代码示例：

from pymilvus import model
import pandas as pd

data = pd.read_csv('你的文件路径',encoding='utf-8')
column_data = data["列名称"]
print(column_data)

sentence_transformer_ef = model.dense.SentenceTransformerEmbeddingFunction(
    model_name='all-MiniLM-L6-v2', # Specify the model name
    device='cpu' # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)
df1_array = column_data.to_numpy()

docs_embeddings = sentence_transformer_ef.encode_documents(df1_array)

print("Embeddings:", docs_embeddings)
print("Dim:", sentence_transformer_ef.dim, docs_embeddings[0].shape)

pd.DataFrame(docs_embeddings).to_csv('你要保存文件路径', encoding='utf-8', na_rep='', header=True, index=True)

在转换为向量之后我们要对其进行存储数据使用 PyMilvus 库与 Milvus 数据库交互，

1.创建索引导入 PyMilvus 所需的模块

2.使用 Pandas 读取 CSV 文件并转换数据

3.建立与 Milvus 服务器的连接；定义集合的字段模式，包括 ID、向量字段和其他文本字段；创建集合并检查是否已存在，如果存在则删除；

4.在向量字段上创建索引，指定度量类型、索引类型和参数，如 L2 距离和 IVF_FLAT 索引，并设置索引参数，如 nlist 的值。

这样，就可以在 Milvus 中高效地存储和搜索向量数据。

import pandas as pd
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility


df = pd.read_csv("你的文件路径",converters={'title_vector': lambda x: eval(x)})
df.head()

connections.connect(host='127.0.0.1', port='19530')

def create_milvus_collection(collection_name, dim):
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    fields = [
             FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),		
            FieldSchema(name="title_vector", dtype=DataType.FLOAT_VECTOR, dim=384),					
            FieldSchema(name="Nationalpolicy", dtype=DataType.VARCHAR, max_length=63000),  
            FieldSchema(name="Municipalpolicy", dtype=DataType.VARCHAR, max_length=63000),  	
            FieldSchema(name="Districtpolicy", dtype=DataType.VARCHAR, max_length=63000),  				
            FieldSchema(name="InheritanceRelation", dtype=DataType.VARCHAR, max_length=63000),  
            FieldSchema(name="Connection",dtype=DataType.VARCHAR, max_length=63000),  
            FieldSchema(name="UpDown",dtype=DataType.VARCHAR, max_length=63000),  				
            FieldSchema(name="RelatedContent", dtype=DataType.VARCHAR, max_length=63000),  
            FieldSchema(name="Alteration", dtype=DataType.VARCHAR, max_length=63000),  	
            FieldSchema(name="PIRequire", dtype=DataType.VARCHAR, max_length=63000),  		
            FieldSchema(name="ImplementContent", dtype=DataType.VARCHAR, max_length=63000), 
            FieldSchema(name="CorrespondingGroup", dtype=DataType.VARCHAR, max_length=63000),  	
            FieldSchema(name="CorrespondingDomain", dtype=DataType.VARCHAR, max_length=63000),  		
            FieldSchema(name="CorrespondingDirection", dtype=DataType.VARCHAR, max_length=63000),
            FieldSchema(name="CorrespondingEC", dtype=DataType.VARCHAR, max_length=63000),  
            
    ]

    schema = CollectionSchema(fields=fields, description='search text')
    collection = Collection(name=collection_name, schema=schema)
    
    index_params = {
        'metric_type': "L2",
        'index_type': "IVF_FLAT",
        'params': {"nlist": 2096}
    }
    collection.create_index(field_name='title_vector', index_params=index_params)
    return collection

collection = create_milvus_collection('policy', 384)

在创建完索引之后就可以根据对应的列把向量数据文件上传到milvus数据库中了。

前端练习生-小爵

关注

7
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
使用Milvus进行文本分析过程

为了能够高效的分析文本内容且存储到milvus数据库中，我们要通过使用 PyMilvus 中的 `SentenceTransformerEmbeddingFunction` 将文本数据转换为向量。定义集合的字段模式，包括 ID、向量字段和其他文本字段；4.在向量字段上创建索引，指定度量类型、索引类型和参数，如 L2 距离和 IVF_FLAT 索引，并设置索引参数，如 nlist 的值。3.生成的向量随后被打印出来，并保存到一个新的 CSV 文件中，以便用于后续的机器学习或其他分析任务。
复制链接

扫一扫