LangChain 文档加载器详解：TextLoader 使用方法与源码剖析-CSDN博客

本文链接：https://blog.csdn.net/yibuapi_com/article/details/148054864

TextLoader 使用方法及源码分析

在 LangChain 中，TextLoader 是最基础的文档加载器，它能够加载纯文本类文件（例如源码文件、Markdown 文档、纯文本文件等以文本形式存储的数据，但 DOC 文件不属于这类）。TextLoader 会将整个文件内容读取到一个 Document 对象中，并为该对象的 metadata 自动添加一个 source 字段，用于记录该文档的来源信息。

TextLoader 使用起来非常简单，传递对应的文本路径即可：

示例代码：

from langchain_community.document_loaders import TextLoader

loader = TextLoader("./电商产品数据.txt", encoding="utf-8")
documents = loader.load()

print(documents)
print(len(documents))
print(documents[0].metadata)

资料推荐

输出内容:

[Document(page_content='xxx', metadata={'source': './电商产品数据.txt'})]
1
{'source': './电商产品数据.txt'}

TextLoader 源码底层主要通过 open 函数与对应的编码方式打开对应的文件，获取其内容，并将传递的路径信息复制到生成的文档示例中的 metadata 字段中，从而实现数据的快速加载。

源码解读：

# langchain_community/document_loaders/text.py->TextLoader::lazy_load
def lazy_load(self) -> Iterator[Document]:
    """Load from file path."""
    text = ""
    try:
        with open(self.file_path, encoding=self.encoding) as f:
            text = f.read()
    except UnicodeDecodeError as e:
        if self.autodetect_encoding:
            detected_encodings = detect_file_encodings(self.file_path)
            for encoding in detected_encodings:
                logger.debug(f"Trying encoding: {encoding.encoding}")
                try:
                    with open(self.file_path, encoding=encoding.encoding) as f:
                        text = f.read()
                    break
                except UnicodeDecodeError:
                    continue
        else:
            raise RuntimeError(f"Error loading {self.file_path}") from e
    except Exception as e:
        raise RuntimeError(f"Error loading {self.file_path}") from e

    metadata = {"source": str(self.file_path)}
    yield Document(page_content=text, metadata=metadata)