langchain-chatglm上传word文档异常处理

爱在一瞬间

已于 2023-08-24 10:44:44 修改

阅读量3.6k

点赞数 9

分类专栏： LLM 文章标签： word langchain

于 2023-06-20 09:04:35 首次发布

本文链接：https://blog.csdn.net/aizaiyishunjian/article/details/131299523

版权

LLM 专栏收录该内容

2 篇文章 1 订阅

订阅专栏

文章目录

- - 1.langchain-chatglm本地部署
  - 2.langchain的UnstructuredFileLoader\UnstructuredWordDocumentLoader

最近LLM模型非常火，Langchain这个工具更有意思，让应用开发更加简单。于是就想着部署一下langchain-chatglm，体验一下大模型挂载知识库的畅快。部署过程耗时长，主要是环境安装，但总体还是很顺利的。但是一个word文档上传无法加载的问题耗费了好长时间处理。

INFO  2023-06-17 11:36:30,140-1d:
ERROR 2023-06-17 11:36:30,141-1d: Error: source file could not be loaded
ERROR 2023-06-17 11:36:30,142-1d: Package not found at '/tmp/tmp33_8507e/08服务器软硬件环境配置说明书.docx'
INFO  2023-06-17 11:36:30,143-1d: /home/aiadmin/langchain-ChatGLM-master/content/DK12/08服务器软硬件环境配置说明书.docx 未能成功加载
INFO  2023-06-17 11:36:30,143-1d: 文件均未成功加载，请检查依赖包或替换为其他文件再次上传。

最终处理办法很简单,执行以下卸载和安装命令：

yum remove openoffice* libreoffice*
yum install libreoffice*

如果yum安装下载过慢，还可以手动下载安装，参考这篇博客：
下载地址，下载下面三个文件：

LibreOffice_6.0.3_Linux_x86-64_rpm.tar.gz
LibreOffice_6.0.3_Linux_x86-64_rpm_sdk.tar.gz
LibreOffice_6.0.3_Linux_x86-64_rpm_langpack_zh-CN.tar.gz

解压文件，执行如下命令

mkdir /usr/libreoffice
tar -zxvf LibreOffice_6.0.3_Linux_x86-64_rpm.tar.gz -C /usr/libreoffice/
tar -zxvf LibreOffice_6.0.3_Linux_x86-64_rpm_sdk.tar.gz -C /usr/libreoffice/
tar -zxvf LibreOffice_6.0.3_Linux_x86-64_rpm_langpack_zh-CN.tar.gz -C /usr/libreoffice/
cd /usr/libreoffice/LibreOffice_6.0.3_Linux_x86-64_rpm/RPMS
yum localinstall *.rpm

安装完成之后，建立软连接：

ln -s /usr/bin/libreoffice7.5 /usr/bin/soffice

=============================================================================
下面记录一下心路历程，启发一下问题解决思路。

1.langchain-chatglm本地部署

本地部署langchain-chatglm过程确实不复杂，参考安装指南进行环境安装。到huggingface下载chatglm-6b和text2vec-large-chinese,并修改配置文件configs/model_config.py中对应的配置，就可以直接运行了。

真的是顺利呀，赶紧上传了一个pdf文档总结一下，真的挺好用。再上传一个word文档吧，就出错了。在安装指南发现了下面这段话：

Note: When using langchain.document_loaders.UnstructuredFileLoader for unstructured file integration, you may need to install other dependency packages according to the documentation. Please refer to langchain documentation

大致意思就是说想要使用langchain.document_loaders.UnstructuredFileLoader来处理非结构化文档，需要安装额外的依赖。但是文档中提到的地址已经失效了，我找到了下面这个有效地址。按照文档中的说明，依次执行了以下包的安装：

# 1.libmagic
# 安装 file-devel 软件包，其中包含 libmagic 库和相关的开发文件
# libmagic是一个独立的C库，用于文件类型识别
# python-magic和python-magic-bin都是基于libmagic的Python封装库，在python中通过`import magic`导入使用
yum install file-devel
# 2.Poppler是一个用于处理 PDF 文件的开源工具集，poppler-utils 包含了一些常用的 Poppler 工具，如 pdftotext、pdfinfo、pdfimages 等
yum install poppler-utils
# 3.tesseract是一个开源的OCR（光学字符识别）引擎，它能够将图像中的文本识别为可编辑的文本
yum install tesseract
# 4.libxml2是一个用于解析和操作 XML 文件的开源库
yum install libxml2
# 5.libxslt 是一个用于处理 XSLT（可扩展样式表语言转换）的开源库
yum install libxslt

由于网络隔离,nltk无法直接执行下载，因此手动下载。

import nltk
nltk.download('punkt')

数据说明，手动下载地址,搜索关键字进行下载，并根据报错提示进行数据文件放置。这一块的报错是在我手工调试word文档异常问题的过程中出现并处理的，路径位置也是根据报错信息的查找路径选择了其中一个。
我放在了我的虚拟环境的目录下：

# nltk
~/langchain-ChatGLM/nltk_data/taggers/
# punkt
~/langchain-ChatGLM/nltk_data/tokenizers/

当然了，经历这一波安装，并没有解决问题，依然还是无法处理word文档。

2.langchain的UnstructuredFileLoader\UnstructuredWordDocumentLoader

那就看看加载word文档的代码吧local_doc_qa.py,大致调用逻辑是下面这样，报错出现在load_and_split方法

from langchain.document_loaders import UnstructuredFileLoader, UnstructuredWordDocumentLoader
from textsplitter import ChineseTextSplitter
loader = UnstructuredWordDocumentLoader("/home/testReport2.doc", mode="elements")
textsplitter = ChineseTextSplitter(pdf=False, sentence_size=100)
docs = loader.load_and_split(text_splitter=textsplitter)

那就往下继续调试，看看langchain.UnstructuredWordDocumentLoader代码的实现吧。

class UnstructuredWordDocumentLoader(UnstructuredFileLoader):
    """Loader that uses unstructured to load word documents."""

    def _get_elements(self) -> List:
        from unstructured.__version__ import __version__ as __unstructured_version__
        from unstructured.file_utils.filetype import FileType, detect_filetype

        unstructured_version = tuple(
            [int(x) for x in __unstructured_version__.split(".")]
        )
        # NOTE(MthwRobinson) - magic will raise an import error if the libmagic
        # system dependency isn't installed. If it's not installed, we'll just
        # check the file extension
        try:
            import magic  # noqa: F401

            is_doc = detect_filetype(self.file_path) == FileType.DOC
        except ImportError:
            _, extension = os.path.splitext(str(self.file_path))
            is_doc = extension == ".doc"

        if is_doc and unstructured_version < (0, 4, 11):
            raise ValueError(
                f"You are on unstructured version {__unstructured_version__}. "
                "Partitioning .doc files is only supported in unstructured>=0.4.11. "
                "Please upgrade the unstructured package and try again."
            )

        if is_doc:
            from unstructured.partition.doc import partition_doc

            return partition_doc(filename=self.file_path, **self.unstructured_kwargs)
        else:
            from unstructured.partition.docx import partition_docx

            return partition_docx(filename=self.file_path, **self.unstructured_kwargs)

经分析_get_elements是关键的调用方法，逐行验证发现以下问题

loader = UnstructuredWordDocumentLoader("/home/aiadmin/h2412730/testReport2.doc", mode="elements")

loader创建之后，直接加载docx,执行partition_docx是正常的;但是直接加载doc，执行partition_doc异常
detect_filetype执行时，docx文件检测出的文件类型也是doc，导致docx无法正常加载

于是查看detect_filetype的逻辑，发现安装magic和不安装magic是两套检测逻辑。于是卸载python-magic,文件类型检测正常。
docx文件可以正常处理。但是doc还是无法加载，提示以下信息：

docx.opc.exceptions.PackageNotFoundError: Package not found at '/tmp/tmpmzlvwg1m/testReport.docx'

这里有两点比较奇怪：

用doc文件验证partition_doc方法，结果提示同名的docx文件找不到
常用的库，理论上不应该存在这么低级的问题。
那就看看partition_doc方法吧，一看恍然大明白

@process_metadata()
@add_metadata_with_filetype(FileType.DOC)
def partition_doc(
    filename: Optional[str] = None,
    file: Optional[IO] = None,
    include_page_breaks: bool = True,
    **kwargs,
) -> List[Element]:
    """Partitions Microsoft Word Documents in .doc format into its document elements.

    Parameters
    ----------
    filename
        A string defining the target filename path.
    file
        A file-like object using "rb" mode --> open(filename, "rb").
    """
    # Verify that only one of the arguments was provided
    if filename is None:
        filename = ""
    exactly_one(filename=filename, file=file)

    if len(filename) > 0:
        _, filename_no_path = os.path.split(os.path.abspath(filename))
        base_filename, _ = os.path.splitext(filename_no_path)
        if not os.path.exists(filename):
            raise ValueError(f"The file {filename} does not exist.")
    elif file is not None:
        tmp = tempfile.NamedTemporaryFile(delete=False)
        tmp.write(file.read())
        tmp.close()
        filename = tmp.name
        _, filename_no_path = os.path.split(os.path.abspath(tmp.name))

    base_filename, _ = os.path.splitext(filename_no_path)

    with tempfile.TemporaryDirectory() as tmpdir:
        convert_office_doc(filename, tmpdir, target_format="docx")
        docx_filename = os.path.join(tmpdir, f"{base_filename}.docx")
        elements = partition_docx(
            filename=docx_filename,
            metadata_filename=filename,
            include_page_breaks=include_page_breaks,
        )

    return elements

这是unstruncture包的partition_doc的实现，基本逻辑很明显，把doc文件转换为docx文件，然后调用partition_docx方法。那就解释了为什么要找docx文件了。既然找不到这个临时转换生成的docx文件，肯定是convert_office_doc这个转换方法出错了。那就继续看看这个方法的实现：

def convert_office_doc(input_filename: str, output_directory: str, target_format: str):
    """Converts a .doc file to a .docx file using the libreoffice CLI."""
    # NOTE(robinson) - In the future can also include win32com client as a fallback for windows
    # users who do not have LibreOffice installed
    # ref: https://stackoverflow.com/questions/38468442/
    #       multiple-doc-to-docx-file-conversion-using-python
    command = [
        "soffice",
        "--headless",
        "--convert-to",
        target_format,
        "--outdir",
        output_directory,
        input_filename,
    ]
    try:
        process = subprocess.Popen(
            command,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
        )
        output, error = process.communicate()
    except FileNotFoundError:
        raise FileNotFoundError(
            """soffice command was not found. Please install libreoffice
on your system and try again.

- Install instructions: https://www.libreoffice.org/get-help/install-howto/
- Mac: https://formulae.brew.sh/cask/libreoffice
- Debian: https://wiki.debian.org/LibreOffice""",
        )

    logger.info(output.decode().strip())
    if error:
        logger.error(error.decode().strip())

这里的方法实现和调用其实隐藏了底层的错误。就是子进程的错误信息只是做了日志输出，调用者partition_doc并没有判断，而是直接执行后续方法调用，因此隐藏了错误的发生点。
当然问题到这里很明显了，就是执行soffice进行文档转换，执行这个报错。找到了问题源头，搜到了以下解决方法：

# Error: no export filter for teste.docx found, aborting
yum remove openoffice* libreoffice*
yum install libreoffice*

手动安装方法可以参考文章开头。

其实一路下来，总的来说，还是要找到问题的根源，才能解决问题。根本原因分析，理论很明确，执行过程却不总是那么顺利。回过头来，发现路上其实有几个问题：

目的不够明确，不够执着。总是徘徊在某个层面的问题点，而不是层层深入。

爱在一瞬间

关注

9
点赞
踩
14

收藏

觉得还不错? 一键收藏
打赏
11
评论
langchain-chatglm上传word文档异常处理

最近LLM模型非常火，Langchain这个工具更有意思，让应用开发更加简单。于是就想着部署一下langchain-chatglm，体验一下大模型挂载知识库的畅快。部署过程耗时长，主要是环境安装，但总体还是很顺利的。但是一个word文档上传无法加载的问题耗费了好长时间处理。
复制链接

扫一扫