textract读取pdf文件时报错误 local variable 'pipe' referenced before assignment

最新推荐文章于 2023-07-05 18:50:53 发布

请输入昵称..

最新推荐文章于 2023-07-05 18:50:53 发布

阅读量648

点赞数 1

分类专栏： NLP 文章标签： textstract NLP 人工智能读取pdf文件

本文链接：https://blog.csdn.net/fangxiananvhai/article/details/95168362

版权

NLP 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

textract读取pdf文件时报错误

- 下面说一下部署到linux服务器上是解决这个问题
- Python3—UnicodeEncodeError 'ascii' codec can't encode characters in position 0-1

执行代码：

text = textract.process(file_path, method='pdfminer', encoding='utf-8')

报错：

  File "D:\anaconda3\lib\site-packages\textract\parsers\__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)
  File "D:\anaconda3\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "D:\anaconda3\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
    return self.extract_pdfminer(filename, **kwargs)
  File "D:\anaconda3\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
    stdout, _ = self.run(['pdf2txt.py', filename])
  File "D:\anaconda3\lib\site-packages\textract\parsers\utils.py", line 96, in run
    stdout, stderr = pipe.communicate()
UnboundLocalError: local variable 'pipe' referenced before assignment

原因分析：
问题出在 Lib/site-packages/textract/parsers/pdf_parser.py 这个文件的这段代码（48行）

    def extract_pdfminer(self, filename, **kwargs):
        """Extract text from pdfs using pdfminer."""
        stdout, _ = self.run(['pdf2txt.py', filename])
        return stdout

原因是没有执行通过什么方式执行文件

解决方案：
改为如下代码

 def extract_pdfminer(self, filename, **kwargs):
        """Extract text from pdfs using pdfminer."""
        stdout, _ = self.run(['python','pdf2txt.py', filename])
        return stdout

问题解决。

如果读取到的文件出现乱码在代码text = textract.process(file_path, method='pdfminer', encoding='utf-8')，后边加上text = text.decode('utf-8')进行解码

补充：
安装textract的时候并不会自动帮你安装pdfminer,需要手动安装pdfminer

pip install pdfminer.six

然后之前别忘记把 pdf2txt.py文件复制到项目下否则会报其他错误

下面说一下部署到linux服务器上是解决这个问题

参考官方文档：https://textract.readthedocs.io/en/latest/installation.html

Ubuntu / Debian上安装textract解析pdf前要安装前置软件：

apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocrflac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev
pip install textract

安装完还是不行如果要解析pdf还要pip install pdfminer.six
参考文档：https://textract.readthedocs.io/en/stable/

这是python 直接执行python脚本文件应该就没问题了

Python3—UnicodeEncodeError ‘ascii’ codec can’t encode characters in position 0-1

定位问题后，解决办法就很简单啦，有两种方法

1.使用PYTHONIOENCODING运行python的时候加上PYTHONIOENCODING=utf-8，即

 PYTHONIOENCODING=utf-8  python XXXX.py

2.重新定义标准输出
标准输出的定义如下

import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

原文链接：https://blog.csdn.net/AckClinkz/article/details/78538462

https://blog.csdn.net/u011415481/article/details/80794567

请输入昵称..

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
textract读取pdf文件时报错误 local variable 'pipe' referenced before assignment

textract读取pdf文件时报错误执行代码：text = textract.process(file_path, method='pdfminer', encoding='utf-8')报错： File "D:\anaconda3\lib\site-packages\textract\parsers\__init__.py", line 77, in process ret...
复制链接

扫一扫