使用Python提取PDF文件中的表格（使用mutool修复PDF）

最新推荐文章于 2024-05-24 17:41:56 发布

Eichee

最新推荐文章于 2024-05-24 17:41:56 发布

阅读量4k

点赞数

本文链接：https://blog.csdn.net/Yubu_/article/details/84198971

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

下载了好多PDF文件，大量的数据在里面，需要提取里面表格。于是乎！

找了一个Python的包，叫做camelot （https://github.com/socialcopsdev/camelot），进行表格提取。强大！！

import camelot

table=camelot.read_pdf(filename,pages=str(page_index))

于是乎！

大概有一般的PDF文件提取失败，出现异常

PdfReadWarning: Illegal character in Name Object [generic.py:489]
Invalid dictionary construct: [/'Type', /'Font', /'Subtype', /'Type0', /'BaseFont', /b"b'", /"ABCDEE+\xcb\xce\xcc\xe5'", /'Encoding', /'Identity-H', /'DescendantFonts', PDFObjRef:6, /'ToUnicode', PDFObjRef:12]

然后听说需要修复以下pdf文件，然后再提取。

于是乎，就找到了这个命令

mutool clean

很强大的指令。

修复后测试，果然有效。

然后对所有PDF进行了一次修复

import os

path='修复前/'
files=os.listdir(path)
for file in files:
    print(file)
    s='mutool clean 修复前/'+file+' 修复后/'+file
    os.system(s)

Eichee

关注

0
点赞
踩
11

收藏

觉得还不错? 一键收藏
5
评论
使用Python提取PDF文件中的表格（使用mutool修复PDF）

下载了好多PDF文件，大量的数据在里面，需要提取里面表格。于是乎！找了一个Python的包，叫做camelot （https://github.com/socialcopsdev/camelot），进行表格提取。强大！！import camelottable=camelot.read_pdf(filename,pages=str(page_index))于是乎！大概有一般的P...
复制链接

扫一扫