提取指定的PDF表格保存到Excel

本文介绍了一个程序,用于从PDF中提取特定表格并保存到Excel。程序根据指定的规则,利用PDF解析和Excel读写模块,通过多进程加速处理。
摘要由CSDN通过智能技术生成

摘要:本文介绍一个提取PDF中的表格内容的程序。首先,程序给出使用示例,最后给出代码开发思路及细节。


0.需求说明
  • PDF中存在大量表格,需要从表格中提取出指定类型的表格,这些表格主要通过表头和表中的关键字来确定。
1.PDF示例
2.提取规则

提取规则通过Excel指定,如下示例:

在这里插入图片描述

3.提取结果示例

提取的结果保存在Excel中,结果如下:

在这里插入图片描述

4.使用方法
  • 首先准备好Demo.xlsx文件(下载),同时下载PDFparser.exe 程序(下载),将二者放在同一个目录下,然后将PDF文件准备好放在任意文件夹xxx中,将xxx文件夹和以上两个文件放在同一目录下,双击运行程序即可。
5.代码说明
  • 程序使用pdfplumber模块进行PDF解析以获取表格和文本
  • 程序使用xlwt模块和xlrd 模块进行Excel的读写
  • 程序使用多进程+多线程模式加快速度
  • 程序使用re模块来使用Python正则表达式
6.代码细节
  • PDF解析

    
    # 该类用来实现PDF表格和文字内容的提取
    class Extractor(object):
        def __init__(self, file_path, rules):
            '''
            :param file_path:PDF file path
            :param rules: extract rules
            '''
            self.file_path = file_path
            self.rules = rules
    
    <span class="token comment"># 加载PDF文件</span>
    <span class="token keyword">def</span> <span class="token function">parse_pages</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span>
        <span class="token keyword">try</span><span class="token punctuation">:</span>
            pages <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
            pdf <span class="token operator">&#61;</span> pdfplumber<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span>
            <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">&#39;parse file:{}   page num:{}&#39;</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
            <span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>
                tables <span class="token operator">&#61;</span> page<span class="token punctuation">.</span>extract_tables<span class="token punctuation">(</span><span class="token punctuation">)</span>
                <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span>
                    <span class="token keyword">continue</span>
                pages<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">&#39;text&#39;</span><span class="token punctuation">:</span> page<span class="token punctuation">.</span>extract_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">&#39;tables&#39;</span><span class="token punctuation">:</span> tables<span class="token punctuation">,</span> <span class="token string">&#39;page&#39;</span><span class="token punctuation">:</span> index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">}</span><span class="token punctuation">)</span>
            <span class="token keyword">return</span> pages
        <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span>
            <span class="token keyword">print</span><span class="token punctuation">(</span>e<span class="token punctuation">)</span>
        <span class="token keyword">return</span> <span class="token boolean">None</span>
    
    <span class="token comment"># 提取特定类型表头的表格&#xff0c;规则有rules参数指定</span>
    <span class="token keyword">def</span> <span class="token function">extract_table_with_specific_header</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pages<span class="token punctuation">)</span><span class="token punctuation">:</span>
        <span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">&lt;</span> <span class="token number">1</span><span class="token punctuation">:</span>
            <span class="token comment"># print(&#39;no-page...&#39;)</span>
            <span class="token keyword">return</span>
        target_tables <span class="token operator">&#61;</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
        <span class="token comment"># 遍历所有页面</span>
        <span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span>
            text <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;text&#39;</span><span class="token punctuation">]</span>
            tables <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span>
            page_id <span class="token operator">&#61;</span> page<span class="token punctuation">[</span><span class="token string">&#39;page&#39;</span><span class="token punctuation">]</span> <span class="token operator">-</span> <span class="token number">1</span>
            lines <span class="token operator">&#61;</span> re<span class="token punctuation">.</span>split<span class="token punctuation">(</span>r<span class="token string">&#39;\n&#43;&#39;</span><span class="token punctuation">,</span> text<span class="token punctuation">)</span>
            <span class="token comment"># 遍历当前页面的所有行</span>
            <span class="token keyword">for</span> ind<span class="token punctuation">,</span> line <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span><span class="token punctuation">:</span>
                <span class="token comment"># 判定表头符合规则的表格</span>
                <span class="token keyword">if</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;in-header&#39;</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span> <span class="token operator">and</span> \
                        <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">&#39;not-in-header&#39;</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&#61;&#61;</span> <span class="token number">0</span><span class="token punctuation">:</span>
                    <span class="token keyword">if</span> ind <span class="token operator">&gt;&#61;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">:</span>
                        <span class="token keyword">break</span>
                    cnt <span class="token operator">&#61;</span> ind <span class="token operator">&#43;</span> <span class="token number">1</span>
                    <span class="token builtin">next</span> <span class="token operator">&#61;</span> lines<span class="token punctuation">[</span>ind <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span>
                    <span class="token keyword">if</span> ind <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">2</span> <span class="token operator">and</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>r<span class="token string">&#39;单位[&#xff1a;:]&#39;</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
                        <span class="token builtin">next</span> <span class="token operator">&#61;</span> lines<span class="token punctuation">[</span>ind <span class="token operator">&#43;</span> <span class="token number">2</span><span class="token punctuation">]</span>
                        cnt <span class="token operator">&#43;&#61;</span> <span class="token number">1</span>
                    <span class="token keyword">for</span> ti<span class="token punctuation">,</span> table <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span><span class="token punctuation">:</span>
                        <span class="token keyword">if</span> <span class="token operator">not</span> table<span class="token punctuation">:</span>
                            <span class="token keyword">continue</span>
                        first <span class="token operator">&#61;</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>word <span class="token keyword">for</span> word <span class="token keyword">in</span> table<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> word <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
                        <span class="token comment"># 表格是完整的情况</span>
                        <span class="token keyword">if</span> first <span class="token operator">&#61;&#61;</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
                            tables<span class="token punctuation">[</span>ti<span class="token punctuation">]</span> <span class="token operator">&#61;</span> <span class="token boolean">False</span>
                            <span class="token keyword">if</span> index <span class="token operator">&#43;</span> <span class="token number">1</span> <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span><span class="token punctuation">:</span>
                                table_next <span class="token operator">&#61;</span> pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
                                fi <span class="token operator">&#61;</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>item <span class="token keyword">for</span> item <span class="token keyword">in</span> table_next<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> item <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
                                <span class="token keyword">if</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> fi<span class="token punctuation">)</span> <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;text&#39;</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
                                    table <span class="token operator">&#43;&#61;</span> table_next
                            target_tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">&#39;page&#39;</span><span class="token punctuation">:</span> page_id <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">&#39;method&#39;</span><span class="token punctuation">:</span> <span class="token string">&#39;exact&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;table&#39;</span><span class="token punctuation">:</span> table<span class="token punctuation">,</span>
                                                  <span class="token string">&#39;table-id&#39;</span><span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">(</span>page_id <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">&#43;</span> <span class="token builtin">str</span><span class="token punctuation">(</span>ti <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span>
                        <span class="token comment"># 表格可能不完整的情况</span>
                        <span class="token keyword">elif</span> first <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">&#39;\s&#43;&#39;</span><span class="token punctuation">,</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
                            tables<span class="token punctuation">[</span>ti<span class="token punctuation">]</span> <span class="token operator">&#61;</span> <span class="token boolean">False</span>
                            <span class="token keyword">if</span> index <span class="token operator">&#43;</span> <span class="token number">1</span> <span class="token operator">&lt;</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">&gt;</span> <span class="token number">0</span><span class="token punctuation">:</span>
                                table_next <span class="token operator">&#61;</span> pages<span class="token punctuation">[</span>index <span class="token operator">&#43;</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">&#39;tables&#39;</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span>
                                fi <span class="token operator">&#61;</span> <span class="token string">&#39;&#39;</span><span class="token punctuation">.</span>join<span 
PDF是一种常见的电子文档格式。在处理PDF文件时,有时需要将其中的一些内容提取出来并以表格的形式保存。这时,可以使用PDF批量提取工具,将指定内容提取Excel中。本篇将详细介绍如何使用此工具进行操作。 首先,需要安装一个PDF批量提取工具。目前市场上有许多PDF提取工具可供选择,如Adobe Acrobat、PDFelement等。 接下来,打开需要提取PDF文件、启动PDF批量提取工具,进入提取内容的设置界面。一般可选的提取方式有三种:关键字提取表格提取、区域提取。因为这里的需求是提取指定内容,因此选择关键字提取。在关键字提取界面,输入需要提取的关键字,并设置提取的范围,即选择要提取的单页还是整个PDF文件。 完成设置后,点击“提取”按钮即可开始批量提取工作。此时,PDF批量提取工具会自动搜索整个PDF文件,找到所设的关键字,并提取相关内容。在提取的过程中,可以进行筛选和排序,确保提取出来的内容是想要的。 最后,将提取出来的内容保存Excel文件。此时,我们便成功地将PDF文件中的指定内容批量提取到了Excel表格中。 总的来说,PDF批量提取工具是一种极其实用的工具,可以大大减轻人工翻阅PDF文件、提取指定内容、整合数据的工作量。当然,在使用工具的过程中,还需要了解PDF技术及工具参数设置等相关知识,才能保证提取的准确性和效率。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

a useful man

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值