摘要:本文介绍一个提取PDF中的表格内容的程序。首先,程序给出使用示例,最后给出代码开发思路及细节。
0.需求说明
- PDF中存在大量表格,需要从表格中提取出指定类型的表格,这些表格主要通过表头和表中的关键字来确定。
1.PDF示例
2.提取规则
提取规则通过Excel指定,如下示例:
3.提取结果示例
提取的结果保存在Excel中,结果如下:
4.使用方法
- 首先准备好
Demo.xlsx
文件(下载),同时下载PDFparser.exe
程序(下载),将二者放在同一个目录下,然后将PDF文件准备好放在任意文件夹xxx中,将xxx文件夹和以上两个文件放在同一目录下,双击运行程序即可。
5.代码说明
- 程序使用
pdfplumber
模块进行PDF解析以获取表格和文本 - 程序使用
xlwt
模块和xlrd
模块进行Excel的读写 - 程序使用
多进程+多线程
模式加快速度 - 程序使用
re
模块来使用Python正则表达式
6.代码细节
-
PDF解析
# 该类用来实现PDF表格和文字内容的提取 class Extractor(object): def __init__(self, file_path, rules): ''' :param file_path:PDF file path :param rules: extract rules ''' self.file_path = file_path self.rules = rules
<span class="token comment"># 加载PDF文件</span> <span class="token keyword">def</span> <span class="token function">parse_pages</span><span class="token punctuation">(</span>self<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">try</span><span class="token punctuation">:</span> pages <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> pdf <span class="token operator">=</span> pdfplumber<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'parse file:{} page num:{}'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>basename<span class="token punctuation">(</span>self<span class="token punctuation">.</span>file_path<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pdf<span class="token punctuation">.</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span> tables <span class="token operator">=</span> page<span class="token punctuation">.</span>extract_tables<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span> <span class="token operator"><</span> <span class="token number">1</span><span class="token punctuation">:</span> <span class="token keyword">continue</span> pages<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">'text'</span><span class="token punctuation">:</span> page<span class="token punctuation">.</span>extract_text<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">'tables'</span><span class="token punctuation">:</span> tables<span class="token punctuation">,</span> <span class="token string">'page'</span><span class="token punctuation">:</span> index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token keyword">return</span> pages <span class="token keyword">except</span> Exception <span class="token keyword">as</span> e<span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span>e<span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token boolean">None</span> <span class="token comment"># 提取特定类型表头的表格,规则有rules参数指定</span> <span class="token keyword">def</span> <span class="token function">extract_table_with_specific_header</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pages<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> pages <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token operator">or</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator"><</span> <span class="token number">1</span><span class="token punctuation">:</span> <span class="token comment"># print('no-page...')</span> <span class="token keyword">return</span> target_tables <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> <span class="token comment"># 遍历所有页面</span> <span class="token keyword">for</span> index<span class="token punctuation">,</span> page <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span><span class="token punctuation">:</span> text <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'text'</span><span class="token punctuation">]</span> tables <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span> page_id <span class="token operator">=</span> page<span class="token punctuation">[</span><span class="token string">'page'</span><span class="token punctuation">]</span> <span class="token operator">-</span> <span class="token number">1</span> lines <span class="token operator">=</span> re<span class="token punctuation">.</span>split<span class="token punctuation">(</span>r<span class="token string">'\n+'</span><span class="token punctuation">,</span> text<span class="token punctuation">)</span> <span class="token comment"># 遍历当前页面的所有行</span> <span class="token keyword">for</span> ind<span class="token punctuation">,</span> line <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token comment"># 判定表头符合规则的表格</span> <span class="token keyword">if</span> <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'in-header'</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">0</span> <span class="token operator">and</span> \ <span class="token builtin">sum</span><span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">1</span> <span class="token keyword">for</span> rule <span class="token keyword">in</span> self<span class="token punctuation">.</span>rules<span class="token punctuation">[</span><span class="token string">'not-in-header'</span><span class="token punctuation">]</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>rule<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">0</span><span class="token punctuation">:</span> <span class="token keyword">if</span> ind <span class="token operator">>=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">1</span><span class="token punctuation">:</span> <span class="token keyword">break</span> cnt <span class="token operator">=</span> ind <span class="token operator">+</span> <span class="token number">1</span> <span class="token builtin">next</span> <span class="token operator">=</span> lines<span class="token punctuation">[</span>ind <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span> <span class="token keyword">if</span> ind <span class="token operator"><</span> <span class="token builtin">len</span><span class="token punctuation">(</span>lines<span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token number">2</span> <span class="token operator">and</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>r<span class="token string">'单位[::]'</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token builtin">next</span> <span class="token operator">=</span> lines<span class="token punctuation">[</span>ind <span class="token operator">+</span> <span class="token number">2</span><span class="token punctuation">]</span> cnt <span class="token operator">+=</span> <span class="token number">1</span> <span class="token keyword">for</span> ti<span class="token punctuation">,</span> table <span class="token keyword">in</span> <span class="token builtin">enumerate</span><span class="token punctuation">(</span>tables<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> <span class="token operator">not</span> table<span class="token punctuation">:</span> <span class="token keyword">continue</span> first <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>word <span class="token keyword">for</span> word <span class="token keyword">in</span> table<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> word <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token comment"># 表格是完整的情况</span> <span class="token keyword">if</span> first <span class="token operator">==</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span> tables<span class="token punctuation">[</span>ti<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token boolean">False</span> <span class="token keyword">if</span> index <span class="token operator">+</span> <span class="token number">1</span> <span class="token operator"><</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">:</span> table_next <span class="token operator">=</span> pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> fi <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token punctuation">[</span>item <span class="token keyword">for</span> item <span class="token keyword">in</span> table_next<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token keyword">if</span> item <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">if</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> fi<span class="token punctuation">)</span> <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'text'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">:</span> table <span class="token operator">+=</span> table_next target_tables<span class="token punctuation">.</span>append<span class="token punctuation">(</span><span class="token punctuation">{<!-- --></span><span class="token string">'page'</span><span class="token punctuation">:</span> page_id <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token string">'method'</span><span class="token punctuation">:</span> <span class="token string">'exact'</span><span class="token punctuation">,</span> <span class="token string">'table'</span><span class="token punctuation">:</span> table<span class="token punctuation">,</span> <span class="token string">'table-id'</span><span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">(</span>page_id <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token builtin">str</span><span class="token punctuation">(</span>ti <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">}</span><span class="token punctuation">)</span> <span class="token comment"># 表格可能不完整的情况</span> <span class="token keyword">elif</span> first <span class="token keyword">in</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span>r<span class="token string">'\s+'</span><span class="token punctuation">,</span> <span class="token string">''</span><span class="token punctuation">,</span> <span class="token builtin">next</span><span class="token punctuation">)</span><span class="token punctuation">:</span> tables<span class="token punctuation">[</span>ti<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token boolean">False</span> <span class="token keyword">if</span> index <span class="token operator">+</span> <span class="token number">1</span> <span class="token operator"><</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">)</span> <span class="token operator">and</span> <span class="token builtin">len</span><span class="token punctuation">(</span>pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">0</span><span class="token punctuation">:</span> table_next <span class="token operator">=</span> pages<span class="token punctuation">[</span>index <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">'tables'</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> fi <span class="token operator">=</span> <span class="token string">''</span><span class="token punctuation">.</span>join<span