如何使用Python从PDF中提取文本并转换为Markdown文本的实操

最新推荐文章于 2025-03-13 11:08:49 发布

wavemenu

最新推荐文章于 2025-03-13 11:08:49 发布

阅读量1.6k

点赞数 19

文章标签： python pdf 数据库

本文链接：https://blog.csdn.net/wavemenu/article/details/141099115

版权

前言概述

之前介绍如何通过pdfplumber获取PDF的文本。

《利用Python将PDF文件转换为文本文件》

这次，基于上次所写的内容，继续向下深挖。

这是关于使用Python从PDF文件中提取文本并转换为Markdown的实际操作。

背景

现在有12个注册测绘师综合真题的PDF，但是里面很多广告，个人希望把这些广告消除了。如果是少量pdf文件，那么用WPS然后花钱开会员的应该能把广告清理掉。但是这12个PDF对应着12年的注册测绘师综合真题，每个PDF有100页，一页一道真题。如果手动清理广告，这个工作量是很大的。

所以打算使用程序解决以上问题。那先说说解决思路：先把PDF转为文本，然后对这些文本进行数据清理，这个是对单个PDF的转纯净文本的思路。

思路

这个编程思路，也是我平常写代码去解决问题的思路。

先解决简单的问题，提取共性，再解决复杂的问题，在这过程中把握好输入输出。

这句话有很多种角度去理解。从底层出发，进而解决高层的问题。即从内到外，从下到上。

就具体事情来说，先解决单个PDF格式转换文本，再进行数据清洗。进而解决批量PDF格式转换文本且数据清洗。

单个PDF格式转换文本

相关代码如下：


def get_file_name(file_dir, type):
   """
  搜索 后缀名为type的文件 不包括子目录的文件
  #
  """
   corretion_file = []
   filelist = os.listdir(file_dir)
   for file in filelist:
       if os.path.splitext(file)[1] == type:
           corretion_file.append(os.path.join(file_dir, file))
   if corretion_file == []:
       for file in filelist:
           if os.path.splitext(file)[1] == '.'+type:
               corretion_file.append(os.path.join(file_dir, file))
return corretion_file
pdf_files_list = get_file_name(path, '.pdf')

i = 0
for pdf_file in pdf_files_list:
单文件处理模块

这个get_file_name函数，实现对指定文件夹下的指定文件的检索，返回一个符合文件后缀的文件名字列表，进而实现批量处理。基本上每次需要批量处理时，这个函数就要被复制粘贴，被拿出来使用。

函数是自己写的，改进了一两个版本，所以现在它很好用，且用了好几年了。以后再细细说这个函数。

全部代码

整合以上所有代码，如下：


import pdfplumber
import os

def get_file_name(file_dir, type):
   """
  搜索 后缀名为type的文件 不包括子目录的文件
  #
  """
   corretion_file = []
   filelist = os.listdir(file_dir)
   for file in filelist:
       if os.path.splitext(file)[1] == type:
           corretion_file.append(os.path.join(file_dir, file))
   if corretion_file == []:
       for file in filelist:
           if os.path.splitext(file)[1] == '.'+type:
               corretion_file.append(os.path.join(file_dir, file))
   return corretion_file
def read_pdf(pdf_path):
   with pdfplumber.open(pdf_path) as pdf:
           content = ''
           table = []
           image = []
           for i in range(len(pdf.pages)):
               # 读取PDF文档第i+1页
               page = pdf.pages[i]
               if i == 9:
                   print()
               # page.extract_text()函数即读取文本内容
               page_content = '\n'.join(page.extract_text().split('\n'))
               content = content + page_content + '\n'


   modified_string = content.replace("\nB", " B")
   # 然后，替换\nC为 C
   modified_string = modified_string.replace("\nC", " C")
   # 最后，替换\nD为 D
   modified_string = modified_string.replace("\nD", " D")
   return modified_string
#
def comprehensive_cleaning(path, outpath):
   """
  综合真题_数据清洗_批量处理
  """
   if os.path.exists(outpath)==False:
       os.makedirs(outpath)
   pdf_files_list = get_file_name(path, '.pdf')

   i = 0
   for pdf_file in pdf_files_list:
       original_pdf_text = read_pdf(pdf_file)
       modified_string = original_pdf_text.replace(
       r'仅允许加其中一个群！！',
       '')
       # 写
       outfile = os.path.join(outpath , os.path.splitext(os.path.basename(pdf_file))[0] + '.md')
       with open(outfile, 'w', encoding='utf-8') as f:
           f.write(modified_string)
           f.close()
       i += 1
       print("\r进行PDF to markdown转换: [{0:50s}] {1:.1f}%".format('#' * int(i / (len(pdf_files_list)) * 50),
                                                                i / len(pdf_files_list) * 100), end="",
             flush=True)

if __name__ == '__main__':
   path = r'D:\Registered_Surveyor\pdf'
   outpath = r'D:\Registered_Surveyor\markdown\test2022'
   comprehensive_cleaning(path, outpath)

已全部转换为markdown文本。如下图所示。