python pdf转换为word

最新推荐文章于 2024-05-07 06:31:59 发布

农民小飞侠

最新推荐文章于 2024-05-07 06:31:59 发布

阅读量942

点赞数 1

分类专栏： pdfminer

本文链接：https://blog.csdn.net/w5688414/article/details/106730574

版权

pdfminer 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

今天尝试了一下用pdf转换为word的操作，也是借鉴的别人的代码，地址为：https://github.com/python-fan/pdf2word，改了改，把多线程那些去掉了，然后这份代码有几个缺点，我说一下：

扫描版不能转换
转换出来主要是文本，并且不带格式的
有一些字体不能转换

这里面有很多可以改进的地方，就期待大牛加入推动这个项目了哈。

代码需要的环境为：

attrs==17.4.0
lxml==4.1.1
pdfminer3k==1.3.1
pluggy==0.6.0
ply==3.11
py==1.5.2
pytest==3.4.1
python-docx==0.8.6
six==1.11.0

下面我分享一下代码：

import os
from configparser import ConfigParser
from io import StringIO
from io import open
from concurrent.futures import ProcessPoolExecutor

from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from docx import Document
import logging
logging.Logger.propagate = False
logging.getLogger().setLevel(logging.ERROR)
import re


def read_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        resource_manager = PDFResourceManager()
        return_str = StringIO()
        lap_params = LAParams()

        device = TextConverter(
            resource_manager, return_str, laparams=lap_params)
        process_pdf(resource_manager, device, file)
        device.close()

        content = return_str.getvalue()
        return_str.close()
        return content

def save_text_to_word(content, file_path):
    doc = Document()
    for line in content.split('\n'):
        paragraph = doc.add_paragraph()
        paragraph.add_run(remove_control_characters(line))
    doc.save(file_path)

def remove_control_characters(content):
    mpa = dict.fromkeys(range(32))
    return content.translate(mpa)

def pdf_to_word(pdf_file_path, word_file_path):
    content = read_from_pdf(pdf_file_path)
    content = re.compile(r'([0-9a-zA-Z_])\n([0-9a-zA-Z_])').sub(r'\1 \2', content)
    content0 = re.compile(r'(-)\n([0-9a-zA-Z_])').sub(r'\2', content)
    content1 = re.compile(r' \n ').sub(r'', content0)
    content_2 = re.compile(r'([^.])\n').sub(r'\1', content1)
    content_compile = re.compile(r'\(cid:\d{1,2}\)').sub(r'', content_2)
    save_text_to_word(content, word_file_path)


if __name__ == "__main__":
    root_path='pdf'
    for file in os.listdir(root_path):
        extension_name = os.path.splitext(file)[1]
        if extension_name != '.pdf':
                continue
        file_name = os.path.splitext(file)[0]
        pdf_file = root_path + '/' + file
        word_file = root_path + '/' + file_name + '.docx'
        print('正在处理: ', file)
        pdf_to_word(pdf_file,word_file)

我是在mac环境下测试的，改天分享一下扫描版的pdf解析，当然使用范围也是有局限性的啦。

参考文献

[1].PYTHON代码教你批量将PDF转为WORD. https://www.cnblogs.com/wumingxiaoyao/p/8460973.html

[2].使用Python将PDF转化为word. https://www.jianshu.com/p/49c5abfee649

农民小飞侠

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
0
评论
python pdf转换为word

今天尝试了一下用pdf转换为word的操作，也是借鉴的别人的代码，地址为：https://github.com/python-fan/pdf2word，改了改，把多线程那些去掉了，然后这份代码有几个缺点，我说一下：扫描版不能转换转换出来主要是文本，并且不带格式的有一些字体不能转换这里面有很多可以改进的地方，就期待大牛加入推动这个项目了哈。代码需要的环境为：attrs==17.4.0lxml==4.1.1pdfminer3k==1.3.1pluggy==0.6.0ply==3.
复制链接

扫一扫