python实现文档转换及去水印

最新推荐文章于 2024-07-20 17:12:48 发布

局点

最新推荐文章于 2024-07-20 17:12:48 发布

阅读量217

点赞数 1

文章标签： python 服务器 java

本文链接：https://blog.csdn.net/weixin_45224874/article/details/135052535

版权

python脚本进行文档转换

在实际工作过程中，我个人常遇到文档类型方面的问题。例如在对PDF文本进行复制时会遇到的抓取范围问题，内容复制准确率问题。再比如有时候想对纯文本内容作处理时会遇到的文件格式问题，内容格式问题。

因此分享几个我自用的几个文档转换脚本，高效好使。

pdf转txt

import PyPDF2


def coverPDFToTxt(pdf_file_path=None, word_file_path=None):
    # 打开pdf文件
    pdf_file = open(pdf_file_path, 'rb')
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    pages = len(pdf_reader.pages)
    text = ""

    # 循环遍历每一页，将文本添加到text字符串中
    for i in range(pages):
        page = pdf_reader.pages[i]
        text += page.extract_text()

    pdf_file.close()
    with open(word_file_path, 'w', encoding='utf-8') as file:
        file.write(text)


if __name__ == '__main__':
    coverPDFToTxt('d.pdf','d.txt')

比如：

在这里插入图片描述

转换为txt。

PDF转换Docx

from pdf2docx import Converter


def pdfToWord(pdf_file_path=None, word_file_path=None):
    converter_ = Converter(pdf_file_path)
    converter_.convert(word_file_path, start=0, end=None)
    converter_.close()


if __name__ == '__main__':
    pdfToWord('d.pdf','d.docx')

例如：

在这里插入图片描述

好处就是没有页码限制，想转换多少转换多少。

Docx转换txt

然而在实际使用中，pdf直接转可能效果并不是很理想，有时候也可以pdf ->docx ->txt。

import pypandoc


def docxToTxt(docx_file_path=None, txt_file_path=None):
    output = pypandoc.convert_file(docx_file_path, 'plain', outputfile=txt_file_path)
    assert output == ""


if __name__ == '__main__':
    docxToTxt('d.docx','d.txt')

这个就不演示了，与上面的效果没什么区别。

python去除图片水印

过去写的部分文档中，有些图片上传至博客站点后就自动被打上水印，想要二次使用时只有截小图，或者将相关代码案例重操一遍，后来在github偶然翻到python去除水印的脚本，用来也相当好用，脚本我精简过，只保留我自己需要的部分。有想看原文的可以看github的DocumentLightMarkWipeTool项目，也很简短。

# 图片处理
def imgDeal(img_path, save_path):
    img = Image.open(img_path)
    img = levelsDeal(img, 108, 164)
    img_res = Image.fromarray(img.astype('uint8'))
    print(u'图片[' + img_path + u']处理完毕')
    img_res.save(save_path)


# 图像矩阵处理
def levelsDeal(img, black, white):
    if white > 255:
        white = 255
    if black < 0:
        black = 0
    if black >= white:
        black = white - 2
    img_array = np.array(img, dtype=int)
    cRate = -(white - black) / 255.0 * 0.05
    rgb_diff = img_array - black
    rgb_diff = np.maximum(rgb_diff, 0)
    img_array = rgb_diff * cRate
    img_array = np.around(img_array, 0)
    img_array = img_array.astype(int)
    return img_array
  
 if __name__ == '__main__':
    imgDeal("image/水印.png", "results/水印.png")

原图：

在这里插入图片描述

处理后：

在这里插入图片描述

写在后面

这篇内容就是对我平时用到的一些脚本做分享，其实可以结合我另一篇文章，使用PyQt5将至做成MacApp，让自己更方便，但在这里就不扩展了。

局点

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
python实现文档转换及去水印

过去写的部分文档中，有些图片上传至博客站点后就自动被打上水印，想要二次使用时只有截小图，或者将相关代码案例重操一遍，后来在github偶然翻到python去除水印的脚本，用来也相当好用，脚本我精简过，只保留我自己需要的部分。这篇内容就是对我平时用到的一些脚本做分享，其实可以结合我另一篇文章，使用PyQt5将至做成MacApp，让自己更方便，但在这里就不扩展了。然而在实际使用中，pdf直接转可能效果并不是很理想，有时候也可以pdf ->docx ->txt。
复制链接

扫一扫