10月20日学习总结

最新推荐文章于 2023-03-26 18:28:45 发布

WBYLX

最新推荐文章于 2023-03-26 18:28:45 发布

阅读量149

点赞数

文章标签： python 计算机视觉开发语言

本文链接：https://blog.csdn.net/WBYLX/article/details/120874609

版权

这篇博客介绍了使用Python处理PDF的多种方法，包括使用PyPDF2库从PDF中提取文本，使用fitz库将PDF转换为PNG图片，旋转和叠加页面，以及加密PDF文件。此外，还展示了如何添加水印到PDF，以及创建带有水印的PDF文件。同时，文中提供了获取文件名和判断文件类型的实用函数。

摘要由CSDN通过智能技术生成

10月20日学习总结

一、从PDF中提取文本

在Python中，可以使⽤名为 PyPDF2 的三⽅库来读取PDF⽂件，可以使⽤下⾯的命令来安装它

库的准备：pip install PyPDF2

PyPDF2 没有办法从PDF⽂档中提取图像、图表或其他媒体，但它可以提取⽂本，并将其返回为Python字符串。

import PyPDF2
from PyPDF2.pdf import PageObject

reader = PyPDF2.PdfFileReader('resources/XGBoost.pdf')
page = reader.getPage(0) # type:PageObject
print(page.extractText())

二、将PDF文件转换成PNG图片

库的准备：pip install fitz

import fitz

import os.path


def get_filename(file_path):
    """获取不带路径和后缀的文件名
    :param file_path:文件路径
    """
    _, fullname = os.path.split(file_path)
    filename, _ = os.path.splitext(fullname)
    return filename


def pdf_image(pdf_file, img_path, zoom_x=4, zoom_y=4, rotation_angle=0):
    """
    将PDF文件转成PNG图片
    :param pdf_file: PDF文件路径
    :param img_path: 保存图片的路径
    :param zoom_x: 缩放比例（横向）
    :param zoom_y: 缩放比例（纵向）
    :param rotation_angle: 旋转角度
    """
    # 打开PDF文件
    pdf = fitz.open(pdf_file)
    # 逐页读取PDF
    for page_num in range(pdf.pageCount):
        page_obj = pdf[page_num]
        # 创建用于图像变换的矩阵
        trans = fitz.Matrix(zoom_x, zoom_y).preRotate(rotation_angle)
        # 将PDF页面处理成图像
        pm = page_obj.getPixmap(matrix=trans, alpha=False)
        temp = get_filename(pdf_file)
        pm.writePNG(f'{img_path}{temp}_{page_num + 1}.png')
    pdf.close()


def main():
    if not os.path.exists('resources/images/'):
        os.makedirs('resources/images/')
    pdf_image('resources/XGBoost.pdf', 'resources/images/', 2, 2)


if __name__ == '__main__':
    main()

三、旋转和叠加页面

import PyPDF2
from PyPDF2.pdf import PageObject
# 创建一个读PDF的Reader对象
reader = PyPDF2.PdfFileReader('resources/XGBoost.pdf')
# 创建一个写PDF的writer对象
writer = PyPDF2.PdfFileWriter()
# 遍历PDF的每一页
for page_num in range(reader.getNumPages()):
    page_obj = reader.getPage(page_num) # type: PageObject
    将PDF的每一页旋转90度
    page_obj.rotateClockwise(90)
    # 将旋转后的PDF写入Writer对象
    writer.addPage(page_obj)
# 在最后添加一个空白页，并且旋转90度
blank_page = writer.addBlankPage()  # type: PageObject
blank_page.rotateClockwise(90)
# 给PDF文件设置密码(口令)
writer.encrypt('5201314')
with open('resources/XGBoost_modified.pdf', 'wb') as file:
    writer.write(file)

四、加密PDF文件

import PyPDF2

reader = PyPDF2.PdfFileReader('resources/XGBoost.pdf')
writer = PyPDF2.PdfFileWriter()
for page_num in range(reader.getNumPages()):
    writer.addPage(reader.getPage(page_num) )
#  通过encrypt⽅法加密PDF⽂件，⽅法的参数就是设置的密码
writer.encrypt('5201314')
with open('resources/XGBoost_modified.pdf', 'wb') as file:
    writer.write(file)

五、给文件添加水印

import PyPDF2
from PyPDF2.pdf import PageObject

reader1 = PyPDF2.PdfFileReader('resources/watermark.pdf')
reader2 = PyPDF2.PdfFileReader('resources/XGBoost.pdf')
# 获取水印页
watermark_page = reader1.getPage(0)
writer = PyPDF2.PdfFileWriter()
for page_num in range(reader2.getNumPages()):
    page_obj = reader2.getPage(page_num)  # type: PageObject
    # 将原始页和水印页进行合并
    page_obj.mergePage(watermark_page)
    writer.addPage(page_obj)
# 将PDF写入文件
with open('resources/XGBoost_watermarked.pdf', 'wb') as file:
    writer.write(file)

六、创建PDF文件，画一个水印

库的准备：pip install reportlab

from reportlab.lib.pagesizes import A4
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas

pdf_canvas = canvas.Canvas('resources/demo.pdf', pagesize=A4)
width, height = A4
print(width, height)
# 绘图
# image = canvas.ImageReader('resources/.jpg')
# pdf_canvas.drawImage(image, 20, height - 395, 250, 375)

# 注册字体文件
pdfmetrics.registerFont(TTFont('F1', 'resources/fonts/Action.ttf'))
pdfmetrics.registerFont(TTFont('F2', 'resources/fonts/青呱石头体.ttf'))

# 写字
pdf_canvas.setFont('F1', 40)
pdf_canvas.setFillColorRGB(1, 0, 0, 0.3)
pdf_canvas.rotate(18)
pdf_canvas.drawString(250, 250, 'I LOVE YOU')

pdf_canvas.showPage()

pdf_canvas.setFont('F2', 40)
pdf_canvas.setFillColorRGB(0, 0, 1, 0.3)
pdf_canvas.rotate(18)
pdf_canvas.drawString(250, 250, '千峰Python人工智能学院')

# 保存
pdf_canvas.save()

七、判断是否是文件夹，是的话打印。不是就分割出文件的名字和后缀

import os
# 遍历文件夹中所有的文件
for file in os.listdir():
	#判断是否是文件夹
    if os.path.isdir(file):
        print(os.listdir(file))
    else:
        _, suffix = os.path.splitext(file)
        print(file, suffix)

八、获取不带路径和后缀的文件名

import os.path

def get_filename(file_path):
    """获取不带路径和后缀的文件名
    :param file_path:文件路径
    """
    # 分割路径
    _, fullname = os.path.split(file_path)
    # 文件名和后缀
    filename, _ = os.path.splitext(fullname)
    return filename

WBYLX

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
10月20日学习总结

10月20日学习总结一、从PDF中提取文本在Python中，可以使⽤名为 PyPDF2 的三⽅库来读取PDF⽂件，可以使⽤下⾯的命令来安装它库的准备：pip install PyPDF2PyPDF2 没有办法从PDF⽂档中提取图像、图表或其他媒体，但它可以提取⽂本，并将其返回为Python字符串。import PyPDF2from PyPDF2.pdf import PageObjectreader = PyPDF2.PdfFileReader('resources/XGBoost.pdf'
复制链接

扫一扫