Python新手学习（十二）：处理PDF与Word文档_python实现对于word,pdf信息处理-CSDN博客

本文链接：https://blog.csdn.net/hbrown/article/details/138545803

14.处理PDF与Word文档
1）PDF文档基础
PYPDF2模块最新为3.0模块，与书中的模块1.26相比，发生重大修改。
a)从PDF文件中提取文本
测试程序：Test_1501.py

import PyPDF2,os

os.chdir('d:/temp')
pdfReader = PyPDF2.PdfReader('python11.pdf')

isEncrypted = pdfReader.isEncrypted
pdfReader.decrypt('123456')
pageObj = pdfReader.pages[0].rotate_clockwise()
print('page num %s\n' % len(pageObj))

for i in range(len(pageObj)):
    print(pageObj[i].extract_text())
    print('---------------Page %s--------------' % i)

PyPDF2.PdfReader()方法
PdfRead.pages   页面
len(PdfRead.pages)    所含页数
pages[0].extract_text() 展开文本
pages[0].rotate()   旋转页面
b)PDF解密操作
PdfReader.is_encrypted True表示加密
PdfReader.decrypt(passwd) 解密
WORD生成PDF加密采用了AES算法，要解密需要安装pycryptodome包
pip install pycryptodome
c)创建PDF文件
PyPDF2.PdfWriter()方法返回写
page.merge_page() 合并页面，用于加水印
PdfWriter.add_page() 添加页面
outfp = open（fn,’wb’）打开写入文件
PdfWriter.write(outfp) 写入文件
Pdfwriter.encrypt(passwd) 加密文件
outfp.close() 关闭写入文件。
测试程序：test_1502.py

import PyPDF2,os

os.chdir('d:/temp')

pdfReader = PyPDF2.PdfReader('pjt001.pdf')
pdfWriter = PyPDF2.PdfWriter()

if pdfReader.is_encrypted :
    print('decrypt ....')
    pdfReader.decrypt('abcdefgh')
else:
    print('no password ')

for i in range(len(pdfReader.pages)):
    cuPage = pdfReader.pages[i]
    cuPage.rotate(90)
    cuPage.merge_page(pdfReader.pages[0],expand=True)
    pdfWriter.add_page(cuPage)

pdfWriter.encrypt('abcd1234')

with open('pjt002.pdf','wb') as outPdfFile:
    pdfWriter.write(outPdfFile)

2）项目：从多个PDF中合并选择的页面
测试程序：test_1503.py

#! python3
# combinePdfs.py - Combines all the PDFs in the current working directory into a single PDF.

import PyPDF2,os
# Get ll the PDF filenames.

os.chdir('d:/temp/waffle')
pdfFiles = []
for filename in os.listdir('.'):
    if filename.endswith('.pdf'):
        pdfFiles.append(filename)
pdfFiles.sort(key = str.lower)

pdfWriter = PyPDF2.PdfWriter()

# Loop through all the PDF files
for filename in pdfFiles:
    print('merge %s ...' % filename)
    pdfReader = PyPDF2.PdfReader(filename)
    # Loop throuth all the pages (ecept the first) and add them
    for pageNum in range(1,len(pdfReader.pages)):
        pageObj = pdfReader.pages[pageNum]
        pdfWriter.add_page(pageObj)

# Save the resulting PDF to a file.
with open('../allminutes.pdf','wb') as pdfOutput:
    pdfWriter.write(pdfOutput)

with … as… 打开文件后无需关闭
测试程序：test_1504.py 将多个图像文件合并到一个PDF中

import os
from PIL import Image

imageFolder = 'd:/temp/photo'

image = Image.new('RGB',(0,0))
imageList = []
for imgName in os.listdir(imageFolder):
    if not imgName.endswith('.jpg'):
        continue
    imgPath = os.path.join(imageFolder,imgName)
    print('merge %s' % imgPath)
    image = Image.open(imgPath).convert('RGB')
    imageList.append(image)

imageList[0].save('d:/temp/imageall.pdf','PDF', resolution=100.0, save_all=True, append_images=imageList[1:])

PIL模块 pip install pillow
3）WORD文档
python_docx模块 pip install python_docx
a)读取WORD文档
docx.Document() 取文档对象
doc.paragraphs[] 文档中包含的对象，对应文档段落。
doc.paragraphs[i].text 每个对象包含的文本
doc.paragraphs[i].runs[] Run对象，以空格分开
doc.paragraphs[i].runs[j].text Run对象的文本
b)从.docx文档中取完整文本
测试程序：test_1505.py

#! python3

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append('----' + para.text)
    return '\n'.join(fullText)

#print(getText('d:/temp/prj001.docx'))

附：模块装载及重载
模块装载，进入到要装载的py目录下，import fname
模块重载，del sys.modules[‘model_name’] import fname
c)设置Paragraph和Run对象的样式
style属性
只有文档中有的样式才能应用，不能创建新样式。
d)创建非默认样式的Word文档
用Word设置好空文档的样式，再用python操作。
e)Run属性
f)写入Word文档
测试程序：test_1506.py

import docx,os

os.chdir('d:/temp')

doc = docx.Document('prj001.docx')

doc.add_heading('====Title====',0)
doc.add_heading('====Ｈｅａｄｉｎｇ　１',1)
doc.add_paragraph('this is on the second')
doc.paragraphs[5].runs[0].add_break()
doc.paragraphs[12].runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)

doc.add_picture('prj101.png',width=docx.shared.Inches(1),height=docx.shared.Cm(4))

doc.save('prj002.docx')

doc.add_paragraph(text,style) 加入段落
doc.paragraph[i].add_run(text,style) 加入run
doc.save(filename)
g)添加标题
doc.add_heading(text,level)
h)添加换行和换页
doc.add_break() 添加换行
doc.add_break(dox.enum.text.WD_BREAK.PAGE)
i)添加图像
doc.add_picture(imgname,width,height)
width = docx.shared.Inches(1)
height = docx.shared.Cm(4)
4）从WORD创建PDF
测试程序：test_1507.py

# This script runs on Windows only,and you must have Word installed.
import win32com.client  # install with 'pip install pywin32==224'
import docx,os

os.chdir('d:/temp')

wordFilename = 'd:\\temp\\word001.docx'
pdfFilename = 'd:\\temp\\pdf010.pdf'

#doc = docx.Document()
# Code to create Word document goes here.
#doc.save(wordFilename)

wdFormatPDF = 17    # Word's numeric code for PDFs.
wordObj = win32com.client.Dispatch('Word.Application')
docObj = wordObj.Documents.Open(wordFilename)
docObj.SaveAs(pdfFilename,FileFormat=wdFormatPDF)
docObj.Close()
wordObj.Quit()

pywin32模块 import win32com.client
wdFormatPDF = 17
wordObj = win32com.client.Dispatch(‘Word.Application’)
docObj = wordObj.Documents.open(filename)
docObj.SaveAs(filename,fileFormat=wdFormatPDF)
docObj.Close()
wordObj.Quit()