14.处理PDF与Word文档
1)PDF文档基础
PYPDF2模块 最新为3.0模块,与书中的模块1.26相比,发生重大修改。
a)从PDF文件中提取文本
测试程序:Test_1501.py
import PyPDF2,os
os.chdir('d:/temp')
pdfReader = PyPDF2.PdfReader('python11.pdf')
isEncrypted = pdfReader.isEncrypted
pdfReader.decrypt('123456')
pageObj = pdfReader.pages[0].rotate_clockwise()
print('page num %s\n' % len(pageObj))
for i in range(len(pageObj)):
print(pageObj[i].extract_text())
print('---------------Page %s--------------' % i)
PyPDF2.PdfReader()方法
PdfRead.pages 页面
len(PdfRead.pages) 所含页数
pages[0].extract_text() 展开文本
pages[0].rotate() 旋转页面
b)PDF解密操作
PdfReader.is_encrypted True表示加密
PdfReader.decrypt(passwd) 解密
WORD生成PDF加密采用了AES算法,要解密需要安装pycryptodome包
pip install pycryptodome
c)创建PDF文件
PyPDF2.PdfWriter()方法返回写
page.merge_page() 合并页面,用于加水印
PdfWriter.add_page() 添加页面
outfp = open(fn,’wb’) 打开写入文件
PdfWriter.write(outfp) 写入文件
Pdfwriter.encrypt(passwd) 加密文件
outfp.close() 关闭写入文件。
测试程序:test_1502.py
import PyPDF2,os
os.chdir('d:/temp')
pdfReader = PyPDF2.PdfReader('pjt001.pdf')
pdfWriter = PyPDF2.PdfWriter()
if pdfReader.is_encrypted :
print('decrypt ....')
pdfReader.decrypt('abcdefgh')
else:
print('no password ')
for i in range(len(pdfReader.pages)):
cuPage = pdfReader.pages[i]
cuPage.rotate(90)
cuPage.merge_page(pdfReader.pages[0],expand=True)
pdfWriter.add_page(cuPage)
pdfWriter.encrypt('abcd1234')
with open('pjt002.pdf','wb') as outPdfFile:
pdfWriter.write(outPdfFile)
2)项目:从多个PDF中合并选择的页面
测试程序:test_1503.py
#! python3
# combinePdfs.py - Combines all the PDFs in the current working directory into a single PDF.
import PyPDF2,os
# Get ll the PDF filenames.
os.chdir('d:/temp/waffle')
pdfFiles = []
for filename in os.listdir('.'):
if filename.endswith('.pdf'):
pdfFiles.append(filename)
pdfFiles.sort(key = str.lower)
pdfWriter = PyPDF2.PdfWriter()
# Loop through all the PDF files
for filename in pdfFiles:
print('merge %s ...' % filename)
pdfReader = PyPDF2.PdfReader(filename)
# Loop throuth all the pages (ecept the first) and add them
for pageNum in range(1,len(pdfReader.pages)):
pageObj = pdfReader.pages[pageNum]
pdfWriter.add_page(pageObj)
# Save the resulting PDF to a file.
with open('../allminutes.pdf','wb') as pdfOutput:
pdfWriter.write(pdfOutput)
with … as… 打开文件后无需关闭
测试程序:test_1504.py 将多个图像文件合并到一个PDF中
import os
from PIL import Image
imageFolder = 'd:/temp/photo'
image = Image.new('RGB',(0,0))
imageList = []
for imgName in os.listdir(imageFolder):
if not imgName.endswith('.jpg'):
continue
imgPath = os.path.join(imageFolder,imgName)
print('merge %s' % imgPath)
image = Image.open(imgPath).convert('RGB')
imageList.append(image)
imageList[0].save('d:/temp/imageall.pdf','PDF', resolution=100.0, save_all=True, append_images=imageList[1:])
PIL模块 pip install pillow
3)WORD文档
python_docx模块 pip install python_docx
a)读取WORD文档
docx.Document() 取文档对象
doc.paragraphs[] 文档中包含的对象,对应文档段落。
doc.paragraphs[i].text 每个对象包含的文本
doc.paragraphs[i].runs[] Run对象,以空格分开
doc.paragraphs[i].runs[j].text Run对象的文本
b)从.docx文档中取完整文本
测试程序:test_1505.py
#! python3
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append('----' + para.text)
return '\n'.join(fullText)
#print(getText('d:/temp/prj001.docx'))
附:模块装载及重载
模块装载,进入到要装载的py目录下,import fname
模块重载,del sys.modules[‘model_name’] import fname
c)设置Paragraph和Run对象的样式
style属性
只有文档中有的样式才能应用,不能创建新样式。
d)创建非默认样式的Word文档
用Word设置好空文档的样式,再用python操作。
e)Run属性
f)写入Word文档
测试程序:test_1506.py
import docx,os
os.chdir('d:/temp')
doc = docx.Document('prj001.docx')
doc.add_heading('====Title====',0)
doc.add_heading('====Heading 1',1)
doc.add_paragraph('this is on the second')
doc.paragraphs[5].runs[0].add_break()
doc.paragraphs[12].runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)
doc.add_picture('prj101.png',width=docx.shared.Inches(1),height=docx.shared.Cm(4))
doc.save('prj002.docx')
doc.add_paragraph(text,style) 加入段落
doc.paragraph[i].add_run(text,style) 加入run
doc.save(filename)
g)添加标题
doc.add_heading(text,level)
h)添加换行和换页
doc.add_break() 添加换行
doc.add_break(dox.enum.text.WD_BREAK.PAGE)
i)添加图像
doc.add_picture(imgname,width,height)
width = docx.shared.Inches(1)
height = docx.shared.Cm(4)
4)从WORD创建PDF
测试程序:test_1507.py
# This script runs on Windows only,and you must have Word installed.
import win32com.client # install with 'pip install pywin32==224'
import docx,os
os.chdir('d:/temp')
wordFilename = 'd:\\temp\\word001.docx'
pdfFilename = 'd:\\temp\\pdf010.pdf'
#doc = docx.Document()
# Code to create Word document goes here.
#doc.save(wordFilename)
wdFormatPDF = 17 # Word's numeric code for PDFs.
wordObj = win32com.client.Dispatch('Word.Application')
docObj = wordObj.Documents.Open(wordFilename)
docObj.SaveAs(pdfFilename,FileFormat=wdFormatPDF)
docObj.Close()
wordObj.Quit()
pywin32模块 import win32com.client
wdFormatPDF = 17
wordObj = win32com.client.Dispatch(‘Word.Application’)
docObj = wordObj.Documents.open(filename)
docObj.SaveAs(filename,fileFormat=wdFormatPDF)
docObj.Close()
wordObj.Quit()