《Python编程快速上手》
处理PDF和Word文档
用于处理PDF的模块是PyPDF2。
处理Word文档是python-docx模块,要安装python-docx,但是导入模块时是写import docx
。
1.从PDF提取文本
import PyPDF2
pdfFileObj = open('meetingminutes.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
>> 19
pageObj = pdfReader.getPage(0)
pageObj.extractText()
>> 'OOFFFFIICCIIAALL BBOOAARRDD MMIINNUUTTEESS Meeting of \nMarch 7\n, 2014\n \n The Board of Elementary and Secondary Education shall provide leadership and \ncreate policies for education that expand opportunities for children, empower \nfamilies and communities, and advance Louisiana in an increasingly \ncompetitive glob\nal market.\n BOARD \n of ELEMENTARY\n and \n SECONDARY\n EDUCATION\n '
2.解密PDF
某些PDF文档有加密功能,以防止别人阅读,只有在打开文档前时提供口令后才能阅读。
import PyPDF2
pdfFile = open('encrypted.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)
#返回True说明时加密的PDF
pdfReader.isEncrypted
>>True
#调用decrypt()函数,传入口令字符串,返回1说明口令正确,之后就可以进行读取操作了
pdfReader.decrypt('rosebud')
>> 1
pdfReader.getPage(0).extractText()
>> 'OOFFFFIICCIIAALL BBOOAARRDD MMIINNUUTTEESS Meeting of \nMarch 7\n, 2014\n \n The Board of Elementary and Secondary Education shall provide leadership and \ncreate policies for education that expand opportunities for children, empower \nfamilies and communities, and advance Louisiana in an increasingly \ncompetitive glob\nal market.\n BOARD \n of ELEMENTARY\n and \n SECONDARY\n EDUCATION\n '
注意:decrypt()函数只解密了PdfFileReader对象,而不是实际的PDF文件。在程序中止后,硬盘上的文件仍然是加密的。程序下次运行时,仍然需要再次调用decrypt()。
3.创建PDF
PdfFilewWrite对象可以创建一个新的PDF文件。但PyPDF2不能将任意文本写入PDF,PyPDF2写入PDF的能力,仅限于从其他PDF中拷贝页面、旋转页面、重叠页面和加密文件。
模块不允许直接编辑PDF。必须创建一个新的PDF,然后从已有的文档拷贝内容。
import PyPDF2
#打开PDF,创建File对象,还有创建PdfFileReader对象,从打开的PDF中读取数据
pdf1File = open('meetingminutes.pdf','rb')
pdf2File = open('meetingminutes2.pdf','rb')
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
pdf2Reader = PyPDF2.PdfF