I am using Python 3.4 and need to extract all the text from a PDF and then use it for text processing.
All the answers I have seen suggest options for Python 2.7.
I need something in Python 3.4.
Bonson
解决方案
You need to install PyPDF2 module to be able to work with PDFs in Python 3.4. PyPDF2 cannot extract images, charts or other media but it can extract text and return it as a Python string. To install it run pip install PyPDF2 from the command line. This module name is case-sensitive so make sure to type 'y' in lowercase and all other characters as uppercase.
>>> import PyPDF2
>>> pdfFileObj = open('my_file.pdf','rb') #'rb' for read binary mode
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pdfReader.numPages
56
>>> pageObj = pdfReader.getPage(9) #'9' is the page number
>>> pageObj.extractText()
last statement returns all the text that is available in page-9 of 'my_file.pdf' document.