https://towardsdatascience.com/how-to-extract-keywords-from-pdfs-and-arrange-in-order-of-their-weights-using-python-841556083341
Problem Statement -
Given a particular PDF/Text document ,How to extract keywords and arrange in order of their weightage using Python?
Dependencies :
(I have used Python 2.7.15 version for this tutorial.)
You will need below mentioned libraries installed on your machine for the task.In case you don’t have it,I have inserted codes for each dependency in code block below, which you can type it on command prompt for windows or on terminal for mac operating system.
- PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
pip install PyPDF2
- textract (To convert non-trivial, scanned PDF files into text readable by Python)
pip install textract
- re (To find keywords)
pip install regex
Note : I have attempted three approaches for this task.Above libraries would be suffice for approach 1.However I have just touched upon two other approaches which I found online.Treat them as alternatives. Down below is the jupyter notebook with all three approaches.Take a look!
Jupyter Notebook :
All necessary remarks are denoted with ‘#’.
- Approach 1 unboxed
Step 1: Import all libraries.
Step 2: Convert PDF file to txt format and read data.
Step 3: Use “.findall()” function of regular expressions to extract keywords.
Step 4: Save list of extracted keywords in a DataFrame.
Step 5 : Apply concept of TF-IDF for calculating weights of each keyword.
Step 6 : Save results in a DataFrame and use “.sort_values()” to arrange keywords in order.
import pandas as pd import numpy as np import PyPDF2 import textract import re
Reading Text
- converted PDF file to txt format for better pre-processing
filename ='JavaBasics-notes.pdf'
pdfFileObj = open(filename,'rb') #open allows you to read the file pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #The pdfReader variable is a readable object that will be parsed num_pages = pdfReader.numPages #discerning the number of pages will allow us to parse through all the pages count = 0 text = "" while count < num_pages: #The while loop will read each page pageObj = pdfReader.getPage(count) count +=1 text += pageObj.extractText() #Below if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files. if text != "": text = text #If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text else: text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng') # Now we have a text variable which contains all the text derived from our PDF file.
For more details find GitHub repo HERE !
References :
2. Medium post for PDF to Text Conversion
3. keyword extraction tutorial
I hope you find this tutorial fruitful and worth reading. Also,I am sure there must be tons of other approaches with which you can perform the said task.Do share them in comment section if you have came across any.
Code for the Masked Word Cloud :
Find GitHub repo HERE !
# modules for generating the word cloud
from os import path, getcwd
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline
d = getcwd()
text = open('nlp.txt','r').read()
#Image link = 'https://produto.mercadolivre.com.br/MLB-693994282-adesivo-decorativo-de-parede-batman-rosto-e-simbolo-grande-_JM'
mask = np.array(Image.open(path.join(d, "batman.jpg")))
wc = WordCloud(background_color="black",max_words=3000,mask=mask,\
max_font_size=30,min_font_size=0.00000001,\
random_state=42,)
wc.generate(text)
plt.figure(figsize=[100,80])
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.savefig('bat_wordcloud.jpg',bbox_inches='tight',pad_inches=0.3)