Data mining in pdf

weixin_34167043

于 2018-10-26 11:15:00 发布

阅读量213

点赞数

文章标签：开发工具 python

原文链接：http://www.cnblogs.com/rabbittail/p/9855250.html

版权

https://towardsdatascience.com/how-to-extract-keywords-from-pdfs-and-arrange-in-order-of-their-weights-using-python-841556083341

Problem Statement -

Given a particular PDF/Text document ,How to extract keywords and arrange in order of their weightage using Python?

Dependencies :

(I have used Python 2.7.15 version for this tutorial.)

You will need below mentioned libraries installed on your machine for the task.In case you don’t have it,I have inserted codes for each dependency in code block below, which you can type it on command prompt for windows or on terminal for mac operating system.

PyPDF2 (To convert simple, text-based PDF files into text readable by Python)

pip install PyPDF2

textract (To convert non-trivial, scanned PDF files into text readable by Python)

pip install textract

re (To find keywords)

pip install regex

Note : I have attempted three approaches for this task.Above libraries would be suffice for approach 1.However I have just touched upon two other approaches which I found online.Treat them as alternatives. Down below is the jupyter notebook with all three approaches.Take a look!

Jupyter Notebook :

All necessary remarks are denoted with ‘#’.

Approach 1 unboxed

Step 1: Import all libraries.

Step 2: Convert PDF file to txt format and read data.

Step 3: Use “.findall()” function of regular expressions to extract keywords.

Step 4: Save list of extracted keywords in a DataFrame.

Step 5 : Apply concept of TF-IDF for calculating weights of each keyword.

Step 6 : Save results in a DataFrame and use “.sort_values()” to arrange keywords in order.

import pandas as pd
import numpy as np import PyPDF2 import textract import re

Reading Text

converted PDF file to txt format for better pre-processing

In [2]:

filename ='JavaBasics-notes.pdf' 

pdfFileObj = open(filename,'rb') #open allows you to read the file pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #The pdfReader variable is a readable object that will be parsed num_pages = pdfReader.numPages #discerning the number of pages will allow us to parse through all the pages count = 0 text = "" while count < num_pages: #The while loop will read each page pageObj = pdfReader.getPage(count) count +=1 text += pageObj.extractText() #Below if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files. if text != "": text = text #If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text else: text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng') # Now we have a text variable which contains all the text derived from our PDF file.

For more details find GitHub repo HERE !

References :

www.wikipedia.org

2. Medium post for PDF to Text Conversion

3. keyword extraction tutorial

4. Regular expressions

I hope you find this tutorial fruitful and worth reading. Also,I am sure there must be tons of other approaches with which you can perform the said task.Do share them in comment section if you have came across any.

Code for the Masked Word Cloud :

Find GitHub repo HERE !

# modules for generating the word cloud 
from os import path, getcwd 
from PIL import Image 
import numpy as np 
import matplotlib.pyplot as plt 
from wordcloud import WordCloud

%matplotlib inline 
d = getcwd()  
text = open('nlp.txt','r').read()

#Image link = 'https://produto.mercadolivre.com.br/MLB-693994282-adesivo-decorativo-de-parede-batman-rosto-e-simbolo-grande-_JM'

mask = np.array(Image.open(path.join(d, "batman.jpg")))

wc = WordCloud(background_color="black",max_words=3000,mask=mask,\
               max_font_size=30,min_font_size=0.00000001,\
               random_state=42,)    
wc.generate(text) 
plt.figure(figsize=[100,80]) 
plt.imshow(wc, interpolation="bilinear") 
plt.axis("off") 
plt.savefig('bat_wordcloud.jpg',bbox_inches='tight',pad_inches=0.3)

转载于:https://www.cnblogs.com/rabbittail/p/9855250.html