Data mining in pdf

https://towardsdatascience.com/how-to-extract-keywords-from-pdfs-and-arrange-in-order-of-their-weights-using-python-841556083341

 

Problem Statement -

Given a particular PDF/Text document ,How to extract keywords and arrange in order of their weightage using Python?

Dependencies :

(I have used Python 2.7.15 version for this tutorial.)

You will need below mentioned libraries installed on your machine for the task.In case you don’t have it,I have inserted codes for each dependency in code block below, which you can type it on command prompt for windows or on terminal for mac operating system.

  • PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
pip install PyPDF2
  • textract (To convert non-trivial, scanned PDF files into text readable by Python)
pip install textract
  • re (To find keywords)
pip install regex

Note : I have attempted three approaches for this task.Above libraries would be suffice for approach 1.However I have just touched upon two other approaches which I found online.Treat them as alternatives. Down below is the jupyter notebook with all three approaches.Take a look!

Jupyter Notebook :

All necessary remarks are denoted with ‘#’.

  • Approach 1 unboxed

Step 1: Import all libraries.

Step 2: Convert PDF file to txt format and read data.

Step 3: Use “.findall()” function of regular expressions to extract keywords.

Step 4: Save list of extracted keywords in a DataFrame.

Step 5 : Apply concept of TF-IDF for calculating weights of each keyword.

Step 6 : Save results in a DataFrame and use “.sort_values()” to arrange keywords in order.

 
import pandas as pd
import numpy as np import PyPDF2 import textract import re


Reading Text

 
  • converted PDF file to txt format for better pre-processing
In [2]:
filename ='JavaBasics-notes.pdf' 

pdfFileObj = open(filename,'rb') #open allows you to read the file pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #The pdfReader variable is a readable object that will be parsed num_pages = pdfReader.numPages #discerning the number of pages will allow us to parse through all the pages count = 0 text = "" while count < num_pages: #The while loop will read each page pageObj = pdfReader.getPage(count) count +=1 text += pageObj.extractText() #Below if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files. if text != "": text = text #If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text else: text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng') # Now we have a text variable which contains all the text derived from our PDF file.


For more details find GitHub repo HERE !

References :

  1. www.wikipedia.org

2. Medium post for PDF to Text Conversion

3. keyword extraction tutorial

4. Regular expressions

I hope you find this tutorial fruitful and worth reading. Also,I am sure there must be tons of other approaches with which you can perform the said task.Do share them in comment section if you have came across any.

Code for the Masked Word Cloud :

Find GitHub repo HERE !

# modules for generating the word cloud 
from os import path, getcwd
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline 
d = getcwd()
text = open('nlp.txt','r').read()
#Image link = 'https://produto.mercadolivre.com.br/MLB-693994282-adesivo-decorativo-de-parede-batman-rosto-e-simbolo-grande-_JM'  
mask = np.array(Image.open(path.join(d, "batman.jpg")))

wc = WordCloud(background_color="black",max_words=3000,mask=mask,\
max_font_size=30,min_font_size=0.00000001,\
random_state=42,)
wc.generate(text)
plt.figure(figsize=[100,80])
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.savefig('bat_wordcloud.jpg',bbox_inches='tight',pad_inches=0.3)

转载于:https://www.cnblogs.com/rabbittail/p/9855250.html

Paperback: 364 pages Publisher: Apress; 1 edition (August 21, 2015) Language: English ISBN-10: 1484209591 ISBN-13: 978-1484209592 Python Data Analytics will help you tackle the world of data acquisition and analysis using the power of the Python language. At the heart of this book lies the coverage of pandas, an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Author Fabio Nelli expertly shows the strength of the Python programming language when applied to processing, managing and retrieving information. Inside, you will see how intuitive and flexible it is to discover and communicate meaningful patterns of data using Python scripts, reporting systems, and data export. This book examines how to go about obtaining, processing, storing, managing and analyzing data using the Python programming language. You will use Python and other open source tools to wrangle data and tease out interesting and important trends in that data that will allow you to predict future patterns. Whether you are dealing with sales data, investment data (stocks, bonds, etc.), medical data, web page usage, or any other type of data set, Python can be used to interpret, analyze, and glean information from a pile of numbers and statistics. This book is an invaluable reference with its examples of storing and accessing data in a database; it walks you through the process of report generation; it provides three real world case studies or examples that you can take with you for your everyday analysis needs.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值