rake-nltk
RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.
Setup
Using pip
pip install rake-nltk
Directly from the repository
git clone https://github.com/csurfer/rake-nltk.git
python rake-nltk/setup.py install
Quick start
from rake_nltk import Rake
# Uses stopwords for english from NLTK, and all puntuation characters by
# default
r = Rake()
# Extraction given the text.
r.extract_keywords_from_text()
# Extraction given the list of strings where each string is a sentence.
r.extract_keywords_from_sentences()
# To get keyword phrases ranked highest to lowest.
r.get_ranked_phrases()
# To get keyword phrases ranked highest to lowest with scores.
r.get_ranked_phrases_with_scores()
Debugging Setup
If you see a stopwords error, it means that you do not have the corpus stopwords downloaded from NLTK. You can download it using command below.
python -c "import nltk; nltk.download('stopwords')"
References
Why I chose to implement it myself?
It is extremely fun to implement algorithms by reading papers. It is the digital equivalent of DIY kits.
There are some rather popular implementations out there, in python(aneesha/RAKE) and node(waseem18/node-rake) but neither seemed to use the power of NLTK. By making NLTK an integral part of the implementation I get the flexibility and power to extend it in other creative ways, if I see fit later, without having to implement everything myself.
I plan to use it in my other pet projects to come and wanted it to be modular and tunable and this way I have complete control.
Contributing
Bug Reports and Feature Requests
Please use issue tracker for reporting bugs or feature requests.
Development
Pull requests are most welcome.
Buy the developer a cup of coffee!
If you found the utility helpful you can buy me a cup of coffee using