RAKE, 即Automatic keyword extraction。
来自于M. W. Berry and J. Kogan (Eds.), Text Mining: Applications and Theory.unknown: John Wiley and Sons, Ltd.一书的第一章
https://www.airpair.com/nlp/keyword-extraction-tutorial 网页中有keywords extraction的详细介绍。
https://github.com/aneesha/RAKE 是RAKE的python源码
https://github.com/zelandiya/RAKE-tutorial 是“a_medelyan”的修改版
其中 rake_tutorial.py我根据网页加了以下注释。
import rake
import operator
# EXAMPLE ONE - SIMPLE
stoppath = "SmartStoplist.txt"
'''
# 1. initialize RAKE by providing a path to a stopwords file
rake_object = rake.Rake(stoppath, 5, 3, 4) # the notation is: (1)Each word has at least 5 characters, (2)Each phrase has at most 3 words,(3)Each keyword appears in the text at least 4 times
# 2. run on RAKE on a given text
sample_file = open("data/docs/fao_test/w2167e.txt", 'r')
text = sample_file.read()
keywords = rake_object.run(text) # this command can output all the keywords and their scores
# 3. print results
print "Keywords:", keywords
print "----------" '''
# EXAMPLE TWO - BEHIND THE SCENES (from https://github.com/aneesha/RAKE/rake.py)
# initialize RAKE by providing a path to a stopwords file
rake_object = rake.Rake(stoppath)
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility " \
"of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. " \
"Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating"\
" sets of solutions for all types of systems are given. These criteria and the corresponding algorithms " \
"for constructing a minimal supporting set of solutions can be used in solving all the considered types of " \
"systems and systems of mixed types."
# Split text into sentences
sentenceList = rake.split_sentences(text) # sentence was split by punctuation mark, comma and period here.
for sentence in sentenceList:
print "Sentence:", sentence
# generate candidate keywords
stopwordpattern = rake.build_stop_word_regex(stoppath)
phraseList = rake.generate_candidate_keywords(sentenceList, stopwordpattern) # phrase is the candidated keywords
# this method does not work for phrases in which these boundaries are parts of the actual phrase (e.g. .Net or Dr. Who).
# improvements can be made here
Read more at https://www.airpair.com/nlp/keyword-extraction-tutorial#4Lc4GeP5t5cYe7OR.99
print "Phrases:", phraseList
# calculate individual word scores
wordscores = rake.calculate_word_scores(phraseList)
# generate candidate keyword scores
keywordcandidates = rake.generate_candidate_keyword_scores(phraseList, wordscores)
# One issue here is that the candidates are not normalized in any way.
# As a result we may have keywords that look nearly identical: small scale production and small scale producers, or skim milk powder and skimmed milk powder.
# Ideally, a keyword extraction algorithm should apply stemming and other ways of normalizing keywords first.
# so stemming is always used before keyword extraction. This can be another improvement.
for candidate in keywordcandidates.keys():
print "Candidate: ", candidate, ", score: ", keywordcandidates.get(candidate)
# sort candidates by score to determine top-scoring keywords
sortedKeywords = sorted(keywordcandidates.iteritems(), key=operator.itemgetter(1), reverse=True)
totalKeywords = len(sortedKeywords)
# for example, you could just take the top third as the final keywords
for keyword in sortedKeywords[0:(totalKeywords / 3)]: # note that hte final keywords are determined by top third
print "Keyword: ", keyword[0], ", score: ", keyword[1]
print rake_object.run(text) # this command outputs all the keywords and their scores.