RAKE(快速自动关键字抽取)算法原理与实现

RAKE算法

RAKE算法用来做关键词(key word)的提取,实际上提取的是关键的短语(phrase),并且倾向于较长的短语,在英文中,关键词通常包括多个单词,但很少包含标点符号和停用词,例如and,the,of等,以及其他不包含语义信息的单词。

算法思想

  1. RAKE算法首先使用标点符号(如半角的句号、问号、感叹号、逗号等)将一篇文档分成若干分句。
  2. 然后对于每一个分句,使用停用词作为分隔符将分句分为若干短语,这些短语作为最终提取出的关键词的候选词。
  3. 最后,每个短语可以再通过空格分为若干个单词,可以通过给每个单词赋予一个得分,通过累加得到每个短语的得分。

一个关键点在于将这个短语中每个单词的共现关系考虑进去。最终定义的公式是:
w o r d S c o r e = w o r d D e g r e e ( w ) w o r d F r e q u e n c y ( w ) wordScore=\frac{wordDegree(w)}{wordFrequency(w)} wordScore=wordFrequency(w)wordDegree(w)
词共现矩阵的两种不同含义

  1. 词共现矩阵的每一列的值即为该词的度deg(是一个网络中的概念,每与一个单词共现在一个短语中,度就加1,考虑该单词本身),每个词在文本中出现的次数即为频率freq。
  2. 通过统计一个事先指定大小的窗口内的word共现次数,以word周边的共现词的次数做为当前word的vector。具体来说,我们通过从大量的语料文本中构建一个共现矩阵来定义word representation。

对于第一解释存在以下示例:

共词矩阵对角线上元素赋值为它自身在所有文章出现次数。循环遍历特征词列表,构建全部两个词之间的组合,再遍历每一篇文章的切词结果,如果该两个词在同一片文章中出现,则该两词的权重+1,再将其存入共词矩阵的对应位置中。

实现可以参照[4]或[7]。

对于第二种解释有以下示例:

I like deep learning.
I like NLP.
I enjoy flying.

假设窗口大小为2,其词共现矩阵如下表所示:

表1 词共现矩阵
countIlikeenjoydeeplearningNLPflying.
I02100000
like20010100
enjoy10000010
deep01001000
learning00010001
NLP01000001
flying00100001
.00001110

其中(I,like)=2是因为窗口等于3,“I like deep”和“I like NLP”中各出现了一次,其他的依此类推。

Reference

[1] python 共现矩阵构建_这是一个死肥宅的博客-CSDN博客_python共现矩阵

[2] https://mp.weixin.qq.com/s/nX_idrMlI_cv9aufnHoRwQ

[3] 共现矩阵_贾世林jiashilin的博客-CSDN博客_共现矩阵

[4] https://github.com/zelandiya/RAKE-tutorial

[5] Rose, S. J. , et al. “Automatic Keyword Extraction from Individual Documents.” (2010).

[6] 英文关键词提取之RAKE算法_lizzy05的博客-CSDN博客_rake算法

[7] laserwave/keywords_extraction_rake: A python implementation of the Rapid Automatic Keyword Extraction (github.com)

附页

代码实现

import re
import operator
import argparse
import codecs
 
def isNumber(s):
    try:
        float(s) if '.' in s else int(s)
        return True
    except ValueError:
        return False
 
class Rake:
    def __init__(self, inputFilePath, stopwordsFilePath, outputFilePath, minPhraseChar, maxPhraseLength):
        self.outputFilePath = outputFilePath
        self.minPhraseChar = minPhraseChar
        self.maxPhraseLength = maxPhraseLength
        # read documents
        self.docs = []
        for document in codecs.open(inputFilePath, 'r', 'utf-8'):
            self.docs.append(document)
        # read stopwords
        stopwords = []
        for word in codecs.open(stopwordsFilePath, 'r', 'utf-8'):
            stopwords.append(word.strip())
        stopwordsRegex = []
        for word in stopwords:
            regex = r'\b' + word + r'(?![\w-])'
            stopwordsRegex.append(regex)
        self.stopwordsPattern = re.compile('|'.join(stopwordsRegex), re.IGNORECASE)
 
    def separateWords(self, text):
        splitter = re.compile('[^a-zA-Z0-9_\\+\\-/]')
        words = []
        for word in splitter.split(text):
            word = word.strip().lower()
            # leave numbers in phrase, but don't count as words, since they tend to invalidate scores of their phrases
            if len(word) > 0 and word != '' and not isNumber(word):
                words.append(word)
        return words
    
    def calculatePhraseScore(self, phrases):
        # calculate wordFrequency and wordDegree
        wordFrequency = {}
        wordDegree = {}
        for phrase in phrases:
            wordList = self.separateWords(phrase)
            wordListLength = len(wordList)
            wordListDegree = wordListLength - 1
            for word in wordList:
                wordFrequency.setdefault(word, 0)
                wordFrequency[word] += 1
                wordDegree.setdefault(word, 0)
                wordDegree[word] += wordListDegree
        for item in wordFrequency:
            wordDegree[item] = wordDegree[item] + wordFrequency[item]
    
        # calculate wordScore = wordDegree(w)/wordFrequency(w)
        wordScore = {}
        for item in wordFrequency:
            wordScore.setdefault(item, 0)
            wordScore[item] = wordDegree[item] * 1.0 / wordFrequency[item]
 
        # calculate phraseScore
        phraseScore = {}
        for phrase in phrases:
            phraseScore.setdefault(phrase, 0)
            wordList = self.separateWords(phrase)
            candidateScore = 0
            for word in wordList:
                candidateScore += wordScore[word]
            phraseScore[phrase] = candidateScore
        return phraseScore
    
        
    def execute(self):
        file = codecs.open(self.outputFilePath,'w','utf-8')
        for document in self.docs:
            # split a document into sentences
            sentenceDelimiters = re.compile(u'[.!?,;:\t\\\\"\\(\\)\\\'\u2019\u2013]|\\s\\-\\s')
            sentences = sentenceDelimiters.split(document)
            # generate all valid phrases
            phrases = []
            for s in sentences:
                tmp = re.sub(self.stopwordsPattern, '|', s.strip())
                phrasesOfSentence = tmp.split("|")
                for phrase in phrasesOfSentence:
                    phrase = phrase.strip().lower()
                    if phrase != "" and len(phrase) >= self.minPhraseChar and len(phrase.split()) <= self.maxPhraseLength:
                        phrases.append(phrase)
    
            # calculate phrase score
            phraseScore = self.calculatePhraseScore(phrases)
            keywords = sorted(phraseScore.items(), key = operator.itemgetter(1), reverse=True)
            file.write(str(keywords[0:int(len(keywords)/3)]) + "\n")
        file.close()
        
def readParamsFromCmd():
    parser = argparse.ArgumentParser(description = "This is a python implementation of rake(rapid automatic keyword extraction).")
    parser.add_argument('inputFilePath', help = 'The file path of input document(s). One line represents a document.')
    parser.add_argument('stopwordsFilePath', help = 'The file path of stopwords, each line represents a word.')
    parser.add_argument('-o', '--outputFilePath', help = 'The file path of output. (default output.txt in current dir).', default = 'output.txt')
    parser.add_argument('-m', '--minPhraseChar', type = int, help = 'The minimum number of characters of a phrase.(default 1)', default = 1)
    parser.add_argument('-a', '--maxPhraseLength', type = int, help = 'The maximum length of a phrase.(default 3)', default = 3)
    return parser.parse_args()
 
params = readParamsFromCmd().__dict__
 
rake = Rake(params['inputFilePath'], params['stopwordsFilePath'], params['outputFilePath'], params['minPhraseChar'], params['maxPhraseLength'])
rake.execute()
  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
多径分量的合并方式通常有两种:最大比值合并(MRC)和等能合并(ERC)。以下是两种方式的代码实现: 1. 最大比值合并(MRC) ```python import numpy as np def mrc(combined_signal): ''' :param combined_signal: 接收到的信号矩阵,shape为(接收天线数,接收信号长度) :return: 合并后的信号向量,shape为(接收信号长度,) ''' # 计算每个采样点的最大值 max_values = np.max(np.abs(combined_signal), axis=0) # 返回合并后的信号向量 return np.multiply(max_values, np.exp(1j*np.angle(combined_signal[np.argmax(np.abs(combined_signal), axis=0), np.arange(combined_signal.shape[1])]))) ``` 2. 等能合并(ERC) ```python import numpy as np def erc(combined_signal): ''' :param combined_signal: 接收到的信号矩阵,shape为(接收天线数,接收信号长度) :return: 合并后的信号向量,shape为(接收信号长度,) ''' # 计算每个采样点的能量 energy = np.sum(np.abs(combined_signal)**2, axis=0) # 计算每个天线的权重系数 weights = np.sqrt(1/energy) # 返回加权平均后的信号向量 return np.sum(np.multiply(combined_signal, weights), axis=0) ``` 这两种方式的核心思想都是将多个接收天线接收到的信号合并成一个信号向量,以提高接收信号的质量。其中,MRC合并方式选择每个采样点中最大值对应的天线接收信号进行合并;ERC合并方式则通过加权平均的方式将所有天线接收信号合并。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Trouble..

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值