TextRank 是由 PageRank 改进而来,其公式:
W S ( V i ) = ( 1 − d ) + d ∗ Σ V j ∈ I n ( V i ) w j i Σ V k ∈ O u t ( V j ) W S ( V j ) WS(V_i)=(1-d)+d*\Sigma_{V_j \in In(V_i)}\frac{w_{ji}}{\Sigma_{V_k \in Out(V_j)}}WS(V_j) WS(Vi)=(1−d)+d∗ΣVj∈In(Vi)ΣVk∈Out(Vj)wjiWS(Vj)
TextRank用于关键词提取的算法如下 :
- 把给定的文本 T 按照完整句子进行分割,即 T = [ S 1 , S 2 . . . S m ] T=[S_1, S_2...S_m] T=[S1,S2...Sm]
- 对于每个句子 S i ∈ T S_i \in T Si∈T ,进行分词和词性标注处理,并过滤掉停用词,只保留指定词性的单词,如名词、动词、形容词,即 S i = [ t i , 1 , t i , 2 . . . t i , n ] S_i=[t_{i,1},t_{i,2}...t_{i,n}] Si=[ti,1,ti,2...ti,n] ,其中 t i , j t_{i,j} ti,j 是保留后的候选关键词
- 构建候选关键词图 G = ( V , E ) G=(V,E) G=(V,E),其中 V 为节点集,由上步生成的候选关键词组成,然后采用共现关系构造任两点之间的边,两个节点之间存在边仅当它们对应的词汇在长度为 k 的窗口中共现,k 表示窗口大小,即最多共现 k 个单词
- 根据上面的公式,迭代传播各节点的权重,直至收敛
- 对节点权重进行倒序排列,从而得到最重要的 T 个单词,作为候选关键词
- 由上步得到最重要的 T 个单词,在原始文本中进行标记,若形成相邻词组,则组合成多词关键词
代码
来自结巴分词
class UndirectWeightedGraph:
d = 0.85
def __init__(self):
self.graph = defaultdict(list)
def addEdge(self, start, end, weight):
# use a tuple (start, end, weight) instead of a Edge object
self.graph[start].append((start, end, weight))
self.graph[end].append((end, start, weight))
def rank(self):
ws = defaultdict(float)
outSum = defaultdict(float)
wsdef = 1.0 / (len(self.graph) or 1.0)
for n, out in self.graph.items():
ws[n] = wsdef
outSum[n] = sum((e[2] for e in out), 0.0)
# this line for build stable iteration
sorted_keys = sorted(self.graph.keys())
for x in xrange(10): # 10 iters
for n in sorted_keys:
s = 0
for e in self.graph[n]:
s += e[2] / outSum[e[1]] * ws[e[1]]
ws[n] = (1 - self.d) + self.d * s
(min_rank, max_rank) = (sys.float_info[0], sys.float_info[3])
for w in itervalues(ws):
if w < min_rank:
min_rank = w
if w > max_rank:
max_rank = w
for n, w in ws.items():
# to unify the weights, don't *100.
ws[n] = (w - min_rank / 10.0) / (max_rank - min_rank / 10.0)
return ws
class TextRank(KeywordExtractor):
def __init__(self):
self.tokenizer = self.postokenizer = jieba.posseg.dt
self.stop_words = self.STOP_WORDS.copy()
self.pos_filt = frozenset(('ns', 'n', 'vn', 'v'))
self.span = 5
def pairfilter(self, wp):
return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2
and wp.word.lower() not in self.stop_words)
def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):
"""
Extract keywords from sentence using TextRank algorithm.
Parameter:
- topK: return how many top keywords. `None` for all possible words.
- withWeight: if True, return a list of (word, weight);
if False, return a list of words.
- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
if the POS of w is not in this list, it will be filtered.
- withFlag: if True, return a list of pair(word, weight) like posseg.cut
if False, return a list of words
"""
self.pos_filt = frozenset(allowPOS)
g = UndirectWeightedGraph()
cm = defaultdict(int)
words = tuple(self.tokenizer.cut(sentence))
for i, wp in enumerate(words):
if self.pairfilter(wp):
for j in xrange(i + 1, i + self.span):
if j >= len(words):
break
if not self.pairfilter(words[j]):
continue
if allowPOS and withFlag:
cm[(wp, words[j])] += 1
else:
cm[(wp.word, words[j].word)] += 1
for terms, w in cm.items():
g.addEdge(terms[0], terms[1], w)
nodes_rank = g.rank()
if withWeight:
tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
else:
tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)
if topK:
return tags[:topK]
else:
return tags
extract_tags = textrank
参考
https://www.zybuluo.com/evilking/note/902585