TextRank

TextRank 是由 PageRank 改进而来,其公式:

W S ( V i ) = ( 1 − d ) + d ∗ Σ V j ∈ I n ( V i ) w j i Σ V k ∈ O u t ( V j ) W S ( V j ) WS(V_i)=(1-d)+d*\Sigma_{V_j \in In(V_i)}\frac{w_{ji}}{\Sigma_{V_k \in Out(V_j)}}WS(V_j) WS(Vi)=(1d)+dΣVjIn(Vi)ΣVkOut(Vj)wjiWS(Vj)

TextRank用于关键词提取的算法如下 :

  1. 把给定的文本 T 按照完整句子进行分割,即 T = [ S 1 , S 2 . . . S m ] T=[S_1, S_2...S_m] T=[S1,S2...Sm]
  2. 对于每个句子 S i ∈ T S_i \in T SiT ,进行分词和词性标注处理,并过滤掉停用词,只保留指定词性的单词,如名词、动词、形容词,即 S i = [ t i , 1 , t i , 2 . . . t i , n ] S_i=[t_{i,1},t_{i,2}...t_{i,n}] Si=[ti,1,ti,2...ti,n] ,其中 t i , j t_{i,j} ti,j 是保留后的候选关键词
  3. 构建候选关键词图 G = ( V , E ) G=(V,E) G=(V,E),其中 V 为节点集,由上步生成的候选关键词组成,然后采用共现关系构造任两点之间的边,两个节点之间存在边仅当它们对应的词汇在长度为 k 的窗口中共现,k 表示窗口大小,即最多共现 k 个单词
  4. 根据上面的公式,迭代传播各节点的权重,直至收敛
  5. 对节点权重进行倒序排列,从而得到最重要的 T 个单词,作为候选关键词
  6. 由上步得到最重要的 T 个单词,在原始文本中进行标记,若形成相邻词组,则组合成多词关键词

代码

来自结巴分词

class UndirectWeightedGraph:
    d = 0.85

    def __init__(self):
        self.graph = defaultdict(list)

    def addEdge(self, start, end, weight):
        # use a tuple (start, end, weight) instead of a Edge object
        self.graph[start].append((start, end, weight))
        self.graph[end].append((end, start, weight))

    def rank(self):
        ws = defaultdict(float)
        outSum = defaultdict(float)

        wsdef = 1.0 / (len(self.graph) or 1.0)
        for n, out in self.graph.items():
            ws[n] = wsdef
            outSum[n] = sum((e[2] for e in out), 0.0)

        # this line for build stable iteration
        sorted_keys = sorted(self.graph.keys())
        for x in xrange(10):  # 10 iters
            for n in sorted_keys:
                s = 0
                for e in self.graph[n]:
                    s += e[2] / outSum[e[1]] * ws[e[1]]
                ws[n] = (1 - self.d) + self.d * s

        (min_rank, max_rank) = (sys.float_info[0], sys.float_info[3])

        for w in itervalues(ws):
            if w < min_rank:
                min_rank = w
            if w > max_rank:
                max_rank = w

        for n, w in ws.items():
            # to unify the weights, don't *100.
            ws[n] = (w - min_rank / 10.0) / (max_rank - min_rank / 10.0)

        return ws


class TextRank(KeywordExtractor):

    def __init__(self):
        self.tokenizer = self.postokenizer = jieba.posseg.dt
        self.stop_words = self.STOP_WORDS.copy()
        self.pos_filt = frozenset(('ns', 'n', 'vn', 'v'))
        self.span = 5

    def pairfilter(self, wp):
        return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2
                and wp.word.lower() not in self.stop_words)

    def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):
        """
        Extract keywords from sentence using TextRank algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
                        if the POS of w is not in this list, it will be filtered.
            - withFlag: if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
        self.pos_filt = frozenset(allowPOS)
        g = UndirectWeightedGraph()
        cm = defaultdict(int)
        words = tuple(self.tokenizer.cut(sentence))
        for i, wp in enumerate(words):
            if self.pairfilter(wp):
                for j in xrange(i + 1, i + self.span):
                    if j >= len(words):
                        break
                    if not self.pairfilter(words[j]):
                        continue
                    if allowPOS and withFlag:
                        cm[(wp, words[j])] += 1
                    else:
                        cm[(wp.word, words[j].word)] += 1

        for terms, w in cm.items():
            g.addEdge(terms[0], terms[1], w)
        nodes_rank = g.rank()
        if withWeight:
            tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
        else:
            tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)

        if topK:
            return tags[:topK]
        else:
            return tags

    extract_tags = textrank

参考

https://www.zybuluo.com/evilking/note/902585

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值