面向特定问题的开源算法管理和推荐（十四）

最新推荐文章于 2024-08-10 07:27:22 发布

郭德纲闭门弟子

最新推荐文章于 2024-08-10 07:27:22 发布

阅读量237

点赞数

分类专栏：软件工程应用与实践文章标签：算法

本文链接：https://blog.csdn.net/m0_46320525/article/details/122005831

版权

软件工程应用与实践专栏收录该内容

17 篇文章 1 订阅

订阅专栏

2021SC@SDUSC

系列文章目录

（十四）PKE代码分析七

前言

pke包含模型如下：

本篇博客将继续从无监督模型的基于图的模型进行代码分析

unsupervised->graph_based

topicrank.py

TopicRank关键字提取模型。

关键词提取的基于图的排序方法，描述如下:

Adrien Bougouin, Florian Boudin and Béatrice Daille.

TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction.

In proceedings of IJCNLP*, pages 543-551, 2013.

(一）原理

TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction
TopicRank把主题当做相似关键短语的簇，这些topics会根据在文档的重要性进行排序，然后选取top 个最相关的topics，每个topic选择一个最重要的关键短语来代表文档的核心关键词。

TopicRank算法的步骤如下：

主题识别：主要抽取名词短语来表征文档的主题，短语中有超过25%重合的单词就考虑为相似短语，用 Hierarchical Agglomerative Clustering (HAC) algorithm进行了聚类相似的短语。

图构建：这里的图中的节点是topics，边的权重，根据两个topics ti,tj之间的语义关系进行分配，而语义关系的强弱根据两个主题的关键短语之间的距离公式，具体计算如下：

其中

d i s t ( c i , c j ) 表示关键短语c i和c j在文档中的偏移距离。

p o s ( c i ) 表示关键短语c i的所有偏移位置。

TopicRank不需要设置窗口，而是通过计算位置偏移距离。

关键短语选择：一旦topic进行排序后，选择top K个topics，每个topic选择一个最重要的关键短语作为输出，所有topics总共产生top K个关键短语。有三个策略选择一个topic最适合的关键短语：第一：选择关键短语中最开始出现在文档的那个关键短语；第二：选择频率最高的那个关键短语；第三：选择聚焦的群簇中心的那个关键短语

参考链接：https://blog.csdn.net/BGoodHabit/article/details/108926383

（二）使用示例

在类的注释中有使用示例：

首先导入string和pke和nltk.corpus中的stopwords包

    import pke
    import string
    from nltk.corpus import stopwords

# 1. 创建一个TopicRank提取器

 extractor = pke.unsupervised.TopicRank()

# 2. 加载文档的内容.

 extractor.load_document(input='path/to/input.xml')

# 3. 选择最长的名词和形容词序列，不包含标点符号或停止词作为候选词.

    pos = {'NOUN', 'PROPN', 'ADJ'}
    stoplist = list(string.punctuation)
    stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
    stoplist += stopwords.words('english')
    extractor.candidate_selection(pos=pos, stoplist=stoplist)

# 4. 通过使用HAC(平均链接，阈值为共享stem的1/4)将候选对象分组建立主题。使用随机漫步对主题进行加权，并从每个主题中选择第一个出现的候选对象。

extractor.candidate_weighting(threshold=0.74, method='average')

# 5. 找出10个得分最高的候选作为关键词

keyphrases = extractor.get_n_best(n=10)

（三）函数

包含一个类

class TopicRank(LoadFile):

类中包含6个函数

1.def __init__(self):

重新初始化定义TopicRank

        super(TopicRank, self).__init__()

        #主题图
        self.graph = nx.Graph()
        
        #主题容器
        self.topics = []

2.def candidate_selection(self, pos=None, stoplist=None):

选择最长的名词和形容词序列作为关键短语候选。

参数:
pos (set):有效的pos标签集合，默认为('NOUN'， 'PROPN'， 'ADJ')。
stoplist (list):用于过滤候选者的停用词表，默认为nltk stoplist。不允许使用string.punctuation中的标点符号。

        # 定义默认pos标签集
        if pos is None:
            pos = {'NOUN', 'PROPN', 'ADJ'}

        # 选择形容词和名词的顺序
        self.longest_pos_sequence_selection(valid_pos=pos)

        # 如果没有提供，则初始化停止列表
        if stoplist is None:
            stoplist = self.stoplist

        # 筛选包含停止词或标点符号的候选
        self.candidate_filtering(stoplist=list(string.punctuation) + ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-'] + stoplist)

3.def vectorize_candidates(self):

向量化候选关键字。

返回值:
C (list):候选名单。
X(matrix):候选词的向量化表示。

        # 构建词汇表，即设置向量维度
        dim = set([])
        # for k, v in self.candidates.iteritems():
        # Python 2/3兼容
        for (k, v) in self.candidates.items():
            for w in v.lexical_form:
                dim.add(w)
        dim = list(dim)

        # 向量化候选Python 2/3 +排序的随机问题
        C = list(self.candidates)  # .keys()
        C.sort()

        X = np.zeros((len(C), len(dim)))
        for i, k in enumerate(C):
            for w in self.candidates[k].lexical_form:
                X[i, dim.index(w)] += 1

        return C, X

4.def topic_clustering(self, threshold=0.74, method='average'):

将候选人聚集到主题中。

参数:

threshold(float):聚类的最小相似度，默认为0.74，即超过1/4的stem重叠相似度。
method(str):链接方法，默认为平均。

        # 处理只有一个候选的文档
        if len(self.candidates) == 1:
            self.topics.append([list(self.candidates)[0]])
            return

        # 向量化候选
        candidates, X = self.vectorize_candidates()

        # 计算距离矩阵
        Y = pdist(X, 'jaccard')

        #计算集群
        Z = linkage(Y, method=method)

        # 形成平坦的集群
        clusters = fcluster(Z, t=threshold, criterion='distance')

        # 对于每个主题标识符
        for cluster_id in range(1, max(clusters) + 1):
            self.topics.append([candidates[j] for j in range(len(clusters))
                                if clusters[j] == cluster_id])

其中计算距离矩阵时使用了jaccard

jaccard相似系数（Jaccard similarity coefficient）主要应用场景为数据聚类、比较文本的相似度，用于文本的查重与去重，计算对象间的距离。

jaccard相似系数用于比较有限样本集之间的相似性和差异性J(A,B)为A与B交集的大小与A与B并集的大小的比值。

5.def build_topic_graph(self):

构建主题图

        # 将节点添加到图中
        self.graph.add_nodes_from(range(len(self.topics)))

        # 循环遍历主题以连接节点
        for i, j in combinations(range(len(self.topics)), 2):
            self.graph.add_edge(i, j, weight=0.0)
            for c_i in self.topics[i]:
                for c_j in self.topics[j]:
                    for p_i in self.candidates[c_i].offsets:
                        for p_j in self.candidates[c_j].offsets:
                            gap = abs(p_i - p_j)
                            if p_i < p_j:
                                gap -= len(self.candidates[c_i].lexical_form) - 1
                            if p_j < p_i:
                                gap -= len(self.candidates[c_j].lexical_form) - 1
                            self.graph[i][j]['weight'] += 1.0 / gap

6.def candidate_weighting(self, threshold=0.74, method='average', heuristic=None):

采用随机漫步法对候选进行排序。

参数:

threshold(float):最小的聚类相似度，默认值为0.74。
method(str):链接方法，默认为平均。
heuristic(str):为每个主题选择最佳候选的启发式，默认为第一个出现的候选。其他选项还有“频繁的”(最频繁的候选，position is used for ties)。

        if not self.candidates:
            return

        # 将候选聚集在一起
        self.topic_clustering(threshold=threshold, method=method)

        # 构建主题图
        self.build_topic_graph()

        # 使用随机游走计算单词分数
        w = nx.pagerank_scipy(self.graph, alpha=0.85, weight='weight')

        # 遍历主题
        for i, topic in enumerate(self.topics):

            # 获取候选主题的偏移量
            offsets = [self.candidates[t].offsets[0] for t in topic]

            # 从主题中获取第一个候选
            if heuristic == 'frequent':

                # 获取主题内每个候选的频率
                freq = [len(self.candidates[t].surface_forms) for t in topic]

                # 获取最频繁的候选的索引
                indexes = [j for j, f in enumerate(freq) if f == max(freq)]

                # 索引的偏移量
                indexes_offsets = [offsets[j] for j in indexes]
                # 选择第一个出现频率最高的候选
                most_frequent = offsets.index(min(indexes_offsets))
                self.weights[topic[most_frequent]] = w[i]

            else:
                first = offsets.index(min(offsets))
                self.weights[topic[first]] = w[i]

总结

本文分析了unsupervised->graph_based->topicrank.py

下篇博客开始将对有监督模型进行代码分析

郭德纲闭门弟子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
面向特定问题的开源算法管理和推荐（十四）

2021SC@SDUSC系列文章目录（一）组内分工情况（二）任务一爬虫部分代码分析（上）（三）任务一爬虫部分代码分析（下）（四）任务一数据集统计代码分析（五）任务二及PKE模型解读（六）PKE模型使用一（七）PKE模型使用二（八）PKE代码分析一（九）PKE代码分析二（十）PKE代码分析三（十一）PKE代码分析四（十二）PKE代码分析五（十三）PKE代码分析六（十四）PKE代码分析七...
复制链接

扫一扫