面向特定问题的开源算法管理和推荐（九）

最新推荐文章于 2024-11-03 21:27:26 发布

郭德纲闭门弟子

最新推荐文章于 2024-11-03 21:27:26 发布

阅读量332

点赞数

分类专栏：软件工程应用与实践文章标签：算法 python 人工智能

本文链接：https://blog.csdn.net/m0_46320525/article/details/121545534

版权

软件工程应用与实践专栏收录该内容

17 篇文章 1 订阅

订阅专栏

本文详细分析了Python关键词提取库PKE的数据结构，包括Sentence、Candidate和Document类，以及utils.py中的文档频率计算和LDA模型等相关函数。通过对代码的深入解析，展示了如何构建和操作文档数据结构，以及进行关键词频率统计和主题建模。

摘要由CSDN通过智能技术生成

2021SC@SDUSC

系列文章目录

（九）PKE代码分析二

前言

承接上文，继续对以下代码进行分析

一、data_structures.py

pke 模块的数据结构，包含了3个类

（一）class Sentence(object):

类的功能是定义句子数据结构。

包括两个函数

1.def __init__(self, words):#对各个属性进行初始化，包括

length (number of tokens) of the sentence. 句子的长度（标记数）

meta-information of the sentence. 句子的元信息

list of Part-Of-Speeches.词组列表

list of stems.词干列表

list of words (tokens) in the sentence.句子中的单词（标记）列表

        self.words = words
        
        self.pos = []

        self.stems = []

        self.length = len(words)

        self.meta = {}

2.def __eq__(self, other):#比较两个句子是否相等

对上述初始化的属性都进行比较，如果全部相同，则返回True

         if type(self) != type(other):
            return False

        if self.length != other.length:
            return False

        if self.words != other.words:
            return False

        if self.pos != other.pos:
            return False

        if self.stems != other.stems:
            return False

        if self.meta != other.meta:
            return False

        return True

（二）class Candidate(object):

类的功能是定义关键词候选数据结构

只有一个构造函数

def __init__(self):#负责初始化，包括
        the lexical form of the candidate.
        the offsets of the surface forms.
        the Part-Of-Speech patterns of the candidate.
        the sentence id of each surface form.
        the surface forms of the candidate.

        self.surface_forms = []
        
        self.offsets = []

        self.sentence_ids = []
       
        self.pos_patterns = []

        self.lexical_form = []

（三）class Document(object):

类的功能是定义文档数据结构。

包含3个函数：

1.def __init__(self):#初始化输入文件路径和句子列表

        self.input_file = None
       
        self.sentences = []

2.def from_sentences(sentences, **kwargs):#填充句子列表

参数： sentence（Sentence list）：创建文档的内容。 input_file (str): 输入文件的路径。

是静态方法（@staticmethod），关于@staticmethod,这里抛开修饰器的概念不谈，只简单谈它的作用和用法。 staticmethod用于修饰类中的方法,使其可以在不创建类实例的情况下调用方法，这样做的好处是执行效率比较高。当然，也可以像一般的方法一样用实例调用该方法。该方法一般被称为静态方法。静态方法不可以引用类中的属性或方法，其参数列表也不需要约定的默认参数self。我个人觉得，静态方法就是类对外部函数的封装，有助于优化代码结构和提高程序的可读性。当然了，被封装的方法应该尽可能的和封装它的类的功能相匹配。

#初始化文件

doc = Document()

#设置输入文件

doc.input_file = kwargs.get('input_file', None)

以下的for循环遍历解析的句子，依次实现

#将句子添加到容器中

#添加POS

#添加lemmas

#添加meta-information

#将句子添加到文档中

 for i, sentence in enumerate(sentences):

            s = Sentence(words=sentence['words'])

            s.pos = sentence['POS']

            s.stems = sentence['lemmas']

            for (k, infos) in sentence.items():
                if k not in {'POS', 'lemmas', 'words'}:
                    s.meta[k] = infos

            doc.sentences.append(s)

3.def __eq__(self, other):#比较两个文档是否相等

对上述初始化的属性都进行比较，如果全部相同，则返回True

        if type(self) != type(other):
            return False

        if self.language != other.language:
            return False

        if self.input_file != other.input_file:
            return False

        if self.sentences != other.sentences:
            return False

        return True

二、utils.py

有9个函数

1.def load_document_frequency_file(input_file, delimiter='\t'):

加载包含文档频率的tsv（制表符分隔值）文件。
通过查看输入文件的扩展名（.gz），自动检测输入文件是否已压缩（gzip）。

参数：

input_file (str)：包含csv格式文件频率的输入文件。

delimiter (str):用于分隔术语文档频率元组的分隔符，默认为“\t”。

返回值：字典：{term_1:freq}形式的字典，freq是一个整数。

#初始化DF字典

frequencies = {}

#打开输入文件

with gzip.open(input_file, 'rt', encoding='utf-8') if input_file.endswith('.gz') else \
            codecs.open(input_file, 'rt', encoding='utf-8') as f:

#读取csv文件

df_reader = csv.reader(f, delimiter=delimiter)

#填充字典

        for row in df_reader:
            frequencies[row[0]] = int(row[1])

最后返回填充的字典

2.def compute_document_frequency(input_dir, output_file, extension='xml', language='en', normalization="stemming", stoplist=None, delimiter='\t', n=3, max_length=None, encoding=None):

从一组输入文档计算n-gram文档频率。输出文件中添加了一个额外的行，用于指定从中计算文档频率的文档数量（--NB_DOC--tab XXX）。使用gzip压缩输出文件。

参数：

input_dir （str）：输入目录。
output_file （str）：输出文件。
extension（str）：输入文档的文件扩展名，默认为 xml。
language（str）：输入文档的语言（用于计算n-stem 或 n-lemma forms），默认为 'en'（英语）。
normalization （str）：词规范化方法，默认为"词干分解（stemming）"。其他可能的值是使用单词的"词形还原（lemmatization）"或"None"，用于使用词surface forms而不是stems/lemmas
stoplist（列表）：用于过滤 n-grams的停用词，默认为"None"。
delimiter（str）：n-grams和文档频率之间的分隔符，默认为制表（\t）。
n（int）：n-grams 的大小，默认为 3。
encoding（str）：input_dir中文件的编码，默认为"None"。

# 文档频率容器

frequencies = defaultdict(int)

# 初始化文档数

 nb_documents = 0

# 循环浏览文档

 for input_file in glob.iglob(input_dir + os.sep + '*.' + extension):

        # 初始化加载文件对象

 doc = LoadFile()

        # 读取输入文件

doc.load_document(input=input_file,
                          language=language,
                          normalization=normalization,
                          max_length=max_length,
                          encoding=encoding)

        # 候选选择

doc.ngram_selection(n=n)

        # 筛选包含标点符号的候选项

doc.candidate_filtering(stoplist=stoplist)

        # 循环浏览候选项

 for lexical_form in doc.candidates:
            frequencies[lexical_form] += 1

        nb_documents += 1

        if nb_documents % 1000 == 0:
            logging.info("{} docs, memory used: {} mb".format(nb_documents,
                                                           sys.getsizeof(
                                                               frequencies)
                                                           / 1024 / 1024 ))

# 如果不存在，则从路径创建目录

    if os.path.dirname(output_file):
        os.makedirs(os.path.dirname(output_file), exist_ok=True)

# 转储 df 容器

 with gzip.open(output_file, 'wt', encoding='utf-8') as f:

        # 将文档数量添加为特殊标记

        first_line = '--NB_DOC--' + delimiter + str(nb_documents)
        f.write(first_line + '\n')

        for ngram in frequencies:
            line = ngram + delimiter + str(frequencies[ngram])
            f.write(line + '\n')

3.def train_supervised_model(input_dir, reference_file, model_file, extension='xml', language='en', normalization="stemming", df=None, model=None, sep_doc_id=':', sep_ref_keyphrases=',', normalize_reference=False, leave_one_out=False, encoding=None, ref_encoding=None):

从一组文档和一个参考文件中建立一个有监督的关键词提取模型

4.def load_references(input_file, sep_doc_id=':', sep_ref_keyphrases=',', normalize_reference=False, language="en", encoding=None, excluded_file=None):

加载引用文件。参考文件可以是json格式，也可以是SemEval-2010官方格式。

5.def load_lda_model(input_file):

加载包含lda模型的gzip文件。

6.def compute_lda_model(input_dir, output_file, n_topics=500, extension="xml", language="en", normalization="stemming", max_length=None, encoding=None):

从文档集合计算LDA模型。使用sklearn模块计算潜在Dirichlet分配。

狄利克雷分布（Dirichlet distribution）或多元Beta分布（multivariate Beta distribution）是一类在实数域以正单纯形（standard simplex）为支撑集（support）的高维连续概率分布，是Beta分布在高维情形的推广。狄利克雷分布是指数族分布之一，也是刘维尔分布（Liouville distribution）的特殊形式，将狄利克雷分布的解析形式进行推广可以得到广义狄利克雷分布（generalized Dirichlet distribution）和组合狄利克雷分布（Grouped Dirichlet distribution）。

在贝叶斯推断（Bayesian inference）中，狄利克雷分布作为多项分布的共轭先验得到应用 [3] ，在机器学习（machine learning）中被用于构建狄利克雷混合模型（Dirichlet mixture model）。狄利克雷分布在函数空间内对应的随机过程（stochastic process）是狄利克雷过程（Dirichlet process）。

狄利克雷分布的命名来自德国数学家约翰·彼得·古斯塔夫·勒热纳·狄利克雷（Johann P. G. Lejeune Dirichlet）以纪念其首次得到狄利克雷分布（积分形式）的解析形式。

7.def load_document_as_bos(input_file, language="en", normalization="stemming", stoplist=None, encoding=None):#Load a document as a bag of words/stems/lemmas.

参数：

input_file （str）：输入文件的路径。
language （str）：输入文档的语言，用于stop_words。在 sklearn CountVectorizer 中，默认为"en"。

normalization （str）：词规范化方法，默认为"词干分解（stemming）"。其他可能的值是使用单词的"词形还原（lemmatization）"或"None"，用于使用词surface forms而不是stems/lemmas
stoplist (list)：用于筛选标记的停用词列表，默认为 [ ]。
encoding (str)：编码为"input_file"，默认为"None"。

8.def load_pairwise_similarities(path):

加载 ExpandRank 的成对相似性。

Expandrank

Single Document Keyphrase Extraction Using Neighborhood Knowledge
论文链接： https://www.aaai.org/Papers/AAAI/2008/AAAI08-136.pdf

下面从算法步骤和实验效果进行说明

算法步骤如下

通过余弦相似度选出候选的k篇文章（di表示文章的tf-idf的向量）

对K篇文章构建词图（利用词性标注，只保留名词和形容词），利用公式计算词图的词之间的边的权重，公式如下，第一部分计算目标文章和扩展文章的相似度，第二部分是表示在文章dp中词vi和vj的共现次数

计算每个候选词的权重

首先计算Global Affinity Graph

对M进行归一化

单词得分使用pagerank算法的形式进行迭代

说明: 是所有元素都是1的向量， u 是阻尼因子，通常设置为0.85

计算候选短语权重，排序后取Top m（m一般取1=20）

实验效果：

数据源：选择DUC2001 dataset两个研究生进行标注，看一致性，不一样的进行讨论
结论：ExpandRank效果优于TF-IDF和Single-rank

说明：k=0时，expandrank退化为Single-rank

其他：文章还对比不同参数k（外部文章数量）,w（共现窗口大小）,m(Top关键词数量)对结果产的影响

9.def compute_pairwise_similarity_matrix(input_dir, output_file, collection_dir=None, df=None, extension="xml", language="en", normalization="stemming", stoplist=None, encoding=None):

计算"input_dir"中的文档与"collection_dir"中的文档之间的成对相似性。相似性分数是使用 TF x IDF 项权重上的余弦相似性计算得出的。如果没有用于计算这些分数的集合，则将改为返回input_dir中文档之间的相似性。