面向特定问题的开源算法管理和推荐（十五）

最新推荐文章于 2024-07-24 20:25:39 发布

郭德纲闭门弟子

最新推荐文章于 2024-07-24 20:25:39 发布

阅读量1k

点赞数

分类专栏：软件工程应用与实践文章标签：算法

本文链接：https://blog.csdn.net/m0_46320525/article/details/122157541

版权

软件工程应用与实践专栏收录该内容

17 篇文章 1 订阅

订阅专栏

2021SC@SDUSC

系列文章目录

（十五）PKE代码分析八

前言

pke包含模型如下：

本篇博客将开始从有监督模型的基于特征的模型进行代码分析

supervised->feature_based

kea.py

Kea 监督关键短语提取模型。

Kea 是一种用于关键短语提取的监督模型，它使用两个特征，即 TF x IDF 和首次出现，将候选关键短语分类为关键短语与否。该模型描述于：

Ian Witten, Gordon Paynter, Eibe Frank, Carl Gutwin and Craig Nevill-Mannin.
KEA: Practical Automatic Keyphrase Extraction.
Proceedings of the 4th ACM Conference on Digital Libraries*, pages 254–255,1999.

（一）原理

Kea使用词法方法识别候选关键词，为每个候选关键词计算特征值，并使用机器学习算法预测哪些候选关键词是好的关键词。
1.首先基于一定的规则选出候选关键词，作者在文章中提出三个规则：
（1） Candidate phrases are limited to a certain maximum length (usually three words).
（2）Candidate phrases cannot be proper names (i.e. single words that only ever appear with an initial capital).
（3）Candidate phrases cannot begin or end with a stopword.
2.提取出关键词的tf-idf特征

在这里插入图片描述

3.用朴素贝叶斯模型计算候选关键词得分排名后选出关键词

在这里插入图片描述

在这里插入图片描述

参考文章：https://blog.csdn.net/qq_41824131/article/details/107028478

（二）使用示例

        在类的注释中有使用示例：

        首先导入pke和nltk.corpus中的stopwords包

        import pke
        from nltk.corpus import stopwords

        # 定义一个停用词列表
        stoplist = stopwords.words('english')

        # 1. 创建一个 Kea 提取器。
        extractor = pke.supervised.Kea()

        # 2. 加载文档的内容。
        extractor.load_document(input='path/to/input',
                                language='en',
                                normalization=None)

        # 3. 选择 1-3 grams ：不以停用词开头或结尾的词作为候选。 包含标点符号的候选词将被丢弃。
        extractor.candidate_selection(stoplist=stoplist)

        # 4. 将候选词分类为关键短语或非关键短语。
        df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
        model_file = 'path/to/kea_model'
        extractor.candidate_weighting(model_file=model_file,
                                      df=df)

        # 5. 将得分最高的 10 个候选词作为关键词
        keyphrases = extractor.get_n_best(n=10)

（三）函数

包含一个类

class Kea(SupervisedLoadFile):

类中包含5个函数

1.def __init__(self):

重新初始化定义 Kea 。

super(Kea, self).__init__()

2.def candidate_selection(self, stoplist=None, **kwargs):

选择 1-3 grams“标准化”词作为候选关键词。
以停用词开头或结尾的候选词将被丢弃。包含标点符号（来自`string.punctuation`）作为单词的候选词被过滤掉。

参数：
stoplist (list)：过滤候选的stoplist，默认为nltk stoplist。

        # 从 1 到 3 grams选择 ngrams
        self.ngram_selection(n=3)

        # 过滤包含标点符号的候选词
        self.candidate_filtering(list(string.punctuation))

        # 如果未提供，则初始化停止列表
        if stoplist is None:
            stoplist = self.stoplist

        # 过滤以停用词开头或结尾的候选词
        for k in list(self.candidates):

            # 得到候选
            v = self.candidates[k]

            # 如果候选在第一个/最后一个位置包含停用词，则删除
            words = [u.lower() for u in v.surface_forms[0]]
            if words[0] in stoplist or words[-1] in stoplist:
                del self.candidates[k]

3.def feature_extraction(self, df=None, training=False):

为每个候选关键短语提取特征。特征是候选者的 tf * idf 及其相对于文档的第一次出现。

参数：
df (dict): 文档频率，文档数量应使用“--NB_DOC--”键指定。
training (bool)：指示是否为训练集计算特征以计算 IDF 权重，默认为 false。

        # 如果未提供，则初始化默认文档频率计数
        if df is None:
            logging.warning('LoadFile._df_counts is hard coded to {}'.format(
                self._df_counts))
            df = load_document_frequency_file(self._df_counts, delimiter='\t')

        # 将文档数初始化为--NB_DOC--
        N = df.get('--NB_DOC--', 0) + 1
        if training:
            N -= 1

        # 找到最大偏移量
        maximum_offset = float(sum([s.length for s in self.sentences]))

        for k, v in self.candidates.items():

            # 获取候选文档频率
            candidate_df = 1 + df.get(k, 0)

            # 处理训练文件的hack
            if training and candidate_df > 1:
                candidate_df -= 1

            # 计算候选的 tf * idf
            idf = math.log(N / candidate_df, 2)

            # 将特征添加到实例容器
            self.instances[k] = np.array([len(v.surface_forms) * idf,
                                          v.offsets[0] / maximum_offset])

        # 尺度特征
        self.feature_scaling()

4.def candidate_weighting(self, model_file=None, df=None):

提取特征并对候选进行分类。

参数：
model_file (str): 模型文件的路径。
df (dict): 文档频率，文档数量应使用“--NB_DOC--”键指定。

        if not self.candidates:
            return

        self.feature_extraction(df=df)
        self.classify_candidates(model=model_file)

5.def train(training_instances, training_classes, model_file):

@staticmethod

训练朴素贝叶斯分类器并将模型存储在文件中。

参数：
training_instances（list）：特征值列表。
training_classes（list）：二进制值列表。
model_file (str)：模型输出文件。

        clf = MultinomialNB()
        clf.fit(training_instances, training_classes)
        dump_model(clf, model_file)

总结

本文分析了supervised->feature_based->kea.py

郭德纲闭门弟子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
面向特定问题的开源算法管理和推荐（十五）

2021SC@SDUSC系列文章目录（一）组内分工情况（二）任务一爬虫部分代码分析（上）（三）任务一爬虫部分代码分析（下）（四）任务一数据集统计代码分析（五）任务二及PKE模型解读（六）PKE模型使用一（七）PKE模型使用二（八）PKE代码分析一（九）PKE代码分析二（十）PKE代码分析三（十一）PKE代码分析四（十二）PKE代码分析五（十三）PKE代码分析六（十四）PKE代码分析七（十五）PKE代码分析八前言pke包含模型如下.
复制链接

扫一扫