kea算法提取关键词

kea算法提取关键词

上一篇文章讲到基于bert的关键词提取,关键字出来的太少,需要一些其他方法增加关键词,我首先选择了kea算法
kea算法
Kea使用词法方法识别候选关键词,为每个候选关键词计算特征值,并使用机器学习算法预测哪些候选关键词是好的关键词。
1.首先基于一定的规则选出候选关键词,作者在文章中提出三个规则:
(1) Candidate phrases are limited to a certain maximum length (usually three words).
(2)Candidate phrases cannot be proper names (i.e. single words that only ever appear with an initial capital).
(3)Candidate phrases cannot begin or end with a stopword.
2.提取出关键词的tf-idf特征
在这里插入图片描述
3.用朴素贝叶斯模型计算候选关键词得分排名后选出关键词
在这里插入图片描述
在这里插入图片描述
主要代码
1.根据规则选出候选词

    def candidate_selection(self, stoplist=None, **kwargs):
        """Select 1-3 grams of `normalized` words as keyphrase candidates.
        Candidates that start or end with a stopword are discarded. Candidates
        that contain punctuation marks (from `string.punctuation`) as words are
        filtered out.

        Args:
            stoplist (list): the stoplist for filtering candidates, defaults
                to the nltk stoplist.
        """

        # select ngrams from 1 to 3 grams
        self.ngram_selection(n=3)

        # filter candidates containing punctuation marks
        self.candidate_filtering(list(string.punctuation))

        # initialize stoplist list if not provided
        if stoplist is None:
            stoplist = self.stoplist

        # filter candidates that start or end with a stopword
        for k in list(self.candidates):

            # get the candidate
            v = self.candidates[k]

            # delete if candidate contains a stopword in first/last position
            words = [u.lower() for u in v.surface_forms[0]]
            if words[0] in stoplist or words[-1] in stoplist:
                del self.candidates[k]

2.提取tf-idf特征

    def feature_extraction(self, df=None, training=False):
        """Extract features for each keyphrase candidate. Features are the
        tf*idf of the candidate and its first occurrence relative to the
        document.

        Args:
            df (dict): document frequencies, the number of documents should be
                specified using the "--NB_DOC--" key.
            training (bool): indicates whether features are computed for the
                training set for computing IDF weights, defaults to false.
        """

        # initialize default document frequency counts if none provided
        if df is None:
            logging.warning('LoadFile._df_counts is hard coded to {}'.format(
                self._df_counts))
            df = load_document_frequency_file(self._df_counts, delimiter='\t')

        # initialize the number of documents as --NB_DOC--
        N = df.get('--NB_DOC--', 0) + 1
        if training:
            N -= 1

        # find the maximum offset
        maximum_offset = float(sum([s.length for s in self.sentences]))

        for k, v in self.candidates.items():

            # get candidate document frequency
            candidate_df = 1 + df.get(k, 0)

            # hack for handling training documents
            if training and candidate_df > 1:
                candidate_df -= 1

            # compute the tf*idf of the candidate
            idf = math.log(N / candidate_df, 2)

            # add the features to the instance container
            self.instances[k] = np.array([len(v.surface_forms) * idf,
                                          v.offsets[0] / maximum_offset])

        # scale features
        self.feature_scaling()

3.训练贝叶斯模型并保存

 def train(training_instances, training_classes, model_file):
        """ Train a Naive Bayes classifier and store the model in a file.

            Args:
                training_instances (list): list of features.
                training_classes (list): list of binary values.
                model_file (str): the model output file.
        """

        clf = MultinomialNB()
        clf.fit(training_instances, training_classes)
        dump_model(clf, model_file)

参考
https://www.cs.waikato.ac.nz/ml/publications/2005/chap_Witten-et-al_Windows.pdf
https://github.com/boudinfl/pke

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值