WINGNUS算法提取关键词

WINGNUS算法提取关键词

WINGNUS我觉得可以视为kea算法的改进版本,他研究了语言逻辑,不止关注了文章全局信息也关注了局部重要的信息
WINGNUS算法
论文写到经过统计发现一般句子比较重要的部分都出现在句首,标题等地方,所以不使用整个文档文本作为输入,而是从完整到最小的不同层次上缩减了输入文本,注重重要的地方。
1.和kea讲到的一样首先根据规则选取候选词
2.提取关键词特征,在tf-idf特征的基础上添加了单词偏移,字体属性,单词短语长度等特征
在这里插入图片描述
3.通过朴素贝叶斯模型计算关键词
主要代码
1.候选关键词

    def candidate_selection(self, grammar=None):
        """Select noun phrases (NP) and NP containing a pre-propositional phrase
        (NP IN NP) as keyphrase candidates.

        Args:
            grammar (str): grammar defining POS patterns of NPs.
        """

        # initialize default grammar if none provided
        if grammar is None:
            grammar = r"""
                NBAR:
                    {<NOUN|PROPN|ADJ>{,2}<NOUN|PROPN>} 
                    
                NP:
                    {<NBAR>}
                    {<NBAR><ADP><NBAR>}
            """

        self.grammar_selection(grammar)

2.候选关键词的特征

    def feature_extraction(self, df=None, training=False, features_set=None):
        """Extract features for each candidate.

        Args:
            df (dict): document frequencies, the number of documents should be
                specified using the "--NB_DOC--" key.
            training (bool): indicates whether features are computed for the
                training set for computing IDF weights, defaults to false.
            features_set (list): the set of features to use, defaults to
                [1, 4, 6].

        """

        # define the default features_set
        if features_set is None:
            features_set = [1, 4, 6]

        # initialize default document frequency counts if none provided
        if df is None:
            logging.warning('LoadFile._df_counts is hard coded to {}'.format(
                self._df_counts))
            df = load_document_frequency_file(self._df_counts, delimiter='\t')

        # initialize the number of documents as --NB_DOC--
        N = df.get('--NB_DOC--', 0) + 1
        if training:
            N -= 1

        # find the maximum offset
        maximum_offset = float(sum([s.length for s in self.sentences]))

        # loop through the candidates
        for k, v in self.candidates.items():

            # initialize features array
            feature_array = []

            # get candidate document frequency
            candidate_df = 1 + df.get(k, 0)

            # hack for handling training documents
            if training and candidate_df > 1:
                candidate_df -= 1

            # compute the tf*idf of the candidate
            idf = math.log(N / candidate_df, 2)

            # [F1] TF*IDF
            feature_array.append(len(v.surface_forms) * idf)

            # [F2] -> TF
            feature_array.append(len(v.surface_forms))

            # [F3] -> term frequency of substrings
            tf_of_substrings = 0
            stoplist = self.stoplist
            for i in range(len(v.lexical_form)):
                for j in range(i, min(len(v.lexical_form), i + 3)):
                    sub_words = v.lexical_form[i:j + 1]
                    sub_string = ' '.join(sub_words)

                    # skip if substring is fullstring
                    if sub_string == ' '.join(v.lexical_form):
                        continue

                    # skip if substring contains a stopword
                    if set(sub_words).intersection(stoplist):
                        continue

                    # check whether the substring occurs "as it"
                    if sub_string in self.candidates:

                        # loop throught substring offsets
                        for offset_1 in self.candidates[sub_string].offsets:
                            is_included = False
                            for offset_2 in v.offsets:
                                if offset_2 <= offset_1 <= offset_2 + len(v.lexical_form):
                                    is_included = True
                            if not is_included:
                                tf_of_substrings += 1

            feature_array.append(tf_of_substrings)

            # [F4] -> relative first occurrence
            feature_array.append(v.offsets[0] / maximum_offset)

            # [F5] -> relative last occurrence
            feature_array.append(v.offsets[-1] / maximum_offset)

            # [F6] -> length of phrases in words
            feature_array.append(len(v.lexical_form))

            # [F7] -> typeface
            feature_array.append(0)

            # extract information from sentence meta information
            meta = [self.sentences[sid].meta for sid in v.sentence_ids]

            # extract meta information of candidate
            sections = [u['section'] for u in meta if 'section' in u]
            types = [u['type'] for u in meta if 'type' in u]

            # [F8] -> Is in title
            feature_array.append('title' in sections)

            # [F9] -> TitleOverlap
            feature_array.append(0)

            # [F10] -> Header
            feature_array.append('sectionHeader' in types or
                                 'subsectionHeader' in types or
                                 'subsubsectionHeader' in types)

            # [F11] -> abstract
            feature_array.append('abstract' in sections)

            # [F12] -> introduction
            feature_array.append('introduction' in sections)

            # [F13] -> related work
            feature_array.append('related work' in sections)

            # [F14] -> conclusions
            feature_array.append('conclusions' in sections)

            # [F15] -> HeaderF
            feature_array.append(types.count('sectionHeader') +
                                 types.count('subsectionHeader') +
                                 types.count('subsubsectionHeader'))

            # [F11] -> abstractF
            feature_array.append(sections.count('abstract'))

            # [F12] -> introductionF
            feature_array.append(sections.count('introduction'))

            # [F13] -> related workF
            feature_array.append(sections.count('related work'))

            # [F14] -> conclusionsF
            feature_array.append(sections.count('conclusions'))

            # add the features to the instance container
            self.instances[k] = np.array([feature_array[i - 1] for i
                                          in features_set])

        # scale features
        self.feature_scaling()

3.训练贝叶斯并保存

 def train(training_instances, training_classes, model_file):
        """ Train a Naive Bayes classifier and store the model in a file.

            Args:
                training_instances (list): list of features.
                training_classes (list): list of binary values.
                model_file (str): the model output file.
        """

        clf = MultinomialNB()
        clf.fit(training_instances, training_classes)
        dump_model(clf, model_file)

4.与k其他算法的比较
kea算法某两个摘要在子空间上的关键词
在这里插入图片描述
kea算法某两个摘要在子空间上的关键词
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200630002604233.png
WINGNUS算法某两个摘要在子空间上的关键词
在这里插入图片描述
可以看到信息量更大了

存在问题
可以看到输出来的有一些不是单词的词语,也没有停用频率高无意义的词,所以后面改进为去除停用词,控制一定长度的单词得到可靠停用词语

参考
https://www.aclweb.org/anthology/S10-1035.pdf

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值