WINGNUS算法提取关键词
WINGNUS我觉得可以视为kea算法的改进版本,他研究了语言逻辑,不止关注了文章全局信息也关注了局部重要的信息
WINGNUS算法
论文写到经过统计发现一般句子比较重要的部分都出现在句首,标题等地方,所以不使用整个文档文本作为输入,而是从完整到最小的不同层次上缩减了输入文本,注重重要的地方。
1.和kea讲到的一样首先根据规则选取候选词
2.提取关键词特征,在tf-idf特征的基础上添加了单词偏移,字体属性,单词短语长度等特征
3.通过朴素贝叶斯模型计算关键词
主要代码
1.候选关键词
def candidate_selection(self, grammar=None):
"""Select noun phrases (NP) and NP containing a pre-propositional phrase
(NP IN NP) as keyphrase candidates.
Args:
grammar (str): grammar defining POS patterns of NPs.
"""
# initialize default grammar if none provided
if grammar is None:
grammar = r"""
NBAR:
{<NOUN|PROPN|ADJ>{,2}<NOUN|PROPN>}
NP:
{<NBAR>}
{<NBAR><ADP><NBAR>}
"""
self.grammar_selection(grammar)
2.候选关键词的特征
def feature_extraction(self, df=None, training=False, features_set=None):
"""Extract features for each candidate.
Args:
df (dict): document frequencies, the number of documents should be
specified using the "--NB_DOC--" key.
training (bool): indicates whether features are computed for the
training set for computing IDF weights, defaults to false.
features_set (list): the set of features to use, defaults to
[1, 4, 6].
"""
# define the default features_set
if features_set is None:
features_set = [1, 4, 6]
# initialize default document frequency counts if none provided
if df is None:
logging.warning('LoadFile._df_counts is hard coded to {}'.format(
self._df_counts))
df = load_document_frequency_file(self._df_counts, delimiter='\t')
# initialize the number of documents as --NB_DOC--
N = df.get('--NB_DOC--', 0) + 1
if training:
N -= 1
# find the maximum offset
maximum_offset = float(sum([s.length for s in self.sentences]))
# loop through the candidates
for k, v in self.candidates.items():
# initialize features array
feature_array = []
# get candidate document frequency
candidate_df = 1 + df.get(k, 0)
# hack for handling training documents
if training and candidate_df > 1:
candidate_df -= 1
# compute the tf*idf of the candidate
idf = math.log(N / candidate_df, 2)
# [F1] TF*IDF
feature_array.append(len(v.surface_forms) * idf)
# [F2] -> TF
feature_array.append(len(v.surface_forms))
# [F3] -> term frequency of substrings
tf_of_substrings = 0
stoplist = self.stoplist
for i in range(len(v.lexical_form)):
for j in range(i, min(len(v.lexical_form), i + 3)):
sub_words = v.lexical_form[i:j + 1]
sub_string = ' '.join(sub_words)
# skip if substring is fullstring
if sub_string == ' '.join(v.lexical_form):
continue
# skip if substring contains a stopword
if set(sub_words).intersection(stoplist):
continue
# check whether the substring occurs "as it"
if sub_string in self.candidates:
# loop throught substring offsets
for offset_1 in self.candidates[sub_string].offsets:
is_included = False
for offset_2 in v.offsets:
if offset_2 <= offset_1 <= offset_2 + len(v.lexical_form):
is_included = True
if not is_included:
tf_of_substrings += 1
feature_array.append(tf_of_substrings)
# [F4] -> relative first occurrence
feature_array.append(v.offsets[0] / maximum_offset)
# [F5] -> relative last occurrence
feature_array.append(v.offsets[-1] / maximum_offset)
# [F6] -> length of phrases in words
feature_array.append(len(v.lexical_form))
# [F7] -> typeface
feature_array.append(0)
# extract information from sentence meta information
meta = [self.sentences[sid].meta for sid in v.sentence_ids]
# extract meta information of candidate
sections = [u['section'] for u in meta if 'section' in u]
types = [u['type'] for u in meta if 'type' in u]
# [F8] -> Is in title
feature_array.append('title' in sections)
# [F9] -> TitleOverlap
feature_array.append(0)
# [F10] -> Header
feature_array.append('sectionHeader' in types or
'subsectionHeader' in types or
'subsubsectionHeader' in types)
# [F11] -> abstract
feature_array.append('abstract' in sections)
# [F12] -> introduction
feature_array.append('introduction' in sections)
# [F13] -> related work
feature_array.append('related work' in sections)
# [F14] -> conclusions
feature_array.append('conclusions' in sections)
# [F15] -> HeaderF
feature_array.append(types.count('sectionHeader') +
types.count('subsectionHeader') +
types.count('subsubsectionHeader'))
# [F11] -> abstractF
feature_array.append(sections.count('abstract'))
# [F12] -> introductionF
feature_array.append(sections.count('introduction'))
# [F13] -> related workF
feature_array.append(sections.count('related work'))
# [F14] -> conclusionsF
feature_array.append(sections.count('conclusions'))
# add the features to the instance container
self.instances[k] = np.array([feature_array[i - 1] for i
in features_set])
# scale features
self.feature_scaling()
3.训练贝叶斯并保存
def train(training_instances, training_classes, model_file):
""" Train a Naive Bayes classifier and store the model in a file.
Args:
training_instances (list): list of features.
training_classes (list): list of binary values.
model_file (str): the model output file.
"""
clf = MultinomialNB()
clf.fit(training_instances, training_classes)
dump_model(clf, model_file)
4.与k其他算法的比较
kea算法某两个摘要在子空间上的关键词
kea算法某两个摘要在子空间上的关键词
WINGNUS算法某两个摘要在子空间上的关键词
可以看到信息量更大了
存在问题
可以看到输出来的有一些不是单词的词语,也没有停用频率高无意义的词,所以后面改进为去除停用词,控制一定长度的单词得到可靠停用词语
参考
https://www.aclweb.org/anthology/S10-1035.pdf