随着互联网的蓬勃发展,信息传递也变得十分方便,以前了解新闻可能通过电视报纸,而现在手机电脑都变成了信息传递的载体。随着新闻传播门槛的降低,每天的新闻数不胜数,如何在大量的新闻文本中提取出关键信息变得越来越重要。本案例将使用Python实现新闻关键词的提取。
准备需要提取关键词的新闻数据,保存为test.txt。
安装需要用到的Python库,jieba:可以利用pip install jieba安装。
参考之前的案例,新建项目并建立util.py文件,用util.py来存放定义一些公用函数。
在util.py文件创建combine()函数,用来组合单词。
1. defcombine(word_list, window=2):构造在window下的单词组合,用来构造单词之间的边。word_list -- list of str, 由单词组成的列表。
2. ifwindow < 2: window = 2
3. forx inxrange(1, window):
4. if x >= len(word_list):
5. break
6. word_list2 = word_list[x:]
7. res = zip(word_list, word_list2)
8. for r in res:
9. yield r
在util.py文件创建get_similarity()函数,用来计算两个句子的相似度。
1. def get_similarity(word_list1, word_list2):word_list1, word_list2 -- 分别代表两个句子,都是由单词组成的列表。
2. words = list(set(word_list1 + word_list2))
3. vector1 = [float(word_list1.count(word)) for word in words]
4. vector2 = [float(word_list2.count(word)) for word in words]
5. vector3 = [vector1[x] * vector2[x] for x in xrange(len(vector1))]
6. vector4 = [1 for num in vector3 if num > 0.]
7. co_occur_num = sum(vector4)
8. if abs(co_occur_num) <= 1e-12:
9. return 0.
10. denominator = math.log(float(len(word_list1))) + math.log(float(len(word_list2))) # 分母
11. Ifabs(denominator) < 1e-12:
12. return 0.
13. return co_occur_num / denominator
在util.py文件创建sort_words()函数,将单词按照关键程度从大到小排序 。
1. def sort_words(vertex_source, edge_source, window=2, pagerank_config={'alpha': 0.85, }):
2. sorted_words = []
3. word_index = {}
4. index_word = {}
5. _vertex_source = vertex_source
6. _edge_source = edge_source
7. words_number = 0
8. for word_list in _vertex_source:
9. for word in word_list:
10. if not word in word_index:
11. word_index[word] = words_number
12. index_word[words_number] = word
13. words_number += 1
14. graph = np.zeros((words_number, words_number))
15. for word_list in _edge_source:
16. for w1, w2 in combine(word_list, window):
17. if w1 in word_index and w2 in word_index:
18. index1 = word_index[w1]
19. index2 = word_index[w2]
20. graph[index1][index2] = 1.0
21. graph[index2][index1] = 1.0
22. debug('graph:n', graph)
23. nx_graph = nx.from_numpy_matrix(graph)
24. scores = nx.pagerank(nx_graph, **pagerank_config)
25. sorted_scores = sorted(scores.items(), key=lambda item: item[1], reverse=True)
26. for index, score in sorted_scores:
27. item = AttrDict(word=index_word[index], weight=score)
28. sorted_words.append(item)
29. return sorted_words
在util.py文件创建sort_sentences()函数,将句子按照关键程度从大到小排序。
1. def sort_sentences(sentences, words, sim_func=get_similarity, pagerank_config={'alpha': 0.85, }):sentences --列表,元素是句子
words --二维列表,子列表和sentences中的句子对应,子列表由单词组成sim_func --计算两个句子的相似性,参数是两个由单词组成的列表 pagerank_config -- pagerank的设置
2. sorted_sentences = []
3. _source = words
4. sentences_num = len(_source)
5. graph = np.zeros((sentences_num, sentences_num))
6. for x in xrange(sentences_num):
7. for y in xrange(x, sentences_num):
8. similarity = sim_func(_source[x], _source[y])
9. graph[x, y] = similarity
10. graph[y, x] = similarity
11. nx_graph = nx.from_numpy_matrix(graph)
12. scores = nx.pagerank(nx_graph, **pagerank_config) # this is a dict
13. sorted_scores = sorted(scores.items(), key=lambda item: item[1], reverse=True)
14. for index, score in sorted_scores:
15. item = AttrDict(index=index, sentence=sentences[index], weight=score)
16. sorted_sentences.append(item)
17. return sorted_sentences
然后创建segmentation.py文件,用来存放文章分词分句等数据清洗的函数。
在segmentation.py文件创建WordSegmentation类,在该类下创建分词的初始化函数以及分词函数segment()。
1. def segment(self, text, lower=True, use_stop_words=True, use_speech_tags_filter=False):lower -- 是否将单词小写(针对英文) use_stop_words -- 若为True,则利用停止词集合来过滤(去掉停止词 use_speech_tags_filter -- 是否基于词性进行过滤。若为True,则使用self.default_speech_tag_filter过滤。否则,不过滤。
2. text = util.as_text(text)
3. jieba_result = pseg.cut(text)
4. if use_speech_tags_filter == True:
5. jieba_result = [w for w in jieba_result if w.flag in self.default_speech_tag_filter]
6. else:
7. jieba_result = [w for w in jieba_result]
8. word_list = [w.word.strip() for w in jieba_result if w.flag != 'x']去除特殊符号
9. word_list = [word for word in word_list if len(word) > 0]
10. if lower:
11. word_list = [word.lower() for word in word_list]
12. if use_stop_words:
13. word_list = [word.strip() for word in word_list if word.strip() not in self.stop_words]
14. return word_list
然后在segmentation.py文件创建SentenceSegmentation类,在该类下创建分句的初始化函数以及分句函数segment()。
1. def segment(self, text):
2. res = [util.as_text(text)]
3. util.debug(res)
4. util.debug(self.delimiters)
5. for sep in self.delimiters:
6. text, res = res, []
7. for seq in text:
8. res += seq.split(sep)
9. res = [s.strip() for s in res if len(s.strip()) > 0]
10. return res
最后整合分词分句的部分,在segmentation.py文件创建Segmentation类,依然在该类下创建segment()函数。
1. def segment(self, text, lower=False):
2. text = util.as_text(text)
3. sentences = self.ss.segment(text)
4. words_no_filter = self.ws.segment_sentences(sentences=sentences,lower=lower,
use_stop_words=False,use_speech_tags_filter=False)
5. words_no_stop_words = self.ws.segment_sentences(sentences=sentences,lower=lower,
use_stop_words=True, use_speech_tags_filter=False)
6. words_all_filters = self.ws.segment_sentences(sentences=sentences,lower=lower,
use_stop_words=True,use_speech_tags_filter=True)
7.return util.AttrDict(
8.sentences=sentences,
9.words_no_filter=words_no_filter,
10.words_no_stop_words=words_no_stop_words,
11. words_all_filters=words_all_filters
在分词分句都完成后便可以进行关键词提取,创建Keywords.py文件,在该文件下创建初始化函数,将参数初始化;创建analyze()函数,分析待提取关键词的文本。
1. defanalyze(self, text,window=2,lower=False,vertex_source='all_filters',edge_source='no_stop_words',pagerank_config={'alpha': 0.85, }):text --文本内容,字符串。window --窗口大小,int,用来构造单词之间的边。默认值为2。lower--是否将文本转换为小写。默认为False。vertex_source--选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点。默认值为`'all_filters'`,可选值为`'no_filter', 'no_stop_words', 'all_filters'`。关键词也来自`vertex_source`。edge_source --选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点之间的边。默认值为`'no_stop_words'`,可选值为`'no_filter', 'no_stop_words', 'all_filters'`。边的构造要结合`window`参数。
2. self.text = text
3. self.word_index = {}
4. self.index_word = {}
5. self.keywords = []
6. self.graph = None
7. result = self.seg.segment(text=text, lower=lower)
8. self.sentences = result.sentences
9. self.words_no_filter = result.words_no_filter
10. self.words_no_stop_words = result.words_no_stop_words
11. self.words_all_filters = result.words_all_filters
12. options = ['no_filter', 'no_stop_words', 'all_filters']
13. if vertex_source in options:
14. _vertex_source = result['words_' + vertex_source]
15. else:
16. _vertex_source = result['words_all_filters']
17. ifedge_source in options:
18. _edge_source = result['words_' + edge_source]
19. else:
20. _edge_source = result['words_no_stop_words']
21. self.keywords = util.sort_words(_vertex_source, _edge_source, window=window,pagerank_config=pagerank_config)
在Keywords.py文件下创建get_keywords()函数,获取获取最重要的num个长度大于等于word_min_len的关键词,返回关键词列表。
1. def get_keywords(self, num=6, word_min_len=1):
2. result = []
3. count = 0
4. for item in self.keywords:
5. if count >= num:
6. break
7. iflen(item.word) >= word_min_len:
8. result.append(item)
9. count += 1
10. return result
在Keywords.py文件下创建get_keyphrases ()函数,获取 keywords_num 个关键词构造的可能出现的短语,要求这个短语在原文本中至少出现的次数为min_occur_num,返回关短语列表。
1. defget_keyphrases(self, keywords_num=12, min_occur_num=2):
2. keywords_set = set([item.word for item in self.get_keywords(num=keywords_num, word_min_len=1)])
3. keyphrases = set()
4. for sentence in self.words_no_filter:
5. one = []
6. for word in sentence:
7. ifword in keywords_set:
8. one.append(word)
9. else:
10. if len(one) > 1:
11. keyphrases.add(''.join(one))
12. iflen(one) == 0:
13. continue
14. else:
15. one = []
16. if len(one) > 1:
17. keyphrases.add(''.join(one))
18. return [phrase for phrase in keyphrases
19. if self.text.count(phrase) >= min_occur_num]