多文档自动摘要
目录
1 计划
1.了解bat文件的作用,处理文件输入
2.先完成一个主题的自动摘要,再完成多个主题的摘要
3.与专家摘要进行对比,评估
2 baseline+改进1+改进2
baseline:
依次抽取10个文档中,每个文档的第一句话,组成摘要。
文档格式如下:
<DOC>
<DOCNO> APW19981118.0276 </DOCNO>
<DOCTYPE> NEWS </DOCTYPE>
<TXTTYPE> NEWSWIRE </TXTTYPE>
<TEXT>
Cambodian leader Hun Sen has guaranteed the safety and political freedom
of all politicians, trying to ease the fears of his rivals that they
will be arrested or killed if they return to the country. The assurances
were aimed especially at Sam Rainsy, leader of a vocally anti-Hun
Sen opposition party, who was forced to take refuge in the U.N. offices
in September to avoid arrest after Hun Sen accused him of being behind
a plot against his life. Sam Rainsy and the 14 members of parliament
from his party have been holed up overseas for two months. But a deal
reached between Hun Sen and his chief rival, Prince Norodom Ranariddh,
on forming a new government last week has opened the door for their
return. In a letter to King Norodom Sihanouk _ the prince's father
and Cambodia's head of state _ that was broadcast on television Tuesday,
Hun Sen said that guarantees of safety extended to Ranariddh applied
to all politicians. His assurances come a week before the first session
of Cambodia's new parliament, the National Assembly. Sam Rainsy said
Wednesday that he was unsatisfied with the guarantee. He said it contained
indirect language and loopholes that suggest he and his Sam Rainsy
Party members are still under threat of arrest from Hun Sen's ruling
party. ``It should be easy for them to say, `Rainsy and the SRP members
of the assembly have no charges against them and will not be arrested,'''
the opposition leader said in a statement. ``But instead they make
roundabout statements, full of loopholes that can easily be exploited
by a legal system that is completely in their control.'' Ranariddh
told reporters Wednesday that he believed it was safe for Sam Rainsy
in Cambodia. Speaking upon his return from a brief stay in Bangkok,
the prince said he would soon meet with Hun Sen to discuss the apportioning
of ministries in the new coalition government. Last week, Hun Sen's
Cambodian People's Party and Ranariddh's FUNCINPEC party agreed to
form a coalition that would leave Hun Sen as sole prime minister and
make the prince president of the National Assembly. The deal assures
the two-thirds vote in parliament needed to approve a new government.
The men served as co-prime ministers until Hun Sen overthrew Ranariddh
in a coup last year. ``I think Hun Sen has got everything. He's got
the premiership and legitimacy through the election and recognition
from his majesty the king. I don't think there is any benefit for
Hun Sen to cause instability for our country,'' Ranariddh said. The
prince also said that his top general, Nhek Bunchhay, would not be
given back his previous position as the second-ranking general in
the Cambodian military's general staff. Nhek Bunchhay's outnumbered
forces in the capital put up tough but unsuccessful resistance to
last year's coup.
</TEXT>
</DOC>
python代码如下:
'''
Created by hjn On 2020.5.18
读取DUC数据集中一个主题,10个文档的text内容
'''
import os
# 遍历文件夹
for roots, dirs, files in os.walk("E:\pycharm_workspace\data_mining\data_classification\dataset"):
for dir in dirs:
# print(dir)
final_result = []
for file in os.listdir("E:\pycharm_workspace\data_mining\data_classification\dataset\\"+dir):
# print(file)
path = os.path.join("E:\pycharm_workspace\data_mining\data_classification\dataset", dir, file)
f = open(path)
result = []
while True:
lines = f.readline()
if not lines:
break
pass
else:
result.append(lines)
result_str = ' '.join(result).replace('\n', '')
text_temp = result_str.split("<TEXT>")
text = text_temp[1].replace("</TEXT> </DOC>", "")
senetences = text.split(".")
final_result.append(senetences[0])
for row in final_result:
print(row, end='.')
print()
实现TF-IDF的计算:
学习类的创建和使用
遇到的诡异问题,见图:
计算tf-idf和句子间的相似度:
"""
Created by hjn On 2020.5.30
读取DUC数据集中一个主题,10个文档的text内容,计算tf-idf
"""
# coding=utf-8
import os
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk import data
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
# 离线加载nltk_data
data.path.append(r"C:/Users/HJN/nltk_data/tokenizers/punkt")
# 去除stopwords 和 词干化
set(stopwords.words('english'))
stop_words = set(stopwords.words('english'))
# 标点符号集合
punctuation = [',', '.', '“', '”', '(', ')', '‘', '’', '\'', '`']
def data_presolve(text):
"去除text中的停用词,并进行词干化"
word_tokens = word_tokenize(text)
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
Stem_words = []
ps = PorterStemmer()
for w in filtered_sentence:
rootWord = ps.stem(w)
Stem_words.append(rootWord)
# print(filtered_sentence) # 去除停用词的结果
# print(Stem_words) # 词干化的结果
# 去除列表中的, . “ ”
for letter in punctuation:
while letter in Stem_words:
Stem_words.remove(letter)
return Stem_words
# 存储句子
short_sentence = []
# # 存储句子的分词结果
# short_sentence_split = []
# 存储长句子
doc_