自动摘要生成 tf-idf+doc2vec+句子聚类

多文档自动摘要

 

目录

多文档自动摘要

1 计划

2 baseline+改进1+改进2

baseline:

文档格式如下:

python代码如下:

实现TF-IDF的计算:

改进1:

改进2:


 

1 计划

1.了解bat文件的作用,处理文件输入

2.先完成一个主题的自动摘要,再完成多个主题的摘要

3.与专家摘要进行对比,评估

 

2 baseline+改进1+改进2

baseline:

依次抽取10个文档中,每个文档的第一句话,组成摘要。

 

文档格式如下:

<DOC>
<DOCNO> APW19981118.0276 </DOCNO>
<DOCTYPE> NEWS </DOCTYPE>
<TXTTYPE> NEWSWIRE </TXTTYPE>
<TEXT>
Cambodian leader Hun Sen has guaranteed the safety and political freedom 
of all politicians, trying to ease the fears of his rivals that they 
will be arrested or killed if they return to the country. The assurances 
were aimed especially at Sam Rainsy, leader of a vocally anti-Hun 
Sen opposition party, who was forced to take refuge in the U.N. offices 
in September to avoid arrest after Hun Sen accused him of being behind 
a plot against his life. Sam Rainsy and the 14 members of parliament 
from his party have been holed up overseas for two months. But a deal 
reached between Hun Sen and his chief rival, Prince Norodom Ranariddh, 
on forming a new government last week has opened the door for their 
return. In a letter to King Norodom Sihanouk _ the prince's father 
and Cambodia's head of state _ that was broadcast on television Tuesday, 
Hun Sen said that guarantees of safety extended to Ranariddh applied 
to all politicians. His assurances come a week before the first session 
of Cambodia's new parliament, the National Assembly. Sam Rainsy said 
Wednesday that he was unsatisfied with the guarantee. He said it contained 
indirect language and loopholes that suggest he and his Sam Rainsy 
Party members are still under threat of arrest from Hun Sen's ruling 
party. ``It should be easy for them to say, `Rainsy and the SRP members 
of the assembly have no charges against them and will not be arrested,''' 
the opposition leader said in a statement. ``But instead they make 
roundabout statements, full of loopholes that can easily be exploited 
by a legal system that is completely in their control.'' Ranariddh 
told reporters Wednesday that he believed it was safe for Sam Rainsy 
in Cambodia. Speaking upon his return from a brief stay in Bangkok, 
the prince said he would soon meet with Hun Sen to discuss the apportioning 
of ministries in the new coalition government. Last week, Hun Sen's 
Cambodian People's Party and Ranariddh's FUNCINPEC party agreed to 
form a coalition that would leave Hun Sen as sole prime minister and 
make the prince president of the National Assembly. The deal assures 
the two-thirds vote in parliament needed to approve a new government. 
The men served as co-prime ministers until Hun Sen overthrew Ranariddh 
in a coup last year. ``I think Hun Sen has got everything. He's got 
the premiership and legitimacy through the election and recognition 
from his majesty the king. I don't think there is any benefit for 
Hun Sen to cause instability for our country,'' Ranariddh said. The 
prince also said that his top general, Nhek Bunchhay, would not be 
given back his previous position as the second-ranking general in 
the Cambodian military's general staff. Nhek Bunchhay's outnumbered 
forces in the capital put up tough but unsuccessful resistance to 
last year's coup. 
</TEXT>
</DOC>

 

python代码如下:

'''
    Created by hjn On 2020.5.18
    读取DUC数据集中一个主题,10个文档的text内容
'''
import os

# 遍历文件夹
for roots, dirs, files in os.walk("E:\pycharm_workspace\data_mining\data_classification\dataset"):
    for dir in dirs:
        # print(dir)
        final_result = []
        for file in os.listdir("E:\pycharm_workspace\data_mining\data_classification\dataset\\"+dir):
            # print(file)
            path = os.path.join("E:\pycharm_workspace\data_mining\data_classification\dataset", dir, file)
            f = open(path)
            result = []

            while True:
                lines = f.readline()
                if not lines:
                    break
                    pass
                else:
                    result.append(lines)

            result_str = ' '.join(result).replace('\n', '')
            text_temp = result_str.split("<TEXT>")
            text = text_temp[1].replace("</TEXT> </DOC>", "")
            senetences = text.split(".")
            final_result.append(senetences[0])

        for row in final_result:
            print(row, end='.')
        print()

 

实现TF-IDF的计算:

学习类的创建和使用

遇到的诡异问题,见图:

 

计算tf-idf和句子间的相似度:

"""
    Created by hjn On 2020.5.30
    读取DUC数据集中一个主题,10个文档的text内容,计算tf-idf
"""

# coding=utf-8
import os
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from nltk import data
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# 离线加载nltk_data
data.path.append(r"C:/Users/HJN/nltk_data/tokenizers/punkt")

# 去除stopwords 和 词干化
set(stopwords.words('english'))
stop_words = set(stopwords.words('english'))

# 标点符号集合
punctuation = [',', '.', '“', '”', '(', ')', '‘', '’', '\'', '`']


def data_presolve(text):
    "去除text中的停用词,并进行词干化"
    word_tokens = word_tokenize(text)
    filtered_sentence = []

    for w in word_tokens:
        if w not in stop_words:
            filtered_sentence.append(w)

    Stem_words = []
    ps = PorterStemmer()
    for w in filtered_sentence:
        rootWord = ps.stem(w)
        Stem_words.append(rootWord)

    # print(filtered_sentence) # 去除停用词的结果
    # print(Stem_words) # 词干化的结果

    # 去除列表中的, . “ ”
    for letter in punctuation:
        while letter in Stem_words:
            Stem_words.remove(letter)

    return Stem_words


# 存储句子
short_sentence = []

# # 存储句子的分词结果
# short_sentence_split = []

# 存储长句子
doc_
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值