基于TextRank的抽取式文本摘要（英文）

最新推荐文章于 2021-11-11 15:04:37 发布

quantum00549

最新推荐文章于 2021-11-11 15:04:37 发布

阅读量2.3k

点赞数 5

分类专栏：学习笔记文章标签：自然语言处理

原文链接：http://blog.itpub.net/31562039/viewspace-2286669/

版权

学习笔记专栏收录该内容

8 篇文章 0 订阅

订阅专栏

基于TextRank的抽取式文本摘要（英文）

前言
- 备注
- Talk is cheap, show me the code.

前言

在GitHub上写笔记要经常查看很麻烦，在此记录一些整合的各种代码。能附上原文链接的都附上了，多数非原创，不要杠。

备注

TextRank抽取式摘要，原理自行搜索
本代码原文链接：http://blog.itpub.net/31562039/viewspace-2286669/
适用英文，使用Glove 100d词向量，中文的话自己改改代码，我自己写的可参考https://blog.csdn.net/ziyi9663/article/details/106996293
数据集下载：
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/10/tennis_articles_v4.csv
Glove词向量我自己有，原文链接里也提供了下载
NLTK下载停用词和断句数据可能需要科学上网

Talk is cheap, show me the code.

// 以下代码基于Python3.7，需要的库均为pip安装，部分库安装需要科学上网。亲测无bug，可以直接运行。
// 注释偏好为写在相关代码下方
import networkx
# 一个图结构的相关操作包，没用过无所谓，有兴趣可以搜索学习
import numpy as np
import pandas as pd
import nltk
# nltk.download('punkt')
# nltk.download('stopwords')
# 下载断句和停用词数据，下载一次就行，后续运行可直接注释掉
from sklearn.metrics.pairwise import cosine_similarity
import re
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords

df = pd.read_csv('tennis_articles_v4.csv')
# 读文章数据，原文中附带下载链接
sentences = []
for s in df['article_text']:
    sentences.append(sent_tokenize(s))
    # 断句，并写入sentences列表

sentences = [y for x in sentences for y in x]
# 打平list。
# 原数据是好几篇文章，本代码将所有文章的所有句子放在一个列表里，摘要抽取也是基于所有句子（文章）的。

word_embeddings = {}
GLOVE_DIR = 'glove.6B.100d.txt'
with open(GLOVE_DIR,encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        word_embeddings[word] = coefs
# 获取词向量
# 该词向量文件形式为：词 空格 词向量，然后换行，自行理解上述操作代码

clean_sentences = pd.Series(sentences).str.replace('[^a-zA-Z]', ' ')
clean_sentences = [s.lower() for s in clean_sentences]
# 文本清洗，去除标点、数字、特殊符号、统一小写
stop_words = stopwords.words('english')
def remove_stopwords(str):
    sen = ' '.join([i for i in str if i not in stop_words])
    return sen
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]
# 去停用词
sentences_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum(
            [word_embeddings.get(w,np.zeros((100,))) for w in i.split()]
        )/(len(i.split())+1e-2)
    else:
        v = np.zeros((100,))
    sentences_vectors.append(v)
# 获取每个句子的所有组成词的向量（从GloVe词向量文件中获取，每个向量大小为100），
# 然后取这些向量的平均值，得出这个句子的合并向量为这个句子的特征向量

similarity_matrix = np.zeros((len(clean_sentences),len(clean_sentences)))
# 初始化相似度矩阵（全零矩阵）
for i in range(len(clean_sentences)):
    for j in range(len(clean_sentences)):
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(
                sentences_vectors[i].reshape(1,-1),sentences_vectors[j].reshape(1,-1)
            )
# 计算相似度矩阵，基于余弦相似度
nx_graph = networkx.from_numpy_array(similarity_matrix)
scores = networkx.pagerank(nx_graph)
# 将相似度矩阵转为图结构
ranked_sentences = sorted(
    ((scores[i],s) for i,s in enumerate(sentences)),reverse=True
)
# 排序
for i in range(10):
    print(ranked_sentences[i][1])
# 打印得分最高的前10个句子，即为摘要