NLP(十一) 提取文本摘要

  • gensim.summarization库的函数
    gensim.summarization.summarize(text, ratio=0.2, word_count=None, split=False)
    Parameters(参数):
    text : str
    Given text.
    ratio : float, optional
    Number between 0 and 1 that determines the proportion of the number of
    sentences of the original text to be chosen for the summary.
    word_count : int or None, optional
    Determines how many words will the output contain.
    If both parameters are provided, the ratio will be ignored.
    split : bool, optional
    If True, list of sentences will be returned. Otherwise joined
    strings will bwe returned.
  • 代码
from gensim.summarization import summarize # 基于文本排序的摘要算法
from bs4 import BeautifulSoup # 用于解析HTML文档的BeautifulSoup库
import requests # 用于下载HTTP资源的库
urls = { # 题目:网站 字典
    'Deconstructing Voice-over-IP':
    'http://scigen.csail.mit.edu/scicache/269/scimakelatex.25977.A.+G.+Hassan.html',
    'Exploration of the Location-Identity Split':
    'http://scigen.csail.mit.edu/scicache/270/scimakelatex.26087.Ali+Veli.Veli+Ali.Vel+Al.html',
}
# 摘要(真实的):
# 1.The implications of ambimorphic archetypes have been far-reaching and pervasive. After years of natural research into consistent hashing, we argue the simulation of public-private key pairs, which embodies the confirmed principles of theory. Such a hypothesis might seem perverse but is derived from known results. Our focus in this paper is not on whether the well-known knowledge-based algorithm for the emulation of checksums by Herbert Simon runs in Θ( n ) time, but rather on exploring a semantic tool for harnessing telephony (Swale).
# 2.Superblocks must work. Given the current status of homogeneous configurations, security experts particularly desire the simulation of 802.11b. we consider how the Internet can be applied to the refinement of Scheme.
for key in urls.keys():
    url = urls[key]
    r = requests.get(url)
    soup = BeautifulSoup(r.text,'html.parser')
    data = soup.get_text() # HTML去标签后的文本
    pos1 = data.find('1 Introduction') + len('1 Introduction')
    pos2 = data.find('Related Work')
    text = data[pos1:pos2].strip() # 提取pos1与pos2之间的引言部分
    print('PAPER URL: {}'.format(url))
    print('TITLE: {}'.format(key))
    print('GENERATED SUMMARY: {}'.format(summarize(text)))
    print()

输出:

PAPER URL: http://scigen.csail.mit.edu/scicache/269/scimakelatex.25977.A.+G.+Hassan.html
TITLE: Deconstructing Voice-over-IP
GENERATED SUMMARY: 。。。。。。

PAPER URL: http://scigen.csail.mit.edu/scicache/270/scimakelatex.26087.Ali+Veli.Veli+Ali.Vel+Al.html
TITLE: Exploration of the Location-Identity Split
GENERATED SUMMARY: 。。。。。。

转载于:https://www.cnblogs.com/peng8098/p/nlp_11.html

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值