python关键字以什么开头_从python的rss提要中提取以关键字/短语开头的句子

最新推荐文章于 2021-12-30 14:38:25 发布

爱吃兔兔的牛魔王

最新推荐文章于 2021-12-30 14:38:25 发布

阅读量169

点赞数

文章标签： python关键字以什么开头

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_34832809/article/details/113505728

版权

我正在寻找一种从RSS提要中提取特定文本的方法，特别是新闻网站。我想抓取feed，寻找[phrase]的任何示例，返回该短语的每个匹配项以及后面的句子的其余部分(直到句号，没有太nlp的内容)。在

我发现的最接近的是：from bs4 import BeautifulSoup

import csv

import feedparser

import re

import requests

def search_article(url, phrases):

"""

Yield all of the specified phrases that occur in the HTML body of the URL.

"""

response = requests.get(url)

text = BeautifulSoup(response.text, 'html.parser').find_all('div', {"itemprop":"articleBody"})

for phrase in phrases:

for i in text:

i = i.text

block = ''

block = block + i

if re.search(r'\b' + re.escape(phrase) + r'\b', block):

yield phrase

def search_rss(rss_entries, phrases):

"""

Search articles listed in the RSS entries for phases, yielding

(url, article_title, phrase) tuples.

"""

for entry in rss_entries:

for hit_phrase in search_article(entry['link'], phrases):

yield entry['link'], entry['title'], hit_phrase

def main(rss_url, phrases, output_csv_path, rss_limit=None):

rss_entries = feedparser.parse(rss_url).entries[:rss_limit]

with open(output_csv_path, 'w') as f:

w = csv.writer(f)

for url, title, phrase in search_rss(rss_entries, phrases):

print('"{0}" found in "{1}"'.format(phrase, title))

w.writerow([url, phrase])

if __name__ == '__main__':

rss_url = 'http://www.theguardian.com/rss'

phrases = ['in the future', 'the future will be',]

main(rss_url, phrases, 'output.csv')

我不需要从包含这些句子的文章中找到链接，但我不需要从包含这些句子的句子中找到链接)。在

我是python的初学者(但是很想学习，所以尝试一下！)，一些正则表达式的经验。如有任何建议，将不胜感激！在

爱吃兔兔的牛魔王

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。