我正在寻找一种从RSS提要中提取特定文本的方法,特别是新闻网站。我想抓取feed,寻找[phrase]的任何示例,返回该短语的每个匹配项以及后面的句子的其余部分(直到句号,没有太nlp的内容)。在
我发现的最接近的是:from bs4 import BeautifulSoup
import csv
import feedparser
import re
import requests
def search_article(url, phrases):
"""
Yield all of the specified phrases that occur in the HTML body of the URL.
"""
response = requests.get(url)
text = BeautifulSoup(response.text, 'html.parser').find_all('div', {"itemprop":"articleBody"})
for phrase in phrases:
for i in text:
i = i.text
block = ''
block = block + i
if re.search(r'\b' + re.escape(phrase) + r'\b', block):
yield phrase
def search_rss(rss_entries, phrases):
"""
Search articles listed in the RSS entries for phases, yielding
(url, article_title, phrase) tuples.
"""
for entry in rss_entries:
for hit_phrase in search_article(entry['link'], phrases):
yield entry['link'], entry['title'], hit_phrase
def main(rss_url, phrases, output_csv_path, rss_limit=None):
rss_entries = feedparser.parse(rss_url).entries[:rss_limit]
with open(output_csv_path, 'w') as f:
w = csv.writer(f)
for url, title, phrase in search_rss(rss_entries, phrases):
print('"{0}" found in "{1}"'.format(phrase, title))
w.writerow([url, phrase])
if __name__ == '__main__':
rss_url = 'http://www.theguardian.com/rss'
phrases = ['in the future', 'the future will be',]
main(rss_url, phrases, 'output.csv')
我不需要从包含这些句子的文章中找到链接,但我不需要从包含这些句子的句子中找到链接)。在
我是python的初学者(但是很想学习,所以尝试一下!),一些正则表达式的经验。如有任何建议,将不胜感激!在