python 抓取_使用Python构建RSS Feed抓取工具

python 抓取

This is part 1 of building a web scraping tool with Python. We’re using Requests and BeautifulSoup. In parts 2 and 3 of this series, I’ll be illustrating how to build scheduled scraping with Celery and integrating this into a web application with Django.

这是使用Python构建网络抓取工具的第1部分。 我们正在使用请求和BeautifulSoup。 在本系列的第2部分和第3部分中,我将说明如何使用Celery构建计划的抓取并将其与Django集成到Web应用程序中。

背景: (Background:)

I’ve utilized web scraping in different capacities for my projects, whether it be data collection for analysis, creating notifications for myself when sites change, or building web applications. This code is available publicly on my GitHub under web_scraping_example.

我为项目使用了不同功能的Web抓取功能,无论是收集数据进行分析,在网站更改时为自己创建通知还是构建Web应用程序。 该代码可web_scraping_example在我的GitHub上公开获得

This guide will walk through a quick RSS feed scraper for HackerNews. The RSS feed itself is located here, it will have updates of new posts and activity on the site at regular intervals.

本指南将逐步介绍HackerNews的 RSS feed 抓取工具RSS feed本身位于此处 ,它将定期更新站点上的新帖子和活动。

While this could be accomplished using an RSS reader, this is meant as a quick example using Python that can be adapted to other websites easily.

尽管可以使用RSS阅读器完成此操作,但这只是作为使用Python的快速示例,可以轻松地将其应用于其他网站。

I will be using the following:

我将使用以下内容:

项目概况: (Project outline:)

Here’s an outline of the steps we’ll take to create our finalized program:

这是我们创建最终程序要采取的步骤的概述:

  1. Creating our project directory and scraping.py file

    创建我们的项目目录和scraping.py文件

  2. Testing that we can ping the RSS feed we’re going to scrape

    测试我们可以ping我们要抓取的RSS feed
  3. Scraping the site’s XML content

    抓取网站的XML内容
  4. Parsing the content using BS4

    使用BS4解析内容
  5. Outputting the content to a .txt file

    将内容输出到.txt文件

入门: (Getting started:)

We’ll begin by creating our project directory and changing into that directory from the command line.

我们将首先创建项目目录,然后从命令行更改为该目录。

$ mkdir web_scraping_example && cd web_scraping_example

Once within our project directory, we’ll create our project file.

进入项目目录后,我们将创建项目文件。

$ touch scraping.py

Note: I’m using Ubuntu, so my commands may differ from yours.

注意:我正在使用Ubuntu,所以我的命令可能与您的命令不同。

If you haven’t already, go ahead and install the packages we’ll be using in this guide. You can do so using pip like the example below.

如果尚未安装,请继续安装本指南中将使用的软件包。 您可以像下面的示例一样使用pip

$ pip install requests
$ pip install bs4

导入我们的库: (Importing our libraries:)

Now that the basics of our project are set up, we can begin writing the scraping tool itself.

现在,我们已经建立了项目的基础,我们可以开始编写抓取工具本身。

Within our scraping.py we’ll import the packages we’ve installed using pip.

scraping.py我们将导入使用pip安装的软件包。

# scraping.pyimport requests
from bs4 import BeautifulSoup

The above will allow us to utilize the functions given to us by the Requests and BeautifulSoup libraries.

以上内容将使我们能够利用Requests和BeautifulSoup库提供给我们的功能。

We can now begin testing our ability to ping the HackerNews RSS feed, as well as write our scraping script.

现在,我们可以开始测试ping HackerNews RSS提要的能力,并编写抓取脚本。

Note: I will not be including the import lines going forward. All code that remains the same as previously will be noted with ... either above or below the new lines.

注意:我将不包括以后的导入行。 在新行的上方或下方都将用...表示与以前相同的所有代码。

测试请求: (Testing the request:)

When we’re web scraping, we begin by sending a request to a website. To ensure that we’re capable of scraping at all, we’ll need to test that we can connect.

当我们进行网页抓取时,我们首先向网站发送请求。 为了确保我们完全能够抓取,我们需要测试我们是否可以连接。

Let’s begin by creating our base scraping function. This will be what we execute to

让我们开始创建基本的抓取功能。 这就是我们要执行的

# scraping.py# library imports omitted
...# scraping function
def hackernews_rss('https://news.ycombinator.com/rss'):
try:
r = requests.get()
return print('The scraping job succeeded: ', r.status_code)
except Exception as e:
print('The scraping job failed. See exception: ')
print(e)print('Starting scraping')
hackernews_rss()
print('Finished scraping')

In the above, we’re going to call the Requests library and fetch our website using requests.get(...). I’m printing the status code to the terminal using r.status_code to check that the website has been successfully called.

在上面,我们将调用Requests库,并使用requests.get(...)获取我们的网站。 我正在使用r.status_code将状态代码打印到终端,以检查网站是否已成功调用。

Additionally, I’ve wrapped this into a try: except: to catch any errors we may have later on down the road.

另外,我将其包装为try: except:以捕获以后可能遇到的任何错误。

Once we run the program, we’ll see a successful status code of 200. This states that we’re able to ping the site and “get” information.

一旦运行该程序,我们将看到成功的状态码200。这表明我们能够ping通该站点并“获取”信息。

$ python scraping.py
Starting scraping
The scraping job succeeded: 200
Finsihed scraping

刮取内容: (Scraping content:)

Our program has returned a status code of 200, we’re all set to begin pulling XML content from the site.

我们的程序返回的状态码为200,我们都准备开始从站点中提取XML内容。

To accomplish this, we’ll begin using BS4 and Requests together.

为此,我们将开始同时使用BS4和Requests。

# scraping.py...def hackernews_rss():
try:
r = requests.get('https://news.ycombinator.com/rss')
soup = BeautifulSoup(r.content, features='xml')

return print(soup)
...

The above will assign our XML content from HackerNews to the soup variable. We’re using r.content to pass the returned XML to BeautifulSoup, which we’ll parse in the next example.

上面的代码会将HackerNews中的XML内容分配给soup变量。 我们正在使用r.content将返回的XML传递给BeautifulSoup,我们将在下一个示例中对其进行解析。

A key thing to note is that we’re leveraging features='xml' , this would differ in other projects (i.e., if you’re scraping HTML you’ll declare it as HTML).

需要注意的关键是,我们利用features='xml' ,这在其他项目中会有所不同(即,如果您正在抓取HTML,则将其声明为HTML)。

Our output of the above will be a large mess of content that makes very little sense. This is to illustrate that we’re successfully pulling information from the website.

我们上面的输出将是一团糟的内容,毫无意义。 这说明我们正在成功地从网站上获取信息。

解析数据: (Parsing our data:)

We’ve successfully illustrated that we can extract the XML from our HackerNews RSS feed. Next, we’ll begin parsing the information.

我们已经成功地说明了我们可以从HackerNews RSS feed中提取XML。 接下来,我们将开始解析信息。

The RSS feed was chosen because it’s much easier than parsing website information, as we don’t have to worry about nested HTML elements and pinpointing our exact information.

之所以选择RSS源,是因为它比解析网站信息要容易得多,因为我们不必担心嵌套HTML元素和精确的信息。

Let’s begin by looking at the structure of the feed:

让我们从查看提要的结构开始:

<item>
<title>...</title>
<link>...</link>
<pubDate>...</pubDate>
<comments>...</comments>
<description>...</description>
</item>

Each of the articles available on the RSS feed follows the above structure, containing all information within item tags — <item>...</item>. We’ll be taking advantage of the consistent item tags to parse our information.

RSS feed上可用的每篇文章都遵循上述结构,其中包含标签- <item>...</item>中的所有信息。 我们将利用一致的商品标签来解析我们的信息。

# scraping.py def hackernews_rss():
article_list = [] try:
r = requests.get('https://news.ycombinator.com/rss')
soup = BeautifulSoup(r.content, features='xml') articles = soup.findAll('item') for a in articles:
title = a.find('title').text
link = a.find('link').text
published = a.find('pubDate').text article = {
'title': title,
'link': link,
'published': published
}
article_list.append(article) return print(article_list)
...

Unpacking the above, we’ll begin by checking out the articles = soup.findAll('item'). This will allow us to pull each of the <item>...</item> tags from the XML that we scraped.

articles = soup.findAll('item')上面的内容,我们将首先检查articles = soup.findAll('item') 。 这将使我们能够从我们抓取的XML中提取每个<item>...</item>标记。

Each of the articles will be separated by using the loop: for a in articles:, this will allow us to parse the information into separate variables and append them to an empty dictionary we’ve created.

每篇文章都将使用loop:进行分隔: for a in articles: ,这将使我们能够将信息解析为单独的变量,并将其附加到我们创建的空字典中。

BS4 has parsed our XML into a string, allowing us to call the .find() function on each of our objects to search for our tags. By using .text we’re able to strip them of the <tag>...</tag> elements and save exclusively the string.

BS4已将XML解析为一个字符串,使我们可以在每个对象上调用.find()函数以搜索我们的标签。 通过使用.text我们可以剥离<tag>...</tag>元素,并仅保存字符串。

We’re putting this into a list so we can access them later by calling article_list.append(article).

我们将其放入列表中,以便稍后可以通过调用article_list.append(article)

You should now see a large amount of output when running the scraping program.

现在,在运行抓取程序时,应该会看到大量的输出。

输出到文件: (Outputting to a file:)

The RSS feed has now been successfully outputting into a print() function to illustrate our list once the parsing is completed. We can now work through putting the data into a .txt file, which opens the door to analysis and other data-related activities. We’re importing JSON to make this a bit easier for us; however, I’ve also provided an example without the JSON library.

现在,RSS feed已成功输出到print()函数中,以在解析完成后说明我们的列表。 现在,我们可以将数据放入一个.txt文件中,这为分析和其他与数据相关的活动打开了大门。 我们正在导入JSON,以使我们更容易做到这一点; 但是,我还提供了一个没有JSON库的示例。

We’ll begin by creating another function def save_function(): that will take in the list from our hackernews_rss() function. This will make it easier for us to make changes in the future.

我们将从创建另一个函数def save_function():它将从我们的hackernews_rss()函数中获取列表。 这将使我们将来更容易进行更改。

# scraping.py
import json...
def save_function(article_list):
with open('articles.txt', 'w') as outfile:
json.dump(article_list, outfile)...

The above utilizes the JSON library to write the output of the scraping to the articles.txt file. This file will be overwritten as the program is executed.

上面的代码利用JSON库将抓取的输出写入articles.txt文件。 该文件将在执行程序时被覆盖。

Another method of writing to the .txt file would be a for loop:

写入.txt文件的另一种方法是for循环:

# scraping.py
...
def save_function(article_list):
with open('articles.txt', 'w') as f:
for a in article_list:
f.write(a+'\n')
f.close()

Now that we have our save_function() created, we’ll move into adapting our scrape function to save our data.

现在,我们已经创建了save_function() ,接下来我们将save_function()修改scrape函数以保存数据。

# scraping.py...
def hackernews_rss():
...
try:
...
return save_function(article_list)
...

By changing our return print(article_list) to return save_function(article_list) we’re able to push the data into a .txt file. Running our program will now output a .txt file of the scraped data from the HackerNews RSS feed.

通过将return print(article_list)更改为return save_function(article_list)我们可以将数据推送到.txt文件中。 现在,运行我们的程序将从HackerNews RSS feed输出一个.txt文件,其中包含已抓取的数据。

结论: (Conclusion:)

We’ve successfully created an RSS feed scraping tool using Python, Requests, and BeautifulSoup. This allows us to parse XML information into a legible format for us to work with in the future.

我们已经成功使用Python,Requests和BeautifulSoup创建了RSS feed抓取工具。 这使我们能够将XML信息解析为清晰的格式,以备将来使用。

Where should we go from here?

我们应该从这里去哪里?

  • Scraping more complex information using HTML elements

    使用HTML元素收集更复杂的信息
  • Using Selenium to scrape sites that Requests can’t due to client-side rendering

    使用Selenium抓取由于客户端渲染而无法进行请求的网站

  • Building a web application that will take in scraped data and display it (i.e., an aggregator)

    构建一个将收集抓取的数据并将其显示的Web应用程序(即聚合器)
  • Pulling data from websites of your choice on a schedule

    按计划从您选择的网站中提取数据

其他文章: (Additional articles:)

While this article covered the very basics of web scraping, we can begin to delve into further detail by pushing things into a scheduled sequence using Celery or aggregating our information on a web application using Django.

虽然本文涵盖了Web抓取的基本知识,但我们可以通过使用Celery将事情按计划的顺序进行推送或使用Django在Web应用程序上汇总我们的信息来开始深入研究。

This article is part of a 3 part series where we begin to look at simple examples of web scraping and aggregation on a scheduled basis.

本文是由3部分组成的系列文章的一部分,我们将开始按计划查看网络抓取和聚合的简单示例。

If you found this helpful, take a look at some of my other pieces. Cheers.

如果您觉得这有帮助,请看一下我的其他作品。 干杯。

翻译自: https://codeburst.io/building-an-rss-feed-scraper-with-python-73715ca06e1f

python 抓取

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值