汤和油谁的沸点高_如何使用精美的汤和Python 3抓取网页

汤和油谁的沸点高

介绍 (Introduction)

Many data analysis, big data, and machine learning projects require scraping websites to gather the data that you’ll be working with. The Python programming language is widely used in the data science community, and therefore has an ecosystem of modules and tools that you can use in your own projects. In this tutorial we will be focusing on the Beautiful Soup module.

许多数据分析,大数据和机器学习项目都需要抓取网站来收集您将要使用的数据。 Python编程语言已在数据科学界广泛使用,因此具有可在您自己的项目中使用的模块和工具生态系统。 在本教程中,我们将专注于Beautiful Soup模块。

Beautiful Soup, an allusion to the Mock Turtle’s song found in Chapter 10 of Lewis Carroll’s Alice’s Adventures in Wonderland, is a Python library that allows for quick turnaround on web scraping projects. Currently available as Beautiful Soup 4 and compatible with both Python 2.7 and Python 3, Beautiful Soup creates a parse tree from parsed HTML and XML documents (including documents with non-closed tags or tag soup and other malformed markup).

“美丽的汤 ”是在Lewis Lewis的《 爱丽丝梦游仙境》第10章中找到的嘲笑龟歌的寓意,它是一个Python库,可以快速处理Web抓取项目。 当前以Beautiful Soup 4的形式提供,并且与Python 2.7和Python 3兼容,Beautiful Soup从已解析HTML和XML文档(包括带有非封闭标签或标签汤和其他格式错误的标记的文档)创建一个解析树。

In this tutorial, we will collect and parse a web page in order to grab textual data and write the information we have gathered to a CSV file.

在本教程中,我们将收集并解析网页,以获取文本数据并将收集到的信息写入CSV文件。

先决条件 (Prerequisites)

Before working on this tutorial, you should have a local or server-based Python programming environment set up on your machine.

在学习本教程之前,您应该在计算机上设置本地基于服务器的 Python编程环境。

You should have the Requests and Beautiful Soup modules installed, which you can achieve by following our tutorial “How To Work with Web Data Using Requests and Beautiful Soup with Python 3.” It would also be useful to have a working familiarity with these modules.

你应该有请求和美丽的汤模块安装 ,你可以按照我们的教程中实现“ 如何处理网络数据使用要求和美丽的汤与Python 3 ”。 熟悉这些模块也将很有用。

Additionally, since we will be working with data scraped from the web, you should be comfortable with HTML structure and tagging.

此外,由于我们将使用从网络抓取的数据,因此您应该对HTML结构和标记感到满意。

了解数据 (Understanding the Data)

In this tutorial, we’ll be working with data from the official website of the National Gallery of Art in the United States. The National Gallery is an art museum located on the National Mall in Washington, D.C. It holds over 120,000 pieces dated from the Renaissance to the present day done by more than 13,000 artists.

在本教程中,我们将使用美国国家美术馆的官方网站上的数据。 国家美术馆是位于华盛顿特区国家广场上的一座艺术博物馆,收藏着从文艺复兴时期至今的12万多件作品,由13,000多名艺术家完成。

We would like to search the Index of Artists, which, at the time of updating this tutorial, is available via the Internet Archive’s Wayback Machine at the following URL:

我们想搜索艺术家索引,在更新本教程时,可通过Internet ArchiveWayback Machine在以下URL 找到该索引:

https://web.archive.org/web/20170131230332/https://www.nga.gov/collection/an.shtm

https://web.archive.org/web/20170131230332/https://www.nga.gov/collection/an.shtm

Note: The long URL above is due to this website having been archived by the Internet Archive.

注意 :上面的长URL是由于此网站已由Internet档案库存档。

The Internet Archive is a non-profit digital library that provides free access to internet sites and other digital media. This organization takes snapshots of websites to preserve sites’ histories, and we can currently access an older version of the National Gallery’s site that was available when this tutorial was first written. The Internet Archive is a good tool to keep in mind when doing any kind of historical data scraping, including comparing across iterations of the same site and available data.

互联网档案馆是一个非营利性的数字图书馆,可免费访问互联网站点和其他数字媒体。 该组织为网站拍摄快照以保存网站的历史记录,并且我们当前可以访问本教程首次编写时可用的国家美术馆网站的旧版本。 Internet归档是在进行任何类型的历史数据抓取(包括比较同一站点的迭代和可用数据之间的比较)时要牢记的一个好工具。

Beneath the Internet Archive’s header, you’ll see a page that looks like this:

在Internet Archive的标题下,您将看到一个如下所示的页面:

Since we’ll be doing this project in order to learn about web scraping with Beautiful Soup, we don’t need to pull too much data from the site, so let’s limit the scope of the artist data we are looking to scrape. Let’s therefore choose one letter — in our example we’ll choose the letter Z — and we’ll see a page that looks like this:

由于我们将进行此项目来学习Beautiful Soup的Web抓取,因此我们不需要从站点中提取太多数据,因此我们限制了要抓取的艺术家数据的范围。 因此,让我们选择一个字母-在我们的示例中,我们选择字母Z-我们将看到一个如下所示的页面:

In the page above, we see that the first artist listed at the time of writing is Zabaglia, Niccola, which is a good thing to note for when we start pulling data. We’ll start by working with this first page, with the following URL for the letter Z:

在上面的页面中,我们看到撰写本文时列出的第一位艺术家是Zabaglia,Niccola ,这是我们开始提取数据时要注意的一件好事。 我们将从第一页开始,使用以下URL作为字母Z

https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm

https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/an Z 1.htm

It is important to note for later how many pages total there are for the letter you are choosing to list, which you can discover by clicking through to the last page of artists. In this case, there are 4 pages total, and the last artist listed at the time of writing is Zykmund, Václav. The last page of Z artists has the following URL:

重要的是,以后要注意要选择列出的信函的总页数,您可以通过单击进入艺术家的最后一页来发现这些页面。 在这种情况下,总共有4页,撰写本文时列出的最后一位艺术家是瓦茨拉夫的ZykmundZ艺术家的最后一页具有以下URL:

https://web.archive.org/web/20121010201041/http://www.nga.gov/collection/anZ4.htm

https://web.archive.org/web/20121010201041/http://www.nga.gov/collection/anž4.htm

However, you can also access the above page by using the same Internet Archive numeric string of the first page:

但是 ,您也可以使用与第一页相同的Internet Archive数字字符串来访问以上页面:

https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ4.htm

https://web.archive.org/web/ 20121007172955 /http://www.nga.gov/collection/an Z 4.htm

This is important to note because we’ll be iterating through these pages later in this tutorial.

注意这一点很重要,因为在本教程后面的部分中我们将迭代这些页面。

To begin to familiarize yourself with how this web page is set up, you can take a look at its DOM, which will help you understand how the HTML is structured. In order to inspect the DOM, you can open your browser’s Developer Tools.

要开始熟悉此网页的设置方法,可以看一下它的DOM ,它可以帮助您了解HTML的结构。 为了检查DOM,您可以打开浏览器的开发人员工具

导入库 (Importing the Libraries)

To begin our coding project, let’s activate our Python 3 programming environment. Make sure you’re in the directory where your environment is located, and run the following command:

要开始我们的编码项目,让我们激活我们的Python 3编程环境。 确保您位于环境所在的目录中,然后运行以下命令:

  • . my_env/bin/activate

    。 my_env / bin /激活

With our programming environment activated, we’ll create a new file, with nano for instance. You can name your file whatever you would like, we’ll call it nga_z_artists.py in this tutorial.

激活编程环境后,我们将创建一个新文件,例如nano。 您可以随意命名文件,在本教程中我们将其nga_z_artists.py

  • nano nga_z_artists.py

    纳米nga_z_artists.py

Within this file, we can begin to import the libraries we’ll be using — Requests and Beautiful Soup.

在此文件中,我们可以开始导入将要使用的库- 请求和精美汤。

The Requests library allows you to make use of HTTP within your Python programs in a human readable way, and the Beautiful Soup module is designed to get web scraping done quickly.

Requests库允许您以易于理解的方式在Python程序中使用HTTP,并且Beautiful Soup模块旨在快速完成Web抓取。

We will import both Requests and Beautiful Soup with the import statement. For Beautiful Soup, we’ll be importing it from bs4, the package in which Beautiful Soup 4 is found.

我们将使用import语句导入Requests和Beautiful Soup。 对于Beautiful Soup,我们将从bs4导入它,其中找到了Beautiful Soup 4。

nga_z_artists.py
nga_z_artists.py
# Import libraries
import requests
from bs4 import BeautifulSoup

With both the Requests and Beautiful Soup modules imported, we can move on to working to first collect a page and then parse it.

导入了Requests和Beautiful Soup模块后,我们可以继续进行工作,首先收集页面,然后解析它。

收集和解析网页 (Collecting and Parsing a Web Page)

The next step we will need to do is collect the URL of the first web page with Requests. We’ll assign the URL for the first page to the variable page by using the method requests.get().

下一步,我们需要收集带有请求的第一个网页的URL。 通过使用requests.get()方法,我们将第一页的URL分配给可变 page

nga_z_artists.py
nga_z_artists.py
import requests
from bs4 import BeautifulSoup


# Collect first page of artists’ list
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')

Note: Because the URL is lengthy, the code above and throughout this tutorial will not pass PEP 8 E501 which flags lines longer than 79 characters. You may want to assign the URL to a variable to make the code more readable in final versions. The code in this tutorial is for demonstration purposes and will allow you to swap out shorter URLs as part of your own projects.

注意 :由于URL太长,因此上面和整个教程中的代码都不会通过PEP 8 E501 ,该行标记的行长超过79个字符。 您可能希望将URL分配给变量,以使代码在最终版本中更具可读性。 本教程中的代码仅用于演示目的,您可以将较短的URL替换为自己的项目的一部分。

We’ll now create a BeautifulSoup object, or a parse tree. This object takes as its arguments the page.text document from Requests (the content of the server’s response) and then parses it from Python’s built-in html.parser.

现在,我们将创建BeautifulSoup对象或解析树。 该对象将Requests中的page.text文档(服务器响应的内容)作为其参数,然后从Python的内置html.parser

nga_z_artists.py
nga_z_artists.py
import requests
from bs4 import BeautifulSoup


page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')

# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')

With our page collected, parsed, and set up as a BeautifulSoup object, we can move on to collecting the data that we would like.

将页面收集,解析并设置为BeautifulSoup对象后,我们可以继续收集所需的数据。

从网页中提取文本 (Pulling Text From a Web Page)

For this project, we’ll collect artists’ names and the relevant links available on the website. You may want to collect different data, such as the artists’ nationality and dates. Whatever data you would like to collect, you need to find out how it is described by the DOM of the web page.

对于此项目,我们将收集艺术家的姓名和网站上可用的相关链接。 您可能需要收集不同的数据,例如艺术家的国籍和日期。 无论您想收集什么数据,都需要了解网页的DOM如何描述它们。

To do this, in your web browser, right-click — or CTRL + click on macOS — on the first artist’s name, Zabaglia, Niccola. Within the context menu that pops up, you should see a menu item similar to Inspect Element (Firefox) or Inspect (Chrome).

为此,请在您的Web浏览器中,以第一个艺术家的名字Zabaglia,Niccola右键单击(或CTRL +单击macOS)。 在弹出的上下文菜单中,您应该看到类似于检查元素 (Firefox)或检查 (Chrome)的菜单项。

Once you click on the relevant Inspect menu item, the tools for web developers should appear within your browser. We want to look for the class and tags associated with the artists’ names in this list.

单击相关的“ 检查”菜单项后,用于Web开发人员的工具应出现在浏览器中。 我们要在此列表中查找与艺​​术家姓名相关的类和标签。

We’ll see first that the table of names is within <div> tags where class="BodyText". This is important to note so that we only search for text within this section of the web page. We also notice that the name Zabaglia, Niccola is in a link tag, since the name references a web page that describes the artist. So we will want to reference the <a> tag for links. Each artist’s name is a reference to a link.

我们将首先看到名称表位于<div>标记中,其中class="BodyText" 。 请务必注意这一点,以便我们仅在网页的此部分中搜索文本。 我们还注意到,名称Zabaglia,Niccola在链接标记中,因为该名称引用的是描述该艺术家的网页。 因此,我们将要引用<a>标记以获得链接。 每个艺术家的名字都是对链接的引用。

To do this, we’ll use Beautiful Soup’s find() and find_all() methods in order to pull the text of the artists’ names from the BodyText <div>.

为此,我们将使用Beautiful Soup的find()find_all()方法,以便从BodyText <div>提取艺术家姓名的文本。

nga_z_artists.py
nga_z_artists.py
import requests
from bs4 import BeautifulSoup


# Collect and parse first page
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
soup = BeautifulSoup(page.text, 'html.parser')

# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')

# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')

Next, at the bottom of our program file, we will want to create a for loop in order to iterate over all the artist names that we just put into the artist_name_list_items variable.

接下来,在程序文件的底部,我们将要创建一个for循环 ,以遍历刚刚放入artist_name_list_items变量中的所有艺术家姓名。

We’ll print these names out with the prettify() method in order to turn the Beautiful Soup parse tree into a nicely formatted Unicode string.

我们将使用prettify()方法将这些名称打印出来,以便将Beautiful Soup解析树变成格式良好的Unicode字符串。

nga_z_artists.py
nga_z_artists.py
...
artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')

# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
    print(artist_name.prettify())

Let’s run the program as we have it so far:

到目前为止,让我们运行该程序:

  • python nga_z_artists.py

    python nga_z_artists.py

Once we do so, we’ll receive the following output:

完成后,我们将收到以下输出:


   
   
Output
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630"> Zabaglia, Niccola </a> ... <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427"> Zao Wou-Ki </a> <a href="/web/20121007172955/https://www.nga.gov/collection/anZ2.htm"> Zas-Zie </a> <a href="/web/20121007172955/https://www.nga.gov/collection/anZ3.htm"> Zie-Zor </a> <a href="/web/20121007172955/https://www.nga.gov/collection/anZ4.htm"> <strong> next <br/> page </strong> </a>

What we see in the output at this point is the full text and tags related to all of the artists’ names within the <a> tags found in the <div class="BodyText"> tag on the first page, as well as some additional link text at the bottom. Since we don’t want this extra information, let’s work on removing this in the next section.

此时,我们在输出中看到的是与第一页<div class="BodyText">标签中的<a>标签中的所有艺术家姓名相关的全文和标签,以及一些底部的其他链接文本。 由于我们不想要这些额外的信息,因此让我们在下一部分中删除它。

删除多余的数据 (Removing Superfluous Data)

So far, we have been able to collect all the link text data within one <div> section of our web page. However, we don’t want to have the bottom links that don’t reference artists’ names, so let’s work to remove that part.

到目前为止,我们已经能够在网页的一个<div>部分中收集所有链接文本数据。 但是,我们不希望底部的链接不引用艺术家的名字,因此让我们删除该部分。

In order to remove the bottom links of the page, let’s again right-click and Inspect the DOM. We’ll see that the links on the bottom of the <div class="BodyText"> section are contained in an HTML table: <table class="AlphaNav">:

为了删除页面的底部链接,让我们再次右键单击并检查 DOM。 我们将看到<div class="BodyText">部分底部的链接包含在HTML表中: <table class="AlphaNav">

We can therefore use Beautiful Soup to find the AlphaNav class and use the decompose() method to remove a tag from the parse tree and then destroy it along with its contents.

因此,我们可以使用Beautiful Soup查找AlphaNav类,并使用decompose()方法从解析树中删除标签,然后将其及其内容销毁。

We’ll use the variable last_links to reference these bottom links and add them to the program file:

我们将使用变量last_links引用这些底部链接并将其添加到程序文件中:

nga_z_artists.py
nga_z_artists.py
import requests
from bs4 import BeautifulSoup


page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')

soup = BeautifulSoup(page.text, 'html.parser')

# Remove bottom links
last_links = soup.find(class_='AlphaNav')
last_links.decompose()

artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:
    print(artist_name.prettify())

Now, when we run the program with the python nga_z_artist.py command, we’ll receive the following output:

现在,当我们使用python nga_z_artist.py命令运行程序时,将收到以下输出:


   
   
Output
<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630"> Zabaglia, Niccola </a> <a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202"> Zaccone, Fabian </a> ... <a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11631"> Zanotti, Giampietro </a> <a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3427"> Zao Wou-Ki </a>

At this point, we see that the output no longer includes the links at the bottom of the web page, and now only displays the links associated with artists’ names.

至此,我们看到输出不再包含网页底部的链接,现在仅显示与艺术家姓名相关的链接。

Until now, we have targeted the links with the artists’ names specifically, but we have the extra tag data that we don’t really want. Let’s remove that in the next section.

到现在为止,我们已经专门针对具有艺术家姓名的链接,但是我们拥有了我们真正不需要的额外标签数据。 让我们在下一部分中将其删除。

从标签中提取内容 (Pulling the Contents from a Tag)

In order to access only the actual artists’ names, we’ll want to target the contents of the <a> tags rather than print out the entire link tag.

为了仅访问实际艺术家的姓名,我们将针对<a>标记的内容,而不是打印出整个链接标记。

We can do this with Beautiful Soup’s .contents, which will return the tag’s children as a Python list data type.

我们可以使用Beautiful Soup的.contents来做到这一点,它将返回标签的子代作为Python 列表数据类型

Let’s revise the for loop so that instead of printing the entire link and its tag, we’ll print the list of children (i.e. the artists’ full names):

让我们修改for循环,以便代替打印整个链接及其标签,而是打印子列表(即艺术家的全名):

nga_z_artists.py
nga_z_artists.py
import requests
from bs4 import BeautifulSoup


page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')

soup = BeautifulSoup(page.text, 'html.parser')

last_links = soup.find(class_='AlphaNav')
last_links.decompose()

artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')

# Use .contents to pull out the <a> tag’s children
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    print(names)

Note that we are iterating over the list above by calling on the index number of each item.

请注意,我们通过调用每个项目的索引号来遍历上面的列表。

We can run the program with the python command to view the following output:

我们可以使用python命令运行该程序以查看以下输出:


   
   
Output
Zabaglia, Niccola Zaccone, Fabian Zadkine, Ossip ... Zanini-Viola, Giuseppe Zanotti, Giampietro Zao Wou-Ki

We have received back a list of all the artists’ names available on the first page of the letter Z.

我们已经收到回信Z的第一页上所有艺术家姓名的列表。

However, what if we want to also capture the URLs associated with those artists? We can extract URLs found within a page’s <a> tags by using Beautiful Soup’s get('href') method.

但是,如果我们还想捕获与这些艺术家相关的URL,该怎么办? 我们可以使用Beautiful Soup的get('href')方法提取页面<a>标记中找到的URL。

From the output of the links above, we know that the entire URL is not being captured, so we will concatenate the link string with the front of the URL string (in this case https://web.archive.org/).

从上面的链接的输出,我们知道整个URL没有被捕获,因此我们将串连链接字符串URL字符串的前面(在这种情况下https://web.archive.org/ )。

These lines we’ll also add to the for loop:

这些行我们还将添加到for循环中:

nga_z_artists.py
nga_z_artists.py
...
for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    links = 'https://web.archive.org' + artist_name.get('href')
    print(names)
    print(links)

When we run the program above, we’ll receive both the artists’ names and the URLs to the links that tell us more about the artists:

当我们运行上面的程序,我们会收到艺术家姓名和网址,以告诉我们更多关于艺人的联系


   
   
Output
Zabaglia, Niccola https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630 Zaccone, Fabian https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202 ... Zanotti, Giampietro https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11631 Zao Wou-Ki https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427

Although we are now getting information from the website, it is currently just printing to our terminal window. Let’s instead capture this data so that we can use it elsewhere by writing it to a file.

尽管我们现在正在从网站上获取信息,但目前它只是打印到我们的终端窗口中。 让我们捕获这些数据,以便我们可以通过将其写入文件来在其他地方使用它。

将数据写入CSV文件 (Writing the Data to a CSV File)

Collecting data that only lives in a terminal window is not very useful. Comma-separated values (CSV) files allow us to store tabular data in plain text, and is a common format for spreadsheets and databases. Before beginning with this section, you should familiarize yourself with how to handle plain text files in Python.

收集仅存在于终端窗口中的数据不是很有用。 逗号分隔值(CSV)文件允许我们以纯文本格式存储表格数据,这是电子表格和数据库的常用格式。 在开始本节之前,您应该熟悉如何在Python中处理纯文本文件

First, we need to import Python’s built-in csv module along with the other modules at the top of the Python programming file:

首先,我们需要导入Python的内置csv模块以及Python编程文件顶部的其他模块:

import csv

Next, we’ll create and open a file called z-artist-names.csv for us to write to (we’ll use the variable f for file here) by using the 'w' mode. We’ll also write the top row headings: Name and Link which we’ll pass to the writerow() method as a list:

接下来,我们将使用'w'模式创建并打开一个名为z-artist-names .csv供我们写入 (在这里,我们将变量f用于文件)。 我们还将写第一行标题: NameLink ,并将其作为列表传递给writerow()方法:

f = csv.writer(open('z-artist-names.csv', 'w'))
f.writerow(['Name', 'Link'])

Finally, within our for loop, we’ll write each row with the artists’ names and their associated links:

最后,在for循环中,我们将在每一行中写下艺术家的names及其相关links

f.writerow([names, links])

You can see the lines for each of these tasks in the file below:

您可以在下面的文件中看到每个任务的行:

nga_z_artists.py
nga_z_artists.py
import requests
import csv
from bs4 import BeautifulSoup


page = requests.get('https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm')

soup = BeautifulSoup(page.text, 'html.parser')

last_links = soup.find(class_='AlphaNav')
last_links.decompose()

# Create a file to write to, add headers row
f = csv.writer(open('z-artist-names.csv', 'w'))
f.writerow(['Name', 'Link'])

artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:
    names = artist_name.contents[0]
    links = 'https://web.archive.org' + artist_name.get('href')


    # Add each artist’s name and associated link to a row
    f.writerow([names, links])

When you run the program now with the python command, no output will be returned to your terminal window. Instead, a file will be created in the directory you are working in called z-artist-names.csv.

当您现在使用python命令运行程序时,没有输出将返回到您的终端窗口。 相反,将在您正在使用的目录中创建一个文件, z-artist-names .csvz-artist-names .csv

Depending on what you use to open it, it may look something like this:

根据您用来打开它的方式,它可能看起来像这样:

z-artist-names.csv
z-artist-names.csv
Name,Link
"Zabaglia, Niccola",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11630
"Zaccone, Fabian",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=34202
"Zadkine, Ossip",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3475w
...

Or, it may look more like a spreadsheet:

或者,它看起来更像是电子表格:

In either case, you can now use this file to work with the data in more meaningful ways since the information you have collected is now stored in your computer’s memory.

无论哪种情况,由于您收集的信息现在都存储在计算机内存中,因此您现在可以使用该文件以更有意义的方式处理数据。

We have created a program that will pull data from the first page of the list of artists whose last names start with the letter Z. However, there are 4 pages in total of these artists available on the website.

我们创建了一个程序,该程序将从姓氏以字母Z开头的艺术家列表的第一页提取数据。 但是,网站上总共有4页这些艺术家。

In order to collect all of these pages, we can perform more iterations with for loops. This will revise most of the code we have written so far, but will employ similar concepts.

为了收集所有这些页面,我们可以使用for循环执行更多迭代。 这将修改到目前为止我们编写的大多数代码,但将采用类似的概念。

To start, we’ll want to initialize a list to hold the pages:

首先,我们要初始化一个列表来保存页面:

pages = []

We will populate this initialized list with the following for loop:

我们将使用以下for循环填充此初始化列表:

for i in range(1, 5):
    url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
    pages.append(url)

Earlier in this tutorial, we noted that we should pay attention to the total number of pages there are that contain artists’ names starting with the letter Z (or whatever letter we’re using). Since there are 4 pages for the letter Z, we constructed the for loop above with a range of 1 to 5 so that it will iterate through each of the 4 pages.

在本教程的前面 ,我们指出,我们应该注意包含艺术家姓名(以字母Z (或我们使用的任何字母)开头)的页面总数。 由于字母Z有4页,因此我们在上面构造了for循环,范围为15以便它将遍历这4页中的每一页。

For this specific web site, the URLs begin with the string https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ and then are followed with a number for the page (which will be the integer i from the for loop that we convert to a string) and end with .htm. We will concatenate these strings together and then append the result to the pages list.

对于此特定网站,URL以字符串https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ开头,然后以数字开头(这将是来自for循环的整数i ,我们将其转换为字符串 ),并以.htm结尾。 我们将这些字符串连接在一起,然后将结果附加到pages列表。

In addition to this loop, we’ll have a second loop that will go through each of the pages above. The code in this for loop will look similar to the code we have created so far, as it is doing the task we completed for the first page of the letter Z artists for each of the 4 pages total. Note that because we have put the original program into the second for loop, we now have the original loop as a nested for loop contained in it.

除此循环外,我们还有第二个循环,将遍历以上每个页面。 此for循环中的代码看起来与我们到目前为止创建的代码相似,因为它正在完成我们为字母Z艺术家的第一页完成的任务,共4页。 请注意,因为我们已经将原始程序放入了第二个for循环中,所以现在我们将原始循环作为其中包含的嵌套for循环使用

The two for loops will look like this:

两个for循环将如下所示:

pages = []

for i in range(1, 5):
    url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
    pages.append(url)

for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

    last_links = soup.find(class_='AlphaNav')
    last_links.decompose()

    artist_name_list = soup.find(class_='BodyText')
    artist_name_list_items = artist_name_list.find_all('a')

    for artist_name in artist_name_list_items:
        names = artist_name.contents[0]
        links = 'https://web.archive.org' + artist_name.get('href')

        f.writerow([names, links])

In the code above, you should see that the first for loop is iterating over the pages and the second for loop is scraping data from each of those pages and then is adding the artists’ names and links line by line through each row of each page.

在上面的代码中,您应该看到第一个for循环正在页面上进行迭代,第二个for循环是从每个页面中抓取数据,然后在每个页面的每一行中逐行添加艺术家的名称和链接。

These two for loops come below the import statements, the CSV file creation and writer (with the line for writing the headers of the file), and the initialization of the pages variable (assigned to a list).

这两个for循环位于import语句下面,即CSV文件的创建和编写器(带有用于写文件头的行)以及pages变量的初始化(分配给列表)。

Within the greater context of the programming file, the complete code looks like this:

在编程文件的更大范围内,完整的代码如下所示:

nga_z_artists.py
nga_z_artists.py
import requests
import csv
from bs4 import BeautifulSoup


f = csv.writer(open('z-artist-names.csv', 'w'))
f.writerow(['Name', 'Link'])

pages = []

for i in range(1, 5):
    url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i) + '.htm'
    pages.append(url)


for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

    last_links = soup.find(class_='AlphaNav')
    last_links.decompose()

    artist_name_list = soup.find(class_='BodyText')
    artist_name_list_items = artist_name_list.find_all('a')

    for artist_name in artist_name_list_items:
        names = artist_name.contents[0]
        links = 'https://web.archive.org' + artist_name.get('href')

        f.writerow([names, links])

Since this program is doing a bit of work, it will take a little while to create the CSV file. Once it is done, the output will be complete, showing the artists’ names and their associated links from Zabaglia, Niccola to Zykmund, Václav.

由于该程序正在做一些工作,因此将需要一些时间来创建CSV文件。 完成后,输出将完成,显示艺术家的姓名及其从NiccolaZabagliaVáclav的 Zykmund的相关链接。

体贴 (Being Considerate)

When scraping web pages, it is important to remain considerate of the servers you are grabbing information from.

抓取网页时,请务必考虑从中获取信息的服务器。

Check to see if a site has terms of service or terms of use that pertains to web scraping. Also, check to see if a site has an API that allows you to grab data before scraping it yourself.

检查网站是否具有与网页抓取有关的服务条款或使用条款。 另外,请检查网站是否具有允许您在自己抓取数据之前抓取数据的API。

Be sure to not continuously hit servers to gather data. Once you have collected what you need from a site, run scripts that will go over the data locally rather than burden someone else’s servers.

确保不要连续命中服务器以收集数据。 从网站上收集到所需信息后,请运行将在本地遍历数据的脚本,而不会给其他服务器增加负担。

Additionally, it is a good idea to scrape with a header that has your name and email so that a website can identify you and follow up if they have any questions. An example of a header you can use with the Python Requests library is as follows:

另外,最好使用包含您的姓名和电子邮件的标头进行刮擦,以便网站可以识别您的身份并在他们有任何疑问时进行跟进。 可以与Python Requests库一起使用的标头示例如下:

import requests

headers = {
    'User-Agent': 'Your Name, example.com',
    'From': 'email@example.com'
}

url = 'https://example.com'

page = requests.get(url, headers = headers)

Using headers with identifiable information ensures that the people who go over a server’s logs can reach out to you.

使用带有可识别信息的标头可确保遍历服务器日志的人员可以与您联系。

结论 (Conclusion)

This tutorial went through using Python and Beautiful Soup to scrape data from a website. We stored the text that we gathered within a CSV file.

本教程通过使用Python和Beautiful Soup来从网站中抓取数据。 我们将收集的文本存储在CSV文件中。

You can continue working on this project by collecting more data and making your CSV file more robust. For example, you may want to include the nationalities and years of each artist. You can also use what you have learned to scrape data from other websites.

您可以通过收集更多数据并使CSV文件更健壮来继续从事此项目。 例如,您可能要包括每个艺术家的国籍和年龄。 您还可以使用所学知识从其他网站上抓取数据。

To continue learning about pulling information from the web, read our tutorial “How To Crawl A Web Page with Scrapy and Python 3.”

要继续学习有关从网络提取信息的信息,请阅读我们的教程“ 如何使用Scrapy和Python 3抓取网页 ”。

翻译自: https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3

汤和油谁的沸点高

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值